-
48386e34ad
Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable
biondizzle
2026-05-18 21:38:28 +00:00
-
85e1cd3b69
Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run()
biondizzle
2026-05-18 21:07:35 +00:00
-
a94011ec92
Fix torch.compile crash: remove threading.Lock from LUT cache path
biondizzle
2026-05-18 20:54:55 +00:00
-
6326222d68
Fix: add abstract create_weights to CuTeDSLNvfp4LinearMethod
biondizzle
2026-05-18 20:40:48 +00:00
-
450793311c
Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4
biondizzle
2026-05-18 20:27:42 +00:00
-
6ce6a47be9
Add NVFP4 linear runner + attention projection test
biondizzle
2026-05-18 20:14:03 +00:00
-
f07643791e
Fix hidden_size: shared expert uses 7168, not HC_DIM 28672
biondizzle
2026-05-18 20:10:32 +00:00
-
70f50a1ec6
Fix scale assembly: use correctly-sized temp buffer for swizzle
biondizzle
2026-05-18 20:09:50 +00:00
-
97bdd604e9
Fix scale assembly: reshape swizzled output to 2D
biondizzle
2026-05-18 20:09:19 +00:00
-
c1aa4af123
Shared expert: dedicated CuTeDSL runner with proper scale assembly
biondizzle
2026-05-18 20:08:34 +00:00
-
b3451c74f8
Update README and CURRENT_BUG.md with current state
biondizzle
2026-05-18 20:05:03 +00:00
-
e8b289e30d
WIP: CuTeDSL shared expert kernel
biondizzle
2026-05-18 20:02:19 +00:00
-
1836e5fdc7
Add shared experts to post-quant BF16 dequant fix
biondizzle
2026-05-18 19:27:49 +00:00
-
82ac648563
Patch utils.py the standard way: copy modified file into Docker image
biondizzle
2026-05-18 19:10:08 +00:00
-
3c1a76bdcc
Fix Dockerfile: use external patch script instead of inline Python
biondizzle
2026-05-18 19:03:57 +00:00
-
75844a8361
Post-quant fix via Dockerfile patch to process_weights_after_loading
biondizzle
2026-05-18 18:35:34 +00:00
-
a4ad5898c1
Fix post-quant hook: register on inner model, fix module refs
biondizzle
2026-05-18 18:15:36 +00:00
-
a51edd238e
Add post-quant-init forward hook to fix attention NVFP4
biondizzle
2026-05-18 17:56:19 +00:00
-
2835cb040b
Fix input_scale BEFORE process_weights_after_loading runs
biondizzle
2026-05-18 16:43:44 +00:00
-
2fc81ccac4
Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early)
biondizzle
2026-05-18 16:23:41 +00:00
-
4a57399592
Add debug prints for input_global_scale_inv check
biondizzle
2026-05-18 15:59:59 +00:00
-
f86892e26b
Replace BF16 dequant with input_scale warmup fix for attention NVFP4
biondizzle
2026-05-18 15:43:46 +00:00
-
301015b037
Remove all inline diagnostics — incompatible with torch.compile
biondizzle
2026-05-18 15:22:53 +00:00
-
a83d364d45
Switch to cudagraph_mode=NONE (not enforce-eager) for real inference testing
biondizzle
2026-05-18 15:05:52 +00:00
-
2a2a42c6d6
Add attention-internal diagnostics: MLA output, FP8 quant output
biondizzle
2026-05-18 14:45:43 +00:00
-
5c1dda10f6
Add granular attention diagnostics: pre/post attn, embed, dequant stats
biondizzle
2026-05-18 14:24:14 +00:00
-
e0e0528778
Add debug logging for BF16 dequant to find missing attrs
biondizzle
2026-05-18 14:04:12 +00:00
-
2e8c3c961f
Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv
biondizzle
2026-05-18 13:47:08 +00:00
-
a7216b27df
Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16
biondizzle
2026-05-18 13:22:15 +00:00
-
334e95047e
Fix: dequantize ALL attention NVFP4 projections to BF16
biondizzle
2026-05-18 13:09:36 +00:00
-
a83c332059
Fix docker-compose: remove orphaned compilation-config arg, enforce-eager mode
biondizzle
2026-05-18 12:54:14 +00:00
-
9e7639fba4
Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)
biondizzle
2026-05-18 12:51:51 +00:00
-
2d1e9f42b1
Remove NaN check — incompatible with Dynamo fullgraph compilation
biondizzle
2026-05-18 12:17:25 +00:00
-
65763a200c
Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break
biondizzle
2026-05-18 11:33:29 +00:00
-
8758bc93ca
crap shoot
biondizzle
2026-05-18 11:13:29 +00:00
-
b8df4a8cc5
Fix NaN check: use os.environ gate instead of is_current_stream_capturing
biondizzle
2026-05-18 02:20:14 +00:00
-
0c02d84514
Add NaN/Inf detection in DeepseekV4Model.forward layer loop
biondizzle
2026-05-17 23:37:12 +00:00
-
bedcfc4dab
Pipeline test: use max_num_tokens=8192 matching vLLM
biondizzle
2026-05-17 23:04:44 +00:00
-
c45364b3a8
Add MoE scale ratio output
biondizzle
2026-05-17 22:58:27 +00:00
-
bf99ad49ec
Print both MoE and residual cosine
biondizzle
2026-05-17 22:56:56 +00:00
-
8637020487
Fix multi-layer test: add residual connections
biondizzle
2026-05-17 22:55:40 +00:00
-
11dce13afe
Add multi-layer pipeline test to check error accumulation
biondizzle
2026-05-17 22:53:28 +00:00
-
87582fc9f7
HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph
biondizzle
2026-05-17 22:28:32 +00:00
-
8717e0e411
Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping
biondizzle
2026-05-17 22:03:48 +00:00
-
d332f4f900
Add NaN debug checks after L1 and L2 GEMM
biondizzle
2026-05-17 22:02:24 +00:00
-
e65f2b2ba2
Update CURRENT_BUG.md with Bug 26 fix
biondizzle
2026-05-17 21:36:25 +00:00
-
72628fb689
Full pipeline test: runner vs BF16 reference
biondizzle
2026-05-17 21:29:16 +00:00
-
2796bd81e8
Fix: scatter FP4 as uint8 (float4 doesn't support index_put)
biondizzle
2026-05-17 21:28:04 +00:00
-
364f8372bb
Fix FP4 buffer shapes: D//2 for packed dimensions
biondizzle
2026-05-17 21:26:46 +00:00
-
5e4d674736
Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf
biondizzle
2026-05-17 21:25:58 +00:00
-
803e7160d8
Fix: allocate FP4 buffers as uint8 then view-cast
biondizzle
2026-05-17 21:25:04 +00:00
-
7256070dd3
FIX Bug 26: quantize slot tokens, not padded buffer
biondizzle
2026-05-17 21:24:43 +00:00
-
4d0b6d889d
Set runner weights before _ensure_stacked
biondizzle
2026-05-17 21:22:50 +00:00
-
b7acac5e4e
Call _ensure_stacked() before using runner buffers
biondizzle
2026-05-17 21:22:30 +00:00
-
1acf01fc1a
Fix token_indices: repeat each token ID top_k times, not arange
biondizzle
2026-05-17 21:22:11 +00:00
-
a478ca4746
Debug: trace runner logic step by step, test L1 GEMM
biondizzle
2026-05-17 21:21:45 +00:00
-
a100bd11c1
Simplify pipeline test: BF16 ref + bridge ref + full runner
biondizzle
2026-05-17 21:20:41 +00:00
-
6eade5e7f8
Fix: gs values are floats not tensors
biondizzle
2026-05-17 21:19:47 +00:00
-
b05a38a9bd
Test stages 1-2 first: sort + L1 GEMM
biondizzle
2026-05-17 21:19:23 +00:00
-
9728604ea1
Pipeline test: stage-by-stage with BF16 reference comparison
biondizzle
2026-05-17 21:19:17 +00:00
-
7fff5fd39b
Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes
biondizzle
2026-05-17 21:18:20 +00:00
-
4ef345773d
Rewrite pipeline test: load real weights, step-by-step vs BF16 reference
biondizzle
2026-05-17 21:17:18 +00:00
-
b43541afdd
Fix test path setup
biondizzle
2026-05-17 21:00:00 +00:00
-
490ddfa294
Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours)
biondizzle
2026-05-17 20:58:06 +00:00
-
c1bb551446
Fix weight loading: skip already-loaded experts correctly
biondizzle
2026-05-17 18:15:51 +00:00
-
955d7533f2
Use system Python for pipeline test (CuTeDSL in system site-packages)
biondizzle
2026-05-17 18:13:42 +00:00
-
925e390b93
Fix import: use direct import from vllm/ subdirectory
biondizzle
2026-05-17 18:12:53 +00:00
-
cd6144b832
Fix imports: all functions are in cutedsl.bridge, not separate modules
biondizzle
2026-05-17 18:11:03 +00:00
-
5e63a0d8a3
Rewrite pipeline test: use raw checkpoint weights, compare runner vs dynamic-gs reference
biondizzle
2026-05-17 18:10:05 +00:00
-
e51eafe288
Rewrite pipeline test: compare runner vs reference with real weights, step-by-step
biondizzle
2026-05-17 18:08:33 +00:00
-
e38d60a6e8
Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline
biondizzle
2026-05-17 18:07:44 +00:00
-
22e0370e6e
Fix AttributeError: DeepseekV4MegaMoEExperts has no swiglu_limit
biondizzle
2026-05-17 18:06:44 +00:00
-
6692166d0f
Update CURRENT_BUG.md: Bug 25 (swiglu_limit), shared expert path verification, variable padded offsets
biondizzle
2026-05-17 17:56:04 +00:00
-
a10c582cf4
Add swiglu_limit=10.0 activation clamping (was missing)
biondizzle
2026-05-17 17:52:16 +00:00
-
3f2f4e1882
Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing
biondizzle
2026-05-17 17:24:26 +00:00
-
11b5aa5e37
Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops
biondizzle
2026-05-17 16:59:51 +00:00
-
94dec5922d
Scale assembly Phase 2: use CPU-computed offsets for Python slicing
biondizzle
2026-05-17 16:56:52 +00:00
-
49c28e6562
Fix: use real padded expert offsets instead of fixed layout
biondizzle
2026-05-17 16:55:47 +00:00
-
87a223f1ac
Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses
biondizzle
2026-05-17 16:52:40 +00:00
-
c03438fc4e
crap shoot
biondizzle
2026-05-17 16:25:38 +00:00
-
7c16f3cb46
Fix: init shared dict before using it, remove duplicate _output_buf
biondizzle
2026-05-17 16:06:58 +00:00
-
ea8acf9852
Share padded_x_sf and output buffers across layers to save ~300 MB
biondizzle
2026-05-17 16:05:53 +00:00
-
3d0b1408b4
Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status
biondizzle
2026-05-17 15:52:06 +00:00
-
455ecb5631
Fix: define padded_max_slots before using it in shared buffer allocation
biondizzle
2026-05-17 15:47:38 +00:00
-
b1ac74bb4d
Fix shape mismatch: shared padded buffers, revert max_num_tokens cap
biondizzle
2026-05-17 15:47:10 +00:00
-
e2f33596a2
Update CURRENT_BUG.md: status through Bug 20, fixed-layout padding architecture
biondizzle
2026-05-17 15:46:13 +00:00
-
faf7c8cc51
Debug: print runner max_num_tokens and max_chunks
biondizzle
2026-05-17 15:18:07 +00:00
-
c5af1aba6b
Fix OOB: size padded buffers for num_experts*max_chunks*128
biondizzle
2026-05-17 14:59:45 +00:00
-
8ac8e20fa9
Fix OOM: cap buffer pre-allocation at cudagraph max capture size
biondizzle
2026-05-17 14:14:13 +00:00
-
5bb78564f5
Remove dynamic tensor allocation in scale assembly (cudagraph fix)
biondizzle
2026-05-17 14:01:32 +00:00
-
8c31e78359
Fix cudagraph: fully fixed-layout per-expert sections, no GPU scalars in Python control flow
biondizzle
2026-05-17 13:58:58 +00:00
-
ff74b33d2c
Fix cudagraph: static loop for per-expert scale swizzle
biondizzle
2026-05-17 13:56:52 +00:00
-
bf22b6f0e4
Fix scale assembly: variable-size per-expert padding matching GEMM offsets
biondizzle
2026-05-17 13:55:10 +00:00
-
0d3c928ff2
Update CURRENT_BUG.md: full status through Bug 14, vLLM integration status, architecture docs
biondizzle
2026-05-17 13:32:41 +00:00
-
bde81b95f4
Fix GEMM scale layout: pad to 128 tokens per expert
biondizzle
2026-05-17 13:19:31 +00:00
-
7e692c3aec
Fix cudaErrorStreamCaptureUnsupported: pre-allocate all tensors used during capture
biondizzle
2026-05-17 12:31:25 +00:00
-
b0221662e7
Fix warmup: pass local expert IDs (not global), remove incorrect _warmup_done guard
biondizzle
2026-05-17 11:38:19 +00:00
-
b531a98f8f
Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing
biondizzle
2026-05-17 11:10:59 +00:00
-
04245b664b
Add warmup-based activation global scale computation in finalize_weights
biondizzle
2026-05-17 10:48:24 +00:00
-
4445882ba7
Fix: return 2D scale tensor for GEMM (shape[1] access)
biondizzle
2026-05-17 09:59:57 +00:00