nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

48386e34ad Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable biondizzle 2026-05-18 21:38:28 +00:00
85e1cd3b69 Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run() biondizzle 2026-05-18 21:07:35 +00:00
a94011ec92 Fix torch.compile crash: remove threading.Lock from LUT cache path biondizzle 2026-05-18 20:54:55 +00:00
6326222d68 Fix: add abstract create_weights to CuTeDSLNvfp4LinearMethod biondizzle 2026-05-18 20:40:48 +00:00
450793311c Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4 biondizzle 2026-05-18 20:27:42 +00:00
6ce6a47be9 Add NVFP4 linear runner + attention projection test biondizzle 2026-05-18 20:14:03 +00:00
f07643791e Fix hidden_size: shared expert uses 7168, not HC_DIM 28672 biondizzle 2026-05-18 20:10:32 +00:00
70f50a1ec6 Fix scale assembly: use correctly-sized temp buffer for swizzle biondizzle 2026-05-18 20:09:50 +00:00
97bdd604e9 Fix scale assembly: reshape swizzled output to 2D biondizzle 2026-05-18 20:09:19 +00:00
c1aa4af123 Shared expert: dedicated CuTeDSL runner with proper scale assembly biondizzle 2026-05-18 20:08:34 +00:00
b3451c74f8 Update README and CURRENT_BUG.md with current state biondizzle 2026-05-18 20:05:03 +00:00
e8b289e30d WIP: CuTeDSL shared expert kernel biondizzle 2026-05-18 20:02:19 +00:00
1836e5fdc7 Add shared experts to post-quant BF16 dequant fix biondizzle 2026-05-18 19:27:49 +00:00
82ac648563 Patch utils.py the standard way: copy modified file into Docker image biondizzle 2026-05-18 19:10:08 +00:00
3c1a76bdcc Fix Dockerfile: use external patch script instead of inline Python biondizzle 2026-05-18 19:03:57 +00:00
75844a8361 Post-quant fix via Dockerfile patch to process_weights_after_loading biondizzle 2026-05-18 18:35:34 +00:00
a4ad5898c1 Fix post-quant hook: register on inner model, fix module refs biondizzle 2026-05-18 18:15:36 +00:00
a51edd238e Add post-quant-init forward hook to fix attention NVFP4 biondizzle 2026-05-18 17:56:19 +00:00
2835cb040b Fix input_scale BEFORE process_weights_after_loading runs biondizzle 2026-05-18 16:43:44 +00:00
2fc81ccac4 Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early) biondizzle 2026-05-18 16:23:41 +00:00
4a57399592 Add debug prints for input_global_scale_inv check biondizzle 2026-05-18 15:59:59 +00:00
f86892e26b Replace BF16 dequant with input_scale warmup fix for attention NVFP4 biondizzle 2026-05-18 15:43:46 +00:00
301015b037 Remove all inline diagnostics — incompatible with torch.compile biondizzle 2026-05-18 15:22:53 +00:00
a83d364d45 Switch to cudagraph_mode=NONE (not enforce-eager) for real inference testing biondizzle 2026-05-18 15:05:52 +00:00
2a2a42c6d6 Add attention-internal diagnostics: MLA output, FP8 quant output biondizzle 2026-05-18 14:45:43 +00:00
5c1dda10f6 Add granular attention diagnostics: pre/post attn, embed, dequant stats biondizzle 2026-05-18 14:24:14 +00:00
e0e0528778 Add debug logging for BF16 dequant to find missing attrs biondizzle 2026-05-18 14:04:12 +00:00
2e8c3c961f Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv biondizzle 2026-05-18 13:47:08 +00:00
a7216b27df Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16 biondizzle 2026-05-18 13:22:15 +00:00
334e95047e Fix: dequantize ALL attention NVFP4 projections to BF16 biondizzle 2026-05-18 13:09:36 +00:00
a83c332059 Fix docker-compose: remove orphaned compilation-config arg, enforce-eager mode biondizzle 2026-05-18 12:54:14 +00:00
9e7639fba4 Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager) biondizzle 2026-05-18 12:51:51 +00:00
2d1e9f42b1 Remove NaN check — incompatible with Dynamo fullgraph compilation biondizzle 2026-05-18 12:17:25 +00:00
65763a200c Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break biondizzle 2026-05-18 11:33:29 +00:00
8758bc93ca crap shoot biondizzle 2026-05-18 11:13:29 +00:00
b8df4a8cc5 Fix NaN check: use os.environ gate instead of is_current_stream_capturing biondizzle 2026-05-18 02:20:14 +00:00
0c02d84514 Add NaN/Inf detection in DeepseekV4Model.forward layer loop biondizzle 2026-05-17 23:37:12 +00:00
bedcfc4dab Pipeline test: use max_num_tokens=8192 matching vLLM biondizzle 2026-05-17 23:04:44 +00:00
c45364b3a8 Add MoE scale ratio output biondizzle 2026-05-17 22:58:27 +00:00
bf99ad49ec Print both MoE and residual cosine biondizzle 2026-05-17 22:56:56 +00:00
8637020487 Fix multi-layer test: add residual connections biondizzle 2026-05-17 22:55:40 +00:00
11dce13afe Add multi-layer pipeline test to check error accumulation biondizzle 2026-05-17 22:53:28 +00:00
87582fc9f7 HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph biondizzle 2026-05-17 22:28:32 +00:00
8717e0e411 Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping biondizzle 2026-05-17 22:03:48 +00:00
d332f4f900 Add NaN debug checks after L1 and L2 GEMM biondizzle 2026-05-17 22:02:24 +00:00
e65f2b2ba2 Update CURRENT_BUG.md with Bug 26 fix biondizzle 2026-05-17 21:36:25 +00:00
72628fb689 Full pipeline test: runner vs BF16 reference biondizzle 2026-05-17 21:29:16 +00:00
2796bd81e8 Fix: scatter FP4 as uint8 (float4 doesn't support index_put) biondizzle 2026-05-17 21:28:04 +00:00
364f8372bb Fix FP4 buffer shapes: D//2 for packed dimensions biondizzle 2026-05-17 21:26:46 +00:00
5e4d674736 Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf biondizzle 2026-05-17 21:25:58 +00:00
803e7160d8 Fix: allocate FP4 buffers as uint8 then view-cast biondizzle 2026-05-17 21:25:04 +00:00
7256070dd3 FIX Bug 26: quantize slot tokens, not padded buffer biondizzle 2026-05-17 21:24:43 +00:00
4d0b6d889d Set runner weights before _ensure_stacked biondizzle 2026-05-17 21:22:50 +00:00
b7acac5e4e Call _ensure_stacked() before using runner buffers biondizzle 2026-05-17 21:22:30 +00:00
1acf01fc1a Fix token_indices: repeat each token ID top_k times, not arange biondizzle 2026-05-17 21:22:11 +00:00
a478ca4746 Debug: trace runner logic step by step, test L1 GEMM biondizzle 2026-05-17 21:21:45 +00:00
a100bd11c1 Simplify pipeline test: BF16 ref + bridge ref + full runner biondizzle 2026-05-17 21:20:41 +00:00
6eade5e7f8 Fix: gs values are floats not tensors biondizzle 2026-05-17 21:19:47 +00:00
b05a38a9bd Test stages 1-2 first: sort + L1 GEMM biondizzle 2026-05-17 21:19:23 +00:00
9728604ea1 Pipeline test: stage-by-stage with BF16 reference comparison biondizzle 2026-05-17 21:19:17 +00:00
7fff5fd39b Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes biondizzle 2026-05-17 21:18:20 +00:00
4ef345773d Rewrite pipeline test: load real weights, step-by-step vs BF16 reference biondizzle 2026-05-17 21:17:18 +00:00
b43541afdd Fix test path setup biondizzle 2026-05-17 21:00:00 +00:00
490ddfa294 Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours) biondizzle 2026-05-17 20:58:06 +00:00
c1bb551446 Fix weight loading: skip already-loaded experts correctly biondizzle 2026-05-17 18:15:51 +00:00
955d7533f2 Use system Python for pipeline test (CuTeDSL in system site-packages) biondizzle 2026-05-17 18:13:42 +00:00
925e390b93 Fix import: use direct import from vllm/ subdirectory biondizzle 2026-05-17 18:12:53 +00:00
cd6144b832 Fix imports: all functions are in cutedsl.bridge, not separate modules biondizzle 2026-05-17 18:11:03 +00:00
5e63a0d8a3 Rewrite pipeline test: use raw checkpoint weights, compare runner vs dynamic-gs reference biondizzle 2026-05-17 18:10:05 +00:00
e51eafe288 Rewrite pipeline test: compare runner vs reference with real weights, step-by-step biondizzle 2026-05-17 18:08:33 +00:00
e38d60a6e8 Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline biondizzle 2026-05-17 18:07:44 +00:00
22e0370e6e Fix AttributeError: DeepseekV4MegaMoEExperts has no swiglu_limit biondizzle 2026-05-17 18:06:44 +00:00
6692166d0f Update CURRENT_BUG.md: Bug 25 (swiglu_limit), shared expert path verification, variable padded offsets biondizzle 2026-05-17 17:56:04 +00:00
a10c582cf4 Add swiglu_limit=10.0 activation clamping (was missing) biondizzle 2026-05-17 17:52:16 +00:00
3f2f4e1882 Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing biondizzle 2026-05-17 17:24:26 +00:00
11b5aa5e37 Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops biondizzle 2026-05-17 16:59:51 +00:00
94dec5922d Scale assembly Phase 2: use CPU-computed offsets for Python slicing biondizzle 2026-05-17 16:56:52 +00:00
49c28e6562 Fix: use real padded expert offsets instead of fixed layout biondizzle 2026-05-17 16:55:47 +00:00
87a223f1ac Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses biondizzle 2026-05-17 16:52:40 +00:00
c03438fc4e crap shoot biondizzle 2026-05-17 16:25:38 +00:00
7c16f3cb46 Fix: init shared dict before using it, remove duplicate _output_buf biondizzle 2026-05-17 16:06:58 +00:00
ea8acf9852 Share padded_x_sf and output buffers across layers to save ~300 MB biondizzle 2026-05-17 16:05:53 +00:00
3d0b1408b4 Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status biondizzle 2026-05-17 15:52:06 +00:00
455ecb5631 Fix: define padded_max_slots before using it in shared buffer allocation biondizzle 2026-05-17 15:47:38 +00:00
b1ac74bb4d Fix shape mismatch: shared padded buffers, revert max_num_tokens cap biondizzle 2026-05-17 15:47:10 +00:00
e2f33596a2 Update CURRENT_BUG.md: status through Bug 20, fixed-layout padding architecture biondizzle 2026-05-17 15:46:13 +00:00
faf7c8cc51 Debug: print runner max_num_tokens and max_chunks biondizzle 2026-05-17 15:18:07 +00:00
c5af1aba6b Fix OOB: size padded buffers for num_experts*max_chunks*128 biondizzle 2026-05-17 14:59:45 +00:00
8ac8e20fa9 Fix OOM: cap buffer pre-allocation at cudagraph max capture size biondizzle 2026-05-17 14:14:13 +00:00
5bb78564f5 Remove dynamic tensor allocation in scale assembly (cudagraph fix) biondizzle 2026-05-17 14:01:32 +00:00
8c31e78359 Fix cudagraph: fully fixed-layout per-expert sections, no GPU scalars in Python control flow biondizzle 2026-05-17 13:58:58 +00:00
ff74b33d2c Fix cudagraph: static loop for per-expert scale swizzle biondizzle 2026-05-17 13:56:52 +00:00
bf22b6f0e4 Fix scale assembly: variable-size per-expert padding matching GEMM offsets biondizzle 2026-05-17 13:55:10 +00:00
0d3c928ff2 Update CURRENT_BUG.md: full status through Bug 14, vLLM integration status, architecture docs biondizzle 2026-05-17 13:32:41 +00:00
bde81b95f4 Fix GEMM scale layout: pad to 128 tokens per expert biondizzle 2026-05-17 13:19:31 +00:00
7e692c3aec Fix cudaErrorStreamCaptureUnsupported: pre-allocate all tensors used during capture biondizzle 2026-05-17 12:31:25 +00:00
b0221662e7 Fix warmup: pass local expert IDs (not global), remove incorrect _warmup_done guard biondizzle 2026-05-17 11:38:19 +00:00
b531a98f8f Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing biondizzle 2026-05-17 11:10:59 +00:00
04245b664b Add warmup-based activation global scale computation in finalize_weights biondizzle 2026-05-17 10:48:24 +00:00
4445882ba7 Fix: return 2D scale tensor for GEMM (shape[1] access) biondizzle 2026-05-17 09:59:57 +00:00

... 20 21 22 23 24 ...