nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	f74447bfd0	Proper NVFP4 integration: quantized compressor/indexer + mapper fixes Weight mapper fixes: - Reorder substr renames: compressor renames first, then .self_attn.compressor. → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up under mla_attn after the compressor rename already fired) - Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm, position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names) - Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv, gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape, weights_proj stays (structural: compressor.indexer → indexer.compressor) - Remove broken suffix renames (already fixed in prior commit) Model architecture fixes: - Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4 checkpoint has quantized compressor weights with input_scale/weight_scale) - Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config (was None, but checkpoint has quantized weights) - Add indexer.compressor.fused_wkv_wgate stacking in load_weights Infrastructure: - Add deepseek_compressor.py to Dockerfile - Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM) - Update unit test to 50 cases (compressor + indexer + quantization scales)	2026-05-18 23:20:13 +00:00
biondizzle	17496b2615	Fix NVFP4 weights mapper: add prefix mappings, fix substr order - Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.) AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required - Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn. in substr_renames so compressor keys get the mla_attn prefix before the general rename - Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight) that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens - Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)	2026-05-18 23:03:34 +00:00
biondizzle	6ce6a47be9	Add NVFP4 linear runner + attention projection test - CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection - test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16 - Same pad+swizzle pattern as shared expert, but no SiLU/fusion	2026-05-18 20:14:03 +00:00
biondizzle	f07643791e	Fix hidden_size: shared expert uses 7168, not HC_DIM 28672	2026-05-18 20:10:32 +00:00
biondizzle	c1aa4af123	Shared expert: dedicated CuTeDSL runner with proper scale assembly - CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing - _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle - All buffers pre-allocated for cudagraph compatibility - Updated test to use dedicated runner instead of MoE runner hack	2026-05-18 20:08:34 +00:00
biondizzle	e8b289e30d	WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow.	2026-05-18 20:02:19 +00:00
biondizzle	bedcfc4dab	Pipeline test: use max_num_tokens=8192 matching vLLM	2026-05-17 23:04:44 +00:00
biondizzle	c45364b3a8	Add MoE scale ratio output	2026-05-17 22:58:27 +00:00
biondizzle	bf99ad49ec	Print both MoE and residual cosine	2026-05-17 22:56:56 +00:00
biondizzle	8637020487	Fix multi-layer test: add residual connections	2026-05-17 22:55:40 +00:00
biondizzle	11dce13afe	Add multi-layer pipeline test to check error accumulation	2026-05-17 22:53:28 +00:00
biondizzle	72628fb689	Full pipeline test: runner vs BF16 reference	2026-05-17 21:29:16 +00:00
biondizzle	2796bd81e8	Fix: scatter FP4 as uint8 (float4 doesn't support index_put)	2026-05-17 21:28:04 +00:00
biondizzle	364f8372bb	Fix FP4 buffer shapes: D//2 for packed dimensions	2026-05-17 21:26:46 +00:00
biondizzle	5e4d674736	Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf	2026-05-17 21:25:58 +00:00
biondizzle	4d0b6d889d	Set runner weights before _ensure_stacked	2026-05-17 21:22:50 +00:00
biondizzle	b7acac5e4e	Call _ensure_stacked() before using runner buffers	2026-05-17 21:22:30 +00:00
biondizzle	1acf01fc1a	Fix token_indices: repeat each token ID top_k times, not arange	2026-05-17 21:22:11 +00:00
biondizzle	a478ca4746	Debug: trace runner logic step by step, test L1 GEMM	2026-05-17 21:21:45 +00:00
biondizzle	a100bd11c1	Simplify pipeline test: BF16 ref + bridge ref + full runner	2026-05-17 21:20:41 +00:00
biondizzle	6eade5e7f8	Fix: gs values are floats not tensors	2026-05-17 21:19:47 +00:00
biondizzle	b05a38a9bd	Test stages 1-2 first: sort + L1 GEMM	2026-05-17 21:19:23 +00:00
biondizzle	9728604ea1	Pipeline test: stage-by-stage with BF16 reference comparison	2026-05-17 21:19:17 +00:00
biondizzle	7fff5fd39b	Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes	2026-05-17 21:18:20 +00:00
biondizzle	4ef345773d	Rewrite pipeline test: load real weights, step-by-step vs BF16 reference	2026-05-17 21:17:18 +00:00
biondizzle	b43541afdd	Fix test path setup	2026-05-17 21:00:00 +00:00
biondizzle	490ddfa294	Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours)	2026-05-17 20:58:06 +00:00
biondizzle	c1bb551446	Fix weight loading: skip already-loaded experts correctly	2026-05-17 18:15:51 +00:00
biondizzle	955d7533f2	Use system Python for pipeline test (CuTeDSL in system site-packages)	2026-05-17 18:13:42 +00:00
biondizzle	925e390b93	Fix import: use direct import from vllm/ subdirectory	2026-05-17 18:12:53 +00:00
biondizzle	cd6144b832	Fix imports: all functions are in cutedsl.bridge, not separate modules	2026-05-17 18:11:03 +00:00
biondizzle	5e63a0d8a3	Rewrite pipeline test: use raw checkpoint weights, compare runner vs dynamic-gs reference	2026-05-17 18:10:05 +00:00
biondizzle	e51eafe288	Rewrite pipeline test: compare runner vs reference with real weights, step-by-step	2026-05-17 18:08:33 +00:00
biondizzle	e38d60a6e8	Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline	2026-05-17 18:07:44 +00:00
biondizzle	87a223f1ac	Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses	2026-05-17 16:52:40 +00:00
biondizzle	33e28100ee	test: use runner's built-in warmup method	2026-05-17 08:24:27 +00:00
biondizzle	8c9a51e006	fix: call _ensure_stacked in warmup test	2026-05-17 08:07:09 +00:00
biondizzle	5ba77e355f	test: warmup gs computation with safety margin sweep	2026-05-17 08:06:27 +00:00
biondizzle	37fecb588f	fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls	2026-05-17 07:43:05 +00:00
biondizzle	8dadd9a723	test: scale assembly debug	2026-05-17 07:37:47 +00:00
biondizzle	7b95e76723	test: runner vs pipeline comparison + scale assembly comparison	2026-05-17 07:33:20 +00:00
biondizzle	cc75a55bd9	restore: new bridge/moe_pipeline/layertest	2026-05-16 19:55:19 +00:00
biondizzle	0c878b3a9e	temp: restore old layertest+bridge for cosine comparison	2026-05-16 19:54:04 +00:00
biondizzle	d15c43294b	fix: test L2 weight N dim should be hidden_size, not hidden_size//2	2026-05-16 19:07:36 +00:00
biondizzle	28788c6f55	fix: L1 weight N dimension is 2intermediate (gate+up), not intermediate float4_e2m1fn_x2 packs 2 values per byte along K, not N. The GEMM output N dimension is the logical N from mat_b.shape[2], not 2x packed. Previous n_dim2 was wrong — it accidentally worked in the test because intermediate_size2 == 2intermediate_size. Real model with N=9216 exposed the bug.	2026-05-16 19:07:08 +00:00
biondizzle	54c470e535	fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8)	2026-05-16 18:13:14 +00:00
biondizzle	f2de95c526	fix: use randint for float4 dummy weights in cudagraph test	2026-05-16 18:08:45 +00:00
biondizzle	f66d4b69a4	GPU-only scale assembly + cudagraph test harness - assemble_activation_scales_gpu: builds padded+swizzled scale tensor without .item() or .tolist() CPU syncs. Uses GPU index arange + cat + single scatter instead of per-expert Python slicing. - Still has a for e in range(num_experts) loop but num_experts is compile-time constant so torch.compile unrolls it. - Added tests/cudagraph_test.py: attempts CUDA graph capture on the MoE runner, diagnoses sync violations with patched torch functions. - Removed the if total_slots == 0 early return (Python control flow on GPU data)	2026-05-16 18:05:13 +00:00
biondizzle	a0ff8a3278	fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge The bridge's assemble_scales_3d_side expects (K_sf, N) input and transposes to (N, K_sf) internally before swizzling. The checkpoint stores scales as (N, K_sf). Without this transpose, the kernel was reading completely wrong scale data — cosine dropped to 0.713. Also fixed dual global scale normalization: after transpose, gate/up are along dim 1 (columns), not dim 0 (rows).	2026-05-16 03:43:30 +00:00
biondizzle	389453fbf4	feat: direct NVFP4 path — no BF16 round-trip on weights finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2 directly. Block scales (float8_e4m3fn) and global scales (float32) pass through unchanged. Zero precision loss on the weights themselves. L1 dual global scale handling: gate and up have different global scales. Normalize to max(gate_gs, up_gs) and fold the ratio into block scales via float32 (one multiply + float8 round-trip on the RATIO only — much better than dequantizing the entire weight matrix). layertest.py: updated to test direct path. Expect cosine improvement from 0.989 → 0.995+ (matching the L1-only result).	2026-05-16 03:41:23 +00:00

1 2

71 Commits