nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	c289c44920	Fix BF16 wo_a: per-group BMM instead of flat linear The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1)) which flattens across groups: (num_tokens, n_local_headshead_dim)=(tokens, 8192). But wo_a is a BMM with in_features=n_headshead_dim/n_groups=4096. The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes. The BF16 path now does the same: reshape o_inv to per-group format, do torch.bmm, then reshape output and handle TP all-gather manually.	2026-05-19 04:10:02 +00:00
biondizzle	6f9a400ae0	Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn - Removed hc_head prefix mapping (checkpoint already has model.hc_head.*) - Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn) - The model has self.hc_head_fn as flat params, not inside a sub-module	2026-05-19 03:58:25 +00:00
biondizzle	4cf5b8b751	Fix compressor path: attn.mla_attn.compressor (not attn.compressor) The compressor is inside mla_attn, not directly on the attention wrapper. Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*	2026-05-19 03:47:26 +00:00
biondizzle	fece06f746	Add unit tests for NVFP4 weight mapper and inverse RoPE BF16	2026-05-19 03:22:00 +00:00
biondizzle	b856ee9315	Clean up debug scripts	2026-05-19 02:47:29 +00:00
biondizzle	8fe5546bb3	Fix debug script	2026-05-19 02:43:17 +00:00
biondizzle	788f0aa65a	Add step-by-step debug for wo_a	2026-05-19 02:43:05 +00:00
biondizzle	77e4970d93	Add debug script for wo_a quantization	2026-05-19 02:40:43 +00:00
biondizzle	80122b850b	Add debug script for wo_a	2026-05-19 02:39:55 +00:00
biondizzle	ae233ab648	Fix test: cos_sin_cache on CUDA device	2026-05-19 02:37:50 +00:00
biondizzle	882d4996ff	Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj) The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion in layout.hpp:39) when processing wo_a (o-projection first half) in the attention layer. The crash is caused by scale tensor dimension mismatch for the SM100 recipe (1, 1, 128). Instead of fighting DeepGEMM, replace the entire wo_a path with our own CuTeDSL NVFP4 kernel: 1. inverse_rope_bf16() — Python implementation of inverse RoPE (replaces fused_inv_rope_fp8_quant CUDA kernel) 2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using ScaledGroupedGemm with n_local_groups=8 groups 3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4, no conversion to another quantization) Changes: - cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation) - cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a - vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner is initialized, keep DeepGEMM fallback - vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant - tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM	2026-05-19 02:36:30 +00:00
biondizzle	00fe63b56f	Fix compile test: add warmup for activation global scales	2026-05-19 01:57:16 +00:00
biondizzle	bba3bca4d3	Add torch.compile + custom op integration test	2026-05-19 01:56:46 +00:00
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	f74447bfd0	Proper NVFP4 integration: quantized compressor/indexer + mapper fixes Weight mapper fixes: - Reorder substr renames: compressor renames first, then .self_attn.compressor. → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up under mla_attn after the compressor rename already fired) - Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm, position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names) - Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv, gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape, weights_proj stays (structural: compressor.indexer → indexer.compressor) - Remove broken suffix renames (already fixed in prior commit) Model architecture fixes: - Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4 checkpoint has quantized compressor weights with input_scale/weight_scale) - Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config (was None, but checkpoint has quantized weights) - Add indexer.compressor.fused_wkv_wgate stacking in load_weights Infrastructure: - Add deepseek_compressor.py to Dockerfile - Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM) - Update unit test to 50 cases (compressor + indexer + quantization scales)	2026-05-18 23:20:13 +00:00
biondizzle	17496b2615	Fix NVFP4 weights mapper: add prefix mappings, fix substr order - Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.) AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required - Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn. in substr_renames so compressor keys get the mla_attn prefix before the general rename - Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight) that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens - Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)	2026-05-18 23:03:34 +00:00
biondizzle	6ce6a47be9	Add NVFP4 linear runner + attention projection test - CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection - test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16 - Same pad+swizzle pattern as shared expert, but no SiLU/fusion	2026-05-18 20:14:03 +00:00
biondizzle	f07643791e	Fix hidden_size: shared expert uses 7168, not HC_DIM 28672	2026-05-18 20:10:32 +00:00
biondizzle	c1aa4af123	Shared expert: dedicated CuTeDSL runner with proper scale assembly - CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing - _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle - All buffers pre-allocated for cudagraph compatibility - Updated test to use dedicated runner instead of MoE runner hack	2026-05-18 20:08:34 +00:00
biondizzle	e8b289e30d	WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow.	2026-05-18 20:02:19 +00:00
biondizzle	bedcfc4dab	Pipeline test: use max_num_tokens=8192 matching vLLM	2026-05-17 23:04:44 +00:00
biondizzle	c45364b3a8	Add MoE scale ratio output	2026-05-17 22:58:27 +00:00
biondizzle	bf99ad49ec	Print both MoE and residual cosine	2026-05-17 22:56:56 +00:00
biondizzle	8637020487	Fix multi-layer test: add residual connections	2026-05-17 22:55:40 +00:00
biondizzle	11dce13afe	Add multi-layer pipeline test to check error accumulation	2026-05-17 22:53:28 +00:00
biondizzle	72628fb689	Full pipeline test: runner vs BF16 reference	2026-05-17 21:29:16 +00:00
biondizzle	2796bd81e8	Fix: scatter FP4 as uint8 (float4 doesn't support index_put)	2026-05-17 21:28:04 +00:00
biondizzle	364f8372bb	Fix FP4 buffer shapes: D//2 for packed dimensions	2026-05-17 21:26:46 +00:00
biondizzle	5e4d674736	Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf	2026-05-17 21:25:58 +00:00
biondizzle	4d0b6d889d	Set runner weights before _ensure_stacked	2026-05-17 21:22:50 +00:00
biondizzle	b7acac5e4e	Call _ensure_stacked() before using runner buffers	2026-05-17 21:22:30 +00:00
biondizzle	1acf01fc1a	Fix token_indices: repeat each token ID top_k times, not arange	2026-05-17 21:22:11 +00:00
biondizzle	a478ca4746	Debug: trace runner logic step by step, test L1 GEMM	2026-05-17 21:21:45 +00:00
biondizzle	a100bd11c1	Simplify pipeline test: BF16 ref + bridge ref + full runner	2026-05-17 21:20:41 +00:00
biondizzle	6eade5e7f8	Fix: gs values are floats not tensors	2026-05-17 21:19:47 +00:00
biondizzle	b05a38a9bd	Test stages 1-2 first: sort + L1 GEMM	2026-05-17 21:19:23 +00:00
biondizzle	9728604ea1	Pipeline test: stage-by-stage with BF16 reference comparison	2026-05-17 21:19:17 +00:00
biondizzle	7fff5fd39b	Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes	2026-05-17 21:18:20 +00:00
biondizzle	4ef345773d	Rewrite pipeline test: load real weights, step-by-step vs BF16 reference	2026-05-17 21:17:18 +00:00
biondizzle	b43541afdd	Fix test path setup	2026-05-17 21:00:00 +00:00
biondizzle	490ddfa294	Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours)	2026-05-17 20:58:06 +00:00
biondizzle	c1bb551446	Fix weight loading: skip already-loaded experts correctly	2026-05-17 18:15:51 +00:00
biondizzle	955d7533f2	Use system Python for pipeline test (CuTeDSL in system site-packages)	2026-05-17 18:13:42 +00:00
biondizzle	925e390b93	Fix import: use direct import from vllm/ subdirectory	2026-05-17 18:12:53 +00:00
biondizzle	cd6144b832	Fix imports: all functions are in cutedsl.bridge, not separate modules	2026-05-17 18:11:03 +00:00
biondizzle	5e63a0d8a3	Rewrite pipeline test: use raw checkpoint weights, compare runner vs dynamic-gs reference	2026-05-17 18:10:05 +00:00
biondizzle	e51eafe288	Rewrite pipeline test: compare runner vs reference with real weights, step-by-step	2026-05-17 18:08:33 +00:00
biondizzle	e38d60a6e8	Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline	2026-05-17 18:07:44 +00:00
biondizzle	87a223f1ac	Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses	2026-05-17 16:52:40 +00:00
biondizzle	33e28100ee	test: use runner's built-in warmup method	2026-05-17 08:24:27 +00:00

1 2

85 Commits