nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	5a4e355d3a	Add model forward test: reproduce vLLM empty output outside container	2026-05-19 07:47:48 +00:00
biondizzle	f5ce728ef2	Fix OOM: add --max-model-len=876544 + revert CPU dummy weight The CPU dummy weight broke torch.mm(compressor.weight.T) which expects GPU tensors. Instead, reduce max_model_len to fit KV cache within available memory (876544 instead of 1048576).	2026-05-19 07:35:43 +00:00
biondizzle	79a41d9197	Save ~5-8 GiB GPU VRAM: move dummy weight to CPU The CuTeDSL kernel never reads layer.weight — it uses the runner's pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for vLLM model introspection. Moving it to CPU saves massive VRAM: - q_b_proj alone: 6553615362 = 192 MiB on GPU → ~0 MiB - All layers combined: ~5-8 GiB saved This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).	2026-05-19 07:29:38 +00:00
biondizzle	cebc586014	Fix OOM: use 1-token warmup sample + free immediately 8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB But the dummy weight param (out_features * in_features * 2 bytes BF16) was the real killer — each layer allocated a BF16 dummy of its full weight shape. With 1 token the warmup still gets a valid gs, and empty_cache frees the sample tensor before KV cache allocation.	2026-05-19 07:28:57 +00:00
biondizzle	5122cadc94	Update CURRENT_BUG.md: root cause found + fix committed	2026-05-19 07:21:30 +00:00
biondizzle	6e6f95dfa8	FIX: Use warmup-based activation global scale in CuTeDSL linear kernel The checkpoint's input_scale is a calibration-time value that doesn't match what quantize_activation_nvfp4 expects at runtime. Using it as the activation global scale produces garbage output (empty EOS tokens). The fix: run a warmup forward pass with sample data and compute the activation global scale from the actual activation distribution, exactly like our standalone test does (which passes with cosine >= 0.994). This is the root cause of the vLLM server returning empty content.	2026-05-19 07:21:07 +00:00
biondizzle	0a7769972f	Fix garbled shared_expert_pipeline.py: imports/class were merged	2026-05-19 07:18:10 +00:00
biondizzle	87453a53b0	Fix checkpoint keys: attn_hc., compressor., q_a_proj/q_b_proj/kv_proj	2026-05-19 07:17:37 +00:00
biondizzle	f97762cc9f	Fix full layer test: use correct checkpoint key names Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM fused names (fused_wqa_wkv, wq_b, q_norm).	2026-05-19 07:16:33 +00:00
biondizzle	cc48a5715e	Add full layer 0 B200 test: CuTeDSL vs BF16 reference Tests each attention/FFN projection individually against BF16 dequantized reference, then runs full layer forward. Identifies exactly where garbage enters the pipeline. Key finding: checkpoint uses different names than vLLM: - q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv) - q_a_norm (not q_norm) - compressor.* (C4A layers only) - sinks (attn_sink)	2026-05-19 07:14:58 +00:00
biondizzle	dbaa3d6fe6	Update CURRENT_BUG.md and README with current state Empty output still happening. Documented what's been tried, what works standalone, what we don't know, and the plan to bypass vLLM's kernel selection entirely by calling our runners directly.	2026-05-19 07:05:45 +00:00
biondizzle	62abf41b03	Revert deepseek_v4_attention.py to `ffc2264` — don't nuke existing patches The file at `ffc2264` already had our BF16 wo_a path (_apply_inv_rope_bf16 + BMM + all-gather) with FP8 fallback. I was replacing it from the wrong vllm source, losing all prior work. Restored to the known-good version.	2026-05-19 06:52:40 +00:00
biondizzle	4c2effa2be	Fix attention patch: source from v0.21.0 stable, not local clone The local vllm clone has different imports (breakable_cudagraph) that don't exist in the Docker image. Now sourced from v0.21.0 tag.	2026-05-19 06:44:59 +00:00
biondizzle	284b6a5d57	Fix attention patch: use original vllm imports, only patch forward method Previous version copied the entire file from our local vllm clone which had imports (breakable_cudagraph) missing from the Docker image's vllm. Now we start from the Docker image's original file and only patch the DeepseekV4MultiHeadLatentAttentionWrapper.forward method.	2026-05-19 06:40:58 +00:00
biondizzle	199efe0871	Fix dims: o_groups=16, o_lora_rank=1024 from config	2026-05-19 06:37:25 +00:00
biondizzle	b4fee70151	Fix device mismatch in test	2026-05-19 06:36:22 +00:00
biondizzle	6b4b9774d1	Add B200 test: prove O-projection root cause + validate fix	2026-05-19 06:32:54 +00:00
biondizzle	77baca668e	Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b The original attention forward uses fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights and weight_scale_inv. Our checkpoint has wo_a in BF16, so the original path crashes (produces empty output). Replace O projection with: 1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8) 2. BMM grouped linear for wo_a (BF16) 3. NVFP4 wo_b via CuTeDSL Also fixes activation global scale bug from previous commit: - input_global_scale_inv IS the activation gs, don't re-invert - w13_input_scale_orig (after undoing convert) IS the MoE gs Test: tests/test_o_projection.py validates inv RoPE roundtrip and wo_a BMM correctness.	2026-05-19 06:30:18 +00:00
biondizzle	ffc2264c41	Fix activation global scale: don't double-invert input_global_scale_inv The activation global scale = amax / (6.0 * 448.0). Both the linear kernel and MoE kernel were taking 1.0 / (value that's already the correct gs), inverting it and producing garbage quantization. Linear kernel: input_global_scale_inv IS the gs, so use it directly. MoE kernel: w13_input_scale_orig (after undoing convert inversion) IS the gs, so use it directly.	2026-05-19 06:03:08 +00:00
biondizzle	918342feeb	MHC: replace monolithic layers/mhc.py with pure PyTorch The nightly vLLM image puts ALL MHC code in layers/mhc.py (not kernels/mhc/). It imports tilelang at top level and JIT-compiles kernels. Replace the entire file with pure PyTorch implementations using direct_register_custom_op for mhc_pre, mhc_post, mhc_fused_post_pre, and hc_head_fused_kernel. No tilelang dependency at all. Also removes the separate mhc_torch_ops.py and kernels/mhc/ patches which don't apply to the nightly image layout.	2026-05-19 05:41:55 +00:00
biondizzle	dfd9c10ae9	Fix MHC import: don't import .torch from layers/mhc.py The layers/mhc.py was trying to import kernels.mhc.torch which failed because our __init__.py was breaking the package. Instead, just import our mhc_torch_ops which has everything we need. Also fix __init__.py to explicitly import mhc_pre_torch and mhc_post_torch from .torch instead of using import *.	2026-05-19 05:36:35 +00:00
biondizzle	e404e18efb	Also replace layers/mhc.py CustomOp dispatch The original layers/mhc.py forward_cuda calls torch.ops.vllm.mhc_pre_tilelang which triggers TileLang JIT. Replace with our torch implementations in forward_cuda. This is what the CustomOp dispatch routes through.	2026-05-19 05:31:05 +00:00
biondizzle	5e6d459145	Fix MHC custom op registration Previous approach used @CustomOp.register which doesn't create torch.ops.vllm.mhc_pre. The model code calls torch.ops.vllm.mhc_pre() directly, which requires direct_register_custom_op. Use direct_register_custom_op to register mhc_pre, mhc_post, mhc_fused_post_pre, and hc_head_fused_kernel as PyTorch custom ops with torch (eager) implementations. Patch kernels/mhc/__init__.py to import from both .torch (original) and .mhc_torch_ops (our replacements), skipping tilelang import.	2026-05-19 05:19:48 +00:00
biondizzle	9ff1679064	Replace MHC TileLang kernels with pure PyTorch TileLang kernels (mhc_pre_big_fuse_tilelang, mhc_fused_tilelang) don't work correctly on Blackwell SM100 and cause empty model output. Replace with pure PyTorch implementations: - mhc_pre_torch: Sinkhorn-normalized HC residual mixing - mhc_post_torch: HC post block (einsum residual + post layer mix) - mhc_fused_post_pre_torch: Fused post+pre (composition of above) - hc_head_fused_torch: RMS norm + linear + sigmoid + weighted sum Patch both layers/mhc.py (CustomOp dispatch) and kernels/mhc/__init__.py (no tilelang import). Also remove tilelang from pyproject.toml deps.	2026-05-19 05:07:41 +00:00
biondizzle	5c770c68ca	Keep MoE scale tensors: framework warmup needs them The framework's deep_gemm_warmup calls get_fused_moe_quant_config which accesses w13_input_scale etc. Setting them to None caused TypeError: float / NoneType. Keep scales (small tensors) and only free the large weight tensors.	2026-05-19 04:50:31 +00:00
biondizzle	e0f385ac45	Fix workspace_shapes: output dim is hidden_dim, not K2 K comes from hidden_states.size(-1) which is the full BF16 dimension (7168), not the packed weight dimension. K2=14336 is wrong. The MoE output is always hidden_dim (7168).	2026-05-19 04:42:22 +00:00
biondizzle	cfd8ec741d	Debug: add shape mismatch logging in MoE apply	2026-05-19 04:35:58 +00:00
biondizzle	ffc1a5c6a8	Fix workspace_shapes: remove wrong assertion, compute output dim from K The framework may pass K in different forms (packed or unpacked). Use max(K*2, hidden_dim) to handle both cases.	2026-05-19 04:28:04 +00:00
biondizzle	f023b3b2c6	Fix: wrap dummy MoE weights in nn.Parameter PyTorch requires module attributes to be nn.Parameter or None. torch.empty can't be assigned to a registered parameter slot.	2026-05-19 04:21:35 +00:00
biondizzle	b06dcb40dc	Fix MoE w1=None crash: keep shape-preserving dummy weights on CPU The modular kernel framework reads w1.shape[0] in its outer apply() before delegating to our expert impl. Setting layer.w13_weight = None caused AttributeError. Replace with shape-preserving CPU dummy tensors to free GPU memory while keeping shape metadata accessible.	2026-05-19 04:17:10 +00:00
biondizzle	c289c44920	Fix BF16 wo_a: per-group BMM instead of flat linear The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1)) which flattens across groups: (num_tokens, n_local_headshead_dim)=(tokens, 8192). But wo_a is a BMM with in_features=n_headshead_dim/n_groups=4096. The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes. The BF16 path now does the same: reshape o_inv to per-group format, do torch.bmm, then reshape output and handle TP all-gather manually.	2026-05-19 04:10:02 +00:00
biondizzle	6f9a400ae0	Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn - Removed hc_head prefix mapping (checkpoint already has model.hc_head.*) - Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn) - The model has self.hc_head_fn as flat params, not inside a sub-module	2026-05-19 03:58:25 +00:00
biondizzle	909a2710e4	Fix double lm_head mapping: NVFP4 checkpoint already uses correct names The checkpoint has lm_head.weight and model.embed_tokens.weight already — the suffix mappings head.weight→lm_head.weight and embed.weight→embed_tokens.weight were incorrectly applying to keys that already had the right prefix, producing lm_lm_head.weight.	2026-05-19 03:54:14 +00:00
biondizzle	4cf5b8b751	Fix compressor path: attn.mla_attn.compressor (not attn.compressor) The compressor is inside mla_attn, not directly on the attention wrapper. Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*	2026-05-19 03:47:26 +00:00
biondizzle	9d41419e9f	Debug: print compressor params to diagnose KeyError	2026-05-19 03:44:40 +00:00
biondizzle	db5192fe41	Patch from Docker image's vLLM (0.20.2rc1) instead of newer upstream The nightly Docker image uses an older vLLM that doesn't have NormGateLinear, breakable_cudagraph, etc. Patching the Docker image's own files ensures compatibility. - deepseek_v4.py: Patches from Docker image + NVFP4 mapper + wo_a BF16 - deepseek_v4_attention.py: Patches from Docker image + inv rope BF16 + weights_proj quant + removed QuantFP8/GroupShape imports	2026-05-19 03:35:15 +00:00
biondizzle	df5a496f5d	Fix: make eager_break_during_capture import conditional for older vLLM	2026-05-19 03:29:05 +00:00
biondizzle	4ed91b81d0	Fix inverse RoPE formula: swap signs on cross terms	2026-05-19 03:22:10 +00:00
biondizzle	fece06f746	Add unit tests for NVFP4 weight mapper and inverse RoPE BF16	2026-05-19 03:22:00 +00:00
biondizzle	b0b5113467	Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj - The compressor is on attn.compressor (not attn.mla_attn.compressor) - weights_proj in indexer is NVFP4-quantized in our checkpoint	2026-05-19 03:20:41 +00:00
biondizzle	396a83ea56	Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper - deepseek_v4.py: Fresh upstream copy with minimal NVFP4 changes - wo_a uses quant_config=None (BF16 in NVFP4 checkpoint, no scales) - Added _make_deepseek_v4_nvfp4_weights_mapper() using official WeightsMapper API - Handles: self_attn→attn, mlp→ffn, gate_proj→w1, compressor renames, etc. - Mapper selected by quant_config.get_name() == 'modelopt_fp4' - deepseek_v4_attention.py: Fresh upstream copy with minimal NVFP4 changes - Removed _wo_a_act_quant and custom CuTeDSL wo_a runner - Added _apply_inv_rope_bf16() helper (inverse RoPE in BF16) - Detects BF16 wo_a (no weight_scale_inv) and uses BF16 path - FP8 einsum path kept as fallback for SM90 checkpoints - BF16 path: inverse RoPE → wo_a() → wo_b() (standard linear methods)	2026-05-19 03:13:38 +00:00
biondizzle	b856ee9315	Clean up debug scripts	2026-05-19 02:47:29 +00:00
biondizzle	05cdde1676	Fix wo_a: scatter each group's data at correct offset in padded buffer The grouped GEMM expects each group's tokens at their own offset range: - Group 0: rows [0, padded_T) - Group 1: rows [padded_T, 2padded_T) - etc. Previously we wrote all groups' data contiguously starting at row 0, so group 1+ would read zeros from the padding area. Now we scatter each group's quantized activation at the correct offset. Also: - Size buffer for total_max_rows = padded_max n_groups - Use assemble_scales_2d_side for multi-group scale assembly - Extract output per-group at correct offsets	2026-05-19 02:45:57 +00:00
biondizzle	8fe5546bb3	Fix debug script	2026-05-19 02:43:17 +00:00
biondizzle	788f0aa65a	Add step-by-step debug for wo_a	2026-05-19 02:43:05 +00:00
biondizzle	5f5b997fc3	Fix wo_a: permute to groups-first layout for grouped GEMM The grouped GEMM expects mat_a to be laid out contiguously per group: [all tokens for group0, all tokens for group1, ...] A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout which is wrong. Fix: permute to (G, T, D) before flattening. Same fix for output: permute (G, T, R) → (T, G, R).	2026-05-19 02:41:32 +00:00
biondizzle	77e4970d93	Add debug script for wo_a quantization	2026-05-19 02:40:43 +00:00
biondizzle	80122b850b	Add debug script for wo_a	2026-05-19 02:39:55 +00:00
biondizzle	ae233ab648	Fix test: cos_sin_cache on CUDA device	2026-05-19 02:37:50 +00:00
biondizzle	882d4996ff	Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj) The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion in layout.hpp:39) when processing wo_a (o-projection first half) in the attention layer. The crash is caused by scale tensor dimension mismatch for the SM100 recipe (1, 1, 128). Instead of fighting DeepGEMM, replace the entire wo_a path with our own CuTeDSL NVFP4 kernel: 1. inverse_rope_bf16() — Python implementation of inverse RoPE (replaces fused_inv_rope_fp8_quant CUDA kernel) 2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using ScaledGroupedGemm with n_local_groups=8 groups 3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4, no conversion to another quantization) Changes: - cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation) - cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a - vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner is initialized, keep DeepGEMM fallback - vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant - tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM	2026-05-19 02:36:30 +00:00

1 2 3 4 5 ...

388 Commits