nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	facc6509e7	Fix imports in vLLM codepaths test	2026-05-19 17:26:50 +00:00
biondizzle	835e1a0590	Fix f-string syntax	2026-05-19 17:26:40 +00:00
biondizzle	9c30168202	Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)	2026-05-19 17:26:10 +00:00
biondizzle	2cc1910c45	Fix N for C128A (need 128 tokens)	2026-05-19 16:04:53 +00:00
biondizzle	cea453cbab	Fix compressor key name	2026-05-19 16:04:38 +00:00
biondizzle	04f2b2d8d4	Add CSA sparse attention test (compressed KV gather + SWA merge)	2026-05-19 16:04:19 +00:00
biondizzle	be8566a443	Add decode vs prefill consistency test	2026-05-19 16:00:33 +00:00
biondizzle	2ddd3d0702	Test with all 61 layers (shared experts only)	2026-05-19 15:55:41 +00:00
biondizzle	842e6e1381	Fix view→reshape for non-contiguous tensor	2026-05-19 15:54:40 +00:00
biondizzle	f0f8d8211b	Add e2e decode test (3 layers: C128A, C4A, SWA)	2026-05-19 15:53:29 +00:00
biondizzle	6ceb05327f	Add blackwell_attention module and comprehensive test	2026-05-19 15:30:29 +00:00
biondizzle	85c74e5932	Fix attention for decode (1 query vs N cached KVs)	2026-05-19 15:28:52 +00:00
biondizzle	85099c7e75	Fix fp8 amax in decode test	2026-05-19 15:28:17 +00:00
biondizzle	c66b0b88c0	Add decode attention pipeline test — reproduces KV cache bug	2026-05-19 15:27:55 +00:00
biondizzle	8e6721917e	Fix syntax in RoPE KV test	2026-05-19 10:31:07 +00:00
biondizzle	cbf440f75a	Add RoPE KV test	2026-05-19 10:28:15 +00:00
biondizzle	dd7f2627e8	Add full model forward test (WIP), sparse attention test passes	2026-05-19 09:04:19 +00:00
biondizzle	9781953509	Add CSA/HCA sparse attention kernel test	2026-05-19 09:02:12 +00:00
biondizzle	d60673864a	Fix kv_ref transpose in KV cache test	2026-05-19 08:58:46 +00:00
biondizzle	c1099d76d2	Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression	2026-05-19 08:57:31 +00:00
biondizzle	c54ddbdae1	Fix NVFP4 attention: slice output to actual N after 128-padding	2026-05-19 08:55:31 +00:00
biondizzle	42285b6c24	Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM	2026-05-19 08:54:59 +00:00
biondizzle	9465929e6e	Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)	2026-05-19 08:51:16 +00:00
biondizzle	d08a457829	Fix cos_sin cache shape in NVFP4 attention test	2026-05-19 08:38:55 +00:00
biondizzle	7dd8871e84	Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM	2026-05-19 08:38:25 +00:00
biondizzle	3de75c4e37	Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) Replaces vLLM's broken FlashMLA sparse attention which doesn't work on SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention which works on all GPUs. Architecture: - CSA (C128A): Batched sparse gather + SDPA on top-k positions - HCA (C4A): Same with compressed KV + per-layer indexer - SWA: Sliding window attention - Full reference: standard SDPA for testing without compression Also adds test_csa_attention_b200.py to verify the full attention path.	2026-05-19 07:58:10 +00:00
biondizzle	65f48be38c	Add attention path test: pinpoint FlashMLA failure	2026-05-19 07:54:01 +00:00
biondizzle	04ad6409e5	Rewrite test: diagnose whether warmup gs matters at inference time	2026-05-19 07:49:41 +00:00
biondizzle	496848e158	Fix ffn_hc.scale key name	2026-05-19 07:48:09 +00:00
biondizzle	5a4e355d3a	Add model forward test: reproduce vLLM empty output outside container	2026-05-19 07:47:48 +00:00
biondizzle	87453a53b0	Fix checkpoint keys: attn_hc., compressor., q_a_proj/q_b_proj/kv_proj	2026-05-19 07:17:37 +00:00
biondizzle	f97762cc9f	Fix full layer test: use correct checkpoint key names Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM fused names (fused_wqa_wkv, wq_b, q_norm).	2026-05-19 07:16:33 +00:00
biondizzle	cc48a5715e	Add full layer 0 B200 test: CuTeDSL vs BF16 reference Tests each attention/FFN projection individually against BF16 dequantized reference, then runs full layer forward. Identifies exactly where garbage enters the pipeline. Key finding: checkpoint uses different names than vLLM: - q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv) - q_a_norm (not q_norm) - compressor.* (C4A layers only) - sinks (attn_sink)	2026-05-19 07:14:58 +00:00
biondizzle	199efe0871	Fix dims: o_groups=16, o_lora_rank=1024 from config	2026-05-19 06:37:25 +00:00
biondizzle	b4fee70151	Fix device mismatch in test	2026-05-19 06:36:22 +00:00
biondizzle	6b4b9774d1	Add B200 test: prove O-projection root cause + validate fix	2026-05-19 06:32:54 +00:00
biondizzle	77baca668e	Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b The original attention forward uses fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights and weight_scale_inv. Our checkpoint has wo_a in BF16, so the original path crashes (produces empty output). Replace O projection with: 1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8) 2. BMM grouped linear for wo_a (BF16) 3. NVFP4 wo_b via CuTeDSL Also fixes activation global scale bug from previous commit: - input_global_scale_inv IS the activation gs, don't re-invert - w13_input_scale_orig (after undoing convert) IS the MoE gs Test: tests/test_o_projection.py validates inv RoPE roundtrip and wo_a BMM correctness.	2026-05-19 06:30:18 +00:00
biondizzle	c289c44920	Fix BF16 wo_a: per-group BMM instead of flat linear The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1)) which flattens across groups: (num_tokens, n_local_headshead_dim)=(tokens, 8192). But wo_a is a BMM with in_features=n_headshead_dim/n_groups=4096. The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes. The BF16 path now does the same: reshape o_inv to per-group format, do torch.bmm, then reshape output and handle TP all-gather manually.	2026-05-19 04:10:02 +00:00
biondizzle	6f9a400ae0	Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn - Removed hc_head prefix mapping (checkpoint already has model.hc_head.*) - Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn) - The model has self.hc_head_fn as flat params, not inside a sub-module	2026-05-19 03:58:25 +00:00
biondizzle	4cf5b8b751	Fix compressor path: attn.mla_attn.compressor (not attn.compressor) The compressor is inside mla_attn, not directly on the attention wrapper. Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*	2026-05-19 03:47:26 +00:00
biondizzle	fece06f746	Add unit tests for NVFP4 weight mapper and inverse RoPE BF16	2026-05-19 03:22:00 +00:00
biondizzle	b856ee9315	Clean up debug scripts	2026-05-19 02:47:29 +00:00
biondizzle	8fe5546bb3	Fix debug script	2026-05-19 02:43:17 +00:00
biondizzle	788f0aa65a	Add step-by-step debug for wo_a	2026-05-19 02:43:05 +00:00
biondizzle	77e4970d93	Add debug script for wo_a quantization	2026-05-19 02:40:43 +00:00
biondizzle	80122b850b	Add debug script for wo_a	2026-05-19 02:39:55 +00:00
biondizzle	ae233ab648	Fix test: cos_sin_cache on CUDA device	2026-05-19 02:37:50 +00:00
biondizzle	882d4996ff	Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj) The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion in layout.hpp:39) when processing wo_a (o-projection first half) in the attention layer. The crash is caused by scale tensor dimension mismatch for the SM100 recipe (1, 1, 128). Instead of fighting DeepGEMM, replace the entire wo_a path with our own CuTeDSL NVFP4 kernel: 1. inverse_rope_bf16() — Python implementation of inverse RoPE (replaces fused_inv_rope_fp8_quant CUDA kernel) 2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using ScaledGroupedGemm with n_local_groups=8 groups 3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4, no conversion to another quantization) Changes: - cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation) - cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a - vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner is initialized, keep DeepGEMM fallback - vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant - tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM	2026-05-19 02:36:30 +00:00
biondizzle	00fe63b56f	Fix compile test: add warmup for activation global scales	2026-05-19 01:57:16 +00:00
biondizzle	bba3bca4d3	Add torch.compile + custom op integration test	2026-05-19 01:56:46 +00:00

1 2 3

122 Commits