nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	7070fadf72	Add full layer NaN test (attention + MoE, multi-layer chain)	2026-05-19 18:36:49 +00:00
biondizzle	152b0749df	Use 16 experts for MoE runner test (fits in memory)	2026-05-19 18:35:40 +00:00
biondizzle	daa59a7c75	Add MoE runner NaN test (grouped GEMM with real weights)	2026-05-19 18:34:56 +00:00
biondizzle	9308634e65	Fix intermediate size: 3072 not 18432	2026-05-19 18:34:12 +00:00
biondizzle	2b91bb1b71	Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM	2026-05-19 18:33:57 +00:00
biondizzle	8904d409f8	Fix MoE weight key names, add fallback	2026-05-19 18:32:49 +00:00
biondizzle	e45ceb2226	Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan	2026-05-19 18:32:14 +00:00
biondizzle	facc6509e7	Fix imports in vLLM codepaths test	2026-05-19 17:26:50 +00:00
biondizzle	835e1a0590	Fix f-string syntax	2026-05-19 17:26:40 +00:00
biondizzle	9c30168202	Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)	2026-05-19 17:26:10 +00:00
biondizzle	2cc1910c45	Fix N for C128A (need 128 tokens)	2026-05-19 16:04:53 +00:00
biondizzle	cea453cbab	Fix compressor key name	2026-05-19 16:04:38 +00:00
biondizzle	04f2b2d8d4	Add CSA sparse attention test (compressed KV gather + SWA merge)	2026-05-19 16:04:19 +00:00
biondizzle	be8566a443	Add decode vs prefill consistency test	2026-05-19 16:00:33 +00:00
biondizzle	2ddd3d0702	Test with all 61 layers (shared experts only)	2026-05-19 15:55:41 +00:00
biondizzle	842e6e1381	Fix view→reshape for non-contiguous tensor	2026-05-19 15:54:40 +00:00
biondizzle	f0f8d8211b	Add e2e decode test (3 layers: C128A, C4A, SWA)	2026-05-19 15:53:29 +00:00
biondizzle	6ceb05327f	Add blackwell_attention module and comprehensive test	2026-05-19 15:30:29 +00:00
biondizzle	85c74e5932	Fix attention for decode (1 query vs N cached KVs)	2026-05-19 15:28:52 +00:00
biondizzle	85099c7e75	Fix fp8 amax in decode test	2026-05-19 15:28:17 +00:00
biondizzle	c66b0b88c0	Add decode attention pipeline test — reproduces KV cache bug	2026-05-19 15:27:55 +00:00
biondizzle	8e6721917e	Fix syntax in RoPE KV test	2026-05-19 10:31:07 +00:00
biondizzle	cbf440f75a	Add RoPE KV test	2026-05-19 10:28:15 +00:00
biondizzle	dd7f2627e8	Add full model forward test (WIP), sparse attention test passes	2026-05-19 09:04:19 +00:00
biondizzle	9781953509	Add CSA/HCA sparse attention kernel test	2026-05-19 09:02:12 +00:00
biondizzle	d60673864a	Fix kv_ref transpose in KV cache test	2026-05-19 08:58:46 +00:00
biondizzle	c1099d76d2	Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression	2026-05-19 08:57:31 +00:00
biondizzle	c54ddbdae1	Fix NVFP4 attention: slice output to actual N after 128-padding	2026-05-19 08:55:31 +00:00
biondizzle	42285b6c24	Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM	2026-05-19 08:54:59 +00:00
biondizzle	9465929e6e	Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)	2026-05-19 08:51:16 +00:00
biondizzle	d08a457829	Fix cos_sin cache shape in NVFP4 attention test	2026-05-19 08:38:55 +00:00
biondizzle	7dd8871e84	Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM	2026-05-19 08:38:25 +00:00
biondizzle	3de75c4e37	Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) Replaces vLLM's broken FlashMLA sparse attention which doesn't work on SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention which works on all GPUs. Architecture: - CSA (C128A): Batched sparse gather + SDPA on top-k positions - HCA (C4A): Same with compressed KV + per-layer indexer - SWA: Sliding window attention - Full reference: standard SDPA for testing without compression Also adds test_csa_attention_b200.py to verify the full attention path.	2026-05-19 07:58:10 +00:00
biondizzle	65f48be38c	Add attention path test: pinpoint FlashMLA failure	2026-05-19 07:54:01 +00:00
biondizzle	04ad6409e5	Rewrite test: diagnose whether warmup gs matters at inference time	2026-05-19 07:49:41 +00:00
biondizzle	496848e158	Fix ffn_hc.scale key name	2026-05-19 07:48:09 +00:00
biondizzle	5a4e355d3a	Add model forward test: reproduce vLLM empty output outside container	2026-05-19 07:47:48 +00:00
biondizzle	87453a53b0	Fix checkpoint keys: attn_hc., compressor., q_a_proj/q_b_proj/kv_proj	2026-05-19 07:17:37 +00:00
biondizzle	f97762cc9f	Fix full layer test: use correct checkpoint key names Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM fused names (fused_wqa_wkv, wq_b, q_norm).	2026-05-19 07:16:33 +00:00
biondizzle	cc48a5715e	Add full layer 0 B200 test: CuTeDSL vs BF16 reference Tests each attention/FFN projection individually against BF16 dequantized reference, then runs full layer forward. Identifies exactly where garbage enters the pipeline. Key finding: checkpoint uses different names than vLLM: - q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv) - q_a_norm (not q_norm) - compressor.* (C4A layers only) - sinks (attn_sink)	2026-05-19 07:14:58 +00:00
biondizzle	199efe0871	Fix dims: o_groups=16, o_lora_rank=1024 from config	2026-05-19 06:37:25 +00:00
biondizzle	b4fee70151	Fix device mismatch in test	2026-05-19 06:36:22 +00:00
biondizzle	6b4b9774d1	Add B200 test: prove O-projection root cause + validate fix	2026-05-19 06:32:54 +00:00
biondizzle	77baca668e	Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b The original attention forward uses fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights and weight_scale_inv. Our checkpoint has wo_a in BF16, so the original path crashes (produces empty output). Replace O projection with: 1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8) 2. BMM grouped linear for wo_a (BF16) 3. NVFP4 wo_b via CuTeDSL Also fixes activation global scale bug from previous commit: - input_global_scale_inv IS the activation gs, don't re-invert - w13_input_scale_orig (after undoing convert) IS the MoE gs Test: tests/test_o_projection.py validates inv RoPE roundtrip and wo_a BMM correctness.	2026-05-19 06:30:18 +00:00
biondizzle	c289c44920	Fix BF16 wo_a: per-group BMM instead of flat linear The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1)) which flattens across groups: (num_tokens, n_local_headshead_dim)=(tokens, 8192). But wo_a is a BMM with in_features=n_headshead_dim/n_groups=4096. The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes. The BF16 path now does the same: reshape o_inv to per-group format, do torch.bmm, then reshape output and handle TP all-gather manually.	2026-05-19 04:10:02 +00:00
biondizzle	6f9a400ae0	Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn - Removed hc_head prefix mapping (checkpoint already has model.hc_head.*) - Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn) - The model has self.hc_head_fn as flat params, not inside a sub-module	2026-05-19 03:58:25 +00:00
biondizzle	4cf5b8b751	Fix compressor path: attn.mla_attn.compressor (not attn.compressor) The compressor is inside mla_attn, not directly on the attention wrapper. Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*	2026-05-19 03:47:26 +00:00
biondizzle	fece06f746	Add unit tests for NVFP4 weight mapper and inverse RoPE BF16	2026-05-19 03:22:00 +00:00
biondizzle	b856ee9315	Clean up debug scripts	2026-05-19 02:47:29 +00:00
biondizzle	8fe5546bb3	Fix debug script	2026-05-19 02:43:17 +00:00

1 2 3

129 Commits