nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	dd7af0cd8a	feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL) - native_swa_decode.py: BlackwellSWADecodeKernel - CTA mapping: 1 CTA per (decode_token, q_head_group) - Online softmax with KV tile streaming (16 tokens/tile) - Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext requires 32-bit aligned vector, no scalar fp8->bf16 support) - Cosine 0.9999+ vs PyTorch batched SDPA reference - Fallback _fallback_batched_sdp when CuTeDSL unavailable - native_sparse_decode.py: BlackwellSparseDecodeKernel - Combined SWA + compressed KV in single attention pass - Supports CSA (cr=4) and HCA (cr=128) layers - Sink weight merge on host side - Cosine 0.9999+ vs combined SDPA reference - fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires vector<4xf8>, no scalar support). Pre-dequant is the workaround. - vLLM wiring (attention.py): - SWA-only layers: native_swa_decode_attention - CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink - csa_attention.py updated to use native kernels - Tests: test_decode_pipeline.py, test_sparse_decode.py both passing	2026-05-20 05:46:15 +00:00
biondizzle	d775d1075d	Fused SwiGLU epilogue with granularity-8 weight interleave - Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis) - Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline - De-interleave L1 GEMM output before gate/up split - Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing - Even subtiles = gate: SiLU in FP32 registers, save to register buffer - Odd subtiles = up: silu(gate)*up from buffer - Both branches produce same BF16 tensor type (CuTeDSL constraint) - run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2 - Runner: fused_swiglu=True option for CuTeDSLMoERunner - Layertest: both fused and non-fused paths PASS (cosine 0.988) - README.md updated with current status and lessons learned	2026-05-20 04:13:52 +00:00
biondizzle	b1778eedf8	wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL SiLU in registers: PASS (0.034% error, Step 1 stable) Gate/up subtile detection: blocked by CuTeDSL type system CuTeDSL compiles the kernel for ALL subtile iterations at once. Runtime conditionals (if is_gate_subtile) that affect: - Register tensor assignment → DSLRuntimeError (type structure mismatch) - TMA store skipping → corrupted output - Mask blending → wrong results Path forward: use const_expr debug flag for the BF16 side output, or process gate/up in a separate post-GEMM kernel.	2026-05-20 03:26:20 +00:00
biondizzle	ed89e678be	wip: add run_fused_swiglu_grouped_gemm bridge + step1 test	2026-05-20 03:10:56 +00:00
biondizzle	9cdf79fd9c	wip: fused SwiGLU kernel scaffold + bridge interleave + plan - fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with class rename and fused_swiglu/swiglu_limit params added - bridge.py: added interleave_l1_weights, deinterleave_l1_weights, warmup_fused_swiglu_compilation - Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave)) - Standalone GEMM interleave test fails due to kernel-internal N-tiling layout (expected, skipping per plan) - FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan, 4-step implementation strategy	2026-05-20 03:04:38 +00:00
biondizzle	3c6b5a0522	chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct Verified that our NVFP4 packing convention (odd<<4\|even, round-half-to-even) matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip across all tested experts. The dequantize->requantize path is lossless in practice but wasteful. Marked both prepare_weights_from_dequantized and prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked which loads checkpoint FP4 bytes directly via .view(). Also added test_fp4_roundtrip.py for future reference.	2026-05-20 02:11:40 +00:00
biondizzle	7070fadf72	Add full layer NaN test (attention + MoE, multi-layer chain)	2026-05-19 18:36:49 +00:00
biondizzle	152b0749df	Use 16 experts for MoE runner test (fits in memory)	2026-05-19 18:35:40 +00:00
biondizzle	daa59a7c75	Add MoE runner NaN test (grouped GEMM with real weights)	2026-05-19 18:34:56 +00:00
biondizzle	9308634e65	Fix intermediate size: 3072 not 18432	2026-05-19 18:34:12 +00:00
biondizzle	2b91bb1b71	Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM	2026-05-19 18:33:57 +00:00
biondizzle	8904d409f8	Fix MoE weight key names, add fallback	2026-05-19 18:32:49 +00:00
biondizzle	e45ceb2226	Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan	2026-05-19 18:32:14 +00:00
biondizzle	facc6509e7	Fix imports in vLLM codepaths test	2026-05-19 17:26:50 +00:00
biondizzle	835e1a0590	Fix f-string syntax	2026-05-19 17:26:40 +00:00
biondizzle	9c30168202	Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)	2026-05-19 17:26:10 +00:00
biondizzle	2cc1910c45	Fix N for C128A (need 128 tokens)	2026-05-19 16:04:53 +00:00
biondizzle	cea453cbab	Fix compressor key name	2026-05-19 16:04:38 +00:00
biondizzle	04f2b2d8d4	Add CSA sparse attention test (compressed KV gather + SWA merge)	2026-05-19 16:04:19 +00:00
biondizzle	be8566a443	Add decode vs prefill consistency test	2026-05-19 16:00:33 +00:00
biondizzle	2ddd3d0702	Test with all 61 layers (shared experts only)	2026-05-19 15:55:41 +00:00
biondizzle	842e6e1381	Fix view→reshape for non-contiguous tensor	2026-05-19 15:54:40 +00:00
biondizzle	f0f8d8211b	Add e2e decode test (3 layers: C128A, C4A, SWA)	2026-05-19 15:53:29 +00:00
biondizzle	6ceb05327f	Add blackwell_attention module and comprehensive test	2026-05-19 15:30:29 +00:00
biondizzle	85c74e5932	Fix attention for decode (1 query vs N cached KVs)	2026-05-19 15:28:52 +00:00
biondizzle	85099c7e75	Fix fp8 amax in decode test	2026-05-19 15:28:17 +00:00
biondizzle	c66b0b88c0	Add decode attention pipeline test — reproduces KV cache bug	2026-05-19 15:27:55 +00:00
biondizzle	8e6721917e	Fix syntax in RoPE KV test	2026-05-19 10:31:07 +00:00
biondizzle	cbf440f75a	Add RoPE KV test	2026-05-19 10:28:15 +00:00
biondizzle	dd7f2627e8	Add full model forward test (WIP), sparse attention test passes	2026-05-19 09:04:19 +00:00
biondizzle	9781953509	Add CSA/HCA sparse attention kernel test	2026-05-19 09:02:12 +00:00
biondizzle	d60673864a	Fix kv_ref transpose in KV cache test	2026-05-19 08:58:46 +00:00
biondizzle	c1099d76d2	Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression	2026-05-19 08:57:31 +00:00
biondizzle	c54ddbdae1	Fix NVFP4 attention: slice output to actual N after 128-padding	2026-05-19 08:55:31 +00:00
biondizzle	42285b6c24	Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM	2026-05-19 08:54:59 +00:00
biondizzle	9465929e6e	Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)	2026-05-19 08:51:16 +00:00
biondizzle	d08a457829	Fix cos_sin cache shape in NVFP4 attention test	2026-05-19 08:38:55 +00:00
biondizzle	7dd8871e84	Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM	2026-05-19 08:38:25 +00:00
biondizzle	3de75c4e37	Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) Replaces vLLM's broken FlashMLA sparse attention which doesn't work on SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention which works on all GPUs. Architecture: - CSA (C128A): Batched sparse gather + SDPA on top-k positions - HCA (C4A): Same with compressed KV + per-layer indexer - SWA: Sliding window attention - Full reference: standard SDPA for testing without compression Also adds test_csa_attention_b200.py to verify the full attention path.	2026-05-19 07:58:10 +00:00
biondizzle	65f48be38c	Add attention path test: pinpoint FlashMLA failure	2026-05-19 07:54:01 +00:00
biondizzle	04ad6409e5	Rewrite test: diagnose whether warmup gs matters at inference time	2026-05-19 07:49:41 +00:00
biondizzle	496848e158	Fix ffn_hc.scale key name	2026-05-19 07:48:09 +00:00
biondizzle	5a4e355d3a	Add model forward test: reproduce vLLM empty output outside container	2026-05-19 07:47:48 +00:00
biondizzle	87453a53b0	Fix checkpoint keys: attn_hc., compressor., q_a_proj/q_b_proj/kv_proj	2026-05-19 07:17:37 +00:00
biondizzle	f97762cc9f	Fix full layer test: use correct checkpoint key names Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM fused names (fused_wqa_wkv, wq_b, q_norm).	2026-05-19 07:16:33 +00:00
biondizzle	cc48a5715e	Add full layer 0 B200 test: CuTeDSL vs BF16 reference Tests each attention/FFN projection individually against BF16 dequantized reference, then runs full layer forward. Identifies exactly where garbage enters the pipeline. Key finding: checkpoint uses different names than vLLM: - q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv) - q_a_norm (not q_norm) - compressor.* (C4A layers only) - sinks (attn_sink)	2026-05-19 07:14:58 +00:00
biondizzle	199efe0871	Fix dims: o_groups=16, o_lora_rank=1024 from config	2026-05-19 06:37:25 +00:00
biondizzle	b4fee70151	Fix device mismatch in test	2026-05-19 06:36:22 +00:00
biondizzle	6b4b9774d1	Add B200 test: prove O-projection root cause + validate fix	2026-05-19 06:32:54 +00:00
biondizzle	77baca668e	Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b The original attention forward uses fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights and weight_scale_inv. Our checkpoint has wo_a in BF16, so the original path crashes (produces empty output). Replace O projection with: 1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8) 2. BMM grouped linear for wo_a (BF16) 3. NVFP4 wo_b via CuTeDSL Also fixes activation global scale bug from previous commit: - input_global_scale_inv IS the activation gs, don't re-invert - w13_input_scale_orig (after undoing convert) IS the MoE gs Test: tests/test_o_projection.py validates inv RoPE roundtrip and wo_a BMM correctness.	2026-05-19 06:30:18 +00:00

1 2 3

135 Commits