nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	be8566a443	Add decode vs prefill consistency test	2026-05-19 16:00:33 +00:00
biondizzle	2ddd3d0702	Test with all 61 layers (shared experts only)	2026-05-19 15:55:41 +00:00
biondizzle	842e6e1381	Fix view→reshape for non-contiguous tensor	2026-05-19 15:54:40 +00:00
biondizzle	f0f8d8211b	Add e2e decode test (3 layers: C128A, C4A, SWA)	2026-05-19 15:53:29 +00:00
biondizzle	6ceb05327f	Add blackwell_attention module and comprehensive test	2026-05-19 15:30:29 +00:00
biondizzle	85c74e5932	Fix attention for decode (1 query vs N cached KVs)	2026-05-19 15:28:52 +00:00
biondizzle	85099c7e75	Fix fp8 amax in decode test	2026-05-19 15:28:17 +00:00
biondizzle	c66b0b88c0	Add decode attention pipeline test — reproduces KV cache bug	2026-05-19 15:27:55 +00:00
biondizzle	8e6721917e	Fix syntax in RoPE KV test	2026-05-19 10:31:07 +00:00
biondizzle	cbf440f75a	Add RoPE KV test	2026-05-19 10:28:15 +00:00
biondizzle	dd7f2627e8	Add full model forward test (WIP), sparse attention test passes	2026-05-19 09:04:19 +00:00
biondizzle	9781953509	Add CSA/HCA sparse attention kernel test	2026-05-19 09:02:12 +00:00
biondizzle	d60673864a	Fix kv_ref transpose in KV cache test	2026-05-19 08:58:46 +00:00
biondizzle	c1099d76d2	Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression	2026-05-19 08:57:31 +00:00
biondizzle	c54ddbdae1	Fix NVFP4 attention: slice output to actual N after 128-padding	2026-05-19 08:55:31 +00:00
biondizzle	42285b6c24	Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM	2026-05-19 08:54:59 +00:00
biondizzle	9465929e6e	Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)	2026-05-19 08:51:16 +00:00
biondizzle	d08a457829	Fix cos_sin cache shape in NVFP4 attention test	2026-05-19 08:38:55 +00:00
biondizzle	7dd8871e84	Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM	2026-05-19 08:38:25 +00:00
biondizzle	3de75c4e37	Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) Replaces vLLM's broken FlashMLA sparse attention which doesn't work on SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention which works on all GPUs. Architecture: - CSA (C128A): Batched sparse gather + SDPA on top-k positions - HCA (C4A): Same with compressed KV + per-layer indexer - SWA: Sliding window attention - Full reference: standard SDPA for testing without compression Also adds test_csa_attention_b200.py to verify the full attention path.	2026-05-19 07:58:10 +00:00
biondizzle	65f48be38c	Add attention path test: pinpoint FlashMLA failure	2026-05-19 07:54:01 +00:00
biondizzle	04ad6409e5	Rewrite test: diagnose whether warmup gs matters at inference time	2026-05-19 07:49:41 +00:00
biondizzle	496848e158	Fix ffn_hc.scale key name	2026-05-19 07:48:09 +00:00
biondizzle	5a4e355d3a	Add model forward test: reproduce vLLM empty output outside container	2026-05-19 07:47:48 +00:00
biondizzle	87453a53b0	Fix checkpoint keys: attn_hc., compressor., q_a_proj/q_b_proj/kv_proj	2026-05-19 07:17:37 +00:00
biondizzle	f97762cc9f	Fix full layer test: use correct checkpoint key names Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM fused names (fused_wqa_wkv, wq_b, q_norm).	2026-05-19 07:16:33 +00:00
biondizzle	cc48a5715e	Add full layer 0 B200 test: CuTeDSL vs BF16 reference Tests each attention/FFN projection individually against BF16 dequantized reference, then runs full layer forward. Identifies exactly where garbage enters the pipeline. Key finding: checkpoint uses different names than vLLM: - q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv) - q_a_norm (not q_norm) - compressor.* (C4A layers only) - sinks (attn_sink)	2026-05-19 07:14:58 +00:00
biondizzle	199efe0871	Fix dims: o_groups=16, o_lora_rank=1024 from config	2026-05-19 06:37:25 +00:00
biondizzle	b4fee70151	Fix device mismatch in test	2026-05-19 06:36:22 +00:00
biondizzle	6b4b9774d1	Add B200 test: prove O-projection root cause + validate fix	2026-05-19 06:32:54 +00:00
biondizzle	77baca668e	Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b The original attention forward uses fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights and weight_scale_inv. Our checkpoint has wo_a in BF16, so the original path crashes (produces empty output). Replace O projection with: 1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8) 2. BMM grouped linear for wo_a (BF16) 3. NVFP4 wo_b via CuTeDSL Also fixes activation global scale bug from previous commit: - input_global_scale_inv IS the activation gs, don't re-invert - w13_input_scale_orig (after undoing convert) IS the MoE gs Test: tests/test_o_projection.py validates inv RoPE roundtrip and wo_a BMM correctness.	2026-05-19 06:30:18 +00:00
biondizzle	c289c44920	Fix BF16 wo_a: per-group BMM instead of flat linear The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1)) which flattens across groups: (num_tokens, n_local_headshead_dim)=(tokens, 8192). But wo_a is a BMM with in_features=n_headshead_dim/n_groups=4096. The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes. The BF16 path now does the same: reshape o_inv to per-group format, do torch.bmm, then reshape output and handle TP all-gather manually.	2026-05-19 04:10:02 +00:00
biondizzle	6f9a400ae0	Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn - Removed hc_head prefix mapping (checkpoint already has model.hc_head.*) - Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn) - The model has self.hc_head_fn as flat params, not inside a sub-module	2026-05-19 03:58:25 +00:00
biondizzle	4cf5b8b751	Fix compressor path: attn.mla_attn.compressor (not attn.compressor) The compressor is inside mla_attn, not directly on the attention wrapper. Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*	2026-05-19 03:47:26 +00:00
biondizzle	fece06f746	Add unit tests for NVFP4 weight mapper and inverse RoPE BF16	2026-05-19 03:22:00 +00:00
biondizzle	b856ee9315	Clean up debug scripts	2026-05-19 02:47:29 +00:00
biondizzle	8fe5546bb3	Fix debug script	2026-05-19 02:43:17 +00:00
biondizzle	788f0aa65a	Add step-by-step debug for wo_a	2026-05-19 02:43:05 +00:00
biondizzle	77e4970d93	Add debug script for wo_a quantization	2026-05-19 02:40:43 +00:00
biondizzle	80122b850b	Add debug script for wo_a	2026-05-19 02:39:55 +00:00
biondizzle	ae233ab648	Fix test: cos_sin_cache on CUDA device	2026-05-19 02:37:50 +00:00
biondizzle	882d4996ff	Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj) The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion in layout.hpp:39) when processing wo_a (o-projection first half) in the attention layer. The crash is caused by scale tensor dimension mismatch for the SM100 recipe (1, 1, 128). Instead of fighting DeepGEMM, replace the entire wo_a path with our own CuTeDSL NVFP4 kernel: 1. inverse_rope_bf16() — Python implementation of inverse RoPE (replaces fused_inv_rope_fp8_quant CUDA kernel) 2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using ScaledGroupedGemm with n_local_groups=8 groups 3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4, no conversion to another quantization) Changes: - cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation) - cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a - vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner is initialized, keep DeepGEMM fallback - vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant - tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM	2026-05-19 02:36:30 +00:00
biondizzle	00fe63b56f	Fix compile test: add warmup for activation global scales	2026-05-19 01:57:16 +00:00
biondizzle	bba3bca4d3	Add torch.compile + custom op integration test	2026-05-19 01:56:46 +00:00
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	f74447bfd0	Proper NVFP4 integration: quantized compressor/indexer + mapper fixes Weight mapper fixes: - Reorder substr renames: compressor renames first, then .self_attn.compressor. → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up under mla_attn after the compressor rename already fired) - Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm, position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names) - Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv, gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape, weights_proj stays (structural: compressor.indexer → indexer.compressor) - Remove broken suffix renames (already fixed in prior commit) Model architecture fixes: - Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4 checkpoint has quantized compressor weights with input_scale/weight_scale) - Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config (was None, but checkpoint has quantized weights) - Add indexer.compressor.fused_wkv_wgate stacking in load_weights Infrastructure: - Add deepseek_compressor.py to Dockerfile - Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM) - Update unit test to 50 cases (compressor + indexer + quantization scales)	2026-05-18 23:20:13 +00:00
biondizzle	17496b2615	Fix NVFP4 weights mapper: add prefix mappings, fix substr order - Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.) AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required - Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn. in substr_renames so compressor keys get the mla_attn prefix before the general rename - Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight) that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens - Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)	2026-05-18 23:03:34 +00:00
biondizzle	6ce6a47be9	Add NVFP4 linear runner + attention projection test - CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection - test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16 - Same pad+swizzle pattern as shared expert, but no SiLU/fusion	2026-05-18 20:14:03 +00:00
biondizzle	f07643791e	Fix hidden_size: shared expert uses 7168, not HC_DIM 28672	2026-05-18 20:10:32 +00:00
biondizzle	c1aa4af123	Shared expert: dedicated CuTeDSL runner with proper scale assembly - CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing - _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle - All buffers pre-allocated for cudagraph compatibility - Updated test to use dedicated runner instead of MoE runner hack	2026-05-18 20:08:34 +00:00

1 2 3

116 Commits