nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	4b85605edf	Fix fp8 amax in decode test	2026-05-19 15:28:17 +00:00
biondizzle	4f23055450	Add decode attention pipeline test — reproduces KV cache bug	2026-05-19 15:27:55 +00:00
biondizzle	31b9cfbdbd	Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.	2026-05-19 15:19:55 +00:00
biondizzle	dca8bfc3a8	Fix _apply_rope_kv: use inline RoPE instead of 3D apply_gptj_rope	2026-05-19 10:36:21 +00:00
biondizzle	8e6721917e	Fix syntax in RoPE KV test	2026-05-19 10:31:07 +00:00
biondizzle	cbf440f75a	Add RoPE KV test	2026-05-19 10:28:15 +00:00
biondizzle	a5fabbdf66	Apply RoPE to KV in Blackwell attention path - fix NaN output	2026-05-19 10:27:15 +00:00
biondizzle	7e97551fd3	Fix: use self.scale instead of self.softmax_scale in Blackwell attention path	2026-05-19 10:04:46 +00:00
biondizzle	39310c357d	Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers	2026-05-19 09:52:23 +00:00
biondizzle	d9cd8fa165	Add debug patch to print layer name mismatch	2026-05-19 09:45:10 +00:00
biondizzle	9a0b015aac	Reduce max_model_len to 256	2026-05-19 09:37:38 +00:00
biondizzle	de1fb839f0	Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment)	2026-05-19 09:29:57 +00:00
biondizzle	ea771ff70b	Reduce max_model_len to 512 for initial container test	2026-05-19 09:23:10 +00:00
biondizzle	bcfbd1e25b	Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache)	2026-05-19 09:13:33 +00:00
biondizzle	e91421f06e	Fix KV cache page size patch: separate groups for large SWA pages	2026-05-19 09:05:14 +00:00
biondizzle	dd7f2627e8	Add full model forward test (WIP), sparse attention test passes	2026-05-19 09:04:19 +00:00
biondizzle	9781953509	Add CSA/HCA sparse attention kernel test	2026-05-19 09:02:12 +00:00
biondizzle	d60673864a	Fix kv_ref transpose in KV cache test	2026-05-19 08:58:46 +00:00
biondizzle	c1099d76d2	Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression	2026-05-19 08:57:31 +00:00
biondizzle	c54ddbdae1	Fix NVFP4 attention: slice output to actual N after 128-padding	2026-05-19 08:55:31 +00:00
biondizzle	42285b6c24	Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM	2026-05-19 08:54:59 +00:00
biondizzle	9465929e6e	Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)	2026-05-19 08:51:16 +00:00
biondizzle	fa71fbe909	Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes	2026-05-19 08:45:44 +00:00
biondizzle	d08a457829	Fix cos_sin cache shape in NVFP4 attention test	2026-05-19 08:38:55 +00:00
biondizzle	7dd8871e84	Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM	2026-05-19 08:38:25 +00:00
biondizzle	2672e98e4c	Remove VLLM_NVFP4_GEMM_BACKEND env var - CuTeDSL auto-selects on Blackwell	2026-05-19 08:35:40 +00:00
biondizzle	914d27fee7	Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks Mike's directive: build the full thing with NVFP4/CuTeDSL. No more 'optimize later' or 'just make it work' workarounds. Key updates: - README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage - CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE - Both files document: checkpoint key names, compress ratios, config issues - Removed all 'TODO: optimize later' hedging — we build it right the first time	2026-05-19 08:26:16 +00:00
biondizzle	7d5c093c99	Fix KV cache crash: skip SWA cache write on Blackwell The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot, not 512). Our naive FP8 quant + write had a shape mismatch. Fix: skip the SWA cache write entirely. The compressor (Triton) handles the compressed cache. For full SDPA attention, we use the raw kv tensor directly — we don't need the paged cache at all during prefill.	2026-05-19 08:21:57 +00:00
biondizzle	e1a642452a	Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel 1. DeepseekV4MLAAttention.__init__ had a hard assertion that the attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't work but we bypass it via _attention_impl_blackwell(). Added _is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla cache format conversion). 2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml to force CuTeDSL kernel selection for NVFP4 linear layers. 3. Updated register_cutedsl_kernel.py to also register CuTeDSL in _NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).	2026-05-19 08:19:23 +00:00
biondizzle	2856323360	Fix torch.compile crash: move Blackwell path inside custom op boundary The previous approach called _forward_blackwell() BEFORE the torch.ops.vllm.deepseek_v4_attention custom op, which broke torch.compile (dynamo can't trace the Python functions). Fix: instead of modifying forward(), modify attention_impl() which runs INSIDE the custom op boundary. Detect SM100+ and dispatch to _attention_impl_blackwell() which uses: - fused_qnorm_rope_kv_insert_py() instead of C++ kernel - full_sdpa_attention() instead of FlashMLA Removed dead _forward_blackwell method from forward().	2026-05-19 08:11:58 +00:00
biondizzle	a782ac00ce	Integrate CSA/SDPA attention into vLLM for Blackwell - Add vllm/patches/layers/csa_attention.py: pure PyTorch replacement for FlashMLA + fused CUDA kernels that don't work on SM100 - Patch deepseek_v4_attention.py: detect SM100+ and dispatch to _forward_blackwell() which uses: 1. fused_qnorm_rope_kv_insert_py() instead of C++ kernel 2. full_sdpa_attention() instead of FlashMLA 3. BF16 inverse RoPE + BMM for wo_a (same as existing BF16 path) - Add csa_attention.py to Dockerfile The Blackwell path: GEMM projections (CuTeDSL) → RMS norm → q_b → RoPE (PyTorch) → SDPA attention → inverse RoPE + wo_a BMM → wo_b → output	2026-05-19 08:04:07 +00:00
biondizzle	81931614e9	Update CURRENT_BUG: CSA kernel works, plan vLLM integration	2026-05-19 08:02:00 +00:00
biondizzle	9d067add90	Fix device reference in full_attention_reference	2026-05-19 08:01:31 +00:00
biondizzle	3e3e998578	Fix attention: manual causal mask for batched single-query	2026-05-19 08:01:08 +00:00
biondizzle	1e675ccc9a	Fix causal mask shape for SDPA: (1,1,T,T) broadcast	2026-05-19 08:00:39 +00:00
biondizzle	57615029a4	Fix KV expand for SDPA: (T,HD) → (T*NH, T, HD)	2026-05-19 08:00:08 +00:00
biondizzle	dd3a12bbda	Fix full_attention_reference: broadcast KV to all heads+positions	2026-05-19 07:59:28 +00:00
biondizzle	910015c47e	Fix kv shape: expand to (T, NH, HD) before reshape	2026-05-19 07:58:42 +00:00
biondizzle	3de75c4e37	Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) Replaces vLLM's broken FlashMLA sparse attention which doesn't work on SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention which works on all GPUs. Architecture: - CSA (C128A): Batched sparse gather + SDPA on top-k positions - HCA (C4A): Same with compressed KV + per-layer indexer - SWA: Sliding window attention - Full reference: standard SDPA for testing without compression Also adds test_csa_attention_b200.py to verify the full attention path.	2026-05-19 07:58:10 +00:00
biondizzle	65f48be38c	Add attention path test: pinpoint FlashMLA failure	2026-05-19 07:54:01 +00:00
biondizzle	90d1098935	Update CURRENT_BUG: warmup gs is irrelevant, bug is in vLLM pipeline	2026-05-19 07:51:10 +00:00
biondizzle	04ad6409e5	Rewrite test: diagnose whether warmup gs matters at inference time	2026-05-19 07:49:41 +00:00
biondizzle	496848e158	Fix ffn_hc.scale key name	2026-05-19 07:48:09 +00:00
biondizzle	5a4e355d3a	Add model forward test: reproduce vLLM empty output outside container	2026-05-19 07:47:48 +00:00
biondizzle	f5ce728ef2	Fix OOM: add --max-model-len=876544 + revert CPU dummy weight The CPU dummy weight broke torch.mm(compressor.weight.T) which expects GPU tensors. Instead, reduce max_model_len to fit KV cache within available memory (876544 instead of 1048576).	2026-05-19 07:35:43 +00:00
biondizzle	79a41d9197	Save ~5-8 GiB GPU VRAM: move dummy weight to CPU The CuTeDSL kernel never reads layer.weight — it uses the runner's pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for vLLM model introspection. Moving it to CPU saves massive VRAM: - q_b_proj alone: 6553615362 = 192 MiB on GPU → ~0 MiB - All layers combined: ~5-8 GiB saved This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).	2026-05-19 07:29:38 +00:00
biondizzle	cebc586014	Fix OOM: use 1-token warmup sample + free immediately 8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB But the dummy weight param (out_features * in_features * 2 bytes BF16) was the real killer — each layer allocated a BF16 dummy of its full weight shape. With 1 token the warmup still gets a valid gs, and empty_cache frees the sample tensor before KV cache allocation.	2026-05-19 07:28:57 +00:00
biondizzle	5122cadc94	Update CURRENT_BUG.md: root cause found + fix committed	2026-05-19 07:21:30 +00:00
biondizzle	6e6f95dfa8	FIX: Use warmup-based activation global scale in CuTeDSL linear kernel The checkpoint's input_scale is a calibration-time value that doesn't match what quantize_activation_nvfp4 expects at runtime. Using it as the activation global scale produces garbage output (empty EOS tokens). The fix: run a warmup forward pass with sample data and compute the activation global scale from the actual activation distribution, exactly like our standalone test does (which passes with cosine >= 0.994). This is the root cause of the vLLM server returning empty content.	2026-05-19 07:21:07 +00:00
biondizzle	0a7769972f	Fix garbled shared_expert_pipeline.py: imports/class were merged	2026-05-19 07:18:10 +00:00

1 2 3 4 5 ...

431 Commits