nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	ea771ff70b	Reduce max_model_len to 512 for initial container test	2026-05-19 09:23:10 +00:00
biondizzle	bcfbd1e25b	Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache)	2026-05-19 09:13:33 +00:00
biondizzle	e91421f06e	Fix KV cache page size patch: separate groups for large SWA pages	2026-05-19 09:05:14 +00:00
biondizzle	dd7f2627e8	Add full model forward test (WIP), sparse attention test passes	2026-05-19 09:04:19 +00:00
biondizzle	9781953509	Add CSA/HCA sparse attention kernel test	2026-05-19 09:02:12 +00:00
biondizzle	d60673864a	Fix kv_ref transpose in KV cache test	2026-05-19 08:58:46 +00:00
biondizzle	c1099d76d2	Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression	2026-05-19 08:57:31 +00:00
biondizzle	c54ddbdae1	Fix NVFP4 attention: slice output to actual N after 128-padding	2026-05-19 08:55:31 +00:00
biondizzle	42285b6c24	Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM	2026-05-19 08:54:59 +00:00
biondizzle	9465929e6e	Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)	2026-05-19 08:51:16 +00:00
biondizzle	fa71fbe909	Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes	2026-05-19 08:45:44 +00:00
biondizzle	d08a457829	Fix cos_sin cache shape in NVFP4 attention test	2026-05-19 08:38:55 +00:00
biondizzle	7dd8871e84	Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM	2026-05-19 08:38:25 +00:00
biondizzle	2672e98e4c	Remove VLLM_NVFP4_GEMM_BACKEND env var - CuTeDSL auto-selects on Blackwell	2026-05-19 08:35:40 +00:00
biondizzle	914d27fee7	Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks Mike's directive: build the full thing with NVFP4/CuTeDSL. No more 'optimize later' or 'just make it work' workarounds. Key updates: - README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage - CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE - Both files document: checkpoint key names, compress ratios, config issues - Removed all 'TODO: optimize later' hedging — we build it right the first time	2026-05-19 08:26:16 +00:00
biondizzle	7d5c093c99	Fix KV cache crash: skip SWA cache write on Blackwell The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot, not 512). Our naive FP8 quant + write had a shape mismatch. Fix: skip the SWA cache write entirely. The compressor (Triton) handles the compressed cache. For full SDPA attention, we use the raw kv tensor directly — we don't need the paged cache at all during prefill.	2026-05-19 08:21:57 +00:00
biondizzle	e1a642452a	Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel 1. DeepseekV4MLAAttention.__init__ had a hard assertion that the attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't work but we bypass it via _attention_impl_blackwell(). Added _is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla cache format conversion). 2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml to force CuTeDSL kernel selection for NVFP4 linear layers. 3. Updated register_cutedsl_kernel.py to also register CuTeDSL in _NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).	2026-05-19 08:19:23 +00:00
biondizzle	2856323360	Fix torch.compile crash: move Blackwell path inside custom op boundary The previous approach called _forward_blackwell() BEFORE the torch.ops.vllm.deepseek_v4_attention custom op, which broke torch.compile (dynamo can't trace the Python functions). Fix: instead of modifying forward(), modify attention_impl() which runs INSIDE the custom op boundary. Detect SM100+ and dispatch to _attention_impl_blackwell() which uses: - fused_qnorm_rope_kv_insert_py() instead of C++ kernel - full_sdpa_attention() instead of FlashMLA Removed dead _forward_blackwell method from forward().	2026-05-19 08:11:58 +00:00
biondizzle	a782ac00ce	Integrate CSA/SDPA attention into vLLM for Blackwell - Add vllm/patches/layers/csa_attention.py: pure PyTorch replacement for FlashMLA + fused CUDA kernels that don't work on SM100 - Patch deepseek_v4_attention.py: detect SM100+ and dispatch to _forward_blackwell() which uses: 1. fused_qnorm_rope_kv_insert_py() instead of C++ kernel 2. full_sdpa_attention() instead of FlashMLA 3. BF16 inverse RoPE + BMM for wo_a (same as existing BF16 path) - Add csa_attention.py to Dockerfile The Blackwell path: GEMM projections (CuTeDSL) → RMS norm → q_b → RoPE (PyTorch) → SDPA attention → inverse RoPE + wo_a BMM → wo_b → output	2026-05-19 08:04:07 +00:00
biondizzle	81931614e9	Update CURRENT_BUG: CSA kernel works, plan vLLM integration	2026-05-19 08:02:00 +00:00
biondizzle	9d067add90	Fix device reference in full_attention_reference	2026-05-19 08:01:31 +00:00
biondizzle	3e3e998578	Fix attention: manual causal mask for batched single-query	2026-05-19 08:01:08 +00:00
biondizzle	1e675ccc9a	Fix causal mask shape for SDPA: (1,1,T,T) broadcast	2026-05-19 08:00:39 +00:00
biondizzle	57615029a4	Fix KV expand for SDPA: (T,HD) → (T*NH, T, HD)	2026-05-19 08:00:08 +00:00
biondizzle	dd3a12bbda	Fix full_attention_reference: broadcast KV to all heads+positions	2026-05-19 07:59:28 +00:00
biondizzle	910015c47e	Fix kv shape: expand to (T, NH, HD) before reshape	2026-05-19 07:58:42 +00:00
biondizzle	3de75c4e37	Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) Replaces vLLM's broken FlashMLA sparse attention which doesn't work on SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention which works on all GPUs. Architecture: - CSA (C128A): Batched sparse gather + SDPA on top-k positions - HCA (C4A): Same with compressed KV + per-layer indexer - SWA: Sliding window attention - Full reference: standard SDPA for testing without compression Also adds test_csa_attention_b200.py to verify the full attention path.	2026-05-19 07:58:10 +00:00
biondizzle	65f48be38c	Add attention path test: pinpoint FlashMLA failure	2026-05-19 07:54:01 +00:00
biondizzle	90d1098935	Update CURRENT_BUG: warmup gs is irrelevant, bug is in vLLM pipeline	2026-05-19 07:51:10 +00:00
biondizzle	04ad6409e5	Rewrite test: diagnose whether warmup gs matters at inference time	2026-05-19 07:49:41 +00:00
biondizzle	496848e158	Fix ffn_hc.scale key name	2026-05-19 07:48:09 +00:00
biondizzle	5a4e355d3a	Add model forward test: reproduce vLLM empty output outside container	2026-05-19 07:47:48 +00:00
biondizzle	f5ce728ef2	Fix OOM: add --max-model-len=876544 + revert CPU dummy weight The CPU dummy weight broke torch.mm(compressor.weight.T) which expects GPU tensors. Instead, reduce max_model_len to fit KV cache within available memory (876544 instead of 1048576).	2026-05-19 07:35:43 +00:00
biondizzle	79a41d9197	Save ~5-8 GiB GPU VRAM: move dummy weight to CPU The CuTeDSL kernel never reads layer.weight — it uses the runner's pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for vLLM model introspection. Moving it to CPU saves massive VRAM: - q_b_proj alone: 6553615362 = 192 MiB on GPU → ~0 MiB - All layers combined: ~5-8 GiB saved This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).	2026-05-19 07:29:38 +00:00
biondizzle	cebc586014	Fix OOM: use 1-token warmup sample + free immediately 8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB But the dummy weight param (out_features * in_features * 2 bytes BF16) was the real killer — each layer allocated a BF16 dummy of its full weight shape. With 1 token the warmup still gets a valid gs, and empty_cache frees the sample tensor before KV cache allocation.	2026-05-19 07:28:57 +00:00
biondizzle	5122cadc94	Update CURRENT_BUG.md: root cause found + fix committed	2026-05-19 07:21:30 +00:00
biondizzle	6e6f95dfa8	FIX: Use warmup-based activation global scale in CuTeDSL linear kernel The checkpoint's input_scale is a calibration-time value that doesn't match what quantize_activation_nvfp4 expects at runtime. Using it as the activation global scale produces garbage output (empty EOS tokens). The fix: run a warmup forward pass with sample data and compute the activation global scale from the actual activation distribution, exactly like our standalone test does (which passes with cosine >= 0.994). This is the root cause of the vLLM server returning empty content.	2026-05-19 07:21:07 +00:00
biondizzle	0a7769972f	Fix garbled shared_expert_pipeline.py: imports/class were merged	2026-05-19 07:18:10 +00:00
biondizzle	87453a53b0	Fix checkpoint keys: attn_hc., compressor., q_a_proj/q_b_proj/kv_proj	2026-05-19 07:17:37 +00:00
biondizzle	f97762cc9f	Fix full layer test: use correct checkpoint key names Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM fused names (fused_wqa_wkv, wq_b, q_norm).	2026-05-19 07:16:33 +00:00
biondizzle	cc48a5715e	Add full layer 0 B200 test: CuTeDSL vs BF16 reference Tests each attention/FFN projection individually against BF16 dequantized reference, then runs full layer forward. Identifies exactly where garbage enters the pipeline. Key finding: checkpoint uses different names than vLLM: - q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv) - q_a_norm (not q_norm) - compressor.* (C4A layers only) - sinks (attn_sink)	2026-05-19 07:14:58 +00:00
biondizzle	dbaa3d6fe6	Update CURRENT_BUG.md and README with current state Empty output still happening. Documented what's been tried, what works standalone, what we don't know, and the plan to bypass vLLM's kernel selection entirely by calling our runners directly.	2026-05-19 07:05:45 +00:00
biondizzle	62abf41b03	Revert deepseek_v4_attention.py to `ffc2264` — don't nuke existing patches The file at `ffc2264` already had our BF16 wo_a path (_apply_inv_rope_bf16 + BMM + all-gather) with FP8 fallback. I was replacing it from the wrong vllm source, losing all prior work. Restored to the known-good version.	2026-05-19 06:52:40 +00:00
biondizzle	4c2effa2be	Fix attention patch: source from v0.21.0 stable, not local clone The local vllm clone has different imports (breakable_cudagraph) that don't exist in the Docker image. Now sourced from v0.21.0 tag.	2026-05-19 06:44:59 +00:00
biondizzle	284b6a5d57	Fix attention patch: use original vllm imports, only patch forward method Previous version copied the entire file from our local vllm clone which had imports (breakable_cudagraph) missing from the Docker image's vllm. Now we start from the Docker image's original file and only patch the DeepseekV4MultiHeadLatentAttentionWrapper.forward method.	2026-05-19 06:40:58 +00:00
biondizzle	199efe0871	Fix dims: o_groups=16, o_lora_rank=1024 from config	2026-05-19 06:37:25 +00:00
biondizzle	b4fee70151	Fix device mismatch in test	2026-05-19 06:36:22 +00:00
biondizzle	6b4b9774d1	Add B200 test: prove O-projection root cause + validate fix	2026-05-19 06:32:54 +00:00
biondizzle	77baca668e	Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b The original attention forward uses fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights and weight_scale_inv. Our checkpoint has wo_a in BF16, so the original path crashes (produces empty output). Replace O projection with: 1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8) 2. BMM grouped linear for wo_a (BF16) 3. NVFP4 wo_b via CuTeDSL Also fixes activation global scale bug from previous commit: - input_global_scale_inv IS the activation gs, don't re-invert - w13_input_scale_orig (after undoing convert) IS the MoE gs Test: tests/test_o_projection.py validates inv RoPE roundtrip and wo_a BMM correctness.	2026-05-19 06:30:18 +00:00
biondizzle	ffc2264c41	Fix activation global scale: don't double-invert input_global_scale_inv The activation global scale = amax / (6.0 * 448.0). Both the linear kernel and MoE kernel were taking 1.0 / (value that's already the correct gs), inverting it and producing garbage quantization. Linear kernel: input_global_scale_inv IS the gs, so use it directly. MoE kernel: w13_input_scale_orig (after undoing convert inversion) IS the gs, so use it directly.	2026-05-19 06:03:08 +00:00

... 5 6 7 8 9 ...

719 Commits