nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	c54ddbdae1	Fix NVFP4 attention: slice output to actual N after 128-padding	2026-05-19 08:55:31 +00:00
biondizzle	42285b6c24	Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM	2026-05-19 08:54:59 +00:00
biondizzle	9465929e6e	Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)	2026-05-19 08:51:16 +00:00
biondizzle	fa71fbe909	Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes	2026-05-19 08:45:44 +00:00
biondizzle	d08a457829	Fix cos_sin cache shape in NVFP4 attention test	2026-05-19 08:38:55 +00:00
biondizzle	7dd8871e84	Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM	2026-05-19 08:38:25 +00:00
biondizzle	2672e98e4c	Remove VLLM_NVFP4_GEMM_BACKEND env var - CuTeDSL auto-selects on Blackwell	2026-05-19 08:35:40 +00:00
biondizzle	914d27fee7	Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks Mike's directive: build the full thing with NVFP4/CuTeDSL. No more 'optimize later' or 'just make it work' workarounds. Key updates: - README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage - CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE - Both files document: checkpoint key names, compress ratios, config issues - Removed all 'TODO: optimize later' hedging — we build it right the first time	2026-05-19 08:26:16 +00:00
biondizzle	7d5c093c99	Fix KV cache crash: skip SWA cache write on Blackwell The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot, not 512). Our naive FP8 quant + write had a shape mismatch. Fix: skip the SWA cache write entirely. The compressor (Triton) handles the compressed cache. For full SDPA attention, we use the raw kv tensor directly — we don't need the paged cache at all during prefill.	2026-05-19 08:21:57 +00:00
biondizzle	e1a642452a	Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel 1. DeepseekV4MLAAttention.__init__ had a hard assertion that the attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't work but we bypass it via _attention_impl_blackwell(). Added _is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla cache format conversion). 2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml to force CuTeDSL kernel selection for NVFP4 linear layers. 3. Updated register_cutedsl_kernel.py to also register CuTeDSL in _NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).	2026-05-19 08:19:23 +00:00
biondizzle	2856323360	Fix torch.compile crash: move Blackwell path inside custom op boundary The previous approach called _forward_blackwell() BEFORE the torch.ops.vllm.deepseek_v4_attention custom op, which broke torch.compile (dynamo can't trace the Python functions). Fix: instead of modifying forward(), modify attention_impl() which runs INSIDE the custom op boundary. Detect SM100+ and dispatch to _attention_impl_blackwell() which uses: - fused_qnorm_rope_kv_insert_py() instead of C++ kernel - full_sdpa_attention() instead of FlashMLA Removed dead _forward_blackwell method from forward().	2026-05-19 08:11:58 +00:00
biondizzle	a782ac00ce	Integrate CSA/SDPA attention into vLLM for Blackwell - Add vllm/patches/layers/csa_attention.py: pure PyTorch replacement for FlashMLA + fused CUDA kernels that don't work on SM100 - Patch deepseek_v4_attention.py: detect SM100+ and dispatch to _forward_blackwell() which uses: 1. fused_qnorm_rope_kv_insert_py() instead of C++ kernel 2. full_sdpa_attention() instead of FlashMLA 3. BF16 inverse RoPE + BMM for wo_a (same as existing BF16 path) - Add csa_attention.py to Dockerfile The Blackwell path: GEMM projections (CuTeDSL) → RMS norm → q_b → RoPE (PyTorch) → SDPA attention → inverse RoPE + wo_a BMM → wo_b → output	2026-05-19 08:04:07 +00:00
biondizzle	81931614e9	Update CURRENT_BUG: CSA kernel works, plan vLLM integration	2026-05-19 08:02:00 +00:00
biondizzle	9d067add90	Fix device reference in full_attention_reference	2026-05-19 08:01:31 +00:00
biondizzle	3e3e998578	Fix attention: manual causal mask for batched single-query	2026-05-19 08:01:08 +00:00
biondizzle	1e675ccc9a	Fix causal mask shape for SDPA: (1,1,T,T) broadcast	2026-05-19 08:00:39 +00:00
biondizzle	57615029a4	Fix KV expand for SDPA: (T,HD) → (T*NH, T, HD)	2026-05-19 08:00:08 +00:00
biondizzle	dd3a12bbda	Fix full_attention_reference: broadcast KV to all heads+positions	2026-05-19 07:59:28 +00:00
biondizzle	910015c47e	Fix kv shape: expand to (T, NH, HD) before reshape	2026-05-19 07:58:42 +00:00
biondizzle	3de75c4e37	Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) Replaces vLLM's broken FlashMLA sparse attention which doesn't work on SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention which works on all GPUs. Architecture: - CSA (C128A): Batched sparse gather + SDPA on top-k positions - HCA (C4A): Same with compressed KV + per-layer indexer - SWA: Sliding window attention - Full reference: standard SDPA for testing without compression Also adds test_csa_attention_b200.py to verify the full attention path.	2026-05-19 07:58:10 +00:00
biondizzle	65f48be38c	Add attention path test: pinpoint FlashMLA failure	2026-05-19 07:54:01 +00:00
biondizzle	90d1098935	Update CURRENT_BUG: warmup gs is irrelevant, bug is in vLLM pipeline	2026-05-19 07:51:10 +00:00
biondizzle	04ad6409e5	Rewrite test: diagnose whether warmup gs matters at inference time	2026-05-19 07:49:41 +00:00
biondizzle	496848e158	Fix ffn_hc.scale key name	2026-05-19 07:48:09 +00:00
biondizzle	5a4e355d3a	Add model forward test: reproduce vLLM empty output outside container	2026-05-19 07:47:48 +00:00
biondizzle	f5ce728ef2	Fix OOM: add --max-model-len=876544 + revert CPU dummy weight The CPU dummy weight broke torch.mm(compressor.weight.T) which expects GPU tensors. Instead, reduce max_model_len to fit KV cache within available memory (876544 instead of 1048576).	2026-05-19 07:35:43 +00:00
biondizzle	79a41d9197	Save ~5-8 GiB GPU VRAM: move dummy weight to CPU The CuTeDSL kernel never reads layer.weight — it uses the runner's pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for vLLM model introspection. Moving it to CPU saves massive VRAM: - q_b_proj alone: 6553615362 = 192 MiB on GPU → ~0 MiB - All layers combined: ~5-8 GiB saved This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).	2026-05-19 07:29:38 +00:00
biondizzle	cebc586014	Fix OOM: use 1-token warmup sample + free immediately 8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB But the dummy weight param (out_features * in_features * 2 bytes BF16) was the real killer — each layer allocated a BF16 dummy of its full weight shape. With 1 token the warmup still gets a valid gs, and empty_cache frees the sample tensor before KV cache allocation.	2026-05-19 07:28:57 +00:00
biondizzle	5122cadc94	Update CURRENT_BUG.md: root cause found + fix committed	2026-05-19 07:21:30 +00:00
biondizzle	6e6f95dfa8	FIX: Use warmup-based activation global scale in CuTeDSL linear kernel The checkpoint's input_scale is a calibration-time value that doesn't match what quantize_activation_nvfp4 expects at runtime. Using it as the activation global scale produces garbage output (empty EOS tokens). The fix: run a warmup forward pass with sample data and compute the activation global scale from the actual activation distribution, exactly like our standalone test does (which passes with cosine >= 0.994). This is the root cause of the vLLM server returning empty content.	2026-05-19 07:21:07 +00:00
biondizzle	0a7769972f	Fix garbled shared_expert_pipeline.py: imports/class were merged	2026-05-19 07:18:10 +00:00
biondizzle	87453a53b0	Fix checkpoint keys: attn_hc., compressor., q_a_proj/q_b_proj/kv_proj	2026-05-19 07:17:37 +00:00
biondizzle	f97762cc9f	Fix full layer test: use correct checkpoint key names Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM fused names (fused_wqa_wkv, wq_b, q_norm).	2026-05-19 07:16:33 +00:00
biondizzle	cc48a5715e	Add full layer 0 B200 test: CuTeDSL vs BF16 reference Tests each attention/FFN projection individually against BF16 dequantized reference, then runs full layer forward. Identifies exactly where garbage enters the pipeline. Key finding: checkpoint uses different names than vLLM: - q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv) - q_a_norm (not q_norm) - compressor.* (C4A layers only) - sinks (attn_sink)	2026-05-19 07:14:58 +00:00
biondizzle	dbaa3d6fe6	Update CURRENT_BUG.md and README with current state Empty output still happening. Documented what's been tried, what works standalone, what we don't know, and the plan to bypass vLLM's kernel selection entirely by calling our runners directly.	2026-05-19 07:05:45 +00:00
biondizzle	62abf41b03	Revert deepseek_v4_attention.py to `ffc2264` — don't nuke existing patches The file at `ffc2264` already had our BF16 wo_a path (_apply_inv_rope_bf16 + BMM + all-gather) with FP8 fallback. I was replacing it from the wrong vllm source, losing all prior work. Restored to the known-good version.	2026-05-19 06:52:40 +00:00
biondizzle	4c2effa2be	Fix attention patch: source from v0.21.0 stable, not local clone The local vllm clone has different imports (breakable_cudagraph) that don't exist in the Docker image. Now sourced from v0.21.0 tag.	2026-05-19 06:44:59 +00:00
biondizzle	284b6a5d57	Fix attention patch: use original vllm imports, only patch forward method Previous version copied the entire file from our local vllm clone which had imports (breakable_cudagraph) missing from the Docker image's vllm. Now we start from the Docker image's original file and only patch the DeepseekV4MultiHeadLatentAttentionWrapper.forward method.	2026-05-19 06:40:58 +00:00
biondizzle	199efe0871	Fix dims: o_groups=16, o_lora_rank=1024 from config	2026-05-19 06:37:25 +00:00
biondizzle	b4fee70151	Fix device mismatch in test	2026-05-19 06:36:22 +00:00
biondizzle	6b4b9774d1	Add B200 test: prove O-projection root cause + validate fix	2026-05-19 06:32:54 +00:00
biondizzle	77baca668e	Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b The original attention forward uses fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights and weight_scale_inv. Our checkpoint has wo_a in BF16, so the original path crashes (produces empty output). Replace O projection with: 1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8) 2. BMM grouped linear for wo_a (BF16) 3. NVFP4 wo_b via CuTeDSL Also fixes activation global scale bug from previous commit: - input_global_scale_inv IS the activation gs, don't re-invert - w13_input_scale_orig (after undoing convert) IS the MoE gs Test: tests/test_o_projection.py validates inv RoPE roundtrip and wo_a BMM correctness.	2026-05-19 06:30:18 +00:00
biondizzle	ffc2264c41	Fix activation global scale: don't double-invert input_global_scale_inv The activation global scale = amax / (6.0 * 448.0). Both the linear kernel and MoE kernel were taking 1.0 / (value that's already the correct gs), inverting it and producing garbage quantization. Linear kernel: input_global_scale_inv IS the gs, so use it directly. MoE kernel: w13_input_scale_orig (after undoing convert inversion) IS the gs, so use it directly.	2026-05-19 06:03:08 +00:00
biondizzle	918342feeb	MHC: replace monolithic layers/mhc.py with pure PyTorch The nightly vLLM image puts ALL MHC code in layers/mhc.py (not kernels/mhc/). It imports tilelang at top level and JIT-compiles kernels. Replace the entire file with pure PyTorch implementations using direct_register_custom_op for mhc_pre, mhc_post, mhc_fused_post_pre, and hc_head_fused_kernel. No tilelang dependency at all. Also removes the separate mhc_torch_ops.py and kernels/mhc/ patches which don't apply to the nightly image layout.	2026-05-19 05:41:55 +00:00
biondizzle	dfd9c10ae9	Fix MHC import: don't import .torch from layers/mhc.py The layers/mhc.py was trying to import kernels.mhc.torch which failed because our __init__.py was breaking the package. Instead, just import our mhc_torch_ops which has everything we need. Also fix __init__.py to explicitly import mhc_pre_torch and mhc_post_torch from .torch instead of using import *.	2026-05-19 05:36:35 +00:00
biondizzle	e404e18efb	Also replace layers/mhc.py CustomOp dispatch The original layers/mhc.py forward_cuda calls torch.ops.vllm.mhc_pre_tilelang which triggers TileLang JIT. Replace with our torch implementations in forward_cuda. This is what the CustomOp dispatch routes through.	2026-05-19 05:31:05 +00:00
biondizzle	5e6d459145	Fix MHC custom op registration Previous approach used @CustomOp.register which doesn't create torch.ops.vllm.mhc_pre. The model code calls torch.ops.vllm.mhc_pre() directly, which requires direct_register_custom_op. Use direct_register_custom_op to register mhc_pre, mhc_post, mhc_fused_post_pre, and hc_head_fused_kernel as PyTorch custom ops with torch (eager) implementations. Patch kernels/mhc/__init__.py to import from both .torch (original) and .mhc_torch_ops (our replacements), skipping tilelang import.	2026-05-19 05:19:48 +00:00
biondizzle	9ff1679064	Replace MHC TileLang kernels with pure PyTorch TileLang kernels (mhc_pre_big_fuse_tilelang, mhc_fused_tilelang) don't work correctly on Blackwell SM100 and cause empty model output. Replace with pure PyTorch implementations: - mhc_pre_torch: Sinkhorn-normalized HC residual mixing - mhc_post_torch: HC post block (einsum residual + post layer mix) - mhc_fused_post_pre_torch: Fused post+pre (composition of above) - hc_head_fused_torch: RMS norm + linear + sigmoid + weighted sum Patch both layers/mhc.py (CustomOp dispatch) and kernels/mhc/__init__.py (no tilelang import). Also remove tilelang from pyproject.toml deps.	2026-05-19 05:07:41 +00:00
biondizzle	5c770c68ca	Keep MoE scale tensors: framework warmup needs them The framework's deep_gemm_warmup calls get_fused_moe_quant_config which accesses w13_input_scale etc. Setting them to None caused TypeError: float / NoneType. Keep scales (small tensors) and only free the large weight tensors.	2026-05-19 04:50:31 +00:00
biondizzle	e0f385ac45	Fix workspace_shapes: output dim is hidden_dim, not K2 K comes from hidden_states.size(-1) which is the full BF16 dimension (7168), not the packed weight dimension. K2=14336 is wrong. The MoE output is always hidden_dim (7168).	2026-05-19 04:42:22 +00:00

1 2 3 4 5 ...

412 Commits