nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	9cdf79fd9c	wip: fused SwiGLU kernel scaffold + bridge interleave + plan - fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with class rename and fused_swiglu/swiglu_limit params added - bridge.py: added interleave_l1_weights, deinterleave_l1_weights, warmup_fused_swiglu_compilation - Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave)) - Standalone GEMM interleave test fails due to kernel-internal N-tiling layout (expected, skipping per plan) - FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan, 4-step implementation strategy	2026-05-20 03:04:38 +00:00
biondizzle	2f8b26c176	chore: remove unused _expert_id_range after bincount migration	2026-05-20 02:17:44 +00:00
biondizzle	7e2adb7e85	perf: replace expert counting O(n*E) comparison with torch.bincount O(n) Bug #5 fix: (sorted_ids.unsqueeze(1) == expert_id_range.unsqueeze(0)).sum(dim=0) materializes a (num_slots × num_experts) bool tensor every forward — 48K × 384 = 18M elements. torch.bincount(sorted_ids, minlength=num_experts) gives the same result in O(n) with no intermediate allocation. ~200× less work. Also removes the now-unused _expert_id_range buffer.	2026-05-20 02:17:23 +00:00
biondizzle	d59b10e170	fix: zero out x_norm for underflow blocks before division in NVFP4 quantization Bug #4 fix: When a block has amax > 0 but amax/6 underflows to 0 in FP8 (amax < 62^-9 ≈ 0.0117), the block scale is 0, but the division x / clamp(0, 1e-8) inflates x into nonzero FP4 buckets (up to ±6.0). This produces semantically wrong FP4 even though dequant gives 0 (60=0). Root cause: we only detected truly-zero blocks (amax == 0) but not underflow blocks (0 < amax < FP8_threshold). The fix: 1. Detect both zero and underflow blocks: block_amax < 6 * 2^-9 2. Zero out x_reshaped for these blocks BEFORE division 3. Force FP8 scale to 0 for these blocks This ensures x_scaled = 0 → FP4 nibbles = 0 → dequant = 0. Verified: bug scenario now produces nibble=0, scale=0. Checkpoint byte match remains 100%.	2026-05-20 02:16:49 +00:00
biondizzle	c8fa87fac7	fix: detect zero blocks in NVFP4 quantization, force FP4+FP8 to exact zero Bug #3 fix: The clamp(min=1e-8) on block_amax prevented NaN from 0/0 but allowed truly-zero blocks to get a nonzero FP8 scale (5e-12 from underflow). While the kernel produces 0 * 0 = 0 (no NaN), the nonzero scale is semantically wrong and could interact badly with future kernels. Fix: detect zero blocks explicitly (block_amax == 0), clamp only for safe division, then force FP8 scale to exact zero for zero blocks via torch.where. The FP4 nibbles are already zero (0 / anything = 0). Verified: checkpoint byte match remains 100%, zero blocks produce exact-zero dequantization, no NaN propagation. Applies to all three quantization functions: - quantize_to_nvfp4 (activation with computed gs) - quantize_activation_nvfp4 (activation with pre-computed gs) - quantize_weight_to_nvfp4 (weight quantization)	2026-05-20 02:14:50 +00:00
biondizzle	3c6b5a0522	chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct Verified that our NVFP4 packing convention (odd<<4\|even, round-half-to-even) matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip across all tested experts. The dequantize->requantize path is lossless in practice but wasteful. Marked both prepare_weights_from_dequantized and prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked which loads checkpoint FP4 bytes directly via .view(). Also added test_fp4_roundtrip.py for future reference.	2026-05-20 02:11:40 +00:00
biondizzle	3181f74c86	fix: correct scale factor dimensions in warmup (K_sf = ceil_div(K_packed,8) not ceil_div(K_packed,16)) K_packed = original_K // 2. The scale factor dimension is K_sf = ceil_div(original_K, 16) = ceil_div(K_packed * 2, 16) = ceil_div(K_packed, 8). The previous code used ceil_div(K_packed, 16) which was wrong.	2026-05-20 02:08:26 +00:00
biondizzle	cc6b094450	fix: root-cause JIT memory corruption myth, add eager warmup, remove _needs_token_refill Bug #1 fix: The _needs_token_refill workaround was a band-aid over a misdiagnosis. cute.compile does NOT corrupt GPU memory (verified on B200). The original corruption was from a different bug (likely OOB write or weight loading issue). Changes: - bridge.py: Add warmup_compilation() for eager JIT before runtime buffers exist. Pre-allocate workspace per cache entry (no torch.full in hot path). Cache stores {compiled, workspace, workspace_size} instead of just compiled. CuTe tensor wrappers re-created per call (cheap metadata, avoids stale refs). - runner.py: Remove _needs_token_refill hack. Add eager warmup call in _ensure_stacked() for both L1 and L2 GEMM shapes. - nvfp4_linear.py: Add eager warmup in finalize_weights() for single GEMM. The warmup approach ensures cute.compile runs exactly once per shape during model init, before any forward pass. This is deterministic and eliminates any possible interaction between JIT and runtime GPU memory.	2026-05-20 02:08:01 +00:00
biondizzle	039a9e27d6	fix: handle 3D swa_indices and correct kv_bf16 expand dims	2026-05-20 01:36:27 +00:00
biondizzle	b3f6f260ce	feat: add native CuTeDSL SWA decode attention kernel stub + batched SDPA fallback	2026-05-20 01:28:05 +00:00
biondizzle	268dc251c1	fix: replace _allocate_buffers with _ensure_buffer_size for dynamic sizing	2026-05-20 00:02:10 +00:00
biondizzle	09669dded4	fix: dynamic buffer sizing in nvfp4_linear for varying token counts	2026-05-19 23:59:55 +00:00
biondizzle	02b9c1ac20	nuke vllm because this keep confusing people	2026-05-19 23:04:36 +00:00
biondizzle	02b57071be	Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean	2026-05-19 20:22:10 +00:00
biondizzle	7070fadf72	Add full layer NaN test (attention + MoE, multi-layer chain)	2026-05-19 18:36:49 +00:00
biondizzle	152b0749df	Use 16 experts for MoE runner test (fits in memory)	2026-05-19 18:35:40 +00:00
biondizzle	daa59a7c75	Add MoE runner NaN test (grouped GEMM with real weights)	2026-05-19 18:34:56 +00:00
biondizzle	9308634e65	Fix intermediate size: 3072 not 18432	2026-05-19 18:34:12 +00:00
biondizzle	2b91bb1b71	Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM	2026-05-19 18:33:57 +00:00
biondizzle	8904d409f8	Fix MoE weight key names, add fallback	2026-05-19 18:32:49 +00:00
biondizzle	e45ceb2226	Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan	2026-05-19 18:32:14 +00:00
biondizzle	22ec43e685	Add input NaN debug to trace where NaN starts	2026-05-19 18:15:53 +00:00
biondizzle	b86d0d2dee	Add prefill inputs NaN debug	2026-05-19 18:04:18 +00:00
biondizzle	45a2d8851d	Add prefill attention value debug check	2026-05-19 17:55:35 +00:00
biondizzle	1589b79137	Use module-level Blackwell flag in compressor (works during torch.compile)	2026-05-19 17:37:26 +00:00
biondizzle	658b12cb3d	CRITICAL FIX: Remove double Q normalization and fix RoPE sin slice	2026-05-19 17:27:33 +00:00
biondizzle	facc6509e7	Fix imports in vLLM codepaths test	2026-05-19 17:26:50 +00:00
biondizzle	835e1a0590	Fix f-string syntax	2026-05-19 17:26:40 +00:00
biondizzle	9c30168202	Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)	2026-05-19 17:26:10 +00:00
biondizzle	8f80991fdf	CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale	2026-05-19 17:08:58 +00:00
biondizzle	d67d8613af	FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode	2026-05-19 16:55:44 +00:00
biondizzle	3b204c4772	Fix UnboundLocalError: move num_decode_tokens before debug print	2026-05-19 16:43:28 +00:00
biondizzle	30890b621d	CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path	2026-05-19 16:35:07 +00:00
biondizzle	b8e2cf61ad	Add debug logging to Blackwell attention path	2026-05-19 16:31:55 +00:00
biondizzle	d7f686bcfc	Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn	2026-05-19 16:19:28 +00:00
biondizzle	114da83090	Add CSA/HCA decode + prefill attention to Blackwell path	2026-05-19 16:06:24 +00:00
biondizzle	2cc1910c45	Fix N for C128A (need 128 tokens)	2026-05-19 16:04:53 +00:00
biondizzle	cea453cbab	Fix compressor key name	2026-05-19 16:04:38 +00:00
biondizzle	04f2b2d8d4	Add CSA sparse attention test (compressed KV gather + SWA merge)	2026-05-19 16:04:19 +00:00
biondizzle	4c6464e7e0	Update CURRENT_BUG: KV cache pipeline verified, all tests passing	2026-05-19 16:01:10 +00:00
biondizzle	be8566a443	Add decode vs prefill consistency test	2026-05-19 16:00:33 +00:00
biondizzle	2ddd3d0702	Test with all 61 layers (shared experts only)	2026-05-19 15:55:41 +00:00
biondizzle	842e6e1381	Fix view→reshape for non-contiguous tensor	2026-05-19 15:54:40 +00:00
biondizzle	f0f8d8211b	Add e2e decode test (3 layers: C128A, C4A, SWA)	2026-05-19 15:53:29 +00:00
biondizzle	255913fba4	Vectorize paged KV cache read/write, kill container	2026-05-19 15:48:16 +00:00
biondizzle	8b2cb41160	Fix KV cache: write to paged cache, handle uint8→fp8 conversion, fix RoPE bug	2026-05-19 15:34:09 +00:00
biondizzle	6ceb05327f	Add blackwell_attention module and comprehensive test	2026-05-19 15:30:29 +00:00
biondizzle	85c74e5932	Fix attention for decode (1 query vs N cached KVs)	2026-05-19 15:28:52 +00:00
biondizzle	85099c7e75	Fix fp8 amax in decode test	2026-05-19 15:28:17 +00:00
biondizzle	c66b0b88c0	Add decode attention pipeline test — reproduces KV cache bug	2026-05-19 15:27:55 +00:00

1 2 3 4 5 ...

479 Commits