nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	50e9b5da81	README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.	2026-05-21 09:47:08 +00:00
biondizzle	d4934371d0	Key finding: PV A-fragment layout is IDENTICAL for (128,128)/(128,32)/(128,16) PV. Bug is NOT TMEM alias. cta_tile_shape_mnk wrong for non-(128,128) PV. V SMEM and O C-fragment sizes look correct. Debugging V/epilogue paths.	2026-05-21 09:44:22 +00:00
biondizzle	422af26024	Update README: Bug 4 status, (128,16) PV zero output, (128,128) PV zero-pad workaround (cosine 1.0)	2026-05-21 09:20:09 +00:00
biondizzle	781684dd89	TMEM alias analysis: (128,16) PV broken, (128,128) PV with zero-pad works. Root cause: PV A-fragment layout differs from QK C-fragment layout for (128,16) PV, causing TMEM column mismatch. Using (128,128) PV as workaround.	2026-05-21 09:10:12 +00:00
biondizzle	96e7210db7	Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout.	2026-05-21 09:00:42 +00:00
biondizzle	ad3f63033d	Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment.	2026-05-21 08:45:49 +00:00
biondizzle	5e37ea56e4	FOOTGUN #0 : num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.	2026-05-21 07:13:14 +00:00
biondizzle	dd8d872bec	v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0	2026-05-21 07:08:29 +00:00
biondizzle	b9b1b808a5	README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns	2026-05-21 06:46:02 +00:00
biondizzle	f1c4ee0e4d	v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV	2026-05-21 06:40:27 +00:00
biondizzle	4968ce064d	even more stuff	2026-05-21 05:55:22 +00:00
biondizzle	15c987244f	v28 attempt: PV MMA (128,64) - cosine 0.004, debugging	2026-05-21 05:41:44 +00:00
biondizzle	a7fd2761df	README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write - (128,64) PV MMA A-fragment has N_MMA=64, reads P with wrong stride - Softmax writes P with QK C-fragment layout (N_MMA=128) - O[m,d] ≈ P[m,2d] — every other column effect confirmed - All-ones and single-element V pass (uniform/sparse data hides mismatch) - epi_tile must use PV cta_tile (partial fix: 0.01 → 0.876) - Added footguns #9 (TMEM alias N_MMA match) and #10 (epi_tile) - Added diagnostic test results to test table	2026-05-21 05:17:12 +00:00
biondizzle	c20518332e	more stuff	2026-05-21 05:08:57 +00:00
biondizzle	0dc6fe4a7d	Stage B progress: PV works for square (128,128), broken for (128,64) - Bug 1 (V MN-major): Fix applied - Bug 2 (softmax packing): Confirmed correct (V=I test: cosine 1.0) - Bug 3 (ACCUMULATE): Fix applied (first PV must overwrite, not accumulate) - Bug 4 (CURRENT): PV MMA broken for non-square output - (128,128) PV with random V: cosine 0.999999 ✅ - (128,64) PV with MN-major V: cosine ~0.01 ❌ - Softmax packing, layout aliasing, pipeline ordering all verified correct - Root cause unknown — likely epilogue/V layout/MMA tiler issue Added test_pv_diag.py (V=I and random V, 128x128 output — PASS) Added test_layout_compare.py (TMEM layout inspection) Added test_inspect_types.py (TMEM pointer arithmetic verification) Updated test_mma_si_pv.py with head_dim param, pv_mma_tiler_mn fix, ACCUMULATE fix Updated READMEs with current state	2026-05-21 04:40:28 +00:00
biondizzle	7a8945eb76	Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage Pipeline deadlock fixed: - No cta_layout_vmnk on mma_si PipelineUmmaAsync - TMA warp excluded from tmem.wait_for_alloc - PipelineTmaStore (not TmaStorePipeline) Bug 1 (V MN-major): fix applied - PV MMA uses v_major=OperandMajorMode.MN - V shaped (64,128) strides(1,64) via as_strided Bug 2 (softmax packing): C-fragment composition store applied - FP32 to BF16 packing works - St32x32bOp uses Float32 (not BFloat16) Bug 3 (PV garbage): investigating - PV MMA cosine ~0.01 against reference - Suspected TMEM layout mismatch between softmax P store and PV A-fragment read Test results: - test_mma_si_only: cosine 0.999999 PASS - test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)	2026-05-21 04:10:07 +00:00
biondizzle	467ade37b2	Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed Key finding: C-fragment and A-fragment use different physical TMEM address mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16. Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999) Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02) Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)	2026-05-21 00:12:47 +00:00
biondizzle	97656a5cd1	Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong Key fixes: - PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps) - TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded) - P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py) - V SMEM aliasing via recast_ptr Status: - Stage A: cosine 0.999999 ✅ - Stage B: runs without crash, identity softmax cosine -0.02 ❌ - Diagnostics: TMEM layout inspection, bisection results	2026-05-20 20:26:25 +00:00
biondizzle	a5b48be7d5	stuff	2026-05-20 07:15:01 +00:00
biondizzle	9f0528f150	Update README: reflect current state, add C128A/C4A topk + warmup fixes	2026-05-20 06:51:12 +00:00
biondizzle	67d5e26080	Fix warmup compilation + add sparse topk metadata kernels Bug #2 fix: warmup_compilation and warmup_fused_swiglu_compilation now use valid FP4 data by quantizing random BF16 through quantize_to_nvfp4. Random uint8 bytes as FP4 bit patterns cause cudaErrorIllegalInstruction in Blackwell MMA hardware. Re-enabled warmup calls in runner.py. Bug #1 kernel: sparse_topk_metadata.cu with: - build_c128a_topk_metadata: position-based compressed KV slot lookup via block table for C128A (compress_ratio=128) decode tokens - compute_c4a_global_topk: local topk index -> global slot ID mapping via block table for C4A (compress_ratio=4) decode tokens - Both tested: correct block table lookups, proper padding Bug #3 kernel: C4A uses compute_c4a_global_topk (same .cu file) - Replaces vLLM Triton kernel with our own CUDA kernel Deleted stale STATUS.md, FUSED_EPILOGUE_STATUS.md, FUSED_EPILOGUE_PLAN.md, CURRENT_BUGMD	2026-05-20 06:43:43 +00:00
biondizzle	bbba289bd8	feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL) - native_swa_decode.py: BlackwellSWADecodeKernel - CTA mapping: 1 CTA per (decode_token, q_head_group) - Online softmax with KV tile streaming (16 tokens/tile) - Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext requires 32-bit aligned vector, no scalar fp8->bf16 support) - Cosine 0.9999+ vs PyTorch batched SDPA reference - Fallback _fallback_batched_sdp when CuTeDSL unavailable - native_sparse_decode.py: BlackwellSparseDecodeKernel - Combined SWA + compressed KV in single attention pass - Supports CSA (cr=4) and HCA (cr=128) layers - Sink weight merge on host side - Cosine 0.9999+ vs combined SDPA reference - fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires vector<4xf8>, no scalar support). Pre-dequant is the workaround. - vLLM wiring (attention.py): - SWA-only layers: native_swa_decode_attention - CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink - csa_attention.py updated to use native kernels - Tests: test_decode_pipeline.py, test_sparse_decode.py both passing	2026-05-20 05:46:15 +00:00
biondizzle	06bf4f482d	README: comprehensive update with current kernel status	2026-05-20 04:42:57 +00:00
biondizzle	a30d9eb523	Update README with final kernel status	2026-05-20 04:39:57 +00:00
biondizzle	04eca7c6da	Custom CUDA kernel for de-interleave plus NVFP4 quantize	2026-05-20 04:39:47 +00:00
biondizzle	061d5692a9	Remove debug print statements from pipeline	2026-05-20 04:20:46 +00:00
biondizzle	aa8563c626	Fused SwiGLU epilogue with granularity-8 weight interleave - Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis) - Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline - De-interleave L1 GEMM output before gate/up split - Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing - Even subtiles = gate: SiLU in FP32 registers, save to register buffer - Odd subtiles = up: silu(gate)*up from buffer - Both branches produce same BF16 tensor type (CuTeDSL constraint) - run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2 - Runner: fused_swiglu=True option for CuTeDSLMoERunner - Layertest: both fused and non-fused paths PASS (cosine 0.988) - README.md updated with current status and lessons learned	2026-05-20 04:13:52 +00:00
biondizzle	57d4cb714f	docs: rewrite README.md with current project state - Document all 5 correctness bug fixes - Document fused SwiGLU epilogue progress (Step 1 PASS, Step 2 blocked) - Document CuTeDSL runtime conditional limitation - List remaining steps (amax shuffles, NVFP4 quantize, FP4/SF TMA stores) - Document weight interleave and register layout - Capture key lessons learned - Update file structure and test inventory	2026-05-20 03:30:35 +00:00
biondizzle	6c04155167	wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL SiLU in registers: PASS (0.034% error, Step 1 stable) Gate/up subtile detection: blocked by CuTeDSL type system CuTeDSL compiles the kernel for ALL subtile iterations at once. Runtime conditionals (if is_gate_subtile) that affect: - Register tensor assignment → DSLRuntimeError (type structure mismatch) - TMA store skipping → corrupted output - Mask blending → wrong results Path forward: use const_expr debug flag for the BF16 side output, or process gate/up in a separate post-GEMM kernel.	2026-05-20 03:26:20 +00:00
biondizzle	9f0c1b8c5d	wip: Step 1 SiLU validation complete, Step 2 gate/up pairing planning Step 1 VALIDATED: - cute.exp works on register tensors in the epilogue - SiLU (x / (1+exp(-x))) produces correct results - Relative error vs PyTorch: 0.034%, max abs: 0.0625 (BF16 precision) Step 2 (gate/up pairing) approach: - Register-level pairing requires understanding acc_vec layout from tiled_copy_r2s - DeepGEMM pattern: (values[0], values[2]) pairs for tcgen05.ld - CuTeDSL retile may produce different layout than direct PTX loads - SMEM-level SiLU is a valid intermediate: avoids GMEM round-trip while working in logical (M, N) coordinate space - Non-interleaved weights + SMEM SiLU is simplest starting point	2026-05-20 03:16:34 +00:00
biondizzle	b84f2f7bf9	fix: cutlass.Float32 not cutlass.float32_t in fused epilogue Step 1 SiLU validation: PASS - cute.exp works on register tensors - SiLU (x / (1+exp(-x))) in registers matches PyTorch reference - Relative error: 0.034%, Max abs error: 0.0625 (BF16 precision limit)	2026-05-20 03:12:23 +00:00
biondizzle	08992b818d	wip: add run_fused_swiglu_grouped_gemm bridge + step1 test	2026-05-20 03:10:56 +00:00
biondizzle	9c43c69a4c	wip: fused SwiGLU Stage 1 - SiLU in registers (full acc_vec) Stage 1 of the fused epilogue: applies SiLU (x * sigmoid(x)) to the full accumulator register tensor before writing BF16 to C. This validates that cute.exp and element-wise FP32 operations work on CuTe register tensors in the epilogue. The gate/up pairing is not yet implemented (Stage 2). The fused_swiglu flag is const_expr(0) by default, so the standard epilogue path is unchanged unless the flag is enabled.	2026-05-20 03:07:02 +00:00
biondizzle	2f053f674e	wip: fused SwiGLU kernel scaffold + bridge interleave + plan - fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with class rename and fused_swiglu/swiglu_limit params added - bridge.py: added interleave_l1_weights, deinterleave_l1_weights, warmup_fused_swiglu_compilation - Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave)) - Standalone GEMM interleave test fails due to kernel-internal N-tiling layout (expected, skipping per plan) - FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan, 4-step implementation strategy	2026-05-20 03:04:38 +00:00
biondizzle	4f178d6e9c	chore: remove unused _expert_id_range after bincount migration	2026-05-20 02:17:44 +00:00
biondizzle	84a2f6d441	perf: replace expert counting O(n*E) comparison with torch.bincount O(n) Bug #5 fix: (sorted_ids.unsqueeze(1) == expert_id_range.unsqueeze(0)).sum(dim=0) materializes a (num_slots × num_experts) bool tensor every forward — 48K × 384 = 18M elements. torch.bincount(sorted_ids, minlength=num_experts) gives the same result in O(n) with no intermediate allocation. ~200× less work. Also removes the now-unused _expert_id_range buffer.	2026-05-20 02:17:23 +00:00
biondizzle	4882d8553c	fix: zero out x_norm for underflow blocks before division in NVFP4 quantization Bug #4 fix: When a block has amax > 0 but amax/6 underflows to 0 in FP8 (amax < 62^-9 ≈ 0.0117), the block scale is 0, but the division x / clamp(0, 1e-8) inflates x into nonzero FP4 buckets (up to ±6.0). This produces semantically wrong FP4 even though dequant gives 0 (60=0). Root cause: we only detected truly-zero blocks (amax == 0) but not underflow blocks (0 < amax < FP8_threshold). The fix: 1. Detect both zero and underflow blocks: block_amax < 6 * 2^-9 2. Zero out x_reshaped for these blocks BEFORE division 3. Force FP8 scale to 0 for these blocks This ensures x_scaled = 0 → FP4 nibbles = 0 → dequant = 0. Verified: bug scenario now produces nibble=0, scale=0. Checkpoint byte match remains 100%.	2026-05-20 02:16:49 +00:00
biondizzle	e653712598	fix: detect zero blocks in NVFP4 quantization, force FP4+FP8 to exact zero Bug #3 fix: The clamp(min=1e-8) on block_amax prevented NaN from 0/0 but allowed truly-zero blocks to get a nonzero FP8 scale (5e-12 from underflow). While the kernel produces 0 * 0 = 0 (no NaN), the nonzero scale is semantically wrong and could interact badly with future kernels. Fix: detect zero blocks explicitly (block_amax == 0), clamp only for safe division, then force FP8 scale to exact zero for zero blocks via torch.where. The FP4 nibbles are already zero (0 / anything = 0). Verified: checkpoint byte match remains 100%, zero blocks produce exact-zero dequantization, no NaN propagation. Applies to all three quantization functions: - quantize_to_nvfp4 (activation with computed gs) - quantize_activation_nvfp4 (activation with pre-computed gs) - quantize_weight_to_nvfp4 (weight quantization)	2026-05-20 02:14:50 +00:00
biondizzle	1857bdedc3	chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct Verified that our NVFP4 packing convention (odd<<4\|even, round-half-to-even) matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip across all tested experts. The dequantize->requantize path is lossless in practice but wasteful. Marked both prepare_weights_from_dequantized and prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked which loads checkpoint FP4 bytes directly via .view(). Also added test_fp4_roundtrip.py for future reference.	2026-05-20 02:11:40 +00:00
biondizzle	ef398006a7	fix: correct scale factor dimensions in warmup (K_sf = ceil_div(K_packed,8) not ceil_div(K_packed,16)) K_packed = original_K // 2. The scale factor dimension is K_sf = ceil_div(original_K, 16) = ceil_div(K_packed * 2, 16) = ceil_div(K_packed, 8). The previous code used ceil_div(K_packed, 16) which was wrong.	2026-05-20 02:08:26 +00:00
biondizzle	8f1a20562f	fix: root-cause JIT memory corruption myth, add eager warmup, remove _needs_token_refill Bug #1 fix: The _needs_token_refill workaround was a band-aid over a misdiagnosis. cute.compile does NOT corrupt GPU memory (verified on B200). The original corruption was from a different bug (likely OOB write or weight loading issue). Changes: - bridge.py: Add warmup_compilation() for eager JIT before runtime buffers exist. Pre-allocate workspace per cache entry (no torch.full in hot path). Cache stores {compiled, workspace, workspace_size} instead of just compiled. CuTe tensor wrappers re-created per call (cheap metadata, avoids stale refs). - runner.py: Remove _needs_token_refill hack. Add eager warmup call in _ensure_stacked() for both L1 and L2 GEMM shapes. - nvfp4_linear.py: Add eager warmup in finalize_weights() for single GEMM. The warmup approach ensures cute.compile runs exactly once per shape during model init, before any forward pass. This is deterministic and eliminates any possible interaction between JIT and runtime GPU memory.	2026-05-20 02:08:01 +00:00
biondizzle	6ec0afc318	fix: handle 3D swa_indices and correct kv_bf16 expand dims	2026-05-20 01:36:27 +00:00
biondizzle	aa593361e7	feat: add native CuTeDSL SWA decode attention kernel stub + batched SDPA fallback	2026-05-20 01:28:05 +00:00
biondizzle	3599b44c0f	fix: replace _allocate_buffers with _ensure_buffer_size for dynamic sizing	2026-05-20 00:02:10 +00:00
biondizzle	1d5e70adfb	fix: dynamic buffer sizing in nvfp4_linear for varying token counts	2026-05-19 23:59:55 +00:00
biondizzle	1901bf585e	nuke vllm because this keep confusing people	2026-05-19 23:04:36 +00:00
biondizzle	5fb70b4cd2	Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean	2026-05-19 20:22:10 +00:00
biondizzle	2e6559402c	Add full layer NaN test (attention + MoE, multi-layer chain)	2026-05-19 18:36:49 +00:00
biondizzle	cca145e35c	Use 16 experts for MoE runner test (fits in memory)	2026-05-19 18:35:40 +00:00
biondizzle	7893e7514d	Add MoE runner NaN test (grouped GEMM with real weights)	2026-05-19 18:34:56 +00:00

1 2 3 4 5 ...

512 Commits