nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	89f6e64057	README: document test harness gotchas (timeout arg, stale procs, screen names)	2026-06-03 08:36:02 +00:00
biondizzle	0bf276f8c9	more doc cleanup	2026-06-03 07:37:13 +00:00
biondizzle	35dbb8d12b	Cleanup Part 2: Fix docs, stale references, dead code - Update README.md package structure to match actual file tree - Remove references to nonexistent fmha.py, fmha_smem_acc, kernels/decode/ - Document live attention path: production.py → fmha_multitile_op → capi.cu → .cuh - Add _archive/ section - Fix loader.py docstring: fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer - Remove preload_all() (dead, referenced nonexistent compressor_reduce_quant.cu)	2026-06-02 19:27:28 +00:00
biondizzle	224d7e24c6	harness: add fire_b200_cuda_test + check_b200_cuda, update README Two new turnkey harness scripts for .cu tests: - fire_b200_cuda_test: compile+run+poll, kills everything first, deletes old logs, one test at a time, screen-based, timeout - check_b200_cuda: peek at running test log, or kill hung test README updated with CUDA harness documentation. Removed janky tests/run_cuda_test.sh.	2026-05-28 07:36:10 +00:00
biondizzle	43f0b5d1e8	D1.5: Fix O rescale with paired atoms (incremental approach) Keep epilogue_tma_store for final output (proven path). Only fix the multi-KV-tile O rescale using paired atoms from epilogue_tmem_copy_and_partition. The paired atoms share addressing, making the TMEM->REGS->modify->TMEM cycle lossless. Guarded by const_expr(n_kv_tiles > 1) so single-tile path (n=128) is completely unaffected — zero regression risk. Full correction epilogue (one-way TMEM->REGS->SMEM->GMEM) deferred until we can address the MLIR compilation time issue.	2026-05-26 19:34:26 +00:00
biondizzle	f97aee6eed	plan update	2026-05-26 19:00:22 +00:00
biondizzle	57a8316bc1	update README: D5c sink bias DONE (cos 0.999996, single KV tile)	2026-05-26 15:17:10 +00:00
biondizzle	60a6f2d296	update README: D5b per-row LSE, D3/D4 DONE	2026-05-26 11:03:57 +00:00
biondizzle	4656fa81f9	update README: D3 and D4 status DONE	2026-05-26 10:56:57 +00:00
biondizzle	32850f6974	Update README, STAGE_D, STAGE_D2 with D1 rescale findings and D2 status	2026-05-25 01:18:48 +00:00
biondizzle	335e310c79	Update D2 status in README	2026-05-24 22:58:23 +00:00
biondizzle	dadfad8f89	Docs: Update STAGE_D.md, README.md with hd=512 compilation blocker, lessons learned	2026-05-24 21:35:25 +00:00
biondizzle	6be7690011	Docs: Update STAGE_D.md, README.md status for D1 hd≤256 milestone	2026-05-24 04:32:43 +00:00
biondizzle	98e5b48470	Update all .md files with D5a/D5b progress, tOrP0 fix, LSE formula - README.md: Updated Stage status table (D1 🟡, D5 🟢), D5 section with D5a/D5b results, tOrP0 bug fix docs, new CuTeDSL constraints #11-12 - STAGE_D1.3.md: Added progress update - TMEM-P works, SMEM-P still blocked, recommended next steps - STAGE_D.md was already updated	2026-05-23 22:07:53 +00:00
biondizzle	bfacfeca7b	Rename FmhaV3StageC → FmhaKernel — no dev stage artifacts in production API	2026-05-23 05:45:58 +00:00
biondizzle	787a25516d	Update README: reflect Stage C migration, built indexer/router/compressor, SMEM-P path, CuTeDSL scoping lesson	2026-05-23 05:42:44 +00:00
biondizzle	6dd71aaf56	docs: revised Stage D/E plan — indexer removes paged TMA, one kernel for CSA/HCA/SWA, sink merge	2026-05-23 03:10:41 +00:00
biondizzle	b1fe18acbb	cleanup: remove archive/ (240 stale files), stale example9/10, fix test table, add Stage D plan	2026-05-23 03:05:08 +00:00
biondizzle	6eb9729b06	docs: update README with Stage C TMEM layout mismatch findings and status	2026-05-23 03:01:04 +00:00
biondizzle	8ccbdec1ed	🚀🚀🚀 TMA MULTI-TILE FIX VERIFIED ON B200 🚀🚀🚀 THE BUG: tBgK[(None,None,0,0)] kept modes 0,1 free but set mode 2 (KV tiles) to 0. TMA always loaded from tile 0 regardless of the coordinate value. This was a LAYOUT bug, NOT a JIT bug, NOT a CuTeDSL bug. THE FIX: tBgK[(None,0,None,0)] keeps modes 0 and 2 free. Then tBgK[None, kt] indexes the surviving KV_tiles dim. VERIFIED SHAPES (B200, n=256, inside @cute.kernel): Before slice: tBgK = (((64,128),1), Int32(?), Int32(?), Int32(?)) — 4 modes After (None,0,None,0): tBgK = (((64,128),1), Int32(?)) — 2 modes TEST RESULTS (test_fmha_v3_stage_c.py, identity softmax): n=128: cos 0.999998 ✅ PASS n=256: cos 0.71 (TMA loads 2 tiles, needs O rescale for 0.9999) n=512+: same output as n=256 (pipeline not cycling past kv_stage=2) example10 (real softmax + O rescale): compiles and runs, cos ~0.47 (softmax bugs separate from TMA) LESSON: PRINT THE SHAPES. ALWAYS. Reasoning about mode counts without evidence is how we wasted a day. The 8-mode theory was WRONG — 8-None slice fails with 'weakly congruent' at JIT compile. The tensor has 4 modes. Updated: README (verified shapes, correct fix), MEMORY.md (new rules), test_fmha_v3_stage_c.py, test_fmha_v3_diag.py, example10, test_fmha_v3.py, fire_b200_test (clean git state, kill all old processes).	2026-05-22 23:51:29 +00:00
biondizzle	30eaba39aa	FIX: 8-None no-op pre-slice opens full TMA coordinate space (8 dims) The tma_partition output has 8 TMA coordinate dimensions, not 4. The Python-visible shape shows 4 modes, but the TMA descriptor uses 8 coordinates. Without the 8-None no-op pre-slice, modes 4-7 are collapsed and the GMEM tile axis (mode 4) is pinned to 0. Pattern that works (confirmed on B200 at n=256 in diag test): tBgK = tBgK[(None,None,None,None,None,None,None,None)] # open 8D cute.copy(tma_k, tBgK[None,None,None,None,kt,None,None,None], ...) The old 4-mode indexing tBgK[(None,None,kt,0)] fails with 'rank mismatch: got 2 and 1' because slicing a 4-mode tensor produces wrong rank for the TMA coordinate space. Matches working diag test test_fmha_v3_diag.py exactly.	2026-05-22 23:18:40 +00:00
biondizzle	9c5adcee46	FIX: tma_partition tensors have 4 modes, not 8. Mode 2 is GMEM tile dim. The 8-mode indexing (tBgK[None,None,None,None,kt,None,None,None]) fails at JIT compilation with 'coord and shape are weakly congruent' error. The actual MLIR tensor shape is (((64,128),1),?,?,?) — 4 modes, not 8. The working fix from commit `845ad98` on the B200 used 4-mode indexing all along: tBgK[(None, None, kt, 0)] — mode 2 = GMEM tile dim tVgV[(None, 0, kt, 0)] — mode 2 = GMEM tile dim Updated all files: example10, test_fmha_v3_stage_c, README, docstrings.	2026-05-22 23:08:27 +00:00
biondizzle	0330c1da7a	Fix README: multi-tile was layout bug not JIT bug, add example10, update status	2026-05-22 22:57:53 +00:00
biondizzle	dbd77f2bc4	DOCUMENT: TMA 8-mode indexing — the bug that cost us a full day. README + inline comments.	2026-05-22 21:28:58 +00:00
biondizzle	56769cdbf5	README: add fire_b200_test docs, update multi-tile blocker with real findings	2026-05-22 17:41:23 +00:00
biondizzle	81eee05018	README: add test harness instructions	2026-05-22 17:09:53 +00:00
biondizzle	793e3243d5	README + MEMORY: update Stage C status to single-tile only, document multi-tile blocker - Stage C works for n=128 (0.993) but multi-tile (n>128) is broken - Root cause: tBgK slice hardcodes GMEM iteration to tile 0 - CuTeDSL TMA copy doesn't accept Python int as tile index - Mike's combined K+V barrier fix compiles but deadlocks at runtime - Fallback: kh.count // 2 (untested)	2026-05-22 16:32:31 +00:00
biondizzle	35d532c742	README: update Stage C status to WORKING, add CuTeDSL constraints and target architecture	2026-05-22 09:39:15 +00:00
biondizzle	96f900f5f0	README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA)	2026-05-21 17:40:25 +00:00
biondizzle	2ec32eb8da	README: update for new dsv4/ package structure	2026-05-21 17:34:40 +00:00
biondizzle	20564425ec	README: full roadmap — Stage C (real softmax), D (paged KV), E (production kernel) Document canonical test files, obsolete test sprawl, and the path from test_fmha_v3.py → cutedsl/kernel/attention/fmha_kernel.py → vLLM integration. Also: TMEM layout for Stage C, key lessons from A&B.	2026-05-21 15:43:01 +00:00
biondizzle	ad24792fc7	Update both READMEs: Stage B complete, document TMEM overlap root cause - Workspace README: full rewrite with Stage B ✅, Bug 4b root cause (P/O overlap), FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns - Kernel README: focused on the bug, fix, and current test status - Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement	2026-05-21 15:36:06 +00:00
biondizzle	750f1f09c9	README: Bug 4 ROOT CAUSE CONFIRMED - V SMEM 1 K-tile + PV 8 K-phases mismatch. Zero-pad V workaround correct.	2026-05-21 09:59:37 +00:00
biondizzle	50e9b5da81	README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.	2026-05-21 09:47:08 +00:00
biondizzle	5e37ea56e4	FOOTGUN #0 : num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.	2026-05-21 07:13:14 +00:00
biondizzle	b9b1b808a5	README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns	2026-05-21 06:46:02 +00:00
biondizzle	a7fd2761df	README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write - (128,64) PV MMA A-fragment has N_MMA=64, reads P with wrong stride - Softmax writes P with QK C-fragment layout (N_MMA=128) - O[m,d] ≈ P[m,2d] — every other column effect confirmed - All-ones and single-element V pass (uniform/sparse data hides mismatch) - epi_tile must use PV cta_tile (partial fix: 0.01 → 0.876) - Added footguns #9 (TMEM alias N_MMA match) and #10 (epi_tile) - Added diagnostic test results to test table	2026-05-21 05:17:12 +00:00
biondizzle	0dc6fe4a7d	Stage B progress: PV works for square (128,128), broken for (128,64) - Bug 1 (V MN-major): Fix applied - Bug 2 (softmax packing): Confirmed correct (V=I test: cosine 1.0) - Bug 3 (ACCUMULATE): Fix applied (first PV must overwrite, not accumulate) - Bug 4 (CURRENT): PV MMA broken for non-square output - (128,128) PV with random V: cosine 0.999999 ✅ - (128,64) PV with MN-major V: cosine ~0.01 ❌ - Softmax packing, layout aliasing, pipeline ordering all verified correct - Root cause unknown — likely epilogue/V layout/MMA tiler issue Added test_pv_diag.py (V=I and random V, 128x128 output — PASS) Added test_layout_compare.py (TMEM layout inspection) Added test_inspect_types.py (TMEM pointer arithmetic verification) Updated test_mma_si_pv.py with head_dim param, pv_mma_tiler_mn fix, ACCUMULATE fix Updated READMEs with current state	2026-05-21 04:40:28 +00:00
biondizzle	7a8945eb76	Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage Pipeline deadlock fixed: - No cta_layout_vmnk on mma_si PipelineUmmaAsync - TMA warp excluded from tmem.wait_for_alloc - PipelineTmaStore (not TmaStorePipeline) Bug 1 (V MN-major): fix applied - PV MMA uses v_major=OperandMajorMode.MN - V shaped (64,128) strides(1,64) via as_strided Bug 2 (softmax packing): C-fragment composition store applied - FP32 to BF16 packing works - St32x32bOp uses Float32 (not BFloat16) Bug 3 (PV garbage): investigating - PV MMA cosine ~0.01 against reference - Suspected TMEM layout mismatch between softmax P store and PV A-fragment read Test results: - test_mma_si_only: cosine 0.999999 PASS - test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)	2026-05-21 04:10:07 +00:00
biondizzle	467ade37b2	Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed Key finding: C-fragment and A-fragment use different physical TMEM address mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16. Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999) Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02) Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)	2026-05-21 00:12:47 +00:00
biondizzle	97656a5cd1	Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong Key fixes: - PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps) - TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded) - P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py) - V SMEM aliasing via recast_ptr Status: - Stage A: cosine 0.999999 ✅ - Stage B: runs without crash, identity softmax cosine -0.02 ❌ - Diagnostics: TMEM layout inspection, bisection results	2026-05-20 20:26:25 +00:00
biondizzle	9f0528f150	Update README: reflect current state, add C128A/C4A topk + warmup fixes	2026-05-20 06:51:12 +00:00
biondizzle	bbba289bd8	feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL) - native_swa_decode.py: BlackwellSWADecodeKernel - CTA mapping: 1 CTA per (decode_token, q_head_group) - Online softmax with KV tile streaming (16 tokens/tile) - Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext requires 32-bit aligned vector, no scalar fp8->bf16 support) - Cosine 0.9999+ vs PyTorch batched SDPA reference - Fallback _fallback_batched_sdp when CuTeDSL unavailable - native_sparse_decode.py: BlackwellSparseDecodeKernel - Combined SWA + compressed KV in single attention pass - Supports CSA (cr=4) and HCA (cr=128) layers - Sink weight merge on host side - Cosine 0.9999+ vs combined SDPA reference - fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires vector<4xf8>, no scalar support). Pre-dequant is the workaround. - vLLM wiring (attention.py): - SWA-only layers: native_swa_decode_attention - CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink - csa_attention.py updated to use native kernels - Tests: test_decode_pipeline.py, test_sparse_decode.py both passing	2026-05-20 05:46:15 +00:00
biondizzle	06bf4f482d	README: comprehensive update with current kernel status	2026-05-20 04:42:57 +00:00
biondizzle	a30d9eb523	Update README with final kernel status	2026-05-20 04:39:57 +00:00
biondizzle	aa8563c626	Fused SwiGLU epilogue with granularity-8 weight interleave - Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis) - Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline - De-interleave L1 GEMM output before gate/up split - Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing - Even subtiles = gate: SiLU in FP32 registers, save to register buffer - Odd subtiles = up: silu(gate)*up from buffer - Both branches produce same BF16 tensor type (CuTeDSL constraint) - run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2 - Runner: fused_swiglu=True option for CuTeDSLMoERunner - Layertest: both fused and non-fused paths PASS (cosine 0.988) - README.md updated with current status and lessons learned	2026-05-20 04:13:52 +00:00
biondizzle	57d4cb714f	docs: rewrite README.md with current project state - Document all 5 correctness bug fixes - Document fused SwiGLU epilogue progress (Step 1 PASS, Step 2 blocked) - Document CuTeDSL runtime conditional limitation - List remaining steps (amax shuffles, NVFP4 quantize, FP4/SF TMA stores) - Document weight interleave and register layout - Capture key lessons learned - Update file structure and test inventory	2026-05-20 03:30:35 +00:00
biondizzle	5fb70b4cd2	Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean	2026-05-19 20:22:10 +00:00
biondizzle	31b9cfbdbd	Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.	2026-05-19 15:19:55 +00:00
biondizzle	914d27fee7	Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks Mike's directive: build the full thing with NVFP4/CuTeDSL. No more 'optimize later' or 'just make it work' workarounds. Key updates: - README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage - CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE - Both files document: checkpoint key names, compress ratios, config issues - Removed all 'TODO: optimize later' hedging — we build it right the first time	2026-05-19 08:26:16 +00:00

1 2

58 Commits