THE BUG: tBgK[(None,None,0,0)] kept modes 0,1 free but set mode 2 (KV tiles) to 0.
TMA always loaded from tile 0 regardless of the coordinate value.
This was a LAYOUT bug, NOT a JIT bug, NOT a CuTeDSL bug.
THE FIX: tBgK[(None,0,None,0)] keeps modes 0 and 2 free.
Then tBgK[None, kt] indexes the surviving KV_tiles dim.
VERIFIED SHAPES (B200, n=256, inside @cute.kernel):
Before slice: tBgK = (((64,128),1), Int32(?), Int32(?), Int32(?)) — 4 modes
After (None,0,None,0): tBgK = (((64,128),1), Int32(?)) — 2 modes
TEST RESULTS (test_fmha_v3_stage_c.py, identity softmax):
n=128: cos 0.999998 ✅ PASS
n=256: cos 0.71 (TMA loads 2 tiles, needs O rescale for 0.9999)
n=512+: same output as n=256 (pipeline not cycling past kv_stage=2)
example10 (real softmax + O rescale): compiles and runs, cos ~0.47 (softmax bugs separate from TMA)
LESSON: PRINT THE SHAPES. ALWAYS. Reasoning about mode counts without
evidence is how we wasted a day. The 8-mode theory was WRONG — 8-None
slice fails with 'weakly congruent' at JIT compile. The tensor has 4 modes.
Updated: README (verified shapes, correct fix), MEMORY.md (new rules),
test_fmha_v3_stage_c.py, test_fmha_v3_diag.py, example10, test_fmha_v3.py,
fire_b200_test (clean git state, kill all old processes).
The tma_partition output has 8 TMA coordinate dimensions, not 4.
The Python-visible shape shows 4 modes, but the TMA descriptor uses
8 coordinates. Without the 8-None no-op pre-slice, modes 4-7 are
collapsed and the GMEM tile axis (mode 4) is pinned to 0.
Pattern that works (confirmed on B200 at n=256 in diag test):
tBgK = tBgK[(None,None,None,None,None,None,None,None)] # open 8D
cute.copy(tma_k, tBgK[None,None,None,None,kt,None,None,None], ...)
The old 4-mode indexing tBgK[(None,None,kt,0)] fails with
'rank mismatch: got 2 and 1' because slicing a 4-mode tensor
produces wrong rank for the TMA coordinate space.
Matches working diag test test_fmha_v3_diag.py exactly.
The 8-mode indexing (tBgK[None,None,None,None,kt,None,None,None]) fails at
JIT compilation with 'coord and shape are weakly congruent' error. The actual
MLIR tensor shape is (((64,128),1),?,?,?) — 4 modes, not 8.
The working fix from commit 845ad98 on the B200 used 4-mode indexing all along:
tBgK[(None, None, kt, 0)] — mode 2 = GMEM tile dim
tVgV[(None, 0, kt, 0)] — mode 2 = GMEM tile dim
Updated all files: example10, test_fmha_v3_stage_c, README, docstrings.
- Stage C works for n=128 (0.993) but multi-tile (n>128) is broken
- Root cause: tBgK slice hardcodes GMEM iteration to tile 0
- CuTeDSL TMA copy doesn't accept Python int as tile index
- Mike's combined K+V barrier fix compiles but deadlocks at runtime
- Fallback: kh.count // 2 (untested)
Document canonical test files, obsolete test sprawl, and the path from
test_fmha_v3.py → cutedsl/kernel/attention/fmha_kernel.py → vLLM integration.
Also: TMEM layout for Stage C, key lessons from A&B.
- Workspace README: full rewrite with Stage B ✅, Bug 4b root cause (P/O overlap),
FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns
- Kernel README: focused on the bug, fix, and current test status
- Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement
- (128,64) PV MMA A-fragment has N_MMA=64, reads P with wrong stride
- Softmax writes P with QK C-fragment layout (N_MMA=128)
- O[m,d] ≈ P[m,2d] — every other column effect confirmed
- All-ones and single-element V pass (uniform/sparse data hides mismatch)
- epi_tile must use PV cta_tile (partial fix: 0.01 → 0.876)
- Added footguns #9 (TMEM alias N_MMA match) and #10 (epi_tile)
- Added diagnostic test results to test table
- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
- Even subtiles = gate: SiLU in FP32 registers, save to register buffer
- Odd subtiles = up: silu(gate)*up from buffer
- Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned
Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.
Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
- README: updated NVFP4 coverage table, status, and plan
- CURRENT_BUG.md: full debugging journey, what works, what's next
- Both reflect decision to build our own CuTeDSL kernels
README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.
Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)
Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
- Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
- Handles slot-based routing, L1→SiLU→L2→scatter
- prepare_weights_from_dequantized() for weight prep
Tagged the-last-of-cutlass on the old C++ kernel state.