58 Commits

Author SHA1 Message Date
89f6e64057 README: document test harness gotchas (timeout arg, stale procs, screen names) 2026-06-03 08:36:02 +00:00
0bf276f8c9 more doc cleanup 2026-06-03 07:37:13 +00:00
35dbb8d12b Cleanup Part 2: Fix docs, stale references, dead code
- Update README.md package structure to match actual file tree
  - Remove references to nonexistent fmha.py, fmha_smem_acc, kernels/decode/
  - Document live attention path: production.py → fmha_multitile_op → capi.cu → .cuh
  - Add _archive/ section
- Fix loader.py docstring: fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer
- Remove preload_all() (dead, referenced nonexistent compressor_reduce_quant.cu)
2026-06-02 19:27:28 +00:00
224d7e24c6 harness: add fire_b200_cuda_test + check_b200_cuda, update README
Two new turnkey harness scripts for .cu tests:
- fire_b200_cuda_test: compile+run+poll, kills everything first,
  deletes old logs, one test at a time, screen-based, timeout
- check_b200_cuda: peek at running test log, or kill hung test

README updated with CUDA harness documentation.
Removed janky tests/run_cuda_test.sh.
2026-05-28 07:36:10 +00:00
43f0b5d1e8 D1.5: Fix O rescale with paired atoms (incremental approach)
Keep epilogue_tma_store for final output (proven path).
Only fix the multi-KV-tile O rescale using paired atoms from
epilogue_tmem_copy_and_partition. The paired atoms share addressing,
making the TMEM->REGS->modify->TMEM cycle lossless.

Guarded by const_expr(n_kv_tiles > 1) so single-tile path (n=128)
is completely unaffected — zero regression risk.

Full correction epilogue (one-way TMEM->REGS->SMEM->GMEM) deferred
until we can address the MLIR compilation time issue.
2026-05-26 19:34:26 +00:00
f97aee6eed plan update 2026-05-26 19:00:22 +00:00
57a8316bc1 update README: D5c sink bias DONE (cos 0.999996, single KV tile) 2026-05-26 15:17:10 +00:00
60a6f2d296 update README: D5b per-row LSE, D3/D4 DONE 2026-05-26 11:03:57 +00:00
4656fa81f9 update README: D3 and D4 status DONE 2026-05-26 10:56:57 +00:00
32850f6974 Update README, STAGE_D, STAGE_D2 with D1 rescale findings and D2 status 2026-05-25 01:18:48 +00:00
335e310c79 Update D2 status in README 2026-05-24 22:58:23 +00:00
dadfad8f89 Docs: Update STAGE_D.md, README.md with hd=512 compilation blocker, lessons learned 2026-05-24 21:35:25 +00:00
6be7690011 Docs: Update STAGE_D.md, README.md status for D1 hd≤256 milestone 2026-05-24 04:32:43 +00:00
98e5b48470 Update all .md files with D5a/D5b progress, tOrP0 fix, LSE formula
- README.md: Updated Stage status table (D1 🟡, D5 🟢), D5 section with
  D5a/D5b results, tOrP0 bug fix docs, new CuTeDSL constraints #11-12
- STAGE_D1.3.md: Added progress update - TMEM-P works, SMEM-P still blocked,
  recommended next steps
- STAGE_D.md was already updated
2026-05-23 22:07:53 +00:00
bfacfeca7b Rename FmhaV3StageC → FmhaKernel — no dev stage artifacts in production API 2026-05-23 05:45:58 +00:00
787a25516d Update README: reflect Stage C migration, built indexer/router/compressor, SMEM-P path, CuTeDSL scoping lesson 2026-05-23 05:42:44 +00:00
6dd71aaf56 docs: revised Stage D/E plan — indexer removes paged TMA, one kernel for CSA/HCA/SWA, sink merge 2026-05-23 03:10:41 +00:00
b1fe18acbb cleanup: remove archive/ (240 stale files), stale example9/10, fix test table, add Stage D plan 2026-05-23 03:05:08 +00:00
6eb9729b06 docs: update README with Stage C TMEM layout mismatch findings and status 2026-05-23 03:01:04 +00:00
8ccbdec1ed 🚀🚀🚀 TMA MULTI-TILE FIX VERIFIED ON B200 🚀🚀🚀
THE BUG: tBgK[(None,None,0,0)] kept modes 0,1 free but set mode 2 (KV tiles) to 0.
TMA always loaded from tile 0 regardless of the coordinate value.
This was a LAYOUT bug, NOT a JIT bug, NOT a CuTeDSL bug.

THE FIX: tBgK[(None,0,None,0)] keeps modes 0 and 2 free.
Then tBgK[None, kt] indexes the surviving KV_tiles dim.

VERIFIED SHAPES (B200, n=256, inside @cute.kernel):
  Before slice: tBgK = (((64,128),1), Int32(?), Int32(?), Int32(?))  — 4 modes
  After (None,0,None,0): tBgK = (((64,128),1), Int32(?))             — 2 modes

TEST RESULTS (test_fmha_v3_stage_c.py, identity softmax):
  n=128:  cos 0.999998  PASS
  n=256:  cos 0.71    (TMA loads 2 tiles, needs O rescale for 0.9999)
  n=512+: same output as n=256 (pipeline not cycling past kv_stage=2)

example10 (real softmax + O rescale): compiles and runs, cos ~0.47 (softmax bugs separate from TMA)

LESSON: PRINT THE SHAPES. ALWAYS. Reasoning about mode counts without
evidence is how we wasted a day. The 8-mode theory was WRONG — 8-None
slice fails with 'weakly congruent' at JIT compile. The tensor has 4 modes.

Updated: README (verified shapes, correct fix), MEMORY.md (new rules),
test_fmha_v3_stage_c.py, test_fmha_v3_diag.py, example10, test_fmha_v3.py,
fire_b200_test (clean git state, kill all old processes).
2026-05-22 23:51:29 +00:00
30eaba39aa FIX: 8-None no-op pre-slice opens full TMA coordinate space (8 dims)
The tma_partition output has 8 TMA coordinate dimensions, not 4.
The Python-visible shape shows 4 modes, but the TMA descriptor uses
8 coordinates. Without the 8-None no-op pre-slice, modes 4-7 are
collapsed and the GMEM tile axis (mode 4) is pinned to 0.

Pattern that works (confirmed on B200 at n=256 in diag test):
  tBgK = tBgK[(None,None,None,None,None,None,None,None)]  # open 8D
  cute.copy(tma_k, tBgK[None,None,None,None,kt,None,None,None], ...)

The old 4-mode indexing tBgK[(None,None,kt,0)] fails with
'rank mismatch: got 2 and 1' because slicing a 4-mode tensor
produces wrong rank for the TMA coordinate space.

Matches working diag test test_fmha_v3_diag.py exactly.
2026-05-22 23:18:40 +00:00
9c5adcee46 FIX: tma_partition tensors have 4 modes, not 8. Mode 2 is GMEM tile dim.
The 8-mode indexing (tBgK[None,None,None,None,kt,None,None,None]) fails at
JIT compilation with 'coord and shape are weakly congruent' error. The actual
MLIR tensor shape is (((64,128),1),?,?,?) — 4 modes, not 8.

The working fix from commit 845ad98 on the B200 used 4-mode indexing all along:
  tBgK[(None, None, kt, 0)] — mode 2 = GMEM tile dim
  tVgV[(None, 0, kt, 0)] — mode 2 = GMEM tile dim

Updated all files: example10, test_fmha_v3_stage_c, README, docstrings.
2026-05-22 23:08:27 +00:00
0330c1da7a Fix README: multi-tile was layout bug not JIT bug, add example10, update status 2026-05-22 22:57:53 +00:00
dbd77f2bc4 DOCUMENT: TMA 8-mode indexing — the bug that cost us a full day. README + inline comments. 2026-05-22 21:28:58 +00:00
56769cdbf5 README: add fire_b200_test docs, update multi-tile blocker with real findings 2026-05-22 17:41:23 +00:00
81eee05018 README: add test harness instructions 2026-05-22 17:09:53 +00:00
793e3243d5 README + MEMORY: update Stage C status to single-tile only, document multi-tile blocker
- Stage C works for n=128 (0.993) but multi-tile (n>128) is broken
- Root cause: tBgK slice hardcodes GMEM iteration to tile 0
- CuTeDSL TMA copy doesn't accept Python int as tile index
- Mike's combined K+V barrier fix compiles but deadlocks at runtime
- Fallback: kh.count // 2 (untested)
2026-05-22 16:32:31 +00:00
35d532c742 README: update Stage C status to WORKING, add CuTeDSL constraints and target architecture 2026-05-22 09:39:15 +00:00
96f900f5f0 README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA) 2026-05-21 17:40:25 +00:00
2ec32eb8da README: update for new dsv4/ package structure 2026-05-21 17:34:40 +00:00
20564425ec README: full roadmap — Stage C (real softmax), D (paged KV), E (production kernel)
Document canonical test files, obsolete test sprawl, and the path from
test_fmha_v3.py → cutedsl/kernel/attention/fmha_kernel.py → vLLM integration.
Also: TMEM layout for Stage C, key lessons from A&B.
2026-05-21 15:43:01 +00:00
ad24792fc7 Update both READMEs: Stage B complete, document TMEM overlap root cause
- Workspace README: full rewrite with Stage B , Bug 4b root cause (P/O overlap),
  FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns
- Kernel README: focused on the bug, fix, and current test status
- Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement
2026-05-21 15:36:06 +00:00
750f1f09c9 README: Bug 4 ROOT CAUSE CONFIRMED - V SMEM 1 K-tile + PV 8 K-phases mismatch. Zero-pad V workaround correct. 2026-05-21 09:59:37 +00:00
50e9b5da81 README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D. 2026-05-21 09:47:08 +00:00
5e37ea56e4 FOOTGUN #0: num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README. 2026-05-21 07:13:14 +00:00
b9b1b808a5 README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns 2026-05-21 06:46:02 +00:00
a7fd2761df README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write
- (128,64) PV MMA A-fragment has N_MMA=64, reads P with wrong stride
- Softmax writes P with QK C-fragment layout (N_MMA=128)
- O[m,d] ≈ P[m,2d] — every other column effect confirmed
- All-ones and single-element V pass (uniform/sparse data hides mismatch)
- epi_tile must use PV cta_tile (partial fix: 0.01 → 0.876)
- Added footguns #9 (TMEM alias N_MMA match) and #10 (epi_tile)
- Added diagnostic test results to test table
2026-05-21 05:17:12 +00:00
0dc6fe4a7d Stage B progress: PV works for square (128,128), broken for (128,64)
- Bug 1 (V MN-major): Fix applied
- Bug 2 (softmax packing): Confirmed correct (V=I test: cosine 1.0)
- Bug 3 (ACCUMULATE): Fix applied (first PV must overwrite, not accumulate)
- Bug 4 (CURRENT): PV MMA broken for non-square output
  - (128,128) PV with random V: cosine 0.999999 
  - (128,64) PV with MN-major V: cosine ~0.01 
  - Softmax packing, layout aliasing, pipeline ordering all verified correct
  - Root cause unknown — likely epilogue/V layout/MMA tiler issue

Added test_pv_diag.py (V=I and random V, 128x128 output — PASS)
Added test_layout_compare.py (TMEM layout inspection)
Added test_inspect_types.py (TMEM pointer arithmetic verification)
Updated test_mma_si_pv.py with head_dim param, pv_mma_tiler_mn fix, ACCUMULATE fix
Updated READMEs with current state
2026-05-21 04:40:28 +00:00
7a8945eb76 Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)
2026-05-21 04:10:07 +00:00
467ade37b2 Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed
Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)
2026-05-21 00:12:47 +00:00
97656a5cd1 Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 
- Stage B: runs without crash, identity softmax cosine -0.02 
- Diagnostics: TMEM layout inspection, bisection results
2026-05-20 20:26:25 +00:00
9f0528f150 Update README: reflect current state, add C128A/C4A topk + warmup fixes 2026-05-20 06:51:12 +00:00
bbba289bd8 feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL)
- native_swa_decode.py: BlackwellSWADecodeKernel
  - CTA mapping: 1 CTA per (decode_token, q_head_group)
  - Online softmax with KV tile streaming (16 tokens/tile)
  - Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext
    requires 32-bit aligned vector, no scalar fp8->bf16 support)
  - Cosine 0.9999+ vs PyTorch batched SDPA reference
  - Fallback _fallback_batched_sdp when CuTeDSL unavailable

- native_sparse_decode.py: BlackwellSparseDecodeKernel
  - Combined SWA + compressed KV in single attention pass
  - Supports CSA (cr=4) and HCA (cr=128) layers
  - Sink weight merge on host side
  - Cosine 0.9999+ vs combined SDPA reference

- fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires
  vector<4xf8>, no scalar support). Pre-dequant is the workaround.

- vLLM wiring (attention.py):
  - SWA-only layers: native_swa_decode_attention
  - CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink
  - csa_attention.py updated to use native kernels

- Tests: test_decode_pipeline.py, test_sparse_decode.py both passing
2026-05-20 05:46:15 +00:00
06bf4f482d README: comprehensive update with current kernel status 2026-05-20 04:42:57 +00:00
a30d9eb523 Update README with final kernel status 2026-05-20 04:39:57 +00:00
aa8563c626 Fused SwiGLU epilogue with granularity-8 weight interleave
- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
  - Even subtiles = gate: SiLU in FP32 registers, save to register buffer
  - Odd subtiles = up: silu(gate)*up from buffer
  - Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned
2026-05-20 04:13:52 +00:00
57d4cb714f docs: rewrite README.md with current project state
- Document all 5 correctness bug fixes
- Document fused SwiGLU epilogue progress (Step 1 PASS, Step 2 blocked)
- Document CuTeDSL runtime conditional limitation
- List remaining steps (amax shuffles, NVFP4 quantize, FP4/SF TMA stores)
- Document weight interleave and register layout
- Capture key lessons learned
- Update file structure and test inventory
2026-05-20 03:30:35 +00:00
5fb70b4cd2 Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean 2026-05-19 20:22:10 +00:00
31b9cfbdbd Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM. 2026-05-19 15:19:55 +00:00
914d27fee7 Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks
Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.

Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
2026-05-19 08:26:16 +00:00