Go to file

biondizzle 2030d41e41 Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999

Same P/O overlap bug: O at col 64 overlapped P at [32,96).
Same fixes: O at col 128, FMHA V reconstruction, power-of-2 TMEM alloc.

2026-05-21 15:32:49 +00:00

cutedsl

even more stuff

2026-05-21 05:55:22 +00:00

cutedsl_loader

fix: dynamic buffer sizing in nvfp4_linear for varying token counts

2026-05-19 23:59:55 +00:00

reference

docs: add CuTeDSL rewrite plan + reference files

2026-05-16 02:41:51 +00:00

tests

Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999

2026-05-21 15:32:49 +00:00

.dockerignore

Patch utils.py the standard way: copy modified file into Docker image

2026-05-18 19:10:08 +00:00

.gitignore

feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

2026-05-15 11:38:18 +00:00

build_and_run.sh

Revert "feat: auto-warmup in build_and_run.sh"

2026-05-16 06:14:28 +00:00

docker-compose.yml

Reduce max_model_len to 256

2026-05-19 09:37:38 +00:00

Dockerfile

nuke vllm because this keep confusing people

2026-05-19 23:04:36 +00:00

pyproject.toml

fix: dynamic buffer sizing in nvfp4_linear for varying token counts

2026-05-19 23:59:55 +00:00

README.md

README: Bug 4 ROOT CAUSE CONFIRMED - V SMEM 1 K-tile + PV 8 K-phases mismatch. Zero-pad V workaround correct.

2026-05-21 09:59:37 +00:00

README.md

DSV4 NVFP4 Workspace

Status (May 21, 2026 — 09:18 UTC)

Stage A ✅ COMPLETE

Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.

Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4

Two MMAs chained: Q@K^T (SMEM source) → identity softmax in TMEM → P@V (TMEM source).

Pipeline deadlock: ✅ FIXED. Softmax packing: ✅ CONFIRMED CORRECT.

Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure

Summary

The softmax writes P to TMEM using the QK C-fragment layout. The PV MMA reads P from TMEM using the PV A-fragment layout. For (128,128) PV these layouts agree. For (128,16) PV they disagree — the PV A-fragment reads from different TMEM columns than where the softmax wrote, producing zero output.

FMHA uses (128,16) PV with the same construction and works. The root difference is not yet identified despite exhaustive comparison. FMHA references: p_tmem_layout_staged = make_smem_layout_a(pv_mma, pv_mma_tiler, q_dtype, 1) — same call we make.

What Works / What Doesn't

✅ PV (128,128) output, V=I or random → cosine 1.0 / 0.999999
✅ PV (128,128) with zero-padded V (head_dim=16) → cosine 1.0 WORKAROUND
✅ PV (128,64), all-ones V → cosine 0.999999 (uniform hides bug)
✅ PV (128,64), single-element V → cosine 1.0 (sparse hides bug)
❌ PV (128,64), truncated identity V → cosine 0.02
❌ PV (128,16), V=I(128,128) → cosine 0.0 (all zeros)
❌ PV (128,16) with P at S offset (no softmax) → NaN (FP32→BF16 reinterpret)

Root Cause (CONFIRMED May 21 09:50 UTC)

Bug is NOT the TMEM alias. The PV A-fragment layout is identical for all PV sizes (confirmed by C++ source and diagnostics): all PV sizes produce tOrP2_s = (2048, 1, 8), size=16384.

The real bug: V SMEM only holds 1 K-tile (2048 BF16), but the PV MMA iterates 8 K-phases. For non-(128,128) V, most K-phases read wrong V data.

(128,128) PV + V=I works by coincidence (V=I makes the projection self-consistent)
(128,32) PV + V=(32,128) fails because V SMEM only has V[0:16,:], K-phases 1-7 read wrong data
Zero-padded V works because V=(128,128) covers all 8 K-phases; rows beyond head_dim are zero

How FMHA avoids this: FMHA interleaves QK and PV per KV-tile. Each tile loads 16 K-rows of V, and PV processes only that tile. This ensures V SMEM always has the correct data.

Workaround: Zero-pad V to 128 K-rows (2-4x compute waste, but correct). Proper fix: FMHA-style KV-tile interleaving.

Current Workaround

Use (128,128) PV with zero-padded V. This wastes compute (8× for head_dim=16, 2× for head_dim=64) but produces correct results (cosine 1.0). For the production kernel, we'll use this initially and optimize to (128,16) PV once the TMEM alias is resolved.

Required Fixes (Not Yet Applied)

Primary: Softmax must write P using the PV A-fragment TMEM layout, not the QK C-fragment layout. Requires constructing a make_tmem_copy with tP (PV layout) as the destination, and rearranging register data from QK partition to PV partition.
Secondary: epi_tile must use PV's cta tile, and self.cta_tile_shape_mnk must be swapped before epilogue_tma_store. FMHA sets self.epi_tile = self.pv_mma_tiler[:2] directly.
Alternative (for later): Investigate using composition() to create a hybrid layout that both the QK softmax write and PV A-fragment read can agree on.

Bugs 1–3: ✅ FIXED

Bug 1: V B-Operand Must Be MN-Major

FMHA requires V to be MN-major for the PV MMA B-operand. V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) via as_strided.

Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT

FP32→BF16 packing via C-fragment composition store works. ⛔ St32x32bOp MUST use Float32, NOT BFloat16.

Bug 3: First PV Must Use ACCUMULATE=False

If ACCUMULATE=True on the first PV, O = P@V + old_O adds uninitialized TMEM. FMHA: pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0).

Pipeline Deadlock — ✅ FIXED (May 21)

Three root causes found and fixed:

PipelineUmmaAsync for mma_si must NOT pass cta_layout_vmnk
TMA warp must NOT call tmem.wait_for_alloc()
pipeline.PipelineTmaStore (not TmaStorePipeline)

⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines

St32x32bOp with BFloat16 → ILLEGAL MEMORY ACCESS — Must use Float32 + cute.recast_ptr
V major ≠ K major — V must be MN-major, use as_strided
C-fragment → A-fragment TMEM alias only works when N_MMA matches — (128,128) works, (128,64) breaks
PipelineUmmaAsync consumer = thread count, NOT warp count — 32 * len(warp_ids)
mma_si pipeline must NOT pass cta_layout_vmnk
TMA warp excluded from tmem barrier
First PV ACCUMULATE=False
TMEM offset: FP32 ptr + 32 = BF16 ptr + 64 (width scaling)
epi_tile must use PV cta_tile, not QK
CuTe nested layout modes flatten sequentially — ((128,16),1,(4,2)):((65536,1),0,(16,64)) is sequential

Architecture: Per-Tile Flow

For each KV tile:
  1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
  2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
     Signals scores_full_mbar (via PipelineUmmaAsync commit)
  3. Epilogue warps wait on mma_si consumer (scores ready), then:
     a. tcgen05.ld scores from TMEM → register fragments
     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
     d. tcgen05.st exp(scores - new_max) back to TMEM → P operand
     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
  4. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
  5. Stage released, load warp can refill it

After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM

Environment

Server: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
venv: source /root/dsv4-nvfp4-workspace/venv/bin/activate
PYTHONPATH: /root/dsv4-nvfp4-workspace/kernel
Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
vLLM repo: /root/dsv4-nvfp4-workspace/vllm (modified for Blackwell)
Pseudocode: /root/fragile-kernel-example/README.md
fmha.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py
fmha_bwd.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py

README.md Unescape Escape

DSV4 NVFP4 Workspace

Status (May 21, 2026 — 09:18 UTC)

Stage A ✅ COMPLETE

Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4

Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure

Summary

What Works / What Doesn't

Root Cause (CONFIRMED May 21 09:50 UTC)

Current Workaround

Required Fixes (Not Yet Applied)

Bugs 1–3: ✅ FIXED

Bug 1: V B-Operand Must Be MN-Major

Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT

Bug 3: First PV Must Use ACCUMULATE=False

Pipeline Deadlock — ✅ FIXED (May 21)

⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines

Architecture: Per-Tile Flow

Environment

README.md