Go to file

biondizzle ad24792fc7 Update both READMEs: Stage B complete, document TMEM overlap root cause

- Workspace README: full rewrite with Stage B ✅, Bug 4b root cause (P/O overlap),
  FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns
- Kernel README: focused on the bug, fix, and current test status
- Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement

2026-05-21 15:36:06 +00:00

cutedsl

even more stuff

2026-05-21 05:55:22 +00:00

cutedsl_loader

fix: dynamic buffer sizing in nvfp4_linear for varying token counts

2026-05-19 23:59:55 +00:00

reference

docs: add CuTeDSL rewrite plan + reference files

2026-05-16 02:41:51 +00:00

tests

Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999

2026-05-21 15:32:49 +00:00

.dockerignore

Patch utils.py the standard way: copy modified file into Docker image

2026-05-18 19:10:08 +00:00

.gitignore

feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

2026-05-15 11:38:18 +00:00

build_and_run.sh

Revert "feat: auto-warmup in build_and_run.sh"

2026-05-16 06:14:28 +00:00

docker-compose.yml

Reduce max_model_len to 256

2026-05-19 09:37:38 +00:00

Dockerfile

nuke vllm because this keep confusing people

2026-05-19 23:04:36 +00:00

pyproject.toml

fix: dynamic buffer sizing in nvfp4_linear for varying token counts

2026-05-19 23:59:55 +00:00

README.md

Update both READMEs: Stage B complete, document TMEM overlap root cause

2026-05-21 15:36:06 +00:00

README.md

DSV4 NVFP4 Kernel

Status (May 21, 2026 — 15:35 UTC)

Stage A ✅ COMPLETE

Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.

Stage B ✅ COMPLETE — QK → Softmax → PV pipeline working for (128,64) PV

Cosine 0.999999 with identity softmax and random V.

Stage C 🔨 NEXT

Real softmax (exp, row max, row sum, rescale). Multi-tile with proper accumulation.

Stage B — Bug 4b Root Cause & Fix

The Bug: TMEM P/O Region Overlap

Symptom: (128,64) PV produces NaN or zeros. (128,128) PV works fine.

Root cause: PV output accumulator O was placed at find_tmem_tensor_col_offset(tOtO), which returns 64 for (128,64) PV. P occupies TMEM columns [32, 96). O at [64, 128) overlaps P at [64, 96). While PV MMA reads P (A-operand), it simultaneously writes O (D-operand) to overlapping TMEM columns. The A-operand gets corrupted mid-computation.

For (128,128) PV, find_tmem_tensor_col_offset(tOtO) returns 128, so O starts after P — no overlap. It worked by accident.

The Fix

Place O after both S and P regions:

p_cols_fp32 = pv_mma_tiler[2] * q_dtype.width // qk_acc_dtype.width  # 128*16/32 = 64
p_end = tmem_p0_offset + p_cols_fp32  # 32 + 64 = 96
s_cols = qk_mma_tiler[1]  # 128
o_after = max(s_cols, p_end)  # 128
tmem_o0_offset = ((o_after + 31) // 32) * 32  # align to 32 = 128

Secondary Fix: FMHA-Style V Reconstruction

V from DLPack has logical shape (n, hd) but PV B-operand expects (hd, n). Reconstruct inside CuTe:

v_fmha = cute.make_tensor(
    v.iterator,
    cute.make_layout(
        (HEAD_DIM, s_k, 1),
        stride=(1, HEAD_DIM, HEAD_DIM * s_k),
    ),
)
v_major = LayoutEnum.from_tensor(v_fmha).mma_major_mode()  # MN
# Use v_fmha in make_tiled_tma_atom_B, NOT the DLPack v

TMEM Layout

Col:  0          32          64          96          128         192        256
      |---- S ----|---- P ----|           |---- O ----|
      |  QK acc   | Softmax P |  (gap)    |  PV acc   |
      |  128 FP32 |  64 FP32  |  32 col   |  64 FP32  |

P aliases part of S (softmax overwrites S columns 32-95 with P). O must not overlap P or S.

Softmax P Store (FMHA Pattern)

Store uses QK C-fragment composition. Read uses PV A-fragment. These are two separate aliases of the same physical TMEM — the P/A alias works (proven by no-softmax test) because both layouts depend on M=128 and K, not on PV output N.

# Store (softmax writes P)
tStP = cute.make_tensor(tStS.iterator + tmem_p0_offset,
    cute.composition(tStS.layout, cute.make_layout((128, p_cols_fp32))))
tiled_tmem_store = tcgen05.make_tmem_copy(store_atom, tStP)

# Read (PV MMA reads P)
tP = cute.make_tensor(tStS.iterator, p_tmem_s.outer)
tOrP = pv_thr.make_fragment_A(tP)[None,None,None,0]
tOrP0 = cute.make_tensor(tOrP.iterator + width_scale * tmem_p0_offset, tOrP.layout)

rP_words = cute.make_rmem_tensor(tScP.shape, qk_acc_dtype)
rP_bf16 = cute.make_tensor(recast_ptr(rP_words.iterator, dtype=q_dtype), tTMEM_LOADrS.layout)

Test Files

tests/test_fmha_v3.py — Full pipeline with KV-tile interleaving. PASS.
tests/test_pv64_with_softmax.py — Single AB pipeline. PASS.
tests/test_128_128_vdiag.py — (128,128) PV baseline. PASS.
tests/test_qkonly.py — QK only. PASS.
tests/test_qk_softmax.py — QK + softmax (no PV). PASS.