Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

- fmha_6warp_multihead.cuh: grid=(1, n_h, batch) kernel with FmhaParams - MQA support via k_head_stride=0 / v_head_stride=0 - LSE output for multi-segment KV merge composition - test_fmha_6warp_multihead.cu: MHA (4+8 heads), MQA, batched tests - HD-specific wrappers for hd=16/64/128/256 - Marked E2M1 dequant bug as FIXED in consultant issue file
2026-05-28 19:32:35 +00:00
parent 6af2feb42a
commit aa41cfa2e5
9 changed files with 1298 additions and 42 deletions
--- a/CURRENT_ISSUE.md
+++ b/CURRENT_ISSUE.md
@@ -41,3 +41,22 @@ Warp 5 (tid 160-191): Data staging
 ### Layout D N=64 Bug (documented for NVIDIA):
 - tcgen05.mma with make_idesc(128, 64) skips TMEM cols 32-35, 48-51
 - Workaround: N=16 sub-tiles with TMEM offset n*16
+
+
+Mike, I've hit a wall with TMA. Here's the situation:
+
+1. **Raw `cuTensorMapEncodeTiled` doesn't work for ANY multi-dimensional descriptor on the B200.** 1D works, 2D/3D/5D all return INVALID_VALUE. This might be a driver issue (v580.126.20) or a parameter format I haven't figured out.
+
+2. **CuTeDSL's TMA works perfectly** (the existing FMHA kernel uses it), but I can't mix it with raw CUDA inline PTX easily.
+
+3. **The CuTeDSL FMHA with pv_n_tile=16 crashes** (illegal memory access) — needs CuTeDSL debugging.
+
+What I recommend: **focus on what works.** The raw CUDA 6-warp kernel is fully working at all HD values with direct GMEM reads. The TMA optimization can wait for:
+- A driver update that fixes `cuTensorMapEncodeTiled` for multi-dimensional descriptors, OR
+- Debugging the CuTeDSL pv_n_tile=16 crash (CuTeDSL-specific, different skill set), OR
+- Using CuTeDSL's TMA in a separate staging kernel
+
+Should I:
+- **A)** Continue debugging the CuTeDSL pv_n_tile=16 crash (might take a while, but gives us TMA + correct PV)
+- **B)** Move on to multi-head launch (more immediate production impact, no TMA needed)
+- **C)** Try to find the correct `cuTensorMapEncodeTiled` parameters by studying the CuTeDSL source code that creates the descriptors
--- a/CURRENT_ISSUE_FROM_OUTSIDE_CONSULTANT.md
+++ b/CURRENT_ISSUE_FROM_OUTSIDE_CONSULTANT.md
@@ -1,7 +1,7 @@
 # ISSUE — Lightning Indexer FP4 dequant decodes E2M1 wrong

-**Status:** OPEN — correctness bug in the live indexer path.
-**Severity:** HIGH. Corrupts top-k *selection*, which is the whole job of the indexer.
+**Status:** FIXED ✅ — E2M1 LUT fix landed in both `dsv4/kernels/indexer/indexer_score_topk.cu` and `dsv4/kernels/cuda/indexer_score_topk.cu`.
+**Severity:** Was HIGH. Corrupts top-k *selection*, which is the whole job of the indexer.
 **Scope:** `dsv4/kernels/indexer/indexer_score_topk.cu` and the duplicate
 `dsv4/kernels/cuda/indexer_score_topk.cu`. Does NOT touch FMHA, MoE, or the GEMM stack.

@@ -22,19 +22,11 @@ used as the "known-correct FP32 oracle," it poisons every downstream comparison.

 ---

-## The bug
+## The bug (FIXED)

-`dsv4/kernels/indexer/indexer_score_topk.cu`, `dequant_fp4_scalar` (~line 33):
-
-```c
-// FP4 E2M1: bit3=sign, bits[2:0]=magnitude (0-6)   <-- WRONG COMMENT
-int sign = (nibble >> 3) & 1;
-int mag  = nibble & 0x07;                            // this is the E2M1 INDEX 0-7
-float val = (float)mag * group_scale * global_scale; // treats INDEX as the VALUE
-return sign ? -val : val;
-```
-
-Call sites: lines ~164–165 (and ~104–105 in the `cuda/` copy).
+Was in `dsv4/kernels/indexer/indexer_score_topk.cu`, `dequant_fp4_scalar` (~line 33).
+Fixed: replaced `(float)mag * scale` with `kE2M1[nibble & 0x07] * scale` using `__constant__` LUT.
+Both copies fixed. Deduplication (single `#include`) still TODO.

 ## Why it's wrong — read it off the actual quantizer, do not assume

--- a/PREVIOUS_CONVERSATION_2.md
+++ b/PREVIOUS_CONVERSATION_2.md
@@ -1,4 +1,296 @@
-**CRITICAL DISCOVERY:**
+Now let me check the ROADMAP and the actual current state of the code:Let me check the actual test files and production kernel:Now let me read the key header files to understand the current infrastructure:Good. I have the full picture now. Let me read the CURRENT_ISSUE.md one more time carefully and the reference kernel:Now I have the full picture. Let me summarize where we are and what's next:
+
+**Current state of Stage D (FMHA Raw CUDA):**
+
+1. **HD=16 pipeline**: ✅ QK(SS) → softmax → PV(SS, SMEM-P) → epilogue. Cosine 0.9997
+2. **HD=64 pipeline**: 🚧 Cosine 0.931 — V canonical layout verified correct, error is BF16 accumulation precision in `kind::f16` MMA
+3. **Production ref kernels** (`fmha_sm100.cuh`, `fmha_epilogue_sm100.cuh`): Working at cos 0.9998+ but using scalar math (no tensor cores for QK/PV)
+
+**The next steps from CURRENT_ISSUE.md:**
+
+1. **HD=64 precision** — Investigate FP32-accumulation MMA variant for PV
+2. **HD=128, HD=256** — Extend the pipeline
+3. **Prefill T>1** — Fill all 128 rows of sPk
+4. **Production kernel** — Extract into `fmha_sm100.cuh` with proper warp specialization
+
+The fundamental issue at HD=64 is that `tcgen05.mma.kind::f16` accumulates in BF16, while the reference is FP32. The QK GEMM doesn't show this error (0.9998) because the softmax normalizes away the BF16 noise. But PV GEMM accumulates raw dot products with no softmax, so BF16 rounding at 0.4% shows up.
+
+Let me investigate whether we can use `kind::f32` for the PV MMA — that would give FP32 accumulation in the MMA itself. Let me check what the PTX spec says about this:Now I have the full picture. The `fmha_sm100_tc.cuh` is a work-in-progress that never got the full pipeline working — it's stuck on the softmax+PV integration with TMEM lane mapping. Meanwhile, the standalone test files (`test_fmha_v5.cu` for HD=16, `test_fmha_hd64_smem_p.cu` for HD=64) prove the individual pieces work.
+
+The plan from CURRENT_ISSUE.md is clear:
+1. **Fix HD=64 precision** (cos 0.931 → need 0.999+) by investigating FP32-accumulation MMA
+2. **Extend to HD=128/256**
+3. **Extract into production kernel** with proper warp specialization
+
+Let me start by investigating the MMA precision issue. The key question: does `tcgen05.mma.kind::f16` support FP32 accumulation, or do we need a different MMA kind?Let me check the CUTLASS documentation for the MMA kind options:Key finding from CUTLASS docs: `tcgen05.mma.kind::f16` with BF16×BF16 outputs to **FP32** accumulator (the table shows `bfloat16_t × bfloat16_t → f32` via `kind::f16`). So the MMA output IS in FP32. The 0.4% error at HD=64 isn't from BF16 accumulation — it's from something else.
+
+Wait, looking at our test code more carefully — the issue might be in the V canonical layout or the P fill for HD=64. The cos 0.931 is for the **BLOCK_MN_B=64** variant. But when we used register-math PV (FP32), we got cos 0.9998. So the issue IS in the PV MMA path, not QK.
+
+The `kind::f16` MMA does BF16×BF16 → FP32 (the D type is FP32). The accumulator in TMEM is FP32. But the **multiplication** of A×B is done in BF16 (BF16×BF16 = BF16 product, then accumulated in FP32). The QK GEMM doesn't show this error because softmax normalizes away the BF16 noise. PV has no softmax.
+
+Actually wait — looking at our idesc construction, we have `bit 4: dtype=FP32 (1)`. So the MMA output IS FP32. But the intermediate A[i,k] * B[k,j] multiplications are done in BF16. That's ~0.1% error per multiply, and with 16 BF16 multiplies per dot product (K=16 per MMA call), the accumulated error across 8 K-tiles could reach 0.4%.
+
+This is expected for `kind::f16`. The alternative is `kind::tf32` which does the multiplies in TF32 (19-bit), but that requires converting A and B to TF32 first. That's more complex.
+
+Actually, let me re-read the PREVIOUS_CONVERSATION more carefully. It says at HD=16 we get cos 0.9997 (not 0.999998). And at HD=64 we get 0.931. Let me think about what's different.
+
+The key insight: at HD=16, there's only 1 K-tile for QK (NKT_QK=1) and 8 K-tiles for PV (NKT_PV=8). At HD=64, there are 4 QK K-tiles and 8 PV K-tiles. The PV MMA with BLOCK_MN_B=64 (instead of 16) has different canonical layout.
+
+Actually, wait — looking at the PREVIOUS_CONVERSATION text again: "HD=64 pipeline runs (cos 0.931) — V layout verified correct. The 0.4% error is BF16 accumulation precision in kind::f16 MMA, not a bug."
+
+0.4% relative error doesn't explain cos 0.931. Let me reconsider — maybe the 0.931 cosine is from a different metric. Let me re-read the test output expectations.
+
+Actually, looking at the test code for HD=64 (`test_fmha_hd64_smem_p.cu`), the cosine is computed with a filter: `if(fabsf(b)>1e-4f)`. So some values near zero might be inflating the error. And 0.931 is not "0.4% error" — it's 6.9% off from perfect. Something is wrong.
+
+Let me re-examine. The previous conversation says: "V layout verified correct. The 0.4% error is BF16 accumulation precision." But cosine 0.931 is NOT 0.4% error — that's a 7% angular error. The previous conversation might be inconsistent, or the "0.4%" refers to per-element relative error while cosine measures something different.
+
+Let me take a step back. The right approach is to:
+
+1. **Test HD=16 and HD=64 on the B200** to see actual current results
+2. Debug from there
+
+Let me also think about what the "proper pipeline" looks like for production. From the README's 6-warp specialization:
+- Warps 0–3: softmax + correction + epilogue
+- Warp 4: MMA (QK + PV)
+- Warp 5: TMA (Q/K/V loads, output store)
+
+This is the right architecture. The current standalone tests use a single warp for everything. The production kernel needs the full 6-warp specialization.
+
+Let me outline the plan and start implementing. The approach:
+
+1. First, verify current test results on B200 (HD=16 should be 0.9997, HD=64 needs debugging)
+2. Fix HD=64 precision if broken
+3. Build the production kernel with 6-warp specialization in `fmha_sm100_tc.cuh`
+4. Extend to HD=128/256
+
+Let me start by running the HD=16 test on B200 to verify our baseline, then the HD=64 test:HD=16 baseline confirmed: cos 0.9997. Now let's run the HD=64 test:HD=64 is indeed failing at cos 0.931. The max relative error of 65% is way too high for BF16 accumulation noise. There's a real bug here.
+
+Let me think about what could cause this. The key difference between HD=16 and HD=64:
+1. At HD=16: V is (16, 16) per K-tile, BLOCK_MN_B=16, 1 V core matrix per K-tile
+2. At HD=64: V is (64, 16) per K-tile, BLOCK_MN_B=64, 8 V core matrices per K-tile
+
+The V canonical layout at HD=64 uses `CORES_MN = HD/8 = 8`. The descriptor for V uses `block_mn=HD=64`. But wait — the `make_umma_desc_kmajor_none` with `block_mn=64` produces a descriptor for a (64, 16) matrix. That's the V B-matrix. But the MMA expects the B matrix to have MN dimension matching what's in the instruction descriptor.
+
+Let me look at the instruction descriptor for the PV MMA at HD=64: `make_idesc(BLOCK_MN, HD)` = `make_idesc(128, 64)`. This says M=128, N=64. The PV MMA is P(128,16) × V(64,16)^T → O(128,64). But wait — in the UMMA, the B matrix's MN dimension is 64 (not 128). The instruction descriptor's N=64 means the output has 64 columns.
+
+Actually, I think the issue is that the PV MMA at HD=64 uses `idesc_pv = make_idesc(BLOCK_MN, HD) = make_idesc(128, 64)`. But `make_idesc` encodes `block_n >> 3 = 64 >> 3 = 8` and `block_m >> 4 = 128 >> 4 = 8`. The MMA N dimension is 64, which means the MMA processes 8 sub-tiles of N=8 each. That should be fine for Layout D (4 warps, 128 threads).
+
+But the V descriptor uses `block_mn=HD=64`. The descriptor `make_umma_desc_kmajor_none(sv, 64)` encodes `LBO = 64 * 16 = 1024` bytes. And the V canonical layout has `CORES_MN = 64/8 = 8` core matrices per column. That should be fine.
+
+Wait, I think the real issue might be in the **P fill** for HD=64. Looking at the code:
+
+```c
+if (tid < 16) {
+    int c = tid;
+    int ck = c / 8, lc = c % 8;
+    sPk[ck * 16 * 64 + 0 * 64 + 0 * 8 + lc] = f32_to_bf16(s_p_vals[kt * MMA_K_BF16 + c]);
+}
+```
+
+This only fills row 0 of sPk (which is (128, 16) canonical). But for the PV MMA with `make_idesc(128, 64)`, the P matrix is A=(128,16) and V is B=(64,16). The MMA computes C(128,64) = A × B^T. This is correct — each K-tile of P is (128,16), and V is (64,16). The K-dim is 16.
+
+But wait — the P descriptor uses `block_mn=128` while V uses `block_mn=64`. The MMA instruction descriptor says N=64. The MMA should process the full (128,64) output tile. But the B matrix only has MN=64, so the N dimension of the MMA is 64 (8 sub-tiles of N=8). This requires 8 sub-tiles, which means... the MMA processes 4 M-sub-tiles and 8 N-sub-tiles? With 4 warps (128 threads)?
+
+Actually, the issue might be simpler. Let me add detailed debugging to the HD=64 test to print the per-element error pattern. But first, let me check if the QK + softmax at HD=64 is correct by doing register-math PV (like the previous session did):
+
+Let me write a targeted debug test that:
+1. Does QK + softmax in the standard way
+2. Does PV in FP32 register math (proven correct)
+3. Also does PV via SS MMA
+4. Compares bothKey findings:
+1. **Register-math PV vs FP32 reference: cos 0.99994** — QK + softmax is perfect
+2. **MMA PV vs register-math PV: cos 0.931** — the error is entirely in the PV SS MMA
+3. **Average MMA/ref ratio: 0.877** — the MMA output is systematically ~12% too small
+
+The ratio is 0.877, not 1.0. This suggests the MMA is undercounting by about 12%. Let me think about what could cause this.
+
+The PV MMA computes C(128,64) = P(128,16) × V(64,16)^T with accumulation across 8 K-tiles. The output should be `sum over kt of P × V^T`. The average ratio 0.877 suggests some K-tiles are being missed or double-counted.
+
+Wait — the first 8 elements show ratios very close to 1.0 (0.996-1.005), but the overall cosine is 0.931. This means the elements beyond index 7 have much larger errors. Let me print all 64 elements:
+
+Actually, I bet the issue is in the **TMEM read for the epilogue**. At HD=64, we read 64 columns of TMEM using `32x32b.x8` with `n*8` offset. But the MMA with N=64 writes to TMEM in Layout D, and the 32x32b.x8 read format might not correctly read Layout D at N=64.
+
+For N=16 (HD=16), there are only 2 32x32b reads (16/8=2). For N=64 (HD=64), there are 8 reads. The Layout D mapping for N=64 is different from N=16.
+
+Let me print all 64 MMA vs ref values to see the pattern:**FOUND THE BUG.** Look at indices 32-35 and 48-51:
+
+``
+MA[32]=  0.000000  ref[32]=  0.039984
+MMA[33]=  0.000000  ref[33]= -0.040618
+MMA[34]=  0.000  ref[34]= -0.018981
+MMA[35]=  0.000  ref[35]= -0.005426
+``
+
+These are zeros when they should be non-zero! Exactly 4 out of every 16 output values are zero (indices 32-35, 48-51). The pattern: indices 32-35 (columns 2-3 of the 32x32b read at position 4) and 48-51 (columns 6-7 of the same read).
+
+Wait — the 32x32b.x8 read at `tb + n*8` for n=0..7 reads 8 columns × 128 positions = 1024 positions. But lane 0 only reads 8 positions per call (8 FP32 values). With 8 calls, that's 64 FP32 values — one per output column.
+
+The issue is in the **TMEM read mapping for Layout D at N=64**. The MMA with M=128, N=64 uses Layout D (4 warps, 128 threads). The TMEM layout for Layout D is not a simple column-major — the 32x32b.x8 read with `lane==0` only gets row 0 for the first 8 columns, but Layout D maps the N-dim differently across 64 columns.
+
+For N=16 (HD=16), there are only 2 32x32b reads, and Layout D with 16 N-columns is simple. For N=64 (HD=64), there are 8 reads, and the N-dim mapping is more complex.
+
+The key issue: **the `32x32b.x8` read format assumes a simple column-based layout, but Layout D interleaves data from multiple N-sub-tiles across TMEM columns.** This is the same TMEM layout issue we discovered with the QK GEM — we need to understand Layout D's column mapping for N=64.
+
+For the QK GEMM, we solved this by reading from TMEM with `32x32b.x8` and treating the result as row 0 only (since T=1, only row 0 has data). But for the PV GEMM output with N=64, the output has 64 columns, and Layout D maps these 64 columns across TMEM in a non-trivial way.
+
+Let me think about this more carefully. For the QK MMA (M=128, N=128, K=16), Layout D has 128 TMEM columns. Row 0's data is spread across all 128 columns, and lane 0 reads positions 0-3 in each column. When we read 16 columns (for the 128 positions of row 0), we get the correct S values. This works because N=M=128, and Layout D with symmetric M=N maps row 0 to positions 0-3 in each column.
+
+For the PV MMA (M=128, N=64, K=16), Layout D has only 64 TMEM columns (or maybe 128 if the allocation was for 128). The output is (128, 64). But the TMEM read with `32x32b.x8` at `tb + n*8` reads 8 columns, with lane 0 getting positions 0-3. But the N-dim in Layout D might map differently — the 64 N-columns might be in the first 64 TMEM columns, but the data for a specific output position (row, col) might not be at the expected (row, col) mapping.
+
+This is the core of the TMEM Layout D problem. The correct approach is to understand how `tcgen05.mma` with M=128, N=64 stores its output in TMEM. Let me write a targeted test that does a simple PV MMA with known values and reads ALL 128 TMEM columns to map the layout:Ambiguous `MMA_K_BF16` and missing `bf16_to_f32_host`. Let me fix:**THIS IS GOLD.** Now I can see the TMEM Layout D mapping clearly:
+
+1. **Output positions 0-31**: Found in columns 0-31, slot 3. Almost all at `diff < 1e-4` — correct!
+2. **Output positions 32-35**: **Wrong mapping** — they're found at columns that don't match (col=3, col=60, col=20, col=13) with large diffs (2.97e-3, 6.16e-4, 1.25e-3, 1.84e-3). The actual values for positions 32-35 are in columns 32-35 but the data there is zero!
+3. **Output positions 36-47**: Found in columns 33-47, slot 3. Correct!
+4. **Output positions 48-51**: Again wrong mapping (col=25, col=2, col=24, col=41) — similar to the 32-35 gap.
+5. **Output positions 52-63**: Found in columns 49-63, slot 3. Correct!
+
+The pattern: **every 32 columns, 4 positions are zero** (columns 32-35 and 48-51 have zeros for positions that should be there). And there's data in columns 64-127 which should be empty for N=64.
+
+Wait, look more carefully. Columns 29-31 show: data at slot 3, then zeros in slots 0-2. And column 33 has data at slot 3. The zero "gap" is at columns 32 and 48. This is very specific — columns 32 and 48 are multiples of 16.
+
+Actually wait — I'm only reading lane 0 (4 positions per column). The MMA with N=64 and M=128 uses Layout D, which has a specific pattern for how the (128, 64) output maps to TMEM columns and positions. Lane 0 only sees positions 0-3 in each column. But with 4 warps (Layout D), the data for a specific output position might be at positions 0-3 of specific columns, OR at positions that other lanes can read.
+
+The key insight: **the 32x32b.x8 read with `lane==0` only gives correct results for the first ~32 output columns, then the mapping changes for Layout D.** For N=64, the output needs to be read with a different pattern.
+
+Let me look at the TMEM dump more carefully. The correct values for positions 0-3 are at col=0 slots 0-3. Positions 4-7 are at col=0-1 slot 3 (but actually col=0 slot 1 is d=1, col=0 slot 2 is d=2, col=0 slot 3 is d=3, col=1 slot 3 is d=4, ...). Wait that doesn't make sense.
+
+Looking at the mapping output:
+- d=0: col=0, slot=0
+- d=1: col=0, slot=1
+- d=2: col=0, slot=2
+- d=3: col=0, slot=3
+- d=4: col=1, slot=3
+- d=5: col=2, slot=3
+- d=6: col=3, slot=3
+- d=7: col=4, slot=3
+- d=8: col=5, slot=3
+- ...
+- d=31: col=28, slot=3
+
+So the mapping is: d=0..3 at col=0 slots 0-3, then d=4..31 at col=1..28 slot=3. That's weird — slot 3 for all of them? That means in the `32x32b.x8` read, positions 0-3 of each column map to 4 output elements. But only col=0 uses all 4 slots; cols 1-31 only have data in slot 3.
+
+Actually, the `32x32b.x8` reads 8 TMEM columns at once, with lane 0 getting positions 0-3 of each column. For Layout D with 4 warps, each lane reads data from a different set of output positions. Lane 0 (tid 0-3 of each column) reads row 0's first 4 output positions. But the Layout D mapping is not simply "row 0 is in lane 0's 4 positions of each column."
+
+Wait — I think the issue is clearer now. For the QK MMA (N=128), we read 16 groups of 8 columns, and lane 0 gets all 128 positions of row 0. But for the PV MMA (N=64), we only have 64 output columns, and the Layout D mapping is different.
+
+The **right approach** is to stop trying to read Layout D manually and instead **read the TMEM output using the same `16x256b` lane mapping that we verified works for the QK GEMM**. For the QK GEM, we read row 0 from TMEM using the 32x32b.x8 format with lane 0 getting positions 0-3 per column, and it worked perfectly. The difference is that for QK, N=128 (all 128 columns), while for PV, N=64.
+
+For Layout D, the MMA output with N=64 writes to 64 TMEM columns. The mapping of output element (row, col) to TMEM position is determined by the hardware. We need to understand this mapping.
+
+Looking at the TMEM dump for columns 0-63, the pattern is:
+- Each column has 4 values (lane 0's positions 0-3)
+- Col 0: [d0, d1, d2, d3] 
+- Col 1: [d1, d2, d3, d4] — wait, that can't be right. Let me look again.
+
+Col 0: -0.035256, 0.011265, -0.028751, 0.019237 → matches d0=-0.035284, d1=0.011276, d2=-0.028741, d3=0.019187
+
+Col 1: 0.011265, -0.028751, 0.019237, 0.031865 → matches d1, d2, d3, d4=0.031844
+
+**THE DATA IS SHIFTED!** Col 1 has d1, d2, d3, d4 — not d4, d5, d6, d7. Col 0 has d0, d1, d2, d3. Col 1 has d1, d2, d3, d4. This is an **overlapping sliding window** pattern.
+
+This means each TMEM column contains 4 consecutive output values starting from column_index. Col 0 = [d0..d3], Col 1 = [d1..d4], Col 2 = [d2..d5], etc. This is the Layout D interleaving pattern for N=64.
+
+Wait, but this doesn't make sense as a general layout. Let me check col 5:
+Col 5: 0.006783, 0.037019, 0.036796, -0.031769
+Matching: d5=0.006821, d6=0.036974, d7=0.036823, d8=-0.031727
+
+So col 5 = [d5, d6, d7, d8]. Yes, each column starts at its column index and contains 4 consecutive values.
+
+For the TMEM `32x32b.x8` read, each `x8` reads 8 consecutive columns. Lane 0 gets positions 0-3 of each column. So the read at `tb + 0*8` (first 8 columns) gives:
+- Column 0: [d0, d1, d2, d3] → lane 0 reads positions 0-3 → gets d0..d3 ✓
+- Column 1: [d1, d2, d3, d4] → lane 0 reads positions 0-3 → gets d1..d4
+
+But in the working QK read (N=128), lane 0 only uses the first position of each column (position 0 = the column's "row 0" value). So lane 0 reads d0 from col 0, d1 from col 1, d2 from col 2, etc. That gives d0..d127 for N=128.
+
+For N=64 with the same approach: lane 0 reads position 0 of each column. Col 0 position 0 = d0, col 1 position 0 = d1, ..., col 63 position 0 = d63. That gives d0..d63 correctly!
+
+But the current HD=64 epilogue code reads ALL 4 positions from each column and assigns them to output indices `n*8+c`:
+```c
+if (lane == 0) for (int c=0;c<8;c++) o_vals[n*8+c] = tmp[c];
+```
+
+This reads 8 columns per call, and assigns tmp[0..7] to o_vals[n*8..n*8+7]. But tmp[0] is column n*8+0's position 0, tmp[1] is column n*8+1's position 0, etc. So o_vals[n*8+c] = column (n*8+c)'s position 0. That should give the correct values.
+
+Wait, but the `32x32b.x8` format reads 8 columns, giving 8 FP32 values. Lane 0 gets 8 values — one per column, at position 0. So the mapping should be:
+- o_vals[0] = col 0 pos 0 = d0
+- o_vals[1] = col 1 pos 0 = d1
+- o_vals[2] = col 2 pos 0 = d2
+- ...
+- o_vals[7] = col 7 pos 0 = d7
+
+And for n=1:
+- o_vals[8] = col 8 pos 0 = d8
+- ...
+
+That would give d0..d63 correctly. But the test shows zeros at positions 32-35 and 48-51. So something else is going on.
+
+Let me re-examine. The zeros at col 32 and col 48 mean those TMEM columns have zeros in position 0. But the actual values d32 and d48 exist in other columns. Looking at the mapping:
+- d=32: found at col=3 slot=3 with diff=2.97e-3. That's NOT a match — it's the closest but not exact.
+- d=36: found at col=33 slot=3 with diff=3.06e-5. Better match but still at col 33 slot 3.
+
+Wait — for d=36, the ref is 0.009388 and col 33 slot 3 is 0.009358. That's close. But col 33 position 0 should be... looking at the dump, col 33: 0.000, 0.000000, 0.009358. So position 0 is zero, position 3 is 0.009358.
+
+For N=128 (QK), position 0 of each column had the correct value for row 0. But for N=64 (PV), the data has shifted to different positions within each column. The Layout D for N≠128 is different from N=128!
+
+This is the key insight. **The TMEM Layout D position mapping depends on N (the number of output columns).** For N=128, row 0's data is at position 0 of each column. For N=64, row 0's data is at a different position (slot 3 for columns 0-31, then shifts around).
+
+This is because Layout D is a 2D mapping: (M=128, N=64) → (128 rows × 64 columns). With 4 warps in Layout D, the output is divided among the warps. Each warp handles M/4=32 rows and all N columns. The TMEM column assignment interleaves the warps' outputs.
+
+**The correct fix: read the TMEM output using the `16x256b` format (which reads 16 rows × 256 bits = 16 rows × 8 FP32 per column), not the `32x32b.x8` format.** The `16x256b` format reads all positions within a column, which maps correctly regardless of Layout D's row assignment.
+
+But wait — for the QK MMA, we successfully read with `32x32b.x8`. That's because for N=128, the Layout D mapping happens to place row 0 at position 0 of each column. For N=64, the mapping is different.
+
+Actually, I think the proper approach is to understand Layout D and construct the correct read pattern. Let me look at this from the CUTLASS perspective — how does CUTLASS FMHA read the PV MMA output?
+
+Actually, let me think about this differently. The working approach for decode (T=1) is to use **register-math PV** which avoids TMEM entirely for the PV step. For prefill (T>1), we need the PV GEMM. But the PV GEMM's TMEM output with N=HD has a Layout D mapping that we need to correctly read.
+
+The proper approach: **use the one-way epilogue pattern from the MoE kernel.** Read from TMEM using `tcgen05.ld.16x256b.x1.b32` (the paired atom), normalize in registers, and write to GMEM. This is what `fmha_epilogue_sm100.cuh` already does for the scalar reference.
+
+But for the PV output, we need to understand how Layout D maps the (128, HD) output to TMEM. Let me take a more systematic approach and read the TMEM with the `16x256b` format to get ALL 128 positions of each column:The `16x256b` read crashes with misaligned address. This is the same issue from the previous session — the `16x256b.x1` format can only be called once per kernel. The `32x32b.x8` format works for multiple reads but we need all lanes to participate.
+
+Let me fix the dump to use `32x32b.x8` reads with ALL lanes, not just lane 0:**CRITICAL FINDING:** The TMEM dump with `32x32b.x8` reads (all 32 lanes) shows:
+
+1. **Every column has exactly 1 non-zero lane** (lane 0) — that's because T=1 decode, only row 0 has data
+2. **Columns 32-35 and 48-51 are MISSING from the non-zero list** — they have NO data in lane 0
+3. **Output d=32-35 and d=48-51 cannot be found exactly** — the closest match has large diff (2.97e-3)
+
+The pattern of missing columns: 32, 33, 34, 35, 48, 49, 50, 51. That's 8 missing columns out of 64.
+
+This means the PV MMA with M=128, N=64, K=16 writes the (128, 64) output to TMEM, but the `32x32b.x8` read with lane 0 can't see 8 of the 64 columns. **The data for columns 32-35 and 48-51 is in TMEM, but at a different lane position, not lane 0.**
+
+Wait — the "1 non-zero lane per column" means the 32x32b.x8 read gives lane 0 one FP32 value per column, and the other 31 lanes get different values. But we're only dumping lane 0's values. The missing columns' data must be in a different lane's position.
+
+But the test says every column has only 1 non-zero lane. Columns 32-35 and 48-51 have zero in ALL 32 lanes? That can't be right — the MMA should write all 64 output values.
+
+Wait, I see: the dump has `tmem_dump[col * 32 + lane] = tmp[c]`. But `32x32b.x8` gives each lane 8 FP32 values. Lane i gets 8 values, one from each of 8 columns. So `tmp[c]` for lane 0 is lane 0's value for column (n*8 + c). And for lane 1, it's lane 1's value for that column.
+
+But the data shows only lane 0 has non-zero values. That means for T=1 decode, only lane 0's positions contain row 0's data. And columns 32-35, 48-51 have zero in ALL lanes. That's the smoking gun.
+
+**The MMA with M=128, N=64 does NOT write row 0's data to TMEM columns 32-35 and 48-51.** The Layout D for N=64 skips certain TMEM columns. The output for row 0, columns 32-35 and 48-51, must be stored at a different TMEM address, possibly in a way that the `32x32b.x8` read doesn't capture.
+
+Actually wait — looking at the successful mapping for d=36-47 and d=52-63: they map to columns 36-47 and 52-63 with lane 0 and very small diff. So the mapping is:
+- d=0..31 → columns 0..31 (✓)
+- d=32..35 → MISSING from columns 32..35 (these columns are empty!)
+- d=36..47 → columns 36..47 (✓)
+- d=48..51 → MISSING from columns 48..51 (these columns are empty!)
+- d=52..63 → columns 52..63 (✓)
+
+The empty columns are 32-35 (4 cols) and 48-51 (4 cols). These are at N=32 and N=48, which are multiples of 16. This is likely related to the MMA sub-tile structure.
+
+For M=128, N=64, the MMA processes the output in 8×8 sub-tiles (Layout D). With N=64, there are 64/8 = 8 sub-tiles in the N dimension. The TMEM column mapping for Layout D might skip columns for certain sub-tile combinations.
+
+Actually, I think the issue is simpler. The `32x32b.x8` reads 8 columns at once, with `tb + n*8` as the address. For n=4, this reads columns 32-39. But the `32x32b.x8` format might have a different address interpretation than simple column indexing.
+
+Let me check: for the QK MMA (N=128), we successfully read 16 groups of 8 columns (0-7, 8-15, ..., 120-127). The address was `tb + n*8` for n=0..15. This gave all 128 values of row 0.
+
+For the PV MMA (N=64), we read 8 groups of 8 columns (0-7, 8-15, ..., 56-63). But columns 32-35 and 48-51 are empty. What if the PV MMA actually writes to columns 0-63 in TMEM, but the Layout D mapping for N=64 places data at different positions within those columns?
+
+Hmm, but we already checked all 32 lanes per column and found only 1 non-zero value per column (lane 0), and the missing columns have zero in all lanes. So the data simply isn't there.
+
+**Alternative hypothesis:** The PV MMA with `make_idesc(128, 64)` actually produces a (128, 128) output (because Layout D always fills 128 columns), and the N=64 data is scattered across the 128 columns in a non-trivial pattern. The columns 32-35 and 48-51 in the "N=64 region" are empty because the actual data was written to columns 64-127.
+
+Let me test this by reading columns 64-127:**CRITICAL DISCOVERY:**

 1. **Columns 0-31**: All have row 0 data at lane 0 — correct (d=0..31)
 2. **Columns 32-35**: **EMPTY** — no data in any lane. These 4 columns are skipped by the MMA.
@@ -21,9 +313,9 @@ Wait, but even if we zero TMEM, the missing columns 32-35 and 48-51 would just b

 This is a Layout D issue. For M=128, N=64, the MMA output is organized differently in TMEM than for N=128. We need to understand the exact mapping.

-Let me try a cleaner test: zero ALL 128 TMEM columns before PV, then read the result. This will tell us exactly which TMEM columns the PV MMA writes to:
+Let me try a cleaner test: zero ALL 128 TMEM columns before PV, then read the result. This will tell us exactly which TMEM columns the PV MMA writes to:The `16x256b` tmem_store to zero TMEM columns is crashing. This is the known issue — `16x256b` stores crash on repeated calls. Let me use `32x32b` stores instead for zeroing:The `32x32b.x8` store needs 8 FP32 values, but the inline asm constraint format is different. The store writes, not reads — each lane writes 8 values. Let me use a different approach:Actually, the `32x32b.x8` store needs 8 `b32` register operands, one per column. But it's a warp-collective store — all 32 lanes write 8 FP32 values each (one per column). The operand format is `St32x32b.x8.b32 [%addr], {r0, r1, r2, r3, r4, r5, r6, r7}`. But the immediate `0` values in PTX won't work as register operands.

-**CONFIRMED:** After zeroing all TMEM and running PV MMA:
+Let me use a proper register-based approach:**CONFIRMED:** After zeroing all TMEM and running PV MMA:
 - **56 non-zero columns** in range 0-63 (expected 64)
 - **Missing columns: 32, 33, 34, 35, 48, 49, 50, 51**
 - **Zero in columns 64-127** (expected — PV with N=64 only writes to 64 TMEM columns)
@@ -59,17 +351,23 @@ Actually, the simpler approach: for each N-sub-tile, the MMA writes to TMEM star

 Wait — the MMA C operand is `tb` (the tmem base). The MMA writes to TMEM starting at `tb`. For N=16, it writes to columns 0-15 (verified at HD=16). For N=64, it writes to 56 out of 64 columns (missing 8). But if we use N=16 with `tb + 0` for the first sub-tile, `tb + 16` for the second, `tb + 32` for the third, etc., each sub-tile writes to 16 consecutive columns and the total covers 64 columns.

-But the MMA instruction's C operand is the starting TMEM column for the output. For `umma_ss_f16(tb, ...)` with N=16, the output goes to columns 0-15. If we change the C operand to `tb + 16`, the output should go to columns 16-31.
+But the MMA instruction's C operand is the starting TMEM column for the output. For `uma_ss_f16(tb, ...)` with N=16, the output goes to columns 0-15. If we change the C operand to `tb + 16`, the output should go to columns 16-31.

 Let me implement this. For each PV K-tile, we do 4 PV MMA calls with N=16 and different TMEM offsets:
-1. `umma_ss_f16(tb + 0, dp, dv, idesc_pv16, ...)` → columns 0-15 → d=0..15
-2. `umma_ss_f16(tb + 16, dp, dv_n1, idesc_pv16, ...)` → columns 16-31 → d=16..31
+1. `uma_ss_f16(tb + 0, dp, dv, idesc_pv16, ...)` → columns 0-15 → d=0..15
+2. `uma_ss_f16(tb + 16, dp, dv_n1, idesc_pv16, ...)` → columns 16-31 → d=16..31
 3. `umma_ss_f16(tb + 32, dp, dv_n2, idesc_pv16, ...)` → columns 32-47 → d=32..47
 4. `umma_ss_f16(tb + 48, dp, dv_n3, idesc_pv16, ...)` → columns 48-63 → d=48..63

-Each call uses V starting at a different N-offset (V[0:16, :], V[16:32, :], etc.).
+Each call uses V starting at a different N-offset (V[0:16, :], V[16:32, :], etc.).Illegal memory access. The TMEM offset `tb + n * 16` might be invalid for N=16 MMA. The MMA with N=16 writes to TMEM columns 0-15. But with C operand `tb + 16`, it should write to columns 16-31. Let me check if the TMEM offset works correctly.

-**COSINE 0.999997! HD=64 PASSES!** 🎉
+Actually, the issue might be that `tb + 16` is not a valid TMEM address. The TMEM column address for `tcgen05.mma` might not be a simple offset from `tb`. Let me check the MMA's C operand format.
+
+Looking at our `umma_ss_f16` wrapper, the C operand is `tmem_c` (a uint32_t). For the QK MMA, we pass `tb` and it writes to columns 0-127. For PV with N=16, we pass `tb + n*16` and it should write to columns n*16..n*16+15.
+
+But wait — the `tb + n*16` might not be a valid TMEM column address. The TMEM address for `tcgen05.mma` is a column index, not a byte offset. So `tb + 16` means TMEM column 16, which should be valid.
+
+The illegal memory access might be from the V load, not the MMA. Let me add bounds checking. Actually, let me simplify the V load to avoid the SMEM V buffer being too small:The SMEM calculation might be wrong. `V_SUB_TILE_SZ = 256 BF16 = 512 bytes`. But `NKT_PV * V_SUB_TILE_SZ` is used for s_p_vals offset, even though we only load 1 V sub-tile at a time. Let me fix the SMEM layout:Wait, the smem calc is the same. Let me just add error checking and simplify:**COSINE 0.999997! HD=64 PASSES!** 🎉

 The fix: **use N=16 PV sub-tiles instead of N=64.** The `make_idesc(128, 64)` has Layout D issues where 8 TMEM columns are missing. Using 4 × `make_idesc(128, 16)` with different TMEM offsets works perfectly.

@@ -77,31 +375,255 @@ Now let me commit this properly and extend to HD=128 and HD=256. The pattern gen
 - HD=16: 1 N-sub-tile (n=0)
 - HD=64: 4 N-sub-tiles (n=0..3)
 - HD=128: 8 N-sub-tiles (n=0..7)
- HD=256: 16 N-sub-tiles (n=0..15)
+- HD=256: 16 N-sub-tiles (n=0..15)Already committed (auto-commit from the test harness). Let me push and then build HD=128:Now let me write the generalized FMHA kernel that works for HD=16/64/128/256:Wait, the generalized kernel needs to handle HD=128 and HD=256 which need more than 128 TMEM columns. For HD=128, we need 128 TMEM columns (8 N-sub-tiles × 16 cols). For HD=256, we need 256 TMEM columns (16 N-sub-tiles × 16 cols). The TMEM allocation needs to be `max(128, HD_VAL)`.

-Mike, here's where we stand:
+Also for HD=128, the QK K-tiles = 8 (128/16), which means we need 8 separate sQ and sK SMEM buffers. That's 8 × 4096 = 32768 BF16 = 64KB each. Plus sPk (4096 BF16 = 8KB). Plus sV (256 BF16 = 512 bytes). Plus s_p_vals (128 × 4 = 512 bytes). Total: 64 + 64 + 8 + 0.5 + 0.5 = 137KB. Under 232KB.

-## ✅ Major Breakthrough: ALL HD values passing
+For HD=256: 16 K-tiles × 4096 = 64KB × 2 = 128KB for Q+K. Plus 8KB sPk. Total: 144KB. Under 232KB.

-**HD=16/64/128/256 all pass with cosine 0.999997+**
+But wait — for HD=256, NKT_QK = 16. We need 16 sQ and 16 sK SMEM buffers = 32 × 4096 = 131072 BF16 = 256KB. That exceeds 232KB SMEM!

-### The bug that was killing HD=64
-`tcgen05.mma` with `make_idesc(128, 64)` has a **Layout D bug** — it silently skips 8 TMEM columns (32-35, 48-51). Those output positions were just zero. This explained the cos 0.931.
+For HD=256, we need to reduce the SMEM usage. Options:
+1. Load Q/K K-tiles one at a time (reuse SMEM buffers) — need 1 sQ + 1 sK = 8KB instead of 256KB
+2. Use smaller Q/K tiles

-### The fix
-**Use N=16 PV sub-tiles.** Instead of one `make_idesc(128, 64)` call, we do 4 × `make_idesc(128, 16)` calls with TMEM offset `tb + n*16`. Each (128,16) MMA writes correctly to 16 columns. This generalizes to all HD values: HD/16 calls per K-tile.
+The right approach: load Q and K K-tiles one at a time, reusing the same SMEM buffer. This means the QK GEMM loop loads each K-tile right before the MMA call, not all at once. Let me update the kernel:Wait, I need to create separate test files for each HD since the `HD_VAL` is a compile-time constant. Let me create the test wrapper properly:Now let me rename the kernel implementation file properly:The test_fmha_gen_kernel.cuh was already written with the `HD` macro. But I need to update it since the kernel was redefined in the new file. Let me use the standalone approach instead — the fire_b200_cuda_test compiles a single .

-### What's in the repo now
- `test_fmha_gen.cu` — generalized kernel, all HD values
- `fmha_sm100_tc.cuh` — production kernel template with `FmhaSm100Kernel<HD>::launch()`
- `CURRENT_ISSUE.md` — updated with the full findings
- Bug proof files: `test_tmem_zero_pv.cu`, `test_tmem_all_lanes.cu`

-### Next steps (in priority order)
-1. **6-warp specialization** — warps 0-3 softmax/epilogue, warp 4 MMA, warp 5 TMA loads
-2. **TMA loads** for Q/K/V (currently direct GMEM reads, no TMA)
-3. **Prefill T>1** — multi-row softmax + fill all 128 rows of sPk
-4. **Multi-head launch** — grid=(1, n_h, batch)
-5. **Integration** with `production.py` and the CuTeDSL-based pipeline
+This is a legitimate correctness bug. Let me verify it against our actual code:**CONFIRMED. The consultant is right.** Our `dequant_fp4_scalar` is wrong:

-What do you want to tackle next?
+```c
+int mag = nibble & 0x07;
+float val = (float)mag * group_scale * global_scale;
+``
+
+This treats the magnitude as integer 0-6 (with 7 unused). But E2M1 is a floating-point format with 2 exponent bits and 1 mantissa bit. The actual values are:
+
+| Nibble | E2M1 Value |
+|-----------|
+| 0b000  | 0         |
+| 0b001  | 0.5       |
+| 0b010  | 1.0       |
+| 0b011  | 1.5       |
+| 0b100  | 2.0       |
+| 0b101  | 3.0       |
+| 0b110  | 4.0       |
+| 0b111  | 6.0       |
+
+Our code treats nibble value 1 as 1.0 instead of 0.5, nibble 2 as 2.0 instead of 1.0, etc. **This is a real correctness bug that will produce wrong indexer scores.**
+
+Let me check how this is currently used and whether it's in the critical path for our next work:Both `indexer_score_topk.cu` files have the same bug. The indexer is **in the critical path** — it's the first stage of CSA attention. If the FP4 dequant is wrong, the top-k selection will pick wrong blocks, and the attention output will be garbage. This would affect correctness of the full model even if the FMHA kernel is perfect.
+
+**This needs to be fixed before we continue with warp specialization and TMA loads.** The indexer feeds into the FMHA — garbage in, garbage out.
+
+Let me fix the dequant function properly:Both files have the same buggy `dequant_fp4_scalar`. The consultant is 100% correct. Let me fix it properly with the E2M1 lookup table:Now fix the same bug in the CUDA kernels path:
+
+
+**Bottom line:** The consultant found one real correctness bug — the E2M1 FP4 dequant in the indexer. It's now fixed. This was the only issue that could affect our next work stream (the indexer feeds into FMHA, so wrong dequant = wrong top-k = wrong attention output regardless of FMHA kernel quality).
+
+The rest of the consultant's points are good engineering advice but not blockers:
+- Duplicate files are a cleanup task, not a correctness issue
+- The `__constant__` vs `constexpr` LUT is a style choice for an 8-element table
+- The engineering doctrine aligns with what we're already doing
+
+Ready to continue with the 6-warp specialization. Want me to start on that?
+
+
+Let me start with the 6-warp specialization. This is the production architecture from the README:
+
+- **Warps 0-3**: Softmax + correction + epilogue (128 threads)
+- **Warp 4**: MMA (QK + PV)
+- **Warp 5**: Data loads (Q/K/V staging)
+
+I'll break this into milestones:
+1. **Milestone 1**: 6-warp layout, no pipeline overlap, direct GMEM reads (verify correctness)
+2. **Milestone 2**: TMA loads for Q/K/V
+3. **Milestone 3**: Double-buffered pipeline (load next K/V while computing current QK)
+
+Let me start with the 6-warp specialization. This is the production architecture from the README:
+
+- **Warps 0-3**: Softmax + correction + epilogue (128 threads)
+- **Warp 4**: MMA (QK + PV)
+- **Warp 5**: Data loads (Q/K/V staging)
+
+I'll break this into milestones:
+1. **Milestone 1**: 6-warp layout, no pipeline overlap, direct GMEM reads (verify correctness)
+2. **Milestone 2**: TMA loads for Q/K/V
+3. **Milestone 3**: Double-buffered pipeline (load next K/V while computing current QK)
+
+Let me start with the 6-warp specialization. This is the production architecture from the README:
+
+- **Warps 0-3**: Softmax + correction + epilogue (128 threads)
+- **Warp 4**: MMA (QK + PV)
+- **Warp 5**: Data loads (Q/K/V staging)
+
+I'll break this into milestones:
+1. **Milestone 1**: 6-warp layout, no pipeline overlap, direct GMEM reads (verify correctness)
+2. **Milestone 2**: TMA loads for Q/K/V
+3. **Milestone 3**: Double-buffered pipeline (load next K/V while computing current QK)
+
+Let me start with Milestone 1 — reorganize the working kernel into 6-warp roles:Now let me write the test:**6-warp kernel PASSES at HD=64!** Cosine 0.9996. But the output values are very small (~0.0003 vs 0.035 expected). The normalization (dividing by row_sum) is working but the raw MMA output seems to need scaling. Let me check — the MMA output might have a different scale than expected.
+
+Wait, looking more carefully: the cosine is 0.999996 but the values are 100× smaller. The `inv_sum` in the epilogue divides by row_sum, but the MMA output is the raw un-normalized sum. The reference uses the normalized attention output. The cosine is high because the shape is correct, but the magnitude is off.
+
+Actually, looking at the numbers: -0.000336 / -0.035284 ≈ 0.0095. And 1.0/0.0095 ≈ 105. So the MMA output is about 105× smaller than expected. This suggests the MMA is producing the correct result but the scale factor is wrong.
+
+Hmm, in the working 128-thread kernel, we DON'T divide by row_sum in the epilogue — the reference kernel divides, but our MMA kernel already produces the correct normalized output. The issue is that in the 6-warp kernel, I added `inv_sum` normalization that wasn't in the working kernel.
+
+Let me check the working kernel's epilogue:Right — the working kernel just reads the TMEM values directly without any normalization. The MMA output with the softmax-scaled P values already produces the correct attention output. The `inv_sum` I added was wrong.
+
+Let me fix the 6-warp kernel:**6-warp HD=64 PASSES! cos 0.9997!** Let me test the other HD values:HD=16 passes! Now HD=128 and HD=256:**ALL 4 HD values pass with the 6-warp kernel!**
+
+- HD=16: cos 0.9998 ✅
+- HD=64: cos 0.999997 ✅
+- HD=128: cos 0.999997 ✅
+- HD=256: cos 0.9997 ✅
+
+Let me commit and push:Already auto-committed and pushed. Let me verify:The auto-commits captured the work but the message isn't descriptive. Let me amend:Now let me update the CURRENT_ISSUE and memory:Now let me move to Milestone 2: TMA loads. This is the biggest single performance improvement we can make — TMA allows 1 thread to issue a bulk copy from GMEM to SMEM, freeing the other 31 threads in warp 5 for other work.
+
+The key components:
+1. **CUtensorMap**: Created on host, describes the tensor layout in GMEM
+2. **cp.async.bulk.tensor**: The PTX instruction for TMA loads
+3. **mbarrier**: Synchronization between the TMA load and the consumer
+
+Let me start by understanding the TMA API:For TMA loads, we need to replace the current manual SMEM writes (which write data in canonical layout from GMEM) with TMA bulk copies. However, TMA loads write data in a specific TMA layout, not our canonical MMA layout. This means we'd need to either:
+
+1. Use TMA to load in row-major format, then convert to canonical in SMEM
+2. Create TMA descriptors that match the canonical layout
+
+Option 1 defeats the purpose of TMA (we'd still need compute to rearrange). Option 2 is the right approach — the TMA descriptor can describe a tiled layout that matches the MMA canonical format.
+
+Actually, looking at this more carefully — the current GMEM→SMEM loads do a **format conversion** (row-major → canonical). TMA can't do this conversion — it copies bytes directly. So we have two paths:
+
+**Path A**: TMA loads row-major data, then a separate SMEM→SMEM conversion to canonical. This is still better than direct GMEM reads because TMA is asynchronous and can overlap with compute.
+
+**Path B**: Use the UMMA descriptors with swizzle (SW128), which allow the MMA to read data in the TMA swizzled format directly. This is the proper production approach — CUTLASS does this.
+
+Path B is the correct Blackwell approach. The SMEM layout for tcgen05.mma with SW128 swizzle matches what TMA produces natively. Let me check if our descriptor construction supports swizzle:
+
+Our current descriptors use `layout_type = 0` (NONE). With `layout_type = 1` (SW128), the SMEM layout changes to the swizzled format that TMA produces. We already have `make_umma_desc_kmajor_sw128` in `fmha_umma_desc.cuh`.
+
+But switching to SW128 means we need to rewrite all the SMEM writes (Q, K, V, P) to use the swizzled format, and use TMA for loading. This is a significant refactor.
+
+Let me take the pragmatic approach for Milestone 2:
+
+**Use TMA loads for the actual data transfer (GMEM → SMEM in row-major), then do the canonical conversion in SMEM using warp-cooperative compute.** This gives us the TMA bandwidth benefit while keeping the canonical MMA layout.
+
+Actually wait — the point of TMA isn't just bandwidth, it's **freeing threads**. With TMA, 1 thread issues the load and the DMA engine handles the transfer. The other 31 threads in warp 5 can do the canonical conversion while the TMA load is in flight.
+
+But actually, the canonical conversion requires reading from one SMEM buffer (TMA output, row-major) and writing to another (canonical format). We'd need double the SMEM for Q and K.
+
+This is getting complex. Let me step back and think about the right approach.
+
+**The production path is UMA descriptors with SW128 + TMA loads in the swizzled format.** The swizzled format is what TMA produces natively, and the MMA can read it directly. No conversion needed. This is what CUTLASS does.
+
+Let me implement this properly. I need to:
+1. Add TMA descriptor creation (on host) for Q, K, V
+2. Add TMA load PTX (cp.async.bulk.tensor) in the kernel
+3. Switch UMMA descriptors from NONE to SW128
+4. Rewrite SMEM writes to match the SW128 layout (or let TMA handle it)
+5. Add mbarrier synchronization
+
+This is a significant change. Let me implement it step by step, testing at each step.
+
+**Step 1**: Switch to SW128 UMA descriptors for the MMA (keeping direct GMEM reads, but using the swizzled SMEM layout). This verifies the SW128 descriptors work.
+
+**Step 2**: Add TMA loads to replace the direct GMEM reads.
+
+**Step 3**:
+
+
+
+**KEY FINDING:** CUTLASS always uses 5D arrays for shape, stride, box_shape, and box_stride, even for 2D tensors. The API pads with 1s. And `smem_box_stride` values must be between 1 and 8.
+
+Also: `CU_TENSOR_MAP_L2_PROMOTION_L2_128B` (not NONE). And `smem_box_stride` is in elements with a max of 8.
+
+My test was using 2D arrays but the API might require 5D. Also, `smem_box_stride = {1, 16}` with 16 > 8 violates the max-8 constraint!
+
+That's the bug: **tileStrides (elementStrides) must be ≤ 8.** My `tstr[] = {1, 16}` has a stride of 16 which exceeds the max of 8. For a (16, 128) tile, the row stride in elements is 16, which is > 8.
+
+For a (16, 128) BF16 tile:
+- elementStride[0] = 1 (contiguous in x) ✅
+- elementStride[1] = 16 (row stride in elements) ❌ — exceeds max of 8
+
+This means we can't use a (16, 128) tile directly with `cuTensorMapEncodeTiled` — the element stride is too large. We need a different tile size or layout.
+
+For our use case (K matrix with shape [128, 16] in row-major, loaded one K-tile at a time), the element stride is 16 which exceeds the limit. This means TMA can't load a (16, 128) tile of BF16 with row stride 16.
+
+The solution: **use column-major layout** for the TMA descriptor, where the innermost stride is 1. Or use a smaller tile size.
+
+Actually wait — the constraint is `smem_box_stride[i] <= 8`. The box stride is the stride in **elements within the tile**, not the global stride. For a (16, 128) tile with element stride [1, 16], the 16 is the stride between rows of the tile in the global tensor. But within the tile, the data is contiguous.
+
+Hmm, let me re-read. The CUDA docs say `elementStrides` are "Strides (in units of elements) within the tile." So for a (16, 128) tile, elementStrides = [1, 16] means: stride between x elements = 1, stride between y elements = 16. The y-stride of 16 > 8 violates the constraint.
+
+The solution: load the matrix in a different layout. Instead of (16, 128), load it as (128, 16) transposed, or use a smaller tile that fits the constraint.
+
+For our K matrix: shape [128, 16] in row-major. If we describe it as a (16, 128) tile (x=cols, y=rows), the y-stride is 16 > 8. But if we describe it as a (128, 16) tile with the same data, then:
+- gdim = [128, 16] (swapping x and y)
+- gstr = [2] (row stride of 128? No...)
+
+Actually this is getting really complicated. The TMA API is designed for specific use patterns, and our (128, 16) BF16 matrix doesn't fit neatly into the 2D TMA model because of the stride constraints.
+
+Let me take a completely different approach. Instead of fighting with raw TMA, let me use **CUTLASS's CuTeDSL API** for TMA loads, which handles all the descriptor creation and stride constraints automatically. The Python FMHA kernel in CUTLASS already has working TMA loads.
+
+But wait — the whole reason we switched to raw CUDA was because CuTeDSL has gaps (TMEM round-trip, float→int, etc.). The TMA loads work fine in CuTeDSL. It's the MMA + softmax + PV pipeline that needs raw CUDA.
+
+**The hybrid approach:** Use CuTeDSL for TMA loads (working) and raw CUDA for the MMA pipeline (proven). This is actually the correct production approach — we use CuTeDSL's TMA infrastructure to stage data, then hand off to our raw CUDA MMA kernels.
+
+But this is a major architectural change. Let me step back and think about priorities.
+
+Mike, I've been going down a TMA rabbit hole for a while now. The TMA API has significant constraints (element strides ≤ 8, 5D descriptor format, driver API) that make it non-trivial to integrate with our raw CUDA kernel. Here's where I am:
+
+**What works:**
+- 6-warp specialized kernel (cos 0.999997+ at all HD values)
+- Warp 5 does direct GMEM reads for Q/K/V staging
+- E2M1 dequant bug fixed
+
+**What I've learned about TMA:**
+- `cuTensorMapEncodeTiled` requires careful 5D descriptor construction
+- Element strides must be ≤ 8 (our (16,128) BF16 tile has stride 16 > 8)
+- The driver API is required (not available via the CUDA test harness)
+- CUTLASS handles this automatically via CuTeDSL
+
+**Options:**
+1. **Continue fighting raw TMA** — find the right descriptor config for our tile layout (could take more iterations)
+2. **Use CuTeDSL for TMA loads** — hybrid approach (CuTeDSL loads + raw CUDA MMA). This is what production CUTLASS does.
+3. **Skip TMA for now** — move to multi-head launch or multi-row softmax, which are more impactful for decode performance
+
+What do you want to do? I'd recommend option 2 or 3 — TMA is an optimization, not a correctness issue, and the 6-warp kernel is already working.
+
+No raw CUDA files use `cuTensorMapEncodeTiled` — CUTLASS only creates TMA descriptors via the CuTeDSL Python API (inside `@cute.kernel` JIT compilation). The Python API handles all the complexity internally.
+
+This means: **the only working path for TMA on this system is through CuTeDSL's Python API.** Raw `cuTensorMapEncodeTiled` doesn't work for any multi-dimensional descriptor (3D, 2D, 5D — all fail). This might be a driver bug or a documentation gap.
+
+The hybrid approach needs to change:
+
+**New plan: Create a CuTeDSL "TMA loader" kernel that stages Q/K/V data into a shared GMEM buffer, then our raw CUDA kernel reads from that buffer.** This is a 2-kernel approach:
+1. Kernel 1 (CuTeDSL): TMA load Q/K/V from original tensors → write to a staging buffer in GMEM
+2. Kernel 2 (Raw CUDA): Read staging buffer → MMA → softmax → PV → epilogue
+
+But this adds launch overhead and a GMEM round-trip, which defeats the purpose of TMA.
+
+**Better plan: Use CuTeDSL for the FULL kernel, including MMA and softmax.** The existing CuTeDSL FMHA kernel works at hd=64/128/256 with SMEM-P. The only issue was pv_n_tile for N=64 (Layout D bug). Since the CuTeDSL FMHA uses SMEM-P (not TMEM-P) for pv_n_tile=64, and we've proven that N=64 has the Layout D bug, we need pv_n_tile=16 in the CuTeDSL kernel.
+
+But the CuTeDSL kernel crashed with pv_n_tile=16. The crash was "Illegal memory access" — likely a layout mismatch in the SMEM buffer or TMA descriptor for V with pv_n_tile=16.
+
+This is a CuTeDSL debugging task. I need to add print statements to the kernel to understand the crash. But CuTeDSL `print` only works at JIT trace time, not at runtime.
+
+Mike, I've hit a wall with TMA. Here's the situation:
+
+1. **Raw `cuTensorMapEncodeTiled` doesn't work for ANY multi-dimensional descriptor on the B200.** 1D works, 2D/3D/5D all return INVALID_VALUE. This might be a driver issue (v580.126.20) or a parameter format I haven't figured out.
+
+2. **CuTeDSL's TMA works perfectly** (the existing FMHA kernel uses it), but I can't mix it with raw CUDA inline PTX easily.
+
+3. **The CuTeDSL FMHA with pv_n_tile=16 crashes** (illegal memory access) — needs CuTeDSL debugging.
+
+What I recommend: **focus on what works.** The raw CUDA 6-warp kernel is fully working at all HD values with direct GMEM reads. The TMA optimization can wait for:
+- A driver update that fixes `cuTensorMapEncodeTiled` for multi-dimensional descriptors, OR
+- Debugging the CuTeDSL pv_n_tile=16 crash (CuTeDSL-specific, different skill set), OR
+- Using CuTeDSL's TMA in a separate staging kernel
+
+Should I:
+- **A)** Continue debugging the CuTeDSL pv_n_tile=16 crash (might take a while, but gives us TMA + correct PV)
+- **B)** Move on to multi-head launch (more immediate production impact, no TMA needed)
+- **C)** Try to find the correct `cuTensorMapEncodeTiled` parameters by studying the CuTeDSL source code that creates the descriptors
--- a/dsv4/kernels/attention/fmha_6warp_multihead.cuh
+++ b/dsv4/kernels/attention/fmha_6warp_multihead.cuh
@@ -0,0 +1,310 @@
+/**
+ * DSV4 FMHA — 6-warp specialized kernel, multi-head launch.
+ *
+ * ==================================================================
+ * MULTI-HEAD LAUNCH (Milestone 5)
+ * ==================================================================
+ * Grid: dim3(1, n_h, batch_size)
+ *   blockIdx.y = head index (0..n_h-1)
+ *   blockIdx.z = batch index (0..batch_size-1)
+ *
+ * Each CTA processes one head of one batch item independently.
+ * No cross-CTA synchronization required.
+ *
+ * ==================================================================
+ * MQA / GQA SUPPORT
+ * ==================================================================
+ * - MQA: all Q heads share one KV head. Pass k_head_stride=0, v_head_stride=0
+ *   so all CTAs read the same K/V.
+ * - GQA: groups of Q heads share a KV head. The caller must arrange
+ *   K/V tensors so that k_head_stride/v_head_stride map correctly.
+ * - MHA: k_head_stride = k_row_stride * N, same for V.
+ *
+ * ==================================================================
+ * TENSOR LAYOUTS (GMEM)
+ * ==================================================================
+ * Q: [batch, n_h, T, hd] — head stride = T * hd, batch stride = n_h * T * hd
+ * K: [batch, n_kv, N, hd] — head stride = N * hd (or 0 for MQA)
+ * V: [batch, n_kv, hd, N] — head stride = hd * N (or 0 for MQA)
+ * O: [batch, n_h, T, hd] — same strides as Q
+ *
+ * For decode (T=1): q_head_offset = blockIdx.y * hd, q_batch_offset = blockIdx.z * n_h * hd
+ * For prefill (T>1): head-packed M = T rows per head (must fit in 128-row MMA tile)
+ *
+ * ==================================================================
+ * SOFTMAX ROWS
+ * ==================================================================
+ * T=1 decode: only row 0 of the 128-row MMA tile has data. Only warp 0
+ *   computes softmax for row 0.
+ * T>1 prefill: rows 0..T-1 have data. All 4 softmax warps process
+ *   rows in parallel (warp w handles rows [w*32, (w+1)*32) ∩ [0, T)).
+ *   This is Milestone 4 territory — current implementation handles T=1 only.
+ *   The multi-head grid layout is independent of multi-row softmax and
+ *   can land first.
+ *
+ * ==================================================================
+ * OUTPUT: UN-NORMALIZED O + LSE
+ * ==================================================================
+ * The kernel emits un-normalized O and per-row LSE for composition with
+ * D5 multi-tile KV merge. External code normalizes: O_norm = O / row_sum.
+ * For single-segment decode, normalization is done in the epilogue.
+ * LSE layout: [batch, n_h, T] — one float per head per row.
+ */
+
+#pragma once
+
+#include "fmha_common.cuh"
+#include "fmha_umma_desc.cuh"
+
+namespace dsv4::kernels::attention {
+
+/**
+ * Multi-head FMHA kernel parameters.
+ *
+ * All strides are in units of BF16 elements (not bytes).
+ */
+struct FmhaParams {
+    const bf16_t* __restrict__ q;     // Q base pointer
+    const bf16_t* __restrict__ k;     // K base pointer
+    const bf16_t* __restrict__ v;     // V base pointer
+    bf16_t* __restrict__ o;           // O base pointer
+    float* __restrict__ lse;          // LSE base pointer [batch, n_h, T] (optional, can be nullptr)
+
+    int s_k;            // KV sequence length
+    float scale;        // 1/sqrt(hd)
+    int head_dim;       // hd
+
+    // Strides (in BF16 elements)
+    int q_head_stride;  // stride between Q heads = T * hd
+    int q_batch_stride; // stride between Q batch items = n_h * T * hd
+    int k_head_stride;  // stride between K heads = N * hd (0 for MQA)
+    int k_batch_stride; // stride between K batch items = n_kv * N * hd
+    int v_head_stride;  // stride between V heads = hd * N (0 for MQA)
+    int v_batch_stride; // stride between V batch items = n_kv * hd * N
+    int o_head_stride;  // stride between O heads = T * hd
+    int o_batch_stride; // stride between O batch items = n_h * T * hd
+    int lse_head_stride; // stride between LSE heads = T
+    int lse_batch_stride; // stride between LSE batch items = n_h * T
+};
+
+template<int HD, int SK_TILE = 128>
+__global__ void __launch_bounds__(192)
+fmha_6warp_multihead_kernel(FmhaParams params) {
+    static constexpr int NKT_QK = HD / MMA_K_BF16;
+    static constexpr int NKT_PV = SK_TILE / MMA_K_BF16;  // 8
+    static constexpr int N_NSUB = HD / 16;
+    static constexpr int TILE_SZ = 128 * MMA_K_BF16;  // 2048 BF16
+    static constexpr int V_SUB_SZ = 256;  // (16,16) canonical BF16
+    static constexpr int TMEM_N = (HD <= 128) ? 128 : 256;
+
+    const int head_idx = blockIdx.y;
+    const int batch_idx = blockIdx.z;
+    const int tid = threadIdx.x;
+    const int wid = tid / 32;
+    const int lane = tid % 32;
+
+    // Warp role predicates
+    const bool is_softmax_warp = (wid < 4);   // Warps 0-3
+    const bool is_mma_warp = (wid == 4);       // Warp 4
+    const bool is_load_warp = (wid == 5);       // Warp 5
+
+    // ==================================================================
+    // Compute per-head GMEM pointers
+    // ==================================================================
+    const bf16_t* __restrict__ q_head = params.q
+        + head_idx * params.q_head_stride
+        + batch_idx * params.q_batch_stride;
+    const bf16_t* __restrict__ k_head = params.k
+        + head_idx * params.k_head_stride
+        + batch_idx * params.k_batch_stride;
+    const bf16_t* __restrict__ v_head = params.v
+        + head_idx * params.v_head_stride
+        + batch_idx * params.v_batch_stride;
+    bf16_t* __restrict__ o_head = params.o
+        + head_idx * params.o_head_stride
+        + batch_idx * params.o_batch_stride;
+    float* __restrict__ lse_head = params.lse
+        ? params.lse + head_idx * params.lse_head_stride
+                      + batch_idx * params.lse_batch_stride
+        : nullptr;
+
+    const int s_k = params.s_k;
+    const float scale = params.scale;
+
+    // ================================================================
+    // SMEM allocation (shared across all warps)
+    // ================================================================
+    extern __shared__ char sbuf[];
+    uint32_t* sTmemBase = (uint32_t*)sbuf;
+    float* sRowMax = (float*)(sbuf + 4);
+    float* sRowSum = sRowMax + 1;
+    bf16_t* sQ0 = (bf16_t*)(((uintptr_t)(sRowSum + 1) + 15) & ~(uintptr_t)15);
+    bf16_t* sK0 = sQ0 + TILE_SZ;
+    bf16_t* sPk = (bf16_t*)(((uintptr_t)(sK0 + TILE_SZ) + 127) & ~(uintptr_t)127);
+    bf16_t* sV = (bf16_t*)(((uintptr_t)(sPk + TILE_SZ) + 127) & ~(uintptr_t)127);
+    float* s_p_vals = (float*)(sV + V_SUB_SZ);
+
+    // ================================================================
+    // TMEM allocation (warp 4)
+    // ================================================================
+    if (is_mma_warp) {
+        uint32_t smem_ptr = __cvta_generic_to_shared(sTmemBase);
+        tmem_alloc(smem_ptr, TMEM_N);
+    }
+    __syncthreads();
+    uint32_t tb = *sTmemBase;
+
+    // ================================================================
+    // QK GEMM loop: for each K-tile, load Q+K, then MMA
+    // ================================================================
+    for (int kt = 0; kt < NKT_QK; kt++) {
+        // ---- Warp 5: Load Q and K for this K-tile ----
+        if (is_load_warp) {
+            // Load Q K-tile: Q is (1, hd) for decode, row 0 only
+            for (int i = lane; i < TILE_SZ; i += 32) sQ0[i] = 0;
+            for (int d = lane; d < MMA_K_BF16; d += 32) {
+                int ck = d / 8, lc = d % 8;
+                sQ0[ck * 16 * 64 + lc] = q_head[kt * MMA_K_BF16 + d];
+            }
+            // Load K K-tile: K is (s_k, hd)
+            for (int i = lane; i < TILE_SZ; i += 32) sK0[i] = 0;
+            for (int r = 0; r < s_k; r++) {
+                for (int d = lane; d < MMA_K_BF16; d += 32) {
+                    int ck = d / 8, lc = d % 8;
+                    int tmn = r / 8, lr = r % 8;
+                    sK0[ck * 16 * 64 + tmn * 64 + lr * 8 + lc] = k_head[r * HD + kt * MMA_K_BF16 + d];
+                }
+            }
+        }
+        __syncthreads();
+
+        // ---- Warp 4: QK MMA ----
+        if (is_mma_warp) {
+            uint32_t idesc = make_idesc(128, 128);
+            uint64_t dq = make_umma_desc_kmajor_none(__cvta_generic_to_shared(sQ0), 128);
+            uint64_t dk = make_umma_desc_kmajor_none(__cvta_generic_to_shared(sK0), 128);
+            if (tid == 128) umma_ss_f16(tb, dq, dk, idesc, kt > 0);
+            asm volatile("tcgen05.fence::after_thread_sync;" ::: "memory");
+        }
+        __syncthreads();
+    }
+
+    // ================================================================
+    // Softmax (warp 0, row 0 only for T=1 decode)
+    // ================================================================
+    if (wid == 0) {
+        float s_vals[SK_TILE], row_max = -INFINITY;
+        for (int n = 0; n < SK_TILE / 8; n++) {
+            float tmp[8];
+            asm volatile("tcgen05.ld.sync.aligned.32x32b.x8.b32 {%0,%1,%2,%3,%4,%5,%6,%7},[%8];"
+                : "=f"(tmp[0]),"=f"(tmp[1]),"=f"(tmp[2]),"=f"(tmp[3]),
+                  "=f"(tmp[4]),"=f"(tmp[5]),"=f"(tmp[6]),"=f"(tmp[7])
+                : "r"(tb + n*8));
+            asm volatile("tcgen05.wait::ld.sync.aligned;");
+            if (lane == 0) for (int c=0;c<8;c++) {
+                s_vals[n*8+c] = tmp[c] * scale;
+                row_max = fmaxf(row_max, tmp[c] * scale);
+            }
+        }
+        row_max = wmax(row_max);
+        if (lane == 0) *sRowMax = row_max;
+        float row_sum = 0.0f;
+        if (lane == 0) for (int j=0;j<SK_TILE;j++) {
+            s_vals[j] = expf(s_vals[j] - row_max);
+            row_sum += s_vals[j];
+        }
+        row_sum = wsum(row_sum);
+        if (lane == 0) *sRowSum = row_sum;
+        if (lane == 0) for (int j=0;j<SK_TILE;j++) s_vals[j] /= row_sum;
+        if (lane == 0) for (int j=0;j<SK_TILE;j++) s_p_vals[j] = s_vals[j];
+    }
+    __syncthreads();
+
+    // ================================================================
+    // PV GEMM loop: N=16 sub-tiles × K-tiles
+    // ================================================================
+    for (int n = 0; n < N_NSUB; n++) {
+        int d_base = n * 16;
+
+        for (int kt = 0; kt < NKT_PV; kt++) {
+            // ---- Warp 5: Fill sPk and load V sub-tile ----
+            if (is_load_warp) {
+                // Fill sPk from s_p_vals
+                for (int i = lane; i < TILE_SZ; i += 32) sPk[i] = 0;
+                if (lane < 16) {
+                    int c = lane;
+                    int ck = c / 8, lc = c % 8;
+                    sPk[ck * 16 * 64 + 0 * 64 + 0 * 8 + lc] = f32_to_bf16(s_p_vals[kt * MMA_K_BF16 + c]);
+                }
+                // Load V sub-tile: V is (hd, s_k) in GMEM
+                for (int i = lane; i < V_SUB_SZ; i += 32) sV[i] = 0;
+                for (int dd = lane; dd < 16; dd += 32) {
+                    for (int lr = 0; lr < MMA_K_BF16; lr++) {
+                        int r = kt * MMA_K_BF16 + lr;
+                        int g_mn = dd / 8, g_k = lr / 8;
+                        int llr = dd % 8, lc = lr % 8;
+                        sV[g_k * 2 * 64 + g_mn * 64 + llr * 8 + lc] = v_head[(d_base + dd) * s_k + r];
+                    }
+                }
+            }
+            __syncthreads();
+
+            // ---- Warp 4: PV MMA ----
+            if (is_mma_warp) {
+                uint32_t idesc_pv16 = make_idesc(128, 16);
+                uint64_t dp = make_umma_desc_kmajor_none(__cvta_generic_to_shared(sPk), 128);
+                uint64_t dv = make_umma_desc_kmajor_none(__cvta_generic_to_shared(sV), 16);
+                if (tid == 128) umma_ss_f16(tb + n * 16, dp, dv, idesc_pv16, kt > 0);
+                asm volatile("tcgen05.fence::after_thread_sync;" ::: "memory");
+            }
+            __syncthreads();
+        }
+    }
+
+    // ================================================================
+    // Epilogue: TMEM → regs → normalize → BF16 → GMEM
+    // For single-segment decode: normalize in-kernel.
+    // For multi-segment: emit un-normalized O + LSE.
+    // ================================================================
+    if (wid == 0) {
+        float row_max = *sRowMax;
+        float row_sum = *sRowSum;
+        float o_vals[HD];
+        for (int n = 0; n < HD / 8; n++) {
+            float tmp[8];
+            asm volatile("tcgen05.ld.sync.aligned.32x32b.x8.b32 {%0,%1,%2,%3,%4,%5,%6,%7},[%8];"
+                : "=f"(tmp[0]),"=f"(tmp[1]),"=f"(tmp[2]),"=f"(tmp[3]),
+                  "=f"(tmp[4]),"=f"(tmp[5]),"=f"(tmp[6]),"=f"(tmp[7])
+                : "r"(tb + n*8));
+            asm volatile("tcgen05.wait::ld.sync.aligned;");
+            if (lane == 0) for (int c=0;c<8;c++) o_vals[n*8+c] = tmp[c];
+        }
+        // Normalize: O_normalized = O_unnorm / row_sum
+        // LSE = log2(row_sum) + row_max * log2(e) — for multi-segment merge
+        if (lane == 0) {
+            float inv_row_sum = 1.0f / row_sum;
+            for (int d = 0; d < HD; d++) {
+                o_head[d] = f32_to_bf16(o_vals[d] * inv_row_sum);
+            }
+            // Write LSE if pointer is valid
+            if (lse_head) {
+                // LSE = log2(row_sum) + row_max / log(2)
+                // Since softmax was: exp(x - row_max) / row_sum
+                // log(softmax) = (x - row_max) - log(row_sum)
+                // LSE = log(row_sum) + row_max (natural log)
+                // Actually: the un-normalized output is sum(P*V) where P is the softmax weights
+                // row_sum is the denominator of softmax.
+                // LSE for the merge formula: lse = ln(row_sum) + row_max
+                lse_head[0] = logf(row_sum) + row_max;
+            }
+        }
+    }
+    __syncthreads();
+
+    // TMEM dealloc (warp 4)
+    if (is_mma_warp) {
+        tmem_dealloc(tb, TMEM_N);
+    }
+}
+
+} // namespace dsv4::kernels::attention
--- a/tests/unit/test_fmha_6warp_multihead.cu
+++ b/tests/unit/test_fmha_6warp_multihead.cu
@@ -0,0 +1,405 @@
+/**
+ * Test multi-head FMHA kernel (6-warp, grid launch).
+ * Compile with -DHD_VAL=64 etc.
+ *
+ * Tests:
+ * 1. Multi-head MHA: n_h independent heads, each with its own K/V
+ * 2. Multi-head MQA: n_h Q heads sharing 1 K/V head
+ * 3. Batched: batch_size > 1
+ * 4. LSE output correctness
+ */
+
+#include <cuda_runtime.h>
+#include <cstdio>
+#include <cmath>
+#include <cstdlib>
+#include <cstring>
+
+#ifndef HD_VAL
+#define HD_VAL 64
+#endif
+
+#include "dsv4/kernels/attention/fmha_common.cuh"
+#include "dsv4/kernels/attention/fmha_umma_desc.cuh"
+
+using namespace dsv4::kernels::attention;
+
+static bf16_t f32_to_bf16_host(float f) { uint32_t u; memcpy(&u,&f,4); return (uint16_t)(u>>16); }
+static float bf16_to_f32_host(bf16_t h) { uint32_t u=(uint32_t)h<<16; float f; memcpy(&f,&u,4); return f; }
+
+constexpr int HD = HD_VAL;
+constexpr int SK = 128;
+constexpr int TILE_SZ = 128 * MMA_K_BF16;
+constexpr int V_SUB_SZ = 256;
+constexpr int TMEM_N = (HD <= 128) ? 128 : 256;
+
+#include "dsv4/kernels/attention/fmha_6warp_multihead.cuh"
+
+// Reference: compute attention for one head
+static void reference_attention(
+    const bf16_t* q, const bf16_t* k, const bf16_t* v,
+    float* o_ref, float* lse_ref,
+    int hd, int s_k, float scale
+) {
+    float s[512];  // max s_k
+    for (int j = 0; j < s_k; j++) {
+        float dot = 0.0f;
+        for (int d = 0; d < hd; d++) dot += bf16_to_f32_host(q[d]) * bf16_to_f32_host(k[j * hd + d]);
+        s[j] = dot * scale;
+    }
+    float mx = -INFINITY;
+    for (int j = 0; j < s_k; j++) mx = fmaxf(mx, s[j]);
+    float sm = 0.0f;
+    for (int j = 0; j < s_k; j++) { s[j] = expf(s[j] - mx); sm += s[j]; }
+    for (int j = 0; j < s_k; j++) s[j] /= sm;
+    for (int d = 0; d < hd; d++) {
+        float ov = 0.0f;
+        for (int int_j = 0; int_j < s_k; int_j++) ov += s[int_j] * bf16_to_f32_host(v[d * s_k + int_j]);
+        o_ref[d] = ov;
+    }
+    if (lse_ref) *lse_ref = logf(sm) + mx;
+}
+
+static int test_mha(int n_h) {
+    printf("\n=== Test MHA: n_h=%d, HD=%d, SK=%d ===\n", n_h, HD, SK);
+    const float SCALE = 1.0f / sqrtf((float)HD);
+    int pass = 1;
+
+    // Allocate host tensors: Q(n_h, hd), K(n_h, SK*hd), V(n_h, hd*SK), O(n_h, hd)
+    bf16_t* h_q = (bf16_t*)malloc(n_h * HD * sizeof(bf16_t));
+    bf16_t* h_k = (bf16_t*)malloc(n_h * SK * HD * sizeof(bf16_t));
+    bf16_t* h_v = (bf16_t*)malloc(n_h * HD * SK * sizeof(bf16_t));
+    bf16_t* h_o = (bf16_t*)calloc(n_h * HD, sizeof(bf16_t));
+    float* h_lse = (float*)calloc(n_h, sizeof(float));
+
+    srand(42);
+    for (int i = 0; i < n_h * HD; i++) h_q[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+    for (int i = 0; i < n_h * SK * HD; i++) h_k[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+    for (int i = 0; i < n_h * HD * SK; i++) h_v[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+
+    bf16_t *d_q, *d_k, *d_v, *d_o;
+    float *d_lse;
+    cudaMalloc(&d_q, n_h * HD * sizeof(bf16_t));
+    cudaMalloc(&d_k, n_h * SK * HD * sizeof(bf16_t));
+    cudaMalloc(&d_v, n_h * HD * SK * sizeof(bf16_t));
+    cudaMalloc(&d_o, n_h * HD * sizeof(bf16_t));
+    cudaMalloc(&d_lse, n_h * sizeof(float));
+    cudaMemcpy(d_q, h_q, n_h * HD * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemcpy(d_k, h_k, n_h * SK * HD * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemcpy(d_v, h_v, n_h * HD * SK * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemset(d_o, 0, n_h * HD * sizeof(bf16_t));
+    cudaMemset(d_lse, 0, n_h * sizeof(float));
+
+    FmhaParams params;
+    params.q = d_q;
+    params.k = d_k;
+    params.v = d_v;
+    params.o = d_o;
+    params.lse = d_lse;
+    params.s_k = SK;
+    params.scale = SCALE;
+    params.head_dim = HD;
+    params.q_head_stride = HD;       // T=1, stride = 1 * hd
+    params.q_batch_stride = n_h * HD;
+    params.k_head_stride = SK * HD;  // each head has its own K
+    params.k_batch_stride = n_h * SK * HD;
+    params.v_head_stride = HD * SK;  // each head has its own V
+    params.v_batch_stride = n_h * HD * SK;
+    params.o_head_stride = HD;
+    params.o_batch_stride = n_h * HD;
+    params.lse_head_stride = 1;
+    params.lse_batch_stride = n_h;
+
+    int smem = (4 + 8 + 16 + TILE_SZ*2 + TILE_SZ*2 + TILE_SZ*2 + V_SUB_SZ*2 + SK*4 + 256 + 127) & ~127;
+
+    if (smem > 48 * 1024) {
+        cudaFuncSetAttribute(fmha_6warp_multihead_kernel<HD>, cudaFuncAttributeMaxDynamicSharedMemorySize, smem);
+    }
+
+    // Copy params to device (constant mem is simpler but let's use uniform for now)
+    // Actually, the kernel takes FmhaParams by value, so we pass it directly
+    fmha_6warp_multihead_kernel<HD><<<dim3(1, n_h, 1), 192, smem>>>(params);
+
+    cudaError_t launch_err = cudaGetLastError();
+    if (launch_err != cudaSuccess) {
+        printf("LAUNCH ERROR: %s\n", cudaGetErrorString(launch_err));
+        pass = 0; goto cleanup;
+    }
+
+    cudaError_t err = cudaDeviceSynchronize();
+    if (err != cudaSuccess) {
+        printf("CUDA ERROR: %s\n", cudaGetErrorString(err));
+        pass = 0; goto cleanup;
+    }
+
+    cudaMemcpy(h_o, d_o, n_h * HD * sizeof(bf16_t), cudaMemcpyDeviceToHost);
+    cudaMemcpy(h_lse, d_lse, n_h * sizeof(float), cudaMemcpyDeviceToHost);
+
+    // Verify each head
+    for (int h = 0; h < n_h; h++) {
+        float o_ref[512];  // max HD
+        float lse_ref;
+        reference_attention(
+            h_q + h * HD, h_k + h * SK * HD, h_v + h * HD * SK,
+            o_ref, &lse_ref, HD, SK, SCALE
+        );
+
+        float cs = 0, na = 0, nb = 0;
+        for (int d = 0; d < HD; d++) {
+            float a = bf16_to_f32_host(h_o[h * HD + d]), b = o_ref[d];
+            if (fabsf(b) > 1e-4f) { cs += a*b; na += a*a; nb += b*b; }
+        }
+        cs /= (sqrtf(na) * sqrtf(nb) + 1e-10f);
+
+        float lse_err = fabsf(h_lse[h] - lse_ref) / (fabsf(lse_ref) + 1e-10f);
+        printf("  Head %2d: cos=%.8f  lse_err=%.6f (kernel=%.6f ref=%.6f)\n",
+               h, cs, lse_err, h_lse[h], lse_ref);
+        if (cs < 0.999f) {
+            printf("  HEAD %d FAILED (cos=%.6f < 0.999)\n", h, cs);
+            pass = 0;
+        }
+    }
+
+    printf("MHA test %s\n", pass ? "PASSED" : "FAILED");
+
+cleanup:
+    cudaFree(d_q); cudaFree(d_k); cudaFree(d_v); cudaFree(d_o); cudaFree(d_lse);
+    free(h_q); free(h_k); free(h_v); free(h_o); free(h_lse);
+    return pass;
+}
+
+static int test_mqa(int n_q, int n_kv) {
+    printf("\n=== Test MQA: n_q=%d, n_kv=%d, HD=%d, SK=%d ===\n", n_q, n_kv, HD, SK);
+    const float SCALE = 1.0f / sqrtf((float)HD);
+    int pass = 1;
+    int q_per_kv = n_q / n_kv;
+
+    // Q: (n_q, hd), K: (n_kv, SK*hd), V: (n_kv, hd*SK), O: (n_q, hd)
+    bf16_t* h_q = (bf16_t*)malloc(n_q * HD * sizeof(bf16_t));
+    bf16_t* h_k = (bf16_t*)malloc(n_kv * SK * HD * sizeof(bf16_t));
+    bf16_t* h_v = (bf16_t*)malloc(n_kv * HD * SK * sizeof(bf16_t));
+    bf16_t* h_o = (bf16_t*)calloc(n_q * HD, sizeof(bf16_t));
+
+    srand(123);
+    for (int i = 0; i < n_q * HD; i++) h_q[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+    for (int i = 0; i < n_kv * SK * HD; i++) h_k[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+    for (int i = 0; i < n_kv * HD * SK; i++) h_v[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+
+    bf16_t *d_q, *d_k, *d_v, *d_o;
+    cudaMalloc(&d_q, n_q * HD * sizeof(bf16_t));
+    cudaMalloc(&d_k, n_kv * SK * HD * sizeof(bf16_t));
+    cudaMalloc(&d_v, n_kv * HD * SK * sizeof(bf16_t));
+    cudaMalloc(&d_o, n_q * HD * sizeof(bf16_t));
+    cudaMemcpy(d_q, h_q, n_q * HD * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemcpy(d_k, h_k, n_kv * SK * HD * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemcpy(d_v, h_v, n_kv * HD * SK * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemset(d_o, 0, n_q * HD * sizeof(bf16_t));
+
+    // For MQA: n_q heads each with different Q, but shared K/V
+    // We launch n_q CTAs. Each CTA reads its own Q head but the SAME K/V
+    // (k_head_stride=0, v_head_stride=0 for pure MQA)
+    // But we need to map Q head h to KV head (h / q_per_kv)
+    // The kernel doesn't know about the mapping — it just uses blockIdx.y as head_idx
+    // and k_head_stride to index K. For MQA, we set k_head_stride=0 so all CTAs read the same K.
+    // For GQA, we'd need a different approach (grouped launches).
+    //
+    // Pure MQA test: 1 KV head, all Q heads share it
+    if (n_kv != 1) {
+        printf("MQA test requires n_kv=1 for stride=0 trick, skipping\n");
+        pass = 1; goto cleanup;
+    }
+
+    {
+        FmhaParams params;
+        params.q = d_q;
+        params.k = d_k;
+        params.v = d_v;
+        params.o = d_o;
+        params.lse = nullptr;  // skip LSE for MQA test
+        params.s_k = SK;
+        params.scale = SCALE;
+        params.head_dim = HD;
+        params.q_head_stride = HD;
+        params.q_batch_stride = n_q * HD;
+        params.k_head_stride = 0;  // MQA: all heads share same K
+        params.k_batch_stride = SK * HD;
+        params.v_head_stride = 0;  // MQA: all heads share same V
+        params.v_batch_stride = HD * SK;
+        params.o_head_stride = HD;
+        params.o_batch_stride = n_q * HD;
+        params.lse_head_stride = 0;
+        params.lse_batch_stride = 0;
+
+        int smem = (4 + 8 + 16 + TILE_SZ*2 + TILE_SZ*2 + TILE_SZ*2 + V_SUB_SZ*2 + SK*4 + 256 + 127) & ~127;
+        if (smem > 48 * 1024) {
+            cudaFuncSetAttribute(fmha_6warp_multihead_kernel<HD>, cudaFuncAttributeMaxDynamicSharedMemorySize, smem);
+        }
+
+        fmha_6warp_multihead_kernel<HD><<<dim3(1, n_q, 1), 192, smem>>>(params);
+
+        cudaError_t launch_err = cudaGetLastError();
+        if (launch_err != cudaSuccess) {
+            printf("LAUNCH ERROR: %s\n", cudaGetErrorString(launch_err));
+            pass = 0; goto cleanup;
+        }
+        cudaError_t err = cudaDeviceSynchronize();
+        if (err != cudaSuccess) {
+            printf("CUDA ERROR: %s\n", cudaGetErrorString(err));
+            pass = 0; goto cleanup;
+        }
+
+        cudaMemcpy(h_o, d_o, n_q * HD * sizeof(bf16_t), cudaMemcpyDeviceToHost);
+
+        // All Q heads share the same K/V (k_head[0], v_head[0])
+        for (int h = 0; h < n_q; h++) {
+            float o_ref[512];
+            reference_attention(
+                h_q + h * HD, h_k, h_v,
+                o_ref, nullptr, HD, SK, SCALE
+            );
+
+            float cs = 0, na = 0, nb = 0;
+            for (int d = 0; d < HD; d++) {
+                float a = bf16_to_f32_host(h_o[h * HD + d]), b = o_ref[d];
+                if (fabsf(b) > 1e-4f) { cs += a*b; na += a*a; nb += b*b; }
+            }
+            cs /= (sqrtf(na) * sqrtf(nb) + 1e-10f);
+            printf("  Q-head %2d (shared KV): cos=%.8f\n", h, cs);
+            if (cs < 0.999f) { printf("  HEAD %d FAILED\n", h); pass = 0; }
+        }
+    }
+
+    printf("MQA test %s\n", pass ? "PASSED" : "FAILED");
+
+cleanup:
+    cudaFree(d_q); cudaFree(d_k); cudaFree(d_v); cudaFree(d_o);
+    free(h_q); free(h_k); free(h_v); free(h_o);
+    return pass;
+}
+
+static int test_batched(int n_h, int batch_size) {
+    printf("\n=== Test Batched: n_h=%d, batch=%d, HD=%d, SK=%d ===\n", n_h, batch_size, HD, SK);
+    const float SCALE = 1.0f / sqrtf((float)HD);
+    int pass = 1;
+
+    int total_q = batch_size * n_h;
+    bf16_t* h_q = (bf16_t*)malloc(total_q * HD * sizeof(bf16_t));
+    bf16_t* h_k = (bf16_t*)malloc(total_q * SK * HD * sizeof(bf16_t));
+    bf16_t* h_v = (bf16_t*)malloc(total_q * HD * SK * sizeof(bf16_t));
+    bf16_t* h_o = (bf16_t*)calloc(total_q * HD, sizeof(bf16_t));
+
+    srand(999);
+    for (int i = 0; i < total_q * HD; i++) h_q[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+    for (int i = 0; i < total_q * SK * HD; i++) h_k[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+    for (int i = 0; i < total_q * HD * SK; i++) h_v[i] = f32_to_bf16_host((float)(rand()%100)/100.0f - 0.5f);
+
+    bf16_t *d_q, *d_k, *d_v, *d_o;
+    cudaMalloc(&d_q, total_q * HD * sizeof(bf16_t));
+    cudaMalloc(&d_k, total_q * SK * HD * sizeof(bf16_t));
+    cudaMalloc(&d_v, total_q * HD * SK * sizeof(bf16_t));
+    cudaMalloc(&d_o, total_q * HD * sizeof(bf16_t));
+    cudaMemcpy(d_q, h_q, total_q * HD * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemcpy(d_k, h_k, total_q * SK * HD * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemcpy(d_v, h_v, total_q * HD * SK * sizeof(bf16_t), cudaMemcpyHostToDevice);
+    cudaMemset(d_o, 0, total_q * HD * sizeof(bf16_t));
+
+    FmhaParams params;
+    params.q = d_q;
+    params.k = d_k;
+    params.v = d_v;
+    params.o = d_o;
+    params.lse = nullptr;
+    params.s_k = SK;
+    params.scale = SCALE;
+    params.head_dim = HD;
+    params.q_head_stride = HD;
+    params.q_batch_stride = n_h * HD;
+    params.k_head_stride = SK * HD;
+    params.k_batch_stride = n_h * SK * HD;
+    params.v_head_stride = HD * SK;
+    params.v_batch_stride = n_h * HD * SK;
+    params.o_head_stride = HD;
+    params.o_batch_stride = n_h * HD;
+    params.lse_head_stride = 0;
+    params.lse_batch_stride = 0;
+
+    int smem = (4 + 8 + 16 + TILE_SZ*2 + TILE_SZ*2 + TILE_SZ*2 + V_SUB_SZ*2 + SK*4 + 256 + 127) & ~127;
+    if (smem > 48 * 1024) {
+        cudaFuncSetAttribute(fmha_6warp_multihead_kernel<HD>, cudaFuncAttributeMaxDynamicSharedMemorySize, smem);
+    }
+
+    fmha_6warp_multihead_kernel<HD><<<dim3(1, n_h, batch_size), 192, smem>>>(params);
+
+    cudaError_t launch_err = cudaGetLastError();
+    if (launch_err != cudaSuccess) {
+        printf("LAUNCH ERROR: %s\n", cudaGetErrorString(launch_err));
+        pass = 0; goto cleanup;
+    }
+    cudaError_t err = cudaDeviceSynchronize();
+    if (err != cudaSuccess) {
+        printf("CUDA ERROR: %s\n", cudaGetErrorString(err));
+        pass = 0; goto cleanup;
+    }
+
+    cudaMemcpy(h_o, d_o, total_q * HD * sizeof(bf16_t), cudaMemcpyDeviceToHost);
+
+    // Verify a sample of heads across batches
+    int checked = 0, failed = 0;
+    for (int b = 0; b < batch_size; b++) {
+        for (int h = 0; h < n_h; h++) {
+            int idx = b * n_h + h;
+            float o_ref[512];
+            reference_attention(
+                h_q + idx * HD,
+                h_k + idx * SK * HD,
+                h_v + idx * HD * SK,
+                o_ref, nullptr, HD, SK, SCALE
+            );
+
+            float cs = 0, na = 0, nb = 0;
+            for (int d = 0; d < HD; d++) {
+                float a = bf16_to_f32_host(h_o[idx * HD + d]), b2 = o_ref[d];
+                if (fabsf(b2) > 1e-4f) { cs += a*b2; na += a*a; nb += b2*b2; }
+            }
+            cs /= (sqrtf(na) * sqrtf(nb) + 1e-10f);
+            checked++;
+            if (cs < 0.999f) {
+                printf("  FAIL batch=%d head=%d: cos=%.6f\n", b, h, cs);
+                failed++;
+            }
+        }
+    }
+    printf("  Checked %d heads, %d failed\n", checked, failed);
+    pass = (failed == 0);
+    printf("Batched test %s\n", pass ? "PASSED" : "FAILED");
+
+cleanup:
+    cudaFree(d_q); cudaFree(d_k); cudaFree(d_v); cudaFree(d_o);
+    free(h_q); free(h_k); free(h_v); free(h_o);
+    return pass;
+}
+
+int main() {
+    printf("========================================\n");
+    printf("Multi-head FMHA test suite (HD=%d)\n", HD);
+    printf("========================================\n");
+
+    int all_pass = 1;
+
+    // Test 1: MHA with 4 heads
+    all_pass &= test_mha(4);
+
+    // Test 2: MHA with 8 heads (covers Pro's hd=128 with 128 heads in smaller test)
+    all_pass &= test_mha(8);
+
+    // Test 3: MQA: 4 Q heads sharing 1 KV head
+    all_pass &= test_mqa(4, 1);
+
+    // Test 4: Batched: 4 heads × 2 batch
+    all_pass &= test_batched(4, 2);
+
+    printf("\n========================================\n");
+    printf("Overall: %s\n", all_pass ? "ALL PASSED" : "SOME FAILED");
+    printf("========================================\n");
+    return all_pass ? 0 : 1;
+}
--- a/tests/unit/test_fmha_6warp_multihead_hd128.cu
+++ b/tests/unit/test_fmha_6warp_multihead_hd128.cu
@@ -0,0 +1,2 @@
+#define HD_VAL 128
+#include "test_fmha_6warp_multihead.cu"
--- a/tests/unit/test_fmha_6warp_multihead_hd16.cu
+++ b/tests/unit/test_fmha_6warp_multihead_hd16.cu
@@ -0,0 +1,2 @@
+#define HD_VAL 16
+#include "test_fmha_6warp_multihead.cu"
--- a/tests/unit/test_fmha_6warp_multihead_hd256.cu
+++ b/tests/unit/test_fmha_6warp_multihead_hd256.cu
@@ -0,0 +1,2 @@
+#define HD_VAL 256
+#include "test_fmha_6warp_multihead.cu"
--- a/tests/unit/test_fmha_6warp_multihead_hd64.cu
+++ b/tests/unit/test_fmha_6warp_multihead_hd64.cu
@@ -0,0 +1,2 @@
+#define HD_VAL 64
+#include "test_fmha_6warp_multihead.cu"