README.md

# DSV4 NVFP4 Workspace

## Status (May 21, 2026 — 09:18 UTC)

### Stage A ✅ COMPLETE
Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.

### Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4

Two MMAs chained: Q@K^T (SMEM source) → identity softmax in TMEM → P@V (TMEM source).

**Pipeline deadlock: ✅ FIXED. Softmax packing: ✅ CONFIRMED CORRECT.**

---

## Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure

### Summary

The softmax writes P to TMEM using the **QK C-fragment layout**. The PV MMA reads P from TMEM using the **PV A-fragment layout**. For (128,128) PV these layouts agree. For (128,16) PV they disagree — the PV A-fragment reads from different TMEM columns than where the softmax wrote, producing zero output.

**FMHA uses (128,16) PV with the same construction and works.** The root difference is not yet identified despite exhaustive comparison. FMHA references: `p_tmem_layout_staged = make_smem_layout_a(pv_mma, pv_mma_tiler, q_dtype, 1)` — same call we make.

### What Works / What Doesn't

- ✅ PV (128,128) output, V=I or random → cosine 1.0 / 0.999999
- ✅ PV (128,128) with zero-padded V (head_dim=16) → cosine 1.0 **WORKAROUND**
- ✅ PV (128,64), all-ones V → cosine 0.999999 (uniform hides bug)
- ✅ PV (128,64), single-element V → cosine 1.0 (sparse hides bug)
- ❌ PV (128,64), truncated identity V → cosine 0.02
- ❌ PV (128,16), V=I(128,128) → cosine 0.0 (all zeros)
- ❌ PV (128,16) with P at S offset (no softmax) → NaN (FP32→BF16 reinterpret)

### Root Cause (Updated May 21 09:20 UTC)

**The P/A TMEM alias is NOT the bug.** Diagnostic prints confirm the PV A-fragment layout is IDENTICAL for all PV sizes:

```
(128,128) PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=1.0
(128,32)  PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.51
(128,16)  PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.36
```

The C++ source confirms: the A-fragment TMEM atom depends on M and K, NOT output N. The softmax writes P to the same TMEM columns regardless of PV size.

**The real bug is in the V/B staging or output C/D path.** When using (128,128) PV with zero-padded V (which keeps the V SMEM, O C-fragment, and epilogue at (128,128) dimensions), cosine=1.0. When using native (128,32) PV with V=(32,128), cosine=0.51. The difference is the V SMEM layout and/or output epilogue.

**Key observations:**
- V smem_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
- O tOtO_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
- cta_tile_shape_mnk=(128,128,64) for BOTH — this is QK's cta tile, not PV's
- epi_tile=(128,16) for (128,16) PV — this IS correct (from PV)
- Swapping cta_tile to PV before epilogue doesn't fix the issue

**Next steps:**
1. Test V TMA load correctness for (128,32) PV — is V data loaded correctly into SMEM?
2. Test PV MMA output directly (skip epilogue) — is the PV MMA producing correct O?
3. Check if V B-operand fragment (tCrV) has the right shape for (128,32) PV

### Current Workaround

Use **(128,128) PV with zero-padded V**. This wastes compute (8× for head_dim=16, 2× for head_dim=64) but produces correct results (cosine 1.0). For the production kernel, we'll use this initially and optimize to (128,16) PV once the TMEM alias is resolved.

### Required Fixes (Not Yet Applied)

1. **Primary**: Softmax must write P using the PV A-fragment TMEM layout, not the QK C-fragment layout. Requires constructing a `make_tmem_copy` with `tP` (PV layout) as the destination, and rearranging register data from QK partition to PV partition.

2. **Secondary**: `epi_tile` must use PV's cta tile, and `self.cta_tile_shape_mnk` must be swapped before `epilogue_tma_store`. FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly.

3. **Alternative (for later)**: Investigate using `composition()` to create a hybrid layout that both the QK softmax write and PV A-fragment read can agree on.

---

## Bugs 1–3: ✅ FIXED

### Bug 1: V B-Operand Must Be MN-Major

FMHA requires V to be **MN-major** for the PV MMA B-operand. V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) via `as_strided`.

### Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT

FP32→BF16 packing via C-fragment composition store works. ⛔ `St32x32bOp` MUST use Float32, NOT BFloat16.

### Bug 3: First PV Must Use ACCUMULATE=False

If ACCUMULATE=True on the first PV, `O = P@V + old_O` adds uninitialized TMEM. FMHA: `pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0)`.

---

## Pipeline Deadlock — ✅ FIXED (May 21)

Three root causes found and fixed:
1. `PipelineUmmaAsync` for mma_si must NOT pass `cta_layout_vmnk`
2. TMA warp must NOT call `tmem.wait_for_alloc()`
3. `pipeline.PipelineTmaStore` (not `TmaStorePipeline`)

---

## ⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines

1. **St32x32bOp with BFloat16 → ILLEGAL MEMORY ACCESS** — Must use Float32 + `cute.recast_ptr`
2. **V major ≠ K major** — V must be MN-major, use `as_strided`
3. **C-fragment → A-fragment TMEM alias only works when N_MMA matches** — (128,128) works, (128,64) breaks
4. **PipelineUmmaAsync consumer = thread count, NOT warp count** — `32 * len(warp_ids)`
5. **mma_si pipeline must NOT pass cta_layout_vmnk**
6. **TMA warp excluded from tmem barrier**
7. **First PV ACCUMULATE=False**
8. **TMEM offset: FP32 ptr + 32 = BF16 ptr + 64** (width scaling)
9. **epi_tile must use PV cta_tile, not QK**
10. **CuTe nested layout modes flatten sequentially** — `((128,16),1,(4,2)):((65536,1),0,(16,64))` is sequential

---

## Architecture: Per-Tile Flow

```
For each KV tile:
  1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
  2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
     Signals scores_full_mbar (via PipelineUmmaAsync commit)
  3. Epilogue warps wait on mma_si consumer (scores ready), then:
     a. tcgen05.ld scores from TMEM → register fragments
     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
     d. tcgen05.st exp(scores - new_max) back to TMEM → P operand
     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
  4. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
  5. Stage released, load warp can refill it

After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM
```

---

## Environment

- **Server**: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
- **venv**: `source /root/dsv4-nvfp4-workspace/venv/bin/activate`
- **PYTHONPATH**: `/root/dsv4-nvfp4-workspace/kernel`
- **Model**: `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4`
- **vLLM repo**: `/root/dsv4-nvfp4-workspace/vllm` (modified for Blackwell)
- **Pseudocode**: `/root/fragile-kernel-example/README.md`
- **fmha.py reference**: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py`
- **fmha_bwd.py reference**: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py`
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								# DSV4 NVFP4 Workspace
-												Initial: TileLang NVFP4 mega_moe kernel package

- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install

											
										
										
											2026-05-13 15:44:51 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								## Status (May 21, 2026 — 09:18 UTC)
-												Initial: TileLang NVFP4 mega_moe kernel package

- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install

											
										
										
											2026-05-13 15:44:51 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Stage A ✅ COMPLETE
 								Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.
-												Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.

											
										
										
											2026-05-19 15:19:55 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4
-												Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results

											
										
										
											2026-05-20 20:26:25 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								Two MMAs chained: Q@K^T (SMEM source) → identity softmax in TMEM → P@V (TMEM source).
-												Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results

											
										
										
											2026-05-20 20:26:25 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								**Pipeline deadlock: ✅ FIXED. Softmax packing: ✅ CONFIRMED CORRECT.**
-												README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

											
										
										
											2026-05-21 06:46:02 +00:00
-												README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write

- (128,64) PV MMA A-fragment has N_MMA=64, reads P with wrong stride
- Softmax writes P with QK C-fragment layout (N_MMA=128)
- O[m,d] ≈ P[m,2d] — every other column effect confirmed
- All-ones and single-element V pass (uniform/sparse data hides mismatch)
- epi_tile must use PV cta_tile (partial fix: 0.01 → 0.876)
- Added footguns #9 (TMEM alias N_MMA match) and #10 (epi_tile)
- Added diagnostic test results to test table

											
										
										
											2026-05-21 05:17:12 +00:00
+								---
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								## Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Summary
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								The softmax writes P to TMEM using the **QK C-fragment layout**. The PV MMA reads P from TMEM using the **PV A-fragment layout**. For (128,128) PV these layouts agree. For (128,16) PV they disagree — the PV A-fragment reads from different TMEM columns than where the softmax wrote, producing zero output.
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								**FMHA uses (128,16) PV with the same construction and works.** The root difference is not yet identified despite exhaustive comparison. FMHA references: `p_tmem_layout_staged = make_smem_layout_a(pv_mma, pv_mma_tiler, q_dtype, 1)` — same call we make.
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### What Works / What Doesn't
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								- ✅ PV (128,128) output, V=I or random → cosine 1.0 / 0.999999
 								- ✅ PV (128,128) with zero-padded V (head_dim=16) → cosine 1.0 **WORKAROUND**
 								- ✅ PV (128,64), all-ones V → cosine 0.999999 (uniform hides bug)
 								- ✅ PV (128,64), single-element V → cosine 1.0 (sparse hides bug)
 								- ❌ PV (128,64), truncated identity V → cosine 0.02
 								- ❌ PV (128,16), V=I(128,128) → cosine 0.0 (all zeros)
 								- ❌ PV (128,16) with P at S offset (no softmax) → NaN (FP32→BF16 reinterpret)
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Root Cause (Updated May 21 09:20 UTC)
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								**The P/A TMEM alias is NOT the bug.** Diagnostic prints confirm the PV A-fragment layout is IDENTICAL for all PV sizes:
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								```
 								(128,128) PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=1.0
 								(128,32)  PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.51
 								(128,16)  PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.36
 								```
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								The C++ source confirms: the A-fragment TMEM atom depends on M and K, NOT output N. The softmax writes P to the same TMEM columns regardless of PV size.
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								**The real bug is in the V/B staging or output C/D path.** When using (128,128) PV with zero-padded V (which keeps the V SMEM, O C-fragment, and epilogue at (128,128) dimensions), cosine=1.0. When using native (128,32) PV with V=(32,128), cosine=0.51. The difference is the V SMEM layout and/or output epilogue.
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								**Key observations:**
 								- V smem_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
 								- O tOtO_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
 								- cta_tile_shape_mnk=(128,128,64) for BOTH — this is QK's cta tile, not PV's
 								- epi_tile=(128,16) for (128,16) PV — this IS correct (from PV)
 								- Swapping cta_tile to PV before epilogue doesn't fix the issue
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								**Next steps:**
 . Test V TMA load correctness for (128,32) PV — is V data loaded correctly into SMEM?
 . Test PV MMA output directly (skip epilogue) — is the PV MMA producing correct O?
 . Check if V B-operand fragment (tCrV) has the right shape for (128,32) PV
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Current Workaround
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								Use **(128,128) PV with zero-padded V**. This wastes compute (8× for head_dim=16, 2× for head_dim=64) but produces correct results (cosine 1.0). For the production kernel, we'll use this initially and optimize to (128,16) PV once the TMEM alias is resolved.
-												Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.

											
										
										
											2026-05-19 15:19:55 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Required Fixes (Not Yet Applied)
-												Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks

Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.

Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time

											
										
										
											2026-05-19 08:26:16 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+. **Primary**: Softmax must write P using the PV A-fragment TMEM layout, not the QK C-fragment layout. Requires constructing a `make_tmem_copy` with `tP` (PV layout) as the destination, and rearranging register data from QK partition to PV partition.
-												Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks

Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.

Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time

											
										
										
											2026-05-19 08:26:16 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+. **Secondary**: `epi_tile` must use PV's cta tile, and `self.cta_tile_shape_mnk` must be swapped before `epilogue_tma_store`. FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly.
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+. **Alternative (for later)**: Investigate using `composition()` to create a hybrid layout that both the QK softmax write and PV A-fragment read can agree on.
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
 								---
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								## Bugs 1–3: ✅ FIXED
-												README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

											
										
										
											2026-05-21 06:46:02 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Bug 1: V B-Operand Must Be MN-Major
-												README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

											
										
										
											2026-05-21 06:46:02 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								FMHA requires V to be **MN-major** for the PV MMA B-operand. V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) via `as_strided`.
-												README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

											
										
										
											2026-05-21 06:46:02 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT
-												README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

											
										
										
											2026-05-21 06:46:02 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								FP32→BF16 packing via C-fragment composition store works. ⛔ `St32x32bOp` MUST use Float32, NOT BFloat16.
-												README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

											
										
										
											2026-05-21 06:46:02 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								### Bug 3: First PV Must Use ACCUMULATE=False
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								If ACCUMULATE=True on the first PV, `O = P@V + old_O` adds uninitialized TMEM. FMHA: `pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0)`.
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
 								---
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								## Pipeline Deadlock — ✅ FIXED (May 21)
-												FOOTGUN #0: num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.

											
										
										
											2026-05-21 07:13:14 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								Three root causes found and fixed:
 . `PipelineUmmaAsync` for mma_si must NOT pass `cta_layout_vmnk`
 . TMA warp must NOT call `tmem.wait_for_alloc()`
 . `pipeline.PipelineTmaStore` (not `TmaStorePipeline`)
-												FOOTGUN #0: num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.

											
										
										
											2026-05-21 07:13:14 +00:00
 								---
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								## ⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+. **St32x32bOp with BFloat16 → ILLEGAL MEMORY ACCESS** — Must use Float32 + `cute.recast_ptr`
 . **V major ≠ K major** — V must be MN-major, use `as_strided`
 . **C-fragment → A-fragment TMEM alias only works when N_MMA matches** — (128,128) works, (128,64) breaks
 . **PipelineUmmaAsync consumer = thread count, NOT warp count** — `32 * len(warp_ids)`
 . **mma_si pipeline must NOT pass cta_layout_vmnk**
 . **TMA warp excluded from tmem barrier**
 . **First PV ACCUMULATE=False**
 . **TMEM offset: FP32 ptr + 32 = BF16 ptr + 64** (width scaling)
 . **epi_tile must use PV cta_tile, not QK**
 . **CuTe nested layout modes flatten sequentially** — `((128,16),1,(4,2)):((65536,1),0,(16,64))` is sequential
-												README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

											
										
										
											2026-05-21 06:46:02 +00:00
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
+								---
 								## Architecture: Per-Tile Flow
-												docs: rewrite README.md with current project state

- Document all 5 correctness bug fixes
- Document fused SwiGLU epilogue progress (Step 1 PASS, Step 2 blocked)
- Document CuTeDSL runtime conditional limitation
- List remaining steps (amax shuffles, NVFP4 quantize, FP4/SF TMA stores)
- Document weight interleave and register layout
- Capture key lessons learned
- Update file structure and test inventory

											
										
										
											2026-05-20 03:30:35 +00:00
-												Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results

											
										
										
											2026-05-20 20:26:25 +00:00
+								```
 								For each KV tile:
 . Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
 . MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
 								     Signals scores_full_mbar (via PipelineUmmaAsync commit)
 . Epilogue warps wait on mma_si consumer (scores ready), then:
 								     a. tcgen05.ld scores from TMEM → register fragments
 								     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
 								     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
-												README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.

											
										
										
											2026-05-21 09:47:08 +00:00
+								     d. tcgen05.st exp(scores - new_max) back to TMEM → P operand
-												Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results

											
										
										
											2026-05-20 20:26:25 +00:00
+								     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
+. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
-												Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results

											
										
										
											2026-05-20 20:26:25 +00:00
+. Stage released, load warp can refill it
 								After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM
 								```
-												docs: rewrite README.md with current project state

- Document all 5 correctness bug fixes
- Document fused SwiGLU epilogue progress (Step 1 PASS, Step 2 blocked)
- Document CuTeDSL runtime conditional limitation
- List remaining steps (amax shuffles, NVFP4 quantize, FP4/SF TMA stores)
- Document weight interleave and register layout
- Capture key lessons learned
- Update file structure and test inventory

											
										
										
											2026-05-20 03:30:35 +00:00
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
+								---
-												Update README with final kernel status

											
										
										
											2026-05-20 04:39:57 +00:00
-												Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results

											
										
										
											2026-05-20 20:26:25 +00:00
+								## Environment
 								- **Server**: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
 								- **venv**: `source /root/dsv4-nvfp4-workspace/venv/bin/activate`
 								- **PYTHONPATH**: `/root/dsv4-nvfp4-workspace/kernel`
 								- **Model**: `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4`
 								- **vLLM repo**: `/root/dsv4-nvfp4-workspace/vllm` (modified for Blackwell)
-												Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

											
										
										
											2026-05-21 04:10:07 +00:00
+								- **Pseudocode**: `/root/fragile-kernel-example/README.md`
-												Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results

											
										
										
											2026-05-20 20:26:25 +00:00
+								- **fmha.py reference**: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py`
-												Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

											
										
										
											2026-05-21 00:12:47 +00:00
+								- **fmha_bwd.py reference**: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py`