nvfp4-megamoe-kernel/FINAL_STRETCH.md

# DSV4 Audit — Decode Repetition + Precision / Tensor-Core Plan

# PART B — Precision / NVFP4 / tensor-core (WE ARE SKIPPING PART A FOR RIGHT NOW AND WILL REVISIT IT)

Goal: native NVFP4 where the math allows, FP8_E4M3 where it doesn't, BF16/FP32 only where required. Validate each change with per-layer cosine vs `dsv4/reference` before trusting it.

## B0 — What's already optimal: DO NOT "fix" the MoE
`dsv4/layers/moe.py` already runs **native NVFP4**: expert weights and activations are `float4_e2m1fn_x2`, block scales are `float8_e4m3fn`. This matches the paper (routed experts in FP4). Leave it. The remaining wins are in **attention** and the **indexer**, not MoE.

### P5 — Fused mHC pre_block + RMSNorm + NVFP4 quantize: ✅ DONE
- `fused_mhc_rmsnorm_quantize.cu` — 2-kernel approach (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4)
- **Integrated into `forward_layer`** for BOTH attn and ffn mHC paths (commit 0b6ca0d)
- Unit test: cos=0.999 vs unfused, 0.995 vs true mHC+RMSNorm at T=1/8/128

### P4 — Fused RMSNorm + NVFP4 quantize: ✅ DONE
- `fused_rmsnorm_quantize.cu` — 2-kernel approach
- gsa scalar fix in `Nvfp4Linear.run_from_quantized`

### Stale Lock Fix: ✅ DONE (commit 845227c)

## B1 — FP8_E4M3 FMHA: ✅ DONE

**Implementation**: `dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh` + C API + Python bridge.

Storage-native DSV4 attention: noPE KV stays FP8_E4M3, RoPE KV stays BF16, no global FP8→BF16 dequant.

### Unit Test Results (2026-06-03, `tests/unit/test_b1_mixed_fp8_fmha.py`)

| Test | Status |
|------|--------|
| quantize_q_fp8_split | ✅ PASS (cos=0.9997) |
| gather_mixed kernels | ✅ PASS |
| FMHA cosine (N=128..2048, H=128) | ✅ PASS (cos=0.9999..0.9997) |
| Attention sinks | ✅ PASS |
| GQA/MQA (128 Q heads) | ✅ PASS |
| Weight loading verification | ✅ PASS |
| Batch sizes (B=1,2,4) | ✅ PASS |

### Bugs Found and Fixed

1. **V matrix canonical layout swap** (commit 4fe7f9d): `canon_idx_bf16_16x16(kk, dd)` was wrong — should be `canon_idx_bf16_16x16(dd, kk)`. The SMEM group structure was transposed vs the working TMA-loaded V in the multitile kernel. This caused cos=0.158 vs BF16 reference. After fix: cos=0.999972 at N=128.

### Known Limitations
- **Prefill batch size**: T=1..128 supported. For T>128, caller must split. T_BATCH=32 sub-batches used internally.
- Specialized for DSV4 HD=512/NOPE=448/ROPE=64.

### Bug Fix (2026-06-03)
1. **CRITICAL: T-dimension strides were wrong for T>1** — the kernel used `q_nope_head_stride` (stride(1) = T*NOPE) for the T dimension, but the correct stride is `stride(2) = NOPE`. For T=1 this is invisible (qr=0 always), but for T>1 it reads garbage from adjacent heads' data. Fix: added explicit T-dimension strides (`q_nope_t_stride`, `q_scale_t_stride`, `q_rope_t_stride`) to params struct, C API, and Python wrapper. All 16 T>1 test configs now pass (cos >= 0.999887).

## B2 — FP8 tensor-core indexer scoring: ✅ DONE

**Implementation**: `dsv4/kernels/cuda/indexer_fp8_score_topk.cu`

Native Blackwell FP8 GEMM via tcgen05 for CSA Lightning Indexer scoring. No PyTorch einsum fallback.

### Unit Test Results (2026-06-03, `tests/unit/test_b2_indexer_fp8.py`)

| Test | Status |
|------|--------|
| Score cosine vs FP32 reference (n_comp=128..8192) | ✅ PASS (100% overlap ≤1024, ~88% at 8192) |
| Score distribution sanity | ✅ PASS |
| Determinism | ✅ PASS |
| Edge cases (n_comp < top_k, n_comp=1) | ✅ PASS |
| Weight format verification | ✅ PASS |

### Bugs Found and Fixed

1. **Broken `16x256b.x1` TMEM read** — instruction was hanging. Root cause: the `16x256b.x1` PTX instruction either doesn't exist on SM100 or has different alignment requirements. **Fix**: use the proven `32x32b.x8` instruction from B1 FMHA.

2. **TMEM_COLS too small** — TMEM_COLS=128 was insufficient for the 128×128 MMA output. The MMA writes ALL 128 rows, requiring 4 row-groups × 128 columns = 512 TMEM columns. **Fix**: TMEM_COLS=512.

3. **Wrong TMEM offset for rows 32-63** — tried `tb + SK_TILE + col_base` and `tb + 16 + col_base`, both gave wrong results. **Root cause**: the `32x32b.x8` instruction maps different warps to different row slices from the SAME TMEM address. Warp 0 reads rows 0-31, warp 1 reads rows 32-63, all from `tb + col_base`. **Fix**: warps 0-1 both read from the same address, accumulate into separate SMEM partitions, then merge.

4. **Cross-warp accumulation race condition** — initial attempt used shared `sLogits[c]` with first-warp-writes/second-warp-adds pattern, which was non-deterministic. **Fix**: per-warp score partitions (`sWarpScores[0..SK_TILE-1]` and `sWarpScores[SK_TILE..2*SK_TILE-1]`), merged after `__syncthreads()`.

### Production Configuration
- n_ih=64, ihd=128, top_k=1024
- Warps 0-1: TMEM read + per-warp score accumulation
- Warp 4: MMA (FP8 GEMM)
- Per-thread local top-k (INDEXER_LOCAL_K=8) → block-level merge

## B3 — Fused rmsnorm→quant for q_a_norm / kv_norm: ✅ DONE
- `q_a_norm` → `q_b` path uses fused `rmsnorm_quantize_nvfp4` + `run_from_quantized`
- `kv_norm` still uses unfused rmsnorm — requires FP8 FMHA (B1) to fully benefit

## B4 — General "producer BF16 → consumer FP32" sweep: NOT STARTED

## B5 — Residual-stream precision: NOT STARTED (low priority)

---

# PART D — Dangling TODOS

  - Batched Prefill: ✅ DONE (T=1..128, mixed FP8/BF16 kernel)
  - Need to wire prefill into single_shot_inference.py (replace T=1 token-by-token prefill)
  - Need T>128 support (split into multiple launches)