Files
nvfp4-megamoe-kernel/PREVIOUS_CONVERSATION_2.md
biondizzle 79ef87f9a9 FIX: E2M1 FP4 dequantization bug in indexer_score_topk.cu
The dequant_fp4_scalar function was treating the magnitude bits as
a raw integer (0-6) instead of the E2M1 floating-point format:
  Old (WRONG): val = (int)(nibble & 0x07) * scale
  New (CORRECT): proper E2M1 decode with exponent + mantissa

E2M1 encoding (bias=1):
  exp=0 subnormal: 0b000=0, 0b001=0.5
  exp=1: 0b010=1, 0b011=1.5
  exp=2: 0b100=2, 0b101=3
  exp=3: 0b110=4, 0b111=6

Bug found by outside consultant. Affects indexer top-k selection
correctness — wrong FP4 key decoding would select wrong CSA blocks.

Fixed in both:
- dsv4/kernels/indexer/indexer_score_topk.cu
- dsv4/kernels/cuda/indexer_score_topk.cu
2026-05-28 16:16:24 +00:00

7.9 KiB
Raw Blame History

CRITICAL DISCOVERY:

  1. Columns 0-31: All have row 0 data at lane 0 — correct (d=0..31)
  2. Columns 32-35: EMPTY — no data in any lane. These 4 columns are skipped by the MMA.
  3. Columns 36-47: Have data at lane 0 (d=36..47)
  4. Columns 48-51: EMPTY — same as 32-35, 4 skipped columns
  5. Columns 52-63: Have data at lane 0 (d=52..63)
  6. Columns 64-127: Have large non-attention values — these are leftover S values from the QK GEMM!

The PV MMA only wrote to 56 out of 64 expected TMEM columns (64 - 8 skipped = 56). The missing columns are at positions 32-35 and 48-51. This is the root cause of the HD=64 precision bug.

Layout D for N=64 maps the 64 N-dim output values to 56 TMEM columns, not 64. The "missing" 8 columns correspond to sub-tiles that Layout D doesn't use. The actual output d=32..35 and d=48..51 are written to different TMEM positions than expected.

Wait — but columns 64-127 have data (from the QK GEMM). The PV MMA with accumulate=true accumulates on top of the QK output. But the QK MMA wrote to columns 0-127, and the PV MMA only writes to the "N=64 Layout D" columns, which skip 32-35 and 48-51. So those positions still have the old QK data, not the PV data.

This is the bug! The PV MMA with make_idesc(128, 64) and tb pointing to the same TMEM as the QK MMA. The PV writes to the "Layout D for N=64" columns, but the QK data in columns 32-35 and 48-51 is still there. And the epilogue reads tb + n*8 which includes those stale QK values at columns 32-35 and 48-51.

The fix: zero the TMEM before the PV MMA. Or, better, use a separate TMEM region for PV so the stale QK data doesn't interfere.

Wait, but even if we zero TMEM, the missing columns 32-35 and 48-51 would just be zero instead of stale data. The PV output for d=32..35 and d=48..51 would still be missing. The fundamental issue is that the PV MMA with N=64 doesn't write to all 64 TMEM columns in a simple sequential mapping.

This is a Layout D issue. For M=128, N=64, the MMA output is organized differently in TMEM than for N=128. We need to understand the exact mapping.

Let me try a cleaner test: zero ALL 128 TMEM columns before PV, then read the result. This will tell us exactly which TMEM columns the PV MMA writes to:

CONFIRMED: After zeroing all TMEM and running PV MMA:

  • 56 non-zero columns in range 0-63 (expected 64)
  • Missing columns: 32, 33, 34, 35, 48, 49, 50, 51
  • Zero in columns 64-127 (expected — PV with N=64 only writes to 64 TMEM columns)
  • All 56 found values match their reference outputs closely (diff < 1e-4)

The PV MMA with make_idesc(128, 64) writes to only 56 out of 64 expected TMEM columns. Columns 32-35 and 48-51 are simply not written by the MMA. The output for d=32-35 and d=48-51 must be written to a different set of TMEM columns, or the MMA doesn't produce them.

Wait — but the reference for d=32..35 and d=48..51 has non-zero values, and they're not found in ANY of the 128 columns. This means the MMA doesn't compute those output values at all.

This is a fundamental issue with the MMA instruction descriptor. Let me re-examine make_idesc(128, 64). The instruction descriptor encodes M=128 (field: 128>>4=8) and N=64 (field: 64>>3=8).

For the tcgen05.mma.kind::f16 with cta_group=1, the valid M/N combinations might be restricted. Let me check the PTX spec for valid MMA shapes.

Looking at the QK MMA that works: make_idesc(128, 128) — M=128, N=128. The PV MMA: make_idesc(128, 64) — M=128, N=64. The PV MMA with HD=16 works (M=128, N=16). So N=16 and N=128 work, but N=64 produces missing columns.

The pattern of missing columns (4 out of every 16 at positions 32-35 and 48-51) suggests the MMA's N=64 output is organized into 8 sub-tiles of 8 columns each, but the hardware only populates 7 of the 8 sub-tiles. This might be a valid MMA shape issue.

The correct approach for HD=64: use N=16 with 4 separate PV MMA calls (4 N-sub-tiles, each producing 16 output values). This is the approach that works at HD=16, extended to 4 tiles for HD=64.

This is what the earlier PREVIOUS_CONVERSATION tried with test_fmha_hd64_n16.cu but it had issues. Let me implement it properly this time.

The key insight: for any HD value, use BLOCK_MN_B=16 for the PV MMA. This means:

  • For HD=16: 1 PV call per K-tile
  • For HD=64: 4 PV calls per K-tile (4 N-sub-tiles of 16 each)
  • For HD=128: 8 PV calls per K-tile
  • For HD=256: 16 PV calls per K-tile

Each PV call produces a (128, 16) output, which we read from TMEM columns 0-15. The 4 calls accumulate into the same TMEM with different N offsets.

But wait — we need to accumulate across N-sub-tiles too. The MMA with N=16 produces 16 output columns. For the next N-sub-tile (d=16..31), we need to accumulate starting at TMEM column 16. And so on.

Actually, the simpler approach: for each N-sub-tile, the MMA writes to TMEM starting at tb + d/8 * 8 (or similar offset). But the TMEM offset for the MMA C operand controls where the output goes. We can pass a different tmem_c address for each N-sub-tile.

Wait — the MMA C operand is tb (the tmem base). The MMA writes to TMEM starting at tb. For N=16, it writes to columns 0-15 (verified at HD=16). For N=64, it writes to 56 out of 64 columns (missing 8). But if we use N=16 with tb + 0 for the first sub-tile, tb + 16 for the second, tb + 32 for the third, etc., each sub-tile writes to 16 consecutive columns and the total covers 64 columns.

But the MMA instruction's C operand is the starting TMEM column for the output. For umma_ss_f16(tb, ...) with N=16, the output goes to columns 0-15. If we change the C operand to tb + 16, the output should go to columns 16-31.

Let me implement this. For each PV K-tile, we do 4 PV MMA calls with N=16 and different TMEM offsets:

  1. umma_ss_f16(tb + 0, dp, dv, idesc_pv16, ...) → columns 0-15 → d=0..15
  2. umma_ss_f16(tb + 16, dp, dv_n1, idesc_pv16, ...) → columns 16-31 → d=16..31
  3. umma_ss_f16(tb + 32, dp, dv_n2, idesc_pv16, ...) → columns 32-47 → d=32..47
  4. umma_ss_f16(tb + 48, dp, dv_n3, idesc_pv16, ...) → columns 48-63 → d=48..63

Each call uses V starting at a different N-offset (V[0:16, :], V[16:32, :], etc.).

COSINE 0.999997! HD=64 PASSES! 🎉

The fix: use N=16 PV sub-tiles instead of N=64. The make_idesc(128, 64) has Layout D issues where 8 TMEM columns are missing. Using 4 × make_idesc(128, 16) with different TMEM offsets works perfectly.

Now let me commit this properly and extend to HD=128 and HD=256. The pattern generalizes:

  • HD=16: 1 N-sub-tile (n=0)
  • HD=64: 4 N-sub-tiles (n=0..3)
  • HD=128: 8 N-sub-tiles (n=0..7)
  • HD=256: 16 N-sub-tiles (n=0..15)

Mike, here's where we stand:

Major Breakthrough: ALL HD values passing

HD=16/64/128/256 all pass with cosine 0.999997+

The bug that was killing HD=64

tcgen05.mma with make_idesc(128, 64) has a Layout D bug — it silently skips 8 TMEM columns (32-35, 48-51). Those output positions were just zero. This explained the cos 0.931.

The fix

Use N=16 PV sub-tiles. Instead of one make_idesc(128, 64) call, we do 4 × make_idesc(128, 16) calls with TMEM offset tb + n*16. Each (128,16) MMA writes correctly to 16 columns. This generalizes to all HD values: HD/16 calls per K-tile.

What's in the repo now

  • test_fmha_gen.cu — generalized kernel, all HD values
  • fmha_sm100_tc.cuh — production kernel template with FmhaSm100Kernel<HD>::launch()
  • CURRENT_ISSUE.md — updated with the full findings
  • Bug proof files: test_tmem_zero_pv.cu, test_tmem_all_lanes.cu

Next steps (in priority order)

  1. 6-warp specialization — warps 0-3 softmax/epilogue, warp 4 MMA, warp 5 TMA loads
  2. TMA loads for Q/K/V (currently direct GMEM reads, no TMA)
  3. Prefill T>1 — multi-row softmax + fill all 128 rows of sPk
  4. Multi-head launch — grid=(1, n_h, batch)
  5. Integration with production.py and the CuTeDSL-based pipeline

What do you want to tackle next?