Key finding: the root cause is that each epilogue thread owns MULTIPLE rows
in the QK C-fragment, so scalar row_max/row_sum are wrong (global across
all rows, not per-row). The V=ones diagnostic confirmed: all 128 threads
use the same row_sum (from row 114).
Tried: TMEM vector store+load of row_sum (composition(tStS, (128,2))).
This is a no-op because both write and read use the SAME QK partition
with a scalar row_sum. The vector approach only helps when different
partitions are used for write vs read, or when per-row values are stored.
Next steps:
1. Need PER-ROW row_max and row_sum, not per-thread scalar
2. The CUTLASS FMHA works because each thread owns exactly 1 row
3. Options: restructure thread layout, or compute per-row values differently
4. The vector must store ALL 128 per-row values, then read per-row in C9
Key finding: cute.size(v, mode=[0]) in @cute.jit produces wrong code.
Hardcoding s_k=128 (matching Stage B) fixes the base pipeline.
Current status: kernel produces non-zero output but softmax math is still wrong.
Applied fixes: pv_done_bar, acc_scale with scale, fastmath=True
Need to debug row_sum computation and C9 normalization.
- scale_softmax_log2 was missing from _setup (patch artifact)
- C9 normalization: load O from TMEM, multiply by 1/row_sum, store back
instead of trying to capture runtime value in const_expr lambda
- Then use standard epilogue_tma_store with identity transform
- C1: Real softmax reference (torch.softmax, not identity)
- C2: Per-thread row_max/row_sum registers
- C3: QK scale folded (1/sqrt(d) * log2(e))
- C4: Row max via .reduce(MAX)
- C5: Rescale factor (exp2(old_max - new_max))
- C6: O rescale in TMEM (correction_rescale pattern)
- C7: Real exp2 for P computation
- C8: Row sum via packed f32x2 reduction
- C9: Final normalization (1/row_sum in epilogue)
- Dynamic s_k for V FMHA reconstruction
- fastmath=False for correctness first
Document canonical test files, obsolete test sprawl, and the path from
test_fmha_v3.py → cutedsl/kernel/attention/fmha_kernel.py → vLLM integration.
Also: TMEM layout for Stage C, key lessons from A&B.
- Workspace README: full rewrite with Stage B ✅, Bug 4b root cause (P/O overlap),
FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns
- Kernel README: focused on the bug, fix, and current test status
- Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement
Root cause: PV output O started at TMEM column 64 (from find_tmem_tensor_col_offset),
overlapping with P at columns [32,96). PV MMA reading P while writing O to overlapping
columns corrupted the A operand mid-computation.
For (128,128) PV, O started at 128 (no overlap) so it worked by accident.
For (128,64) PV, O started at 64, overlapping P [32,96) -> NaN/garbage.
Fix: Place O at column 128 (after both S [0,128) and P [32,96)).
Also added FMHA-style V reconstruction: logical (HEAD_DIM, s_k, 1) stride (1, hd, hd*s_k)
instead of passing DLPack V directly to TMA.
test_fmha_v3.py: (128,64) PV with random V -> cosine 0.999999 PASS
Key findings:
- P/A alias WORKS: PV reads non-zero P from TMEM at offset 32 (proven by no-softmax test)
- V mode bug: V=(128,64) only loads 64 K-values, PV needs 128. Output = sum(S[:,:64]) = 0.67 cosine
- FMHA-style V reconstruction (hd,n,1) stride (1,hd) gives NaN for (128,64) PV
- K-major V (64,128) contiguous gives NaN for (128,64) PV
- Square (128,128) PV works with ALL V approaches (cosine 0.999999)
- Non-square PV consistently broken regardless of V layout
Test files:
- test_128_128_fmha_v.py: (128,128) with FMHA V - PASS
- test_pv64_fmha_v.py: (128,64) with FMHA V - NaN
- test_pv64_kmajor_v.py: (128,64) with K-major V - NaN
- test_pv64_with_softmax.py: (128,64) with original V - 0.67
- test_pv64_no_softmax.py: proves P/A alias works
- test_fmha_v3.py: full pipeline with QK C-fragment composition store
- test_pv64.py: (128,64) PV with separate V SMEM, single ab pipeline
Result: cosine 0.669848 — data path works but P layout mismatch
Softmax writes P via QK C-fragment layout, PV reads via PV A-fragment layout
These differ for non-(128,128) PV — Bug 1 from README
- test_fmha_v2_fixed.py: KV-tile interleaved pipeline with fixes
Fix 1: per-pipeline tx_count (Q vs KV separate byte counts)
Fix 2: NamedBarrier for softmax-done signal (replaces double-acquire deadlock)
Fix 3: Separate SMEM for V (no recast_ptr overlap with K)
Still produces zeros — needs P layout fix (same root cause as test_pv64)
- (128,64) PV MMA A-fragment has N_MMA=64, reads P with wrong stride
- Softmax writes P with QK C-fragment layout (N_MMA=128)
- O[m,d] ≈ P[m,2d] — every other column effect confirmed
- All-ones and single-element V pass (uniform/sparse data hides mismatch)
- epi_tile must use PV cta_tile (partial fix: 0.01 → 0.876)
- Added footguns #9 (TMEM alias N_MMA match) and #10 (epi_tile)
- Added diagnostic test results to test table
Bug #2 fix: warmup_compilation and warmup_fused_swiglu_compilation now
use valid FP4 data by quantizing random BF16 through quantize_to_nvfp4.
Random uint8 bytes as FP4 bit patterns cause cudaErrorIllegalInstruction
in Blackwell MMA hardware. Re-enabled warmup calls in runner.py.
Bug #1 kernel: sparse_topk_metadata.cu with:
- build_c128a_topk_metadata: position-based compressed KV slot lookup
via block table for C128A (compress_ratio=128) decode tokens
- compute_c4a_global_topk: local topk index -> global slot ID mapping
via block table for C4A (compress_ratio=4) decode tokens
- Both tested: correct block table lookups, proper padding
Bug #3 kernel: C4A uses compute_c4a_global_topk (same .cu file)
- Replaces vLLM Triton kernel with our own CUDA kernel
Deleted stale STATUS.md, FUSED_EPILOGUE_STATUS.md, FUSED_EPILOGUE_PLAN.md, CURRENT_BUGMD
- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
- Even subtiles = gate: SiLU in FP32 registers, save to register buffer
- Odd subtiles = up: silu(gate)*up from buffer
- Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned
SiLU in registers: PASS (0.034% error, Step 1 stable)
Gate/up subtile detection: blocked by CuTeDSL type system
CuTeDSL compiles the kernel for ALL subtile iterations at once.
Runtime conditionals (if is_gate_subtile) that affect:
- Register tensor assignment → DSLRuntimeError (type structure mismatch)
- TMA store skipping → corrupted output
- Mask blending → wrong results
Path forward: use const_expr debug flag for the BF16 side output,
or process gate/up in a separate post-GEMM kernel.
Step 1 VALIDATED:
- cute.exp works on register tensors in the epilogue
- SiLU (x / (1+exp(-x))) produces correct results
- Relative error vs PyTorch: 0.034%, max abs: 0.0625 (BF16 precision)
Step 2 (gate/up pairing) approach:
- Register-level pairing requires understanding acc_vec layout from tiled_copy_r2s
- DeepGEMM pattern: (values[0], values[2]) pairs for tcgen05.ld
- CuTeDSL retile may produce different layout than direct PTX loads
- SMEM-level SiLU is a valid intermediate: avoids GMEM round-trip while
working in logical (M, N) coordinate space
- Non-interleaved weights + SMEM SiLU is simplest starting point