- New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu
- Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4
- Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3
- TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA)
- On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k
- No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores
- Per-thread register heap top-k (same algorithm as indexer_score_topk.cu)
- Modified: single_shot_inference.py
- Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16)
- Consumes FP8 indexer keys from cache without BF16 dequantization
- Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode)
- FP32 einsum fallback retained only for T>1 (prefill)
- Removed 'Intentional first-pass limits' section from B1 doc
(those limits ARE the correct production design, not shortcuts)
docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is
the correct instruction for multi-row softmax. Each call reads 8 KV
positions for 32 rows. No instruction change needed from single-row.
test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N.
Gate: cos >= 0.999996.
Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs.
The bit-21 workaround was NOT needed. The 'misaligned address' crashes were
caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor.
Added docs/p4_tma_hang_resolution.md with root cause and fix.
Cleaned up stale P4 test files.
Key findings documented in docs/cuda13_tma_notes.md:
- CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures)
- BFLOAT16 data type available in CUDA 13
- Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2
- CuTeDSL tma_partition works (production path)
Archived (not deleted):
- fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu
- These will work once driver matches toolkit version