Two critical fixes:
1. prefill_read_pv_all_subs: was missing 'tb' base in TMEM read address
2. PV MMA ACCUMULATE: use pv_kt == 0 (not kv_tile==0 && pv_kt==0 && n_sub==0)
so each query row's PV starts fresh instead of accumulating into previous row's result
Critical bug: prefill_read_pv_row only read n_sub=0 (16 out of 512 HD dims).
Replaced with prefill_read_pv_all_subs that iterates over all 32 n_sub groups.
Also fixed TMEM row-group/warp mapping for rows 32-127.
Phase 1: Run full pipeline to populate KV caches with real model weights.
Phase 2: For each layer, gather KV in mixed FP8/BF16 format, run both
production FMHA and PyTorch SDPA, compare cosine similarity.
Uses random Q (not model-generated) to isolate FMHA kernel accuracy
from upstream pipeline issues.
Test loads real model weights, runs attention forward for layers 0-4,
compares production B1 mixed FP8 FMHA output vs PyTorch SDPA reference.
This will reveal the FMHA cosine degradation (was 0.679 at L1) with
real data patterns, not just synthetic random data.
Production values: HD=512, NOPE=448, ROPE=64, H=128, 8 GPUs.
Key insight: tcgen05.ld.32x32b.x8 maps warp 0 to rows 0-31 and warp 1 to
rows 32-63 from the SAME TMEM address. The hardware routes row slices
based on warp position in the warpgroup.
Fix approach (from external LLM review):
- Warps 0-1 both read from tb + col_base (same address)
- Each warp writes partial scores to its own sWarpScores partition
- After __syncthreads(), merge both partitions for final 64-head scores
- No race conditions, no cross-warp accumulation bugs
Revert to 16x256b.x1 approach (reads 64 rows from single column).
Previous hang was likely due to TMEM_COLS=128 (too small).
With TMEM_COLS=512, the full 128-row MMA output fits in TMEM.
Lane i reads rows 4i..4i+3. Lanes 0-15 cover rows 0-63.
4 warps (0-3) each process 32 columns, computing weighted ReLU scores.
ROOT CAUSE: The TMEM read for rows 32-63 was wrong. The 32x32b.x8
instruction reads 32 rows per warp. Per P7 docs, warp 0 sees rows 0-31
and warp 1 sees rows 32-63 from the SAME TMEM address. There is no TMEM
offset for different row groups — the row-to-lane mapping depends on
the warp ID.
Fix: warp 0 reads heads 0-31, warp 1 reads heads 32-63 from tb + col_base.
Cross-warp reduce via SMEM to compute full 64-head weighted ReLU scores.
The MMA produces 128 rows × 128 cols = 4 row-groups × 128 TMEM cols = 512 total.
Even though we only read rows 0-63, the MMA writes all 128 rows.
TMEM_COLS must match the MMA output size, not just the read size.
ROOT CAUSES:
1. tcgen05.ld.16x256b.x1 was hanging — either invalid instruction or unaligned
2. TMEM_COLS=128 was too small for 64-row MMA output (needs 256 for 2 row-groups)
3. TMEM row-group addressing: rows 32-63 are at offset SK_TILE (128) in TMEM
Fixes:
- Use tcgen05.ld.32x32b.x8 (proven in B1 FMHA) instead of 16x256b.x1
- Increase TMEM_COLS from 128 to 256
- Read both row-groups (0-31 and 32-63) per 8-column chunk
- Each lane handles head i (from row-group 0) and head 32+i (from row-group 1)
- Warp-level reduce sums contributions from all 64 heads per column
ROOT CAUSE: canon_idx_bf16_16x16(kk, dd) was swapping the outer/inner group
structure compared to the working TMA-loaded V layout in the multitile kernel.
Working layout: (lr/8)*128 + (dd/8)*64 + (dd%8)*8 + (lr%8)
B1 with (kk,dd): (dd/8)*128 + (kk/8)*64 + (kk%8)*8 + (dd%8) <- WRONG
B1 with (dd,kk): (kk/8)*128 + (dd/8)*64 + (dd%8)*8 + (kk%8) <- CORRECT
This caused the V matrix to be loaded into SMEM with transposed group
structure, producing garbage output (cos=0.158 vs BF16 reference).
- New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu
- Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4
- Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3
- TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA)
- On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k
- No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores
- Per-thread register heap top-k (same algorithm as indexer_score_topk.cu)
- Modified: single_shot_inference.py
- Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16)
- Consumes FP8 indexer keys from cache without BF16 dequantization
- Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode)
- FP32 einsum fallback retained only for T>1 (prefill)
- Removed 'Intentional first-pass limits' section from B1 doc
(those limits ARE the correct production design, not shortcuts)
torch.utils.cpp_extension.load creates a 'lock' file in the build
directory during compilation. If the compiling process is killed
(OOM, timeout, user interrupt), the lock file is never removed and
subsequent processes spin forever polling it (clock_nanosleep(100ms)
→ stat(lock) → repeat).
Fix: _cleanup_stale_lock() removes lock files older than 10 minutes
before any compilation attempt. This is the correct threshold — CUDA
kernel compilation should never take more than a few minutes, so a
10-minute-old lock is guaranteed stale.
DSV4 is a reasoning model. The standard prompt format is:
BOS <|User|> prompt <|Assistant|> ◇
Without the ◇ priming, the model is out-of-distribution — it expects to
be inside a thinking block but never received the sentinel. This causes
degenerate output from step 0 (France instead of Paris, looping on
newlines/repeated tokens).
With ◇, the model will:
1. Generate thinking content (reasoning)
2. Emit ◇ (think_end=128822) to close the thinking block
3. Produce the actual answer
4. Emit EOS (token 1)
This matches the pattern described in the Kimi K2 accuracy blog:
https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed
prompt formatting is the #1 cause of degenerate output in chat-tuned
reasoning models.
Previously only stopped on tokenizer.eos_token_id. DSV4 uses special
turn-end tokens (<|end_of_sentence|>, USER_TOKEN=128803) that indicate
the assistant turn is complete. Missing these caused decode to continue
past the model's natural stopping point, producing degenerate output.
Also increased diagnostic logging (every step for first 20 steps) to
catch turn-end token emissions.
1. score_topk.py: Fix docstring — K^IComp[s] is shared (MQA), not per-head K^IComp[s,h]
Matches the .cu kernel and production Indexer.forward() einsum.
2. score_topk.py: Add WARNING about valid_lens broadcast being wrong for batched prefill
3. csa_indexer.py: Replace random weights with RuntimeError — CSAIndexer has no
checkpoint loading. Production uses the Indexer class in single_shot_inference.py.
4. csa_indexer.py: Document RoPE assumption — indexer queries/keys have no RoPE.
NEEDS VERIFICATION against HF reference.
The CUDA loader (dsv4/kernels/cuda/loader.py) resolves all .cu
files relative to dsv4/kernels/cuda/. The indexer/ subfolder copies
were never loaded — they were dead code that could silently diverge
from the canonical copies in cuda/.
- Added --ab-compare flag to run both fused and unfused paths for first 3 layers
- Compares x_normed, gsa values, FP4 data, and GEMM outputs (q_a, kv)
- Added --no-fused-rmsnorm to disable P4 and use unfused path
- This will help diagnose the correctness regression introduced by P4