Commit Graph

2293 Commits

Author SHA1 Message Date
801bfc9a83 add router mode debug print 2026-06-03 05:15:52 +00:00
b385ecc05e PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step 2026-06-03 05:06:40 +00:00
d518fcb82a test: correct sink bias reference — denominator-only, no V contribution 2026-06-03 04:57:37 +00:00
9574a9dc2e test: add sink bias to reference SDPA in decode FMHA comparison 2026-06-03 04:53:55 +00:00
9a9b347b2b test: add per-head magnitude ratio diagnostics to decode FMHA test 2026-06-03 04:50:23 +00:00
f5fa20c581 fix: syntax error — missing closing paren in indexer.forward call 2026-06-03 04:46:41 +00:00
693975ec92 fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU 2026-06-03 04:46:24 +00:00
e1d96c509d test: decode FMHA layer comparison — checks FMHA accuracy during decode step 2026-06-03 04:39:12 +00:00
1ebe7f0dde Add PART_A_NEXT_SESSION.md: clues for decode degeneration debugging 2026-06-03 04:34:28 +00:00
d8306be3f2 Fix PART A test: proper FP8 quantization and MQA reference 2026-06-03 04:20:36 +00:00
4126909dfb Simplify PART A test: compressor + FMHA at production scale 2026-06-03 04:18:13 +00:00
8c54cfa748 Fix KVCache init in PART A test 2026-06-03 04:15:41 +00:00
04cf8ca848 Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale 2026-06-03 04:13:53 +00:00
75288bd12f Wire prefill FMHA into production.py and single_shot
- Add dsv4_attention_mixed_fp8_prefill to production.py
- _run_production_fmha_mixed now dispatches to prefill kernel for T>1
- Remove decode-only T==1 restriction
- Update FINAL_STRETCH.md: prefill marked DONE, batched prefill TODO noted
2026-06-03 03:49:57 +00:00
5417f65b08 CRITICAL FIX: Add T-dimension strides to prefill FMHA kernel
The kernel was using head strides for the T (query row) dimension,
which happened to work for T=1 (qr=0 always) but was wrong for T>1.

For (B,H,T,NOPE) layout:
- Head stride = T*NOPE, but T stride = NOPE
- Scale head stride = T, but T stride = 1
- RoPE head stride = T*ROPE, but T stride = ROPE

Added q_nope_t_stride, q_scale_t_stride, q_rope_t_stride to params
struct, C API, and Python wrapper.
2026-06-03 03:48:17 +00:00
dd1cbe1faa Fix smem size for prefill debug test 2026-06-03 03:47:01 +00:00
09384a637a Fix constexpr issues in prefill debug test 2026-06-03 03:46:29 +00:00
d3dc8cf901 Add prefill T=2 debug CUDA test with intermediate value printing 2026-06-03 03:46:14 +00:00
223c22488f Simplify prefill PV read: use decode kernel's exact pattern
Replace complex n_sub-iterating read with the same HD/8 iteration
pattern as the proven decode kernel. Extract from lane qr%32 instead
of always lane 0. For qr>=32, use warp 1; for qr>=64, add TMEM offset.

This should fix the row 1 accuracy issue (was cos=0.94 vs decode).
2026-06-03 03:22:49 +00:00
2bf5e74e61 Add prefill debug test: compare T=1 decode vs prefill kernel step by step 2026-06-03 03:05:25 +00:00
eb69c3bfb9 CRITICAL FIX: add missing tb base in QK TMEM read address
prefill_read_qk_rows was reading from address 0 (sg_off + n * 8)
instead of tb + sg_off + n * 8. This caused garbage QK values,
explaining the 0.928 cosine for T=1 and NaN for T>1.
2026-06-03 03:00:57 +00:00
99b6de316b Fix prefill kernel: add missing tb base in PV TMEM read, fix ACCUMULATE for per-row PV
Two critical fixes:
1. prefill_read_pv_all_subs: was missing 'tb' base in TMEM read address
2. PV MMA ACCUMULATE: use pv_kt == 0 (not kv_tile==0 && pv_kt==0 && n_sub==0)
   so each query row's PV starts fresh instead of accumulating into previous row's result
2026-06-03 02:59:19 +00:00
9034f67b0f Fix prefill kernel: read ALL n_sub PV results (was only n_sub=0)
Critical bug: prefill_read_pv_row only read n_sub=0 (16 out of 512 HD dims).
Replaced with prefill_read_pv_all_subs that iterates over all 32 n_sub groups.
Also fixed TMEM row-group/warp mapping for rows 32-127.
2026-06-03 02:54:59 +00:00
a4ef6c3454 Add B1 mixed FP8 prefill FMHA kernel (T>1 support)
New files:
- fmha_mixed_fp8_prefill.cuh: kernel supporting T=1..128
  - Sub-batch processing (T_BATCH=32) to fit in 232KB SMEM
  - Multi-row QK TMEM read using tcgen05.ld.32x32b.x8
  - Per-row online softmax
  - Per-row PV MMA (correctness first; batched PV is TODO)
  - Attention sink support
- fmha_mixed_fp8_prefill_capi.cu: C API bridge
- fmha_mixed_fp8_prefill_op.py: Python ctypes loader
- test_b1_mixed_fp8_prefill.py: unit test (T=1..32, N=128..4096)

Also: fix production FMHA layer test (BF16 fallback for o_a_proj,
router gate BF16 quantize path, missing DEVICE constant)
2026-06-03 02:50:27 +00:00
1f757151ef Fix router gate BF16 quantize path for production FMHA test 2026-06-03 02:47:47 +00:00
07168357cc Fix o_a_proj weight loading: add BF16 fallback for grouped linear 2026-06-03 02:38:00 +00:00
27d8d80a40 Fix missing DEVICE constant in production FMHA test 2026-06-03 02:31:11 +00:00
26a817c2f2 Fix production FMHA layer test: compare raw FMHA vs SDPA on production gathered KV
Phase 1: Run full pipeline to populate KV caches with real model weights.
Phase 2: For each layer, gather KV in mixed FP8/BF16 format, run both
production FMHA and PyTorch SDPA, compare cosine similarity.

Uses random Q (not model-generated) to isolate FMHA kernel accuracy
from upstream pipeline issues.
2026-06-03 02:26:37 +00:00
ba67e055f7 Add production FMHA layer comparison test
Test loads real model weights, runs attention forward for layers 0-4,
compares production B1 mixed FP8 FMHA output vs PyTorch SDPA reference.
This will reveal the FMHA cosine degradation (was 0.679 at L1) with
real data patterns, not just synthetic random data.

Production values: HD=512, NOPE=448, ROPE=64, H=128, 8 GPUs.
2026-06-03 02:22:23 +00:00
af58f2c5b2 Add B1 weight/format verification at L0 in single_shot v-b1-b2-done-20260603 2026-06-03 01:52:55 +00:00
8df5de5477 Update B1 docs with test results and bug fix 2026-06-03 01:50:59 +00:00
3e3b352e7e Update FINAL_STRETCH.md: B1 and B2 marked DONE with test results and bug fixes 2026-06-03 01:50:21 +00:00
84a02f8995 Remove debug test files, keep production B1/B2 unit tests 2026-06-03 01:49:39 +00:00
6fa9ad7852 B2 indexer: adopt TMEM warp-to-row mapping fix
Key insight: tcgen05.ld.32x32b.x8 maps warp 0 to rows 0-31 and warp 1 to
rows 32-63 from the SAME TMEM address. The hardware routes row slices
based on warp position in the warpgroup.

Fix approach (from external LLM review):
- Warps 0-1 both read from tb + col_base (same address)
- Each warp writes partial scores to its own sWarpScores partition
- After __syncthreads(), merge both partitions for final 64-head scores
- No race conditions, no cross-warp accumulation bugs
2026-06-03 01:42:38 +00:00
6c92ff91f3 B2 indexer: temporary heads 0-31 only while figuring out TMEM row 32-63 layout 2026-06-03 01:12:10 +00:00
7732c93f62 Fix B2 indexer: use 16x256b.x1 TMEM read with TMEM_COLS=512
Revert to 16x256b.x1 approach (reads 64 rows from single column).
Previous hang was likely due to TMEM_COLS=128 (too small).
With TMEM_COLS=512, the full 128-row MMA output fits in TMEM.

Lane i reads rows 4i..4i+3. Lanes 0-15 cover rows 0-63.
4 warps (0-3) each process 32 columns, computing weighted ReLU scores.
2026-06-03 01:08:48 +00:00
a75a9843af Fix B2 indexer: add sLogits scratch buffer to SMEM layout 2026-06-03 00:59:06 +00:00
cc7b17fdaa Fix B2 indexer: use 2-warps for TMEM read (P7 row-slice model)
ROOT CAUSE: The TMEM read for rows 32-63 was wrong. The 32x32b.x8
instruction reads 32 rows per warp. Per P7 docs, warp 0 sees rows 0-31
and warp 1 sees rows 32-63 from the SAME TMEM address. There is no TMEM
offset for different row groups — the row-to-lane mapping depends on
the warp ID.

Fix: warp 0 reads heads 0-31, warp 1 reads heads 32-63 from tb + col_base.
Cross-warp reduce via SMEM to compute full 64-head weighted ReLU scores.
2026-06-03 00:55:27 +00:00
8d0a02ca67 B2 TMEM debug: try stride=SK_TILE/8=16 for row group 32-63 2026-06-03 00:52:32 +00:00
fdf702470c Add B2 TMEM read debug kernel and test 2026-06-03 00:50:52 +00:00
f1cf4c0215 Add B2 QK debug test with w_h=1 for simple comparison 2026-06-03 00:46:48 +00:00
d36dbba01c Fix B2 indexer: increase TMEM_COLS to 512 for full 128-row MMA output
The MMA produces 128 rows × 128 cols = 4 row-groups × 128 TMEM cols = 512 total.
Even though we only read rows 0-63, the MMA writes all 128 rows.
TMEM_COLS must match the MMA output size, not just the read size.
2026-06-03 00:45:15 +00:00
797345dfe9 Add B2 score debug test 2026-06-03 00:43:44 +00:00
afb82b9c89 Fix B2 indexer: replace broken 16x256b TMEM read with proven 32x32b.x8
ROOT CAUSES:
1. tcgen05.ld.16x256b.x1 was hanging — either invalid instruction or unaligned
2. TMEM_COLS=128 was too small for 64-row MMA output (needs 256 for 2 row-groups)
3. TMEM row-group addressing: rows 32-63 are at offset SK_TILE (128) in TMEM

Fixes:
- Use tcgen05.ld.32x32b.x8 (proven in B1 FMHA) instead of 16x256b.x1
- Increase TMEM_COLS from 128 to 256
- Read both row-groups (0-31 and 32-63) per 8-column chunk
- Each lane handles head i (from row-group 0) and head 32+i (from row-group 1)
- Warp-level reduce sums contributions from all 64 heads per column
2026-06-03 00:39:49 +00:00
99e50fcb58 Add B2 minimal debug test to find hang point 2026-06-03 00:35:48 +00:00
e21bd14408 Fix B1 test LSE reference shape handling 2026-06-03 00:25:53 +00:00
4fe7f9dc37 Fix B1 FMHA: swap V matrix canonical layout args (dd, kk) not (kk, dd)
ROOT CAUSE: canon_idx_bf16_16x16(kk, dd) was swapping the outer/inner group
structure compared to the working TMA-loaded V layout in the multitile kernel.

Working layout: (lr/8)*128 + (dd/8)*64 + (dd%8)*8 + (lr%8)
B1 with (kk,dd): (dd/8)*128 + (kk/8)*64 + (kk%8)*8 + (dd%8)  <- WRONG
B1 with (dd,kk): (kk/8)*128 + (dd/8)*64 + (dd%8)*8 + (kk%8)  <- CORRECT

This caused the V matrix to be loaded into SMEM with transposed group
structure, producing garbage output (cos=0.158 vs BF16 reference).
2026-06-03 00:24:20 +00:00
29a95a3db6 Add B1 QK vs PV isolation test 2026-06-03 00:23:35 +00:00
c322e3f301 Add B1 FMHA debug test for cosine failure investigation 2026-06-03 00:22:00 +00:00
5447d1d1dc Add comprehensive B2 FP8 indexer unit test 2026-06-03 00:21:29 +00:00