Commit Graph

1621 Commits

Author SHA1 Message Date
95003eced2 test: 16x256b.x1 loads with uint32_t regs, matching working pattern 2026-05-28 23:03:10 +00:00
fffb493b0e fix: 16x256b.x1 load syntax — single address operand 2026-05-28 23:02:23 +00:00
44dcd6e8d0 test: 16x256b.x1 multiple LOADS — do they crash like stores? 2026-05-28 23:02:03 +00:00
d54bce6a6d fix: correct SMEM size for MMA 4-warp test 2026-05-28 23:01:12 +00:00
be45e87891 test: MMA→4-warp TMEM read — do warps see different rows? 2026-05-28 23:00:27 +00:00
6b0d57074a test: TMEM cross-warp visibility with different sync strategies 2026-05-28 22:59:31 +00:00
77d190278e test: simpler TMEM 4-warp read — direct store+load 2026-05-28 22:58:48 +00:00
91b03bd6bd test: verify 4-warp TMEM read with 32x32b.x8 after MMA 2026-05-28 22:57:59 +00:00
28e04a5ea8 fix: use __cvta_generic_to_shared directly for 64-bit compat 2026-05-28 22:56:29 +00:00
1d6a95df32 fix: typo in tmem row offset test 2026-05-28 22:56:15 +00:00
cf6fe71368 test: verify TMEM 32x32b.x8 row offset addressing 2026-05-28 22:56:00 +00:00
4cfb707405 fix: correct SMEM size calculation in multirow test 2026-05-28 22:53:46 +00:00
863a030c3b fmha_multirow: rewrite with 32x32b.x8 only, no s_p_vals, row_page addressing
- Kill 64KB s_p_vals buffer — P is streamed per K-tile through sPk
- All TMEM ops use 32x32b.x8 exclusively (16x256b.x1 crashes on 2nd call)
- T>32: 4 softmax warps use row_page offset in TMEM address (row<<16)
- Lane l in warp w handles row w*32+l
- Two-pass softmax: pass 1 row_max, pass 2 exp/sum interleaved with PV
- PV: N=16 sub-tiles, SS MMA sPk(128,16) × sV(16,16) → TMEM
- Epilogue: 32x32b.x8 TMEM read, normalize, BF16 → GMEM
- SMEM budget: ~14KB (well within 232KB)
2026-05-28 22:52:52 +00:00
1ba304db3e stuff 2026-05-28 21:08:13 +00:00
deaa3ec725 CRITICAL FIX: Q/K SMEM canonical layout must use local d (0..15) not full_d — UMMA descriptor reads from sQ0/sK0 start, not offset 2026-05-28 20:13:52 +00:00
08694b8136 Fix multi-row softmax v3: 32x32b.x8 with per-lane per-row (no wmax/wsum), per-row sRowMax/sRowSum arrays 2026-05-28 20:10:13 +00:00
aaa76c1af1 Rewrite multi-row softmax using 16x256b.x1 TMEM reads for proper multi-row access 2026-05-28 20:08:30 +00:00
5e3c61184c Fix multi-row softmax: remove cross-lane wmax/wsum — each lane handles its own row independently 2026-05-28 20:06:16 +00:00
bf4dfd131b Fix nvcc goto-bypasses-init: move var decls before goto targets 2026-05-28 20:04:59 +00:00
2b09d4f2ef Fix nvcc goto-bypasses-init in multi-row test 2026-05-28 20:04:45 +00:00
d8b421ccee Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax 2026-05-28 20:04:29 +00:00
adc88613fa Milestone 5 COMPLETE: multi-head FMHA grid launch verified on B200
All HD=16/64/128/256 pass across MHA (4+8 heads), MQA, batched modes.
cos 0.999997+, LSE matches reference. Updated CURRENT_ISSUE.md.
2026-05-28 19:35:06 +00:00
3fd302e7a0 Fix nvcc goto-bypasses-init errors in multi-head test 2026-05-28 19:33:04 +00:00
aa41cfa2e5 Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support
- fmha_6warp_multihead.cuh: grid=(1, n_h, batch) kernel with FmhaParams
- MQA support via k_head_stride=0 / v_head_stride=0
- LSE output for multi-segment KV merge composition
- test_fmha_6warp_multihead.cu: MHA (4+8 heads), MQA, batched tests
- HD-specific wrappers for hd=16/64/128/256
- Marked E2M1 dequant bug as FIXED in consultant issue file
2026-05-28 19:32:35 +00:00
6af2feb42a TMA 5D test: element stride decomposition 2026-05-28 19:18:01 +00:00
96f2f0bb90 auto: pre-test commit 2026-05-28 19:12:23 +00:00
015435b1ab auto: pre-test commit 2026-05-28 19:09:50 +00:00
41343fdc6b auto: pre-test commit 2026-05-28 19:08:04 +00:00
a723b524f7 TMA alignment test 2026-05-28 17:00:20 +00:00
c54a83960d TMA debug: fix globalStrides to tensorRank-1 elements 2026-05-28 16:58:30 +00:00
944e567b6c TMA debug: test various CUtensorMap configs 2026-05-28 16:55:25 +00:00
55d289c65b Fix TMA: use CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 not UINT16 2026-05-28 16:51:40 +00:00
0fd3e12a52 Fix TMA test: globalStrides in bytes not elements 2026-05-28 16:46:56 +00:00
ad8050bbad WIP: TMA load test infrastructure (manual compile needed) 2026-05-28 16:45:04 +00:00
d9df1e6486 auto: pre-test commit 2026-05-28 16:42:24 +00:00
a4211559cf auto: pre-test commit 2026-05-28 16:40:51 +00:00
3b8fdcc823 auto: pre-test commit 2026-05-28 16:39:45 +00:00
072fbf0b5d auto: pre-test commit 2026-05-28 16:36:53 +00:00
090f2866ae Update CURRENT_ISSUE: 6-warp Milestone 1 complete 2026-05-28 16:35:02 +00:00
b3020c2811 6-warp specialized FMHA kernel — ALL HD=16/64/128/256 PASS cos 0.999997+
Warp layout (192 threads):
- Warps 0-3: Softmax + correction + epilogue
- Warp 4: MMA (QK + PV GEMM)
- Warp 5: Data staging (Q/K/V loads, direct GMEM for now)
CTA-wide __syncthreads() sync between phases.

Fix: removed spurious inv_sum normalization in epilogue
(MMA output is already correctly scaled with softmax'd P).

Files: fmha_6warp.cuh + test_fmha_6warp*.cu
2026-05-28 16:34:14 +00:00
2a6d72912a auto: pre-test commit 2026-05-28 16:28:58 +00:00
e74c84458c Clean up E2M1 dequant: use LUT approach (consultant recommendation)
Both indexer files now use a constexpr LUT matching Python's
E2M1_MAGNITUDES = [0, 0.5, 1, 1.5, 2, 3, 4, 6].
This is cleaner and more auditable than bit-manipulation.
2026-05-28 16:17:47 +00:00
79ef87f9a9 FIX: E2M1 FP4 dequantization bug in indexer_score_topk.cu
The dequant_fp4_scalar function was treating the magnitude bits as
a raw integer (0-6) instead of the E2M1 floating-point format:
  Old (WRONG): val = (int)(nibble & 0x07) * scale
  New (CORRECT): proper E2M1 decode with exponent + mantissa

E2M1 encoding (bias=1):
  exp=0 subnormal: 0b000=0, 0b001=0.5
  exp=1: 0b010=1, 0b011=1.5
  exp=2: 0b100=2, 0b101=3
  exp=3: 0b110=4, 0b111=6

Bug found by outside consultant. Affects indexer top-k selection
correctness — wrong FP4 key decoding would select wrong CSA blocks.

Fixed in both:
- dsv4/kernels/indexer/indexer_score_topk.cu
- dsv4/kernels/cuda/indexer_score_topk.cu
2026-05-28 16:16:24 +00:00
44c4bade5f Rewrite fmha_sm100_tc.cuh with working N=16 PV sub-tile approach
Production FMHA kernel template for Blackwell SM100:
- FmhaSm100Kernel<HD>::launch(q, k, v, o, s_k, scale, stream)
- QK: SS MMA N=128, one K-tile at a time
- PV: SS MMA N=16 sub-tiles (HD/16 calls per K-tile)
- Epilogue: TMEM → regs → BF16 → GMEM
- ~25KB SMEM for all HD values
- All HD=16/64/128/256 pass with cos 0.999997+
2026-05-28 16:04:11 +00:00
a18d9c1584 Update CURRENT_ISSUE: ALL HD=16/64/128/256 PASS cos 0.999997+
Documented Layout D N=64 bug and N=16 sub-tile workaround.
2026-05-28 16:03:05 +00:00
01319d7247 auto: pre-test commit 2026-05-28 15:59:22 +00:00
43516ed4ec auto: pre-test commit 2026-05-28 15:55:59 +00:00
1ec3e1ed2c auto: pre-test commit 2026-05-28 15:55:18 +00:00
babff1f402 auto: pre-test commit 2026-05-28 15:54:05 +00:00
2b007d2008 auto: pre-test commit 2026-05-28 15:53:39 +00:00