-
d8b421ccee
Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax
biondizzle
2026-05-28 20:04:29 +00:00
-
adc88613fa
Milestone 5 COMPLETE: multi-head FMHA grid launch verified on B200
biondizzle
2026-05-28 19:35:06 +00:00
-
3fd302e7a0
Fix nvcc goto-bypasses-init errors in multi-head test
biondizzle
2026-05-28 19:33:04 +00:00
-
aa41cfa2e5
Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support
biondizzle
2026-05-28 19:32:35 +00:00
-
6af2feb42a
TMA 5D test: element stride decomposition
biondizzle
2026-05-28 19:18:01 +00:00
-
96f2f0bb90
auto: pre-test commit
biondizzle
2026-05-28 19:12:23 +00:00
-
015435b1ab
auto: pre-test commit
biondizzle
2026-05-28 19:09:50 +00:00
-
41343fdc6b
auto: pre-test commit
biondizzle
2026-05-28 19:08:04 +00:00
-
a723b524f7
TMA alignment test
biondizzle
2026-05-28 17:00:20 +00:00
-
c54a83960d
TMA debug: fix globalStrides to tensorRank-1 elements
biondizzle
2026-05-28 16:58:30 +00:00
-
944e567b6c
TMA debug: test various CUtensorMap configs
biondizzle
2026-05-28 16:55:25 +00:00
-
55d289c65b
Fix TMA: use CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 not UINT16
biondizzle
2026-05-28 16:51:40 +00:00
-
0fd3e12a52
Fix TMA test: globalStrides in bytes not elements
biondizzle
2026-05-28 16:46:56 +00:00
-
ad8050bbad
WIP: TMA load test infrastructure (manual compile needed)
biondizzle
2026-05-28 16:45:04 +00:00
-
d9df1e6486
auto: pre-test commit
biondizzle
2026-05-28 16:42:24 +00:00
-
a4211559cf
auto: pre-test commit
biondizzle
2026-05-28 16:40:51 +00:00
-
3b8fdcc823
auto: pre-test commit
biondizzle
2026-05-28 16:39:45 +00:00
-
072fbf0b5d
auto: pre-test commit
biondizzle
2026-05-28 16:36:53 +00:00
-
090f2866ae
Update CURRENT_ISSUE: 6-warp Milestone 1 complete
biondizzle
2026-05-28 16:35:02 +00:00
-
b3020c2811
6-warp specialized FMHA kernel — ALL HD=16/64/128/256 PASS cos 0.999997+
biondizzle
2026-05-28 16:30:55 +00:00
-
2a6d72912a
auto: pre-test commit
biondizzle
2026-05-28 16:28:58 +00:00
-
e74c84458c
Clean up E2M1 dequant: use LUT approach (consultant recommendation)
biondizzle
2026-05-28 16:17:47 +00:00
-
79ef87f9a9
FIX: E2M1 FP4 dequantization bug in indexer_score_topk.cu
biondizzle
2026-05-28 16:16:24 +00:00
-
44c4bade5f
Rewrite fmha_sm100_tc.cuh with working N=16 PV sub-tile approach
biondizzle
2026-05-28 16:04:11 +00:00
-
a18d9c1584
Update CURRENT_ISSUE: ALL HD=16/64/128/256 PASS cos 0.999997+
biondizzle
2026-05-28 16:03:05 +00:00
-
01319d7247
auto: pre-test commit
biondizzle
2026-05-28 15:59:22 +00:00
-
43516ed4ec
auto: pre-test commit
biondizzle
2026-05-28 15:55:59 +00:00
-
1ec3e1ed2c
auto: pre-test commit
biondizzle
2026-05-28 15:55:18 +00:00
-
babff1f402
auto: pre-test commit
biondizzle
2026-05-28 15:54:05 +00:00
-
2b007d2008
auto: pre-test commit
biondizzle
2026-05-28 15:53:39 +00:00
-
84b997881f
auto: pre-test commit
biondizzle
2026-05-28 15:53:04 +00:00
-
6e5401df3b
auto: pre-test commit
biondizzle
2026-05-28 15:51:55 +00:00
-
102174fade
auto: pre-test commit
biondizzle
2026-05-28 15:50:52 +00:00
-
2dcfc0089f
auto: pre-test commit
biondizzle
2026-05-28 15:49:47 +00:00
-
1cdb90462f
auto: pre-test commit
biondizzle
2026-05-28 15:48:15 +00:00
-
80fd612132
auto: pre-test commit
biondizzle
2026-05-28 15:47:58 +00:00
-
9583cbc67a
auto: pre-test commit
biondizzle
2026-05-28 15:46:53 +00:00
-
1b86860c19
auto: pre-test commit
biondizzle
2026-05-28 15:46:16 +00:00
-
66cc117e11
auto: pre-test commit
biondizzle
2026-05-28 15:44:45 +00:00
-
2b32b51882
Update CURRENT_ISSUE with final session status
biondizzle
2026-05-28 15:22:32 +00:00
-
6249989cf6
Clean up HD=64 test, V layout verified correct
biondizzle
2026-05-28 15:21:33 +00:00
-
e1daad6955
Verify V SMEM values vs GMEM for HD=64
biondizzle
2026-05-28 15:19:31 +00:00
-
bafd26707b
FMHA HD=64 with BLOCK_MN_B=16, 4 N-tiles per K-tile
biondizzle
2026-05-28 15:17:40 +00:00
-
6896d1aebb
Update CURRENT_ISSUE: HD=16 done, HD=64 in progress
biondizzle
2026-05-28 15:16:19 +00:00
-
6b9b06647a
Clean up HD=64 debug prints, keep register-math PV check
biondizzle
2026-05-28 15:15:22 +00:00
-
5c9d471162
Add register-math PV reference for HD=64 debug
biondizzle
2026-05-28 15:13:47 +00:00
-
43e9efbc2b
Fix string literal
biondizzle
2026-05-28 15:12:20 +00:00
-
906be7ce50
Add filtered cosine (exclude near-zero)
biondizzle
2026-05-28 15:11:14 +00:00
-
40c83c769a
Fix: remove ×2 QK scale correction (MMA scale is 1.0, not 0.5)
biondizzle
2026-05-28 15:09:57 +00:00
-
6ea7356fdd
Debug: print P values for HD=64
biondizzle
2026-05-28 15:07:55 +00:00
-
4b052f22a5
Fix: opt into >48KB shared memory for HD=64
biondizzle
2026-05-28 15:06:37 +00:00
-
7becbfc07e
Fix: printf after var declarations
biondizzle
2026-05-28 15:03:25 +00:00
-
2d44f8e356
Debug: check if HD=64 kernel starts
biondizzle
2026-05-28 15:02:00 +00:00
-
46e4d07c71
Test PV SS MMA with B=(64,16) BLOCK_MN=64
biondizzle
2026-05-28 14:58:10 +00:00
-
465e089a2b
Add launch error check for HD=64
biondizzle
2026-05-28 14:56:07 +00:00
-
2fd64c464d
FMHA HD=64 with BLOCK_MN_B=64 for V, proper output dimensions
biondizzle
2026-05-28 14:54:10 +00:00
-
15ecc1f616
Full FMHA HD=64 with PV SS MMA (SMEM-P)
biondizzle
2026-05-28 14:52:29 +00:00
-
5b2e690936
Milestone: Full FMHA HD=16 with PV SS MMA (SMEM-P) — cosine 0.9997
biondizzle
2026-05-28 14:50:43 +00:00
-
78026839b7
Fix V canonical layout: swap g_mn/g_k indices (d=MN, lr=K)
biondizzle
2026-05-28 14:49:17 +00:00
-
9a3b43c42b
Fix reference to also use uniform P
biondizzle
2026-05-28 14:47:10 +00:00
-
75bdcbf728
Debug: override P with uniform 1/128
biondizzle
2026-05-28 14:46:21 +00:00
-
af93c283c7
Enable all 8 PV K-tiles
biondizzle
2026-05-28 14:45:13 +00:00
-
6f5be8a4e4
Debug: print P values
biondizzle
2026-05-28 14:44:09 +00:00
-
3d15f5bb21
Debug: 1 PV K-tile
biondizzle
2026-05-28 14:43:01 +00:00
-
284a06ddf1
FMHA v5: clean rewrite with QK + softmax + PV SS per K-tile
biondizzle
2026-05-28 14:42:13 +00:00
-
342193e0b4
Fix tb scope
biondizzle
2026-05-28 14:40:55 +00:00
-
a6f7ef7c45
Add softmax read from TMEM
biondizzle
2026-05-28 14:40:35 +00:00
-
38b0ff0bf8
Add QK GEMM to minimal PV test
biondizzle
2026-05-28 14:39:51 +00:00
-
e9f8f9e6e3
Minimal PV with s_p_vals in SMEM
biondizzle
2026-05-28 14:38:58 +00:00
-
97ebb964a2
Move s_p_vals to dynamic SMEM
biondizzle
2026-05-28 14:38:03 +00:00
-
d2387dd858
Full FMHA v4: per-K-tile P fill into reusable (128,16) buffer
biondizzle
2026-05-28 14:37:11 +00:00
-
78b470317f
PV accumulation debug with detailed TMEM read
biondizzle
2026-05-28 14:35:29 +00:00
-
dacbf53081
Test K-tiles 0-1 accumulated
biondizzle
2026-05-28 14:33:31 +00:00
-
bad31d9476
Test K-tile 1
biondizzle
2026-05-28 14:32:51 +00:00
-
9198ed734f
Test 1 PV K-tile from (128,128) P at offset 0
biondizzle
2026-05-28 14:32:10 +00:00
-
ce88cd6e9e
Zero TMEM manually, all K-tiles accumulate=true
biondizzle
2026-05-28 14:31:22 +00:00
-
727c509454
PV SS MMA with 8 K-tile accumulation
biondizzle
2026-05-28 14:30:09 +00:00
-
d5b0941f2e
PV SS MMA with (128,128) P layout
biondizzle
2026-05-28 14:29:13 +00:00
-
f94693fdc2
Fix: add back cudaDeviceSynchronize
biondizzle
2026-05-28 14:28:24 +00:00
-
fb8af865f4
Check launch error
biondizzle
2026-05-28 14:28:02 +00:00
-
738e39cb63
Debug: add printf at kernel start
biondizzle
2026-05-28 14:27:12 +00:00
-
9e13096bf8
Debug: skip QK, write P directly to SMEM, 1 PV K-tile
biondizzle
2026-05-28 14:26:36 +00:00
-
11da4daa01
Debug: single PV K-tile
biondizzle
2026-05-28 14:25:47 +00:00
-
8cb32cabc9
Fix asm constraint typo
biondizzle
2026-05-28 14:25:04 +00:00
-
36a50962b3
Full FMHA SMEM-P with scale calibration
biondizzle
2026-05-28 14:24:53 +00:00
-
4a36da9845
Minimal PV SS MMA test: A=128x16, B=16x16
biondizzle
2026-05-28 14:23:42 +00:00
-
77901834a9
Fix P K-tile offset: 2048 BF16 per (128,16) tile, not 1024
biondizzle
2026-05-28 14:22:27 +00:00
-
0bfc943cec
FMHA with SMEM-P approach: PV via SS MMA avoids TMEM layout mismatch
biondizzle
2026-05-28 14:21:42 +00:00
-
faeedd3643
Test TS MMA with non-uniform A data
biondizzle
2026-05-28 14:19:45 +00:00
-
570c5b5154
Test softmax→PV with 1 K-tile in isolation
biondizzle
2026-05-28 14:18:39 +00:00
-
a29ef77b64
QK→PV layout test: skip softmax to test TMEM layout compatibility
biondizzle
2026-05-28 14:17:37 +00:00
-
acf17e001e
Fix SMEM allocation (was half the needed size) + re-enable full pipeline
biondizzle
2026-05-28 14:16:43 +00:00
-
fa6c124163
Debug: QK only, skip softmax+PV
biondizzle
2026-05-28 14:15:18 +00:00
-
79cee32125
Debug: skip PV step entirely
biondizzle
2026-05-28 14:14:34 +00:00
-
47e9b8a413
Debug: single PV K-tile
biondizzle
2026-05-28 14:13:57 +00:00
-
414b3f4f92
Full FMHA HD=16 with PV GEMM via tcgen05.mma TS
biondizzle
2026-05-28 14:13:11 +00:00
-
ed8f48dddf
Add systematic SS+TS sequence test to debug MMA coexistence crash
biondizzle
2026-05-28 14:10:07 +00:00
-
6a3159dfd9
test: PV then QK to find ordering issue
biondizzle
2026-05-28 13:53:35 +00:00
-
640233cb87
test: PV GEMM first (before QK) to test ordering
biondizzle
2026-05-28 13:52:18 +00:00
-
d4ed3fa06f
test: QK GEMM + PV GEMM combined test
biondizzle
2026-05-28 13:50:47 +00:00