|
|
3fd302e7a0
|
Fix nvcc goto-bypasses-init errors in multi-head test
|
2026-05-28 19:33:04 +00:00 |
|
|
|
aa41cfa2e5
|
Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support
- fmha_6warp_multihead.cuh: grid=(1, n_h, batch) kernel with FmhaParams
- MQA support via k_head_stride=0 / v_head_stride=0
- LSE output for multi-segment KV merge composition
- test_fmha_6warp_multihead.cu: MHA (4+8 heads), MQA, batched tests
- HD-specific wrappers for hd=16/64/128/256
- Marked E2M1 dequant bug as FIXED in consultant issue file
|
2026-05-28 19:32:35 +00:00 |
|
|
|
6af2feb42a
|
TMA 5D test: element stride decomposition
|
2026-05-28 19:18:01 +00:00 |
|
|
|
96f2f0bb90
|
auto: pre-test commit
|
2026-05-28 19:12:23 +00:00 |
|
|
|
015435b1ab
|
auto: pre-test commit
|
2026-05-28 19:09:50 +00:00 |
|
|
|
41343fdc6b
|
auto: pre-test commit
|
2026-05-28 19:08:04 +00:00 |
|
|
|
a723b524f7
|
TMA alignment test
|
2026-05-28 17:00:20 +00:00 |
|
|
|
c54a83960d
|
TMA debug: fix globalStrides to tensorRank-1 elements
|
2026-05-28 16:58:30 +00:00 |
|
|
|
944e567b6c
|
TMA debug: test various CUtensorMap configs
|
2026-05-28 16:55:25 +00:00 |
|
|
|
55d289c65b
|
Fix TMA: use CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 not UINT16
|
2026-05-28 16:51:40 +00:00 |
|
|
|
0fd3e12a52
|
Fix TMA test: globalStrides in bytes not elements
|
2026-05-28 16:46:56 +00:00 |
|
|
|
ad8050bbad
|
WIP: TMA load test infrastructure (manual compile needed)
|
2026-05-28 16:45:04 +00:00 |
|
|
|
d9df1e6486
|
auto: pre-test commit
|
2026-05-28 16:42:24 +00:00 |
|
|
|
a4211559cf
|
auto: pre-test commit
|
2026-05-28 16:40:51 +00:00 |
|
|
|
3b8fdcc823
|
auto: pre-test commit
|
2026-05-28 16:39:45 +00:00 |
|
|
|
072fbf0b5d
|
auto: pre-test commit
|
2026-05-28 16:36:53 +00:00 |
|
|
|
2a6d72912a
|
auto: pre-test commit
|
2026-05-28 16:28:58 +00:00 |
|
|
|
01319d7247
|
auto: pre-test commit
|
2026-05-28 15:59:22 +00:00 |
|
|
|
43516ed4ec
|
auto: pre-test commit
|
2026-05-28 15:55:59 +00:00 |
|
|
|
1ec3e1ed2c
|
auto: pre-test commit
|
2026-05-28 15:55:18 +00:00 |
|
|
|
babff1f402
|
auto: pre-test commit
|
2026-05-28 15:54:05 +00:00 |
|
|
|
2b007d2008
|
auto: pre-test commit
|
2026-05-28 15:53:39 +00:00 |
|
|
|
84b997881f
|
auto: pre-test commit
|
2026-05-28 15:53:04 +00:00 |
|
|
|
6e5401df3b
|
auto: pre-test commit
|
2026-05-28 15:51:55 +00:00 |
|
|
|
102174fade
|
auto: pre-test commit
|
2026-05-28 15:50:52 +00:00 |
|
|
|
2dcfc0089f
|
auto: pre-test commit
|
2026-05-28 15:49:47 +00:00 |
|
|
|
1cdb90462f
|
auto: pre-test commit
|
2026-05-28 15:48:15 +00:00 |
|
|
|
80fd612132
|
auto: pre-test commit
|
2026-05-28 15:47:58 +00:00 |
|
|
|
9583cbc67a
|
auto: pre-test commit
|
2026-05-28 15:46:53 +00:00 |
|
|
|
1b86860c19
|
auto: pre-test commit
|
2026-05-28 15:46:16 +00:00 |
|
|
|
6249989cf6
|
Clean up HD=64 test, V layout verified correct
|
2026-05-28 15:21:33 +00:00 |
|
|
|
e1daad6955
|
Verify V SMEM values vs GMEM for HD=64
|
2026-05-28 15:19:31 +00:00 |
|
|
|
bafd26707b
|
FMHA HD=64 with BLOCK_MN_B=16, 4 N-tiles per K-tile
|
2026-05-28 15:17:40 +00:00 |
|
|
|
6b9b06647a
|
Clean up HD=64 debug prints, keep register-math PV check
|
2026-05-28 15:15:22 +00:00 |
|
|
|
5c9d471162
|
Add register-math PV reference for HD=64 debug
|
2026-05-28 15:13:47 +00:00 |
|
|
|
43e9efbc2b
|
Fix string literal
|
2026-05-28 15:12:20 +00:00 |
|
|
|
906be7ce50
|
Add filtered cosine (exclude near-zero)
|
2026-05-28 15:11:14 +00:00 |
|
|
|
40c83c769a
|
Fix: remove ×2 QK scale correction (MMA scale is 1.0, not 0.5)
|
2026-05-28 15:09:57 +00:00 |
|
|
|
6ea7356fdd
|
Debug: print P values for HD=64
|
2026-05-28 15:07:55 +00:00 |
|
|
|
4b052f22a5
|
Fix: opt into >48KB shared memory for HD=64
|
2026-05-28 15:06:37 +00:00 |
|
|
|
7becbfc07e
|
Fix: printf after var declarations
|
2026-05-28 15:03:25 +00:00 |
|
|
|
2d44f8e356
|
Debug: check if HD=64 kernel starts
|
2026-05-28 15:02:00 +00:00 |
|
|
|
46e4d07c71
|
Test PV SS MMA with B=(64,16) BLOCK_MN=64
|
2026-05-28 14:58:10 +00:00 |
|
|
|
465e089a2b
|
Add launch error check for HD=64
|
2026-05-28 14:56:07 +00:00 |
|
|
|
2fd64c464d
|
FMHA HD=64 with BLOCK_MN_B=64 for V, proper output dimensions
|
2026-05-28 14:54:10 +00:00 |
|
|
|
15ecc1f616
|
Full FMHA HD=64 with PV SS MMA (SMEM-P)
|
2026-05-28 14:52:29 +00:00 |
|
|
|
5b2e690936
|
Milestone: Full FMHA HD=16 with PV SS MMA (SMEM-P) — cosine 0.9997
|
2026-05-28 14:50:43 +00:00 |
|
|
|
78026839b7
|
Fix V canonical layout: swap g_mn/g_k indices (d=MN, lr=K)
|
2026-05-28 14:49:17 +00:00 |
|
|
|
9a3b43c42b
|
Fix reference to also use uniform P
|
2026-05-28 14:47:10 +00:00 |
|
|
|
75bdcbf728
|
Debug: override P with uniform 1/128
|
2026-05-28 14:46:21 +00:00 |
|