|
|
fee022a485
|
test: MMA→4-warp read using proven fmha_common+umma_desc infra
|
2026-05-28 23:05:29 +00:00 |
|
|
|
e1a708a187
|
test: try 16x256b.x1 with column step=4 (4 cols per read)
|
2026-05-28 23:03:51 +00:00 |
|
|
|
95003eced2
|
test: 16x256b.x1 loads with uint32_t regs, matching working pattern
|
2026-05-28 23:03:10 +00:00 |
|
|
|
fffb493b0e
|
fix: 16x256b.x1 load syntax — single address operand
|
2026-05-28 23:02:23 +00:00 |
|
|
|
44dcd6e8d0
|
test: 16x256b.x1 multiple LOADS — do they crash like stores?
|
2026-05-28 23:02:03 +00:00 |
|
|
|
d54bce6a6d
|
fix: correct SMEM size for MMA 4-warp test
|
2026-05-28 23:01:12 +00:00 |
|
|
|
be45e87891
|
test: MMA→4-warp TMEM read — do warps see different rows?
|
2026-05-28 23:00:27 +00:00 |
|
|
|
6b0d57074a
|
test: TMEM cross-warp visibility with different sync strategies
|
2026-05-28 22:59:31 +00:00 |
|
|
|
77d190278e
|
test: simpler TMEM 4-warp read — direct store+load
|
2026-05-28 22:58:48 +00:00 |
|
|
|
91b03bd6bd
|
test: verify 4-warp TMEM read with 32x32b.x8 after MMA
|
2026-05-28 22:57:59 +00:00 |
|
|
|
28e04a5ea8
|
fix: use __cvta_generic_to_shared directly for 64-bit compat
|
2026-05-28 22:56:29 +00:00 |
|
|
|
1d6a95df32
|
fix: typo in tmem row offset test
|
2026-05-28 22:56:15 +00:00 |
|
|
|
cf6fe71368
|
test: verify TMEM 32x32b.x8 row offset addressing
|
2026-05-28 22:56:00 +00:00 |
|
|
|
4cfb707405
|
fix: correct SMEM size calculation in multirow test
|
2026-05-28 22:53:46 +00:00 |
|
|
|
863a030c3b
|
fmha_multirow: rewrite with 32x32b.x8 only, no s_p_vals, row_page addressing
- Kill 64KB s_p_vals buffer — P is streamed per K-tile through sPk
- All TMEM ops use 32x32b.x8 exclusively (16x256b.x1 crashes on 2nd call)
- T>32: 4 softmax warps use row_page offset in TMEM address (row<<16)
- Lane l in warp w handles row w*32+l
- Two-pass softmax: pass 1 row_max, pass 2 exp/sum interleaved with PV
- PV: N=16 sub-tiles, SS MMA sPk(128,16) × sV(16,16) → TMEM
- Epilogue: 32x32b.x8 TMEM read, normalize, BF16 → GMEM
- SMEM budget: ~14KB (well within 232KB)
|
2026-05-28 22:52:52 +00:00 |
|
|
|
08694b8136
|
Fix multi-row softmax v3: 32x32b.x8 with per-lane per-row (no wmax/wsum), per-row sRowMax/sRowSum arrays
|
2026-05-28 20:10:13 +00:00 |
|
|
|
bf4dfd131b
|
Fix nvcc goto-bypasses-init: move var decls before goto targets
|
2026-05-28 20:04:59 +00:00 |
|
|
|
2b09d4f2ef
|
Fix nvcc goto-bypasses-init in multi-row test
|
2026-05-28 20:04:45 +00:00 |
|
|
|
d8b421ccee
|
Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax
|
2026-05-28 20:04:29 +00:00 |
|
|
|
3fd302e7a0
|
Fix nvcc goto-bypasses-init errors in multi-head test
|
2026-05-28 19:33:04 +00:00 |
|
|
|
aa41cfa2e5
|
Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support
- fmha_6warp_multihead.cuh: grid=(1, n_h, batch) kernel with FmhaParams
- MQA support via k_head_stride=0 / v_head_stride=0
- LSE output for multi-segment KV merge composition
- test_fmha_6warp_multihead.cu: MHA (4+8 heads), MQA, batched tests
- HD-specific wrappers for hd=16/64/128/256
- Marked E2M1 dequant bug as FIXED in consultant issue file
|
2026-05-28 19:32:35 +00:00 |
|
|
|
6af2feb42a
|
TMA 5D test: element stride decomposition
|
2026-05-28 19:18:01 +00:00 |
|
|
|
96f2f0bb90
|
auto: pre-test commit
|
2026-05-28 19:12:23 +00:00 |
|
|
|
015435b1ab
|
auto: pre-test commit
|
2026-05-28 19:09:50 +00:00 |
|
|
|
41343fdc6b
|
auto: pre-test commit
|
2026-05-28 19:08:04 +00:00 |
|
|
|
a723b524f7
|
TMA alignment test
|
2026-05-28 17:00:20 +00:00 |
|
|
|
c54a83960d
|
TMA debug: fix globalStrides to tensorRank-1 elements
|
2026-05-28 16:58:30 +00:00 |
|
|
|
944e567b6c
|
TMA debug: test various CUtensorMap configs
|
2026-05-28 16:55:25 +00:00 |
|
|
|
55d289c65b
|
Fix TMA: use CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 not UINT16
|
2026-05-28 16:51:40 +00:00 |
|
|
|
0fd3e12a52
|
Fix TMA test: globalStrides in bytes not elements
|
2026-05-28 16:46:56 +00:00 |
|
|
|
ad8050bbad
|
WIP: TMA load test infrastructure (manual compile needed)
|
2026-05-28 16:45:04 +00:00 |
|
|
|
d9df1e6486
|
auto: pre-test commit
|
2026-05-28 16:42:24 +00:00 |
|
|
|
a4211559cf
|
auto: pre-test commit
|
2026-05-28 16:40:51 +00:00 |
|
|
|
3b8fdcc823
|
auto: pre-test commit
|
2026-05-28 16:39:45 +00:00 |
|
|
|
072fbf0b5d
|
auto: pre-test commit
|
2026-05-28 16:36:53 +00:00 |
|
|
|
2a6d72912a
|
auto: pre-test commit
|
2026-05-28 16:28:58 +00:00 |
|
|
|
01319d7247
|
auto: pre-test commit
|
2026-05-28 15:59:22 +00:00 |
|
|
|
43516ed4ec
|
auto: pre-test commit
|
2026-05-28 15:55:59 +00:00 |
|
|
|
1ec3e1ed2c
|
auto: pre-test commit
|
2026-05-28 15:55:18 +00:00 |
|
|
|
babff1f402
|
auto: pre-test commit
|
2026-05-28 15:54:05 +00:00 |
|
|
|
2b007d2008
|
auto: pre-test commit
|
2026-05-28 15:53:39 +00:00 |
|
|
|
84b997881f
|
auto: pre-test commit
|
2026-05-28 15:53:04 +00:00 |
|
|
|
6e5401df3b
|
auto: pre-test commit
|
2026-05-28 15:51:55 +00:00 |
|
|
|
102174fade
|
auto: pre-test commit
|
2026-05-28 15:50:52 +00:00 |
|
|
|
2dcfc0089f
|
auto: pre-test commit
|
2026-05-28 15:49:47 +00:00 |
|
|
|
1cdb90462f
|
auto: pre-test commit
|
2026-05-28 15:48:15 +00:00 |
|
|
|
80fd612132
|
auto: pre-test commit
|
2026-05-28 15:47:58 +00:00 |
|
|
|
9583cbc67a
|
auto: pre-test commit
|
2026-05-28 15:46:53 +00:00 |
|
|
|
1b86860c19
|
auto: pre-test commit
|
2026-05-28 15:46:16 +00:00 |
|
|
|
6249989cf6
|
Clean up HD=64 test, V layout verified correct
|
2026-05-28 15:21:33 +00:00 |
|