|
|
f2592ea0da
|
fix: native TMEM columns for hd_chunk (no remapping)
|
2026-05-30 07:01:42 +00:00 |
|
|
|
dcf89fdd1c
|
debug: check full HD for chunk1 test
|
2026-05-30 07:00:46 +00:00 |
|
|
|
3dbd3c5e7f
|
debug: test chunk 1 only
|
2026-05-30 07:00:14 +00:00 |
|
|
|
72779e7f71
|
debug: compare only first HD_CHUNK values
|
2026-05-30 06:59:39 +00:00 |
|
|
|
9227b0e93f
|
debug: skip hd_chunk>0 to isolate chunk0
|
2026-05-30 06:59:01 +00:00 |
|
|
|
25aeaca9ab
|
fix: PV accumulate flag
|
2026-05-30 06:56:53 +00:00 |
|
|
|
1da785c070
|
D1.5: HD tiling (HD_CHUNK=256) for HD=512 support
|
2026-05-30 06:56:09 +00:00 |
|
|
|
700524f183
|
test: HD=128/256 variants for D1.5
|
2026-05-30 04:49:33 +00:00 |
|
|
|
f2544a4600
|
test: full matrix for D1.5 multirow multitile
|
2026-05-30 04:49:00 +00:00 |
|
|
|
5544d3a0a4
|
fix: TMEM reads must be outside my_row_active (warp-collective)
|
2026-05-30 04:48:26 +00:00 |
|
|
|
1dca8d8cfa
|
debug: unbuffered stdout
|
2026-05-30 04:46:11 +00:00 |
|
|
|
8be8813d54
|
debug: more prints
|
2026-05-30 04:44:41 +00:00 |
|
|
|
570396b4be
|
debug: simplify test, add fflush
|
2026-05-30 04:42:35 +00:00 |
|
|
|
0ad35f8be6
|
debug: add prints to multirow multitile test
|
2026-05-30 04:40:06 +00:00 |
|
|
|
dd3e0fdfc8
|
D1.5: multi-row + multi-tile FMHA with SMEM accumulator in-kernel rescale
|
2026-05-30 04:37:33 +00:00 |
|
|
|
10ae8f3346
|
auto: pre-test commit
|
2026-05-30 03:46:38 +00:00 |
|
|
|
8b1ac380ac
|
feat: HD=512 support — TMEM_N=512, test variants for all three TMA kernels
|
2026-05-30 03:45:05 +00:00 |
|
|
|
762f054d6d
|
feat: double-buffer TMA pipeline in multi-row kernel
|
2026-05-30 03:20:49 +00:00 |
|
|
|
4a9c850e9c
|
feat: double-buffer TMA pipeline for K loads in single-tile kernel
|
2026-05-30 03:14:06 +00:00 |
|
|
|
afa949071b
|
fix: brace structure in V TMA conversion
|
2026-05-29 22:59:18 +00:00 |
|
|
|
ec577f71ee
|
feat: V TMA loads in single-tile kernel too
|
2026-05-29 22:57:59 +00:00 |
|
|
|
422e7bb312
|
cleanup: v_head reference in multi-row (V via TMA now)
|
2026-05-29 22:54:44 +00:00 |
|
|
|
88c72a887e
|
feat: V TMA loads in multi-row kernel
|
2026-05-29 22:51:24 +00:00 |
|
|
|
13403d2808
|
cleanup: remove unused v_head in multi-tile (V via TMA)
|
2026-05-29 22:48:50 +00:00 |
|
|
|
74145a31cc
|
feat: V TMA loads in multi-tile kernel
|
2026-05-29 22:46:21 +00:00 |
|
|
|
680d2ebf64
|
test: V TMA diagnostic — isolate V TMA descriptor issue
|
2026-05-29 22:42:46 +00:00 |
|
|
|
077fbdf3c5
|
test: HD=128/256 multi-tile variants
|
2026-05-29 20:02:00 +00:00 |
|
|
|
7df17384fd
|
test: multi-tile s_k=128/256/384/512
|
2026-05-29 19:59:21 +00:00 |
|
|
|
d47b2bfcce
|
fix: use un-normalized P for multi-tile PV (correct online softmax merge)
|
2026-05-29 19:57:54 +00:00 |
|
|
|
43ae3e7f98
|
fix: reload Q per-K-sub-tile in multi-tile kernel (same as single-tile)
|
2026-05-29 19:56:35 +00:00 |
|
|
|
7598d548ee
|
debug: test multi-tile with s_k=128 only
|
2026-05-29 19:53:02 +00:00 |
|
|
|
8e99bd50e6
|
feat: 6-warp TMA multi-tile KV kernel with register accumulator + test
|
2026-05-29 19:49:53 +00:00 |
|
|
|
1814510195
|
wip: add n_kv_tiles param for multi-tile KV (not yet used)
|
2026-05-29 19:47:48 +00:00 |
|
|
|
d20792aa9d
|
fix: TMA descriptor index for batched multi-head (batch*n_h + head)
|
2026-05-29 19:45:44 +00:00 |
|
|
|
754c6a692c
|
feat: per-head TMA descriptors for multi-head FMHA
|
2026-05-29 19:44:58 +00:00 |
|
|
|
9eb193458e
|
test: refactored multi-row TMA test with multi-head and batch
|
2026-05-29 19:43:41 +00:00 |
|
|
|
832a04181d
|
test: relax relative error threshold to 5% for BF16, use cosine > 0.999 as pass criterion
|
2026-05-29 19:41:40 +00:00 |
|
|
|
bfef94f5d0
|
test: HD=128/256 multi-row TMA FMHA
|
2026-05-29 19:40:32 +00:00 |
|
|
|
a1b2ab79a1
|
feat: 6-warp TMA FMHA multi-row kernel + test
|
2026-05-29 19:39:17 +00:00 |
|
|
|
d0a50f1f2e
|
fix: remove double normalization in TMA epilogue (P already normalized before PV)
|
2026-05-29 19:36:41 +00:00 |
|
|
|
fb971781aa
|
fix: revert V to direct load (V TMA needs debugging), K TMA works
|
2026-05-29 19:35:44 +00:00 |
|
|
|
cd2c028b39
|
feat: TMA loads for both K and V in 6-warp FMHA kernel
|
2026-05-29 19:34:48 +00:00 |
|
|
|
523d3838a2
|
test: HD=128/256 variants for TMA FMHA
|
2026-05-29 19:32:49 +00:00 |
|
|
|
bd4f09d514
|
fix: ambiguous MMA_K_BF16 in test
|
2026-05-29 19:32:15 +00:00 |
|
|
|
4459ddefdd
|
feat: 6-warp TMA FMHA kernel + test — TMA for K loads
|
2026-05-29 19:32:02 +00:00 |
|
|
|
7a8ba8eeb6
|
fix: SMEM size calculation — TILE_SZ is in BF16 elements, need *sizeof(bf16_t) for bytes
|
2026-05-29 19:30:50 +00:00 |
|
|
|
aac1b25442
|
test: TMA QK diagnostic — 3 variants to isolate failure
|
2026-05-29 19:29:35 +00:00 |
|
|
|
9dfada6626
|
test: TMA + canonical + QK GEMM incremental
|
2026-05-29 19:28:23 +00:00 |
|
|
|
0435e229bd
|
fix: typo cuda_SUCCESS -> cudaSuccess
|
2026-05-29 19:27:30 +00:00 |
|
|
|
74514e2680
|
test: TMA sub-tile load — exact pattern from test_qk_softmax
|
2026-05-29 19:26:56 +00:00 |
|