-
2833eb56e7
P6: Add minimal CUDA test for TMA store epilogue
biondizzle
2026-05-30 16:59:45 +00:00
-
6a7726e764
P6: Add integration test for TMA store epilogue
biondizzle
2026-05-30 16:58:24 +00:00
-
fd7c0cb773
P6: Fix TMA store — use bulk_group (commit+wait) not mbarrier
biondizzle
2026-05-30 16:57:35 +00:00
-
212fc85627
P6: One-way TMEM→regs→SMEM→TMA store epilogue
biondizzle
2026-05-30 16:56:07 +00:00
-
05b5bf9db1
docs: mark P5 as done in NEXT_PRIORITIES.md
biondizzle
2026-05-30 10:54:21 +00:00
-
95e0c8c464
P5: fix multi-tile test — use same Q data for kernel and reference
biondizzle
2026-05-30 10:49:12 +00:00
-
e701a1411c
P5: use multi-tile kernel for N>128 in integration test
biondizzle
2026-05-30 10:47:00 +00:00
-
5932e928a8
cleanup: remove debug test files (P4, P5)
biondizzle
2026-05-30 10:46:14 +00:00
-
8fef46ce73
P5: add reference comparison to Python multi-tile test
biondizzle
2026-05-30 10:45:02 +00:00
-
897a70a491
P5: minimal Python multi-tile test
biondizzle
2026-05-30 10:43:26 +00:00
-
a2627359fb
P5: fix TMA desc creation — write to HOST then cudaMemcpy to device
biondizzle
2026-05-30 10:40:01 +00:00
-
f370bfb1f1
P5: re-enable multi-tile Python tests, fix CAPI to use create_tma_desc_2d_bf16
biondizzle
2026-05-30 10:38:33 +00:00
-
da54f6439f
P5: fix TMA multitile test (include cuda.h first, proper SMEM calc)
biondizzle
2026-05-30 10:35:34 +00:00
-
34320653e9
P5: standalone TMA multi-tile test with 128B-aligned memory
biondizzle
2026-05-30 10:34:20 +00:00
-
a1d05b3055
P5: disable multi-tile Python tests (TMA descriptor alignment issue)
biondizzle
2026-05-30 10:32:44 +00:00
-
97531a68e6
fix: remove n_kv_tiles from capi too
biondizzle
2026-05-30 10:30:40 +00:00
-
a5b47602b5
fix: remove n_kv_tiles from standalone test (struct doesn't have it anymore)
biondizzle
2026-05-30 10:28:38 +00:00
-
f032800eaa
P5: integrate WORKING multi-tile kernel (fmha_6warp_tma_multirow_multitile) into production
biondizzle
2026-05-30 10:27:38 +00:00
-
032cb4c7b2
P5: add single-tile merge comparison to multitile test
biondizzle
2026-05-30 09:06:57 +00:00
-
d424ccbcc1
fix: const not constexpr for SCALE
biondizzle
2026-05-30 09:04:45 +00:00
-
3da31de4c0
P5: fix BF16 host helpers for standalone test
biondizzle
2026-05-30 09:04:05 +00:00
-
9e6ba25a98
P5: standalone multi-tile CUDA test (2 KV tiles, hd=64)
biondizzle
2026-05-30 09:01:52 +00:00
-
b61df2657b
P5: fix reference attention for MQA/GQA (kv_idx = h // q_per_kv)
biondizzle
2026-05-30 08:59:50 +00:00
-
c55030a340
P5: clean kernel with runtime branch (single-tile unchanged, multi-tile separate path)
biondizzle
2026-05-30 08:57:00 +00:00
-
5f4856d771
P5: fix sOacc init race — use single thread (tid==0) instead of 4 softmax warps
biondizzle
2026-05-30 08:53:50 +00:00
-
66b126ded8
P5: fix standalone test template — add n_kv_tiles to FmhaParams
biondizzle
2026-05-30 08:50:38 +00:00
-
0f34f60494
P5: fix single-tile backward compat (normalized P for n_kv_tiles==1)
biondizzle
2026-05-30 08:47:47 +00:00
-
2649488d13
P5: in-kernel multi-KV-tile FA2 online softmax in fmha_6warp_multihead.cuh
biondizzle
2026-05-30 08:46:09 +00:00
-
6421f7c3f3
P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue
biondizzle
2026-05-30 08:42:18 +00:00
-
58c087416b
P4: 128B-aligned GMEM, proper SMEM alignment, bit21 test
biondizzle
2026-05-30 08:41:15 +00:00
-
90c806733f
P4: test TMA with bit-21 workaround and innermost-first dims
biondizzle
2026-05-30 08:40:21 +00:00
-
16027018df
P4: fix TMA load test (32-bit SMEM addrs, proper mbarrier)
biondizzle
2026-05-30 08:38:55 +00:00
-
e2ecdc42d8
P4: TMA load test kernel (swizzle vs no-swizzle hang diagnosis)
biondizzle
2026-05-30 08:38:11 +00:00
-
bd104c2ab2
P4: fix OOB fill enum name
biondizzle
2026-05-30 08:37:05 +00:00
-
cdd1babf1f
P4: correct CUDA 13.2 API (dataType before rank, FloatOOBfill, globalDim)
biondizzle
2026-05-30 08:36:24 +00:00
-
8df3ccecea
P4: CUDA 13.2 has 10-param cuTensorMapEncodeTiled (no OOB fill)
biondizzle
2026-05-30 08:35:34 +00:00
-
d8ffdb66e1
P4: fix API signature rank/dtype order, OOB_FILL defines
biondizzle
2026-05-30 08:35:04 +00:00
-
277689f8b8
P4: use proper CUDA enum names
biondizzle
2026-05-30 08:34:19 +00:00
-
6d624a1b14
P4: remove explicit enum casts
biondizzle
2026-05-30 08:33:42 +00:00
-
4898a946eb
P4: fix TMA descriptor dump API order (dtype before rank)
biondizzle
2026-05-30 08:33:12 +00:00
-
3943be6063
P4: fix TMA descriptor dump (cuuint64_t dims, proper CUtensorMap API)
biondizzle
2026-05-30 08:32:34 +00:00
-
4df6ea2d8c
P4: TMA descriptor dump test (cuTensorMapEncodeTiled)
biondizzle
2026-05-30 08:31:56 +00:00
-
ae425b5522
P3: clean up test, remove debug files, final integration test
biondizzle
2026-05-30 08:29:25 +00:00
-
10915c4e70
fix: remove double normalization in fmha_6warp_multihead epilogue
biondizzle
2026-05-30 08:26:20 +00:00
-
cfac224b59
debug: single head sanity test with known values
biondizzle
2026-05-30 08:25:20 +00:00
-
1c74d35fb4
debug: V layout reference comparison
biondizzle
2026-05-30 08:24:35 +00:00
-
a3c5f817e1
debug: compare api vs direct kernel vs reference
biondizzle
2026-05-30 08:23:43 +00:00
-
78e6d58b85
debug: V layout comparison test
biondizzle
2026-05-30 08:22:49 +00:00
-
074c4c4f42
P3: call fmha_multihead_decode_raw directly (skip custom op)
biondizzle
2026-05-30 08:21:53 +00:00
-
1b9cdf89fb
P3: add full API integration test
biondizzle
2026-05-30 08:20:53 +00:00
-
0608d9d09e
P3: fix GQA via K/V repeat_interleave, relax threshold to 0.999990
biondizzle
2026-05-30 08:20:01 +00:00
-
d5c0086737
P3: fix SMEM computation, pad K/V to 128, remove stale files
biondizzle
2026-05-30 08:19:16 +00:00
-
094b3c9e6c
P3: fix test — create V in kernel layout (hd,N), transpose for reference
biondizzle
2026-05-30 08:18:20 +00:00
-
7b5b3342fa
P3: fix integration test — V transpose, direct ctypes call
biondizzle
2026-05-30 08:17:33 +00:00
-
8a5070aa38
test: minimal ctypes debug test for P3
biondizzle
2026-05-30 08:16:50 +00:00
-
63645a3c7b
fix: -Xcompiler -fPIC instead of -fPIC for nvcc
biondizzle
2026-05-30 08:16:04 +00:00
-
adcf3e04ab
P3: ctypes loader for 6-warp FMHA (bypass torch JIT sm_100 arch issue)
biondizzle
2026-05-30 08:15:31 +00:00
-
1e6adf5e01
P3: wire 6-warp multi-head FMHA decode fast path into production.py
biondizzle
2026-05-30 08:12:23 +00:00
-
20f3ccd992
D1.5 complete: HD=512 support via hd_chunk tiling with native TMEM columns
biondizzle
2026-05-30 07:02:41 +00:00
-
f2592ea0da
fix: native TMEM columns for hd_chunk (no remapping)
biondizzle
2026-05-30 07:01:42 +00:00
-
dcf89fdd1c
debug: check full HD for chunk1 test
biondizzle
2026-05-30 07:00:46 +00:00
-
3dbd3c5e7f
debug: test chunk 1 only
biondizzle
2026-05-30 07:00:14 +00:00
-
72779e7f71
debug: compare only first HD_CHUNK values
biondizzle
2026-05-30 06:59:39 +00:00
-
9227b0e93f
debug: skip hd_chunk>0 to isolate chunk0
biondizzle
2026-05-30 06:59:01 +00:00
-
25aeaca9ab
fix: PV accumulate flag
biondizzle
2026-05-30 06:56:53 +00:00
-
1da785c070
D1.5: HD tiling (HD_CHUNK=256) for HD=512 support
biondizzle
2026-05-30 06:56:09 +00:00
-
700524f183
test: HD=128/256 variants for D1.5
biondizzle
2026-05-30 04:49:33 +00:00
-
f2544a4600
test: full matrix for D1.5 multirow multitile
biondizzle
2026-05-30 04:49:00 +00:00
-
5544d3a0a4
fix: TMEM reads must be outside my_row_active (warp-collective)
biondizzle
2026-05-30 04:48:26 +00:00
-
1dca8d8cfa
debug: unbuffered stdout
biondizzle
2026-05-30 04:46:11 +00:00
-
8be8813d54
debug: more prints
biondizzle
2026-05-30 04:44:41 +00:00
-
570396b4be
debug: simplify test, add fflush
biondizzle
2026-05-30 04:42:35 +00:00
-
0ad35f8be6
debug: add prints to multirow multitile test
biondizzle
2026-05-30 04:40:06 +00:00
-
dd3e0fdfc8
D1.5: multi-row + multi-tile FMHA with SMEM accumulator in-kernel rescale
biondizzle
2026-05-30 04:37:33 +00:00
-
10ae8f3346
auto: pre-test commit
biondizzle
2026-05-30 03:46:38 +00:00
-
8b1ac380ac
feat: HD=512 support — TMEM_N=512, test variants for all three TMA kernels
biondizzle
2026-05-30 03:45:05 +00:00
-
762f054d6d
feat: double-buffer TMA pipeline in multi-row kernel
biondizzle
2026-05-30 03:20:49 +00:00
-
4a9c850e9c
feat: double-buffer TMA pipeline for K loads in single-tile kernel
biondizzle
2026-05-30 03:14:06 +00:00
-
afa949071b
fix: brace structure in V TMA conversion
biondizzle
2026-05-29 22:59:18 +00:00
-
ec577f71ee
feat: V TMA loads in single-tile kernel too
biondizzle
2026-05-29 22:57:59 +00:00
-
422e7bb312
cleanup: v_head reference in multi-row (V via TMA now)
biondizzle
2026-05-29 22:54:44 +00:00
-
88c72a887e
feat: V TMA loads in multi-row kernel
biondizzle
2026-05-29 22:51:24 +00:00
-
13403d2808
cleanup: remove unused v_head in multi-tile (V via TMA)
biondizzle
2026-05-29 22:48:50 +00:00
-
74145a31cc
feat: V TMA loads in multi-tile kernel
biondizzle
2026-05-29 22:46:21 +00:00
-
680d2ebf64
test: V TMA diagnostic — isolate V TMA descriptor issue
biondizzle
2026-05-29 22:42:46 +00:00
-
077fbdf3c5
test: HD=128/256 multi-tile variants
biondizzle
2026-05-29 20:02:00 +00:00
-
7df17384fd
test: multi-tile s_k=128/256/384/512
biondizzle
2026-05-29 19:59:21 +00:00
-
d47b2bfcce
fix: use un-normalized P for multi-tile PV (correct online softmax merge)
biondizzle
2026-05-29 19:57:54 +00:00
-
43ae3e7f98
fix: reload Q per-K-sub-tile in multi-tile kernel (same as single-tile)
biondizzle
2026-05-29 19:56:35 +00:00
-
7598d548ee
debug: test multi-tile with s_k=128 only
biondizzle
2026-05-29 19:53:02 +00:00
-
8e99bd50e6
feat: 6-warp TMA multi-tile KV kernel with register accumulator + test
biondizzle
2026-05-29 19:49:53 +00:00
-
1814510195
wip: add n_kv_tiles param for multi-tile KV (not yet used)
biondizzle
2026-05-29 19:47:48 +00:00
-
d20792aa9d
fix: TMA descriptor index for batched multi-head (batch*n_h + head)
biondizzle
2026-05-29 19:45:44 +00:00
-
754c6a692c
feat: per-head TMA descriptors for multi-head FMHA
biondizzle
2026-05-29 19:44:58 +00:00
-
9eb193458e
test: refactored multi-row TMA test with multi-head and batch
biondizzle
2026-05-29 19:43:41 +00:00
-
832a04181d
test: relax relative error threshold to 5% for BF16, use cosine > 0.999 as pass criterion
biondizzle
2026-05-29 19:41:40 +00:00
-
bfef94f5d0
test: HD=128/256 multi-row TMA FMHA
biondizzle
2026-05-29 19:40:32 +00:00
-
a1b2ab79a1
feat: 6-warp TMA FMHA multi-row kernel + test
biondizzle
2026-05-29 19:39:17 +00:00
-
d0a50f1f2e
fix: remove double normalization in TMA epilogue (P already normalized before PV)
biondizzle
2026-05-29 19:36:41 +00:00
-
fb971781aa
fix: revert V to direct load (V TMA needs debugging), K TMA works
biondizzle
2026-05-29 19:35:44 +00:00