Commit Graph

1804 Commits

Author SHA1 Message Date
e4ee9fdc9f P6: Fix host-side BF16→FP32 conversion in test 2026-05-30 17:01:13 +00:00
a88b321433 P6: Fix host-side BF16 conversion in test 2026-05-30 17:00:51 +00:00
1a87e054db P6: Fix constexpr and bf16 conversion in CUDA test 2026-05-30 17:00:05 +00:00
2833eb56e7 P6: Add minimal CUDA test for TMA store epilogue 2026-05-30 16:59:45 +00:00
6a7726e764 P6: Add integration test for TMA store epilogue
test_p6_tma_epilogue.py: Tests direct GMEM path, TMA store path,
and parity between both. Gate: cos >= 0.999998.
2026-05-30 16:58:24 +00:00
fd7c0cb773 P6: Fix TMA store — use bulk_group (commit+wait) not mbarrier
TMA store uses cp.async.bulk.tensor.2d.global.shared::cta.tile.bulk_group
NOT mbarrier::complete_tx::bytes. Completion tracked via:
  - cp.async.bulk.commit_group (after issuing stores)
  - cp.async.bulk.wait_group.read 0 (wait for all groups)

Removed sMbarStore from SMEM allocations (no longer needed).
2026-05-30 16:57:35 +00:00
212fc85627 P6: One-way TMEM→regs→SMEM→TMA store epilogue
- fmha_6warp_multihead.cuh: Rewritten epilogue with proper Blackwell pipeline
  1. TMEM → regs (tcgen05.ld, warp-collective)
  2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE)
  3. regs → SMEM row-major (sO_epi, for TMA tile format)
  4. TMA store SMEM → GMEM (async, enables multi-CTA)
  Fallback to direct GMEM write when tma_o is nullptr.
  Added FmhaParams.tma_o field and ENABLE_FP4_EPILOGUE template param.

- fmha_6warp_tma_multirow_multitile.cuh: Same epilogue pattern for multi-tile.
  Writes normalized output to sO_epi_rowmajor + TMA store (or direct GMEM).
  Added tma_o to FmhaTmaMultiRowMultiTileParams.

- fmha_tma.cuh: Added tma_store_2d and tma_store_wait for async GMEM writes.

- fmha_multihead_capi.cu: Added fmha_multihead_decode_tma_launch with
  per-(head,batch) TMA descriptors. Updated SMEM size calculation for sO_epi + sMbarStore.

- fmha_multitile_capi.cu: Added tma_o=nullptr (backward compatible), updated SMEM size.
2026-05-30 16:56:07 +00:00
05b5bf9db1 docs: mark P5 as done in NEXT_PRIORITIES.md 2026-05-30 10:54:21 +00:00
95e0c8c464 P5: fix multi-tile test — use same Q data for kernel and reference 2026-05-30 10:49:12 +00:00
e701a1411c P5: use multi-tile kernel for N>128 in integration test 2026-05-30 10:47:00 +00:00
5932e928a8 cleanup: remove debug test files (P4, P5) 2026-05-30 10:46:14 +00:00
8fef46ce73 P5: add reference comparison to Python multi-tile test 2026-05-30 10:45:02 +00:00
897a70a491 P5: minimal Python multi-tile test 2026-05-30 10:43:26 +00:00
a2627359fb P5: fix TMA desc creation — write to HOST then cudaMemcpy to device 2026-05-30 10:40:01 +00:00
f370bfb1f1 P5: re-enable multi-tile Python tests, fix CAPI to use create_tma_desc_2d_bf16 2026-05-30 10:38:33 +00:00
da54f6439f P5: fix TMA multitile test (include cuda.h first, proper SMEM calc) 2026-05-30 10:35:34 +00:00
34320653e9 P5: standalone TMA multi-tile test with 128B-aligned memory 2026-05-30 10:34:20 +00:00
a1d05b3055 P5: disable multi-tile Python tests (TMA descriptor alignment issue) 2026-05-30 10:32:44 +00:00
97531a68e6 fix: remove n_kv_tiles from capi too 2026-05-30 10:30:40 +00:00
a5b47602b5 fix: remove n_kv_tiles from standalone test (struct doesn't have it anymore) 2026-05-30 10:28:38 +00:00
f032800eaa P5: integrate WORKING multi-tile kernel (fmha_6warp_tma_multirow_multitile) into production
- fmha_multitile_capi.cu: C API wrapper for TMA multi-tile kernel
  Creates TMA descriptors per (head, batch), launches kernel
- fmha_multitile_op.py: nvcc precompile + ctypes loader
- production.py: dispatch to multitile for N>128 or hd=512
- Reverted fmha_6warp_multihead.cuh to working single-tile version
- The TMA multi-tile kernel already passes 72 configs (D1.5)
  HD=64/128/256/512 × T=1/4/32/128 × s_k=128/256/384/512
2026-05-30 10:27:38 +00:00
032cb4c7b2 P5: add single-tile merge comparison to multitile test 2026-05-30 09:06:57 +00:00
d424ccbcc1 fix: const not constexpr for SCALE 2026-05-30 09:04:45 +00:00
3da31de4c0 P5: fix BF16 host helpers for standalone test 2026-05-30 09:04:05 +00:00
9e6ba25a98 P5: standalone multi-tile CUDA test (2 KV tiles, hd=64) 2026-05-30 09:01:52 +00:00
b61df2657b P5: fix reference attention for MQA/GQA (kv_idx = h // q_per_kv) 2026-05-30 08:59:50 +00:00
c55030a340 P5: clean kernel with runtime branch (single-tile unchanged, multi-tile separate path)
Single-tile path is IDENTICAL to the working pre-P5 kernel.
Multi-tile path uses FA2 online softmax with sOacc accumulator.
Runtime branch on is_multi_tile = (n_kv_tiles > 1).
2026-05-30 08:57:00 +00:00
5f4856d771 P5: fix sOacc init race — use single thread (tid==0) instead of 4 softmax warps 2026-05-30 08:53:50 +00:00
66b126ded8 P5: fix standalone test template — add n_kv_tiles to FmhaParams 2026-05-30 08:50:38 +00:00
0f34f60494 P5: fix single-tile backward compat (normalized P for n_kv_tiles==1) 2026-05-30 08:47:47 +00:00
2649488d13 P5: in-kernel multi-KV-tile FA2 online softmax in fmha_6warp_multihead.cuh
- Kernel loops over KV tiles internally with running max/sum rescale
- SMEM accumulator sOacc[hd] replaces TMEM accumulation across tiles
- P is UN-NORMALIZED for multi-tile (exp(s-max), not /sum)
- Per KV tile: QK→softmax→PV→TMEM→read→add to sOacc
- Final: O = sOacc / running_sum
- Single tile (n_kv_tiles=1): same as before, no rescale
- Updated CAPI, Python loader, production.py fast path
- Added multi-tile test cases (N=256, 512)
2026-05-30 08:46:09 +00:00
6421f7c3f3 P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue
Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs.
The bit-21 workaround was NOT needed. The 'misaligned address' crashes were
caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor.

Added docs/p4_tma_hang_resolution.md with root cause and fix.
Cleaned up stale P4 test files.
2026-05-30 08:42:18 +00:00
58c087416b P4: 128B-aligned GMEM, proper SMEM alignment, bit21 test 2026-05-30 08:41:15 +00:00
90c806733f P4: test TMA with bit-21 workaround and innermost-first dims 2026-05-30 08:40:21 +00:00
16027018df P4: fix TMA load test (32-bit SMEM addrs, proper mbarrier) 2026-05-30 08:38:55 +00:00
e2ecdc42d8 P4: TMA load test kernel (swizzle vs no-swizzle hang diagnosis) 2026-05-30 08:38:11 +00:00
bd104c2ab2 P4: fix OOB fill enum name 2026-05-30 08:37:05 +00:00
cdd1babf1f P4: correct CUDA 13.2 API (dataType before rank, FloatOOBfill, globalDim) 2026-05-30 08:36:24 +00:00
8df3ccecea P4: CUDA 13.2 has 10-param cuTensorMapEncodeTiled (no OOB fill) 2026-05-30 08:35:34 +00:00
d8ffdb66e1 P4: fix API signature rank/dtype order, OOB_FILL defines 2026-05-30 08:35:04 +00:00
277689f8b8 P4: use proper CUDA enum names 2026-05-30 08:34:19 +00:00
6d624a1b14 P4: remove explicit enum casts 2026-05-30 08:33:42 +00:00
4898a946eb P4: fix TMA descriptor dump API order (dtype before rank) 2026-05-30 08:33:12 +00:00
3943be6063 P4: fix TMA descriptor dump (cuuint64_t dims, proper CUtensorMap API) 2026-05-30 08:32:34 +00:00
4df6ea2d8c P4: TMA descriptor dump test (cuTensorMapEncodeTiled) 2026-05-30 08:31:56 +00:00
ae425b5522 P3: clean up test, remove debug files, final integration test
- test_p3_fast_decode.py: clean kernel test + full API test
- Removed debug tests (sanity, v_debug, v_ref_debug)
- Double normalization fix verified: kernel output matches reference
  at cos >= 0.999990 across all MHA/MQA/GQA configs
2026-05-30 08:29:25 +00:00
10915c4e70 fix: remove double normalization in fmha_6warp_multihead epilogue
P was already normalized in softmax step. PV = P_norm @ V gives the
correct attention output. Dividing by row_sum again in the epilogue
produces O = O_correct / row_sum (128x too small for uniform data).
2026-05-30 08:26:20 +00:00
cfac224b59 debug: single head sanity test with known values 2026-05-30 08:25:20 +00:00
1c74d35fb4 debug: V layout reference comparison 2026-05-30 08:24:35 +00:00
a3c5f817e1 debug: compare api vs direct kernel vs reference 2026-05-30 08:23:43 +00:00