biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:57:01 +00:00
c55030a340 P5: clean kernel with runtime branch (single-tile unchanged, multi-tile separate path)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:53:52 +00:00
5f4856d771 P5: fix sOacc init race — use single thread (tid==0) instead of 4 softmax warps
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:50:48 +00:00
66b126ded8 P5: fix standalone test template — add n_kv_tiles to FmhaParams
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:47:49 +00:00
0f34f60494 P5: fix single-tile backward compat (normalized P for n_kv_tiles==1)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:46:10 +00:00
2649488d13 P5: in-kernel multi-KV-tile FA2 online softmax in fmha_6warp_multihead.cuh
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:42:21 +00:00
6421f7c3f3 P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:41:21 +00:00
58c087416b P4: 128B-aligned GMEM, proper SMEM alignment, bit21 test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:40:23 +00:00
90c806733f P4: test TMA with bit-21 workaround and innermost-first dims
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:38:57 +00:00
16027018df P4: fix TMA load test (32-bit SMEM addrs, proper mbarrier)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:38:12 +00:00
e2ecdc42d8 P4: TMA load test kernel (swizzle vs no-swizzle hang diagnosis)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:37:07 +00:00
bd104c2ab2 P4: fix OOB fill enum name
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:36:26 +00:00
cdd1babf1f P4: correct CUDA 13.2 API (dataType before rank, FloatOOBfill, globalDim)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:35:35 +00:00
8df3ccecea P4: CUDA 13.2 has 10-param cuTensorMapEncodeTiled (no OOB fill)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:35:06 +00:00
d8ffdb66e1 P4: fix API signature rank/dtype order, OOB_FILL defines
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:34:20 +00:00
277689f8b8 P4: use proper CUDA enum names
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:33:43 +00:00
6d624a1b14 P4: remove explicit enum casts
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:33:14 +00:00
4898a946eb P4: fix TMA descriptor dump API order (dtype before rank)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:32:35 +00:00
3943be6063 P4: fix TMA descriptor dump (cuuint64_t dims, proper CUtensorMap API)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:31:58 +00:00
4df6ea2d8c P4: TMA descriptor dump test (cuTensorMapEncodeTiled)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:29:27 +00:00
ae425b5522 P3: clean up test, remove debug files, final integration test