biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:32:03 +00:00
4459ddefdd feat: 6-warp TMA FMHA kernel + test — TMA for K loads
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:30:52 +00:00
7a8ba8eeb6 fix: SMEM size calculation — TILE_SZ is in BF16 elements, need *sizeof(bf16_t) for bytes
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:29:37 +00:00
aac1b25442 test: TMA QK diagnostic — 3 variants to isolate failure
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:28:24 +00:00
9dfada6626 test: TMA + canonical + QK GEMM incremental
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:27:35 +00:00
0435e229bd fix: typo cuda_SUCCESS -> cudaSuccess
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:26:59 +00:00
74514e2680 test: TMA sub-tile load — exact pattern from test_qk_softmax
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:26:10 +00:00
e449d6d5e1 test: TMA diagnostic with 192 threads
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:25:39 +00:00
0b36b6047a test: TMA diagnostic with 128 threads
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:25:04 +00:00
a766b488c2 test: minimal TMA diagnostic — isolate multi-warp TMA bug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:12:29 +00:00
fe3b6b8d13 test: QK+softmax T=1 first
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:11:21 +00:00
a9a87fe7b8 fix: P write with lane stride, use sRowSum
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:10:10 +00:00
fd6a9b00ae test: QK + softmax — verify P values against reference
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:08:49 +00:00
5eff53c145 fix: SMEM layout and printf in PV-only test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:06:54 +00:00
106f103c83 test: PV-only GEMM — isolate PV from full FMHA pipeline
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:57:42 +00:00
5542a9da00 debug: V loaded directly from GMEM (not TMA) to isolate PV issue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:54:03 +00:00
2262e10fca fix: PV GEMM — V canonical uses CORES_MN_V=2 (block_mn=16), not 16
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:51:00 +00:00
90c3372040 refactor: TMA FMHA kernel — 4-warp, proven pattern, full pipeline
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:48:41 +00:00
d5e20b2d42 fix: reference should be raw dot product (MMA is unscaled)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:47:30 +00:00
2b945f255b test: TMA K-load + QK GEMM — incremental from working pattern
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:46:10 +00:00
f33746f183 test: minimal TMA K-load — no MMA/TMEM, just verify TMA + canonical