|
|
7a4ff959bf
|
D1.4: Use cutlass.range loop for k_sub (reduce IR), guard O rescale with const_expr(n_kv_tiles>1)
|
2026-05-24 14:22:45 +00:00 |
|
|
|
449a6e7ede
|
Fix: add cutlass import to test_d1_qk512
|
2026-05-24 14:20:32 +00:00 |
|
|
|
ce267909ad
|
Fix: add cpasync import to test_d1_qk512
|
2026-05-24 14:20:01 +00:00 |
|
|
|
625837fd44
|
D1.4: Add hd=512 QK-only and standalone test for compilation debugging
|
2026-05-24 14:19:26 +00:00 |
|
|
|
592873b560
|
D1.4: Reduce pv_n_tile to 128 for hd=512 to fit SMEM budget (192KB)
|
2026-05-24 08:07:32 +00:00 |
|
|
|
e7c146dbfd
|
D1: Unrolled k_sub path (hardcoded k_sub=0,1) to avoid cutlass.range IR explosion
|
2026-05-24 07:03:14 +00:00 |
|
|
|
dd39c2ebdf
|
D1: Use cutlass.range for k_sub loops (CuTeDSL immutable handle)
|
2026-05-24 06:43:30 +00:00 |
|
|
|
2bf3ee40aa
|
D1: Fix kvh scoping - define before loops, consume V via pipeline
|
2026-05-24 06:42:26 +00:00 |
|
|
|
f2170fc1b3
|
D1: Fix kvb→kvh typo in PV GEMM
|
2026-05-24 06:41:25 +00:00 |
|
|
|
e2b914be5e
|
D1: Remove qh.commit() - pipeline handles commit internally
|
2026-05-24 06:40:10 +00:00 |
|
|
|
583c509bcd
|
D1: TMA producer uses acquire_and_advance + commit (no wait_and_advance)
|
2026-05-24 06:38:15 +00:00 |
|
|
|
3bf1e62b58
|
D1: Use same pipeline API as working code (acquire_and_advance) for k_sub path
|
2026-05-24 06:36:19 +00:00 |
|
|
|
85af7f4cf3
|
D1: Add PipelineState for k_sub TMA path
|
2026-05-24 05:02:17 +00:00 |
|
|
|
622089ad16
|
D1: Fix pipeline API for K sub-tile path (producer_acquire/commit)
|
2026-05-24 04:59:41 +00:00 |
|
|
|
b9e806f09d
|
D1: K sub-tile MMA path using pipeline barriers
|
2026-05-24 04:57:08 +00:00 |
|
|
|
98e974403c
|
D1: Fix TMA copies in k_sub path (no mbarrier, use cp_async wait)
|
2026-05-24 04:53:46 +00:00 |
|
|
|
e637d3ae73
|
D1: Add K sub-tile loop for hd=512 (const_expr guarded, hd≤256 path unchanged)
|
2026-05-24 04:51:51 +00:00 |
|
|
|
24b9310682
|
D1: Debug TMA partition shapes at hd=512
|
2026-05-24 04:43:12 +00:00 |
|
|
|
9201a844dd
|
D1: K sub-tiling - qk_mma_tiler K-dim = k_tile=256, SMEM fits at hd=512
|
2026-05-24 04:41:12 +00:00 |
|
|
|
6be7690011
|
Docs: Update STAGE_D.md, README.md status for D1 hd≤256 milestone
|
2026-05-24 04:32:43 +00:00 |
|
|
|
787d0160a1
|
D1: Full test with TMEM-P at hd=64,128,256,512
|
2026-05-24 04:07:40 +00:00 |
|
|
|
d234297712
|
D1: Remove debug prints, clean up
|
2026-05-24 04:06:26 +00:00 |
|
|
|
3b63405ad4
|
D1: const_expr for sP layout selection (CuTeDSL)
|
2026-05-24 04:05:17 +00:00 |
|
|
|
1c8b043702
|
D1: Python if for sP layout (trace-time, not MLIR)
|
2026-05-24 04:04:27 +00:00 |
|
|
|
3aa8e5185a
|
D1: Tiny 4-mode sP placeholder for TMEM-P path
|
2026-05-24 04:03:28 +00:00 |
|
|
|
03ad730a9b
|
D1: Conditional sP allocation (saves 64KB SMEM for TMEM-P at hd=256)
|
2026-05-24 04:02:02 +00:00 |
|
|
|
975829e5c7
|
D1: Fix sP dummy allocation
|
2026-05-24 04:00:19 +00:00 |
|
|
|
5fda73b53b
|
D1: Skip sP allocation when use_smem_p=False (saves 64KB at hd=256)
|
2026-05-24 03:59:27 +00:00 |
|
|
|
93590eb1ad
|
D1: Fix syntax (separate kv_stage line)
|
2026-05-24 03:58:12 +00:00 |
|
|
|
2958cad75d
|
D1: Reduce kv_stage to 1 at hd>128 to avoid SMEM overflow
|
2026-05-24 03:55:44 +00:00 |
|
|
|
d6f7d9009d
|
D1: FIX qk_mma_tiler K-dim = head_dim (was hardcoded to 64, broke hd>64)
|
2026-05-24 03:53:19 +00:00 |
|
|
|
b4bf6818c6
|
D1: Print qk_ik in _setup
|
2026-05-24 03:51:40 +00:00 |
|
|
|
0953708f2c
|
D1: Add more debug prints (QK/PV mode2 sizes)
|
2026-05-24 03:49:55 +00:00 |
|
|
|
24b9ebfba9
|
D1: SMEM-P test at hd=128
|
2026-05-24 03:48:37 +00:00 |
|
|
|
d9bc430570
|
D1: Add sP shape debug print
|
2026-05-24 03:46:27 +00:00 |
|
|
|
0f50933f69
|
D1: Fix SMEM-P (coordinate store), LSE (FP32), add TMEM-P-only test
|
2026-05-24 03:27:14 +00:00 |
|
|
|
c995a2ca46
|
D1: Fix SMEM-P - coordinate-indexed store (replaces make_tiled_copy_C)
|
2026-05-24 03:24:44 +00:00 |
|
|
|
0de0f20799
|
feat: SMEM-P make_tiled_copy_C + zero-fill dest tensor
|
2026-05-24 03:23:53 +00:00 |
|
|
|
99b2e12fd8
|
Merge branch 'master' of ssh://sweetapi.com:2222/biondizzle/nvfp4-megamoe-kernel
|
2026-05-24 03:23:22 +00:00 |
|
|
|
f645f3994a
|
D1: LSE diagnostic at various hd
|
2026-05-24 03:23:16 +00:00 |
|
|
|
54915f6b56
|
feat: SMEM-P using make_tiled_copy_C(qk_mma) approach
|
2026-05-24 03:22:57 +00:00 |
|
|
|
c042fcf6c7
|
D1: Add diagnostic test (TMEM-P vs SMEM-P at various hd)
|
2026-05-24 03:22:23 +00:00 |
|
|
|
09c7d8eb36
|
Merge branch 'master' of ssh://sweetapi.com:2222/biondizzle/nvfp4-megamoe-kernel
|
2026-05-24 03:21:06 +00:00 |
|
|
|
1c5d6475e5
|
D1 test: compare un-norm O + norm using ref row_sum + LSE verification
|
2026-05-24 03:21:01 +00:00 |
|
|
|
ea4b6b10bc
|
fix: LSE type mismatch Float32→BFloat16
|
2026-05-24 03:20:26 +00:00 |
|
|
|
850f16b2a3
|
merge: keep our fmha.py (coordinate-indexed SMEM-P + epilogue_tma_store)
|
2026-05-24 03:19:52 +00:00 |
|
|
|
53bc54ed17
|
D1.5: Fix SMEM-P - use coordinate-indexed store (same proven pattern)
|
2026-05-24 03:19:32 +00:00 |
|
|
|
6c0ca13aed
|
feat: SMEM-P with make_tiled_copy_tv + partition_S
|
2026-05-24 03:19:18 +00:00 |
|
|
|
93e7fe97f7
|
D1.5: Always output un-normalized O + LSE (epilogue_tma_store only, no TMEM round-trip normalize)
|
2026-05-24 03:18:38 +00:00 |
|
|
|
b22ab84f1a
|
feat: SMEM-P using make_tiled_copy_A from PV MMA
|
2026-05-24 03:16:34 +00:00 |
|