This website requires JavaScript.
6f371d6b31
D2: add flat_divide shape print, try different coordinate order
biondizzle
2026-05-24 23:42:04 +00:00
7007a9db79
D2: use flat_divide for runtime coordinate indexing (like CUTLASS)
biondizzle
2026-05-24 23:40:37 +00:00
3e340a0eee
D2: fix local_tile coordinate for 4D Q (2 rest modes, not 3)
biondizzle
2026-05-24 23:38:48 +00:00
b5cd1b88c9
D2: add shape debug print for mQ/mK
biondizzle
2026-05-24 23:37:10 +00:00
df3146eb53
D2: hardcode a_major=MN for multi-CTA (Q is always MN-major in FMHA)
biondizzle
2026-05-24 23:35:49 +00:00
e809e71253
D2: use tensor indexing q[0] instead of local_tile for layout extraction
biondizzle
2026-05-24 23:34:38 +00:00
49c4189195
D2: fix LayoutEnum for multi-dim Q (use head-0 view for layout)
biondizzle
2026-05-24 23:33:27 +00:00
2b76b691cb
fix: block_idx() returns tuple, use [1] for y
biondizzle
2026-05-24 23:29:59 +00:00
4c79e5533e
D2: add multi-CTA grid with block_idx_y for Q/O head indexing
biondizzle
2026-05-24 23:27:38 +00:00
335e310c79
Update D2 status in README
biondizzle
2026-05-24 22:58:23 +00:00
e3e67c3992
NVFP4-3: enable 2-CTA UMMA when MMA tile M >= 256 (1.7-1.9x throughput)
biondizzle
2026-05-24 22:57:49 +00:00
e0339a92fc
D2: revert multi-CTA grid params (using per-head launch approach instead)
biondizzle
2026-05-24 22:52:21 +00:00
a5271821a8
D2: add scale test (more heads, larger hd)
biondizzle
2026-05-24 22:49:44 +00:00
d563c93fc5
D2: add per-head launch test
biondizzle
2026-05-24 22:48:22 +00:00
9b476d87f9
fix: compare un-normalized O against un-normalized reference
biondizzle
2026-05-24 22:44:11 +00:00
0ca7b58a6a
D1: fully revert LSE change back to original sfw_idx==0 guard
biondizzle
2026-05-24 22:41:32 +00:00
db353ec35a
D2: add simple n_h=1 regression test
biondizzle
2026-05-24 22:39:25 +00:00
4418e04a28
D1: revert per-row LSE to sfw_idx=0 for now (debugging D2 regression)
biondizzle
2026-05-24 22:28:11 +00:00
2cc66bff68
D2: add initial multi-head test file
biondizzle
2026-05-24 22:26:10 +00:00
49e66fb6e4
D1: corrected KV merge test with proper normalized output formula
biondizzle
2026-05-24 22:24:27 +00:00
c47f648617
fix lse verify
biondizzle
2026-05-24 22:23:08 +00:00
3577e09603
D1: add LSE verification test
biondizzle
2026-05-24 22:22:31 +00:00
674c5b9c18
D1: fix per-row LSE output + add KV merge test v2 with per-row LSE
biondizzle
2026-05-24 22:21:51 +00:00
18f3274c0b
D1: DEBUG - NO-OP O rescale (multiply by 1.0) to test TMEM round-trip
biondizzle
2026-05-24 22:19:16 +00:00
c33185ca0a
D1: add rescale diagnostic
biondizzle
2026-05-24 22:18:12 +00:00
02edff5ac7
D1: add KV merge test using log-sum-exp (avoids TMEM round-trip)
biondizzle
2026-05-24 22:17:24 +00:00
0f30319e06
Revert "D1: move O rescale atoms outside const_expr guard (match CUTLASS pattern)"
biondizzle
2026-05-24 22:15:38 +00:00
aaf21d8ac1
D1: move O rescale atoms outside const_expr guard (match CUTLASS pattern)
biondizzle
2026-05-24 22:07:18 +00:00
35a3c04e8e
fix debug test
biondizzle
2026-05-24 22:04:51 +00:00
a391aa1fd3
D1: add rescale debug test
biondizzle
2026-05-24 22:04:20 +00:00
55c6903980
D1: fix O rescale identity tensor - use PV MMA shape not QK shape
biondizzle
2026-05-24 22:02:55 +00:00
f1aab1bfc1
D1: add multi-KV-tile O rescale test (s_k=256,384,512)
biondizzle
2026-05-24 22:00:42 +00:00
77b366d44b
Scrub B200 password from markdown files
biondizzle
2026-05-24 21:52:54 +00:00
83506e6ad2
Add MAY_24_26_PLAN.md: next session startup plan
biondizzle
2026-05-24 21:50:32 +00:00
9435bf9653
Restore NVFP4 Precision Roadmap + add O rescale gap to D1.5
biondizzle
2026-05-24 21:48:58 +00:00
03cbd8ffa6
Add STAGE_D2.md: Multi-query grid + head packing plan
biondizzle
2026-05-24 21:43:04 +00:00
f4e0315af9
Remove obsolete STAGE_D1.3.md and SMEM_P_GUIDANCE_REQUEST.md
biondizzle
2026-05-24 21:41:17 +00:00
dadfad8f89
Docs: Update STAGE_D.md, README.md with hd=512 compilation blocker, lessons learned
biondizzle
2026-05-24 21:35:25 +00:00
a5fef69363
D1.4: Use cutlass.range(unroll=1) for k_sub loops in both TMA and MMA warps
biondizzle
2026-05-24 17:55:33 +00:00
c11ac38ceb
D1.4: Remove --opt-level 0 from hd512 test (use default opt level)
biondizzle
2026-05-24 16:42:01 +00:00
b14d88f37f
D1.4: Fix merge test - use use_smem_p=False for hd=256 kernel (SMEM budget)
biondizzle
2026-05-24 16:36:48 +00:00
e6c9e6c0d0
D1.4: Add external k_sub merge test for hd=512 (avoids slow in-kernel k_sub compilation)
biondizzle
2026-05-24 16:31:06 +00:00
13fcf16b14
D1.4: Use --opt-level 0 only (ptxas -j not supported, MLIR is the bottleneck)
biondizzle
2026-05-24 15:43:17 +00:00
b4da412b30
D1.4: Use options string for compile flags (--ptxas-options -j64 --opt-level 0)
biondizzle
2026-05-24 15:40:39 +00:00
4f69dffc93
D1.4: Add PtxasOptions -j64 + OptLevel(0) for faster hd=512 compilation
biondizzle
2026-05-24 15:36:35 +00:00
331ddb29b7
D1.4: Fix regression test for un-normalized O output (D5a)
biondizzle
2026-05-24 15:13:16 +00:00
25201d0c3d
D1.4: Guard LSE computation with const_expr(not normalize) - fixes BF16 type mismatch in regression test
biondizzle
2026-05-24 15:11:39 +00:00
7f64a11eea
D1.4: Switch k_sub from cutlass.range to Python range (unrolled at trace time)
biondizzle
2026-05-24 15:10:28 +00:00
6d7b8fed3e
D1.4: Fix tTMrO placeholder - define only inside const_expr block
biondizzle
2026-05-24 14:23:22 +00:00
7a4ff959bf
D1.4: Use cutlass.range loop for k_sub (reduce IR), guard O rescale with const_expr(n_kv_tiles>1)
biondizzle
2026-05-24 14:22:45 +00:00
449a6e7ede
Fix: add cutlass import to test_d1_qk512
biondizzle
2026-05-24 14:20:32 +00:00
ce267909ad
Fix: add cpasync import to test_d1_qk512
biondizzle
2026-05-24 14:20:01 +00:00
625837fd44
D1.4: Add hd=512 QK-only and standalone test for compilation debugging
biondizzle
2026-05-24 14:19:26 +00:00
592873b560
D1.4: Reduce pv_n_tile to 128 for hd=512 to fit SMEM budget (192KB)
biondizzle
2026-05-24 08:07:32 +00:00
e7c146dbfd
D1: Unrolled k_sub path (hardcoded k_sub=0,1) to avoid cutlass.range IR explosion
biondizzle
2026-05-24 07:03:14 +00:00
dd39c2ebdf
D1: Use cutlass.range for k_sub loops (CuTeDSL immutable handle)
biondizzle
2026-05-24 06:43:30 +00:00
2bf3ee40aa
D1: Fix kvh scoping - define before loops, consume V via pipeline
biondizzle
2026-05-24 06:42:26 +00:00
f2170fc1b3
D1: Fix kvb→kvh typo in PV GEMM
biondizzle
2026-05-24 06:41:25 +00:00
e2b914be5e
D1: Remove qh.commit() - pipeline handles commit internally
biondizzle
2026-05-24 06:40:10 +00:00
583c509bcd
D1: TMA producer uses acquire_and_advance + commit (no wait_and_advance)
biondizzle
2026-05-24 06:38:15 +00:00
3bf1e62b58
D1: Use same pipeline API as working code (acquire_and_advance) for k_sub path
biondizzle
2026-05-24 06:36:19 +00:00
85af7f4cf3
D1: Add PipelineState for k_sub TMA path
biondizzle
2026-05-24 05:02:17 +00:00
622089ad16
D1: Fix pipeline API for K sub-tile path (producer_acquire/commit)
biondizzle
2026-05-24 04:59:41 +00:00
b9e806f09d
D1: K sub-tile MMA path using pipeline barriers
biondizzle
2026-05-24 04:57:08 +00:00
98e974403c
D1: Fix TMA copies in k_sub path (no mbarrier, use cp_async wait)
biondizzle
2026-05-24 04:53:46 +00:00
e637d3ae73
D1: Add K sub-tile loop for hd=512 (const_expr guarded, hd≤256 path unchanged)
biondizzle
2026-05-24 04:51:51 +00:00
24b9310682
D1: Debug TMA partition shapes at hd=512
biondizzle
2026-05-24 04:43:12 +00:00
9201a844dd
D1: K sub-tiling - qk_mma_tiler K-dim = k_tile=256, SMEM fits at hd=512
biondizzle
2026-05-24 04:41:12 +00:00
6be7690011
Docs: Update STAGE_D.md, README.md status for D1 hd≤256 milestone
biondizzle
2026-05-24 04:32:43 +00:00
787d0160a1
D1: Full test with TMEM-P at hd=64,128,256,512
biondizzle
2026-05-24 04:07:40 +00:00
d234297712
D1: Remove debug prints, clean up
biondizzle
2026-05-24 04:06:26 +00:00
3b63405ad4
D1: const_expr for sP layout selection (CuTeDSL)
biondizzle
2026-05-24 04:05:17 +00:00
1c8b043702
D1: Python if for sP layout (trace-time, not MLIR)
biondizzle
2026-05-24 04:04:27 +00:00
3aa8e5185a
D1: Tiny 4-mode sP placeholder for TMEM-P path
biondizzle
2026-05-24 04:03:28 +00:00
03ad730a9b
D1: Conditional sP allocation (saves 64KB SMEM for TMEM-P at hd=256)
biondizzle
2026-05-24 04:02:02 +00:00
975829e5c7
D1: Fix sP dummy allocation
biondizzle
2026-05-24 04:00:19 +00:00
5fda73b53b
D1: Skip sP allocation when use_smem_p=False (saves 64KB at hd=256)
biondizzle
2026-05-24 03:59:27 +00:00
93590eb1ad
D1: Fix syntax (separate kv_stage line)
biondizzle
2026-05-24 03:58:12 +00:00
2958cad75d
D1: Reduce kv_stage to 1 at hd>128 to avoid SMEM overflow
biondizzle
2026-05-24 03:55:44 +00:00
d6f7d9009d
D1: FIX qk_mma_tiler K-dim = head_dim (was hardcoded to 64, broke hd>64)
biondizzle
2026-05-24 03:53:19 +00:00
b4bf6818c6
D1: Print qk_ik in _setup
biondizzle
2026-05-24 03:51:40 +00:00
0953708f2c
D1: Add more debug prints (QK/PV mode2 sizes)
biondizzle
2026-05-24 03:49:55 +00:00
24b9ebfba9
D1: SMEM-P test at hd=128
biondizzle
2026-05-24 03:48:37 +00:00
d9bc430570
D1: Add sP shape debug print
biondizzle
2026-05-24 03:46:27 +00:00
0f50933f69
D1: Fix SMEM-P (coordinate store), LSE (FP32), add TMEM-P-only test
biondizzle
2026-05-24 03:27:14 +00:00
c995a2ca46
D1: Fix SMEM-P - coordinate-indexed store (replaces make_tiled_copy_C)
biondizzle
2026-05-24 03:24:44 +00:00
0de0f20799
feat: SMEM-P make_tiled_copy_C + zero-fill dest tensor
biondizzle
2026-05-24 03:23:48 +00:00
99b2e12fd8
Merge branch 'master' of ssh://sweetapi.com:2222/biondizzle/nvfp4-megamoe-kernel
biondizzle
2026-05-24 03:23:22 +00:00
f645f3994a
D1: LSE diagnostic at various hd
biondizzle
2026-05-24 03:23:16 +00:00
54915f6b56
feat: SMEM-P using make_tiled_copy_C(qk_mma) approach
biondizzle
2026-05-24 03:22:51 +00:00
c042fcf6c7
D1: Add diagnostic test (TMEM-P vs SMEM-P at various hd)
biondizzle
2026-05-24 03:22:23 +00:00
09c7d8eb36
Merge branch 'master' of ssh://sweetapi.com:2222/biondizzle/nvfp4-megamoe-kernel
biondizzle
2026-05-24 03:21:06 +00:00
1c5d6475e5
D1 test: compare un-norm O + norm using ref row_sum + LSE verification
biondizzle
2026-05-24 03:21:01 +00:00
ea4b6b10bc
fix: LSE type mismatch Float32→BFloat16
biondizzle
2026-05-24 03:20:24 +00:00
850f16b2a3
merge: keep our fmha.py (coordinate-indexed SMEM-P + epilogue_tma_store)
biondizzle
2026-05-24 03:19:52 +00:00
53bc54ed17
D1.5: Fix SMEM-P - use coordinate-indexed store (same proven pattern)
biondizzle
2026-05-24 03:19:32 +00:00
6c0ca13aed
feat: SMEM-P with make_tiled_copy_tv + partition_S
biondizzle
2026-05-24 03:19:10 +00:00
93e7fe97f7
D1.5: Always output un-normalized O + LSE (epilogue_tma_store only, no TMEM round-trip normalize)
biondizzle
2026-05-24 03:18:33 +00:00
b22ab84f1a
feat: SMEM-P using make_tiled_copy_A from PV MMA
biondizzle
2026-05-24 03:16:34 +00:00
7357b1a866
fix: fence_proxy not fence
biondizzle
2026-05-24 02:44:20 +00:00