biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 23:21:18 +00:00
e0a11e32f8 D1.3: Fix coord extraction - identity tensor stores (m,k) pairs as values
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 23:20:13 +00:00
a7171fa5e1 D1.3: Fix coordinate indexing - tTMEM_LOADcS first mode is (32,1) nested tuple
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 23:19:23 +00:00
df7bc40d37 D1.3: Direct coordinate-indexed SMEM-P write using tTMEM_LOADcS coords
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 23:17:32 +00:00
2bbe55b08c D1.3: Use make_cotiled_copy for SMEM-P — custom TV layout from TMEM-load coords to sP
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 23:03:37 +00:00
2e86ed939e Add SMEM-P guidance request document for CUTLASS LLM consultation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:30:57 +00:00
029c21a2af D1.3: Use const_expr for lse None check
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:30:16 +00:00
1720a0e86b D1.3: Fix LSE with const_expr, always create valid mLSE tensor
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:29:12 +00:00
bce31176aa D1.3: Try make_tiled_copy_C(qk_mma) for SMEM-P copy - zero-fill source for compile test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:28:16 +00:00
c80bd021c9 D1.3: Define SMEM-P copy atoms unconditionally (CuTeDSL scoping)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:27:12 +00:00
43bb501acb D1.3: Use full sP (4D) for make_tiled_copy_D partition
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:26:11 +00:00
06fd2f63e9 D1.3: SMEM-P via get_smem_store_op + make_tiled_copy_D
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:24:18 +00:00
c507d0640c D1.3: Enhanced diagnostic - test QK C-fragment as source for make_tiled_copy_C
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:21:33 +00:00
bf896c0894 D1.3: Skip fragment creation in diagnostic, just print layouts
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:20:17 +00:00
0c435b3e51 D1.3: Fix diagnostic - use dummy ptr 0 for shape analysis
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:19:12 +00:00
55caf8be38 D1.3: Fix sP allocation - p_smem_s.outer is already a layout
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:17:58 +00:00
d1c600f599 D1.3: Fix layout diagnostic - compute c_major outside kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:16:58 +00:00
d3d0020b4e D1.3: Layout diagnostic v2 - run inside JIT-compiled kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:15:49 +00:00
ec8fd1474c D1.3: Fix layout diagnostic - remove JIT-dependent code
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:14:37 +00:00
c3a7c30f20 D1.3: Layout diagnostic - print all QK C-fragment and PV A-operand shapes
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 22:07:54 +00:00
b59aca4655 Update all .md files with D5a/D5b progress, tOrP0 fix, LSE formula