biondizzle
  • Joined on 2025-12-10
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-12 15:07:37 +00:00
787d427847 test: fix NVFP4 mega_moe test dimensions for SMEM alignment
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 15:04:26 +00:00
94b30dc2bc revert: block_n/4 was correct (SwiGLU halving × FP4 packing)
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 14:58:12 +00:00
c71fb97687 fix: L1 output TMA smem_inner_dim was block_n/4, should be block_n/2
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-12 14:53:49 +00:00
8737fd57c0 remove crap
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 14:31:41 +00:00
d8ae7a3225 debug: print SF shape/strides before interleave
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 14:23:04 +00:00
e498a2c729 fix: single transpose back to MN-major, don't double-transpose
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 14:22:07 +00:00
916f03d528 debug: add transform output shape/stride prints
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 14:11:56 +00:00
1f13b24354 debug: add strides to SF debug prints
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 14:02:00 +00:00
bfe612969b fix: preserve MN-major layout when interleaving L1 SF tensors
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 13:48:47 +00:00
76220ac6ee fix: force contiguous on SF tensors before C++ call
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 13:28:33 +00:00
bf5bf8d995 fix: unpack weight tuples before printing debug info
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-12 13:10:44 +00:00
52c3aefe73 bump cache busters to 33 for debug build
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 13:10:34 +00:00
5ac151d0a5 debug: print tensor dtypes/shapes at C++ call boundary in fp8_nvfp4_mega_moe
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-12 12:23:46 +00:00
ca1d306890 fix: use torch.int8 for packed FP4 tensors (kPackedFP4=kInt8, not uint8)
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-12 11:15:07 +00:00
b8f95ffad3 docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache
5840291ea3 fix staging kernel packed_k_mask double-count
Compare 2 commits »
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 08:08:21 +00:00
26a8ab75a1 NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-12 07:24:37 +00:00
5ea5b579c3 Trim banner, no code changes
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 07:08:09 +00:00
680874d067 NVFP4 L1 epilogue: group_size=16 SF layout
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 06:51:42 +00:00
c0850a6859 Fix weight TMA descriptors: packed E2M1 needs K/2, block_k/2, swizzle/2
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-12 05:52:35 +00:00
fbfeb54c9a Fix fold_global_scale: UE4M3 scales use .to(float32), not shift-by-23