biondizzle

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-12 15:07:37 +00:00

787d427847 test: fix NVFP4 mega_moe test dimensions for SMEM alignment

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 15:04:26 +00:00

94b30dc2bc revert: block_n/4 was correct (SwiGLU halving × FP4 packing)

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 14:58:12 +00:00

c71fb97687 fix: L1 output TMA smem_inner_dim was block_n/4, should be block_n/2

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-12 14:53:49 +00:00

8737fd57c0 remove crap

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 14:31:41 +00:00

d8ae7a3225 debug: print SF shape/strides before interleave

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 14:23:04 +00:00

e498a2c729 fix: single transpose back to MN-major, don't double-transpose

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 14:22:07 +00:00

916f03d528 debug: add transform output shape/stride prints

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 14:11:56 +00:00

1f13b24354 debug: add strides to SF debug prints

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 14:02:00 +00:00

bfe612969b fix: preserve MN-major layout when interleaving L1 SF tensors

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 13:48:47 +00:00

76220ac6ee fix: force contiguous on SF tensors before C++ call

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 13:28:33 +00:00

bf5bf8d995 fix: unpack weight tuples before printing debug info

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-12 13:10:44 +00:00

52c3aefe73 bump cache busters to 33 for debug build

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 13:10:34 +00:00

5ac151d0a5 debug: print tensor dtypes/shapes at C++ call boundary in fp8_nvfp4_mega_moe

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-12 12:23:46 +00:00

ca1d306890 fix: use torch.int8 for packed FP4 tensors (kPackedFP4=kInt8, not uint8)

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-12 11:15:07 +00:00

b8f95ffad3 docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache

5840291ea3 fix staging kernel packed_k_mask double-count

Compare 2 commits »

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 08:08:21 +00:00

26a8ab75a1 NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-12 07:24:37 +00:00

5ea5b579c3 Trim banner, no code changes

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 07:08:09 +00:00

680874d067 NVFP4 L1 epilogue: group_size=16 SF layout

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 06:51:42 +00:00

c0850a6859 Fix weight TMA descriptors: packed E2M1 needs K/2, block_k/2, swizzle/2

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-12 05:52:35 +00:00

fbfeb54c9a Fix fold_global_scale: UE4M3 scales use .to(float32), not shift-by-23