biondizzle
  • Joined on 2025-12-10
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-12 05:52:32 +00:00
74af9984f6 Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 23:58:09 +00:00
af092fa7ba fix: double SMEM SF allocation for NVFP4 group=16 + clean stale comments
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 23:44:13 +00:00
aa97a3f949 fix: correct TMEM column layout for scale_vec::4X
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 23:17:53 +00:00
d6551617c0 fix: 4 kernel compilation fixes for packed FP4
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 22:55:30 +00:00
49e5646b42 fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 22:54:51 +00:00
80df24a641 fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 22:40:11 +00:00
e608a20dec docs: major README update — packed FP4 SMEM layout, L1 epilogue, TMA descriptors
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 22:39:40 +00:00
a36bf47f11 fix: use tl.split instead of indexing for E2M1 pair packing
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 22:23:12 +00:00
27dbf2850f fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 22:08:51 +00:00
3d1f3de190 fix: syntax error — move triton imports before docstring, remove orphan @triton.jit
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 21:59:58 +00:00
79d866995f bump cache buster 32 for packed FP4 mxf4nvf4 fix
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 21:59:39 +00:00
30d72e7ef5 fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 21:29:35 +00:00
c85b84b0fe fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte)
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 21:27:36 +00:00
0ac73a82f9 fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 21:05:56 +00:00
01cfd02759 fix: same reshape fix in main patch file
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 21:04:55 +00:00
076d325c97 fix: use reshape instead of risky [0::2] slicing for E2M1 packing
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 21:02:22 +00:00
8dc917c498 fix: topk_weights_out store missing topk_offsets multiplier
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 20:57:36 +00:00
091b974736 fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output
biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM 2026-05-11 20:48:05 +00:00
a554de8b24 fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs
biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant 2026-05-11 20:30:16 +00:00
17ba5a9d7b bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations