biondizzle

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-12 05:52:32 +00:00

74af9984f6 Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 23:58:09 +00:00

af092fa7ba fix: double SMEM SF allocation for NVFP4 group=16 + clean stale comments

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 23:44:13 +00:00

aa97a3f949 fix: correct TMEM column layout for scale_vec::4X

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 23:17:53 +00:00

d6551617c0 fix: 4 kernel compilation fixes for packed FP4

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 22:55:30 +00:00

49e5646b42 fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 22:54:51 +00:00

80df24a641 fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 22:40:11 +00:00

e608a20dec docs: major README update — packed FP4 SMEM layout, L1 epilogue, TMA descriptors

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 22:39:40 +00:00

a36bf47f11 fix: use tl.split instead of indexing for E2M1 pair packing

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 22:23:12 +00:00

27dbf2850f fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 22:08:51 +00:00

3d1f3de190 fix: syntax error — move triton imports before docstring, remove orphan @triton.jit

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 21:59:58 +00:00

79d866995f bump cache buster 32 for packed FP4 mxf4nvf4 fix

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 21:59:39 +00:00

30d72e7ef5 fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 21:29:35 +00:00

c85b84b0fe fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte)

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 21:27:36 +00:00

0ac73a82f9 fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 21:05:56 +00:00

01cfd02759 fix: same reshape fix in main patch file

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 21:04:55 +00:00

076d325c97 fix: use reshape instead of risky [0::2] slicing for E2M1 packing

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 21:02:22 +00:00

8dc917c498 fix: topk_weights_out store missing topk_offsets multiplier

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 20:57:36 +00:00

091b974736 fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output

biondizzle pushed to nvfp4-mega-moe at biondizzle/DeepGEMM

2026-05-11 20:48:05 +00:00

a554de8b24 fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs

biondizzle pushed to mega-moe-nvfp4 at biondizzle/deepseek-v4-quant

2026-05-11 20:30:16 +00:00

17ba5a9d7b bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations