deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	5840291ea3	fix staging kernel packed_k_mask double-count	2026-05-12 08:08:24 +00:00
biondizzle	74af9984f6	Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul - Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0 - Changed from shift-by-23 to .to(torch.float32) in both copies - Fix fold_global_scale in DeepGEMM mega/__init__.py - Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU - Split 8-group pack into two int32 writes (groups 0-3, 4-7) - Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem) into packed buffer (hidden/2 bytes), causing 2x overflow - Now packs even/odd nibble pairs correctly - Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3 - Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick - Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11, quant_module_patched.py, patch_finegrained_fp8_blackwell.py, patch_vllm_weights.py - Remove SCALE-FMT diagnostic histogram clutter - Update stale UE8M0 comments throughout - Rewrite README: clean instructions, confirmed format details	2026-05-12 05:52:30 +00:00
biondizzle	c85b84b0fe	fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte) Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element. L1→L2 handoff uses unpacked format (same byte count as FP8). No bandwidth savings at L1→L2 for v1 — can optimize later.	2026-05-11 21:29:33 +00:00
biondizzle	076d325c97	fix: use reshape instead of risky [0::2] slicing for E2M1 packing	2026-05-11 21:04:53 +00:00
biondizzle	8dc917c498	fix: topk_weights_out store missing topk_offsets multiplier	2026-05-11 21:02:19 +00:00
biondizzle	7a4403fa98	feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales mxf4nvf4 requires FP4×FP4, not FP8×FP4. - New staging kernel: E2M1 nearest-neighbor quantization - Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales - Added CUDA sync diagnostics for error localization	2026-05-11 20:29:36 +00:00

6 Commits