74af9984f6
Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul
...
- Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0
- Changed from shift-by-23 to .to(torch.float32) in both copies
- Fix fold_global_scale in DeepGEMM mega/__init__.py
- Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU
- Split 8-group pack into two int32 writes (groups 0-3, 4-7)
- Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem)
into packed buffer (hidden/2 bytes), causing 2x overflow
- Now packs even/odd nibble pairs correctly
- Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3
- Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick
- Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11,
quant_module_patched.py, patch_finegrained_fp8_blackwell.py,
patch_vllm_weights.py
- Remove SCALE-FMT diagnostic histogram clutter
- Update stale UE8M0 comments throughout
- Rewrite README: clean instructions, confirmed format details
2026-05-12 05:52:30 +00:00