From 7febeaeb719fda8e7c9ca132104eb606a592f906 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Mon, 11 May 2026 04:28:38 +0000 Subject: [PATCH] README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status --- README.md | 54 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 46 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 14298bd..57ea183 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 G | MoE Expert Serving | ✅ FusedMoE NVFP4 (FLASHINFER_TRTLLM backend) | | Profile/Warmup Run | ✅ Passes | | API Server | ✅ Running on port 8000 | -| Output Quality | 🔧 Under investigation (FP4 quantization loss + scale tuning) | +| Output Quality | 🔧 Garbled — likely remaining dequant/scale bug | ## B200 Node @@ -57,7 +57,7 @@ Our NVFP4 weights are uint8 packed FP4 with separate block/global scales. **Solution** (`_convert_nvfp4_to_fp8`): 1. Unpack NVFP4 uint8 → BF16 using E2M1 lookup table -2. Dequantize: `weight_bf16 * block_scale * global_scale * input_scale` +2. Dequantize: `weight_bf16 * block_scale * global_scale` (NO input_scale — it's for activations) 3. Re-quantize BF16 → FP8 e4m3 with per-tensor scale (`w_amax / fp8_max`) 4. Create block scale tensor filled with `fp8_scale` (same scale for every 128×128 block) 5. Call `deepgemm_post_process_fp8_weight_block(wq, ws, quant_block_shape=(128,128), use_e8m0=True, is_bmm=True, bmm_batch_size=N)` @@ -78,7 +78,7 @@ NVFP4 weights (uint8) can't be used directly. **Solution** (`_convert_nvfp4_to_bf16`): 1. Unpack NVFP4 → BF16 -2. Dequantize with block/global/input scales +2. Dequantize with block/global scales (input_scale is for activations, not weights) 3. Replace `mod.weight` with BF16 parameter 4. Set `quant_method = UnquantizedLinearMethod()` 5. Remove NVFP4 scale attributes (`weight_scale`, `weight_scale_2`, `input_scale`) @@ -199,13 +199,51 @@ Checkpoint (NVFP4 safetensors) └── MoE experts: stay NVFP4 (FusedMoE backend) ``` +## Bugs Found and Fixed (continued) + +### `input_scale` Multiplied into Weight Dequantization (CRITICAL) +- **Root cause**: `_convert_nvfp4_to_bf16`, `_convert_nvfp4_to_fp8`, and + `_reconstruct_compressor_weight` all multiplied by `input_scale` during weight + dequantization. `input_scale` is for **activations**, not weights. The correct + formula is: `weight_bf16 = e2m1 * block_scale * global_scale` (NO input_scale). + Including it made weights ~5000× too small, causing garbage output. +- **Fix**: Removed `* input_scale` from all three dequant paths. + +### `fused_skip_regex` Skipping Non-Fused Layer Scales (CRITICAL) +- **Root cause**: The skip list included `q_b_proj`, `o_a_proj`, `o_b_proj` weight + scales. These are **NOT fused/stacked** — they're individual Linear layers + (`wq_b`, `wo_a`, `wo_b`) converted in-place. Skipping their scales caused + `process_weights_after_loading` to read `torch.empty()` garbage for + `weight_scale_inv`, producing garbled output. +- **Fix**: Removed `q_b_proj`, `o_a_proj`, `o_b_proj` scale entries from + `fused_skip_regex`. Only truly stacked params remain skipped: + `compressor.{kv_proj,gate_proj}` → `fused_wkv_wgate`, + `self_attn.{kv_proj,q_a_proj}` → `fused_wqa_wkv`, + `shared_experts.{gate_proj,up_proj}` → `gate_up_proj`. + +## Version Banner + +The patch prints a version banner at import time (visible in `docker logs`): +``` +====================================================================== + DeepSeek V4 NVFP4 Patch + Commit: 26aaaba + Loaded: 2026-05-11 04:25:00 UTC + Node: ... + + Architecture: ... + Bugs fixed: #1-#6 +====================================================================== +``` +This ensures you can always verify what's running inside the container. + ## Known Issues -1. **Output quality**: FP4 is very aggressive quantization. The model produces - tokens but they may be incoherent. This could be: - - Normal FP4 quality degradation - - Subtle dequantization bugs (sign handling, scale ordering) - - The per-tensor FP8 requantization of wo_a losing per-block precision +1. **Output quality**: Model produces tokens but they're garbled/incoherent. + All 6 known bugs are fixed. The remaining issue is under investigation — + likely a subtle dequantization bug (sign handling, scale ordering, or + E2M1 unpack edge case). The version banner in the logs helps debug which + patch version is active. 2. **Runtime performance**: Not yet benchmarked. The DeepGEMM einsum + FusedMoE path should be efficient on B200, but the BF16 layers go through