README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status

2026-05-11 04:28:38 +00:00
parent 26aaaba4a2
commit 7febeaeb71
1 changed files with 46 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@ Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 G
 | MoE Expert Serving | ✅ FusedMoE NVFP4 (FLASHINFER_TRTLLM backend) |
 | Profile/Warmup Run | ✅ Passes |
 | API Server | ✅ Running on port 8000 |
-| Output Quality | 🔧 Under investigation (FP4 quantization loss + scale tuning) |
+| Output Quality | 🔧 Garbled — likely remaining dequant/scale bug |

 ## B200 Node

@@ -57,7 +57,7 @@ Our NVFP4 weights are uint8 packed FP4 with separate block/global scales.

 **Solution** (`_convert_nvfp4_to_fp8`):
 1. Unpack NVFP4 uint8 → BF16 using E2M1 lookup table
-2. Dequantize: `weight_bf16 * block_scale * global_scale * input_scale`
+2. Dequantize: `weight_bf16 * block_scale * global_scale` (NO input_scale — it's for activations)
 3. Re-quantize BF16 → FP8 e4m3 with per-tensor scale (`w_amax / fp8_max`)
 4. Create block scale tensor filled with `fp8_scale` (same scale for every 128×128 block)
 5. Call `deepgemm_post_process_fp8_weight_block(wq, ws, quant_block_shape=(128,128), use_e8m0=True, is_bmm=True, bmm_batch_size=N)`
@@ -78,7 +78,7 @@ NVFP4 weights (uint8) can't be used directly.

 **Solution** (`_convert_nvfp4_to_bf16`):
 1. Unpack NVFP4 → BF16
-2. Dequantize with block/global/input scales
+2. Dequantize with block/global scales (input_scale is for activations, not weights)
 3. Replace `mod.weight` with BF16 parameter
 4. Set `quant_method = UnquantizedLinearMethod()`
 5. Remove NVFP4 scale attributes (`weight_scale`, `weight_scale_2`, `input_scale`)
@@ -199,13 +199,51 @@ Checkpoint (NVFP4 safetensors)
       └── MoE experts: stay NVFP4 (FusedMoE backend)
 ```

+## Bugs Found and Fixed (continued)
+
+### `input_scale` Multiplied into Weight Dequantization (CRITICAL)
+- **Root cause**: `_convert_nvfp4_to_bf16`, `_convert_nvfp4_to_fp8`, and
+  `_reconstruct_compressor_weight` all multiplied by `input_scale` during weight
+  dequantization. `input_scale` is for **activations**, not weights. The correct
+  formula is: `weight_bf16 = e2m1 * block_scale * global_scale` (NO input_scale).
+  Including it made weights ~5000× too small, causing garbage output.
+- **Fix**: Removed `* input_scale` from all three dequant paths.
+
+### `fused_skip_regex` Skipping Non-Fused Layer Scales (CRITICAL)
+- **Root cause**: The skip list included `q_b_proj`, `o_a_proj`, `o_b_proj` weight
+  scales. These are **NOT fused/stacked** — they're individual Linear layers
+  (`wq_b`, `wo_a`, `wo_b`) converted in-place. Skipping their scales caused
+  `process_weights_after_loading` to read `torch.empty()` garbage for
+  `weight_scale_inv`, producing garbled output.
+- **Fix**: Removed `q_b_proj`, `o_a_proj`, `o_b_proj` scale entries from
+  `fused_skip_regex`. Only truly stacked params remain skipped:
+  `compressor.{kv_proj,gate_proj}` → `fused_wkv_wgate`,
+  `self_attn.{kv_proj,q_a_proj}` → `fused_wqa_wkv`,
+  `shared_experts.{gate_proj,up_proj}` → `gate_up_proj`.
+
+## Version Banner
+
+The patch prints a version banner at import time (visible in `docker logs`):
+```
+======================================================================
+  DeepSeek V4 NVFP4 Patch
+  Commit:   26aaaba
+  Loaded:   2026-05-11 04:25:00 UTC
+  Node:     ...
+  
+  Architecture: ...
+  Bugs fixed: #1-#6
+======================================================================
+```
+This ensures you can always verify what's running inside the container.
+
 ## Known Issues

-1. **Output quality**: FP4 is very aggressive quantization. The model produces
-   tokens but they may be incoherent. This could be:
-   - Normal FP4 quality degradation
-   - Subtle dequantization bugs (sign handling, scale ordering)
-   - The per-tensor FP8 requantization of wo_a losing per-block precision
+1. **Output quality**: Model produces tokens but they're garbled/incoherent.
+   All 6 known bugs are fixed. The remaining issue is under investigation —
+   likely a subtle dequantization bug (sign handling, scale ordering, or
+   E2M1 unpack edge case). The version banner in the logs helps debug which
+   patch version is active.

 2. **Runtime performance**: Not yet benchmarked. The DeepGEMM einsum + FusedMoE
   path should be efficient on B200, but the BF16 layers go through