README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status

This commit is contained in:
2026-05-11 04:28:38 +00:00
parent 26aaaba4a2
commit 7febeaeb71

View File

@@ -14,7 +14,7 @@ Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 G
| MoE Expert Serving | ✅ FusedMoE NVFP4 (FLASHINFER_TRTLLM backend) |
| Profile/Warmup Run | ✅ Passes |
| API Server | ✅ Running on port 8000 |
| Output Quality | 🔧 Under investigation (FP4 quantization loss + scale tuning) |
| Output Quality | 🔧 Garbled — likely remaining dequant/scale bug |
## B200 Node
@@ -57,7 +57,7 @@ Our NVFP4 weights are uint8 packed FP4 with separate block/global scales.
**Solution** (`_convert_nvfp4_to_fp8`):
1. Unpack NVFP4 uint8 → BF16 using E2M1 lookup table
2. Dequantize: `weight_bf16 * block_scale * global_scale * input_scale`
2. Dequantize: `weight_bf16 * block_scale * global_scale` (NO input_scale — it's for activations)
3. Re-quantize BF16 → FP8 e4m3 with per-tensor scale (`w_amax / fp8_max`)
4. Create block scale tensor filled with `fp8_scale` (same scale for every 128×128 block)
5. Call `deepgemm_post_process_fp8_weight_block(wq, ws, quant_block_shape=(128,128), use_e8m0=True, is_bmm=True, bmm_batch_size=N)`
@@ -78,7 +78,7 @@ NVFP4 weights (uint8) can't be used directly.
**Solution** (`_convert_nvfp4_to_bf16`):
1. Unpack NVFP4 → BF16
2. Dequantize with block/global/input scales
2. Dequantize with block/global scales (input_scale is for activations, not weights)
3. Replace `mod.weight` with BF16 parameter
4. Set `quant_method = UnquantizedLinearMethod()`
5. Remove NVFP4 scale attributes (`weight_scale`, `weight_scale_2`, `input_scale`)
@@ -199,13 +199,51 @@ Checkpoint (NVFP4 safetensors)
└── MoE experts: stay NVFP4 (FusedMoE backend)
```
## Bugs Found and Fixed (continued)
### `input_scale` Multiplied into Weight Dequantization (CRITICAL)
- **Root cause**: `_convert_nvfp4_to_bf16`, `_convert_nvfp4_to_fp8`, and
`_reconstruct_compressor_weight` all multiplied by `input_scale` during weight
dequantization. `input_scale` is for **activations**, not weights. The correct
formula is: `weight_bf16 = e2m1 * block_scale * global_scale` (NO input_scale).
Including it made weights ~5000× too small, causing garbage output.
- **Fix**: Removed `* input_scale` from all three dequant paths.
### `fused_skip_regex` Skipping Non-Fused Layer Scales (CRITICAL)
- **Root cause**: The skip list included `q_b_proj`, `o_a_proj`, `o_b_proj` weight
scales. These are **NOT fused/stacked** — they're individual Linear layers
(`wq_b`, `wo_a`, `wo_b`) converted in-place. Skipping their scales caused
`process_weights_after_loading` to read `torch.empty()` garbage for
`weight_scale_inv`, producing garbled output.
- **Fix**: Removed `q_b_proj`, `o_a_proj`, `o_b_proj` scale entries from
`fused_skip_regex`. Only truly stacked params remain skipped:
`compressor.{kv_proj,gate_proj}``fused_wkv_wgate`,
`self_attn.{kv_proj,q_a_proj}``fused_wqa_wkv`,
`shared_experts.{gate_proj,up_proj}``gate_up_proj`.
## Version Banner
The patch prints a version banner at import time (visible in `docker logs`):
```
======================================================================
DeepSeek V4 NVFP4 Patch
Commit: 26aaaba
Loaded: 2026-05-11 04:25:00 UTC
Node: ...
Architecture: ...
Bugs fixed: #1-#6
======================================================================
```
This ensures you can always verify what's running inside the container.
## Known Issues
1. **Output quality**: FP4 is very aggressive quantization. The model produces
tokens but they may be incoherent. This could be:
- Normal FP4 quality degradation
- Subtle dequantization bugs (sign handling, scale ordering)
- The per-tensor FP8 requantization of wo_a losing per-block precision
1. **Output quality**: Model produces tokens but they're garbled/incoherent.
All 6 known bugs are fixed. The remaining issue is under investigation —
likely a subtle dequantization bug (sign handling, scale ordering, or
E2M1 unpack edge case). The version banner in the logs helps debug which
patch version is active.
2. **Runtime performance**: Not yet benchmarked. The DeepGEMM einsum + FusedMoE
path should be efficient on B200, but the BF16 layers go through