README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status
This commit is contained in:
54
README.md
54
README.md
@@ -14,7 +14,7 @@ Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 G
|
||||
| MoE Expert Serving | ✅ FusedMoE NVFP4 (FLASHINFER_TRTLLM backend) |
|
||||
| Profile/Warmup Run | ✅ Passes |
|
||||
| API Server | ✅ Running on port 8000 |
|
||||
| Output Quality | 🔧 Under investigation (FP4 quantization loss + scale tuning) |
|
||||
| Output Quality | 🔧 Garbled — likely remaining dequant/scale bug |
|
||||
|
||||
## B200 Node
|
||||
|
||||
@@ -57,7 +57,7 @@ Our NVFP4 weights are uint8 packed FP4 with separate block/global scales.
|
||||
|
||||
**Solution** (`_convert_nvfp4_to_fp8`):
|
||||
1. Unpack NVFP4 uint8 → BF16 using E2M1 lookup table
|
||||
2. Dequantize: `weight_bf16 * block_scale * global_scale * input_scale`
|
||||
2. Dequantize: `weight_bf16 * block_scale * global_scale` (NO input_scale — it's for activations)
|
||||
3. Re-quantize BF16 → FP8 e4m3 with per-tensor scale (`w_amax / fp8_max`)
|
||||
4. Create block scale tensor filled with `fp8_scale` (same scale for every 128×128 block)
|
||||
5. Call `deepgemm_post_process_fp8_weight_block(wq, ws, quant_block_shape=(128,128), use_e8m0=True, is_bmm=True, bmm_batch_size=N)`
|
||||
@@ -78,7 +78,7 @@ NVFP4 weights (uint8) can't be used directly.
|
||||
|
||||
**Solution** (`_convert_nvfp4_to_bf16`):
|
||||
1. Unpack NVFP4 → BF16
|
||||
2. Dequantize with block/global/input scales
|
||||
2. Dequantize with block/global scales (input_scale is for activations, not weights)
|
||||
3. Replace `mod.weight` with BF16 parameter
|
||||
4. Set `quant_method = UnquantizedLinearMethod()`
|
||||
5. Remove NVFP4 scale attributes (`weight_scale`, `weight_scale_2`, `input_scale`)
|
||||
@@ -199,13 +199,51 @@ Checkpoint (NVFP4 safetensors)
|
||||
└── MoE experts: stay NVFP4 (FusedMoE backend)
|
||||
```
|
||||
|
||||
## Bugs Found and Fixed (continued)
|
||||
|
||||
### `input_scale` Multiplied into Weight Dequantization (CRITICAL)
|
||||
- **Root cause**: `_convert_nvfp4_to_bf16`, `_convert_nvfp4_to_fp8`, and
|
||||
`_reconstruct_compressor_weight` all multiplied by `input_scale` during weight
|
||||
dequantization. `input_scale` is for **activations**, not weights. The correct
|
||||
formula is: `weight_bf16 = e2m1 * block_scale * global_scale` (NO input_scale).
|
||||
Including it made weights ~5000× too small, causing garbage output.
|
||||
- **Fix**: Removed `* input_scale` from all three dequant paths.
|
||||
|
||||
### `fused_skip_regex` Skipping Non-Fused Layer Scales (CRITICAL)
|
||||
- **Root cause**: The skip list included `q_b_proj`, `o_a_proj`, `o_b_proj` weight
|
||||
scales. These are **NOT fused/stacked** — they're individual Linear layers
|
||||
(`wq_b`, `wo_a`, `wo_b`) converted in-place. Skipping their scales caused
|
||||
`process_weights_after_loading` to read `torch.empty()` garbage for
|
||||
`weight_scale_inv`, producing garbled output.
|
||||
- **Fix**: Removed `q_b_proj`, `o_a_proj`, `o_b_proj` scale entries from
|
||||
`fused_skip_regex`. Only truly stacked params remain skipped:
|
||||
`compressor.{kv_proj,gate_proj}` → `fused_wkv_wgate`,
|
||||
`self_attn.{kv_proj,q_a_proj}` → `fused_wqa_wkv`,
|
||||
`shared_experts.{gate_proj,up_proj}` → `gate_up_proj`.
|
||||
|
||||
## Version Banner
|
||||
|
||||
The patch prints a version banner at import time (visible in `docker logs`):
|
||||
```
|
||||
======================================================================
|
||||
DeepSeek V4 NVFP4 Patch
|
||||
Commit: 26aaaba
|
||||
Loaded: 2026-05-11 04:25:00 UTC
|
||||
Node: ...
|
||||
|
||||
Architecture: ...
|
||||
Bugs fixed: #1-#6
|
||||
======================================================================
|
||||
```
|
||||
This ensures you can always verify what's running inside the container.
|
||||
|
||||
## Known Issues
|
||||
|
||||
1. **Output quality**: FP4 is very aggressive quantization. The model produces
|
||||
tokens but they may be incoherent. This could be:
|
||||
- Normal FP4 quality degradation
|
||||
- Subtle dequantization bugs (sign handling, scale ordering)
|
||||
- The per-tensor FP8 requantization of wo_a losing per-block precision
|
||||
1. **Output quality**: Model produces tokens but they're garbled/incoherent.
|
||||
All 6 known bugs are fixed. The remaining issue is under investigation —
|
||||
likely a subtle dequantization bug (sign handling, scale ordering, or
|
||||
E2M1 unpack edge case). The version banner in the logs helps debug which
|
||||
patch version is active.
|
||||
|
||||
2. **Runtime performance**: Not yet benchmarked. The DeepGEMM einsum + FusedMoE
|
||||
path should be efficient on B200, but the BF16 layers go through
|
||||
|
||||
Reference in New Issue
Block a user