Server running on B200 port 8000 with full NVFP4→vLLM bridge. All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.
30 lines
1.5 KiB
Markdown
30 lines
1.5 KiB
Markdown
# MEMORY.md — Long-Term Memory
|
||
|
||
## Mike
|
||
- Working on DeepSeek V4 Pro NVFP4 quantization + vLLM serving on B200 node
|
||
- B200 node: 45.76.247.107, root, password in project .env
|
||
- Repo: https://sweetapi.com/biondizzle/deepseek-v4-quant.git (modelopt-nvfp4 branch)
|
||
|
||
## DeepSeek V4 NVFP4 Project
|
||
- Successfully quantized: 881GB NVFP4 (Run 11), 8× B200, $161/run
|
||
- modelopt 0.45.0.dev64 + transformers 5.8.0.dev0
|
||
- **vLLM server running on B200 port 8000** as of May 11, 2026 🎉
|
||
- We built the entire NVFP4→vLLM bridge from scratch (NVIDIA hasn't done this)
|
||
- Abandoned mega_moe (no kernel, format mismatch), using standard FusedMoE instead
|
||
|
||
### Key Technical Decisions
|
||
- **wo_a**: NVFP4→BF16→FP8 with DeepGEMM block-scale format for BMM einsum
|
||
- **Attention layers**: NVFP4→BF16 dequantization, UnquantizedLinearMethod
|
||
- **Compressor**: Reconstructed fused_wkv_wgate from separate kv_proj+gate_proj in checkpoint
|
||
- **MoE experts**: Stay NVFP4, use FLASHINFER_TRTLLM FusedMoE backend
|
||
|
||
### Critical Bugs Fixed (May 11)
|
||
1. DeepGEMM `sf.dim()` crash: weight_scale_inv must be DeepGEMM-formatted block scale tensor
|
||
2. Compressor indexer shape mismatch: checkpoint keys have `.indexer.` sub-path
|
||
3. All-ones block scale → garbage output: must use `torch.full(..., fp8_scale)` not `torch.ones`
|
||
4. Block scale dtype: must be float32, not float8_e4m3fn
|
||
|
||
### Outstanding
|
||
- Output quality under investigation — FP4 is aggressive quantization
|
||
- All code in patches/deepseek_v4.py on modelopt-nvfp4 branch
|