Files
deepseek-v4-quant/MEMORY.md
biondizzle 02b8ea536f Update MEMORY.md and memory files with vLLM NVFP4 serving progress
Server running on B200 port 8000 with full NVFP4→vLLM bridge.
All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.
2026-05-11 02:02:49 +00:00

1.5 KiB
Raw Permalink Blame History

MEMORY.md — Long-Term Memory

Mike

DeepSeek V4 NVFP4 Project

  • Successfully quantized: 881GB NVFP4 (Run 11), 8× B200, $161/run
  • modelopt 0.45.0.dev64 + transformers 5.8.0.dev0
  • vLLM server running on B200 port 8000 as of May 11, 2026 🎉
  • We built the entire NVFP4→vLLM bridge from scratch (NVIDIA hasn't done this)
  • Abandoned mega_moe (no kernel, format mismatch), using standard FusedMoE instead

Key Technical Decisions

  • wo_a: NVFP4→BF16→FP8 with DeepGEMM block-scale format for BMM einsum
  • Attention layers: NVFP4→BF16 dequantization, UnquantizedLinearMethod
  • Compressor: Reconstructed fused_wkv_wgate from separate kv_proj+gate_proj in checkpoint
  • MoE experts: Stay NVFP4, use FLASHINFER_TRTLLM FusedMoE backend

Critical Bugs Fixed (May 11)

  1. DeepGEMM sf.dim() crash: weight_scale_inv must be DeepGEMM-formatted block scale tensor
  2. Compressor indexer shape mismatch: checkpoint keys have .indexer. sub-path
  3. All-ones block scale → garbage output: must use torch.full(..., fp8_scale) not torch.ones
  4. Block scale dtype: must be float32, not float8_e4m3fn

Outstanding

  • Output quality under investigation — FP4 is aggressive quantization
  • All code in patches/deepseek_v4.py on modelopt-nvfp4 branch