biondizzle cfead0012d docs: comprehensive README update through build 22
- Full architecture diagram and NVFP4→vLLM bridge details
- All 8 bugs documented with fixes
- SM100 hardware limitation (mxf4nvf4 unsupported)
- MegaMoE kernel architecture and debugging log (builds 1-22)
- Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+)
- Container build pipeline, NVFP4 format spec, hard rules
2026-05-11 13:53:41 +00:00

DeepSeek V4 Pro → NVFP4 Quantization + vLLM Serving

Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 GPUs.

Quick Status

Component Status
NVFP4 Quantization 881GB (Run 11), modelopt 0.45.0.dev64
Weight Loading 95 safetensors shards, all 8 TP ranks
NVFP4→FP8 Conversion (wo_a) DeepGEMM block-scale format
NVFP4→BF16 Dequantization 305 attn/shared, 91 compressor layers
Compressor Reconstruction Separate kv_proj/gate_proj → fused_wkv_wgate
MoE Expert Serving (FusedMoE) FLASHINFER_TRTLLM backend
MoE Expert Serving (MegaMoE) 🔧 Kernel compiles, runs, but garbled (SM100 HW limit)
API Server Running on port 8000
Output Quality 🔧 Garbled — UE8M0 scale precision loss + attention bugs

B200 Node

  • IP: 45.76.247.107
  • User: root
  • Password: see .env
  • GPUs: 8× NVIDIA B200 (SM100)
  • RAM: ~2.7 TB
  • Model weights: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/
  • BF16 reference: /root/nvidia-meeting/DeepSeek-V4-Pro-BF16/

Repositories

Repo Branch Purpose
deepseek-v4-quant modelopt-nvfp4 Main patch (FlashInfer FusedMoE path)
deepseek-v4-quant mega-moe-nvfp4 MegaMoE patch (DeepGEMM mega_moe path)
DeepGEMM nvfp4-mega-moe NVFP4 mega_moe kernel fork

Architecture

DeepSeek V4 Pro (1.2T params, 61 layers)
├── MLA Attention (61 layers)
│   ├── fused_wqa_wkv → BF16 (UnquantizedLinearMethod)
│   ├── wo_a → FP8 (DeepGEMM block-scale, BMM einsum)
│   ├── wo_b → BF16 (UnquantizedLinearMethod)
│   └── compressor.fused_wkv_wgate → BF16 (reconstructed from NVFP4)
├── MoE Experts (384 experts, 61 layers)
│   ├── [FusedMoE path] → NVFP4 (FLASHINFER_TRTLLM backend)
│   └── [MegaMoE path] → NVFP4 (DeepGEMM mxf8f6f4, UE4M3→UE8M0 adapted)
└── Shared Expert → FP8 (Fp8LinearMethod, DeepGEMM)

The NVFP4 → vLLM Gap

ModelOpt quantizes to NVFP4 (4-bit FP4 with block scales). vLLM's DeepSeek V4 attention code expects FP8 with DeepGEMM block-scale einsum. These formats were never integrated — we're ahead of NVIDIA on this. Key gaps we had to bridge:

1. wo_a: NVFP4 → FP8 + DeepGEMM Block Scale

Problem: wo_a uses deepseek_v4_fp8_einsum (BMM with DeepGEMM), which expects:

  • Weight: float8_e4m3fn in 3D shape (g, r, d) for batched matmul
  • Scale: DeepGEMM-formatted block scale tensor (not a per-tensor scalar)

Our NVFP4 weights are uint8 packed FP4 with separate block/global scales.

Solution (_convert_nvfp4_to_fp8):

  1. Unpack NVFP4 uint8 → BF16 using E2M1 lookup table
  2. Dequantize: weight_bf16 * block_scale * global_scale (NO input_scale)
  3. Re-quantize BF16 → FP8 e4m3 with per-tensor scale (w_amax / fp8_max)
  4. Create block scale tensor filled with fp8_scale
  5. Call deepgemm_post_process_fp8_weight_block with quant_block_shape=(128,128), use_e8m0=True, is_bmm=True
  6. Store: weight_scale_inv = dg_ws, weight = w_fp8 (3D BMM shape)

2. Attention Layers: NVFP4 → BF16

Solution (_convert_nvfp4_to_bf16): Unpack → dequantize → set UnquantizedLinearMethod.

3. Compressor: Reconstructing fused_wkv_wgate from NVFP4

Solution (_reconstruct_compressor_weight): Read kv_proj+gate_proj from safetensors, dequantize, concatenate. Critical: indexer compressor at .compressor.indexer.{kv_proj,gate_proj} not .compressor.{kv_proj,gate_proj}.

4. MoE Experts: NVFP4 FusedMoE

Solution: Keep expert weights as NVFP4, use FLASHINFER_TRTLLM MoE backend.

5. BF16 wo_a Layers: BF16 → FP8

Solution (_convert_bf16_to_fp8): Directly quantize BF16 → FP8 with block scale.

Bugs Found and Fixed

# Bug Impact Fix
1 DeepGEMM sf.dim() crash Server crash deepgemm_post_process_fp8_weight_block for block-scale format
2 Block scale dtype float8_e4m3fn Crash Use float32
3 Missing deepgemm_post_process args Crash Pass quant_block_shape, use_e8m0
4 Compressor indexer shape mismatch Crash .indexer. sub-path in checkpoint keys
5 All-ones block scale Garbage output torch.full(..., fp8_scale) not torch.ones
6 fused_skip_regex skipping q_b/o_a/o_b scales Garbage output Remove non-fused scale entries from skip list
7 UE8M0 block scale misinterpreted as E4M3 Garbled output _ue8m0_to_float32(): reinterpret raw uint8 as IEEE 754 exponent
8 wo_a BF16 weight into uint8 param (suspected) Double-conversion loss On-the-fly BF16→NVFP4 in weight_loader, or BF16→FP8 directly

Bug #7 Detail: UE8M0 → float32 Misinterpretation

Root cause: weight_scale bytes are E8M0 format (power-of-2 only, 8-bit exponent), but .to(torch.float32) interprets the raw byte as E4M3 (8-bit: sign+exp+mantissa), producing a completely wrong float value.

Fix: _ue8m0_to_float32() — reinterpret the raw uint8 bits as the upper 8 bits of an IEEE 754 float32 exponent: (uint8_value << 23).view(float32). Applied to all dequant paths.

Bug #8 Detail: wo_a BF16 Loading

o_a_proj.weight is BF16 in checkpoint, but ModelOptNvFp4Config creates a uint8 param (shape mismatch: BF16 (16384,4096) vs uint8 (16384,2048)). The weight_loader does on-the-fly BF16→NVFP4 quantization, but the double conversion (BF16→NVFP4→BF16→FP8) is lossy. Diagnostics added but fix pending.

NVFP4 Mega MoE Kernel

What We Built

A native NVFP4 mega_moe kernel in our DeepGEMM fork that avoids dequantizing expert weights to BF16 before the GEMM. The kernel keeps weights in E2M1 packed format and uses block-scaled MMA directly.

SM100 Hardware Limitation (CRITICAL)

B200 (SM100) does NOT support kind::mxf4nvf4 (neither scale_vec::2X nor 4X). This PTX instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled MMA is kind::mxf8f6f4 with UE8M0 scales (block32, group_size=32).

Parameter NVFP4 Checkpoint Kernel (SM100 Adapted)
Weight format E2M1 uint8 E2M1 uint8 (unchanged)
Block scale format UE4M3 (float8_e4m3fn) UE8M0 (uint8) — adapted for HW
Block size 16 32 (merged adjacent pairs, max)
Global scale float32 Folded in before UE4M3→UE8M0
PTX instruction N/A (requires SM103+) mxf8f6f4.block_scale (same as MXFP4)

Result: Server starts and serves, but output is garbled. The UE4M3→UE8M0 conversion loses 3 bits of mantissa precision per scale (8× precision loss), which destroys output quality. The E2M1 weights are correct, but the power-of-2-only UE8M0 scales can't faithfully represent the original UE4M3 values.

Kernel Architecture

sm100_fp8_nvfp4_mega_moe_impl  (adapted from sm100_fp8_fp4_mega_moe_impl)
├── Same E2M1 weight packing as MXFP4
├── Same TMEM layout as MXFP4 (2X, block32)
├── Same UTCCP copy (4x32 transpose, i*4 stride)
├── mxf8f6f4.block_scale PTX instruction (UE8M0)
├── float_ue8m0_t instruction descriptor
└── UE8M0 L1 epilogue (>> 23 activation scales)

Python API:
├── fp8_nvfp4_mega_moe() — recipe=(1,1,32)
├── transform_nvfp4_weights_for_mega_moe()
│   ├── fold_global_scale(): UE4M3 * FP32 → UE4M3
│   ├── merge_block16_to_block32(): max of adjacent pairs
│   ├── UE4M3 → float32 → UE8M0 (extract exponent byte)
│   └── pack_uint8_to_int32() + transform_sf_into_required_layout()
└── get_symm_buffer_for_nvfp4_mega_moe() — 2x SF buffer

C++ Bindings:
├── csrc/apis/mega_nvfp4.hpp
├── csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp
└── csrc/apis/layout.hpp — gran_k=32 support

Container Build Pipeline

Dockerfile → FROM atl.vultrcr.com/vllm/vllm-with-lmcache:dream-build
  ├── DeepGEMM (nvfp4-mega-moe branch) — JIT-compiled at runtime
  ├── vLLM patch (deepseek_v4.py) — COPY over model file
  └── NVRTC symlink for CUDA compilation

build_push.sh → build → login to CR → push → update docker-compose
  Container registry: atl.vultrcr.com/vllm/vllm-dsv4-nvfp4:latest
  Always run builds in screen: screen -S nvfp4-build

Debugging Log (Builds 122)

Build Error Fix
16 Various Dockerfile/build issues NVRTC symlink, CPATH, PYTHONPATH
7 kPackedFP4 type mismatch uint8→int8 view on weights
9 SF stride assertion Need MN-major layout + TMA alignment
10 transform_sf_into_required_layout no gran_k=16 C++ fix
11 SF dtype float8_e4m3fn rejected Pack UE4M3→int32 first
1214 SF stride layout Transpose to MN-major before transform
15 SymmBuffer too small (NVFP4 has 2× SF) NVFP4-specific SymmBuffer
16 ImportError: deep_gemm.mega.nvfp4 Python wrapper in mega/init.py
17 NVCC: scale_vec::4X not supported on sm_100f
18 NVCC: scale_vec::2X ALSO not supported
19 kGranK=16 still in C++ binding → 32
20 UE4M3→UE8M0 uint32 >> 23 fails Cast to int32 first
22 Server UP, but garbled output UE4M3→UE8M0 precision loss

Path Forward

The mega_moe approach has a hardware ceiling on B200 — the Tensor Core can't consume UE4M3 block scales. Three options:

The modelopt-nvfp4 branch already uses FlashInfer FP4 MoE which dequantizes NVFP4→BF16 before GEMM. This avoids the UE4M3→UE8M0 precision loss. The garbled output on that branch is likely from attention layer bugs (#7, #8), not the MoE. Fix those and we should get coherent output.

Option B: Dequant NVFP4→BF16 in MegaMoE Shared Memory

Build a mega_moe that dequantizes in shared memory, then uses BF16 MMA. Slower than FlashInfer but gets the mega_moe communication pattern.

Option C: Wait for SM103+ Hardware

B300 (SM103) and GB300 (SM120) support mxf4nvf4 natively with UE4M3 scales. The kernel we built would work correctly on that hardware.

Running

# On B200 node
cd /root/nvidia-meeting
docker compose up -d

# Check logs
docker logs -f nvidia-meeting-vllm-1

# Test
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/model", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

Files

File Purpose
patches/deepseek_v4.py Main patch: NVFP4 post-load conversion, weight reconstruction
patches/modelopt.py ModelOpt FP4 config patches for weight loading
.env B200 node credentials
Dockerfile Container build (extends dream-build with DeepGEMM + patch)
build_push.sh Build, push to CR, update docker-compose

NVFP4 Format Specification

  • Weights: E2M1 packed uint8 (2 values per byte)
  • Block scales: float8_e4m3fn (UE4M3), group_size=16
  • Global scales: float32 (weight_scale_2), per-tensor
  • Dequant formula: value = packed_E2M1 * block_scale * global_scale
  • Block scale range: [0, 448] (UE4M3 max = 448, E2M1 max = 6, so 6×448 = 2688)
  • UE8M0: Power-of-2 only. Encoded as uint8 = float32_exponent_bits[31:23]
  • UE4M3: 3-bit mantissa + 4-bit exponent + sign. Max = 448.

HARD RULES

  • NEVER convert DeepSeek MoE experts to MXFP4. Experts stay in NVFP4. Period.
Description
No description provided
Readme 1.6 MiB
Languages
Python 100%