mxf4nvf4 requires FP4×FP4, not FP8×FP4. - New staging kernel: E2M1 nearest-neighbor quantization - Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales - Added CUDA sync diagnostics for error localization
DeepSeek V4 Pro → NVFP4 Quantization + vLLM Serving
Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 GPUs.
Quick Status
| Component | Status |
|---|---|
| NVFP4 Quantization | ✅ 881GB (Run 11), modelopt 0.45.0.dev64 |
| Weight Loading | ✅ 95 safetensors shards, all 8 TP ranks |
| NVFP4→FP8 Conversion (wo_a) | ✅ DeepGEMM block-scale format |
| NVFP4→BF16 Dequantization | ✅ 305 attn/shared, 91 compressor layers |
| Compressor Reconstruction | ✅ Separate kv_proj/gate_proj → fused_wkv_wgate |
| MoE Expert Serving (FusedMoE) | ✅ FLASHINFER_TRTLLM backend |
| MoE Expert Serving (MegaMoE) | 🔧 Kernel compiles, runs, but garbled (wrong arch flag sm_100 vs sm_100a) |
| API Server | ✅ Running on port 8000 |
| Output Quality | 🔧 Garbled — wrong MMA kind (used mxf8f6f4 instead of mxf4nvf4) |
B200 Node
- IP:
45.76.247.107 - User:
root - Password: see
.env - GPUs: 8× NVIDIA B200 (SM100)
- RAM: ~2.7 TB
- Model weights:
/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/ - BF16 reference:
/root/nvidia-meeting/DeepSeek-V4-Pro-BF16/
Repositories
| Repo | Branch | Purpose |
|---|---|---|
deepseek-v4-quant |
modelopt-nvfp4 |
Main patch (FlashInfer FusedMoE path) |
deepseek-v4-quant |
mega-moe-nvfp4 |
MegaMoE patch (DeepGEMM mega_moe path) |
DeepGEMM |
nvfp4-mega-moe |
NVFP4 mega_moe kernel fork |
Architecture
DeepSeek V4 Pro (1.2T params, 61 layers)
├── MLA Attention (61 layers)
│ ├── fused_wqa_wkv → BF16 (UnquantizedLinearMethod)
│ ├── wo_a → FP8 (DeepGEMM block-scale, BMM einsum)
│ ├── wo_b → BF16 (UnquantizedLinearMethod)
│ └── compressor.fused_wkv_wgate → BF16 (reconstructed from NVFP4)
├── MoE Experts (384 experts, 61 layers)
│ ├── [FusedMoE path] → NVFP4 (FLASHINFER_TRTLLM backend)
│ └── [MegaMoE path] → NVFP4 (DeepGEMM mxf8f6f4, UE4M3→UE8M0 adapted)
└── Shared Expert → FP8 (Fp8LinearMethod, DeepGEMM)
The NVFP4 → vLLM Gap
ModelOpt quantizes to NVFP4 (4-bit FP4 with block scales). vLLM's DeepSeek V4 attention code expects FP8 with DeepGEMM block-scale einsum. These formats were never integrated — we're ahead of NVIDIA on this. Key gaps we had to bridge:
1. wo_a: NVFP4 → FP8 + DeepGEMM Block Scale
Problem: wo_a uses deepseek_v4_fp8_einsum (BMM with DeepGEMM), which expects:
- Weight:
float8_e4m3fnin 3D shape(g, r, d)for batched matmul - Scale: DeepGEMM-formatted block scale tensor (not a per-tensor scalar)
Our NVFP4 weights are uint8 packed FP4 with separate block/global scales.
Solution (_convert_nvfp4_to_fp8):
- Unpack NVFP4 uint8 → BF16 using E2M1 lookup table
- Dequantize:
weight_bf16 * block_scale * global_scale(NO input_scale) - Re-quantize BF16 → FP8 e4m3 with per-tensor scale (
w_amax / fp8_max) - Create block scale tensor filled with
fp8_scale - Call
deepgemm_post_process_fp8_weight_blockwithquant_block_shape=(128,128),use_e8m0=True,is_bmm=True - Store:
weight_scale_inv = dg_ws,weight = w_fp8(3D BMM shape)
2. Attention Layers: NVFP4 → BF16
Solution (_convert_nvfp4_to_bf16): Unpack → dequantize → set UnquantizedLinearMethod.
3. Compressor: Reconstructing fused_wkv_wgate from NVFP4
Solution (_reconstruct_compressor_weight): Read kv_proj+gate_proj from
safetensors, dequantize, concatenate. Critical: indexer compressor at
.compressor.indexer.{kv_proj,gate_proj} not .compressor.{kv_proj,gate_proj}.
4. MoE Experts: NVFP4 FusedMoE
Solution: Keep expert weights as NVFP4, use FLASHINFER_TRTLLM MoE backend.
5. BF16 wo_a Layers: BF16 → FP8
Solution (_convert_bf16_to_fp8): Directly quantize BF16 → FP8 with block scale.
Bugs Found and Fixed
| # | Bug | Impact | Fix |
|---|---|---|---|
| 1 | DeepGEMM sf.dim() crash |
Server crash | deepgemm_post_process_fp8_weight_block for block-scale format |
| 2 | Block scale dtype float8_e4m3fn |
Crash | Use float32 |
| 3 | Missing deepgemm_post_process args |
Crash | Pass quant_block_shape, use_e8m0 |
| 4 | Compressor indexer shape mismatch | Crash | .indexer. sub-path in checkpoint keys |
| 5 | All-ones block scale | Garbage output | torch.full(..., fp8_scale) not torch.ones |
| 6 | fused_skip_regex skipping q_b/o_a/o_b scales |
Garbage output | Remove non-fused scale entries from skip list |
| 7 | UE8M0 block scale misinterpreted as E4M3 | Garbled output | _ue8m0_to_float32(): reinterpret raw uint8 as IEEE 754 exponent |
| 8 | wo_a BF16 weight into uint8 param (suspected) | Double-conversion loss | On-the-fly BF16→NVFP4 in weight_loader, or BF16→FP8 directly |
Bug #7 Detail: UE8M0 → float32 Misinterpretation
Root cause: weight_scale bytes are E8M0 format (power-of-2 only, 8-bit exponent),
but .to(torch.float32) interprets the raw byte as E4M3 (8-bit: sign+exp+mantissa),
producing a completely wrong float value.
Fix: _ue8m0_to_float32() — reinterpret the raw uint8 bits as the upper 8 bits
of an IEEE 754 float32 exponent: (uint8_value << 23).view(float32). Applied to all
dequant paths.
Bug #8 Detail: wo_a BF16 Loading
o_a_proj.weight is BF16 in checkpoint, but ModelOptNvFp4Config creates a uint8
param (shape mismatch: BF16 (16384,4096) vs uint8 (16384,2048)). The weight_loader
does on-the-fly BF16→NVFP4 quantization, but the double conversion (BF16→NVFP4→BF16→FP8)
is lossy. Diagnostics added but fix pending.
NVFP4 Mega MoE Kernel
What We Built
A native NVFP4 mega_moe kernel in our DeepGEMM fork that avoids dequantizing expert weights to BF16 before the GEMM. The kernel keeps weights in E2M1 packed format and uses block-scaled MMA directly.
SM100a (B200) Hardware Support (CORRECTED)
B200 (SM100a) DOES support kind::mxf4nvf4 with scale_vec::4X (block16, UE4M3 scales).
This is documented in PTX ISA 8.7 (CUDA 12.8+) and confirmed by NVIDIA/CUTLASS/Colfax.
Build 22 produced garbled output because we incorrectly fell back to mxf8f6f4 (UE8M0/block32),
losing 3 bits of mantissa precision per scale. The mxf4nvf4 instruction was never actually
tried with the correct target. The real issues were:
- Wrong arch flag: JIT compiled for
sm_100instead ofsm_100a(asuffix required) - No NVFP4 MMA kind enum: DeepGEMM only had BF16 + MXFP8FP4 — NVFP4 was just MXFP4 in disguise
- Wrong SF layout: Block16 (scale_vec::4X) has different TMEM layout than block32 (2X)
Fix: Target sm_100a, emit tcgen05.mma.kind::mxf4nvf4.block_scale.scale_vec::4X,
keep E2M1 weights + UE4M3 scales + block16. No scale conversion needed. Full precision.
| Parameter | NVFP4 Checkpoint | Kernel (sm_100a, CORRECT) |
|---|---|---|
| Weight format | E2M1 uint8 | E2M1 uint8 (unchanged) |
| Block scale format | UE4M3 (float8_e4m3fn) | UE4M3 (native, no conversion) |
| Block size | 16 | 16 (native) |
| Global scale | float32 | Folded in before packing |
| PTX instruction | mxf4nvf4.block_scale.scale_vec::4X |
Same |
| Instruction descriptor | float_ue4m3_t |
Same |
Kernel Architecture (TARGET — sm_100a with mxf4nvf4)
sm100_fp8_nvfp4_mega_moe_impl (adapted from sm100_fp8_fp4_mega_moe_impl)
├── kGranK = 16 (NVFP4 native block size)
├── kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
├── float_ue4m3_t instruction descriptor
├── Block16 SF layout: scale_vec::4X, 4 TMEM sub-columns per UMMA atom
├── UTCCP copy: i*8 stride (4X layout)
├── kNumSFATmemCols = SF_BLOCK_M / 32 * 4
├── kNumSFBTmemCols = SF_BLOCK_N / 32 * 4
├── kNumSFUint32 = kHidden / 64 (4 UE4M3 per int32)
├── UE4M3 L1 epilogue (float → cutlass::float_e4m3_t cast)
└── recipe = (1, 1, 16)
Python API:
├── fp8_nvfp4_mega_moe() — recipe=(1,1,16)
├── transform_nvfp4_weights_for_mega_moe()
│ ├── fold_global_scale(): UE4M3 * FP32 → UE4M3
│ ├── NO block16→block32 merge (native block16)
│ ├── NO UE4M3→UE8M0 conversion (native UE4M3)
│ └── pack_ue4m3_to_int32() + transform_sf_into_required_layout(gran_k=16)
└── get_symm_buffer_for_nvfp4_mega_moe() — 2x SF buffer
C++ Bindings:
├── csrc/apis/mega_nvfp4.hpp — kGranK=16, SF stride K/16
├── csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp
└── csrc/apis/layout.hpp — gran_k=16 support
Current State (Build 22 — WRONG PATH)
Build 22 used mxf8f6f4 with UE8M0/block32, which produces garbled output.
The kernel architecture below is what was deployed. It needs to be reverted to the
mxf4nvf4 path with sm_100a targeting.
Build 22 (GARBLED — do not use):
├── kGranK = 32, mxf8f6f4, float_ue8m0_t
├── Same TMEM layout as MXFP4 (2X, block32)
├── UE4M3→UE8M0 scale conversion (lossy)
├── block16→block32 scale merge (lossy)
└── recipe = (1, 1, 32)
The earlier Build 17 code had the RIGHT instruction (mxf4nvf4.scale_vec::4X)
but wrong arch flag (sm_100 instead of sm_100a). We need to go back to that
code and fix the arch flag.
Container Build Pipeline
Dockerfile → FROM atl.vultrcr.com/vllm/vllm-with-lmcache:dream-build
├── DeepGEMM (nvfp4-mega-moe branch) — JIT-compiled at runtime
├── vLLM patch (deepseek_v4.py) — COPY over model file
└── NVRTC symlink for CUDA compilation
build_push.sh → build → login to CR → push → update docker-compose
Container registry: atl.vultrcr.com/vllm/vllm-dsv4-nvfp4:latest
Always run builds in screen: screen -S nvfp4-build
Debugging Log (Builds 1–22)
| Build | Error | Fix |
|---|---|---|
| 1–6 | Various Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH |
| 7 | kPackedFP4 type mismatch |
uint8→int8 view on weights |
| 9 | SF stride assertion | Need MN-major layout + TMA alignment |
| 10 | transform_sf_into_required_layout no gran_k=16 |
C++ fix |
| 11 | SF dtype float8_e4m3fn rejected | Pack UE4M3→int32 first |
| 12–14 | SF stride layout | Transpose to MN-major before transform |
| 15 | SymmBuffer too small (NVFP4 has 2× SF) | NVFP4-specific SymmBuffer |
| 16 | ImportError: deep_gemm.mega.nvfp4 |
Python wrapper in mega/init.py |
| 17 | NVCC: scale_vec::4X not supported on sm_100f |
Wrong arch: need sm_100a not sm_100 |
| 18 | NVCC: scale_vec::2X ALSO not supported |
Same — sm_100a required |
| 19 | kGranK=16 still in C++ binding | → 32 |
| 20 | UE4M3→UE8M0 uint32 >> 23 fails |
Cast to int32 first |
| 22 | Server UP, but garbled output | Fell back to mxf8f6f4 — should use mxf4nvf4 on sm_100a |
Path Forward
The mxf4nvf4 instruction IS supported on B200 (SM100a). Build 17-18 failed because
we targeted sm_100 instead of sm_100a. Build 22 garbled because we fell back to
mxf8f6f4 unnecessarily. The correct fix:
Fix the NVFP4 MegaMoE Kernel (Recommended)
- Target
sm_100ain DeepGEMM's JIT compiler - Add NVFP4 MMA kind enum (not just MXFP8FP4 with an NVFP4 hat)
- Emit
tcgen05.mma.kind::mxf4nvf4.block_scale.scale_vec::4X - Use
float_ue4m3_tinstruction descriptor - Block16 SF layout (scale_vec::4X) — different TMEM layout from block32 (2X)
- Keep E2M1 weights + UE4M3 scales + block16 — no scale conversion needed
This gives full NVFP4 precision with zero scale format conversion.
Running
# On B200 node
cd /root/nvidia-meeting
docker compose up -d
# Check logs
docker logs -f nvidia-meeting-vllm-1
# Test
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "/model", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'
Files
| File | Purpose |
|---|---|
patches/deepseek_v4.py |
Main patch: NVFP4 post-load conversion, weight reconstruction |
patches/modelopt.py |
ModelOpt FP4 config patches for weight loading |
.env |
B200 node credentials |
Dockerfile |
Container build (extends dream-build with DeepGEMM + patch) |
build_push.sh |
Build, push to CR, update docker-compose |
NVFP4 Format Specification
- Weights: E2M1 packed uint8 (2 values per byte)
- Block scales: float8_e4m3fn (UE4M3), group_size=16
- Global scales: float32 (weight_scale_2), per-tensor
- Dequant formula:
value = packed_E2M1 * block_scale * global_scale - Block scale range: [0, 448] (UE4M3 max = 448, E2M1 max = 6, so 6×448 = 2688)
- UE8M0: Power-of-2 only. Encoded as uint8 = float32_exponent_bits[31:23]
- UE4M3: 3-bit mantissa + 4-bit exponent + sign. Max = 448.
HARD RULES
- NEVER convert DeepSeek MoE experts to MXFP4. Experts stay in NVFP4. Period.