Go to file

biondizzle 27dbf2850f fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization

Triton can't compile deeply nested tl.where. Use arithmetic instead:
idx = sum(abs_s >= threshold_i) for 7 threshold values.

2026-05-11 22:23:05 +00:00

memory

Update MEMORY.md and memory files with vLLM NVFP4 serving progress

2026-05-11 02:02:49 +00:00

patches

fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization

2026-05-11 22:23:05 +00:00

scripts

vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs

2026-05-10 16:14:17 +00:00

tmp

Update MEMORY.md and memory files with vLLM NVFP4 serving progress

2026-05-11 02:02:49 +00:00

.env

Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore

2026-05-08 17:09:59 +00:00

.gitignore

Replace shell wrapper with in-process quantize script

2026-05-09 06:07:22 +00:00

build_push.sh

fix: add --enable-expert-parallel to compose command

2026-05-11 06:15:11 +00:00

docker-compose.yml

vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs

2026-05-10 16:14:17 +00:00

Dockerfile

bump cache buster 32 for packed FP4 mxf4nvf4 fix

2026-05-11 21:59:56 +00:00

index.yaml

Purge INT4 references — expert weights are FP4 (E2M1), not INT4

2026-05-08 02:33:46 +00:00

MEMORY.md

Update MEMORY.md and memory files with vLLM NVFP4 serving progress

2026-05-11 02:02:49 +00:00

README.md

docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

2026-05-11 14:24:13 +00:00

requirements.txt

NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro

2026-05-07 00:11:31 +00:00

README.md

DeepSeek V4 Pro → NVFP4 Quantization + vLLM Serving

Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 GPUs.

Quick Status

Component	Status
NVFP4 Quantization	✅ 881GB (Run 11), modelopt 0.45.0.dev64
Weight Loading	✅ 95 safetensors shards, all 8 TP ranks
NVFP4→FP8 Conversion (wo_a)	✅ DeepGEMM block-scale format
NVFP4→BF16 Dequantization	✅ 305 attn/shared, 91 compressor layers
Compressor Reconstruction	✅ Separate kv_proj/gate_proj → fused_wkv_wgate
MoE Expert Serving (FusedMoE)	✅ FLASHINFER_TRTLLM backend
MoE Expert Serving (MegaMoE)	🔧 Kernel compiles, runs, but garbled (wrong arch flag sm_100 vs sm_100a)
API Server	✅ Running on port 8000
Output Quality	🔧 Garbled — wrong MMA kind (used mxf8f6f4 instead of mxf4nvf4)

B200 Node

IP: 45.76.247.107
User: root
Password: see .env
GPUs: 8× NVIDIA B200 (SM100)
RAM: ~2.7 TB
Model weights: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/
BF16 reference: /root/nvidia-meeting/DeepSeek-V4-Pro-BF16/

Repositories

Repo	Branch	Purpose
`deepseek-v4-quant`	`modelopt-nvfp4`	Main patch (FlashInfer FusedMoE path)
`deepseek-v4-quant`	`mega-moe-nvfp4`	MegaMoE patch (DeepGEMM mega_moe path)
`DeepGEMM`	`nvfp4-mega-moe`	NVFP4 mega_moe kernel fork

Architecture

DeepSeek V4 Pro (1.2T params, 61 layers)
├── MLA Attention (61 layers)
│   ├── fused_wqa_wkv → BF16 (UnquantizedLinearMethod)
│   ├── wo_a → FP8 (DeepGEMM block-scale, BMM einsum)
│   ├── wo_b → BF16 (UnquantizedLinearMethod)
│   └── compressor.fused_wkv_wgate → BF16 (reconstructed from NVFP4)
├── MoE Experts (384 experts, 61 layers)
│   ├── [FusedMoE path] → NVFP4 (FLASHINFER_TRTLLM backend)
│   └── [MegaMoE path] → NVFP4 (DeepGEMM mxf8f6f4, UE4M3→UE8M0 adapted)
└── Shared Expert → FP8 (Fp8LinearMethod, DeepGEMM)

The NVFP4 → vLLM Gap

ModelOpt quantizes to NVFP4 (4-bit FP4 with block scales). vLLM's DeepSeek V4 attention code expects FP8 with DeepGEMM block-scale einsum. These formats were never integrated — we're ahead of NVIDIA on this. Key gaps we had to bridge:

1. wo_a: NVFP4 → FP8 + DeepGEMM Block Scale

Problem: wo_a uses deepseek_v4_fp8_einsum (BMM with DeepGEMM), which expects:

Weight: float8_e4m3fn in 3D shape (g, r, d) for batched matmul
Scale: DeepGEMM-formatted block scale tensor (not a per-tensor scalar)

Our NVFP4 weights are uint8 packed FP4 with separate block/global scales.

Solution (_convert_nvfp4_to_fp8):

Unpack NVFP4 uint8 → BF16 using E2M1 lookup table
Dequantize: weight_bf16 * block_scale * global_scale (NO input_scale)
Re-quantize BF16 → FP8 e4m3 with per-tensor scale (w_amax / fp8_max)
Create block scale tensor filled with fp8_scale
Call deepgemm_post_process_fp8_weight_block with quant_block_shape=(128,128), use_e8m0=True, is_bmm=True
Store: weight_scale_inv = dg_ws, weight = w_fp8 (3D BMM shape)

2. Attention Layers: NVFP4 → BF16

Solution (_convert_nvfp4_to_bf16): Unpack → dequantize → set UnquantizedLinearMethod.

3. Compressor: Reconstructing fused_wkv_wgate from NVFP4

Solution (_reconstruct_compressor_weight): Read kv_proj+gate_proj from safetensors, dequantize, concatenate. Critical: indexer compressor at .compressor.indexer.{kv_proj,gate_proj} not .compressor.{kv_proj,gate_proj}.

4. MoE Experts: NVFP4 FusedMoE

Solution: Keep expert weights as NVFP4, use FLASHINFER_TRTLLM MoE backend.

5. BF16 wo_a Layers: BF16 → FP8

Solution (_convert_bf16_to_fp8): Directly quantize BF16 → FP8 with block scale.

Bugs Found and Fixed

#	Bug	Impact	Fix
1	DeepGEMM `sf.dim()` crash	Server crash	`deepgemm_post_process_fp8_weight_block` for block-scale format
2	Block scale dtype `float8_e4m3fn`	Crash	Use `float32`
3	Missing `deepgemm_post_process` args	Crash	Pass `quant_block_shape`, `use_e8m0`
4	Compressor indexer shape mismatch	Crash	`.indexer.` sub-path in checkpoint keys
5	All-ones block scale	Garbage output	`torch.full(..., fp8_scale)` not `torch.ones`
6	`fused_skip_regex` skipping q_b/o_a/o_b scales	Garbage output	Remove non-fused scale entries from skip list
7	UE8M0 block scale misinterpreted as E4M3	Garbled output	`_ue8m0_to_float32()`: reinterpret raw uint8 as IEEE 754 exponent
8	wo_a BF16 weight into uint8 param (suspected)	Double-conversion loss	On-the-fly BF16→NVFP4 in weight_loader, or BF16→FP8 directly

Bug #7 Detail: UE8M0 → float32 Misinterpretation

Root cause: weight_scale bytes are E8M0 format (power-of-2 only, 8-bit exponent), but .to(torch.float32) interprets the raw byte as E4M3 (8-bit: sign+exp+mantissa), producing a completely wrong float value.

Fix: _ue8m0_to_float32() — reinterpret the raw uint8 bits as the upper 8 bits of an IEEE 754 float32 exponent: (uint8_value << 23).view(float32). Applied to all dequant paths.

Bug #8 Detail: wo_a BF16 Loading

o_a_proj.weight is BF16 in checkpoint, but ModelOptNvFp4Config creates a uint8 param (shape mismatch: BF16 (16384,4096) vs uint8 (16384,2048)). The weight_loader does on-the-fly BF16→NVFP4 quantization, but the double conversion (BF16→NVFP4→BF16→FP8) is lossy. Diagnostics added but fix pending.

NVFP4 Mega MoE Kernel

What We Built

A native NVFP4 mega_moe kernel in our DeepGEMM fork that avoids dequantizing expert weights to BF16 before the GEMM. The kernel keeps weights in E2M1 packed format and uses block-scaled MMA directly.

SM100a (B200) Hardware Support (CORRECTED)

B200 (SM100a) DOES support kind::mxf4nvf4 with scale_vec::4X (block16, UE4M3 scales). This is documented in PTX ISA 8.7 (CUDA 12.8+) and confirmed by NVIDIA/CUTLASS/Colfax.

Build 22 produced garbled output because we incorrectly fell back to mxf8f6f4 (UE8M0/block32), losing 3 bits of mantissa precision per scale. The mxf4nvf4 instruction was never actually tried with the correct target. The real issues were:

Wrong arch flag: JIT compiled for sm_100 instead of sm_100a (a suffix required)
No NVFP4 MMA kind enum: DeepGEMM only had BF16 + MXFP8FP4 — NVFP4 was just MXFP4 in disguise
Wrong SF layout: Block16 (scale_vec::4X) has different TMEM layout than block32 (2X)

Fix: Target sm_100a, emit tcgen05.mma.kind::mxf4nvf4.block_scale.scale_vec::4X, keep E2M1 weights + UE4M3 scales + block16. No scale conversion needed. Full precision.

Parameter	NVFP4 Checkpoint	Kernel (sm_100a, CORRECT)
Weight format	E2M1 uint8	E2M1 uint8 (unchanged)
Block scale format	UE4M3 (float8_e4m3fn)	UE4M3 (native, no conversion)
Block size	16	16 (native)
Global scale	float32	Folded in before packing
PTX instruction	`mxf4nvf4.block_scale.scale_vec::4X`	Same
Instruction descriptor	`float_ue4m3_t`	Same

Kernel Architecture (TARGET — sm_100a with mxf4nvf4)

sm100_fp8_nvfp4_mega_moe_impl  (adapted from sm100_fp8_fp4_mega_moe_impl)
├── kGranK = 16 (NVFP4 native block size)
├── kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
├── float_ue4m3_t instruction descriptor
├── Block16 SF layout: scale_vec::4X, 4 TMEM sub-columns per UMMA atom
├── UTCCP copy: i*8 stride (4X layout)
├── kNumSFATmemCols = SF_BLOCK_M / 32 * 4
├── kNumSFBTmemCols = SF_BLOCK_N / 32 * 4
├── kNumSFUint32 = kHidden / 64 (4 UE4M3 per int32)
├── UE4M3 L1 epilogue (float → cutlass::float_e4m3_t cast)
└── recipe = (1, 1, 16)

Python API:
├── fp8_nvfp4_mega_moe() — recipe=(1,1,16)
├── transform_nvfp4_weights_for_mega_moe()
│   ├── fold_global_scale(): UE4M3 * FP32 → UE4M3
│   ├── NO block16→block32 merge (native block16)
│   ├── NO UE4M3→UE8M0 conversion (native UE4M3)
│   └── pack_ue4m3_to_int32() + transform_sf_into_required_layout(gran_k=16)
└── get_symm_buffer_for_nvfp4_mega_moe() — 2x SF buffer

C++ Bindings:
├── csrc/apis/mega_nvfp4.hpp — kGranK=16, SF stride K/16
├── csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp
└── csrc/apis/layout.hpp — gran_k=16 support

Current State (Build 22 — WRONG PATH)

Build 22 used mxf8f6f4 with UE8M0/block32, which produces garbled output. The kernel architecture below is what was deployed. It needs to be reverted to the mxf4nvf4 path with sm_100a targeting.

Build 22 (GARBLED — do not use):
├── kGranK = 32, mxf8f6f4, float_ue8m0_t
├── Same TMEM layout as MXFP4 (2X, block32)
├── UE4M3→UE8M0 scale conversion (lossy)
├── block16→block32 scale merge (lossy)
└── recipe = (1, 1, 32)

The earlier Build 17 code had the RIGHT instruction (mxf4nvf4.scale_vec::4X) but wrong arch flag (sm_100 instead of sm_100a). We need to go back to that code and fix the arch flag.

Container Build Pipeline

Dockerfile → FROM atl.vultrcr.com/vllm/vllm-with-lmcache:dream-build
  ├── DeepGEMM (nvfp4-mega-moe branch) — JIT-compiled at runtime
  ├── vLLM patch (deepseek_v4.py) — COPY over model file
  └── NVRTC symlink for CUDA compilation

build_push.sh → build → login to CR → push → update docker-compose
  Container registry: atl.vultrcr.com/vllm/vllm-dsv4-nvfp4:latest
  Always run builds in screen: screen -S nvfp4-build

Debugging Log (Builds 1–22)

Build	Error	Fix
1–6	Various Dockerfile/build issues	NVRTC symlink, CPATH, PYTHONPATH
7	`kPackedFP4` type mismatch	uint8→int8 view on weights
9	SF stride assertion	Need MN-major layout + TMA alignment
10	`transform_sf_into_required_layout` no gran_k=16	C++ fix
11	SF dtype float8_e4m3fn rejected	Pack UE4M3→int32 first
12–14	SF stride layout	Transpose to MN-major before transform
15	SymmBuffer too small (NVFP4 has 2× SF)	NVFP4-specific SymmBuffer
16	`ImportError: deep_gemm.mega.nvfp4`	Python wrapper in mega/init.py
17	NVCC: `scale_vec::4X` not supported on sm_100f	Wrong arch: need `sm_100a` not `sm_100`
18	NVCC: `scale_vec::2X` ALSO not supported	Same — `sm_100a` required
19	kGranK=16 still in C++ binding	→ 32
20	UE4M3→UE8M0 `uint32 >> 23` fails	Cast to int32 first
22	Server UP, but garbled output	Fell back to mxf8f6f4 — should use mxf4nvf4 on sm_100a

Path Forward

The mxf4nvf4 instruction IS supported on B200 (SM100a). Build 17-18 failed because we targeted sm_100 instead of sm_100a. Build 22 garbled because we fell back to mxf8f6f4 unnecessarily. The correct fix:

Fix the NVFP4 MegaMoE Kernel (Recommended)

Target sm_100a in DeepGEMM's JIT compiler
Add NVFP4 MMA kind enum (not just MXFP8FP4 with an NVFP4 hat)
Emit tcgen05.mma.kind::mxf4nvf4.block_scale.scale_vec::4X
Use float_ue4m3_t instruction descriptor
Block16 SF layout (scale_vec::4X) — different TMEM layout from block32 (2X)
Keep E2M1 weights + UE4M3 scales + block16 — no scale conversion needed

This gives full NVFP4 precision with zero scale format conversion.

Running

# On B200 node
cd /root/nvidia-meeting
docker compose up -d

# Check logs
docker logs -f nvidia-meeting-vllm-1

# Test
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/model", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

Files

File	Purpose
`patches/deepseek_v4.py`	Main patch: NVFP4 post-load conversion, weight reconstruction
`patches/modelopt.py`	ModelOpt FP4 config patches for weight loading
`.env`	B200 node credentials
`Dockerfile`	Container build (extends dream-build with DeepGEMM + patch)
`build_push.sh`	Build, push to CR, update docker-compose

NVFP4 Format Specification

Weights: E2M1 packed uint8 (2 values per byte)
Block scales: float8_e4m3fn (UE4M3), group_size=16
Global scales: float32 (weight_scale_2), per-tensor
Dequant formula: value = packed_E2M1 * block_scale * global_scale
Block scale range: [0, 448] (UE4M3 max = 448, E2M1 max = 6, so 6×448 = 2688)
UE8M0: Power-of-2 only. Encoded as uint8 = float32_exponent_bits[31:23]
UE4M3: 3-bit mantissa + 4-bit exponent + sign. Max = 448.

HARD RULES

NEVER convert DeepSeek MoE experts to MXFP4. Experts stay in NVFP4. Period.

README.md Unescape Escape

DeepSeek V4 Pro → NVFP4 Quantization + vLLM Serving

Quick Status

B200 Node

Repositories

Architecture

The NVFP4 → vLLM Gap

1. wo_a: NVFP4 → FP8 + DeepGEMM Block Scale

2. Attention Layers: NVFP4 → BF16

3. Compressor: Reconstructing fused_wkv_wgate from NVFP4

4. MoE Experts: NVFP4 FusedMoE

5. BF16 wo_a Layers: BF16 → FP8

Bugs Found and Fixed

Bug #7 Detail: UE8M0 → float32 Misinterpretation

Bug #8 Detail: wo_a BF16 Loading

NVFP4 Mega MoE Kernel

What We Built

SM100a (B200) Hardware Support (CORRECTED)

Kernel Architecture (TARGET — sm_100a with mxf4nvf4)

Current State (Build 22 — WRONG PATH)

Container Build Pipeline

Debugging Log (Builds 1–22)

Path Forward

Fix the NVFP4 MegaMoE Kernel (Recommended)

Running

Files

NVFP4 Format Specification

HARD RULES

README.md