67 Commits

Author SHA1 Message Date
f08bcd456b tweax n shit 2026-05-12 23:16:33 +00:00
2bdda36bb7 fucken aye 2026-05-12 22:33:55 +00:00
fa825c16b9 fucken ay 2026-05-12 21:58:08 +00:00
dbf1d11f9f ayee 2026-05-12 21:52:48 +00:00
ef3edb3481 ba fongol again4 2026-05-12 21:49:43 +00:00
b74dc7121a ba fongol again3 2026-05-12 21:48:00 +00:00
7bbbdbcc79 ba fongol again 2026-05-12 21:47:37 +00:00
2e674f87c1 ba fongol again 2026-05-12 21:47:14 +00:00
5d127d8294 ba fongol again 2026-05-12 21:47:01 +00:00
52cf3f2e25 ba fongol again 2026-05-12 21:46:42 +00:00
02decb486e ba fongol 2026-05-12 21:34:08 +00:00
48f1f9dc5e clanker nonsense again 2026-05-12 21:30:36 +00:00
5cabc1f7d9 clanker nonsense again 2026-05-12 21:29:59 +00:00
25a2d4e6ad clanker nonsense 2026-05-12 21:28:50 +00:00
d88ea9842b fix: add missing staging_kernel.py to Dockerfile — BF16→E2M1+UE4M3 quantization was never in container 2026-05-12 21:21:24 +00:00
91d7d9bad7 fucken a 2026-05-12 21:18:48 +00:00
d68e113af1 remove spammy shit 2026-05-12 20:57:04 +00:00
f0652693a6 dangit again 2026-05-12 19:13:01 +00:00
054792c84e dangit 2026-05-12 18:42:39 +00:00
de055b1e77 syupid clankers 2026-05-12 18:26:37 +00:00
307574bc91 test: signal alarm timeout for kernel hang 2026-05-12 15:14:39 +00:00
fcd6de0a60 test: simplify SF fill to avoid shape mismatch 2026-05-12 15:13:16 +00:00
d4c557fddc test: fix float8 randn + SF int32 packing 2026-05-12 15:12:35 +00:00
28afc2406b test: add random FP4 data and kernel timeout 2026-05-12 15:11:41 +00:00
787d427847 test: fix NVFP4 mega_moe test dimensions for SMEM alignment 2026-05-12 15:07:35 +00:00
8737fd57c0 remove crap 2026-05-12 14:53:42 +00:00
52c3aefe73 bump cache busters to 33 for debug build 2026-05-12 13:10:37 +00:00
ca1d306890 fix: use torch.int8 for packed FP4 tensors (kPackedFP4=kInt8, not uint8) 2026-05-12 12:23:43 +00:00
b8f95ffad3 docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache 2026-05-12 11:15:06 +00:00
5840291ea3 fix staging kernel packed_k_mask double-count 2026-05-12 08:08:24 +00:00
5ea5b579c3 Trim banner, no code changes 2026-05-12 07:24:36 +00:00
74af9984f6 Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul
- Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0
  - Changed from shift-by-23 to .to(torch.float32) in both copies
  - Fix fold_global_scale in DeepGEMM mega/__init__.py
- Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU
  - Split 8-group pack into two int32 writes (groups 0-3, 4-7)
- Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem)
  into packed buffer (hidden/2 bytes), causing 2x overflow
  - Now packs even/odd nibble pairs correctly
- Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3
  - Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick
- Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11,
  quant_module_patched.py, patch_finegrained_fp8_blackwell.py,
  patch_vllm_weights.py
- Remove SCALE-FMT diagnostic histogram clutter
- Update stale UE8M0 comments throughout
- Rewrite README: clean instructions, confirmed format details
2026-05-12 05:52:30 +00:00
a36bf47f11 fix: use tl.split instead of indexing for E2M1 pair packing
Triton doesn't support constexpr tensor indexing (e2m1_pairs[:, 0]).
Use tl.split() which splits the last axis into two tensors.
2026-05-11 22:39:38 +00:00
27dbf2850f fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization
Triton can't compile deeply nested tl.where. Use arithmetic instead:
idx = sum(abs_s >= threshold_i) for 7 threshold values.
2026-05-11 22:23:05 +00:00
3d1f3de190 fix: syntax error — move triton imports before docstring, remove orphan @triton.jit 2026-05-11 22:08:50 +00:00
79d866995f bump cache buster 32 for packed FP4 mxf4nvf4 fix 2026-05-11 21:59:56 +00:00
c85b84b0fe fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte)
Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element.
L1→L2 handoff uses unpacked format (same byte count as FP8).
No bandwidth savings at L1→L2 for v1 — can optimize later.
2026-05-11 21:29:33 +00:00
01cfd02759 fix: same reshape fix in main patch file 2026-05-11 21:05:54 +00:00
076d325c97 fix: use reshape instead of risky [0::2] slicing for E2M1 packing 2026-05-11 21:04:53 +00:00
8dc917c498 fix: topk_weights_out store missing topk_offsets multiplier 2026-05-11 21:02:19 +00:00
17ba5a9d7b bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations 2026-05-11 20:30:14 +00:00
7a4403fa98 feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales
mxf4nvf4 requires FP4×FP4, not FP8×FP4.
- New staging kernel: E2M1 nearest-neighbor quantization
- Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales
- Added CUDA sync diagnostics for error localization
2026-05-11 20:29:36 +00:00
0fd2d4f078 diag: add weight_scale uint8 histogram to verify E8M0 vs E4M3 format 2026-05-11 19:55:41 +00:00
50a945bde4 bump cache buster 29 2026-05-11 19:51:48 +00:00
48b905406a diag: add CUDA sync after mega_moe finalize + forward to catch errors 2026-05-11 19:51:44 +00:00
35f6b66678 fix: UE8M0 reinterpret in DeepGEMM fold_global_scale + bump cache 2026-05-11 19:40:08 +00:00
f32d6b5b48 bump cache buster to 27 2026-05-11 19:26:21 +00:00
cd24182e36 diag: add NaN/Inf + FP8-dtype checks after NVFP4 dequant 2026-05-11 19:26:12 +00:00
8ae2214bad fix: reorder Dockerfile ARG before COPY for proper cache busting 2026-05-11 18:48:07 +00:00
c4891e9ee2 fix: manual FP32→UE4M3 quant in Triton staging kernel
Triton can't cast float8e4nv → uint8 directly. Compute E4M3 bits manually:
extract FP32 exponent/mantissa, convert to E4M3 format (4-bit exp + 3-bit mant),
handle rounding and overflow, reconstruct dequantized value for FP8 activation quantization.
2026-05-11 16:38:52 +00:00
436109081c bump cache buster to 24 2026-05-11 16:12:56 +00:00
5faf9916eb fix: UE4M3 activation scales + group_size=16 for NVFP4 mega_moe
The mxf4nvf4 MMA instruction shares scale_format_ between SFA and SFB.
For NVFP4 (UE4M3), both activation and weight scales must be UE4M3.

Changes to _stage_deepseek_v4_mega_moe_inputs_kernel:
- GROUP_K=16 (was 32) — NVFP4 scale_vec::4X has group_size=16
- Scale quantization: float → float8_e4m3fn (UE4M3) instead of UE8M0
  exponent extraction (>> 23). Pack 4 UE4M3 bytes per int32.
- FP8 activation quantized against UE4M3 rounded scale

Also updated class docstring (was stale MXFP4 conversion description).
2026-05-11 16:12:36 +00:00
220649c188 docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
Build 17-18 'scale_vec not supported' error was because we targeted
sm_100 instead of sm_100a. The 'a' suffix enables FP4 block-scaled
instructions. No need to fall back to mxf8f6f4 with UE8M0.

Path forward: target sm_100a, use mxf4nvf4.scale_vec::4X, keep
native UE4M3 scales + block16. No scale conversion needed.
2026-05-11 14:24:13 +00:00
cfead0012d docs: comprehensive README update through build 22
- Full architecture diagram and NVFP4→vLLM bridge details
- All 8 bugs documented with fixes
- SM100 hardware limitation (mxf4nvf4 unsupported)
- MegaMoE kernel architecture and debugging log (builds 1-22)
- Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+)
- Container build pipeline, NVFP4 format spec, hard rules
2026-05-11 13:53:41 +00:00
8cb23bdb78 fix: import NVFP4 SymmBuffer from deep_gemm.mega 2026-05-11 08:05:50 +00:00
ff579c9767 fix: use NVFP4 SymmBuffer (2x SF size for group_size=16)
The NVFP4 mega_moe kernel needs a larger symmetric buffer because
group_size=16 produces 2x more scale factor entries than MXFP4's 32.
Switch from deep_gemm.get_symm_buffer_for_mega_moe to
deep_gemm.mega.nvfp4.get_symm_buffer_for_nvfp4_mega_moe.
2026-05-11 07:49:11 +00:00
1da40c53da fix: add patch cache buster to Dockerfile 2026-05-11 07:19:10 +00:00
b532742530 debug: add shape/dtype logging to finalize_weights 2026-05-11 07:13:44 +00:00
b1cf4232ee feat: wire DeepGEMM NVFP4 mega_moe kernel into vLLM patch
- DeepseekV4MegaMoEExperts now uses native NVFP4 path
- finalize_weights: transform_nvfp4_weights_for_mega_moe() instead of
  NVFP4→BF16→MXFP4 conversion
- forward: fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- Experts stay in NVFP4. No MXFP4 conversion. Period.
2026-05-11 06:22:11 +00:00
a2e9b5f17f fix: add --enable-expert-parallel to compose command 2026-05-11 06:15:11 +00:00
c8564caf9d fix: patch vLLM deepseek_v4.py directly in image 2026-05-11 06:09:40 +00:00
7c8c6cd67f fix: add PYTHONPATH for deep_gemm import 2026-05-11 06:06:52 +00:00
cffb373759 fix: symlink NVRTC lib into cuda/lib64 for linker 2026-05-11 06:04:24 +00:00
983ba02c5b fix: add CUDA/NVRTC lib paths to Dockerfile 2026-05-11 06:02:13 +00:00
f0471ed1c2 fix: correct CR URL to atl.vultrcr.com 2026-05-11 05:59:06 +00:00
c234190a80 feat: add Dockerfile + build/push script for NVFP4 container
- Extends dream-build with DeepGEMM nvfp4-mega-moe kernel
- build_push.sh: builds, logs into Vultr CR, pushes, updates docker-compose
- CACHE_BUSTER parameter for forcing fresh clones
2026-05-11 05:57:49 +00:00
e963325b61 WIP: MegaMoE NVFP4 kernel + diagnostics
- Force use_mega_moe=True for NVFP4 pipeline
- DeepseekV4MegaMoEExperts: load NVFP4 params (float8 block scales,
  float32 global/input scales), convert NVFP4→BF16→MXFP4 in
  finalize_weights for the DeepGEMM mega_moe kernel
- Add _nvfp4_to_bf16 and _bf16_to_mxfp4 conversion methods
- Remove expert_dtype check blocking mega_moe
- Add diagnostics for wo_a and bf16 layer conversion
- Still WIP: attention layer bugs under investigation
2026-05-11 05:19:49 +00:00
40 changed files with 770 additions and 7015 deletions

7
.env
View File

@@ -1,7 +0,0 @@
B200 Node
IP: `45.76.247.107`
user: `root`
password: `6)Jr)B@dcX[mN?dx`
folder with the weights: `/root/nvidia-meeting`
This repo: `https://sweetapi.com/biondizzle/deepseek-v4-quant.git` (always use this http to pull since its a public repo)

11
.gitignore vendored
View File

@@ -1,10 +1 @@
# Dequantized BF16 weights (3TB)
DeepSeek-V4-Pro-BF16/
# Calibration state (huge, not for git)
*.pt
# Python
__pycache__/
*.pyc
.venv/
.env

30
Dockerfile Normal file
View File

@@ -0,0 +1,30 @@
# DeepSeek V4 NVFP4 vLLM + DeepGEMM Mega MoE
# Extends the vLLM dream-build container with our custom DeepGEMM kernel
FROM atl.vultrcr.com/vllm/vllm-with-lmcache:dream-build
# Install build essentials
RUN apt-get update && apt-get install -y git screen cmake && rm -rf /var/lib/apt/lists/*
# Clone and build DeepGEMM with NVFP4 mega_moe kernel
# CACHE_BUSTER: increment to force fresh clone
RUN git clone -b nvfp4-mega-moe https://sweetapi.com/biondizzle/DeepGEMM.git /root/DeepGEMM && PATCH_CACHE_BUSTER=70
# Build DeepGEMM with proper CUDA/NVRTC paths
ENV CPATH="/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include:/usr/local/lib/python3.12/dist-packages/nvidia/cu13/include:/usr/local/cuda-13.0/include:${CPATH}"
ENV PYTHONPATH="/root/DeepGEMM:${PYTHONPATH}"
# NVRTC lives in the pip nvidia/cu13 package, but the linker expects it in cuda/lib64
# Create a symlink so -lnvrtc resolves
RUN ln -sf /usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib/libnvrtc.so.13 /usr/local/cuda/lib64/libnvrtc.so && PATCH_CACHE_BUSTER=70
RUN cd /root/DeepGEMM && python3 setup.py build_ext --inplace && PATCH_CACHE_BUSTER=69
# Bust cache for patch changes — ARG before COPY ensures layer invalidation
ARG PATCH_CACHE_BUSTER=70
# Copy our DeepSeek V4 patch over vLLM's model file
COPY patches/deepseek_v4.py /usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py
# Copy the NVFP4 staging kernel (BF16→E2M1+UE4M3 quantization for activations)
COPY patches/staging_kernel.py /usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/staging_kernel.py
# Verify everything imports
RUN python3 -c "import deep_gemm; print('DeepGEMM NVFP4 OK')" && \
python3 -c "import vllm; print('vLLM OK')"

View File

@@ -1,29 +0,0 @@
# MEMORY.md — Long-Term Memory
## Mike
- Working on DeepSeek V4 Pro NVFP4 quantization + vLLM serving on B200 node
- B200 node: 45.76.247.107, root, password in project .env
- Repo: https://sweetapi.com/biondizzle/deepseek-v4-quant.git (modelopt-nvfp4 branch)
## DeepSeek V4 NVFP4 Project
- Successfully quantized: 881GB NVFP4 (Run 11), 8× B200, $161/run
- modelopt 0.45.0.dev64 + transformers 5.8.0.dev0
- **vLLM server running on B200 port 8000** as of May 11, 2026 🎉
- We built the entire NVFP4→vLLM bridge from scratch (NVIDIA hasn't done this)
- Abandoned mega_moe (no kernel, format mismatch), using standard FusedMoE instead
### Key Technical Decisions
- **wo_a**: NVFP4→BF16→FP8 with DeepGEMM block-scale format for BMM einsum
- **Attention layers**: NVFP4→BF16 dequantization, UnquantizedLinearMethod
- **Compressor**: Reconstructed fused_wkv_wgate from separate kv_proj+gate_proj in checkpoint
- **MoE experts**: Stay NVFP4, use FLASHINFER_TRTLLM FusedMoE backend
### Critical Bugs Fixed (May 11)
1. DeepGEMM `sf.dim()` crash: weight_scale_inv must be DeepGEMM-formatted block scale tensor
2. Compressor indexer shape mismatch: checkpoint keys have `.indexer.` sub-path
3. All-ones block scale → garbage output: must use `torch.full(..., fp8_scale)` not `torch.ones`
4. Block scale dtype: must be float32, not float8_e4m3fn
### Outstanding
- Output quality under investigation — FP4 is aggressive quantization
- All code in patches/deepseek_v4.py on modelopt-nvfp4 branch

346
README.md
View File

@@ -8,24 +8,29 @@ Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 G
|-----------|--------|
| NVFP4 Quantization | ✅ 881GB (Run 11), modelopt 0.45.0.dev64 |
| Weight Loading | ✅ 95 safetensors shards, all 8 TP ranks |
| Dequant Verification | ✅ Bit-exact match against official dequant (0.0 relative error) |
| NVFP4→FP8 Conversion (wo_a) | ✅ DeepGEMM block-scale format |
| NVFP4→BF16 Dequantization | ✅ 305 attn/shared, 91 compressor layers |
| Compressor Reconstruction | ✅ Separate kv_proj/gate_proj → fused_wkv_wgate |
| MoE Expert Serving | ✅ FusedMoE NVFP4 (FLASHINFER_TRTLLM backend) |
| Profile/Warmup Run | ✅ Passes |
| API Server | ✅ Running on port 8000 |
| Output Quality | 🔧 Garbled — likely remaining dequant/scale bug |
| MoE Expert Serving (MegaMoE) | 🔧 Kernel builds & runs on sm_100a, debugging illegal CUDA access |
| Output Quality | 🔧 Under investigation |
## B200 Node
- **IP**: `45.76.247.107`
- **User**: `root`
- **Password**: see `.env`
- **GPUs**: 8× NVIDIA B200 (SM100)
- **RAM**: ~2.7 TB
- **GPUs**: 8× NVIDIA B200 (SM100a)
- **Model weights**: `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/`
- **BF16 reference**: `/root/nvidia-meeting/DeepSeek-V4-Pro-BF16/`
## Repositories
| Repo | Branch | Purpose |
|------|--------|---------|
| `deepseek-v4-quant` | `modelopt-nvfp4` | Main repo: patches, quantize, serve scripts |
| `DeepGEMM` | `nvfp4-mega-moe` | NVFP4 mega_moe kernel fork |
## Architecture
```
@@ -35,124 +40,70 @@ DeepSeek V4 Pro (1.2T params, 61 layers)
│ ├── wo_a → FP8 (DeepGEMM block-scale, BMM einsum)
│ ├── wo_b → BF16 (UnquantizedLinearMethod)
│ └── compressor.fused_wkv_wgate → BF16 (reconstructed from NVFP4)
├── MoE Experts (384 experts, 61 layers)
── w13_weight → NVFP4 (FusedMoE, FLASHINFER_TRTLLM backend)
│ └── w2_weight → NVFP4 (FusedMoE, FLASHINFER_TRTLLM backend)
├── MoE Experts (256 experts per layer, 61 layers)
── MegaMoE path → NVFP4 (DeepGEMM mxf4nvf4, native block16)
└── Shared Expert → FP8 (Fp8LinearMethod, DeepGEMM)
```
## The NVFP4 → vLLM Gap
## NVFP4 Format (Confirmed)
ModelOpt quantizes to NVFP4 (4-bit FP4 with block scales). vLLM's DeepSeek V4
attention code expects FP8 with DeepGEMM block-scale einsum. These formats were
**never integrated** — we're ahead of NVIDIA on this. Key gaps we had to bridge:
| Field | Format | Notes |
|-------|--------|-------|
| Weights | E2M1 packed uint8 | 2 values per byte |
| Block scales | `torch.float8_e4m3fn` (UE4M3) | Standard NVFP4 spec, group_size=16 |
| Global scales | `torch.float32` (weight_scale_2) | **Scalar per expert** (`torch.Size([])`) |
| Dequant | `value = packed_E2M1 * block_scale * global_scale` | Block scale range [0, 448] |
### 1. wo_a: NVFP4 → FP8 + DeepGEMM Block Scale
**Key finding**: The checkpoint stores block scales as `torch.float8_e4m3fn` (UE4M3), NOT UE8M0.
`.to(torch.float32)` is the correct conversion. The shift-by-23 trick was wrong — it was
applying an E8M0→float conversion to E4M3 bytes, producing garbage.
**Problem**: `wo_a` uses `deepseek_v4_fp8_einsum` (BMM with DeepGEMM), which expects:
- Weight: `float8_e4m3fn` in 3D shape `(g, r, d)` for batched matmul
- Scale: DeepGEMM-formatted block scale tensor (not a per-tensor scalar)
## Dequant Verification
Our NVFP4 weights are uint8 packed FP4 with separate block/global scales.
We verified the dequant path is bit-exact against the official reference:
**Solution** (`_convert_nvfp4_to_fp8`):
1. Unpack NVFP4 uint8 → BF16 using E2M1 lookup table
2. Dequantize: `weight_bf16 * block_scale * global_scale` (NO input_scale — it's for activations)
3. Re-quantize BF16 → FP8 e4m3 with per-tensor scale (`w_amax / fp8_max`)
4. Create block scale tensor filled with `fp8_scale` (same scale for every 128×128 block)
5. Call `deepgemm_post_process_fp8_weight_block(wq, ws, quant_block_shape=(128,128), use_e8m0=True, is_bmm=True, bmm_batch_size=N)`
6. Store: `weight_scale_inv = dg_ws` (DeepGEMM-formatted scale), `weight = w_fp8` (3D BMM shape)
```python
W_bf16 = dequantize_fp4_weight(W_int, S)
y_ours = W_bf16 @ x.bfloat16()
y_ref = official_expert_forward(W_int, S, x)
print((y_ours - y_ref).abs().max() / y_ref.abs().mean())
```
**Why `weight_scale_inv`?** The attention forward reads `self.wo_a.weight_scale_inv` as
`b_scale` for `deepseek_v4_fp8_einsum` → DeepGEMM `fp8_einsum`. This must be the
DeepGEMM block-scale tensor, not a per-tensor scalar.
**Why `fp8_scale` in the block scale (not all-ones)?** DeepGEMM divides by the block
scale at runtime. If the block scale is all-ones, it divides by 1.0, producing garbage.
Each block needs the actual per-tensor scale value.
### 2. Attention Layers: NVFP4 → BF16
**Problem**: `fused_wqa_wkv`, `wo_b` use standard `torch.nn.functional.linear`.
NVFP4 weights (uint8) can't be used directly.
**Solution** (`_convert_nvfp4_to_bf16`):
1. Unpack NVFP4 → BF16
2. Dequantize with block/global scales (input_scale is for activations, not weights)
3. Replace `mod.weight` with BF16 parameter
4. Set `quant_method = UnquantizedLinearMethod()`
5. Remove NVFP4 scale attributes (`weight_scale`, `weight_scale_2`, `input_scale`)
### 3. Compressor: Reconstructing fused_wkv_wgate from NVFP4
**Problem**: The compressor's `fused_wkv_wgate` is a `MergedColumnParallelLinear`
with `disable_tp=True`. NVFP4 uint8 data can't be loaded into the BF16 parameter
(shape mismatch: uint8 is half the input dim). The default weight loader silently
skips these weights, leaving the parameter uninitialized.
**Solution** (`_reconstruct_compressor_weight`):
1. Read original `kv_proj.weight` and `gate_proj.weight` directly from safetensors
2. Unpack NVFP4 → BF16, dequantize with scales
3. Concatenate: `fused = cat([wkv, wgate], dim=0)`
4. Replace the uninitialized parameter
**Critical detail**: The **indexer** compressor is at a different checkpoint path:
- Main: `model.layers.N.self_attn.compressor.{kv_proj,gate_proj}.weight`
- Indexer: `model.layers.N.self_attn.compressor.indexer.{kv_proj,gate_proj}.weight`
Using the wrong prefix loads the main compressor weight into the indexer's
`fused_wkv_wgate`, causing a 4× shape mismatch and `split_with_sizes` crash.
### 4. MoE Experts: NVFP4 FusedMoE
**Problem**: vLLM's DeepSeek V4 uses `DeepseekV4MegaMoEExperts` with DeepGEMM
grouped GEMM. NVFP4 experts need a different kernel path.
**Solution**: The existing `ModelOptNvFp4LinearMethod` + `FusedMoE` infrastructure
handles NVFP4 experts natively. We just need to:
- Keep expert weights as NVFP4 uint8 + block/global scales
- Use `FLASHINFER_TRTLLM` MoE backend (auto-selected)
- Skip any conversion in `process_weights_after_loading`
### 5. BF16 wo_a Layers: BF16 → FP8
**Problem**: Some `wo_a` layers were NOT quantized by modelopt (BF16 in checkpoint).
The attention forward still reads them as FP8 for the einsum path.
**Solution** (`_convert_bf16_to_fp8`): Same as #1 but skip the NVFP4 unpack step.
Directly quantize BF16 → FP8 with block scale.
## Bugs Found and Fixed
### DeepGEMM `sf.dim()` Assertion (layout.hpp:94)
- **Root cause**: `weight_scale_inv` was a 1D per-tensor scale `(g,)`. DeepGEMM expects
2D/3D block-scale tensor formatted by `transform_sf_into_required_layout`.
- **Fix**: Use `deepgemm_post_process_fp8_weight_block` to produce correctly formatted
block scales, store result in `weight_scale_inv`.
### Block Scale dtype (`float8_e4m3fn` vs `float32`)
- **Root cause**: `deepgemm_post_process_fp8_weight_block` expects `float32` or
`float8_e8m0fnu` block scales. We initially used `float8_e4m3fn`.
- **Fix**: Create block scale as `dtype=torch.float32`.
### Missing `deepgemm_post_process` args
- **Root cause**: Function signature changed to require `quant_block_shape` and `use_e8m0`.
- **Fix**: Pass `quant_block_shape=(128, 128)` and `use_e8m0=True`.
### Compressor Indexer Shape Mismatch
- **Root cause**: `_reconstruct_compressor_weight` used the same checkpoint prefix
for both main and indexer compressors. The indexer's keys have `.indexer.` in the path.
- **Fix**: Add `sub_path` parameter; pass `".indexer"` for indexer compressors.
### All-Ones Block Scale → Garbage Output
- **Root cause**: Block scale was `torch.ones(...)` (scale=1.0). DeepGEMM divides by
the block scale at runtime, so the output was divided by 1.0 instead of the actual
per-tensor scale, producing incoherent text.
- **Fix**: Use `torch.full(..., fp8_scale.item())` to fill the block scale with the
correct per-tensor FP8 quantization scale.
Result:
```
Max abs diff: 0.00000000
Mean abs diff: 0.00000000
Relative error: 0.000000
Matmul max diff: 0.00000000
```
## Running
### 1. Quantize
```bash
# On B200 node, in screen
screen -S quantize
cd /root/nvidia-meeting
bash run_quantize_nvfp4.sh
# ~7 hours, $161 per run
```
### 2. Build Container
```bash
# From this repo
bash build_push.sh
# Always build in screen: screen -S build
```
The Dockerfile:
1. Extends `atl.vultrcr.com/vllm/vllm-with-lmcache:dream-build`
2. Clones DeepGEMM (`nvfp4-mega-moe` branch) and builds
3. Copies `patches/deepseek_v4.py` over vLLM's model file
### 3. Serve
```bash
# On B200 node
cd /root/nvidia-meeting
@@ -168,91 +119,98 @@ curl http://localhost:8000/v1/chat/completions \
-d '{"model": "/model", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'
```
### vLLM Flags
```
--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--enable-expert-parallel
--tensor-parallel-size 8
--compilation-config {"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}
--attention_config.use_fp4_indexer_cache=True
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4
--speculative_config {"method":"mtp","num_speculative_tokens":2}
```
## NVFP4 Mega MoE Kernel
### What We Built
A native NVFP4 mega_moe kernel in our DeepGEMM fork. Weights stay in E2M1 packed format
and use `kind::mxf4nvf4.block_scale.scale_vec::4X` MMA directly on SM100a (B200).
**This is novel — NVIDIA has not done NVFP4→vLLM integration.**
### Kernel Architecture
| Parameter | Value |
|-----------|-------|
| PTX instruction | `tcgen05.mma.kind::mxf4nvf4.block_scale.scale_vec::4X` |
| kGranK | 16 (NVFP4 native block_size) |
| Weight format | E2M1 packed uint8 (unchanged from checkpoint) |
| Block scales | UE4M3 (float8_e4m3fn), native — no conversion needed |
| Global scales | Folded into block scales before packing |
| Instruction desc | `float_ue4m3_t` |
| SF layout | block16, scale_vec::4X |
| UTCCP stride | i*8 (4X layout) |
| kNumSFUint32 | kHidden / 64 (4 UE4M3 per int32) |
| recipe | (1, 1, 16) |
| Target arch | `sm_100a` (the `a` suffix is **required**) |
### Python API
- `fp8_nvfp4_mega_moe()` — entry point, recipe=(1,1,16)
- `transform_nvfp4_weights_for_mega_moe()` — fold global scales, pack UE4M3→int32, TMA-align
- `get_symm_buffer_for_nvfp4_mega_moe()` — 2× SF buffer vs MXFP4
### C++ Bindings
- `csrc/apis/mega_nvfp4.hpp` — kGranK=16, SF stride K/16, packed E2M1 hidden/2
- `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` — host-side TMA descriptors
- `deep_gemm/include/deep_gemm/impls/sm100_fp8_nvfp4_mega_moe.cuh` — kernel
### Full FP4 Pipeline
The `mxf4nvf4` instruction is FP4×FP4 — both activations (A) and weights (B) must be E2M1 packed.
A Triton staging kernel quantizes BF16 activations → E2M1 packed uint8 + UE4M3 block16 scales
before the GEMM. The L1 epilogue outputs UE4M3 activation scales directly (float→e4m3 cast).
## Bugs Found and Fixed
| # | Bug | Impact | Fix |
|---|-----|--------|-----|
| 1 | DeepGEMM `sf.dim()` crash | Server crash | `deepgemm_post_process_fp8_weight_block` for block-scale format |
| 2 | Block scale dtype `float8_e4m3fn` | Crash | Use `float32` for block-scale tensor |
| 3 | Missing `deepgemm_post_process` args | Crash | Pass `quant_block_shape`, `use_e8m0` |
| 4 | Compressor indexer shape mismatch | Crash | `.indexer.` sub-path in checkpoint keys |
| 5 | All-ones block scale | Garbage output | `torch.full(..., fp8_scale)` not `torch.ones` |
| 6 | `fused_skip_regex` skipping q_b/o_a/o_b scales | Garbage output | Remove non-fused scale entries from skip list |
| 7 | UE8M0 shift-by-23 applied to E4M3 scales | Garbled output | Checkpoint is standard UE4M3 — use `.to(float32)` (shift-by-23 was wrong) |
| 8 | wo_a BF16→NVFP4 on-the-fly used UE8M0 encoding | Scrambled attention | Produce UE4M3 directly: `.clamp(0, 448).to(float8_e4m3fn)` |
| 9 | FP8 activations fed to mxf4nvf4 (FP4×FP4 instruction) | Crash/garbled | Full FP4 pipeline: activations are E2M1 packed + UE4M3 scales |
| 10 | Staging kernel SF pack: shift ≥32 is UB | Half the activation scales zeroed | Split into 2 int32 writes per k_block (groups 0-3, 4-7) |
| 11 | Staging kernel wrote unpacked E2M1 (1 byte/elem) into packed buffer | 2× buffer overflow | Pack even/odd nibble pairs, write BLOCK_K//2 bytes |
| 12 | `compute-sanitizer` build running during debug | Slow (50×), masking timing | Remove sanitizer, rebuild |
## Files
| File | Purpose |
|------|---------|
| `patches/deepseek_v4.py` | Main patch: NVFP4 post-load conversion, weight reconstruction, DeepGEMM block-scale |
| `patches/modelopt.py` | ModelOpt FP4 config patches for weight loading |
| `.env` | B200 node credentials |
| `docker-compose.yml` | Container config (8 GPU, TP=8, EP=8, NVFP4 quant) |
| `patches/deepseek_v4.py` | Main patch: NVFP4 weight loading, dequant, staging kernel, MegaMoE |
| `patches/staging_kernel.py` | Reference copy of Triton staging kernel (live copy is in deepseek_v4.py) |
| `scripts/dequant_fp8_to_bf16.py` | BF16 dequantization utility |
| `scripts/quantize_nvfp4.py` | NVFP4 quantization runner |
| `scripts/serve_vllm.py` | Standalone vLLM server launcher |
| `Dockerfile` | Container build (extends dream-build with DeepGEMM + patch) |
| `docker-compose.yml` | Production serve config |
| `build_push.sh` | Build, push to CR, update docker-compose |
## Conversion Flow
## HARD RULES
```
Checkpoint (NVFP4 safetensors)
├── [weight loader] ──→ vLLM model (NVFP4 uint8 params)
└── [process_weights_after_loading]
├── wo_a (is_bmm=True):
│ NVFP4→BF16→FP8 + DeepGEMM block scale
│ weight_scale_inv = dg_ws, weight = 3D FP8
├── fused_wqa_wkv, wo_b, shared_expert:
│ NVFP4→BF16, UnquantizedLinearMethod
├── compressor.fused_wkv_wgate:
│ Read kv_proj+gate_proj from checkpoint
│ NVFP4→BF16, cat into fused weight
└── MoE experts: stay NVFP4 (FusedMoE backend)
```
## Bugs Found and Fixed (continued)
### `input_scale` Multiplied into Weight Dequantization (CRITICAL)
- **Root cause**: `_convert_nvfp4_to_bf16`, `_convert_nvfp4_to_fp8`, and
`_reconstruct_compressor_weight` all multiplied by `input_scale` during weight
dequantization. `input_scale` is for **activations**, not weights. The correct
formula is: `weight_bf16 = e2m1 * block_scale * global_scale` (NO input_scale).
Including it made weights ~5000× too small, causing garbage output.
- **Fix**: Removed `* input_scale` from all three dequant paths.
### `fused_skip_regex` Skipping Non-Fused Layer Scales (CRITICAL)
- **Root cause**: The skip list included `q_b_proj`, `o_a_proj`, `o_b_proj` weight
scales. These are **NOT fused/stacked** — they're individual Linear layers
(`wq_b`, `wo_a`, `wo_b`) converted in-place. Skipping their scales caused
`process_weights_after_loading` to read `torch.empty()` garbage for
`weight_scale_inv`, producing garbled output.
- **Fix**: Removed `q_b_proj`, `o_a_proj`, `o_b_proj` scale entries from
`fused_skip_regex`. Only truly stacked params remain skipped:
`compressor.{kv_proj,gate_proj}``fused_wkv_wgate`,
`self_attn.{kv_proj,q_a_proj}``fused_wqa_wkv`,
`shared_experts.{gate_proj,up_proj}``gate_up_proj`.
## Version Banner
The patch prints a version banner at import time (visible in `docker logs`):
```
======================================================================
DeepSeek V4 NVFP4 Patch
Commit: 26aaaba
Loaded: 2026-05-11 04:25:00 UTC
Node: ...
Architecture: ...
Bugs fixed: #1-#6
======================================================================
```
This ensures you can always verify what's running inside the container.
## Known Issues
1. **Output quality**: Model produces tokens but they're garbled/incoherent.
All 6 known bugs are fixed. The remaining issue is under investigation —
likely a subtle dequantization bug (sign handling, scale ordering, or
E2M1 unpack edge case). The version banner in the logs helps debug which
patch version is active.
2. **Runtime performance**: Not yet benchmarked. The DeepGEMM einsum + FusedMoE
path should be efficient on B200, but the BF16 layers go through
`UnquantizedLinearMethod` which may be slower than dedicated kernels.
## Quantization Details
- **Model**: DeepSeek V4 Pro (1.2T parameters)
- **Format**: NVIDIA NVFP4 (4-bit floating point with 128-element block scales)
- **Tool**: modelopt 0.45.0.dev64 + transformers 5.8.0.dev0
- **Run**: Run 11 (881GB), 8× B200, ~$161/run
- **Checkpoint**: 95 safetensors shards
- **NEVER convert DeepSeek MoE experts to MXFP4.** Experts stay in NVFP4. Period.
- **The checkpoint is UE4M3 (float8_e4m3fn), NOT UE8M0.** Never use shift-by-23 on these bytes.
- **Target `sm_100a`, not `sm_100`.** The `a` suffix is required for mxf4nvf4 instructions.

View File

@@ -1,30 +1,26 @@
services:
vllm:
image: atl.vultrcr.com/vllm/vllm-with-lmcache:dream-build
#image: atl.vultrcr.com/vllm/vllm-dsv4-nvfp4:latest
build:
context: .
pull_policy: always
entrypoint:
- bash
- -c
- |
cp /patches/deepseek_v4.py /usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py
exec vllm serve "$$@"
- --
ports:
- "8000:8000"
environment:
- HF_TOKEN=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
- OMP_NUM_THREADS=128
#- VLLM_USE_FLASHINFER_MOE_FP4=1 # What the fuck is this!?
command:
- /model
- --trust-remote-code
- --kv-cache-dtype=fp8
- --block-size=256
#- --kv-cache-dtype=fp8 # maybe we just let it figure its own shit out
#- --block-size=256
- --enable-expert-parallel
- --tensor-parallel-size=8
- --compilation-config={"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}
- --attention_config.use_fp4_indexer_cache=True
- --enforce-eager
#- --compilation-config={"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}
#- --attention_config.use_fp4_indexer_cache=True
- --tokenizer-mode=deepseek_v4
- --tool-call-parser=deepseek_v4
- --enable-auto-tool-choice
- --reasoning-parser=deepseek_v4
- --speculative_config={"method":"mtp","num_speculative_tokens":2}
#- --speculative_config={"method":"mtp","num_speculative_tokens":2}
- --host=0.0.0.0
- --port=8000
deploy:
@@ -34,13 +30,6 @@ services:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
security_opt:
- seccomp:unconfined
tty: true
stdin_open: true
volumes:
- /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4:/model:ro
- /root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py:/patches/deepseek_v4.py:ro
- /root/nvidia-meeting/deepseek-v4-quant/patches:/patches:ro
network_mode: host

View File

@@ -1,3 +0,0 @@
apiVersion: v1
entries: {}
generated: "2026-04-17T19:18:02.693243217Z"

View File

@@ -1,59 +0,0 @@
# Session: 2026-05-08 16:49:00 UTC
- **Session Key**: agent:main:main
- **Session ID**: 9143ad68-d6c2-4b80-8dca-6aaa80b37c8b
- **Source**: webchat
## Conversation Summary
assistant: Token saved. Now pre-download the dataset:
assistant: Access denied even with the token. The token might not have accepted the terms for this specific dataset. Let me check if the modelopt script actually needs this dataset or if we can just use `--dataset` to override:
assistant: You're right, my bad. Let me persist the token properly and pre-download the dataset.Token saved. Now pre-download the dataset:Access denied even with the token. The token might not have accepted the terms for this specific dataset. Let me check if the modelopt script actually needs this dataset or if we can just use `--dataset` to override:
user: System: [2026-05-08 15:38:05 UTC] Exec completed (marine-n, code 0) :: Fetching 204 files: 0%| | 0/204 [00:00<?, ?it/s] Fetching 204 files: 0%| | 0/204 [00:00<?, ?it/s] Error: Access denied. This repository requires approval. [90mSet HF_DEBUG=1 as en…
Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Fri 2026-05-08 15:39 UTC] all you have to do is say so and ill get you a new token that has access, jeeze. try this one `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`
assistant: All 204 files downloaded. Token's persisted, dataset is cached. Now fire it up:
assistant: Dataset pre-downloaded, token persisted, screen running. ~54 min for model load then calibration should find the dataset in cache. We're synced. 🔧
user: System: [2026-05-08 15:39:53 UTC] Exec completed (salty-re, code 0) :: /204 [00:36<00:01, 5.63it/s] Fetching 204 files: 98%|█████████▊| 199/204 [00:37<00:00, 5.23it/s] Fetching 204 files: 99%|█████████▉| 202/204 [00:37<00:00, 6.21it/s] Fetching 204 f…
Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Fri 2026-05-08 16:44 UTC] Ok looks like gpu memory cleared. Not sure if it was successful or another fail
assistant: Same error. The `datasets` library's `load_dataset(streaming=True, ...)` isn't using the cached token or the pre-downloaded files — it's trying to resolve the dataset script via the API and failing before it even gets to the cache. This version of `datasets` (4.8.5) has a bug with gated dataset resolution in streaming mode.
Enough of this — let me patch modelopt's `dataset_utils.py` to pass the token explicitly:
assistant: OOM on grep — the venv is huge. Let me be more targeted:
assistant: Python keeps getting OOM'd. The model is still in RAM from the failed run. Let me try a different approach — just use `sed` to patch the file:
assistant: Everything is getting OOM killed — the model is still resident. Let me clear it first:
assistant: Even ssh is getting OOM'd. The system is completely swamped. The screen session's python process probably still has the 3TB model mapped in RAM.
assistant: SSH itself is getting OOM killed. The B200 is completely out of memory. Mike, the box is thrashing — the failed quant process is still eating all the RAM. You'll need to kill the screen session or reboot the machine. Can you hop on and `screen -X -S nvfp4-full quit` or just `reboot`?
user: System: [2026-05-08 16:44:46 UTC] Exec failed (briny-ba, signal SIGKILL)
System: [2026-05-08 16:44:59 UTC] Exec failed (vivid-ot, signal SIGKILL)
System: [2026-05-08 16:45:13 UTC] Exec failed (quick-bl, signal SIGKILL)
System: [2026-05-08 16:45:34 UTC] Exec failed (nimble-t, signal SIGKILL)
System: [2026-05-08 16:45:51 UTC] Exec failed (tidy-ree, signal SIGKILL)
Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Fri 2026-05-08 16:47 UTC] you're trying to loginto `b200-nodepool-133105fd6865` ?
assistant: No, I'm using `45.76.247.107` from the `.env`. Is that the right address?

View File

@@ -1,28 +0,0 @@
# 2026-05-08 DeepSeek V4 Pro NVFP4 — Session Resume
## Status: Ready to fire (awaiting Mike's go)
### What changed from last session
- We are doing **FULL NVFP4 quantization** (not experts-only). Upcasting mixed-precision to pure BF16, then full quant to NVFP4.
- Calibration size is **128** (not 256).
- Source model: `/root/nvidia-meeting/DeepSeek-V4-Pro-BF16` (dequantized from FP8)
- Previous run failed at calibration: `nvidia/Nemotron-Post-Training-Dataset-v2` is gated, HF token wasn't being picked up by modelopt's `load_dataset()` call
- **Root cause:** modelopt's `dataset_utils.py:303` calls `load_dataset(streaming=True, **config, split=s)` with NO token parameter — relies on env var or `~/.cache/huggingface/token`
### Token fixes applied (this session)
- New HF token: `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` (verified working — write permission, dataset access confirmed)
- Token planted in: `~/.bashrc`, `~/.profile`, `/etc/environment`, `~/.huggingface/token`, `~/.cache/huggingface/token` (via `hf auth login`)
- Script updated to export both `HF_TOKEN` and `HUGGING_FACE_HUB_TOKEN` and echo the token at runtime
### Script
- `/root/nvidia-meeting/deepseek-v4-quant/scripts/model_opt_nvfp4_full.py`
- Config: `nvfp4`, `tp 8`, `calib 128`, `--use_seq_device_map`, `--gpu_max_mem_percentage 0.7`
- Output: `saved_models_DeepSeek-V4-Pro-BF16_nvfp4_kv_fp8_cast` (currently empty — clean start)
### Repo sync
- Local and B200 both on `modelopt-nvfp4` branch, commit `075da67`
### B200 Node
- IP: 45.76.247.107, user: root, pass: 6)Jr)B@dcX[mN?dx
- GPUs: all 8× B200 idle (0 MiB used)
- No screen sessions running

View File

@@ -1,63 +0,0 @@
# 2026-05-09 DeepSeek V4 Pro NVFP4 Quantization — Run 4
## Status: 🔄 RUNNING (started ~08:05 UTC May 9, expected completion ~15:00 UTC)
### Cost Context
$23/hr per B200 node. Each run is ~7 hours = ~$161. Don't waste runs.
If Mike asks about this project, he's tracking the spend.
### Where to check status
- **SSH:** `ssh root@45.76.247.107` (password: `6)Jr)B@dcX[mN?dx`)
- **Screen:** `screen -r quantize`
- **Log:** `tail -f /root/nvidia-meeting/quantize_nvfp4.log`
- **GPU:** `nvidia-smi`
- **Calibration progress:** grep for `XXX/128` in the log
### What's running
- **Script:** `/root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py`
- **Commit:** `f9bbef8`
- **Source model:** `/root/nvidia-meeting/DeepSeek-V4-Pro-BF16`
- **Export dir:** `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4`
- **Config:** nvfp4, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7
### Run Progression (DO NOT GO BACKWARDS)
| Run | Result | Commit | Root Cause | Fix Applied |
|-----|--------|--------|------------|-------------|
| 1 | ❌ batch probing shape mismatch | shell wrapper (pre-repo) | FP8 source + finegrained_fp8 wraps MLA wrong | Use BF16 source |
| 2 | ❌ export crash after 128/128 calib | shell wrapper (pre-repo) | stale GPU amax → cudaErrorIllegalAddress in get_activation_scaling_factor | CPU patches + snapshot |
| 3 | ❌ model loading OOM | `3907838` | AutoModelForCausalLM.from_pretrained torch.cat on experts (31.5GB, 25.9GB free) | Use modelopt get_model() |
| 4 | ❌ quantize config crash | `86dd8df` | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's in `hf_ptq` | Import from hf_ptq |
| 5 | ❌ same as Run 4 | `f9bbef8` | same import bug | same fix, proper sync |
| 6 | ❌ dataloader crash | `6c1bff6` | make_calib_dataloader missing args (dataset, calib_with_images, etc.) | Add all required args |
| 7 | ❌ same root cause | `25b4d8d` | dataset=None, len() on None | Provide actual dataset list |
| 9 | ❌ TypeError | `a300302` | calib_size stays as string "128" — __main__ post-parse conversions skipped | Apply same conversions after parse_args() |
| 10 | ❌ export crash (calib ✅) | `5a72da7` | get_weight_scaling_factor reads stale GPU weight → cudaErrorIllegalAddress | Patch _export_quantized_weight to force weight to CPU at entry |
| 11 | 🔄 running | `07cd50e` | TBD | 8 patches covering full export chain |
### What's in the current script (f9bbef8)
1. **Model loading**`get_model()` from modelopt (handles max_memory, avoids OOM)
2. **load_calib_amax patch** — after amax is written to GPU, immediately moves to CPU
3. **export_amax patch** — safety net, moves any remaining GPU amax to CPU before reading
4. **get_activation_scaling_factor patch** — clamp instead of assert, CPU move
5. **snapshot_amax_to_cpu()** — walks all quantizers after calibration, saves to disk (~50MB)
6. **force_all_amax_to_cpu()** — moves _pre_quant_scale, _global_amax too
7. **--export-only** — retry export from saved state
8. **--validate-only** — check saved state integrity
Only divergence from modelopt example: `get_model()` instead of raw `AutoModelForCausalLM.from_pretrained`.
### If Run 4 fails
Check the log. Identify crash point. Add to table. Fix. Don't repeat earlier mistakes.
### Repo State
- **Local:** `/home/openclaw/dev/deepseek-v4-quant` (branch `modelopt-nvfp4`)
- **Remote:** `https://sweetapi.com/biondizzle/deepseek-v4-quant.git`
- **On B200:** `/root/nvidia-meeting/deepseek-v4-quant` (synced to same commit)
- **Must run from:** `/root/nvidia-meeting/modelopt-repo/examples/llm_ptq` (relative imports)
### B200 Node Details
- 8× B200, 2.7TB RAM, 13TB NVMe
- venv: `/root/nvidia-meeting/venv`
- modelopt: `0.45.0.dev64+g579fc6c31` (from git)
- transformers: `5.8.0.dev0` (from git, required for V4)

View File

@@ -1,24 +0,0 @@
# Session: 2026-05-10 23:05:46 UTC
- **Session Key**: agent:main:main
- **Session ID**: 2187c0aa-d111-453a-a1ff-ad229ea0d352
- **Source**: webchat
## Conversation Summary
user: Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Sun 2026-05-10 23:05 UTC] The read me and memory has been updated so we can pick this back up? if it has, im clearing this session because openclaw is getting fragile with this single long running session going
assistant: Let me check the current state of things.
assistant: Yeah, you're good to clear it. Everything's captured:
- **MEMORY.md** — High-level project summary, node details, key gaps, repo state
- **memory/2026-05-10.md** — Full debug session writeup: S11→S14 progress, all bugs found, current status, architecture decisions, file paths
On fresh session I'll read those and pick right back up at the NVFP4→FP8 conversion with the E2M1 sign fix that hasn't been tested yet. Go ahead and nuke it.

View File

@@ -1,69 +0,0 @@
# Session: 2026-05-10 16:21:39 UTC
- **Session Key**: agent:main:main
- **Session ID**: f727c6b7-5bdb-43bd-a7f1-3262779abecf
- **Source**: webchat
## Conversation Summary
assistant: Good context — we're breaking new ground here. Let me get the current container status, then sync everything up.
assistant: OK so the latest run is still hitting `KeyError``fused_wqa_wkv.weight` isn't found because my `params_dict[name]` skip only covers the "else" branch, not all code paths. But that's fine — the current state is still being debugged. Let me now sync the repo with all our work and update the README.
assistant: Good context — we're breaking new ground here. Let me get the current container status, then sync everything up.OK so the latest run is still hitting `KeyError``fused_wqa_wkv.weight` isn't found because my `params_dict[name]` skip only covers the "else" branch, not all code paths. But that's fine — the current state is still being debuged. Let me now sync the repo with all our work and update the README.
user: Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Sun 2026-05-10 16:09 UTC] I dont mean to interrupt. Just so I can make sure we're on the right path, don't focus on solving the current bug or re running the script. Lets just get everything in a good clean documented state like i mentioned earlier and then we can go in with fresh context once I'm caught up and can get a second pair of eyes on it. Cool?
assistant: Got it. Clean up and document. No more debugging. Let me sync everything and update the README.
assistant: Good. Now let me write the updated README:
assistant: Now update `serve_vllm.py` to add a note about mega_moe being disabled:
assistant: Good. Now commit and push:
assistant: Now make sure the B200 has the latest synced files too:
assistant: Good. Everything is synced. Let me also save a memory file for continuity:
assistant: All synced. Here's the summary, Mike:
**Repo:** Everything committed and pushed to `modelopt-nvfp4`. B200 node also pulled latest. Files updated:
- `README.md` — Full vLLM serving run history (S1→S10), open issues, vLLM bug list, updated file layout
- `patches/deepseek_v4.py` — Patched vllm source with all our name mappings, E2M1 unpacking, skip patterns, resilient loading
- `patches/patch_vllm_weights.py` — Updated but marked as legacy (doesn't work with worker processes)
- `docker-compose.yml` — No mega_moe, copies patched source file at startup
- `scripts/serve_vllm.py` — Warning comment on mega_moe flag
**Where we left off:** We're 10 iterations deep into the vLLM weight loading. The core unsolved problem is `MergedColumnLinear` + NVFP4 — vllm's `ModelOptNvFp4Config` only creates proper NVFP4 params for `Linear` layers, not the stacked `MergedColumnLinear` used for fused attention (`fused_wqa_wkv`, `fused_wkv_wgate`) and shared expert (`gate_up_proj`) projections. Our workaround (unpack uint8→bf16, skip scales) gets past the weight loading but loses the calibration-optimized scale factors for those layers.
Fresh eyes might see a cleaner path. The `ModelOptNvFp4Config.create_weights()` method is the right place to fix it properly — it needs to handle `MergedColumnLinear` and create `PackedColumnParameter` + stacked scale params.
user: Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Sun 2026-05-10 16:16 UTC] Perfect, ok. And it looks like in here you have a full history of config.json tweaks you made so we have a record of those incase they're causing an issue. Is that correct?
assistant: Yeah — the config.json patches are all documented in the README under the "Model Config Patches (post-export)" table. That covers the 4 fields we patched:
1. `compress_ratios` — copied from BF16 source model (62 items)
2. `quantization_config.scale_fmt` — added `"ue8m0"`
3. `rope_parameters` — flattened from nested dict (transformers 5.8.0 format) to flat dict
4. `rope_scaling` — same flattening
The table has the original modelopt value, what vllm requires, what we patched it to, and why modelopt doesn't export it. So if any of those are causing serving issues, we can trace it back.
One thing that's NOT in the repo though — the actual patched `config.json` file lives on the B200 at `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/config.json`. It's too big for git (881GB model dir). Want me to grab a copy of just the config.json and stash it in the repo for reference?
user: Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Sun 2026-05-10 16:20 UTC] Im going to give you fresh context, you have everything you need to start just from the memory file you saved on yourself?
assistant: Yeah — between `MEMORY.md`, `memory/2026-05-10.md`, and the repo (README has the full run history, bug list, and config patch table), I've got everything. B200 creds are in the `.env`, patch logic is in `patches/deepseek_v4.py`, docker setup is in `docker-compose.yml`.
Fire away with the fresh context.

View File

@@ -1,94 +0,0 @@
# Session: 2026-05-10 13:12:29 UTC
- **Session Key**: agent:main:main
- **Session ID**: 8eeafe17-997c-452c-97fe-38e25742c5eb
- **Source**: webchat
## Conversation Summary
assistant: Now let me see what vllm's mapper expects. Let me dump the full mapper:Now I can see the mapping. vllm expects names like:
- `layers.0.attn.compressor.gate_proj...``model.layers.0.atn.mla_attn.compressor.gate_proj...`
- `layers.0.shared_experts.w2``model.layers.0.shared_experts.down_proj`
Our modelopt export has:
- `model.layers.0.self_attn.compressor.gate_proj...` (has `self_atn` not `attn`)
- `model.layers.0.mlp.shared_experts.down_proj` (already uses `down_proj`)
The prefix mapper converts `layers.``model.layers.` and then the substr mapper converts `.attn.compressor.``.attn.mla_attn.compressor.`. But our weights already have `model.layers.` prefix AND use `self_attn` instead of `atn`.
The issue: modelopt exports with `self_attn` but vllm's mapper expects `atn` (which it then maps to `attn.mla_attn`). Let me also check what other name mismatches exist:Now I have the full picture. The main mapping issue is `self_atn``attn.mla_attn`. But there are also quantization suffixes (`input_scale`, `weight_scale`, `weight_scale_2`) that modelopt adds. vllm's modelopt_fp4 quantizer should handle those natively — the `WeightsMapper` just needs to map the base names correctly.
Let me write a mountable patch script that fixes the mapper at container startup:Now I need to figure out how to inject this patch into the container. The easiest way is to mount the patches dir and add it as an entrypoint wrapper. Let me update docker-compose:YAML escape issue. Let me fix the entrypoint:YAML's being a pain with indentation. Let me just write it properly:YAML validates. Let me start it:**It's past the weight mapper crash!** The patch is working. I can see:
- `Detected ModelOpt NVFP4 checkpoint`
- `Using max model len 1048576`
- `DeepSeekV4MTPModel` architecture resolved ✅
- Now it's initializing the engine...
This is loading the 881GB model across 8 GPUs. This will take a while. Let me check back in a few minutes:
user: Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Sun 2026-05-10 09:30 UTC] fuck it. do #2 and see where ya get. If ya dont get anywhere, then ya need to look deeper into model opt and we have to patch model opt, or see if there's a seeting we missed that does that
assistant: Already on it — I patched the weight mapper and restarted the container. It got past the `UnboundLocalError` crash. Let me check where it's at now:
assistant: SSH connection is timing out. The machine might be under heavy load from model loading (881GB across 8 GPUs). Let me retry:
assistant: The execs keep getting aborted. Let me try a different approach:
user: System: [2026-05-10 09:31:01 UTC] Exec completed (calm-ree, code 0) :: INFO 05-10 09:30:59 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL. INFO 05-10 09:30:59 [nixl_utils.py:32] NIXL …
System: [2026-05-10 09:31:19 UTC] Exec completed (calm-dai, code 0) :: __', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__or__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subcl…
System: [2026-05-10 09:31:40 UTC] Exec completed (swift-br, code 0) :: === orig_to_new_substr === .attn.compressor. -> .attn.mla_attn.compressor. .shared_experts.w2 -> .shared_experts.down_proj === orig_to_new_prefix === layers. -> model.layers. embe…
System: [2026-05-10 09:34:38 UTC] Exec completed (briny-ri, code 0) :: ver pid=1)[0;0m [33mWARNING[0m [90m05-10 09:34:35[0m [90m[vllm.py:1252][0m Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implem…
Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Sun 2026-05-10 09:35 UTC] you did this in detached mode? If you did ill watch and tell you when we're good
assistant: Yeah, it's detached — `docker compose up -d`. The container's already past the weight mapper crash (the patch worked 🎉), so it's in the model loading phase now. 881GB across 8 GPUs takes a hot minute.
You can tail it with:
```
docker compose -f /root/nvidia-meeting/docker-compose.yml logs -f
```
What to look for:
- **Good sign**: `Loading safetensors checkpoint shards: X/95` progressing
- **Bad sign**: Another traceback, `UnboundLocalError` again, or `KeyError`
- **Victory**: `Uvicorn running on http://0.0.0.0:8000` or `Application startup complete`
If it crashes again, save the last 50 lines of logs and I'll dig into whatever's next. The mapper patch covered the `self_attn``attn.mla_attn` renaming, but there might be other weight name mismatches I couldn't fully predict without seeing what vllm's internal model actually expects vs what modelopt gave us.
user: Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Sun 2026-05-10 10:58 UTC] im in bed. check if it came up or if we neeed more patches
assistant: Crashed again. Container exited. Let me get the root cause — the worker error:
assistant: Same `UnboundLocalError: name_mapped`. The patch didn't take. Let me check if the patch script actually ran:
assistant: The patch ran in the APIServer process (pid=1) but the worker processes (pid=1068-1075) are separate processes spawned later. They don't inherit the monkeypatch! The `WeightsMapper` is rebuilt from scratch inside each worker. I need a different approach — patch the actual file on disk inside the container, or use a volume mount to override it.
Let me go deeper. The real issue is in `utils.py`'s `_load_module``AutoWeightsLoader.load_weights``_map_name`. The `name_mapped` is set by the mapper. If the mapper can't map a name, `name_mapped` is never assigned. Let me look at that function:
assistant: Wait — the error is at `deepseek_v4.py` line 1579, which is `loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)`. But `name_mapped` isn't in that code. It's in the `deepseek_v2.py` code. Let me check if the DeepSeek V4 model is using the V2 load_weights somehow, or if the error is coming from the `AutoWeightsLoader` internally:
assistant: There's ANOTHER `load_weights` in `deepseek_v4.py` at line 1417 — that's NOT the `DeepseekV4ForCausalLM.load_weights` at line 1579. This is a different class (probably `DeepseekV4Model` or a sub-model). Let me see:
assistant: There it is. Line 1417: `loaded_params.add(name_mapped)`. If none of the `expert_mapping` entries match (the `continue` on line 1403 skips), `name_mapped` is never set, and it falls through to `loaded_params.add(name_mapped)` which blows up.
This is the MoE expert weight loader. The expert weights from modelopt have names like `model.layers.0.mlp.experts.0.gate_proj.weight` but this code expects a different naming convention. The `expert_mapping` comes from `get_expert_mapping()`. Let me see what it expects:
user: Sender (untrusted metadata):
```json
{
"label": "openclaw-control-ui",
"id": "openclaw-control-ui"
}
```
[Sun 2026-05-10 13:06 UTC] what happened?

View File

@@ -1,198 +0,0 @@
# 2026-05-10
## DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session
- Quantization completed successfully (Run 11, 881GB NVFP4)
- Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint
- Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4
- NVIDIA themselves haven't gotten this far — we're in uncharted territory
### What we fixed:
- Expert weight name mapping (gate_proj→w1, up_proj→w3, down_proj→w2)
- mlp→ffn module naming
- Attention: self_attn→attn.mla_attn, kv_proj→wkv, etc.
- Compressor: kv_proj→wkv, gate_proj→wgate
- kv_norm moved from compressor to attention level
- Class attribute patching (hf_to_vllm_mapper)
- Source file patching (workers are separate processes)
- E2M1 FP4→BF16 unpacking for stacked attention params
- Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear
### What we abandoned:
- mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks)
- Runtime monkey-patching: Workers don't inherit patches
### Open issues (stop point):
1. MergedColumnParallelLinear + NVFP4 incompatibility — ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params
2. Unknown params from modelopt (compressor.position_bias) crash loading
3. Current approach (unpack uint8→bf16, skip scales) loses calibration-optimized scales for attention weights
### Repo state:
- All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch
- README fully updated with vLLM serving run history, open issues, bug list
- B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
# 2026-05-10
## DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session
- Quantization completed successfully (Run 11, 881GB NVFP4)
- Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint
- Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4
- NVIDIA themselves haven't gotten this far — we're in uncharted territory
### What we fixed:
- Expert weight name mapping (gate_proj→w1, up_proj→w3, down_proj→w2)
- mlp→ffn module naming
- Attention: self_attn→attn.mla_attn, kv_proj→wkv, etc.
- Compressor: kv_proj→wkv, gate_proj→wgate
- kv_norm moved from compressor to attention level
- Class attribute patching (hf_to_vllm_mapper)
- Source file patching (workers are separate processes)
- E2M1 FP4→BF16 unpacking for stacked attention params
- Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear
### What we abandoned:
- mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks)
- Runtime monkey-patching: Workers don't inherit patches
### Open issues (stop point):
1. MergedColumnParallelLinear + NVFP4 incompatibility — ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params
2. Unknown params from modelopt (compressor.position_bias) crash loading
3. Current approach (unpack uint8→bf16, skip scales) loses calibration-optimized scales for attention weights
### Repo state:
- All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch
- README fully updated with vLLM serving run history, open issues, bug list
- B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
---
## vLLM NVFP4 Serving — Second Session (16:2819:35 UTC)
Mike gave autonomous work instructions. Key directive: use weights AS-IS (NVFP4), do NOT convert to MXFP4. Try FusedMoE first, then if stuck, build a mega_moe NVFP4 kernel from scratch.
### Major breakthroughs (S11→S14 progress):
**Key insight: vLLM attention forward bypasses quant_method, uses deepseek_v4_fp8_einsum directly**
- The attention code reads `self.wo_a.weight` (expects fp8) and `self.wo_a.weight_scale_inv` directly
- NVFP4 uint8 weights are incompatible with this FP8 kernel
- Solution: **NVFP4→bf16→FP8 dequantize/requant at load time** for attention layers
**S12 fixes applied (weight loading now succeeds to 94%):**
1. **Substr mapping fix**: Removed `.mla_attn.` prefix from attention projections. The model has `fused_wqa_wkv`, `wq_b`, `wo_a`, `wo_b` at `attn.*` level, not `attn.mla_attn.*`. The stacking code then correctly maps `attn.wq_a``attn.fused_wqa_wkv`.
2. **Skip patterns fix**: Only skip compressor scale tensors (compressor uses `UnquantizedLinearMethod` with `quant_config=None`). Attention and shared expert scales now correctly load via stacking logic.
3. **Suffix mapping fix**: Removed `"head.weight": "lm_head.weight"` which caused `lm_head.weight``lm_lm_head.weight` doubling.
4. **Resilient loading**: Unknown params (e.g., `compressor.position_bias`) silently skipped.
**S13 — Weight loading SUCCESS (32 seconds!)**
- All 95 safetensors loaded without KeyError
- New error: `MergedColumnParallelLinear` has no `weight_scale_inv` (FP8 attribute)
**S13.5 — o_a_proj discovery:**
- modelopt did NOT quantize `o_a_proj` — it's bf16 in the checkpoint (no scales)
- But vLLM creates `wo_a` with NVFP4 quant (uint8 weight + scales)
- Fix: convert bf16→FP8 directly at load time, set weight_scale_inv
**S14 — NVFP4→FP8 post-load conversion approach:**
- Added `_convert_nvfp4_attention_to_fp8()` and `_convert_nvfp4_module_to_fp8()` methods to `DeepseekV4Model`
- Converts all uint8 NVFP4 attention weights (fused_wqa_wkv, wq_b, wo_a, wo_b, gate_up_proj) to FP8 at load time
- Steps: unpack E2M1 FP4→bf16, dequantize with block/global scales, requantize to FP8 e4m3, set weight_scale_inv
- For o_a_proj (bf16, no scales): convert directly bf16→FP8
- For compressor fused_wkv_wgate: stays bf16 (UnquantizedLinearMethod)
- For MoE experts: handled natively by ModelOptNvFp4FusedMoE
**Bug found: E2M1 LUT indexing off-by-one**
- FP4 4-bit values are 0-15 (bit 3 = sign, bits 0-2 = magnitude)
- LUT has 8 entries (magnitudes 0-7), but code was indexing with full 4-bit value (0-15) → CUDA assert
- Fix: mask with `& 0x07` for magnitude index, apply sign from bit 3 separately
**Bug found: method placement inside Python class**
- `_convert_nvfp4_attention_to_fp8` was being placed at top level (0 indent) instead of inside `DeepseekV4Model`
- The class actually ends at `finalize_mega_moe_weights()` (line ~1600), followed by top-level `hc_head` function
- Had to insert methods BEFORE the `@torch.compile` decorator that marks the class boundary
**Bug found: logger not available in method**
- `logger.info_once()` isn't accessible inside the conversion methods
- Replaced with `print(f"...")` for now
### Current status (as of 19:35 UTC):
- Weight loading + NVFP4→FP8 conversion code is in place
- Last test was running (loading 880GB checkpoint)
- E2M1 sign handling fix applied but NOT YET TESTED
- Need to fix `logger``print` issue
- After load succeeds: FusedMoE expert weight handling needs verification
- If FusedMoE fails: need to build mega_moe NVFP4 kernel
### Key files on B200 node:
- Patch: `/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py`
- Docker: `docker compose up -d` (TP=8, no mega_moe, FLASHINFER_TRTLLM attn)
- Weights: `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/`
### Architecture decisions:
- NVFP4→FP8 for attention/shared_experts (requant, preserves FP8 kernel compat)
- BF16 for compressor (UnquantizedLinearMethod, no quant_config)
- Native NVFP4 for MoE experts (ModelOptNvFp4FusedMoE handles it)
- UnquantizedLinearMethod as no-op quant_method (attention forward bypasses it anyway)
---
## vLLM NVFP4 Serving — Third Session (23:05+ UTC)
### Current state of the B200 node:
- Docker container ran 27 min ago and crashed with `BFloat16 != Float8_e4m3fn`
- Uncommitted changes to `patches/deepseek_v4.py` (the _convert_nvfp4_post_load methods)
- Repo on `modelopt-nvfp4` branch, last commit `db16be8`
### Crash analysis (S15 — `BFloat16 != Float8_e4m3fn`):
Weight loading succeeds (95/95, 330s). Post-load conversion reports: 122 layers → FP8, 183 → BF16. MoE setup runs. Crash during profile_run/_dummy_run.
**Root cause**: `_convert_nvfp4_post_load` converts `fused_wqa_wkv` to FP8 and sets `quant_method = UnquantizedLinearMethod()`. The attention forward calls `self.fused_wqa_wkv(hidden_states)` which goes through `UnquantizedLinearMethod.forward()``F.linear(bf16_input, fp8_weight)` → dtype mismatch.
**Key insight about the attention forward paths**:
- `wo_a`: Attention code reads `self.wo_a.weight` and `self.wo_a.weight_scale_inv` DIRECTLY, passes to `deepseek_v4_fp8_einsum`. This bypasses `quant_method`. FP8 conversion works here.
- `fused_wqa_wkv`: Called via `self.fused_wqa_wkv(hidden_states)``MergedColumnParallelLinear.forward()``quant_method.forward()`. **Cannot be FP8 with UnquantizedLinearMethod**.
- `wq_b`, `wo_b`: Called via normal `.forward()`. Need BF16 + UnquantizedLinearMethod.
- `compressor.fused_wkv_wgate`: Called via `torch.mm(hidden_states, weight.T, out_dtype=torch.float32)` DIRECTLY. **Needs BF16 weight** — currently uint8 (not in any conversion set!).
**Critical finding from safetensors**: `o_a_proj.weight` is BF16 (modelopt did NOT quantize it). So `wo_a` weight is already BF16, not NVFP4. The post-load conversion code's `dtype != uint8` check skips it. This means `wo_a.weight` stays BF16 and `wo_a.weight_scale_inv` is never created. When `deepseek_v4_fp8_einsum` tries to read it as FP8 → crash.
**Wait, but the log says 122 → FP8.** 61 layers × 2 (fused_wqa_wkv + wo_a) = 122. If wo_a.weight is BF16 and gets skipped, only 61 → FP8. The 122 count means wo_a IS being converted somehow. Hypothesis: `ModelOptNvFp4LinearMethod.create_weights()` creates `wo_a.weight` as uint8. When the BF16 checkpoint data is loaded into the uint8 param, the weight_loader might be casting it, or the param might be updated to BF16. Need to verify.
### Unfixed bugs from S14 (still present):
1. E2M1 sign handling fix applied but NOT TESTED
2. `logger``print` issue in conversion methods
### Compressor `fused_wkv_wgate` — PENDING CRASH:
- NOT in any conversion set (fp8_proj_names, bf16_proj_names, bf16_shared_names)
- Weight is uint8 after loading (NVFP4 packed)
- Forward uses `torch.mm(hidden_states, weight.T, out_dtype=torch.float32)` directly
- uint8 × BF16 would crash with a different error than the current one
- Needs BF16 dequantization in post-load conversion
### Checkpoint key format (verified from safetensors):
- `model.layers.0.self_attn.q_a_proj.weight` — uint8
- `model.layers.0.self_attn.q_a_proj.weight_scale` — float8_e4m3fn (block scale)
- `model.layers.0.self_attn.q_a_proj.weight_scale_2` — float32 (per-tensor)
- `model.layers.0.self_attn.q_a_proj.input_scale` — float32
- `model.layers.0.self_attn.o_a_proj.weight`**BF16** (NOT quantized by modelopt)
- `model.layers.0.self_attn.o_b_proj.weight` — uint8
- `model.layers.0.self_attn.kv_proj.weight` — uint8
- `model.layers.0.self_attn.compressor.kv_proj.weight` — uint8
- `model.layers.0.self_attn.compressor.gate_proj.weight` — uint8
- `model.layers.0.self_attn.compressor.position_bias` — BF16 (unknown param, skipped)
- Expert scales: `.weight_scale`, `.weight_scale_2`, `.input_scale` (NOT `.scale`)
### FusedMoE NVFP4 status:
- `ModelOptNvFp4FusedMoE` creates proper uint8 weights + float8_e4m3fn block scales + float32 per-tensor/input scales
- `process_weights_after_loading` calls `convert_to_nvfp4_moe_kernel_format` then `make_nvfp4_moe_kernel`
- Uses `cutlass_fp4_gemm` via nvfp4 backend
- Warning: `w1_weight_scale_2 must match w3_weight_scale_2` — modelopt gives different global scales to w1 and w3, but FusedMoE uses a single w13_weight_scale_2 (takes w1's). Minor accuracy impact.
- `expert_dtype: fp4` in config — causes weight mapper to use `.scale``.weight_scale` regex, but checkpoint already uses `.weight_scale` directly, so regex is a no-op. Correct behavior.
- `scale_fmt: "ue8m0"` in config — used by attention FP8 einsum. Correct for NVFP4.
### Config verification:
- `compress_ratios` ✅ (copied from BF16 source)
- `scale_fmt: "ue8m0"` ✅ (added by us)
- `rope_parameters` ✅ (flattened)
- `expert_dtype: fp4` ✅ (original, correct for weight mapper regex)

View File

@@ -1,58 +0,0 @@
# 2026-05-11 — DeepSeek V4 NVFP4 vLLM Serving: Full End-to-End
## 🎉 SERVER RUNNING ON PORT 8000
The vLLM server successfully loads the NVFP4 model and serves API requests on 8× B200.
### What We Fixed (Session Summary)
#### 1. DeepGEMM `sf.dim()` Assertion (CRITICAL)
- **Error**: `Assertion error layout.hpp:94: sf.dim() == num_groups + 2`
- **Cause**: `weight_scale_inv` was 1D per-tensor scale. DeepGEMM expects 2D/3D block-scale tensor from `transform_sf_into_required_layout`.
- **Fix**: Use `deepgemm_post_process_fp8_weight_block(wq, ws, quant_block_shape=(128,128), use_e8m0=True)` to produce correct block-scale format. Store result in `weight_scale_inv`.
- **Key insight**: The attention runtime reads `self.wo_a.weight_scale_inv` as `b_scale` for the einsum. It MUST be the DeepGEMM-formatted block scale.
#### 2. Block Scale dtype
- **Error**: `Expected float32 or float8_e8m0fnu, got float8_e4m3fn`
- **Fix**: Create block scale as `dtype=torch.float32`
#### 3. Missing `deepgemm_post_process` args
- **Error**: `missing 2 required positional arguments: 'quant_block_shape' and 'use_e8m0'`
- **Fix**: Pass `quant_block_shape=(128, 128)` and `use_e8m0=True`
#### 4. Compressor Indexer Shape Mismatch (CRITICAL)
- **Error**: `split_with_sizes expects 2048, got split_sizes=[256, 256]`
- **Cause**: `_reconstruct_compressor_weight` used wrong checkpoint prefix for indexer. Main compressor keys: `compressor.kv_proj.*`. Indexer keys: `compressor.indexer.kv_proj.*`. Loading main compressor weight into indexer's fused_wkv_wgate = 4× size mismatch.
- **Fix**: Added `sub_path` parameter, pass `".indexer"` for indexer compressors.
#### 5. All-Ones Block Scale → Garbage Output (CRITICAL)
- **Symptom**: Server runs, outputs tokens, but text is incoherent gibberish (repeating "Palm", "sulfuric", "东海")
- **Cause**: Block scale was `torch.ones(...)` = 1.0. DeepGEMM divides by block scale at runtime, so output was divided by 1.0 instead of actual fp8_scale.
- **Fix**: `torch.full(..., fp8_scale.item())` — fill each block with the per-tensor FP8 scale value.
### Conversion Summary
- 61 NVFP4→FP8 (wo_a attention, DeepGEMM block-scale BMM einsum)
- 0 BF16→FP8
- 305 attn/shared→BF16 (UnquantizedLinearMethod)
- 91 compressor→BF16 (reconstructed from separate NVFP4 kv_proj+gate_proj)
- MoE experts: stay NVFP4 (FLASHINFER_TRTLLM FusedMoE backend)
### Architecture Map
```
wo_a → FP8 + DeepGEMM block scale (weight_scale_inv = dg_ws)
fused_wqa_wkv, wo_b → BF16 (UnquantizedLinearMethod)
compressor.fused_wkv_wgate → BF16 (read from checkpoint, unpack, dequant, cat)
shared_expert → FP8 (Fp8LinearMethod, DeepGEMM)
MoE w13/w2 → NVFP4 (FusedMoE, FLASHINFER_TRTLLM)
```
### Key Code Locations
- Patch: `/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py`
- Runtime attention: `deepseek_v4_attention.py:319` — reads `wo_a.weight_scale_inv`
- Runtime einsum: `deepseek_v4_fp8_einsum` → DeepGEMM `fp8_einsum`
- DeepGEMM scale format: `deepgemm_post_process_fp8_weight_block` in `fp8_utils.py`
- Compressor forward: `deepseek_compressor.py:281``kv, score = kv_score.split(...)`
### Outstanding Issues
- **Output quality**: Still producing garbled text after block-scale fix. Need to verify the latest fix (fp8_scale in block scale) produces coherent output.
- Possible causes if still garbled: subtle dequant bug, sign handling in E2M1 unpack, wrong scale ordering

View File

@@ -1,40 +1,6 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
# ==============================================================================
# DeepSeek V4 NVFP4 Patch — Version Banner (printed at import time)
# ==============================================================================
import datetime as _dt
import os as _os
_git_commit = _os.popen("git -C /root/nvidia-meeting/deepseek-v4-quant rev-parse --short HEAD 2>/dev/null || echo 'unknown'").read().strip()
print(f"""
{'='*70}
DeepSeek V4 NVFP4 Patch
{'='*70}
Commit: {_git_commit}
Loaded: {_dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S UTC')}
Node: {_os.uname().nodename}
Architecture:
wo_a → FP8 + DeepGEMM block scale (BMM einsum)
wq_b/wo_b → BF16 (UnquantizedLinearMethod)
fused_wqa → BF16 (stacked q_a + kv, dequantized from NVFP4)
compressor → BF16 (reconstructed from separate kv_proj+gate_proj)
shared_exp → FP8 (Fp8LinearMethod, DeepGEMM)
MoE experts → NVFP4 (FusedMoE, FLASHINFER_TRTLLM) — NOT converted
Bugs fixed:
#1 DeepGEMM sf.dim() — block scale format (deepgemm_post_process)
#2 fused_skip_regex — q_b/o_a/o_b scales no longer skipped
#3 input_scale — removed from weight dequant (activations only)
#4 compressor indexer — sub_path for .indexer keys
#5 block scale dtype — must be float32, not float8_e4m3fn
#6 block scale values — torch.full(fp8_scale) not torch.ones
#7 UE8M0 block scale — .to(float32) misinterprets E8M0 as E4M3
{'='*70}
""")
# ==============================================================================
import typing
from collections.abc import Callable, Iterable
from itertools import islice
@@ -185,8 +151,9 @@ class DeepseekV4FP8Config(Fp8Config):
try:
hf_config = get_current_vllm_config().model_config.hf_config
except Exception:
# vllm_config not yet set; defer the decision until a
# later call lands inside set_current_vllm_config.
# vllm_config not yet set; return safe default but do NOT
# cache — a later call inside set_current_vllm_config may
# resolve differently.
return "fp4"
expert_dtype = getattr(hf_config, "expert_dtype", "fp4")
if expert_dtype not in _DEEPSEEK_V4_EXPERT_DTYPES:
@@ -195,11 +162,6 @@ class DeepseekV4FP8Config(Fp8Config):
f"expected one of {_DEEPSEEK_V4_EXPERT_DTYPES}."
)
self._resolved_expert_dtype = expert_dtype
from vllm.logger import init_logger
init_logger(__name__).info_once(
"DeepSeek V4 expert_dtype resolved to %r", expert_dtype
)
return self._resolved_expert_dtype
@property
@@ -244,11 +206,23 @@ class DeepseekV4FP8Config(Fp8Config):
return isinstance(layer, FusedMoE) and self.expert_dtype == "fp4"
import triton
import triton.language as tl
import torch
"""
NVFP4 staging kernel — full FP4 (E2M1) activations + UE4M3 block16 scales.
The mxf4nvf4 PTX instruction requires BOTH A and B to be FP4 (E2M1 packed).
This kernel quantizes BF16 activations → E2M1 packed uint8 with UE4M3 scales.
"""
@triton.jit
def _deepseek_v4_stage_mega_moe_inputs_kernel(
hidden_states,
x_fp8,
x_sf,
x_fp4, # uint8, shape (M, K//2) — E2M1 packed, 2 values per byte
x_sf, # int32, shape (M, K//64) — UE4M3 packed, 4 scales per int32
topk_ids,
topk_weights,
topk_idx_out,
@@ -269,8 +243,8 @@ def _deepseek_v4_stage_mega_moe_inputs_kernel(
topk_weights_out_stride_k: tl.constexpr,
hidden_size: tl.constexpr,
top_k: tl.constexpr,
BLOCK_K: tl.constexpr,
GROUP_K: tl.constexpr,
BLOCK_K: tl.constexpr, # 128 elements (loaded from hidden)
GROUP_K: tl.constexpr, # 16 (NVFP4 group_size)
BLOCK_TOPK: tl.constexpr,
) -> None:
token_id = tl.program_id(0)
@@ -284,35 +258,94 @@ def _deepseek_v4_stage_mega_moe_inputs_kernel(
other=0.0,
).to(tl.float32)
num_groups: tl.constexpr = BLOCK_K // GROUP_K
hidden_groups = tl.reshape(tl.abs(hidden), [num_groups, GROUP_K])
amax = tl.max(hidden_groups, axis=1)
num_groups: tl.constexpr = BLOCK_K // GROUP_K # 8
hidden_groups = tl.reshape(hidden, [num_groups, GROUP_K])
abs_groups = tl.reshape(tl.abs(hidden), [num_groups, GROUP_K])
amax = tl.max(abs_groups, axis=1)
amax = tl.maximum(amax, 1.0e-4)
scale = amax / 448.0
# ---- UE4M3 scale computation ----
# scale = amax / 6.0 (E2M1 max value = 6)
# Then quantize scale to UE4M3 format
scale = amax / 6.0
scale_bits = scale.to(tl.uint32, bitcast=True)
scale_exp = ((scale_bits >> 23) & 0xFF) + ((scale_bits & 0x7FFFFF) != 0).to(
tl.uint32
)
scale_exp = tl.minimum(tl.maximum(scale_exp, 1), 254)
rounded_scale = (scale_exp << 23).to(tl.float32, bitcast=True)
scale_exp = (scale_bits >> 23) & 0xFF
scale_mant = scale_bits & 0x7FFFFF
hidden_groups = tl.reshape(hidden, [num_groups, GROUP_K])
scaled = hidden_groups * (1.0 / rounded_scale)[:, None]
scaled = tl.reshape(scaled, [BLOCK_K])
fp8 = scaled.to(tl.float8e4nv)
# Convert FP32 → E4M3 manually
e4m3_exp = scale_exp - 120 # FP32 bias=127, E4M3 bias=7
e4m3_exp = tl.maximum(e4m3_exp, 0)
e4m3_exp = tl.minimum(e4m3_exp, 15)
e4m3_mant = scale_mant >> 20
round_bit = (scale_mant >> 19) & 1
e4m3_mant = e4m3_mant + round_bit
overflow = e4m3_mant >= 8
e4m3_mant = tl.where(overflow, 0, e4m3_mant)
e4m3_exp = tl.where(overflow, e4m3_exp + 1, e4m3_exp)
e4m3_exp = tl.minimum(e4m3_exp, 15)
scale_e4m3_bits = (e4m3_exp << 3) | e4m3_mant
# Reconstruct dequantized scale for E2M1 quantization
e4m3_exp_for_recon = tl.maximum(e4m3_exp.to(tl.int32) - 7, -126)
two_pow_exp_bits = (e4m3_exp_for_recon + 127).to(tl.uint32) << 23
two_pow_exp = two_pow_exp_bits.to(tl.float32, bitcast=True)
normal_value = (1.0 + e4m3_mant.to(tl.float32) / 8.0) * two_pow_exp
subnormal_value = (e4m3_mant.to(tl.float32) / 8.0) * 0.015625
e4m3_value = tl.where(e4m3_exp == 0, subnormal_value, normal_value)
# ---- E2M1 FP4 quantization ----
# E2M1 LUT (unsigned): [0, 0.5, 1, 1.5, 2, 3, 4, 6]
# Nearest-neighbor using thresholds (midpoints between consecutive values)
scaled = hidden_groups * (1.0 / tl.maximum(e4m3_value, 1e-6))[:, None]
# Clamp to E2M1 range [-6, 6]
scaled = tl.maximum(scaled, -6.0)
scaled = tl.minimum(scaled, 6.0)
abs_s = tl.abs(scaled)
# E2M1 quantization using arithmetic instead of nested tl.where (Triton compile error)
# LUT: [0, 0.5, 1, 1.5, 2, 3, 4, 6] → thresholds at midpoints
# idx = sum(abs_s >= threshold_i) for thresholds [0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0]
e2m1_idx = ((abs_s >= 0.25).to(tl.int32) + (abs_s >= 0.75).to(tl.int32) +
(abs_s >= 1.25).to(tl.int32) + (abs_s >= 1.75).to(tl.int32) +
(abs_s >= 2.5).to(tl.int32) + (abs_s >= 3.5).to(tl.int32) +
(abs_s >= 5.0).to(tl.int32))
sign_bit = (scaled < 0).to(tl.int32)
e2m1_4bit = (sign_bit << 3) | e2m1_idx # 4-bit: (sign << 3) | index
# Pack 2 E2M1 values per byte: even→low nibble, odd→high nibble
PACKED_K: tl.constexpr = BLOCK_K // 2 # 64
e2m1_pairs = tl.reshape(e2m1_4bit, [PACKED_K, 2])
even, odd = tl.split(e2m1_pairs) # splits last axis (size 2) into two [PACKED_K] tensors
packed_byte = (odd.to(tl.uint8) << 4) | even.to(tl.uint8)
packed_k_offsets = k_block_id * PACKED_K + tl.arange(0, PACKED_K)
packed_k_mask = packed_k_offsets < (hidden_size // 2)
tl.store(
x_fp8 + token_id * x_stride_m + k_offsets * x_stride_k,
fp8,
mask=k_mask,
x_fp4 + token_id * x_stride_m + packed_k_offsets * x_stride_k,
packed_byte,
mask=packed_k_mask,
)
scale_offsets = tl.arange(0, num_groups)
packed_scale = tl.sum(scale_exp << (scale_offsets * 8), axis=0).to(tl.int32)
tl.store(
x_sf + token_id * x_sf_stride_m + k_block_id * x_sf_stride_k,
packed_scale,
)
# Pack UE4M3 bytes into int32 (NVFP4: group_size=16, 4 groups per 64 elements)
# 8 groups per k_block of 128 → 2 int32s per k_block
# int32 can only pack 4 bytes (shifts >= 32 are UB on GPU), so split into two packs
scale_offsets = tl.arange(0, num_groups) # [0..7]
first_half = scale_offsets < 4 # groups 0-3 → int32[0]
second_half = scale_offsets >= 4 # groups 4-7 → int32[1]
packed_lo = tl.sum(
tl.where(first_half, scale_e4m3_bits.to(tl.int32) << (scale_offsets * 8), 0),
axis=0,
).to(tl.int32)
packed_hi = tl.sum(
tl.where(second_half, scale_e4m3_bits.to(tl.int32) << ((scale_offsets - 4) * 8), 0),
axis=0,
).to(tl.int32)
# Write 2 int32s per k_block: x_sf shape is (M, K//64) = (M, num_k_blocks * 2)
sf_base = token_id * x_sf_stride_m + k_block_id * 2 * x_sf_stride_k
tl.store(x_sf + sf_base, packed_lo)
tl.store(x_sf + sf_base + x_sf_stride_k, packed_hi)
if k_block_id == 0:
topk_offsets = tl.arange(0, BLOCK_TOPK)
@@ -351,8 +384,8 @@ def _stage_deepseek_v4_mega_moe_inputs(
hidden_states: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
x_fp8: torch.Tensor,
x_sf: torch.Tensor,
x_fp4: torch.Tensor, # uint8, shape (M, K//2)
x_sf: torch.Tensor, # int32, shape (M, K//64)
topk_idx_out: torch.Tensor,
topk_weights_out: torch.Tensor,
) -> None:
@@ -376,7 +409,7 @@ def _stage_deepseek_v4_mega_moe_inputs(
block_topk = triton.next_power_of_2(top_k)
_deepseek_v4_stage_mega_moe_inputs_kernel[grid](
hidden_states,
x_fp8,
x_fp4,
x_sf,
topk_ids,
topk_weights,
@@ -384,8 +417,8 @@ def _stage_deepseek_v4_mega_moe_inputs(
topk_weights_out,
hidden_states.stride(0),
hidden_states.stride(1),
x_fp8.stride(0),
x_fp8.stride(1),
x_fp4.stride(0),
x_fp4.stride(1),
x_sf.stride(0),
x_sf.stride(1),
topk_ids.stride(0),
@@ -399,7 +432,7 @@ def _stage_deepseek_v4_mega_moe_inputs(
hidden_size,
top_k,
BLOCK_K=block_k,
GROUP_K=32,
GROUP_K=16, # NVFP4: group_size=16 (scale_vec::4X)
BLOCK_TOPK=block_topk,
num_warps=4,
)
@@ -425,8 +458,21 @@ def make_deepseek_v4_expert_params_mapping(
class DeepseekV4MegaMoEExperts(nn.Module):
"""MegaMoE experts for DeepSeek V4 with NVFP4 quantization.
Loads NVFP4 expert weights (E2M1 packed uint8 + float8_e4m3fn block scales
+ float32 global scales) and feeds them natively to the DeepGEMM
fp8_nvfp4_mega_moe kernel (kind::mxf4nvf4.scale_vec::4X).
No conversion to MXFP4. Experts stay NVFP4. The global scale (weight_scale_2)
is folded into the block scales before kernel consumption.
"""
_symm_buffer_cache: dict[tuple[int, int, int, int, int, int, int], object] = {}
# NVFP4 E2M1 lookup table (positive values, sign from bit 3)
E2M1_LUT = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]
# MXFP4 E2M1 is the same format
def __init__(
self,
vllm_config: VllmConfig,
@@ -451,52 +497,83 @@ class DeepseekV4MegaMoEExperts(nn.Module):
self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens
weight_attrs = {"weight_loader": self.weight_loader}
# NVFP4 weights: E2M1 packed as uint8, 2 values per byte
self.w13_weight = nn.Parameter(
torch.zeros(
num_local_experts,
2 * intermediate_size,
hidden_size // 2,
dtype=torch.uint8,
dtype=torch.int8,
),
requires_grad=False,
)
set_weight_attrs(self.w13_weight, weight_attrs)
# NVFP4 block scales: float8_e4m3fn, group_size=16
# Shape: [num_local_experts, 2*intermediate_size, hidden_size // 16]
self.w13_weight_scale = nn.Parameter(
torch.zeros(
num_local_experts,
2 * intermediate_size,
hidden_size // 32,
dtype=torch.uint8,
hidden_size // 16,
dtype=torch.float8_e4m3fn,
),
requires_grad=False,
)
set_weight_attrs(self.w13_weight_scale, weight_attrs)
self.w13_weight_scale.quant_method = "block"
# NVFP4 global scales: float32, per-expert
self.w13_weight_scale_2 = nn.Parameter(
torch.zeros(num_local_experts, dtype=torch.float32),
requires_grad=False,
)
set_weight_attrs(self.w13_weight_scale_2, weight_attrs)
# NVFP4 activation scales: float32, per-expert
self.w13_input_scale = nn.Parameter(
torch.zeros(num_local_experts, dtype=torch.float32),
requires_grad=False,
)
set_weight_attrs(self.w13_input_scale, weight_attrs)
self.w2_weight = nn.Parameter(
torch.zeros(
num_local_experts,
hidden_size,
intermediate_size // 2,
dtype=torch.uint8,
dtype=torch.int8,
),
requires_grad=False,
)
set_weight_attrs(self.w2_weight, weight_attrs)
# NVFP4 block scales for w2
self.w2_weight_scale = nn.Parameter(
torch.zeros(
num_local_experts,
hidden_size,
intermediate_size // 32,
dtype=torch.uint8,
intermediate_size // 16,
dtype=torch.float8_e4m3fn,
),
requires_grad=False,
)
set_weight_attrs(self.w2_weight_scale, weight_attrs)
self.w2_weight_scale.quant_method = "block"
self.w2_weight_scale_2 = nn.Parameter(
torch.zeros(num_local_experts, dtype=torch.float32),
requires_grad=False,
)
set_weight_attrs(self.w2_weight_scale_2, weight_attrs)
self.w2_input_scale = nn.Parameter(
torch.zeros(num_local_experts, dtype=torch.float32),
requires_grad=False,
)
set_weight_attrs(self.w2_input_scale, weight_attrs)
self._transformed_l1_weights: tuple[torch.Tensor, torch.Tensor] | None = None
self._transformed_l2_weights: tuple[torch.Tensor, torch.Tensor] | None = None
@@ -519,21 +596,25 @@ class DeepseekV4MegaMoEExperts(nn.Module):
weight_name: str,
shard_id: str,
expert_id: int,
return_success: bool = False,
) -> bool | None:
) -> bool:
local_expert_id = self._map_global_expert_id(expert_id)
if local_expert_id == -1:
return False if return_success else None
return False
# Scalar params (weight_scale_2, input_scale): 1D per-expert
if "weight_scale_2" in weight_name or "input_scale" in weight_name:
param.data[local_expert_id].copy_(loaded_weight)
return True
expert_data = param.data[local_expert_id]
if shard_id in ("w1", "w3"):
if "w13_" not in weight_name:
return False if return_success else None
return False
shard_offset = 0 if shard_id == "w1" else self.intermediate_size
expert_data = expert_data.narrow(0, shard_offset, self.intermediate_size)
elif shard_id == "w2":
if "w2_" not in weight_name:
return False if return_success else None
return False
else:
raise ValueError(f"Unsupported expert shard id: {shard_id}")
@@ -544,11 +625,7 @@ class DeepseekV4MegaMoEExperts(nn.Module):
f"vs checkpoint {tuple(loaded_weight.shape)}"
)
expert_data.copy_(loaded_weight)
return True if return_success else None
@staticmethod
def _ue8m0_uint8_to_float(sf: torch.Tensor) -> torch.Tensor:
return (sf.to(torch.int32) << 23).view(torch.float32)
return True
def _check_runtime_supported(self) -> None:
if not torch.cuda.is_available():
@@ -558,7 +635,7 @@ class DeepseekV4MegaMoEExperts(nn.Module):
raise NotImplementedError(
"DeepSeek V4 MegaMoE expert weights must be loaded on CUDA."
)
if torch.cuda.get_device_capability(device)[0] != 10:
if torch.cuda.get_device_capability(device)[0] < 10:
raise NotImplementedError("DeepGEMM MegaMoE requires SM100 GPUs.")
if self.hidden_size % 128 != 0 or self.intermediate_size % 128 != 0:
raise ValueError(
@@ -571,41 +648,51 @@ class DeepseekV4MegaMoEExperts(nn.Module):
return
self._check_runtime_supported()
import vllm.third_party.deep_gemm as deep_gemm
from deep_gemm.mega import transform_nvfp4_weights_for_mega_moe
w13_scale = deep_gemm.transform_sf_into_required_layout(
self._ue8m0_uint8_to_float(self.w13_weight_scale.data).contiguous(),
2 * self.intermediate_size,
self.hidden_size,
(1, 32),
self.num_local_experts,
)
w2_scale = deep_gemm.transform_sf_into_required_layout(
self._ue8m0_uint8_to_float(self.w2_weight_scale.data).contiguous(),
self.hidden_size,
self.intermediate_size,
(1, 32),
self.num_local_experts,
)
# === Native NVFP4 path ===
# The DeepGEMM nvfp4 mega_moe kernel consumes NVFP4 directly:
# - E2M1 packed uint8 (same as checkpoint)
# - UE4M3 block scales (float8_e4m3fn), group_size=16
# - float32 global scale folded into block scales
# No conversion to MXFP4. Experts stay NVFP4.
# Fold global scales into block scales and transform for the kernel
self._transformed_l1_weights, self._transformed_l2_weights = (
deep_gemm.transform_weights_for_mega_moe(
(self.w13_weight.data.view(torch.int8).contiguous(), w13_scale),
(self.w2_weight.data.view(torch.int8).contiguous(), w2_scale),
transform_nvfp4_weights_for_mega_moe(
(self.w13_weight.data.contiguous(),
self.w13_weight_scale.data.contiguous()),
(self.w2_weight.data.contiguous(),
self.w2_weight_scale.data.contiguous()),
l1_weight_scale_2=self.w13_weight_scale_2.data.contiguous(),
l2_weight_scale_2=self.w2_weight_scale_2.data.contiguous(),
)
)
# Drop the original loader-side parameters: the MegaMoE kernels only
# consume the transformed views above. transform_weights_for_mega_moe
# allocates a fresh tensor for the L1 weight (see _interleave_l1_weights)
# and fresh SF tensors for L1/L2; the L2 weight is the only tensor that
# aliases the original storage, and _transformed_l2_weights still holds
# it, so the storage stays live after we drop the Parameter.
# Drop the original loader-side parameters
self.w13_weight = None
self.w13_weight_scale = None
self.w13_weight_scale_2 = None
self.w13_input_scale = None
self.w2_weight = None
self.w2_weight_scale = None
self.w2_weight_scale_2 = None
self.w2_input_scale = None
@staticmethod
def _ue8m0_to_float32(sf: torch.Tensor) -> torch.Tensor:
"""Convert NVFP4 block scales (float8_e4m3fn / UE4M3) to float32.
Checkpoint stores float8_e4m3fn (standard NVFP4 spec, NOT UE8M0).
Simple .to(float32) is correct — shift-by-23 was wrong (Bug #7 fix).
"""
return sf.to(torch.float32)
def get_symm_buffer(self):
import vllm.third_party.deep_gemm as deep_gemm
import deep_gemm
from deep_gemm.mega import SymmBuffer, get_symm_buffer_for_nvfp4_mega_moe
group = get_ep_group().device_group
device = torch.accelerator.current_device_index()
@@ -620,7 +707,8 @@ class DeepseekV4MegaMoEExperts(nn.Module):
)
symm_buffer = self._symm_buffer_cache.get(key)
if symm_buffer is None:
symm_buffer = deep_gemm.get_symm_buffer_for_mega_moe(
# NVFP4 SymmBuffer: 2x SF size due to group_size=16
symm_buffer = get_symm_buffer_for_nvfp4_mega_moe(
group,
self.num_experts,
self.max_num_tokens,
@@ -686,7 +774,8 @@ class DeepseekV4MegaMoEExperts(nn.Module):
assert self._transformed_l1_weights is not None
assert self._transformed_l2_weights is not None
deep_gemm.fp8_fp4_mega_moe(
from deep_gemm.mega import fp8_nvfp4_mega_moe
fp8_nvfp4_mega_moe(
y,
self._transformed_l1_weights,
self._transformed_l2_weights,
@@ -694,6 +783,8 @@ class DeepseekV4MegaMoEExperts(nn.Module):
activation_clamp=activation_clamp,
fast_math=fast_math,
)
if os.environ.get('NVFP4_DEBUG_SYNC', '') == '1':
torch.cuda.synchronize()
DeepseekV4MegaMoEExperts.weight_loader.supports_moe_loading = True # type: ignore[attr-defined]
@@ -751,9 +842,7 @@ class DeepseekV4MoE(nn.Module):
config = vllm_config.model_config.hf_config
quant_config = vllm_config.quant_config
self.prefix = prefix
self.use_mega_moe = (
vllm_config.kernel_config.moe_backend == "deep_gemm_mega_moe"
)
self.use_mega_moe = True # Force mega_moe for NVFP4 pipeline
if self.use_mega_moe and not vllm_config.parallel_config.enable_expert_parallel:
raise NotImplementedError(
"DeepSeek V4 MegaMoE currently requires expert parallel. "
@@ -774,12 +863,7 @@ class DeepseekV4MoE(nn.Module):
raise NotImplementedError(
"DeepSeek V4 MegaMoE currently supports sqrtsoftplus routing only."
)
if self.use_mega_moe and getattr(config, "expert_dtype", "fp4") != "fp4":
raise NotImplementedError(
"DeepSeek V4 MegaMoE only supports fp4 experts; got expert_dtype="
f"{config.expert_dtype!r}. Drop --kernel-config moe_backend="
"deep_gemm_mega_moe for this checkpoint."
)
# NVFP4 experts work with mega_moe via NVFP4→MXFP4 conversion in finalize_weights
self.gate = GateLinear(
config.hidden_size,
@@ -1045,7 +1129,7 @@ class DeepseekV4Attention(nn.Module):
self.rope_parameters = config.rope_scaling
# Initialize rotary embedding BEFORE DeepseekV4MLAModules (which needs it)
rope_parameters = config.rope_parameters
rope_parameters = dict(config.rope_parameters)
rope_parameters["rope_theta"] = (
config.compress_rope_theta if self.compress_ratio > 1 else config.rope_theta
)
@@ -1262,9 +1346,7 @@ class DeepseekV4Model(nn.Module):
config = vllm_config.model_config.hf_config
quant_config = vllm_config.quant_config
self.config = config
self.use_mega_moe = (
vllm_config.kernel_config.moe_backend == "deep_gemm_mega_moe"
)
self.use_mega_moe = True # Force mega_moe for NVFP4 pipeline
if self.use_mega_moe and not vllm_config.parallel_config.enable_expert_parallel:
raise NotImplementedError(
"DeepSeek V4 MegaMoE currently requires expert parallel. "
@@ -1461,14 +1543,19 @@ class DeepseekV4Model(nn.Module):
else:
if ".experts." in name:
# E8M0 scales are stored as float8_e8m0fnu in
# checkpoints but the MoE param is uint8. copy_()
# would do a numeric conversion (e.g. 2^-7 → 0),
# destroying the raw exponent bytes.
# MXFP4 checkpoints but NVFP4 uses float8_e4m3fn.
# The uint8 view+copy path is only valid for MXFP4;
# for NVFP4 it would paste raw E8M0 bytes into an
# E4M3 buffer, producing garbage.
if (
"weight_scale" in name
and loaded_weight.dtype == torch.float8_e8m0fnu
):
loaded_weight = loaded_weight.view(torch.uint8)
assert False, (
f"E8M0 weight_scale encountered for NVFP4 experts "
f"({name}) — this is only valid for MXFP4. "
f"Check checkpoint dtype."
)
for mapping in expert_mapping:
param_name, weight_name, expert_id, shard_id = mapping
if weight_name not in name:
@@ -1489,7 +1576,6 @@ class DeepseekV4Model(nn.Module):
name_mapped,
shard_id=shard_id,
expert_id=expert_id,
return_success=True,
)
if success:
name = name_mapped
@@ -1537,16 +1623,10 @@ class DeepseekV4Model(nn.Module):
weight_scale_2_val = global_amax / (6.0 * 448.0)
weight_scale_2 = weight_scale_2_val.to(torch.float32)
# Per-block scale (weight_scale): UE8M0 format
# scale_fmt=ue8m0: block_scale = 2^(exp-127), stored as
# uint8 exponent byte viewed as float8_e4m3fn
# Per-block scale (weight_scale): UE4M3 format (standard NVFP4)
# block_scale = amax / (6.0 * weight_scale_2)
block_scale = amax / (6.0 * weight_scale_2_val)
# Convert to UE8M0: floor to nearest power of 2
# UE8M0 exponent = floor(log2(block_scale)) + 127
block_scale_clamped = block_scale.clamp(min=2**-127)
block_scale_exp = torch.floor(torch.log2(block_scale_clamped)).to(torch.int32) + 127
block_scale_exp = block_scale_exp.clamp(0, 254).to(torch.uint8)
weight_scale = block_scale_exp.view(torch.float8_e4m3fn)
weight_scale = block_scale.clamp(0.0, 448.0).to(torch.float8_e4m3fn)
# Quantize to FP4 (E2M1)
# E2M1 LUT: 0, 0.5, 1, 1.5, 2, 3, 4, 6 (positive)
@@ -1554,10 +1634,8 @@ class DeepseekV4Model(nn.Module):
[0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0],
dtype=torch.float32, device=w_bf16.device,
)
# For each block, dequantize the block scale from UE8M0
block_scale_f32 = (block_scale_exp.to(torch.int32) << 23).view(torch.float32)
# Scale the weight values: normalized = w / (block_scale * weight_scale_2)
# We need to find the nearest FP4 value
block_scale_f32 = block_scale.clamp(0.0, 448.0)
scaled = w_blocks / (block_scale_f32.unsqueeze(-1) * weight_scale_2_val)
# Find nearest FP4 index (0-7 for magnitude)
# Use absolute value for matching, then apply sign
@@ -1575,7 +1653,7 @@ class DeepseekV4Model(nn.Module):
even = fp4_flat[:, 0::2] # lower nibble
odd = fp4_flat[:, 1::2] # upper nibble
packed = (odd << 4) | even
weight_packed = packed.to(torch.uint8)
weight_packed = packed.to(torch.uint8).view(torch.int8)
# Reshape weight_scale to [out, n_blocks]
weight_scale_2d = weight_scale.reshape(out_dim, n_blocks)
@@ -1648,7 +1726,7 @@ class DeepseekV4Model(nn.Module):
- compressor.fused_wkv_wgate: Dequant NVFP4->bf16 (used via direct
torch.mm in attention parallel stream)
- shared_experts (gate_up_proj, down_proj): Dequant NVFP4->bf16
- MoE experts: Stay in native NVFP4 (ModelOptNvFp4FusedMoE)
- MoE experts: Handled by DeepseekV4MegaMoEExperts (NVFP4→MXFP4)
"""
E2M1_LUT = torch.tensor(
[0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16
@@ -1667,6 +1745,10 @@ class DeepseekV4Model(nn.Module):
fp8_from_bf16 = 0
bf16_converted = 0
compressor_converted = 0
# Build shard index once for compressor reconstruction (avoids N×M full-shard loads)
_shard_index = self._build_shard_index("/model") if os.path.isdir("/model") else None
for layer_idx, layer in enumerate(self.layers):
attn = layer.attn
@@ -1677,13 +1759,11 @@ class DeepseekV4Model(nn.Module):
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight"):
continue
if mod.weight.dtype == torch.uint8:
if mod.weight.dtype in (torch.uint8, torch.int8):
# NVFP4 -> dequant to bf16 -> requant to FP8
self._convert_nvfp4_to_fp8(mod, E2M1_LUT, FP8_MAX)
fp8_converted += 1
elif mod.weight.dtype == torch.bfloat16:
# modelopt did NOT quantize o_a_proj — it's bf16 already.
# Convert bf16 -> FP8 directly for fp8_einsum path.
self._convert_bf16_to_fp8(mod, FP8_MAX)
fp8_from_bf16 += 1
@@ -1692,7 +1772,7 @@ class DeepseekV4Model(nn.Module):
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
if not hasattr(mod, "weight") or mod.weight.dtype not in (torch.uint8, torch.int8):
continue
self._dequant_nvfp4_to_bf16(mod, E2M1_LUT)
bf16_converted += 1
@@ -1710,14 +1790,14 @@ class DeepseekV4Model(nn.Module):
compressor = getattr(mla_attn, "compressor", None)
if compressor is not None and hasattr(compressor, "fused_wkv_wgate"):
compressor_converted += self._reconstruct_compressor_weight(
compressor.fused_wkv_wgate, attn, layer_idx, E2M1_LUT)
compressor.fused_wkv_wgate, attn, layer_idx, E2M1_LUT, _shard_index=_shard_index)
# Indexer compressor (C4A layers only)
indexer = getattr(mla_attn, "indexer", None)
if indexer is not None:
idx_compressor = getattr(indexer, "compressor", None)
if idx_compressor is not None and hasattr(idx_compressor, "fused_wkv_wgate"):
compressor_converted += self._reconstruct_compressor_weight(
idx_compressor.fused_wkv_wgate, indexer, layer_idx, E2M1_LUT, sub_path=".indexer")
idx_compressor.fused_wkv_wgate, indexer, layer_idx, E2M1_LUT, sub_path=".indexer", _shard_index=_shard_index)
# Shared experts
ffn = layer.ffn
@@ -1726,7 +1806,7 @@ class DeepseekV4Model(nn.Module):
if not hasattr(ffn.shared_experts, proj_name):
continue
mod = getattr(ffn.shared_experts, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
if not hasattr(mod, "weight") or mod.weight.dtype not in (torch.uint8, torch.int8):
continue
self._dequant_nvfp4_to_bf16(mod, E2M1_LUT)
bf16_converted += 1
@@ -1749,9 +1829,8 @@ class DeepseekV4Model(nn.Module):
# Dequantize with scales
if hasattr(mod, "weight_scale") and hasattr(mod, "weight_scale_2"):
# scale_fmt=ue8m0: weight_scale bytes are E8M0 format (power-of-2 only).
# A simple .to(float32) misinterprets them as E4M3. Must reinterpret
# the raw uint8 bits as IEEE 754 exponent field.
# NVFP4 block scales are float8_e4m3fn (UE4M3) — standard spec.
# .to(float32) is correct (Bug #7: shift-by-23 was wrong, reverted)
block_scale = self._ue8m0_to_float32(mod.weight_scale.data)
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
@@ -1773,8 +1852,10 @@ class DeepseekV4Model(nn.Module):
else:
w_dequant = w_bf16
# Replace weight with bf16 version
# Free source tensors eagerly to avoid holding uint8+bf16+fp32 simultaneously
del w_uint8, w_bf16
mod.weight = torch.nn.Parameter(w_dequant, requires_grad=False)
del w_dequant
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale",
@@ -1794,7 +1875,7 @@ class DeepseekV4Model(nn.Module):
# Dequantize with scales
if hasattr(mod, "weight_scale") and hasattr(mod, "weight_scale_2"):
# scale_fmt=ue8m0: reinterpret E8M0 bytes as float32
# NVFP4 block scales: float8_e4m3fn → .to(float32) (Bug #7 reverted)
block_scale = self._ue8m0_to_float32(mod.weight_scale.data)
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
@@ -1857,21 +1938,38 @@ class DeepseekV4Model(nn.Module):
bmm_batch_size=bmm_batch_size,
)
# Free source tensors eagerly
del w_uint8, w_bf16, w_dequant
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
del w_fp8
# weight_scale_inv is what the attention runtime reads as b_scale
# for deepseek_v4_fp8_einsum -> DeepGEMM fp8_einsum.
# It must be the DeepGEMM-formatted block scale (dg_ws), NOT the
# per-tensor scalar. See: deepseek_v4_attention.py line 319.
mod.weight_scale_inv = torch.nn.Parameter(ws, requires_grad=False)
# weight_scale is not used at runtime for BMM layers; remove it
# to avoid confusing other code paths.
del ws
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale"):
if hasattr(mod, attr):
delattr(mod, attr)
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()
def _reconstruct_compressor_weight(self, fused_mod, parent_mod, layer_idx, e2m1_lut, sub_path=""):
@staticmethod
def _build_shard_index(ckpt_dir: str) -> dict[str, str]:
"""Build key→shard_path index from safetensors metadata (no tensor I/O)."""
import glob
from safetensors import safe_open
index = {}
for shard_file in sorted(glob.glob(os.path.join(ckpt_dir, "model-*.safetensors"))):
try:
with safe_open(shard_file, framework="pt") as f:
for key in f.keys():
index[key] = shard_file
except Exception:
continue
return index
def _reconstruct_compressor_weight(self, fused_mod, parent_mod, layer_idx, e2m1_lut, sub_path="", _shard_index=None):
"""Reconstruct compressor fused_wkv_wgate from checkpoint.
Compressor weights are SKIPPED during loading because NVFP4 uint8 data
@@ -1879,8 +1977,7 @@ class DeepseekV4Model(nn.Module):
We read the original uint8 data from the safetensors checkpoint, unpack
E2M1, dequantize, and stack into the fused weight param.
"""
import glob
from safetensors.torch import load_file
from safetensors import safe_open
# Find the checkpoint directory
# The model weights are mounted at /model in Docker
@@ -1895,49 +1992,45 @@ class DeepseekV4Model(nn.Module):
# We read from checkpoint (before mapper), so use original names
layer_prefix = f"model.layers.{layer_idx}.self_attn.compressor{sub_path}"
# Find which shard contains this layer's compressor weights
wkv_key = f"{layer_prefix}.kv_proj.weight"
wgate_key = f"{layer_prefix}.gate_proj.weight"
wkv_scale_key = f"{layer_prefix}.kv_proj.weight_scale"
wgate_scale_key = f"{layer_prefix}.gate_proj.weight_scale"
wkv_scale2_key = f"{layer_prefix}.kv_proj.weight_scale_2"
wgate_scale2_key = f"{layer_prefix}.gate_proj.weight_scale_2"
wkv_iscale_key = f"{layer_prefix}.kv_proj.input_scale"
wgate_iscale_key = f"{layer_prefix}.gate_proj.input_scale"
# All keys we need from the checkpoint
keys = {
'wkv_uint8': f"{layer_prefix}.kv_proj.weight",
'wgate_uint8': f"{layer_prefix}.gate_proj.weight",
'wkv_block_scale': f"{layer_prefix}.kv_proj.weight_scale",
'wgate_block_scale': f"{layer_prefix}.gate_proj.weight_scale",
'wkv_global_scale': f"{layer_prefix}.kv_proj.weight_scale_2",
'wgate_global_scale': f"{layer_prefix}.gate_proj.weight_scale_2",
'wkv_input_scale': f"{layer_prefix}.kv_proj.input_scale",
'wgate_input_scale': f"{layer_prefix}.gate_proj.input_scale",
}
# Load from safetensors
wkv_uint8 = None
wgate_uint8 = None
wkv_block_scale = None
wgate_block_scale = None
wkv_global_scale = None
wgate_global_scale = None
wkv_input_scale = None
wgate_input_scale = None
shard_files = sorted(glob.glob(os.path.join(ckpt_dir, "model-*.safetensors")))
for shard_file in shard_files:
# Read tensors using shard index for targeted access (no full-shard loads)
tensors = {}
for name, key in keys.items():
shard_path = (_shard_index or {}).get(key)
if shard_path is None:
continue
try:
shard_data = load_file(shard_file)
with safe_open(shard_path, framework="pt") as f:
if key in f.keys():
tensors[name] = f.get_tensor(key)
except Exception:
continue
if wkv_key in shard_data:
wkv_uint8 = shard_data[wkv_key]
wkv_block_scale = shard_data.get(wkv_scale_key)
wkv_global_scale = shard_data.get(wkv_scale2_key)
wkv_input_scale = shard_data.get(wkv_iscale_key)
if wgate_key in shard_data:
wgate_uint8 = shard_data[wgate_key]
wgate_block_scale = shard_data.get(wgate_scale_key)
wgate_global_scale = shard_data.get(wgate_scale2_key)
wgate_input_scale = shard_data.get(wgate_iscale_key)
if wkv_uint8 is not None and wgate_uint8 is not None:
break
wkv_uint8 = tensors.get('wkv_uint8')
wgate_uint8 = tensors.get('wgate_uint8')
if wkv_uint8 is None or wgate_uint8 is None:
# Layer might not have a compressor (compress_ratio=1 layers)
return 0
wkv_block_scale = tensors.get('wkv_block_scale')
wgate_block_scale = tensors.get('wgate_block_scale')
wkv_global_scale = tensors.get('wkv_global_scale')
wgate_global_scale = tensors.get('wgate_global_scale')
wkv_input_scale = tensors.get('wkv_input_scale')
wgate_input_scale = tensors.get('wgate_input_scale')
device = fused_mod.weight.device
wkv_uint8 = wkv_uint8.to(device)
wgate_uint8 = wgate_uint8.to(device)
@@ -1949,7 +2042,7 @@ class DeepseekV4Model(nn.Module):
# Dequantize with scales
def _dequant(w_bf16, block_scale, global_scale, input_scale):
if block_scale is not None and global_scale is not None:
# scale_fmt=ue8m0: reinterpret E8M0 bytes as float32
# NVFP4 block scales: float8_e4m3fn → .to(float32) (Bug #7 reverted)
block_scale = self._ue8m0_to_float32(block_scale.to(device))
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
@@ -1972,8 +2065,6 @@ class DeepseekV4Model(nn.Module):
# fused_wkv_wgate.weight = cat([wkv, wgate], dim=0) → (2*head_dim, hidden_size)
w_fused = torch.cat([wkv_dequant, wgate_dequant], dim=0)
# DEBUG: log shapes to diagnose compressor weight mismatch
print(f"NVFP4 compressor layer {layer_idx}: wkv={wkv_dequant.shape}, wgate={wgate_dequant.shape}, fused={w_fused.shape}, existing_param={fused_mod.weight.shape}")
# Replace the weight
fused_mod.weight = torch.nn.Parameter(w_fused, requires_grad=False)
@@ -2041,17 +2132,12 @@ class DeepseekV4Model(nn.Module):
@staticmethod
def _ue8m0_to_float32(sf: torch.Tensor) -> torch.Tensor:
"""Convert UE8M0 (E8M0 power-of-2) scale bytes to float32.
"""Convert NVFP4 block scales (float8_e4m3fn / UE4M3) to float32.
NVFP4 checkpoints with scale_fmt=ue8m0 store per-block weight scales as
E8M0 format (8-bit exponent, no mantissa). The value = 2^(raw_byte - 127).
The bytes are loaded as float8_e4m3fn by safetensors, but a simple
.to(float32) misinterprets them as E4M3 (which has mantissa bits).
Correct conversion: place the raw uint8 bits into the exponent field
of an IEEE 754 float32 (bits 23-30), yielding 2^(raw-127) * implicit_1.
Checkpoint stores float8_e4m3fn (standard NVFP4 spec, NOT UE8M0).
Simple .to(float32) is correct — shift-by-23 was wrong (Bug #7 fix).
"""
raw_uint8 = sf.view(torch.uint8)
return (raw_uint8.to(torch.int32) << 23).view(torch.float32)
return sf.to(torch.float32)
def _unpack_nvfp4_to_bf16(self, w_uint8, e2m1_lut, device):
"""Unpack NVFP4 uint8 packed weights to bf16 using E2M1 format."""
@@ -2269,6 +2355,9 @@ class DeepseekV4ForCausalLM(nn.Module):
loaded_params = loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
self.model.finalize_mega_moe_weights()
self.model._convert_nvfp4_post_load()
if os.environ.get('NVFP4_DEBUG_SYNC', '') == '1':
torch.cuda.synchronize()
print("[NVFP4] post-load conversion done, CUDA OK")
return loaded_params
def get_expert_mapping(self) -> list[tuple[str, str, int, str]]:

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,66 +0,0 @@
#!/usr/bin/env python3
"""
Patch transformers' finegrained_fp8.py to reject DeepGEMM on Blackwell (SM100+).
DeepGEMM only supports Hopper (SM90). On Blackwell GPUs, _load_deepgemm_kernel()
passes the SM90 check but then fails trying to download/load the kernel from HF Hub
(rate limits, missing builds). This patch adds a check for SM100+ that raises
ImportError, which the existing try/except in w8a8_fp8_matmul catches, falling
back to the Triton finegrained-fp8 kernel.
Also needed because the Triton finegrained-fp8 matmul has shape mismatches during
modelopt calibration (K mismatch on quantized expert weights). The real fix is to
upcast the model to BF16 first (see scripts/upcast_to_bf16.py).
Usage:
python3 patch_finegrained_fp8_blackwell.py [path_to_finegrained_fp8.py]
If no path given, auto-detects from the installed transformers package.
"""
import sys
import os
def patch(fp8_file: str):
with open(fp8_file) as f:
content = f.read()
old = """ # DeepGEMM requires Hopper (SM90) or newer for FP8 WGMMA instructions
major = torch.cuda.get_device_capability()[0]
if major < 9:
raise ImportError(
f"DeepGEMM requires a Hopper (SM90+) or newer GPU, but the current device "
f"has compute capability {major}.x. Use a different `experts_implementation`."
)"""
new = """ # DeepGEMM requires Hopper (SM90) specifically - not yet supported on Blackwell (SM100+)
major = torch.cuda.get_device_capability()[0]
if major < 9:
raise ImportError(
f"DeepGEMM requires a Hopper (SM90+) or newer GPU, but the current device "
f"has compute capability {major}.x. Use a different `experts_implementation`."
)
if major >= 10:
raise ImportError(
f"DeepGEMM is not yet supported on Blackwell (SM100+). "
f"Use a different `experts_implementation`."
)"""
if old in content:
content = content.replace(old, new)
with open(fp8_file, "w") as f:
f.write(content)
print(f"PATCHED: {fp8_file} — DeepGEMM now rejected on Blackwell (SM100+)")
else:
print("Patch target not found (may already be patched or different version)")
if __name__ == "__main__":
if len(sys.argv) > 1:
fp8_file = sys.argv[1]
else:
import transformers.integrations.finegrained_fp8 as fp8
import inspect
fp8_file = inspect.getfile(fp8)
patch(fp8_file)

View File

@@ -1,135 +0,0 @@
#!/usr/bin/env python3
"""
Patch vllm's DeepSeek V4 weight mapper to handle modelopt's NVFP4 export naming.
modelopt exports weights with naming differences from what vllm's
_make_deepseek_v4_weights_mapper + load_weights code expects:
1. Expert projections: modelopt uses gate_proj/up_proj/down_proj, vllm expects w1/w3/w2
2. Shared expert projections: same gate_proj/up_proj naming, needs w1/w3 for stacking
3. Compressor projections: kv_proj→wkv, gate_proj→wgate for fused stacking
4. Attention projections: self_attn prefix, kv_proj→wkv for fused stacking, etc.
5. Expert NVFP4 scales: weight_scale_2 and input_scale have no matching mega_moe params
CRITICAL: DeepseekV4ForCausalLM.hf_to_vllm_mapper is a CLASS attribute set at
module import time. Simply patching _make_deepseek_v4_weights_mapper doesn't help
because the class already cached the old mapper. We must also update the class
attribute directly. Since expert_dtype=="fp4", __init__ doesn't recreate the mapper.
Drop into container as:
python3 /patches/patch_vllm_weights.py
Or add to docker-compose.yml command before vllm serve.
"""
import re
import sys
# Save original function BEFORE patching
_original_make_mapper = None
def make_patched_mapper(expert_dtype: str):
"""Create a WeightsMapper with modelopt NVFP4 naming patches applied."""
global _original_make_mapper
# Use the saved original, not the (possibly patched) module attribute
mapper = _original_make_mapper(expert_dtype)
# ══════════════════════════════════════════════════════════════════
# Regex mappings (applied FIRST by WeightsMapper, before substr)
# Order matters: skip patterns must come before rename patterns.
# ══════════════════════════════════════════════════════════════════
ordered_regexes = {}
# Skip expert NVFP4 scales that have no mega_moe params.
# MUST come before gate_proj→w1 etc. because after renaming,
# the key has "w1." not "gate_proj." and these patterns wouldn't match.
#
# modelopt's NVFP4 export includes weight_scale_2 (global scale) and
# input_scale (activation scale) for each expert projection. But the
# DeepseekV4MegaMoEExperts module only registers w13_weight_scale and
# w2_weight_scale (E8M0 block scales) — no weight_scale_2 or input_scale.
# Mapping to None tells WeightsMapper to skip these weights entirely.
ordered_regexes[re.compile(r"\.experts\.\d+\.\w+_proj\.weight_scale_2$")] = None
ordered_regexes[re.compile(r"\.experts\.\d+\.\w+_proj\.input_scale$")] = None
# Routed expert projections: gate_proj→w1, up_proj→w3, down_proj→w2
# We use regex (not substr) to match ONLY .experts.N. — NOT .shared_experts.
# Using substr ".down_proj." → ".w2." would also affect
# shared_experts.down_proj, breaking shared expert loading
# (vllm model uses down_proj, not w2, for shared experts).
ordered_regexes[re.compile(r"(\.experts\.\d+\.)gate_proj\.")] = r"\1w1."
ordered_regexes[re.compile(r"(\.experts\.\d+\.)up_proj\.")] = r"\1w3."
ordered_regexes[re.compile(r"(\.experts\.\d+\.)down_proj\.")] = r"\1w2."
# Preserve any existing regex mappings from the original mapper
if mapper.orig_to_new_regex:
ordered_regexes.update(mapper.orig_to_new_regex)
mapper.orig_to_new_regex = ordered_regexes
# ══════════════════════════════════════════════════════════════════
# Substr mappings (applied AFTER regex by WeightsMapper)
# ══════════════════════════════════════════════════════════════════
# 1. Attention: self_attn → attn.mla_attn mappings
# modelopt uses "self_attn" but vllm expects "attn" (mapped to "attn.mla_attn")
mapper.orig_to_new_substr[".self_attn.q_a_proj."] = ".attn.mla_attn.wq_a."
mapper.orig_to_new_substr[".self_attn.q_b_proj."] = ".attn.mla_attn.wq_b."
mapper.orig_to_new_substr[".self_attn.q_a_norm."] = ".attn.mla_attn.q_norm."
mapper.orig_to_new_substr[".self_attn.o_a_proj."] = ".attn.mla_attn.wo_a."
mapper.orig_to_new_substr[".self_attn.o_b_proj."] = ".attn.mla_attn.wo_b."
mapper.orig_to_new_substr[".self_attn.sinks"] = ".attn.mla_attn.attn_sink"
# CRITICAL: kv_proj must map to wkv (not kv_proj) because the stacking
# code looks for "attn.wkv" to stack into fused_wqa_wkv.
mapper.orig_to_new_substr[".self_attn.kv_proj."] = ".attn.mla_attn.wkv."
mapper.orig_to_new_substr[".self_attn.kv_norm."] = ".attn.mla_attn.kv_norm."
# Compressor: self_attn.compressor → attn.mla_attn.compressor
mapper.orig_to_new_substr[".self_attn.compressor."] = ".attn.mla_attn.compressor."
# Compressor projection renaming for stacking:
# vllm stacks compressor.wkv + compressor.wgate → compressor.fused_wkv_wgate
# modelopt exports as compressor.kv_proj and compressor.gate_proj
mapper.orig_to_new_substr[".compressor.kv_proj."] = ".compressor.wkv."
mapper.orig_to_new_substr[".compressor.gate_proj."] = ".compressor.wgate."
# 2. Shared expert projections: gate_proj→w1, up_proj→w3
# vllm stacks shared_experts.w1 + shared_experts.w3 into
# shared_experts.gate_up_proj. modelopt uses gate_proj/up_proj naming.
# down_proj stays as-is (vllm model uses down_proj directly).
mapper.orig_to_new_substr[".shared_experts.gate_proj."] = ".shared_experts.w1."
mapper.orig_to_new_substr[".shared_experts.up_proj."] = ".shared_experts.w3."
return mapper
def patch():
global _original_make_mapper
from vllm.model_executor.models import deepseek_v4
# 1. Save the original function BEFORE replacing it
_original_make_mapper = deepseek_v4._make_deepseek_v4_weights_mapper
# 2. Patch the function so __init__ calls also get our mapper
deepseek_v4._make_deepseek_v4_weights_mapper = make_patched_mapper
print("✓ Patched _make_deepseek_v4_weights_mapper function")
# 3. CRITICAL: Also update the CLASS attribute directly.
# DeepseekV4ForCausalLM.hf_to_vllm_mapper is set at class definition
# time (module import). Our function patch above doesn't retroactively
# update it. Since expert_dtype=="fp4", __init__ won't recreate it either.
# We MUST update the class attribute directly.
if hasattr(deepseek_v4, 'DeepseekV4ForCausalLM'):
deepseek_v4.DeepseekV4ForCausalLM.hf_to_vllm_mapper = make_patched_mapper("fp4")
print("✓ Updated DeepseekV4ForCausalLM.hf_to_vllm_mapper class attribute")
else:
print("⚠ DeepseekV4ForCausalLM not found (will be patched at import time)")
print("✓ All modelopt NVFP4 weight mapping patches applied")
if __name__ == "__main__":
patch()

View File

@@ -1,295 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Base class for quantization modules."""
import contextlib
import warnings
from typing import Any
import torch
import torch.nn as nn
from modelopt.torch.opt.dynamic import DynamicModule, _DMRegistryCls
from modelopt.torch.utils.distributed import ParallelState
from ...tensor_quant import QUANT_DESC_8BIT_PER_TENSOR
from ...utils import is_torch_export_mode
from .tensor_quantizer import SequentialQuantizer, TensorQuantizer
__all__ = [
"QuantInputBase",
"QuantLinearConvBase",
"QuantModule",
"QuantModuleRegistry",
]
class QuantModule(DynamicModule):
"""A base class for quantized modules.
In addition, the class also provides ``parallel_state`` attribute that can be used to access
the parallel state of the module.
"""
_parallel_state: ParallelState
@classmethod
@torch.no_grad()
def convert(cls, module: nn.Module, **setup_kwargs: Any) -> "QuantModule":
"""Convert the module to a dynamic module."""
module = super().convert(module, **setup_kwargs)
# setup parallel state now that the module is converted
if module.parallel_state is None:
module._initialize_parallel_state()
return module
@property
def parallel_state(self) -> ParallelState | None:
"""Return the parallel state of the quant module."""
return getattr(self, "_parallel_state", None)
@parallel_state.setter
def parallel_state(self, parallel_state: ParallelState):
"""Set the parallel state of the dynamic module."""
assert isinstance(parallel_state, ParallelState), (
"parallel_state must be a ParallelState object!"
)
self._parallel_state = parallel_state
def _initialize_parallel_state(self):
"""Initialize the parallel state of the dynamic module.
This method is called only if the `QuantModule` does not have a `parallel_state` attribute
after `_setup` is called.
"""
if torch.distributed.is_initialized():
warnings.warn(
f"Distributed training is initialized but no parallel_state is set for {type(self)}. "
"Using default parallel_state which has data_parallel_group set to the default process group and "
"tensor_parallel_group is unspecified. "
"If you are using tensor parallelism for this module, you should set the parallel_state "
"in its `_setup` method."
)
self.parallel_state = ParallelState(data_parallel_group=None)
def modelopt_post_restore(self, prefix: str = ""):
"""Post-restore to correctly configure the TensorQuantizer states.
TensorQuantizer states are restored to their shape before saving. Now we need to further configure them.
1. For non-sharded modules this simply involves moving the TensorQuantizer states to the right device.
This applies for regular Pytorch models and HuggingFace models.
2. For sharded modules the restored states of TensorQuantizer could be incorrect. This is because
parallelism such as TP might have been changed between saving and resoring. So we need to re-calculate
the state shapes. Hence such modules should override this and implement their own logic.
"""
# Get a parameter or buffer that does not belong to a TensorQuantizer
non_tq_param_or_buffer = None
for name, param_or_buffer in self.state_dict().items():
parent = self.get_submodule(name.rsplit(".", 1)[0]) if "." in name else self
if not isinstance(parent, TensorQuantizer):
non_tq_param_or_buffer = param_or_buffer
break
if non_tq_param_or_buffer is None:
warnings.warn(
f"Could not identify the device for TensorQuantizer states of {prefix}. "
"Please move the model to the right device now. This can be done by calling "
"`model.to(device)`."
)
return
# Move the TensorQuantizer states to the right device (dtype should have been restored).
for module in self.modules():
if isinstance(module, TensorQuantizer):
module.to(non_tq_param_or_buffer.device)
def iter_weights_for_calibration(self):
"""Yield ``(weight, weight_quantizer)`` pairs for weight-only calibration."""
import torch.nn as nn
from modelopt.torch.quantization.utils import quantizer_attr_names, weight_attr_names
for weight_name in weight_attr_names(self):
qname = quantizer_attr_names(weight_name).weight_quantizer
qattr = getattr(self, qname, None)
weight = getattr(self, weight_name)
if qattr is not None:
# Singular quantizer
yield weight, qattr
else:
# Try plural (ModuleList) - e.g. _QuantFusedExperts
plural = qname + "s"
qattr = getattr(self, plural, None)
if isinstance(qattr, nn.ModuleList):
# Yield per-expert slices for 3-D fused weights
if weight.dim() == 3:
for idx, q in enumerate(qattr):
yield weight[idx], q
else:
for q in qattr:
yield weight, q
else:
raise AttributeError(
f"Cannot find weight quantizer {qname} or {plural} on {type(self).__name__}"
)
def fold_weight(self, keep_attrs: bool = False):
"""Fold the weight for faster eval."""
# Handle all attributes that end with _weight_quantizer
for name in dir(self):
attr = getattr(self, name)
if (
name.endswith("weight_quantizer")
and isinstance(attr, TensorQuantizer)
and attr.fake_quant
):
# Get the corresponding weight name by removing _weight_quantizer suffix
weight_name = name[:-10]
assert hasattr(self, weight_name), (
f"{name} doesn't have a corresponding {weight_name} in {self.__class__.__name__}"
)
weight = getattr(self, weight_name)
weight.data.copy_(attr(weight.float()).to(weight.dtype))
attr.disable()
if not keep_attrs:
_attrs = [
"_pre_quant_scale",
"_amax",
]
for attr_name in _attrs:
if hasattr(attr, attr_name):
delattr(attr, attr_name)
QuantModuleRegistry = _DMRegistryCls("Quant", QuantModule)
class QuantInputBase(QuantModule):
"""Base class for modules where the input is quantized."""
input_quantizer: TensorQuantizer
output_quantizer: TensorQuantizer
default_quant_desc_input = QUANT_DESC_8BIT_PER_TENSOR
default_quant_desc_output = QUANT_DESC_8BIT_PER_TENSOR
def forward(self, input, *args, **kwargs):
"""Quantize the input before calling the original forward method."""
input = self.input_quantizer(input)
# Check MR: https://github.com/NVIDIA/Model-Optimizer/pull/824
if hasattr(self, "_forward_pre_dm"):
pre_fwd = getattr(self, "_forward_pre_dm")
def _is_forward_in_mro(bound_or_func) -> bool:
# If this is a bound method, compare its underlying function to any `forward`
# implementation in the current MRO. If it matches, it's not an external monkey-patch.
if hasattr(bound_or_func, "__func__"):
fn = bound_or_func.__func__
for cls in type(self).mro():
if cls.__dict__.get("forward") is fn:
return True
return False
if pre_fwd is getattr(self, "forward") or _is_forward_in_mro(pre_fwd):
output = super().forward(input, *args, **kwargs)
else:
output = pre_fwd(input, *args, **kwargs)
else:
output = super().forward(input, *args, **kwargs)
if isinstance(output, tuple):
return (self.output_quantizer(output[0]), *output[1:])
return self.output_quantizer(output)
def _setup(self):
"""Patch the module's forward method to quantize the input."""
self._register_temp_attribute(
"input_quantizer", TensorQuantizer(self.default_quant_desc_input)
)
self._register_temp_attribute(
"output_quantizer", TensorQuantizer(self.default_quant_desc_output)
)
self.output_quantizer.disable()
class QuantLinearConvBase(QuantInputBase):
"""Base class for quantized linear modules.
Quantized linear modules are modules where both the input and the weight are quantized.
"""
weight_quantizer: TensorQuantizer | SequentialQuantizer
_enable_weight_quantization: bool
default_quant_desc_weight = QUANT_DESC_8BIT_PER_TENSOR
@contextlib.contextmanager
def quantize_weight(self):
"""Context in which `self.weight` is quantized."""
self._enable_weight_quantization = True
try:
yield
finally:
self._enable_weight_quantization = False
@staticmethod
def _get_quantized_weight(module: "QuantLinearConvBase", weight: torch.Tensor) -> torch.Tensor:
if module._enable_weight_quantization or is_torch_export_mode():
return module.weight_quantizer(weight)
return weight
def forward(self, input, *args, **kwargs):
"""Quantize the input and the weight before calling the original forward method."""
# self.quntize_weight() setting attributes is not allowed for torch.export.
if is_torch_export_mode():
return super().forward(input, *args, **kwargs)
with self.quantize_weight():
return super().forward(input, *args, **kwargs)
def _setup(self):
super()._setup()
self._register_temp_attribute(
"weight_quantizer", TensorQuantizer(self.default_quant_desc_weight)
)
self._register_temp_attribute("_enable_weight_quantization", False)
self._register_dynamic_attribute("weight", self._get_quantized_weight)
class _LegacyQuantInputBaseMixin:
"""A mixin to support legacy quantized modules which needs to have an __init__ method."""
_quantized_cls = QuantInputBase
default_quant_desc_input = QUANT_DESC_8BIT_PER_TENSOR
default_quant_desc_output = QUANT_DESC_8BIT_PER_TENSOR
def __init__(self, *args, quant_desc_input=None, **kwargs):
"""Initialize the module with its original __init__ and patch its forward."""
self.default_quant_desc_input = quant_desc_input or self.default_quant_desc_input
super().__init__(*args, **kwargs)
QuantModuleRegistry.convert(self)
class _LegacyQuantLinearConvBaseMixin(_LegacyQuantInputBaseMixin):
"""A mixin to support legacy quantized modules which needs to have an __init__ method."""
_quantized_cls = QuantLinearConvBase
default_quant_desc_weight = QUANT_DESC_8BIT_PER_TENSOR
def __init__(self, *args, quant_desc_input=None, quant_desc_weight=None, **kwargs):
"""Initialize the module with its original __init__ and patch its forward."""
self.default_quant_desc_weight = quant_desc_weight or self.default_quant_desc_weight
super().__init__(*args, quant_desc_input=quant_desc_input, **kwargs)

270
patches/staging_kernel.py Normal file
View File

@@ -0,0 +1,270 @@
"""
NVFP4 staging kernel — full FP4 (E2M1) activations + UE4M3 block16 scales.
The mxf4nvf4 PTX instruction requires BOTH A and B to be FP4 (E2M1 packed).
This kernel quantizes BF16 activations → E2M1 packed uint8 with UE4M3 scales.
"""
import triton
import triton.language as tl
import torch
@triton.jit
def _deepseek_v4_stage_mega_moe_inputs_kernel(
hidden_states,
x_fp4, # uint8, shape (M, K//2) — E2M1 packed, 2 values per byte
x_sf, # int32, shape (M, K//64) — UE4M3 packed, 4 scales per int32
topk_ids,
topk_weights,
topk_idx_out,
topk_weights_out,
hidden_stride_m: tl.constexpr,
hidden_stride_k: tl.constexpr,
x_stride_m: tl.constexpr,
x_stride_k: tl.constexpr,
x_sf_stride_m: tl.constexpr,
x_sf_stride_k: tl.constexpr,
topk_ids_stride_m: tl.constexpr,
topk_ids_stride_k: tl.constexpr,
topk_weights_stride_m: tl.constexpr,
topk_weights_stride_k: tl.constexpr,
topk_idx_stride_m: tl.constexpr,
topk_idx_stride_k: tl.constexpr,
topk_weights_out_stride_m: tl.constexpr,
topk_weights_out_stride_k: tl.constexpr,
hidden_size: tl.constexpr,
top_k: tl.constexpr,
BLOCK_K: tl.constexpr, # 128 elements (loaded from hidden)
GROUP_K: tl.constexpr, # 16 (NVFP4 group_size)
BLOCK_TOPK: tl.constexpr,
) -> None:
token_id = tl.program_id(0)
k_block_id = tl.program_id(1)
k_offsets = k_block_id * BLOCK_K + tl.arange(0, BLOCK_K)
k_mask = k_offsets < hidden_size
hidden = tl.load(
hidden_states + token_id * hidden_stride_m + k_offsets * hidden_stride_k,
mask=k_mask,
other=0.0,
).to(tl.float32)
num_groups: tl.constexpr = BLOCK_K // GROUP_K # 8
hidden_groups = tl.reshape(hidden, [num_groups, GROUP_K])
abs_groups = tl.reshape(tl.abs(hidden), [num_groups, GROUP_K])
amax = tl.max(abs_groups, axis=1)
amax = tl.maximum(amax, 1.0e-4)
# ---- UE4M3 scale computation ----
# scale = amax / 6.0 (E2M1 max value = 6)
# Then quantize scale to UE4M3 format
scale = amax / 6.0
scale_bits = scale.to(tl.uint32, bitcast=True)
scale_exp = (scale_bits >> 23) & 0xFF
scale_mant = scale_bits & 0x7FFFFF
# Convert FP32 → E4M3 manually (with subnormal support)
# FP32 bias=127, E4M3 bias=7 → raw exp = scale_exp - 120
e4m3_exp_raw = scale_exp - 120 # can be negative → subnormal
# Normal path: exp >= 1, just truncate mantissa top 3 bits
# RNE rounding: need guard (bit 19), sticky (OR of bits 18:0), and LSB of result
normal_mant = scale_mant >> 20
guard_bit = (scale_mant >> 19) & 1
sticky_bit = tl.where((scale_mant & 0x7FFFF) != 0, 1, 0) # OR of bits [18:0]
result_lsb = normal_mant & 1
# RNE: round up if (guard=1 and sticky=1) or (guard=1 and sticky=0 and lsb=1)
round_up = guard_bit & (sticky_bit | result_lsb)
normal_mant = normal_mant + round_up
normal_exp = e4m3_exp_raw
# Subnormal path: exp_raw <= 0
# Insert implicit leading 1 and right-shift by (1 - exp_raw)
# E4M3 subnormal: value = (mant/8) * 2^(1-7) = (mant/8) * 2^-6
# So we need: (1 + mant_fp32/2^23) * 2^(exp_raw - 7) = (shifted_mant/8) * 2^-6
# shifted_mant = (implicit_1 | mant_fp32) >> (1 - exp_raw - 1) then take top 3 bits
shift = 1 - e4m3_exp_raw # positive when subnormal
mant_with_leading = (0x800000 | scale_mant) # insert implicit 1
# Right-shift to get into the 3-bit E4M3 mantissa window
# We want bits [shift+19 : shift+23) of mant_with_leading for 3 mantissa bits + 1 round bit
subnormal_mant = (mant_with_leading >> (shift.to(tl.int32) + 20)) & 0x7
sub_guard_bit = (mant_with_leading >> (shift.to(tl.int32) + 19)) & 1
# Sticky: OR of all bits below the guard bit in the shifted result
# shift ≤ 8 in practice (amax floor = 1e-4 → scale ≈ 2^-15 → exp_raw ≈ -7), so mask ≤ 2^27
sub_sticky_mask = (1 << (shift.to(tl.int32) + 19)) - 1
sub_sticky_bit = tl.where((mant_with_leading & sub_sticky_mask) != 0, 1, 0)
sub_result_lsb = subnormal_mant & 1
sub_round_up = sub_guard_bit & (sub_sticky_bit | sub_result_lsb)
subnormal_mant = subnormal_mant + sub_round_up
is_normal = e4m3_exp_raw >= 1
e4m3_mant = tl.where(is_normal, normal_mant, subnormal_mant)
e4m3_exp = tl.where(is_normal, normal_exp, 0) # exp=0 for subnormals
# Handle mantissa overflow after rounding
overflow = e4m3_mant >= 8
e4m3_mant = tl.where(overflow, 0, e4m3_mant)
e4m3_exp = tl.where(overflow, e4m3_exp + 1, e4m3_exp)
e4m3_exp = tl.maximum(e4m3_exp, 0)
e4m3_exp = tl.minimum(e4m3_exp, 15)
scale_e4m3_bits = (e4m3_exp << 3) | e4m3_mant
# Reconstruct dequantized scale by decoding the STORED E4M3 bits.
# This guarantees the E2M1 quantization divides by exactly the value
# the CUDA kernel will multiply back — same bits, single decode, no
# possibility of encode/decode disagreement.
stored_exp = (scale_e4m3_bits >> 3) & 0xF
stored_mant = scale_e4m3_bits & 0x7
e4m3_exp_for_recon = tl.maximum(stored_exp.to(tl.int32) - 7, -126)
two_pow_exp_bits = (e4m3_exp_for_recon + 127).to(tl.uint32) << 23
two_pow_exp = two_pow_exp_bits.to(tl.float32, bitcast=True)
normal_value = (1.0 + stored_mant.to(tl.float32) / 8.0) * two_pow_exp
subnormal_value = (stored_mant.to(tl.float32) / 8.0) * 0.015625
e4m3_value = tl.where(stored_exp == 0, subnormal_value, normal_value)
# ---- E2M1 FP4 quantization (unpacked, 1 byte/element) ----
# E2M1 LUT (unsigned): [0, 0.5, 1, 1.5, 2, 3, 4, 6]
# Nearest-neighbor using thresholds (midpoints between consecutive values)
scaled = hidden_groups * (1.0 / tl.maximum(e4m3_value, 1e-6))[:, None]
# Clamp to E2M1 range [-6, 6]
scaled = tl.maximum(scaled, -6.0)
scaled = tl.minimum(scaled, 6.0)
abs_s = tl.abs(scaled)
# Thresholds: midpoints between [0, 0.5, 1, 1.5, 2, 3, 4, 6]
# [0, 0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0, INF]
e2m1_idx = tl.where(abs_s < 0.25, 0,
tl.where(abs_s < 0.75, 1,
tl.where(abs_s < 1.25, 2,
tl.where(abs_s < 1.75, 3,
tl.where(abs_s < 2.5, 4,
tl.where(abs_s < 3.5, 5,
tl.where(abs_s < 5.0, 6, 7)))))))
sign_bit = (scaled < 0).to(tl.int32)
e2m1_4bit = (sign_bit << 3) | e2m1_idx # 4-bit: (sign << 3) | index
# Pack E2M1 pairs into single bytes (2 per byte, low nibble first)
# mxf4nvf4 reads FP4 packed from SMEM — must match kernel's TMA layout
e2m1_flat = tl.reshape(e2m1_4bit, [BLOCK_K])
e2m1_lo = e2m1_flat[0::2] # even indices → low nibble
e2m1_hi = e2m1_flat[1::2] # odd indices → high nibble
e2m1_packed = (e2m1_hi << 4 | e2m1_lo).to(tl.uint8) # [BLOCK_K // 2]
k_offsets_out = k_block_id * (BLOCK_K // 2) + tl.arange(0, BLOCK_K // 2)
k_mask_out = k_offsets_out < (hidden_size // 2)
tl.store(
x_fp4 + token_id * x_stride_m + k_offsets_out * x_stride_k,
e2m1_packed,
mask=k_mask_out,
)
# Pack UE4M3 bytes into int32 (NVFP4: group_size=16, 4 groups per 64 elements)
# 8 groups per k_block of 128 → 2 int32s per k_block
# int32 can only pack 4 bytes (shifts >= 32 are UB), so split into two packs
scale_offsets = tl.arange(0, num_groups) # [0..7]
first_half = scale_offsets < 4 # groups 0-3 → int32[0]
second_half = scale_offsets >= 4 # groups 4-7 → int32[1]
packed_lo = tl.sum(
tl.where(first_half, scale_e4m3_bits.to(tl.int32) << (scale_offsets * 8), 0),
axis=0,
).to(tl.int32)
packed_hi = tl.sum(
tl.where(second_half, scale_e4m3_bits.to(tl.int32) << ((scale_offsets - 4) * 8), 0),
axis=0,
).to(tl.int32)
# Write 2 int32s per k_block: x_sf shape is (M, K//64) = (M, num_k_blocks * 2)
sf_base = token_id * x_sf_stride_m + k_block_id * 2 * x_sf_stride_k
tl.store(x_sf + sf_base, packed_lo)
tl.store(x_sf + sf_base + x_sf_stride_k, packed_hi)
if k_block_id == 0:
topk_offsets = tl.arange(0, BLOCK_TOPK)
topk_mask = topk_offsets < top_k
ids = tl.load(
topk_ids + token_id * topk_ids_stride_m + topk_offsets * topk_ids_stride_k,
mask=topk_mask,
other=0,
).to(tl.int64)
tl.store(
topk_idx_out
+ token_id * topk_idx_stride_m
+ topk_offsets * topk_idx_stride_k,
ids,
mask=topk_mask,
)
weights = tl.load(
topk_weights
+ token_id * topk_weights_stride_m
+ topk_offsets * topk_weights_stride_k,
mask=topk_mask,
other=0.0,
)
tl.store(
topk_weights_out
+ token_id * topk_weights_out_stride_m
+ topk_offsets * topk_weights_out_stride_k,
weights,
mask=topk_mask,
)
def _stage_deepseek_v4_mega_moe_inputs(
hidden_states: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
x_fp4: torch.Tensor, # uint8, shape (M, K//2)
x_sf: torch.Tensor, # int32, shape (M, K//64)
topk_idx_out: torch.Tensor,
topk_weights_out: torch.Tensor,
) -> None:
num_tokens, hidden_size = hidden_states.shape
if num_tokens == 0:
return
if hidden_size % 128 != 0:
raise ValueError(
"DeepSeek V4 MegaMoE input staging requires hidden_size to be "
"a multiple of 128."
)
top_k = topk_ids.shape[1]
if topk_weights.shape != topk_ids.shape:
raise ValueError(
"DeepSeek V4 MegaMoE input staging requires topk_weights and "
"topk_ids to have the same shape."
)
block_k = 128
grid = (num_tokens, triton.cdiv(hidden_size, block_k))
block_topk = triton.next_power_of_2(top_k)
_deepseek_v4_stage_mega_moe_inputs_kernel[grid](
hidden_states,
x_fp4,
x_sf,
topk_ids,
topk_weights,
topk_idx_out,
topk_weights_out,
hidden_states.stride(0),
hidden_states.stride(1),
x_fp4.stride(0),
x_fp4.stride(1),
x_sf.stride(0),
x_sf.stride(1),
topk_ids.stride(0),
topk_ids.stride(1),
topk_weights.stride(0),
topk_weights.stride(1),
topk_idx_out.stride(0),
topk_idx_out.stride(1),
topk_weights_out.stride(0),
topk_weights_out.stride(1),
hidden_size,
top_k,
BLOCK_K=block_k,
GROUP_K=16, # NVFP4: group_size=16 (scale_vec::4X)
BLOCK_TOPK=block_topk,
num_warps=4,
)

View File

@@ -1,262 +0,0 @@
#!/usr/bin/python3
"""
Apply ALL fixes to the S11 base version of deepseek_v4.py.
This is a clean application of all fixes we've developed.
"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
import ast
def check(c, label):
try:
ast.parse(c)
print(f" {label}: OK")
return True
except SyntaxError as e:
print(f" {label}: SYNTAX ERROR at line {e.lineno}: {e.msg}")
return False
# ═══════════════════════════════════════════════════════════
# FIX 1: Substr mapping — remove .mla_attn. from attn projections
# ═══════════════════════════════════════════════════════════
subs = {
'".self_attn.q_a_proj.": ".attn.mla_attn.wq_a."': '".self_attn.q_a_proj.": ".attn.wq_a."',
'".self_attn.q_b_proj.": ".attn.mla_attn.wq_b."': '".self_attn.q_b_proj.": ".attn.wq_b."',
'".self_attn.q_a_norm.": ".attn.mla_attn.q_norm."': '".self_attn.q_a_norm.": ".attn.q_norm."',
'".self_attn.o_a_proj.": ".attn.mla_attn.wo_a."': '".self_attn.o_a_proj.": ".attn.wo_a."',
'".self_attn.o_b_proj.": ".attn.mla_attn.wo_b."': '".self_attn.o_b_proj.": ".attn.wo_b."',
'".self_attn.sinks": ".attn.mla_attn.attn_sink"': '".self_attn.sinks": ".attn.attn_sink"',
'".self_attn.kv_proj.": ".attn.mla_attn.wkv."': '".self_attn.kv_proj.": ".attn.wkv."',
'".self_attn.kv_norm.": ".attn.mla_attn.kv_norm."': '".self_attn.kv_norm.": ".attn.kv_norm."',
}
for old, new in subs.items():
c = c.replace(old, new)
check(c, "Fix 1 (substr)")
# ═══════════════════════════════════════════════════════════
# FIX 2: Skip patterns — only skip compressor scales
# ═══════════════════════════════════════════════════════════
# Remove attention and shared expert scale skip patterns
lines_to_remove = [
' re.compile(r"\\.self_attn\\.kv_proj\\.weight_scale$"): None,',
' re.compile(r"\\.self_attn\\.q_a_proj\\.weight_scale$"): None,',
' re.compile(r"\\.self_attn\\.q_b_proj\\.weight_scale$"): None,',
' re.compile(r"\\.self_attn\\.o_a_proj\\.weight_scale$"): None,',
' re.compile(r"\\.self_attn\\.o_b_proj\\.weight_scale$"): None,',
' re.compile(r"\\.self_attn\\.kv_proj\\.weight_scale_2$"): None,',
' re.compile(r"\\.self_attn\\.q_a_proj\\.weight_scale_2$"): None,',
' re.compile(r"\\.self_attn\\.q_b_proj\\.weight_scale_2$"): None,',
' re.compile(r"\\.self_attn\\.o_a_proj\\.weight_scale_2$"): None,',
' re.compile(r"\\.self_attn\\.o_b_proj\\.weight_scale_2$"): None,',
' re.compile(r"\\.self_attn\\.kv_proj\\.input_scale$"): None,',
' re.compile(r"\\.self_attn\\.q_a_proj\\.input_scale$"): None,',
' re.compile(r"\\.self_attn\\.q_b_proj\\.input_scale$"): None,',
' re.compile(r"\\.self_attn\\.o_a_proj\\.input_scale$"): None,',
' re.compile(r"\\.self_attn\\.o_b_proj\\.input_scale$"): None,',
' re.compile(r"\\.shared_experts\\.gate_proj\\.weight_scale$"): None,',
' re.compile(r"\\.shared_experts\\.up_proj\\.weight_scale$"): None,',
' re.compile(r"\\.shared_experts\\.gate_proj\\.weight_scale_2$"): None,',
' re.compile(r"\\.shared_experts\\.up_proj\\.weight_scale_2$"): None,',
' re.compile(r"\\.shared_experts\\.gate_proj\\.input_scale$"): None,',
' re.compile(r"\\.shared_experts\\.up_proj\\.input_scale$"): None,',
]
for line in lines_to_remove:
c = c.replace(line + "\n", "")
c = c.replace(line, "")
check(c, "Fix 2 (skip patterns)")
# ═══════════════════════════════════════════════════════════
# FIX 3: Remove the 'head.weight' suffix mapping that causes
# 'lm_head.weight' to become 'lm_lm_head.weight'
# ═══════════════════════════════════════════════════════════
c = c.replace(' "head.weight": "lm_head.weight",\n', '')
check(c, "Fix 3 (suffix)")
# ═══════════════════════════════════════════════════════════
# FIX 4: Handle o_a_proj bf16 -> FP8 at load time
# modelopt didn't quantize o_a_proj, but vLLM creates wo_a with NVFP4
# Convert bf16 -> FP8 and set weight_scale_inv
# ═══════════════════════════════════════════════════════════
old_else = ''' else:
if name not in params_dict:
# ModelOpt NVFP4 export includes params not in the
# vllm model (e.g., compressor.position_bias).
# Skip them silently.
continue
param = params_dict[name]
weight_loader = getattr(
param, "weight_loader", default_weight_loader
)
weight_loader(param, loaded_weight)
loaded_params.add(name)
continue'''
new_else = ''' else:
if name not in params_dict:
continue
param = params_dict[name]
# Handle o_a_proj bf16 -> wo_a uint8 mismatch
if (name.endswith(".weight")
and loaded_weight.dtype != torch.uint8
and param.data.dtype == torch.uint8):
# o_a_proj was NOT quantized by modelopt (bf16, no scales)
# Convert bf16 -> FP8 and set weight_scale_inv
w_bf16 = loaded_weight
w_amax = w_bf16.abs().amax()
if w_amax == 0:
w_amax = torch.tensor(1.0, device=w_bf16.device)
fp8_max = torch.finfo(torch.float8_e4m3fn).max
fp8_scale = w_amax / fp8_max
w_fp8 = (w_bf16 / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
# Replace the module weight and add weight_scale_inv
parts = name.rsplit(".", 1)
module_path = parts[0]
mod = self
for attr in module_path.split("."):
if attr.isdigit():
mod = mod[int(attr)]
else:
mod = getattr(mod, attr)
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
mod.weight_scale_inv = torch.nn.Parameter(
weight_scale_inv.reshape(1), requires_grad=False
)
from vllm.model_executor.layers.linear import (
UnquantizedLinearMethod,
)
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale"):
if hasattr(mod, attr):
delattr(mod, attr)
loaded_params.add(name)
loaded_params.add(name.replace(".weight", ".weight_scale_inv"))
continue
weight_loader = getattr(
param, "weight_loader", default_weight_loader
)
weight_loader(param, loaded_weight)
loaded_params.add(name)
continue'''
c = c.replace(old_else, new_else)
check(c, "Fix 4 (o_a_proj bf16->FP8)")
# ═══════════════════════════════════════════════════════════
# FIX 5: Add NVFP4->FP8 post-load conversion for attention
# This converts all uint8 NVFP4 attention weights to FP8
# ═══════════════════════════════════════════════════════════
conversion_methods = '''
def _convert_nvfp4_attention_to_fp8(self):
E2M1_LUT = torch.tensor(
[0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16
)
FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
attn_proj_names = {"fused_wqa_wkv", "wq_b", "wo_a", "wo_b"}
shared_expert_names = {"gate_up_proj"}
converted = 0
for layer_idx, layer in enumerate(self.layers):
attn = layer.attn
for proj_name in attn_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
ffn = layer.ffn
if hasattr(ffn, "shared_experts"):
for proj_name in shared_expert_names:
if not hasattr(ffn.shared_experts, proj_name):
continue
mod = getattr(ffn.shared_experts, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
if converted > 0:
logger.info_once(
"Converted %d NVFP4 attention/shared-expert layers to FP8",
converted,
)
def _convert_nvfp4_module_to_fp8(self, mod, e2m1_lut, fp8_max):
w_uint8 = mod.weight.data
device = w_uint8.device
even_idx = (w_uint8 & 0x0F).int()
odd_idx = ((w_uint8 >> 4) & 0x0F).int()
even_vals = e2m1_lut.to(device)[even_idx]
odd_vals = e2m1_lut.to(device)[odd_idx]
w_bf16 = torch.stack([even_vals, odd_vals], dim=-1)
w_bf16 = w_bf16.reshape(w_uint8.shape[0], -1).to(torch.bfloat16)
if hasattr(mod, "weight_scale") and hasattr(mod, "weight_scale_2"):
block_scale = mod.weight_scale.data.to(torch.float32)
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
block_scale_expanded = block_scale.unsqueeze(-1).expand(
-1, -1, block_size
).reshape(w_bf16.shape)
else:
block_scale_expanded = block_scale
global_scale = mod.weight_scale_2.data.max().item()
input_scale = (
mod.input_scale.data.max().item()
if hasattr(mod, "input_scale")
else 1.0
)
w_dequant = w_bf16.float() * block_scale_expanded * global_scale * input_scale
w_dequant = w_dequant.to(torch.bfloat16)
else:
w_dequant = w_bf16
w_amax = w_dequant.abs().amax()
if w_amax == 0:
w_amax = torch.tensor(1.0, device=device)
fp8_scale = w_amax / fp8_max
w_fp8 = (w_dequant / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
mod.weight_scale_inv = torch.nn.Parameter(
weight_scale_inv.reshape(1), requires_grad=False
)
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale"):
if hasattr(mod, attr):
delattr(mod, attr)
'''
# Insert before DeepseekV4ForCausalLM class
marker = "\n\nclass DeepseekV4ForCausalLM(nn.Module):"
if marker in c:
c = c.replace(marker, "\n" + conversion_methods + "\nclass DeepseekV4ForCausalLM(nn.Module):")
print(" Fix 5: Inserted conversion methods")
else:
print(" Fix 5: Could not find class marker")
check(c, "Fix 5 (NVFP4->FP8 methods)")
# ═══════════════════════════════════════════════════════════
# FIX 6: Call the conversion from DeepseekV4ForCausalLM.load_weights
# ═══════════════════════════════════════════════════════════
old_load = " self.model.finalize_mega_moe_weights()\n return loaded_params"
new_load = " self.model.finalize_mega_moe_weights()\n self.model._convert_nvfp4_attention_to_fp8()\n return loaded_params"
c = c.replace(old_load, new_load)
check(c, "Fix 6 (call conversion)")
# ═══════════════════════════════════════════════════════════
# Final validation
# ═══════════════════════════════════════════════════════════
check(c, "FINAL")
with open(filepath, 'w') as f:
f.write(c)
print("All fixes applied!")

View File

@@ -1,190 +0,0 @@
#!/usr/bin/env python3
"""Apply all NVFP4 serving fixes to deepseek_v4.py"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# ═══════════════════════════════════════════════════════════════
# FIX 1: Substr mapping — remove .mla_attn. from attention projections
# The model has fused_wqa_wkv, wq_b, wo_a, wo_b at attn.* level
# ═══════════════════════════════════════════════════════════════
replacements_1 = {
'".self_attn.q_a_proj.": ".attn.mla_attn.wq_a."': '".self_attn.q_a_proj.": ".attn.wq_a."',
'".self_attn.q_b_proj.": ".attn.mla_attn.wq_b."': '".self_attn.q_b_proj.": ".attn.wq_b."',
'".self_attn.q_a_norm.": ".attn.mla_attn.q_norm."': '".self_attn.q_a_norm.": ".attn.q_norm."',
'".self_attn.o_a_proj.": ".attn.mla_attn.wo_a."': '".self_attn.o_a_proj.": ".attn.wo_a."',
'".self_attn.o_b_proj.": ".attn.mla_attn.wo_b."': '".self_attn.o_b_proj.": ".attn.wo_b."',
'".self_attn.sinks": ".attn.mla_attn.attn_sink"': '".self_attn.sinks": ".attn.attn_sink"',
'".self_attn.kv_proj.": ".attn.mla_attn.wkv."': '".self_attn.kv_proj.": ".attn.wkv."',
'".self_attn.kv_norm.": ".attn.mla_attn.kv_norm."': '".self_attn.kv_norm.": ".attn.kv_norm."',
}
for old, new in replacements_1.items():
if old in c:
c = c.replace(old, new)
print(f" Fixed: {old[:50]}... → {new[:50]}...")
else:
print(f" NOT FOUND: {old[:60]}...")
# Update comment
c = c.replace(
'# Attention: self_attn → attn.mla_attn',
'# Attention: self_attn → attn (projections at attn level, not mla_attn)'
)
print("FIX 1 applied: substr mappings updated\n")
# ═══════════════════════════════════════════════════════════════
# FIX 2: Skip patterns — only skip compressor scale tensors
# Attention and shared expert scale tensors now correctly load
# ═══════════════════════════════════════════════════════════════
old_skip_block = ''' fused_skip_regex = {
# Compressor projections → fused_wkv_wgate (stacked)
re.compile(r"\\.compressor\\.kv_proj\\.weight_scale$"): None,
re.compile(r"\\.compressor\\.gate_proj\\.weight_scale$"): None,
re.compile(r"\\.compressor\\.kv_proj\\.weight_scale_2$"): None,
re.compile(r"\\.compressor\\.gate_proj\\.weight_scale_2$"): None,
re.compile(r"\\.compressor\\.kv_proj\\.input_scale$"): None,
re.compile(r"\\.compressor\\.gate_proj\\.input_scale$"): None,
# Attention projections → fused_wqa_wkv (stacked)
re.compile(r"\\.self_attn\\.kv_proj\\.weight_scale$"): None,
re.compile(r"\\.self_attn\\.q_a_proj\\.weight_scale$"): None,
re.compile(r"\\.self_attn\\.q_b_proj\\.weight_scale$"): None,
re.compile(r"\\.self_attn\\.o_a_proj\\.weight_scale$"): None,
re.compile(r"\\.self_attn\\.o_b_proj\\.weight_scale$"): None,
re.compile(r"\\.self_attn\\.kv_proj\\.weight_scale_2$"): None,
re.compile(r"\\.self_attn\\.q_a_proj\\.weight_scale_2$"): None,
re.compile(r"\\.self_attn\\.q_b_proj\\.weight_scale_2$"): None,
re.compile(r"\\.self_attn\\.o_a_proj\\.weight_scale_2$"): None,
re.compile(r"\\.self_attn\\.o_b_proj\\.weight_scale_2$"): None,
re.compile(r"\\.self_attn\\.kv_proj\\.input_scale$"): None,
re.compile(r"\\.self_attn\\.q_a_proj\\.input_scale$"): None,
re.compile(r"\\.self_attn\\.q_b_proj\\.input_scale$"): None,
re.compile(r"\\.self_attn\\.o_a_proj\\.input_scale$"): None,
re.compile(r"\\.self_attn\\.o_b_proj\\.input_scale$"): None,
# Shared expert gate_proj/up_proj → gate_up_proj (stacked)
re.compile(r"\\.shared_experts\\.gate_proj\\.weight_scale$"): None,
re.compile(r"\\.shared_experts\\.up_proj\\.weight_scale$"): None,
re.compile(r"\\.shared_experts\\.gate_proj\\.weight_scale_2$"): None,
re.compile(r"\\.shared_experts\\.up_proj\\.weight_scale_2$"): None,
re.compile(r"\\.shared_experts\\.gate_proj\\.input_scale$"): None,
re.compile(r"\\.shared_experts\\.up_proj\\.input_scale$"): None,
}'''
new_skip_block = ''' fused_skip_regex = {
# Compressor projections → fused_wkv_wgate (stacked)
# Compressor uses UnquantizedLinearMethod (quant_config=None),
# so it only has a bf16 weight param — no scale params registered.
# We unpack the NVFP4 uint8 weights to bf16 at load time.
re.compile(r"\\.compressor\\.kv_proj\\.weight_scale$"): None,
re.compile(r"\\.compressor\\.gate_proj\\.weight_scale$"): None,
re.compile(r"\\.compressor\\.kv_proj\\.weight_scale_2$"): None,
re.compile(r"\\.compressor\\.gate_proj\\.weight_scale_2$"): None,
re.compile(r"\\.compressor\\.kv_proj\\.input_scale$"): None,
re.compile(r"\\.compressor\\.gate_proj\\.input_scale$"): None,
# Note: attention and shared expert scale tensors are NO LONGER
# skipped. After fixing substr mappings, they correctly map to the
# model's NVFP4 scale parameters (fused_wqa_wkv, wq_b, wo_a,
# wo_b, gate_up_proj). They load via the stacking logic.
}'''
if old_skip_block in c:
c = c.replace(old_skip_block, new_skip_block)
print("FIX 2 applied: skip patterns updated (only compressor scales skipped)\n")
else:
print("FIX 2: Could not find old skip block, searching for alternatives...")
# Try a more flexible search
import re
# Find the fused_skip_regex block
m = re.search(r' fused_skip_regex = \{[^}]+\}', c)
if m:
print(f" Found block at position {m.start()}")
else:
print(" Could not find fused_skip_regex block!")
# ═══════════════════════════════════════════════════════════════
# FIX 3: Handle o_a_proj bf16 → wo_a uint8 mismatch
# modelopt didn't quantize o_a_proj (bf16, no scales).
# When loading bf16 into uint8, replace the layer's quant_method
# with UnquantizedLinearMethod so it runs in bf16.
# ═══════════════════════════════════════════════════════════════
old_else_block = ''' else:
if name not in params_dict:
# ModelOpt NVFP4 export includes params not in the
# vllm model (e.g., compressor.position_bias).
# Skip them silently.
continue
param = params_dict[name]
weight_loader = getattr(
param, "weight_loader", default_weight_loader
)
weight_loader(param, loaded_weight)
loaded_params.add(name)
continue'''
new_else_block = ''' else:
if name not in params_dict:
# ModelOpt NVFP4 export includes params not in the
# vllm model (e.g., compressor.position_bias).
# Skip them silently.
continue
param = params_dict[name]
# Handle bf16 → uint8 mismatch for o_a_proj:
# modelopt didn't quantize o_a_proj (bf16, no scales),
# but ModelOptNvFp4Config creates wo_a with NVFP4 quant
# (uint8 weight + scales). When loading bf16 into uint8,
# we replace the quant method with UnquantizedLinearMethod
# so the layer runs in bf16 at inference.
if (name.endswith(".weight")
and loaded_weight.dtype != torch.uint8
and param.data.dtype == torch.uint8):
# Replace this layer's quant method with unquantized
from vllm.model_executor.layers.linear import (
UnquantizedLinearMethod,
)
parts = name.rsplit(".", 1)
module_path = parts[0] # e.g., layers.0.attn.wo_a
# Find the module and override its quant method
mod = self
for attr in module_path.split("."):
if attr.isdigit():
mod = mod[int(attr)]
else:
mod = getattr(mod, attr)
if hasattr(mod, 'quant_method'):
mod.quant_method = UnquantizedLinearMethod()
# Replace the uint8 weight param with bf16
new_shape = list(loaded_weight.shape)
new_param = torch.nn.Parameter(
loaded_weight.clone(), requires_grad=False
)
mod.weight = new_param
# Remove scale params (they'll stay at init values,
# but the UnquantizedLinearMethod won't use them)
loaded_params.add(name)
continue
weight_loader = getattr(
param, "weight_loader", default_weight_loader
)
weight_loader(param, loaded_weight)
loaded_params.add(name)
continue'''
if old_else_block in c:
c = c.replace(old_else_block, new_else_block)
print("FIX 3 applied: bf16→uint8 mismatch handling for o_a_proj\n")
else:
print("FIX 3: Could not find exact else block, trying flexible match...")
import re
m = re.search(r'(\s+else:\n\s+if name not in params_dict:.*?continue\n\s+continue)', c, re.DOTALL)
if m:
print(f" Found block at position {m.start()}")
else:
print(" Could not find else block!")
with open(filepath, 'w') as f:
f.write(c)
print("\nAll fixes written to", filepath)

View File

@@ -1,46 +0,0 @@
#!/usr/bin/env python3
"""Fix the FP8 conversion to use a simple no-op quant method for attention layers."""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# Replace all instances of Fp8LinearMethod/Fp8MMQuantMethod imports and usage
# with a simpler approach: just set quant_method to None and handle it
# In _convert_nvfp4_module_to_fp8
old_fp8_convert = ''' # Switch quant method to FP8 linear
from vllm.model_executor.layers.quantization.fp8 import Fp8LinearMethod
from vllm.model_executor.layers.quantization.utils.quant_utils import (
Fp8MMQuantMethod,
)
mod.quant_method = Fp8LinearMethod(Fp8MMQuantMethod())'''
new_fp8_convert = ''' # Switch quant method to a no-op. The attention forward uses
# deepseek_v4_fp8_einsum directly (not the quant method), so the
# quant method is irrelevant. We just need process_weights_after_loading
# to not crash. Using UnquantizedLinearMethod as a safe no-op.
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()'''
c = c.replace(old_fp8_convert, new_fp8_convert)
# In the bf16->uint8 handler (o_a_proj case)
old_oa_fp8 = ''' # Switch quant method to FP8 linear
from vllm.model_executor.layers.quantization.fp8 import Fp8LinearMethod
from vllm.model_executor.layers.quantization.utils.quant_utils import (
Fp8MMQuantMethod,
)
mod.quant_method = Fp8LinearMethod(Fp8MMQuantMethod())'''
new_oa_fp8 = ''' # Switch quant method to no-op (attention forward bypasses it)
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()'''
c = c.replace(old_oa_fp8, new_oa_fp8)
with open(filepath, 'w') as f:
f.write(c)
print("Replaced Fp8LinearMethod with UnquantizedLinearMethod for attention")

View File

@@ -1,162 +0,0 @@
#!/usr/bin/env python3
"""Fix the bf16→uint8 handler to properly quantize to NVFP4 instead of switching to UnquantizedLinearMethod"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
old_handler = ''' # Handle bf16 → uint8 mismatch for o_a_proj:
# modelopt didn't quantize o_a_proj (bf16, no scales),
# but ModelOptNvFp4Config creates wo_a with NVFP4 quant
# (uint8 weight + scales). When loading bf16 into uint8,
# we replace the quant method with UnquantizedLinearMethod
# so the layer runs in bf16 at inference.
if (name.endswith(".weight")
and loaded_weight.dtype != torch.uint8
and param.data.dtype == torch.uint8):
# Replace this layer's quant method with unquantized
from vllm.model_executor.layers.linear import (
UnquantizedLinearMethod,
)
parts = name.rsplit(".", 1)
module_path = parts[0] # e.g., layers.0.attn.wo_a
# Find the module and override its quant method
mod = self
for attr in module_path.split("."):
if attr.isdigit():
mod = mod[int(attr)]
else:
mod = getattr(mod, attr)
if hasattr(mod, 'quant_method'):
mod.quant_method = UnquantizedLinearMethod()
# Replace the uint8 weight param with bf16
new_param = torch.nn.Parameter(
loaded_weight.clone(), requires_grad=False
)
mod.weight = new_param
# Set weight_scale_inv = 1.0 (required by
# DeepseekV4MLAModules forward pass which
# reads wo_a.weight_scale_inv directly)
mod.weight_scale_inv = torch.nn.Parameter(
torch.tensor(1.0, dtype=torch.float32),
requires_grad=False,
)
# Also set input_scale to prevent missing attr errors
if hasattr(mod, 'input_scale'):
mod.input_scale = torch.nn.Parameter(
torch.tensor(1.0, dtype=torch.float32),
requires_grad=False,
)
loaded_params.add(name)
loaded_params.add(name.replace('.weight', '.weight_scale_inv'))
continue'''
new_handler = ''' # Handle bf16 → uint8 mismatch for o_a_proj:
# modelopt didn't quantize o_a_proj (bf16, no scales),
# but ModelOptNvFp4Config creates wo_a with NVFP4 quant
# (uint8 weight + scales). We quantize the bf16 weight
# to NVFP4 at load time so the layer runs in NVFP4 path.
if (name.endswith(".weight")
and loaded_weight.dtype != torch.uint8
and param.data.dtype == torch.uint8):
# Quantize bf16 → NVFP4 (E2M1 packed uint8 + scales)
w_bf16 = loaded_weight
out_dim, in_dim = w_bf16.shape
block_size = 16
assert in_dim % block_size == 0
n_blocks = in_dim // block_size
# Reshape into blocks
w_blocks = w_bf16.reshape(out_dim, n_blocks, block_size)
# Compute per-block amax
amax = w_blocks.abs().amax(dim=-1) # [out, n_blocks]
# Global scale (weight_scale_2): max amax / (6.0 * 448.0)
global_amax = amax.max()
# Use 448.0 as the max e4m3 value for scale computation
weight_scale_2_val = global_amax / (6.0 * 448.0)
weight_scale_2 = weight_scale_2_val.to(torch.float32)
# Per-block scale (weight_scale): fp8 e4m3
# block_scale = amax / (6.0 * weight_scale_2)
block_scale = amax / (6.0 * weight_scale_2_val)
# Clamp to fp8 e4m3 range and cast
block_scale = block_scale.clamp(min=0, max=448.0)
weight_scale = block_scale.to(torch.float8_e4m3fn)
# Quantize to FP4 (E2M1)
# E2M1 LUT: 0, 0.5, 1, 1.5, 2, 3, 4, 6 (positive)
FP4_POS = torch.tensor(
[0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0],
dtype=torch.float32, device=w_bf16.device,
)
# For each block, dequantize the block scale from fp8
block_scale_f32 = weight_scale.to(torch.float32)
# Scale the weight values: normalized = w / (block_scale * weight_scale_2)
# We need to find the nearest FP4 value
scaled = w_blocks / (block_scale_f32.unsqueeze(-1) * weight_scale_2_val)
# Find nearest FP4 index (0-7 for magnitude)
# Use absolute value for matching, then apply sign
scaled_abs = scaled.abs()
# Find closest FP4 value
diff = (scaled_abs.unsqueeze(-1) - FP4_POS).abs()
fp4_idx = diff.argmin(dim=-1) # [out, n_blocks, block_size]
# Apply sign: negative values get bit 3 set
sign = (scaled < 0).int()
fp4_val = (sign << 3) | fp4_idx.int()
# Pack: 2 FP4 values per uint8 byte
# Even positions → lower nibble, Odd → upper nibble
fp4_flat = fp4_val.reshape(out_dim, -1) # [out, in_dim]
assert fp4_flat.shape[1] % 2 == 0
even = fp4_flat[:, 0::2] # lower nibble
odd = fp4_flat[:, 1::2] # upper nibble
packed = (odd << 4) | even
weight_packed = packed.to(torch.uint8)
# Reshape weight_scale to [out, n_blocks]
weight_scale_2d = weight_scale.reshape(out_dim, n_blocks)
# Load the quantized weight into the uint8 param
weight_loader = param.weight_loader
weight_loader(param, weight_packed)
loaded_params.add(name)
# Load scales into sibling params
base = name.rsplit(".", 1)[0]
# weight_scale
ws_name = f"{base}.weight_scale"
if ws_name in params_dict:
ws_param = params_dict[ws_name]
ws_loader = getattr(ws_param, "weight_loader", default_weight_loader)
ws_loader(ws_param, weight_scale_2d)
loaded_params.add(ws_name)
# weight_scale_2
ws2_name = f"{base}.weight_scale_2"
if ws2_name in params_dict:
ws2_param = params_dict[ws2_name]
ws2_loader = getattr(ws2_param, "weight_loader", default_weight_loader)
ws2_loader(ws2_param, weight_scale_2.reshape(1))
loaded_params.add(ws2_name)
# input_scale: use 1.0 default (dynamic quant)
is_name = f"{base}.input_scale"
if is_name in params_dict:
is_param = params_dict[is_name]
is_loader = getattr(is_param, "weight_loader", default_weight_loader)
is_loader(is_param, torch.tensor(1.0, dtype=torch.float32))
loaded_params.add(is_name)
continue'''
if old_handler in c:
c = c.replace(old_handler, new_handler)
print('FIX 5 applied: Replaced UnquantizedLinearMethod with proper NVFP4 quantization')
else:
print('FIX 5: Could not find exact handler block, trying flexible match...')
if 'UnquantizedLinearMethod' in c:
print(' Found UnquantizedLinearMethod in code - manual fix needed')
else:
print(' UnquantizedLinearMethod not found - already replaced?')
with open(filepath, 'w') as f:
f.write(c)

View File

@@ -1,81 +0,0 @@
#!/usr/bin/env python3
"""
Replace the current NVFP4 weight handling with a comprehensive
NVFP4→bf16→FP8 re-quantization path for attention layers.
The vLLM DeepSeekV4 attention forward code uses deepseek_v4_fp8_einsum
which requires FP8 weights + weight_scale_inv. NVFP4 weights (uint8 packed
FP4 + per-block scales + per-tensor scales) are incompatible with this kernel.
Solution: At load time, dequantize all NVFP4 attention weights to bf16,
then re-quantize to FP8. Store the FP8 weights + weight_scale_inv.
The existing FP8 attention forward code then works without modification.
For compressor fused_wkv_wgate: stays bf16 (UnquantizedLinearMethod).
For MoE experts: handled by ModelOptNvFp4FusedMoE natively.
For shared experts gate_up_proj: also needs FP8 conversion.
"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
content = f.read()
# ============================================================
# Helper function: add the NVFP4→FP8 conversion utility
# at the top of the load_weights method
# ============================================================
old_load_weights_start = ''' def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
# Define E2M1 FP4 → BF16 lookup table for unpacking
E2M1_LUT = torch.tensor(
[0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16
)'''
new_load_weights_start = ''' def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
# Define E2M1 FP4 → BF16 lookup table for unpacking
E2M1_LUT = torch.tensor(
[0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16
)
# ── NVFP4 → FP8 re-quantization helper ──────────────────
# The vLLM DeepSeekV4 attention forward uses deepseek_v4_fp8_einsum
# which requires FP8 weights + weight_scale_inv. ModelOpt NVFP4
# stores uint8 packed FP4 + per-block/per-tensor scales.
# We dequantize NVFP4→bf16, then re-quantize to FP8 at load time.
def _nvfp4_to_fp8(w_uint8, w_scale, w_scale_2, w_input_scale):
"""Convert NVFP4 weight + scales to FP8 weight + weight_scale_inv."""
# Unpack uint8 → E2M1 FP4 → bf16
even = (w_uint8 & 0x0F).int()
odd = ((w_uint8 >> 4) & 0x0F).int()
# Interleave back
bf16_even = E2M1_LUT.to(w_uint8.device)[even]
bf16_odd = E2M1_LUT.to(w_uint8.device)[odd]
# Stack along last dim and flatten
w_bf16 = torch.stack([bf16_even, bf16_odd], dim=-1)
w_bf16 = w_bf16.reshape(w_uint8.shape[0], -1) # [out, in_dim]
# Dequantize: bf16_val = fp4 * block_scale * global_scale * input_scale
if w_scale.dim() == 2:
block_scale = w_scale.to(torch.float32).unsqueeze(-1) # [out, blocks, 1]
w_bf16_scaled = w_bf16.float() * block_scale.reshape(
w_bf16.shape[0], -1) * w_scale_2.item() * w_input_scale.item()
else:
w_bf16_scaled = w_bf16.float() * w_scale_2.item() * w_input_scale.item()
w_bf16_scaled = w_bf16_scaled.to(torch.bfloat16)
# Re-quantize bf16 → FP8 e4m3
w_amax = w_bf16_scaled.abs().amax()
fp8_scale = w_amax / torch.finfo(torch.float8_e4m3fn).max
w_fp8 = (w_bf16_scaled / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
return w_fp8, weight_scale_inv
# ── End helper ──────────────────────────────────────────
'''
content = content.replace(old_load_weights_start, new_load_weights_start)
print("Added NVFP4→FP8 helper function")
with open(filepath, 'w') as f:
f.write(content)
print("Written to file")

View File

@@ -1,88 +0,0 @@
#!/usr/bin/env python3
"""
Update the load_weights method to use NVFP4→FP8 conversion for attention layers.
Key changes:
1. Stacked params (fused_wqa_wkv): when uint8, dequantize to bf16, re-quantize to FP8,
then load as FP8 weight + weight_scale_inv
2. Non-stacked params (wq_b, wo_a, wo_b, gate_up_proj): same treatment
3. Compressor fused_wkv_wgate: stays as bf16 (E2M1 unpack only)
4. Remove the separate bf16→uint8 handler (no longer needed since we go to FP8)
"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
content = f.read()
# ============================================================
# Replace the stacked params loading block
# The current code unpacks uint8→bf16 for MergedColumnParallelLinear
# We need to instead convert NVFP4→FP8 for attention/shared expert
# and unpack→bf16 for compressor
# ============================================================
old_stacked_unpack = ''' # ModelOpt NVFP4 packed weight fix for MergedColumnParallelLinear.
#
# modelopt exports NVFP4 packed weights as uint8 (2 values/byte
# along the column dim). But MergedColumnParallelLinear creates
# weight as bf16 (not PackedColumnParameter uint8) because
# ModelOptNvFp4Config only handles Linear, not
# MergedColumnParallelLinear.
#
# For compressor fused_wkv_wgate (quant_config=None →
# UnquantizedLinearMethod → bf16 weight):
# Unpack uint8→bf16 using E2M1 LUT, load into bf16 param.
#
# For fused_wqa_wkv (NVFP4 quant method → uint8 weight):
# The weight param IS uint8, so no unpacking needed.
# Just load the packed uint8 weight directly.
# Scales are loaded separately (no longer skipped).
if (loaded_weight.dtype == torch.uint8
and param.data.dtype != torch.uint8
and loaded_weight.shape[-1] * 2 == param.data.shape[-1]):
# Unpack NVFP4 (E2M1) → BF16
# E2M1 LUT: 0→0, 1→0.5, 2→1, 3→1.5, 4→2, 5→3, 6→4, 7→6
even_idx = (loaded_weight & 0x0F).int()
odd_idx = ((loaded_weight >> 4) & 0x0F).int()
even_vals = E2M1_LUT[even_idx]
odd_vals = E2M1_LUT[odd_idx]
# Interleave even and odd along the last dim
out = torch.stack([even_vals, odd_vals], dim=-1)
out = out.reshape(
loaded_weight.shape[0], -1
).to(torch.bfloat16)
loaded_weight = out'''
new_stacked_unpack = ''' # ModelOpt NVFP4 weight handling for stacked params.
#
# The vLLM DeepSeekV4 attention forward uses deepseek_v4_fp8_einsum
# which requires FP8 weights + weight_scale_inv. NVFP4 weights are
# incompatible. We convert NVFP4→bf16→FP8 at load time.
#
# For compressor fused_wkv_wgate (UnquantizedLinearMethod → bf16):
# Just unpack uint8→bf16 and load into bf16 param.
#
# For fused_wqa_wkv and gate_up_proj (NVFP4 quant → uint8):
# Collect the uint8 weight + scales, then convert to FP8
# using the _nvfp4_to_fp8 helper after all sub-weights load.
if (loaded_weight.dtype == torch.uint8
and param.data.dtype != torch.uint8
and loaded_weight.shape[-1] * 2 == param.data.shape[-1]):
# Compressor path: unpack uint8→bf16, load into bf16 param
even_idx = (loaded_weight & 0x0F).int()
odd_idx = ((loaded_weight >> 4) & 0x0F).int()
even_vals = E2M1_LUT[even_idx]
odd_vals = E2M1_LUT[odd_idx]
out = torch.stack([even_vals, odd_vals], dim=-1)
out = out.reshape(
loaded_weight.shape[0], -1
).to(torch.bfloat16)
loaded_weight = out'''
content = content.replace(old_stacked_unpack, new_stacked_unpack)
print("Updated stacked params unpack block")
with open(filepath, 'w') as f:
f.write(content)
print("Written to file")

View File

@@ -1,134 +0,0 @@
#!/usr/bin/env python3
"""Add NVFP4->FP8 conversion methods to deepseek_v4.py"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# 1. Add conversion methods to DeepseekV4Model
old_finalize = ' return loaded_params\n\n\nclass DeepseekV4ForCausalLM(nn.Module):'
new_finalize = ''' return loaded_params
def _convert_nvfp4_attention_to_fp8(self):
"""Convert NVFP4 attention weights to FP8 format.
The vLLM DeepSeekV4 attention forward uses deepseek_v4_fp8_einsum
which requires FP8 weights + weight_scale_inv. NVFP4 weights are
incompatible. We dequantize NVFP4->bf16, then re-quantize to FP8.
"""
E2M1_LUT = torch.tensor(
[0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16
)
FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
attn_proj_names = {"fused_wqa_wkv", "wq_b", "wo_a", "wo_b"}
shared_expert_names = {"gate_up_proj"}
converted = 0
for layer_idx, layer in enumerate(self.layers):
attn = layer.attn
for proj_name in attn_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, 'weight') or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
ffn = layer.ffn
if hasattr(ffn, 'shared_experts'):
for proj_name in shared_expert_names:
if not hasattr(ffn.shared_experts, proj_name):
continue
mod = getattr(ffn.shared_experts, proj_name)
if not hasattr(mod, 'weight') or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
if converted > 0:
logger.info_once(
"Converted %d NVFP4 attention/shared-expert layers to FP8",
converted,
)
def _convert_nvfp4_module_to_fp8(self, mod, e2m1_lut, fp8_max):
"""Convert a single NVFP4 Linear module to FP8 format."""
w_uint8 = mod.weight.data
device = w_uint8.device
# Unpack uint8 -> E2M1 FP4 -> bf16
even_idx = (w_uint8 & 0x0F).int()
odd_idx = ((w_uint8 >> 4) & 0x0F).int()
even_vals = e2m1_lut.to(device)[even_idx]
odd_vals = e2m1_lut.to(device)[odd_idx]
w_bf16 = torch.stack([even_vals, odd_vals], dim=-1)
w_bf16 = w_bf16.reshape(w_uint8.shape[0], -1).to(torch.bfloat16)
# Dequantize: bf16 = fp4 * block_scale * global_scale * input_scale
if hasattr(mod, 'weight_scale') and hasattr(mod, 'weight_scale_2'):
block_scale = mod.weight_scale.data.to(torch.float32)
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
block_scale_expanded = block_scale.unsqueeze(-1).expand(
-1, -1, block_size
).reshape(w_bf16.shape)
else:
block_scale_expanded = block_scale
global_scale = mod.weight_scale_2.data.max().item()
input_scale = mod.input_scale.data.max().item() if hasattr(mod, 'input_scale') else 1.0
w_dequant = w_bf16.float() * block_scale_expanded * global_scale * input_scale
w_dequant = w_dequant.to(torch.bfloat16)
else:
w_dequant = w_bf16
# Re-quantize bf16 -> FP8 e4m3
w_amax = w_dequant.abs().amax()
if w_amax == 0:
w_amax = torch.tensor(1.0, device=device)
fp8_scale = w_amax / fp8_max
w_fp8 = (w_dequant / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
# Replace weight param
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
mod.weight_scale_inv = torch.nn.Parameter(
weight_scale_inv.reshape(1), requires_grad=False
)
# Switch quant method to FP8 linear
from vllm.model_executor.layers.quantization.fp8 import Fp8LinearMethod
from vllm.model_executor.layers.quantization.utils.quant_utils import (
Fp8MMQuantMethod,
)
mod.quant_method = Fp8LinearMethod(Fp8MMQuantMethod())
# Clean up NVFP4 params
for attr in ('weight_scale', 'weight_scale_2', 'input_scale'):
if hasattr(mod, attr):
delattr(mod, attr)
class DeepseekV4ForCausalLM(nn.Module):'''
c = c.replace(old_finalize, new_finalize)
# 2. Call it from DeepseekV4ForCausalLM.load_weights
old_causal = ''' self.model.finalize_mega_moe_weights()
return loaded_params'''
new_causal = ''' self.model.finalize_mega_moe_weights()
# Convert NVFP4 attention weights to FP8 for compatibility with
# the deepseek_v4_fp8_einsum kernel used in the attention forward
self.model._convert_nvfp4_attention_to_fp8()
return loaded_params'''
c = c.replace(old_causal, new_causal)
with open(filepath, 'w') as f:
f.write(c)
print("Applied NVFP4->FP8 conversion methods")

View File

@@ -1,68 +0,0 @@
#!/usr/bin/env python3
"""Update the bf16->uint8 handler to convert bf16->FP8 directly"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# The bf16->uint8 handler needs to convert bf16 weight directly to FP8
# since o_a_proj was NOT quantized by modelopt
# Replace the entire handler block
old_handler = ''' if (name.endswith(".weight")
and loaded_weight.dtype != torch.uint8
and param.data.dtype == torch.uint8):
# Quantize bf16 → NVFP4 (E2M1 packed uint8 + scales)'''
new_handler = ''' if (name.endswith(".weight")
and loaded_weight.dtype != torch.uint8
and param.data.dtype == torch.uint8):
# o_a_proj was NOT quantized by modelopt (bf16, no scales)
# Convert bf16 → FP8 directly, set weight_scale_inv
w_bf16 = loaded_weight
w_amax = w_bf16.abs().amax()
if w_amax == 0:
w_amax = torch.tensor(1.0, device=w_bf16.device)
fp8_max = torch.finfo(torch.float8_e4m3fn).max
fp8_scale = w_amax / fp8_max
w_fp8 = (w_bf16 / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
# Load FP8 weight directly (bypass the uint8 param)
# Find the module and replace weight + quant method
parts = name.rsplit(".", 1)
module_path = parts[0]
mod = self
for attr in module_path.split("."):
if attr.isdigit():
mod = mod[int(attr)]
else:
mod = getattr(mod, attr)
# Replace weight param with FP8 version
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
mod.weight_scale_inv = torch.nn.Parameter(
weight_scale_inv.reshape(1), requires_grad=False
)
# Switch quant method to FP8 linear
from vllm.model_executor.layers.quantization.fp8 import Fp8LinearMethod
from vllm.model_executor.layers.quantization.utils.quant_utils import (
Fp8MMQuantMethod,
)
mod.quant_method = Fp8LinearMethod(Fp8MMQuantMethod())
# Clean up NVFP4 params
for attr in ('weight_scale', 'weight_scale_2', 'input_scale'):
if hasattr(mod, attr):
delattr(mod, attr)
loaded_params.add(name)
loaded_params.add(name.replace('.weight', '.weight_scale_inv'))
continue
# OLD: Quantize bf16 -> NVFP4 (E2M1 packed uint8 + scales)'''
c = c.replace(old_handler, new_handler)
with open(filepath, 'w') as f:
f.write(c)
print("Updated bf16->uint8 handler to convert to FP8 directly")

View File

@@ -1,132 +0,0 @@
#!/usr/bin/python3
"""Fix: move _convert_nvfp4 methods INSIDE DeepseekV4Model class (before hc_head)"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# 1. Remove the wrongly placed methods (currently after hc_head, at top level)
# Find the second occurrence of _convert_nvfp4_attention_to_fp8 (the wrongly placed one)
idx = c.find(" def _convert_nvfp4_attention_to_fp8(self):\n")
if idx > 0:
# Find the end of the methods (before class DeepseekV4ForCausalLM)
end_marker = "\n\nclass DeepseekV4ForCausalLM(nn.Module):"
end_idx = c.find(end_marker, idx)
if end_idx > 0:
c = c[:idx] + c[end_idx:]
print("Removed wrongly placed methods")
else:
print("Could not find end marker")
else:
print("No wrongly placed methods found")
# 2. Insert the methods inside DeepseekV4Model, right after finalize_mega_moe_weights
insert_after = "def finalize_mega_moe_weights(self) -> None:\n for layer in islice(self.layers, self.start_layer, self.end_layer):\n layer.ffn.finalize_mega_moe_moe_weights()\n"
# Try a simpler approach: find the end of finalize_mega_moe_weights
marker = " layer.ffn.finalize_mega_moe_weights()\n\n\n@torch.compile"
if marker in c:
methods = ''' layer.ffn.finalize_mega_moe_weights()
def _convert_nvfp4_attention_to_fp8(self):
E2M1_LUT = torch.tensor(
[0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16
)
FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
attn_proj_names = {"fused_wqa_wkv", "wq_b", "wo_a", "wo_b"}
shared_expert_names = {"gate_up_proj"}
converted = 0
for layer_idx, layer in enumerate(self.layers):
attn = layer.attn
for proj_name in attn_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
ffn = layer.ffn
if hasattr(ffn, "shared_experts"):
for proj_name in shared_expert_names:
if not hasattr(ffn.shared_experts, proj_name):
continue
mod = getattr(ffn.shared_experts, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
if converted > 0:
logger.info_once(
"Converted %d NVFP4 attention/shared-expert layers to FP8",
converted,
)
def _convert_nvfp4_module_to_fp8(self, mod, e2m1_lut, fp8_max):
w_uint8 = mod.weight.data
device = w_uint8.device
even_idx = (w_uint8 & 0x0F).int()
odd_idx = ((w_uint8 >> 4) & 0x0F).int()
even_vals = e2m1_lut.to(device)[even_idx]
odd_vals = e2m1_lut.to(device)[odd_idx]
w_bf16 = torch.stack([even_vals, odd_vals], dim=-1)
w_bf16 = w_bf16.reshape(w_uint8.shape[0], -1).to(torch.bfloat16)
if hasattr(mod, "weight_scale") and hasattr(mod, "weight_scale_2"):
block_scale = mod.weight_scale.data.to(torch.float32)
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
block_scale_expanded = block_scale.unsqueeze(-1).expand(
-1, -1, block_size
).reshape(w_bf16.shape)
else:
block_scale_expanded = block_scale
global_scale = mod.weight_scale_2.data.max().item()
input_scale = (
mod.input_scale.data.max().item()
if hasattr(mod, "input_scale")
else 1.0
)
w_dequant = w_bf16.float() * block_scale_expanded * global_scale * input_scale
w_dequant = w_dequant.to(torch.bfloat16)
else:
w_dequant = w_bf16
w_amax = w_dequant.abs().amax()
if w_amax == 0:
w_amax = torch.tensor(1.0, device=device)
fp8_scale = w_amax / fp8_max
w_fp8 = (w_dequant / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
mod.weight_scale_inv = torch.nn.Parameter(
weight_scale_inv.reshape(1), requires_grad=False
)
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale"):
if hasattr(mod, attr):
delattr(mod, attr)
@torch.compile'''
c = c.replace(marker, methods)
print("Inserted methods inside DeepseekV4Model class")
else:
print("Could not find insertion marker!")
# Try alternate
alt = " layer.ffn.finalize_mega_moe_weights()\n\n\n@torch.compile"
if alt in c:
print("Found alternate marker")
else:
# Just search for finalize_mega_moe_weights
idx = c.find("finalize_mega_moe_weights()")
print(f"Found finalize at position {idx}")
import ast
try:
ast.parse(c)
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error at line {e.lineno}: {e.msg}")
with open(filepath, 'w') as f:
f.write(c)

View File

@@ -1,223 +0,0 @@
#!/usr/bin/python3
"""
Clean rewrite of the NVFP4→FP8/bf16 conversion.
Strategy:
- wo_a, fused_wqa_wkv → FP8 (used with fp8_einsum, need weight_scale_inv)
- wq_b, wo_b, gate_up_proj → bf16 (used via .forward(), just works)
- compressor fused_wkv_wgate → bf16 (already handled in load path)
- MoE experts → native NVFP4 (ModelOptNvFp4FusedMoE handles it)
"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# Find and replace the entire _convert_nvfp4_attention_to_fp8 method
# and _convert_nvfp4_module_to_fp8 method
import re
# Remove old methods and insert new ones
# Find the method definitions
m1_start = c.find(" def _convert_nvfp4_attention_to_fp8(self):")
if m1_start < 0:
print("ERROR: Could not find _convert_nvfp4_attention_to_fp8")
exit(1)
# Find the end: look for the next method/class at the same or lower indent
# after _convert_nvfp4_module_to_fp8
m2_start = c.find(" def _convert_nvfp4_module_to_fp8(self, mod, e2m1_lut, fp8_max):", m1_start)
if m2_start < 0:
print("ERROR: Could not find _convert_nvfp4_module_to_fp8")
exit(1)
# Find the end of the second method
# Scan for the next line at indent <= 4 that's not blank
pos = m2_start
lines_after = c[m2_start:].split('\n')
end_line = 0
for i, line in enumerate(lines_after[1:], 1):
if line.strip() == '':
continue
indent = len(line) - len(line.lstrip())
if indent <= 4:
end_line = i
break
# Calculate the end position
end_pos = m2_start + sum(len(l) + 1 for l in lines_after[:end_line])
new_methods = ''' def _convert_nvfp4_post_load(self):
"""Post-load conversion of NVFP4 weights for vLLM compatibility.
Strategy:
- wo_a, fused_wqa_wkv: Convert NVFP4->FP8 (used with fp8_einsum)
- wq_b, wo_b, gate_up_proj: Dequant NVFP4->bf16 (used via .forward())
- MoE experts: Stay in native NVFP4 (ModelOptNvFp4FusedMoE)
"""
E2M1_LUT = torch.tensor(
[0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16
)
FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
# Layers that use fp8_einsum (need FP8 + weight_scale_inv)
fp8_proj_names = {"fused_wqa_wkv", "wo_a"}
# Layers that use normal .forward() (need bf16)
bf16_proj_names = {"wq_b", "wo_b"}
bf16_shared_names = {"gate_up_proj"}
fp8_converted = 0
bf16_converted = 0
for layer_idx, layer in enumerate(self.layers):
attn = layer.attn
for proj_name in fp8_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_to_fp8(mod, E2M1_LUT, FP8_MAX)
fp8_converted += 1
for proj_name in bf16_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._dequant_nvfp4_to_bf16(mod, E2M1_LUT)
bf16_converted += 1
# Shared experts
ffn = layer.ffn
if hasattr(ffn, "shared_experts"):
for proj_name in bf16_shared_names:
if not hasattr(ffn.shared_experts, proj_name):
continue
mod = getattr(ffn.shared_experts, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._dequant_nvfp4_to_bf16(mod, E2M1_LUT)
bf16_converted += 1
if fp8_converted > 0 or bf16_converted > 0:
print(f"NVFP4 post-load: {fp8_converted} layers -> FP8, "
f"{bf16_converted} layers -> bf16, MoE experts stay NVFP4")
def _dequant_nvfp4_to_bf16(self, mod, e2m1_lut):
"""Dequantize NVFP4 weight to bf16 for normal .forward() path."""
w_uint8 = mod.weight.data
device = w_uint8.device
w_bf16 = self._unpack_nvfp4_to_bf16(w_uint8, e2m1_lut, device)
# Dequantize with scales
if hasattr(mod, "weight_scale") and hasattr(mod, "weight_scale_2"):
block_scale = mod.weight_scale.data.to(torch.float32)
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
block_scale_expanded = block_scale.unsqueeze(-1).expand(
-1, -1, block_size
).reshape(w_bf16.shape)
else:
block_scale_expanded = block_scale
global_scale = mod.weight_scale_2.data.max().item()
input_scale = (
mod.input_scale.data.max().item()
if hasattr(mod, "input_scale")
else 1.0
)
w_dequant = w_bf16.float() * block_scale_expanded * global_scale * input_scale
w_dequant = w_dequant.to(torch.bfloat16)
else:
w_dequant = w_bf16
# Replace weight with bf16 version
mod.weight = torch.nn.Parameter(w_dequant, requires_grad=False)
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale",
"weight_scale_inv"):
if hasattr(mod, attr):
delattr(mod, attr)
def _convert_nvfp4_to_fp8(self, mod, e2m1_lut, fp8_max):
"""Convert NVFP4 weight to FP8 for fp8_einsum path."""
w_uint8 = mod.weight.data
device = w_uint8.device
w_bf16 = self._unpack_nvfp4_to_bf16(w_uint8, e2m1_lut, device)
# Dequantize with scales
if hasattr(mod, "weight_scale") and hasattr(mod, "weight_scale_2"):
block_scale = mod.weight_scale.data.to(torch.float32)
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
block_scale_expanded = block_scale.unsqueeze(-1).expand(
-1, -1, block_size
).reshape(w_bf16.shape)
else:
block_scale_expanded = block_scale
global_scale = mod.weight_scale_2.data.max().item()
input_scale = (
mod.input_scale.data.max().item()
if hasattr(mod, "input_scale")
else 1.0
)
w_dequant = w_bf16.float() * block_scale_expanded * global_scale * input_scale
w_dequant = w_dequant.to(torch.bfloat16)
else:
w_dequant = w_bf16
# Re-quantize bf16 -> FP8 e4m3
w_amax = w_dequant.abs().amax()
if w_amax == 0:
w_amax = torch.tensor(1.0, device=device)
fp8_scale = w_amax / fp8_max
w_fp8 = (w_dequant / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
mod.weight_scale_inv = torch.nn.Parameter(
weight_scale_inv.reshape(1), requires_grad=False
)
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale"):
if hasattr(mod, attr):
delattr(mod, attr)
def _unpack_nvfp4_to_bf16(self, w_uint8, e2m1_lut, device):
"""Unpack NVFP4 uint8 packed weights to bf16 using E2M1 format."""
# Extract 4-bit FP4 values (0-15, bit 3 = sign)
even_raw = (w_uint8 & 0x0F).int()
odd_raw = ((w_uint8 >> 4) & 0x0F).int()
# Sign: 0-7 = positive, 8-15 = negative
even_sign = torch.where(even_raw >= 8, -1.0, 1.0).to(torch.bfloat16)
odd_sign = torch.where(odd_raw >= 8, -1.0, 1.0).to(torch.bfloat16)
# Magnitude index: lower 3 bits (0-7)
even_vals = even_sign * e2m1_lut.to(device)[even_raw & 0x07]
odd_vals = odd_sign * e2m1_lut.to(device)[odd_raw & 0x07]
# Interleave and flatten
w_bf16 = torch.stack([even_vals, odd_vals], dim=-1)
w_bf16 = w_bf16.reshape(w_uint8.shape[0], -1).to(torch.bfloat16)
return w_bf16
'''
c = c[:m1_start] + new_methods + c[end_pos:]
# Also update the call from DeepseekV4ForCausalLM.load_weights
c = c.replace(
"self.model._convert_nvfp4_attention_to_fp8()",
"self.model._convert_nvfp4_post_load()"
)
with open(filepath, 'w') as f:
f.write(c)
import ast
try:
ast.parse(c)
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error at line {e.lineno}: {e.msg}")
print("Replaced conversion methods with clean FP8/bf16 split")

View File

@@ -1,53 +0,0 @@
#!/usr/bin/python3
"""Fix the E2M1 unpacking in _convert_nvfp4_module_to_fp8"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# Fix the unpacking code in _convert_nvfp4_module_to_fp8
old_unpack = ''' even_idx = (w_uint8 & 0x0F).int()
odd_idx = ((w_uint8 >> 4) & 0x0F).int()
even_vals = e2m1_lut.to(device)[even_idx]
odd_vals = e2m1_lut.to(device)[odd_idx]'''
new_unpack = ''' # Extract 4-bit FP4 values (0-15, bit 3 = sign)
even_raw = (w_uint8 & 0x0F).int()
odd_raw = ((w_uint8 >> 4) & 0x0F).int()
# Sign: 0-7 = positive, 8-15 = negative
even_sign = torch.where(even_raw >= 8, -1.0, 1.0).to(torch.bfloat16)
odd_sign = torch.where(odd_raw >= 8, -1.0, 1.0).to(torch.bfloat16)
# Magnitude index: lower 3 bits (0-7)
even_vals = even_sign * e2m1_lut.to(device)[even_raw & 0x07]
odd_vals = odd_sign * e2m1_lut.to(device)[odd_raw & 0x07]'''
c = c.replace(old_unpack, new_unpack)
print("Fixed E2M1 unpacking in _convert_nvfp4_module_to_fp8")
# Also fix the E2M1 unpacking in the stacked params code
old_stacked_unpack = ''' even_idx = (loaded_weight & 0x0F).int()
odd_idx = ((loaded_weight >> 4) & 0x0F).int()
even_vals = E2M1_LUT[even_idx]
odd_vals = E2M1_LUT[odd_idx]'''
new_stacked_unpack = ''' # Extract 4-bit FP4 values (0-15, bit 3 = sign)
even_raw = (loaded_weight & 0x0F).int()
odd_raw = ((loaded_weight >> 4) & 0x0F).int()
even_sign = torch.where(even_raw >= 8, -1.0, 1.0).to(torch.bfloat16)
odd_sign = torch.where(odd_raw >= 8, -1.0, 1.0).to(torch.bfloat16)
even_vals = even_sign * E2M1_LUT[even_raw & 0x07]
odd_vals = odd_sign * E2M1_LUT[odd_raw & 0x07]'''
c = c.replace(old_stacked_unpack, new_stacked_unpack)
print("Fixed E2M1 unpacking in stacked params code")
with open(filepath, 'w') as f:
f.write(c)
import ast
try:
ast.parse(c)
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error: {e}")

View File

@@ -1,129 +0,0 @@
#!/usr/bin/python3
"""Fix the placement of _convert_nvfp4 methods - move inside DeepseekV4Model"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# Remove the wrongly placed methods (at top level, 0 indent)
# Find and remove the block between the marker and the class definition
marker = " def _convert_nvfp4_attention_to_fp8(self):\n"
class_marker = "\nclass DeepseekV4ForCausalLM(nn.Module):"
# Find the wrongly placed methods and remove them
idx = c.find(" def _convert_nvfp4_attention_to_fp8(self):\n")
class_idx = c.find("\n\nclass DeepseekV4ForCausalLM(nn.Module):")
if idx > 0 and class_idx > 0 and idx < class_idx:
# Remove the wrongly placed methods
# Find the start of the blank lines before the methods
search_start = idx
while search_start > 0 and c[search_start-1] == '\n':
search_start -= 1
c = c[:search_start] + c[class_idx:]
print(f"Removed wrongly placed methods (chars {search_start}-{class_idx})")
else:
print(f"Could not find wrongly placed methods: idx={idx}, class_idx={class_idx}")
# Now insert the methods INSIDE DeepseekV4Model class, right before
# the line that precedes DeepseekV4ForCausalLM
# Find the last method of DeepseekV4Model before the class boundary
# Insert before "class DeepseekV4ForCausalLM"
insert_point = c.find("\n\nclass DeepseekV4ForCausalLM(nn.Module):")
if insert_point < 0:
print("ERROR: Could not find class marker")
else:
# The methods need to be at 4-space indent (class method level)
methods = '''
def _convert_nvfp4_attention_to_fp8(self):
E2M1_LUT = torch.tensor(
[0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16
)
FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
attn_proj_names = {"fused_wqa_wkv", "wq_b", "wo_a", "wo_b"}
shared_expert_names = {"gate_up_proj"}
converted = 0
for layer_idx, layer in enumerate(self.layers):
attn = layer.attn
for proj_name in attn_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
ffn = layer.ffn
if hasattr(ffn, "shared_experts"):
for proj_name in shared_expert_names:
if not hasattr(ffn.shared_experts, proj_name):
continue
mod = getattr(ffn.shared_experts, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
if converted > 0:
logger.info_once(
"Converted %d NVFP4 attention/shared-expert layers to FP8",
converted,
)
def _convert_nvfp4_module_to_fp8(self, mod, e2m1_lut, fp8_max):
w_uint8 = mod.weight.data
device = w_uint8.device
even_idx = (w_uint8 & 0x0F).int()
odd_idx = ((w_uint8 >> 4) & 0x0F).int()
even_vals = e2m1_lut.to(device)[even_idx]
odd_vals = e2m1_lut.to(device)[odd_idx]
w_bf16 = torch.stack([even_vals, odd_vals], dim=-1)
w_bf16 = w_bf16.reshape(w_uint8.shape[0], -1).to(torch.bfloat16)
if hasattr(mod, "weight_scale") and hasattr(mod, "weight_scale_2"):
block_scale = mod.weight_scale.data.to(torch.float32)
if block_scale.dim() == 2 and w_bf16.dim() == 2:
block_size = w_bf16.shape[1] // block_scale.shape[1]
block_scale_expanded = block_scale.unsqueeze(-1).expand(
-1, -1, block_size
).reshape(w_bf16.shape)
else:
block_scale_expanded = block_scale
global_scale = mod.weight_scale_2.data.max().item()
input_scale = (
mod.input_scale.data.max().item()
if hasattr(mod, "input_scale")
else 1.0
)
w_dequant = w_bf16.float() * block_scale_expanded * global_scale * input_scale
w_dequant = w_dequant.to(torch.bfloat16)
else:
w_dequant = w_bf16
w_amax = w_dequant.abs().amax()
if w_amax == 0:
w_amax = torch.tensor(1.0, device=device)
fp8_scale = w_amax / fp8_max
w_fp8 = (w_dequant / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
mod.weight_scale_inv = torch.nn.Parameter(
weight_scale_inv.reshape(1), requires_grad=False
)
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale"):
if hasattr(mod, attr):
delattr(mod, attr)
'''
c = c[:insert_point] + methods + c[insert_point:]
print("Inserted methods at correct indentation level")
import ast
try:
ast.parse(c)
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error: {e}")
with open(filepath, 'w') as f:
f.write(c)

View File

@@ -1,37 +0,0 @@
#!/usr/bin/python3
"""Fix the logger.info_once call and any syntax issues"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
lines = f.readlines()
# Find and fix the logger/print issue
new_lines = []
skip_until_paren = False
for i, line in enumerate(lines):
stripped = line.strip()
# Replace the print( call that was replacing logger.info_once
if 'print(' in line and 'Converted %d' in line:
new_lines.append(' if converted > 0:\n')
new_lines.append(' print(f"Converted {converted} NVFP4 attention/shared-expert layers to FP8")\n')
skip_until_paren = True
continue
if skip_until_paren:
if ')' in line:
skip_until_paren = False
continue
new_lines.append(line)
with open(filepath, 'w') as f:
f.writelines(new_lines)
import ast
try:
ast.parse(''.join(new_lines))
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error: {e}")

View File

@@ -1,44 +0,0 @@
#!/usr/bin/python3
"""Clean up the broken logger replacement"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
lines = f.readlines()
new_lines = []
skip = False
for i, line in enumerate(lines):
if skip:
if ')' in line:
skip = False
continue
# Fix the broken logger replacement
if '# logger.info_once(' in line or 'logger.info_once(' in line:
if '"Converted %d NVFP4' in lines[i+1] if i+1 < len(lines) else False:
# Replace the whole block
new_lines.append(' print(f"Converted {converted} NVFP4 layers to FP8")\n')
skip = True
continue
else:
new_lines.append(line)
continue
# Also remove orphaned lines from the old block
if '"Converted %d NVFP4 attention/shared-expert layers to FP8",' in line:
continue
if line.strip() == 'converted,':
continue
new_lines.append(line)
with open(filepath, 'w') as f:
f.writelines(new_lines)
import ast
try:
ast.parse(''.join(new_lines))
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error at line {e.lineno}: {e.msg}")

View File

@@ -1,78 +0,0 @@
#!/usr/bin/python3
"""Replace the bf16->NVFP4 quantization handler with a simple bf16->FP8 conversion.
wo_a is used with fp8_einsum, so it needs FP8 weight + weight_scale_inv."""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# Find and replace the entire bf16->uint8 handler block
# It starts with the o_a_proj comment and ends with "continue"
import re
# Find the handler
m = re.search(
r"( # Handle o_a_proj bf16 → wo_a uint8 mismatch:.*?)(\n continue\n)",
c,
re.DOTALL,
)
if m:
# Replace with bf16->FP8 conversion
new_handler = ''' # Handle o_a_proj bf16 -> wo_a: convert to FP8 directly
# (wo_a is used with fp8_einsum, needs FP8 + weight_scale_inv)
if (name.endswith(".weight")
and loaded_weight.dtype != torch.uint8
and param.data.dtype == torch.uint8):
w_bf16 = loaded_weight
w_amax = w_bf16.abs().amax()
if w_amax == 0:
w_amax = torch.tensor(1.0, device=w_bf16.device)
fp8_max = torch.finfo(torch.float8_e4m3fn).max
fp8_scale = w_amax / fp8_max
w_fp8 = (w_bf16 / fp8_scale).to(torch.float8_e4m3fn)
weight_scale_inv = fp8_scale.to(torch.float32)
parts = name.rsplit(".", 1)
module_path = parts[0]
mod = self
for attr in module_path.split("."):
if attr.isdigit():
mod = mod[int(attr)]
else:
mod = getattr(mod, attr)
mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)
mod.weight_scale_inv = torch.nn.Parameter(
weight_scale_inv.reshape(1), requires_grad=False
)
from vllm.model_executor.layers.linear import (
UnquantizedLinearMethod,
)
mod.quant_method = UnquantizedLinearMethod()
for attr in ("weight_scale", "weight_scale_2", "input_scale"):
if hasattr(mod, attr):
delattr(mod, attr)
loaded_params.add(name)
loaded_params.add(name.replace(".weight", ".weight_scale_inv"))
continue
'''
c = c[:m.start()] + new_handler + c[m.end():]
print("Replaced bf16->NVFP4 handler with bf16->FP8 handler")
else:
print("Could not find handler block, trying alternate search...")
# Try finding just the if condition
idx = c.find("and loaded_weight.dtype != torch.uint8\n")
if idx > 0:
print(f"Found condition at position {idx}")
else:
print("ERROR: Could not find condition")
with open(filepath, 'w') as f:
f.write(c)
import ast
try:
ast.parse(c)
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error at line {e.lineno}: {e.msg}")

View File

@@ -1,94 +0,0 @@
#!/usr/bin/python3
"""Replace the old bf16->NVFP4 handler with a simple bf16->FP8 handler."""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
lines = f.readlines()
# Find the handler: starts with the if check, ends with continue
start = None
end = None
for i, line in enumerate(lines):
if 'loaded_weight.dtype != torch.uint8' in line and 'param.data.dtype == torch.uint8' in line:
# Go back to find the if statement start
for j in range(i, max(i-3, 0), -1):
if lines[j].strip().startswith('if (name.endswith'):
start = j
break
if start is None:
start = i # fallback
break
if start is None:
print("Could not find handler start")
exit(1)
# Find the end: the first 'continue' at indent level 20+ after start
for i in range(start + 1, min(start + 200, len(lines))):
stripped = lines[i].strip()
if stripped == 'continue':
indent = len(lines[i]) - len(lines[i].lstrip())
if indent >= 20:
end = i
break
if end is None:
print("Could not find handler end")
exit(1)
print(f"Replacing lines {start+1} to {end+1} ({end-start+1} lines)")
print(f"First: {lines[start].rstrip()[:80]}")
print(f"Last: {lines[end].rstrip()[:80]}")
new_handler = [
' if (name.endswith(".weight")\n',
' and loaded_weight.dtype != torch.uint8\n',
' and param.data.dtype == torch.uint8):\n',
' # o_a_proj was NOT quantized by modelopt (bf16, no scales)\n',
' # wo_a is used with fp8_einsum: convert bf16 -> FP8 directly\n',
' w_bf16 = loaded_weight\n',
' w_amax = w_bf16.abs().amax()\n',
' if w_amax == 0:\n',
' w_amax = torch.tensor(1.0, device=w_bf16.device)\n',
' fp8_max = torch.finfo(torch.float8_e4m3fn).max\n',
' fp8_scale = w_amax / fp8_max\n',
' w_fp8 = (w_bf16 / fp8_scale).to(torch.float8_e4m3fn)\n',
' weight_scale_inv = fp8_scale.to(torch.float32)\n',
' parts = name.rsplit(".", 1)\n',
' module_path = parts[0]\n',
' mod = self\n',
' for attr in module_path.split("."):\n',
' if attr.isdigit():\n',
' mod = mod[int(attr)]\n',
' else:\n',
' mod = getattr(mod, attr)\n',
' mod.weight = torch.nn.Parameter(w_fp8, requires_grad=False)\n',
' mod.weight_scale_inv = torch.nn.Parameter(\n',
' weight_scale_inv.reshape(1), requires_grad=False\n',
' )\n',
' from vllm.model_executor.layers.linear import (\n',
' UnquantizedLinearMethod,\n',
' )\n',
' mod.quant_method = UnquantizedLinearMethod()\n',
' for attr in ("weight_scale", "weight_scale_2", "input_scale"):\n',
' if hasattr(mod, attr):\n',
' delattr(mod, attr)\n',
' loaded_params.add(name)\n',
' loaded_params.add(name.replace(".weight", ".weight_scale_inv"))\n',
' continue\n',
]
lines[start:end+1] = new_handler
with open(filepath, 'w') as f:
f.writelines(lines)
import ast
try:
ast.parse(''.join(lines))
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error at line {e.lineno}: {e.msg}")
print(f"Replaced {end-start+1} lines with {len(new_handler)} lines")

View File

@@ -1,64 +0,0 @@
#!/usr/bin/python3
"""Fix: Only convert wo_a and fused_wqa_wkv to FP8 (used with fp8_einsum).
Keep wq_b, wo_b, gate_up_proj in NVFP4 (used via normal .forward())."""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
c = f.read()
# Fix 1: Change the list of projections to convert
old_proj_names = 'attn_proj_names = {"fused_wqa_wkv", "wq_b", "wo_a", "wo_b"}'
new_proj_names = 'attn_proj_names = {"fused_wqa_wkv", "wo_a"} # Only these use fp8_einsum'
c = c.replace(old_proj_names, new_proj_names)
# Fix 2: Remove shared_experts gate_up_proj from conversion
old_shared = ''' shared_expert_names = {"gate_up_proj"}
converted = 0
for layer_idx, layer in enumerate(self.layers):
attn = layer.attn
for proj_name in attn_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
ffn = layer.ffn
if hasattr(ffn, "shared_experts"):
for proj_name in shared_expert_names:
if not hasattr(ffn.shared_experts, proj_name):
continue
mod = getattr(ffn.shared_experts, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1'''
new_shared = ''' converted = 0
for layer_idx, layer in enumerate(self.layers):
attn = layer.attn
for proj_name in attn_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
if not hasattr(mod, "weight") or mod.weight.dtype != torch.uint8:
continue
self._convert_nvfp4_module_to_fp8(mod, E2M1_LUT, FP8_MAX)
converted += 1
# wq_b, wo_b, gate_up_proj stay in NVFP4 (normal .forward() path)'''
c = c.replace(old_shared, new_shared)
with open(filepath, 'w') as f:
f.write(c)
import ast
try:
ast.parse(c)
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error: {e}")
print("Updated: only fused_wqa_wkv and wo_a converted to FP8")

View File

@@ -1,41 +0,0 @@
#!/usr/bin/python3
"""Fix the broken docstrings in deepseek_v4.py"""
filepath = "/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py"
with open(filepath, 'r') as f:
content = f.read()
# Replace the problematic triple-quoted docstrings with comments
content = content.replace(
' def _convert_nvfp4_attention_to_fp8(self):\n """Convert NVFP4 attention weights to FP8 format."""',
' def _convert_nvfp4_attention_to_fp8(self): # Convert NVFP4 attention weights to FP8'
)
content = content.replace(
' def _convert_nvfp4_module_to_fp8(self, mod, e2m1_lut, fp8_max):\n """Convert a single NVFP4 Linear module to FP8 format."""',
' def _convert_nvfp4_module_to_fp8(self, mod, e2m1_lut, fp8_max): # Convert single NVFP4 module to FP8'
)
# Also remove any stray docstring lines
lines = content.split('\n')
new_lines = []
skip_next = False
for i, line in enumerate(lines):
stripped = line.strip()
if stripped == '"""' or stripped == "'''":
continue # Skip standalone triple-quote lines
if '"""Convert NVFP4' in line or '"""Convert a single' in line:
continue # Skip the one-line docstrings that might remain
new_lines.append(line)
with open(filepath, 'w') as f:
f.write('\n'.join(new_lines))
# Verify syntax
import ast
try:
ast.parse('\n'.join(new_lines))
print("Syntax OK")
except SyntaxError as e:
print(f"Syntax error: {e}")