deepseek-v4-quant

Files

biondizzle 653e2d7a50 vLLM NVFP4 serving: full end-to-end pipeline working

Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention.
Server loads and serves tokens on 8x B200 with TP=8, EP=8.

Key changes:
- wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum
  Uses deepgemm_post_process_fp8_weight_block for correct scale layout
  weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar)
  Block scale filled with fp8_scale (NOT all-ones -- causes garbage output)
- Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod
- Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj
  Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor)
- MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend

Bugs fixed:
1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor
2. Block scale dtype: float32 (not float8_e4m3fn)
3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0
4. Compressor indexer shape mismatch: wrong checkpoint key prefix
5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale

Updated README with full technical documentation of all fixes.

2026-05-11 02:01:46 +00:00

deepseek_v4.py

vLLM NVFP4 serving: full end-to-end pipeline working

2026-05-11 02:01:46 +00:00

deepseek_v4.py.bak

S11: Fixed substr mapping, stacking, suffix, and o_a_proj - loads weights but attention forward uses FP8 einsum incompatible with NVFP4

2026-05-10 17:45:53 +00:00

deepseek_v4.py.s11

S11: Fixed substr mapping, stacking, suffix, and o_a_proj - loads weights but attention forward uses FP8 einsum incompatible with NVFP4

2026-05-10 17:45:53 +00:00

patch_finegrained_fp8_blackwell.py

Add BF16 upcast script and Blackwell DeepGEMM patch

2026-05-07 14:25:30 +00:00

patch_vllm_weights.py

vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs

2026-05-10 16:14:17 +00:00

quant_module_patched.py

Add ModelOpt NVFP4 pipeline: patch, run script, README

2026-05-07 07:22:54 +00:00