deepseek-v4-quant

biondizzle/deepseek-v4-quant

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	b8f95ffad3	docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache	2026-05-12 11:15:06 +00:00
biondizzle	6fd03a0aa0	vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs - Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4 weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn, compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params, skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and resilient loading for unknown params. - Update docker-compose.yml: copy patched deepseek_v4.py over original at container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel). - Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach (doesn't work with worker processes), kept for reference. - Update README.md: added vLLM serving run history table (S1-S10), documented all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel, resilient loading), added vLLM-specific bug list and key notes. - Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.	2026-05-10 16:14:17 +00:00
biondizzle	d88793dee6	Add vllm weight mapper patch and docker-compose	2026-05-10 09:33:48 +00:00

Author

SHA1

Message

Date

biondizzle

b8f95ffad3

docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache

2026-05-12 11:15:06 +00:00

biondizzle

6fd03a0aa0

vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs

- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
  weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
  compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
  skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
  resilient loading for unknown params.

- Update docker-compose.yml: copy patched deepseek_v4.py over original at
  container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).

- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
  (doesn't work with worker processes), kept for reference.

- Update README.md: added vLLM serving run history table (S1-S10), documented
  all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
  resilient loading), added vLLM-specific bug list and key notes.

- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.

2026-05-10 16:14:17 +00:00

biondizzle

d88793dee6

Add vllm weight mapper patch and docker-compose

2026-05-10 09:33:48 +00:00

3 Commits