Files
nvfp4-megamoe-kernel/Dockerfile
biondizzle f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00

60 lines
2.9 KiB
Docker

# DeepSeek V4 NVFP4 vLLM + CuTeDSL NVFP4 MoE Kernel
FROM vllm/vllm-openai:nightly-x86_64
# Remove broken nixl_ep (built against CUDA 12, image is CUDA 13)
RUN pip uninstall -y nixl-ep; rm -rf /usr/local/lib/python3.12/dist-packages/nixl_ep
RUN apt-get update && apt-get install -y git screen cmake libcusolver-dev-13-0 libcusparse-dev-13-0 libcublas-dev-13-0 libcurand-dev-13-0 libcufft-dev-13-0 libnvjitlink-dev-13-0 && rm -rf /var/lib/apt/lists/*
# Remove the broken symlink if it exists
RUN rm -f /usr/local/cuda/lib64/libcudrt.so.12
ENV CUDA_HOME=/usr/local/cuda
ENV TORCH_CUDA_ARCH_LIST="10.0"
# Install CuTeDSL (NVFP4 block-scaled GEMM kernel framework)
RUN pip install nvidia-cutlass-dsl==4.5.0 nvidia-cutlass-dsl-libs-base==4.5.0
ARG CACHE_BUSTER=${TIMESTAMP}
# Copy the NVFP4 mega_moe Python kernel (no C++ build needed)
COPY src/ /root/nvfp4-megamoe-kernel/src/
COPY pyproject.toml /root/nvfp4-megamoe-kernel/pyproject.toml
RUN cd /root/nvfp4-megamoe-kernel && pip install -e .
# Copy the CuTeDSL kernel and bridge layer
COPY cutedsl/ /root/nvfp4-megamoe-kernel/cutedsl/
ENV PYTHONPATH="/root/nvfp4-megamoe-kernel:${PYTHONPATH}"
# Patch vLLM — overwrite model files and register architecture
ARG VLLM_MODELS_DIR=/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models
ARG VLLM_LAYERS_DIR=/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers
ARG VLLM_QUANT_DIR=/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization
ARG VLLM_FUSED_MOE_DIR=/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe
ARG VLLM_LOADER_DIR=/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader
# Core model patches
COPY vllm/patches/deepseek_v4.py ${VLLM_MODELS_DIR}/deepseek_v4.py
COPY vllm/patches/deepseek_v4_attention.py ${VLLM_LAYERS_DIR}/deepseek_v4_attention.py
COPY vllm/patches/layers/deepseek_compressor.py ${VLLM_LAYERS_DIR}/deepseek_compressor.py
# Config patches (add cutedsl to MoEBackend)
ARG VLLM_CONFIG_DIR=/usr/local/lib/python3.12/dist-packages/vllm/config
COPY vllm/patches/kernel.py ${VLLM_CONFIG_DIR}/kernel.py
# NVFP4 MoE backend registration
COPY vllm/patches/fused_moe/oracle/nvfp4.py ${VLLM_FUSED_MOE_DIR}/oracle/nvfp4.py
COPY vllm/patches/fused_moe/experts/cutedsl_moe.py ${VLLM_FUSED_MOE_DIR}/experts/cutedsl_moe.py
# Register DeepseekV4ForCausalLM model architecture (if not already in upstream)
RUN grep -q '"DeepseekV4ForCausalLM"' ${VLLM_MODELS_DIR}/registry.py || \
sed -i 's/"DeepseekV32ForCausalLM": ("deepseek_v2", "DeepseekV3ForCausalLM"),/"DeepseekV32ForCausalLM": ("deepseek_v2", "DeepseekV3ForCausalLM"),\n "DeepseekV4ForCausalLM": ("deepseek_v4", "DeepseekV4ForCausalLM"),/' \
${VLLM_MODELS_DIR}/registry.py
# Verify
RUN python3 -c "import torch; print(f'PyTorch {torch.__version__} OK')" && \
python3 -c "import vllm; print('vLLM OK')" && \
python3 -c "import nvfp4_megamoe_kernel; print('NVFP4 kernel OK')" && \
python3 -c "import cutlass; print('CuTeDSL OK')"