Files
nvfp4-megamoe-kernel/Dockerfile

29 lines
1.1 KiB
Docker
Raw Normal View History

# DeepSeek V4 NVFP4 vLLM + CuTeDSL NVFP4 MoE Kernel
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00
FROM vllm/vllm-openai:nightly-x86_64
# Remove broken nixl_ep (built against CUDA 12, image is CUDA 13)
RUN pip uninstall -y nixl-ep; rm -rf /usr/local/lib/python3.12/dist-packages/nixl_ep
RUN apt-get update && apt-get install -y git screen cmake libcusolver-dev-13-0 libcusparse-dev-13-0 libcublas-dev-13-0 libcurand-dev-13-0 libcufft-dev-13-0 libnvjitlink-dev-13-0 && rm -rf /var/lib/apt/lists/*
# Remove the broken symlink if it exists
RUN rm -f /usr/local/cuda/lib64/libcudrt.so.12
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00
ENV CUDA_HOME=/usr/local/cuda
ENV TORCH_CUDA_ARCH_LIST="10.0"
# Install CuTeDSL (NVFP4 block-scaled GEMM kernel framework)
RUN pip install nvidia-cutlass-dsl==4.5.0 nvidia-cutlass-dsl-libs-base==4.5.0
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00
ARG CACHE_BUSTER=${TIMESTAMP}
# Copy the NVFP4 mega_moe Python kernel (no C++ build needed)
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00
COPY src/ /root/nvfp4-megamoe-kernel/src/
COPY pyproject.toml /root/nvfp4-megamoe-kernel/pyproject.toml
RUN cd /root/nvfp4-megamoe-kernel && pip install -e .
# Copy the CuTeDSL kernel and bridge layer
COPY cutedsl/ /root/nvfp4-megamoe-kernel/cutedsl/
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00
ENV PYTHONPATH="/root/nvfp4-megamoe-kernel:${PYTHONPATH}"