docker-compose.yml

services:
  vllm:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    environment:
      - OMP_NUM_THREADS=128
      - CUDA_LAUNCH_BLOCKING=0
      - PYTHONUNBUFFERED=1
      - VLLM_RPC_TIMEOUT_MS=600000
    command:
      - /model
      - --trust-remote-code
      - --enable-expert-parallel
      - --tensor-parallel-size=8
      #- --enforce-eager
      - --compilation-config
      #- '{"cudagraph_mode": "NONE", "custom_ops": ["all"]}'
      - '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["all"], "cudagraph_capture_sizes": [1, 2, 4, 8], "max_cudagraph_capture_size": 8}' # This is what is runing right now
      #- '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}'
      #- --moe-backend=deep_gemm_mega_moe
      - --tokenizer-mode=deepseek_v4
      #- --attention_config.use_fp4_indexer_cache=True
      - --tool-call-parser=deepseek_v4
      - --enable-auto-tool-choice
      - --reasoning-parser=deepseek_v4
      - --gpu_memory_utilization=0.9
      - --host=0.0.0.0
      - --port=8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4:/model:ro
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths) 2026-05-15 11:38:18 +00:00			`services:`
			`vllm:`
			`build:`
			`context: .`
			`dockerfile: Dockerfile`
			`ports:`
			`- "8000:8000"`
			`environment:`
			`- OMP_NUM_THREADS=128`
			`- CUDA_LAUNCH_BLOCKING=0`
fix: PYTHONUNBUFFERED=1 so progress bars stream in real-time Python buffers stdout by default. Docker only sees the buffer dumps, so all progress bars appear at once when the step completes. PYTHONUNBUFFERED=1 disables buffering — prints flush immediately. 2026-05-16 04:18:07 +00:00			`- PYTHONUNBUFFERED=1`
fix: increase vLLM RPC timeout to 10 min for first-request JIT First inference triggers Triton/TileLang kernel JIT compilation (2-3 min). The default 5-min RPC timeout kills the engine. Bumped to 10 min via VLLM_RPC_TIMEOUT_MS so the first request survives compilation. Not ideal — would prefer to warm up the kernels during startup. But CUDA graphs don't work well with grouped GEMMs and variable expert counts. Will investigate vLLM warmup shape config later. 2026-05-16 06:02:11 +00:00			`- VLLM_RPC_TIMEOUT_MS=600000`
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths) 2026-05-15 11:38:18 +00:00			`command:`
			`- /model`
			`- --trust-remote-code`
			`- --enable-expert-parallel`
			`- --tensor-parallel-size=8`
allow for cuda graphs again 2026-05-16 19:23:41 +00:00			`#- --enforce-eager`
			`- --compilation-config`
Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses 2026-05-17 16:52:40 +00:00			`#- '{"cudagraph_mode": "NONE", "custom_ops": ["all"]}'`
			`- '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["all"], "cudagraph_capture_sizes": [1, 2, 4, 8], "max_cudagraph_capture_size": 8}' # This is what is runing right now`
crap shoot 2026-05-17 16:25:38 +00:00			`#- '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}'`
vllm tweaks 2026-05-17 07:14:58 +00:00			`#- --moe-backend=deep_gemm_mega_moe`
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths) 2026-05-15 11:38:18 +00:00			`- --tokenizer-mode=deepseek_v4`
vllm tweaks 2026-05-17 07:10:16 +00:00			`#- --attention_config.use_fp4_indexer_cache=True`
vllm tweaks 2026-05-17 07:14:58 +00:00			`- --tool-call-parser=deepseek_v4`
vllm tweaks 2026-05-17 07:10:16 +00:00			`- --enable-auto-tool-choice`
vllm tweaks 2026-05-17 07:14:58 +00:00			`- --reasoning-parser=deepseek_v4`
crap shoot 2026-05-17 16:25:38 +00:00			`- --gpu_memory_utilization=0.9`
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths) 2026-05-15 11:38:18 +00:00			`- --host=0.0.0.0`
			`- --port=8000`
			`deploy:`
			`resources:`
			`reservations:`
			`devices:`
			`- driver: nvidia`
			`count: all`
			`capabilities: [gpu]`
			`volumes:`
			`- /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4:/model:ro`