Commit Graph

22 Commits

Author SHA1 Message Date
1901bf585e nuke vllm because this keep confusing people 2026-05-19 23:04:36 +00:00
39310c357d Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers 2026-05-19 09:52:23 +00:00
d9cd8fa165 Add debug patch to print layer name mismatch 2026-05-19 09:45:10 +00:00
de1fb839f0 Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment) 2026-05-19 09:29:57 +00:00
42285b6c24 Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM 2026-05-19 08:54:59 +00:00
fa71fbe909 Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes 2026-05-19 08:45:44 +00:00
a782ac00ce Integrate CSA/SDPA attention into vLLM for Blackwell
- Add vllm/patches/layers/csa_attention.py: pure PyTorch replacement
  for FlashMLA + fused CUDA kernels that don't work on SM100
- Patch deepseek_v4_attention.py: detect SM100+ and dispatch to
  _forward_blackwell() which uses:
  1. fused_qnorm_rope_kv_insert_py() instead of C++ kernel
  2. full_sdpa_attention() instead of FlashMLA
  3. BF16 inverse RoPE + BMM for wo_a (same as existing BF16 path)
- Add csa_attention.py to Dockerfile

The Blackwell path:
  GEMM projections (CuTeDSL) → RMS norm → q_b → RoPE (PyTorch) →
  SDPA attention → inverse RoPE + wo_a BMM → wo_b → output
2026-05-19 08:04:07 +00:00
918342feeb MHC: replace monolithic layers/mhc.py with pure PyTorch
The nightly vLLM image puts ALL MHC code in layers/mhc.py (not kernels/mhc/).
It imports tilelang at top level and JIT-compiles kernels.

Replace the entire file with pure PyTorch implementations using
direct_register_custom_op for mhc_pre, mhc_post, mhc_fused_post_pre,
and hc_head_fused_kernel. No tilelang dependency at all.

Also removes the separate mhc_torch_ops.py and kernels/mhc/ patches
which don't apply to the nightly image layout.
2026-05-19 05:41:55 +00:00
dfd9c10ae9 Fix MHC import: don't import .torch from layers/mhc.py
The layers/mhc.py was trying to import kernels.mhc.torch which
failed because our __init__.py was breaking the package. Instead,
just import our mhc_torch_ops which has everything we need.

Also fix __init__.py to explicitly import mhc_pre_torch and
mhc_post_torch from .torch instead of using import *.
2026-05-19 05:36:35 +00:00
e404e18efb Also replace layers/mhc.py CustomOp dispatch
The original layers/mhc.py forward_cuda calls
torch.ops.vllm.mhc_pre_tilelang which triggers TileLang JIT.
Replace with our torch implementations in forward_cuda.
This is what the CustomOp dispatch routes through.
2026-05-19 05:31:05 +00:00
5e6d459145 Fix MHC custom op registration
Previous approach used @CustomOp.register which doesn't create
torch.ops.vllm.mhc_pre. The model code calls torch.ops.vllm.mhc_pre()
directly, which requires direct_register_custom_op.

Use direct_register_custom_op to register mhc_pre, mhc_post,
mhc_fused_post_pre, and hc_head_fused_kernel as PyTorch custom ops
with torch (eager) implementations.

Patch kernels/mhc/__init__.py to import from both .torch (original)
and .mhc_torch_ops (our replacements), skipping tilelang import.
2026-05-19 05:19:48 +00:00
9ff1679064 Replace MHC TileLang kernels with pure PyTorch
TileLang kernels (mhc_pre_big_fuse_tilelang, mhc_fused_tilelang) don't
work correctly on Blackwell SM100 and cause empty model output.

Replace with pure PyTorch implementations:
- mhc_pre_torch: Sinkhorn-normalized HC residual mixing
- mhc_post_torch: HC post block (einsum residual + post layer mix)
- mhc_fused_post_pre_torch: Fused post+pre (composition of above)
- hc_head_fused_torch: RMS norm + linear + sigmoid + weighted sum

Patch both layers/mhc.py (CustomOp dispatch) and kernels/mhc/__init__.py
(no tilelang import). Also remove tilelang from pyproject.toml deps.
2026-05-19 05:07:41 +00:00
c043a11bcc Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers
- Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class
- Register it via init_nvfp4_linear_kernel() selection mechanism
  (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer)
- process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner
- apply_weights: route through CuTeDSL GEMM
- Update Dockerfile: copy kernel + registration script
- Fix attention: always use forward() for quantized compressor/indexer
  layers (dtype check was fragile after kernel swaps weights to dummy BF16)
2026-05-19 00:44:44 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
5d37674fb1 Add cutedsl to MoEBackend type in kernel config 2026-05-18 22:38:41 +00:00
a7ed8faec6 Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework
Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
  (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
  that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py

No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.
2026-05-18 22:19:23 +00:00
450793311c Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4
- CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners
  during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod
  for forward dispatch
- Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through
  CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference)
- Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16)
  with monkey-patched forward for fused L1+SiLU+L2 pipeline
- Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix,
  input_scale fixes)
- Deleted _post_quant_fix hook from utils.py
- Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp)
- Cleaned up all debug prints
- Updated Dockerfile with new kernel files
2026-05-18 20:27:42 +00:00
82ac648563 Patch utils.py the standard way: copy modified file into Docker image
Instead of fragile inline Dockerfile patching, just copy a modified
utils.py (with _post_quant_fix call) into the image, same pattern
as deepseek_v4.py and deepseek_v4_attention.py patches.
2026-05-18 19:10:08 +00:00
3c1a76bdcc Fix Dockerfile: use external patch script instead of inline Python
Docker's parser chokes on multi-line Python in RUN. Moved to
scripts/patch_utils.py and COPY + RUN it.
2026-05-18 19:03:57 +00:00
75844a8361 Post-quant fix via Dockerfile patch to process_weights_after_loading
Forward pre-hook approach didn't work — torch.compile and model
wrappers bypass hooks. Instead, patch vLLM's utils.py to call
model._post_quant_fix() at the end of process_weights_after_loading.
This guarantees the fix runs AFTER quant methods set up their attrs.

Dockerfile now patches:
  model_loader/utils.py → calls model._post_quant_fix() if it exists

DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention
NVFP4 weights to BF16 and replaces quant_method.
2026-05-18 18:35:34 +00:00
b04bff7e8b feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build
Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import

docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)

deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl

README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization
2026-05-16 03:50:07 +00:00
9908fd64d9 feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap
Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00