Commit Graph

350 Commits

Author SHA1 Message Date
fece06f746 Add unit tests for NVFP4 weight mapper and inverse RoPE BF16 2026-05-19 03:22:00 +00:00
b0b5113467 Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj
- The compressor is on attn.compressor (not attn.mla_attn.compressor)
- weights_proj in indexer is NVFP4-quantized in our checkpoint
2026-05-19 03:20:41 +00:00
396a83ea56 Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper
- deepseek_v4.py: Fresh upstream copy with minimal NVFP4 changes
  - wo_a uses quant_config=None (BF16 in NVFP4 checkpoint, no scales)
  - Added _make_deepseek_v4_nvfp4_weights_mapper() using official WeightsMapper API
  - Handles: self_attn→attn, mlp→ffn, gate_proj→w1, compressor renames, etc.
  - Mapper selected by quant_config.get_name() == 'modelopt_fp4'

- deepseek_v4_attention.py: Fresh upstream copy with minimal NVFP4 changes
  - Removed _wo_a_act_quant and custom CuTeDSL wo_a runner
  - Added _apply_inv_rope_bf16() helper (inverse RoPE in BF16)
  - Detects BF16 wo_a (no weight_scale_inv) and uses BF16 path
  - FP8 einsum path kept as fallback for SM90 checkpoints
  - BF16 path: inverse RoPE → wo_a() → wo_b() (standard linear methods)
2026-05-19 03:13:38 +00:00
b856ee9315 Clean up debug scripts 2026-05-19 02:47:29 +00:00
05cdde1676 Fix wo_a: scatter each group's data at correct offset in padded buffer
The grouped GEMM expects each group's tokens at their own offset range:
- Group 0: rows [0, padded_T)
- Group 1: rows [padded_T, 2*padded_T)
- etc.

Previously we wrote all groups' data contiguously starting at row 0,
so group 1+ would read zeros from the padding area. Now we scatter
each group's quantized activation at the correct offset.

Also:
- Size buffer for total_max_rows = padded_max * n_groups
- Use assemble_scales_2d_side for multi-group scale assembly
- Extract output per-group at correct offsets
2026-05-19 02:45:57 +00:00
8fe5546bb3 Fix debug script 2026-05-19 02:43:17 +00:00
788f0aa65a Add step-by-step debug for wo_a 2026-05-19 02:43:05 +00:00
5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM
The grouped GEMM expects mat_a to be laid out contiguously per group:
[all tokens for group0, all tokens for group1, ...]
A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout
which is wrong. Fix: permute to (G, T, D) before flattening.
Same fix for output: permute (G, T, R) → (T, G, R).
2026-05-19 02:41:32 +00:00
77e4970d93 Add debug script for wo_a quantization 2026-05-19 02:40:43 +00:00
80122b850b Add debug script for wo_a 2026-05-19 02:39:55 +00:00
ae233ab648 Fix test: cos_sin_cache on CUDA device 2026-05-19 02:37:50 +00:00
882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)
The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion
in layout.hpp:39) when processing wo_a (o-projection first half) in the
attention layer. The crash is caused by scale tensor dimension mismatch
for the SM100 recipe (1, 1, 128).

Instead of fighting DeepGEMM, replace the entire wo_a path with our own
CuTeDSL NVFP4 kernel:

1. inverse_rope_bf16() — Python implementation of inverse RoPE
   (replaces fused_inv_rope_fp8_quant CUDA kernel)
2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using
   ScaledGroupedGemm with n_local_groups=8 groups
3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4,
   no conversion to another quantization)

Changes:
- cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation)
- cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a
- vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner
  is initialized, keep DeepGEMM fallback
- vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant
- tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM
2026-05-19 02:36:30 +00:00
bab1f75f29 Fix gs None error in legacy _ensure_stacked path 2026-05-19 02:17:53 +00:00
48fa64dfda Eliminate weight copies: pass stacked checkpoint tensors directly
Memory optimization for MoE weight processing:

Before (3-4 copies of weights in memory):
1. Original checkpoint weights in layer.w13_weight (copy 1)
2. Per-expert permuted copies (copy 2)
3. torch.stack() in runner._ensure_stacked (copy 3)
4. make_b_k_major re-stride (copy 4)
5. Scales: permute then assemble_scales_3d_side un-permutes (wasted)

After (1-2 copies):
1. View checkpoint as fp4 (NO copy — byte-preserving view)
2. Pass (E, N, K) stacked tensor directly to runner
3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref
4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref
5. Scales: already (N, K_sf) from checkpoint, call assembly directly
6. Free layer.w13_weight etc. immediately after extracting views

Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally,
but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose
by calling assemble_raw_scales_2d3d_3d_side directly.
2026-05-19 02:16:43 +00:00
0612c1ab54 use proper backend 2026-05-19 02:08:18 +00:00
00fe63b56f Fix compile test: add warmup for activation global scales 2026-05-19 01:57:16 +00:00
bba3bca4d3 Add torch.compile + custom op integration test 2026-05-19 01:56:46 +00:00
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.

Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
  - Runners registered in global dict with integer IDs
  - Custom op takes (tensors, runner_id, shape_hint) -> tensor
  - Dynamo calls fake impl for shape inference, never touches the runner
  - At execution time, real impl looks up runner and calls _run_impl

Changes:
  - New: cutedsl/custom_ops.py (custom op definitions + registry)
  - New: tests/test_custom_op.py (local unit tests, no GPU needed)
  - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
  - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
    to use custom ops instead of autograd.Function
  - Updated: cutedsl_quant_method.py to use custom op + registry
2026-05-19 01:54:48 +00:00
98153002c0 Go back to torch.library.custom_op with correct GEMM impl
allow_in_graph doesn't work — Dynamo can't create proxies for Python
objects (the runner). The custom op approach requires only tensor args.

This time the GEMM impl correctly:
- Uses quantize_activation_nvfp4 for activation quantization
- Pads x_fp4 via uint8 + view(float4) for torch.zeros compat
- Assembles A-side scales with pad + swizzle
- Uses int32 expert_offsets (CuTeDSL requirement)
- Passes runner's pre-assembled mat_b, scale_b, gsb tensors
2026-05-19 01:24:41 +00:00
02c500bbb1 Switch to allow_in_graph for Dynamo opacity instead of custom op
The custom op approach required reimplementing the GEMM (wrong scale
assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead,
use torch.autograd.Function + torch._dynamo.allow_in_graph which
tells Dynamo to treat the function as an opaque kernel call, while
still using the runner's battle-tested _run_impl for the actual GEMM.

allow_in_graph is the proper way to register opaque ops for Dynamo
without reimplementing the computation.
2026-05-19 01:20:07 +00:00
581d87f9a6 Remove warmup forward from process_weights_after_loading
The warmup custom op call hit cudaErrorIllegalAddress because our
custom op GEMM implementation doesn't match the runner's call convention.
Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.
2026-05-19 01:18:54 +00:00
5d49849156 Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype
Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.
2026-05-19 01:15:24 +00:00
e1fcfc4f01 Add CuTeDSL warmup + CUDA sync after JIT compilation
CuTeDSL cute.compile corrupts GPU memory. Add warmup forward +
torch.cuda.synchronize() + health check after finalize_weights,
matching the MoE runner pattern.
2026-05-19 01:11:44 +00:00
1d9c0f996c Fix expert_offsets dtype: CuTeDSL expects int32 not int64
The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was
caused by passing int64 expert_offsets when the kernel expects int32.
2026-05-19 01:05:20 +00:00
b81200f427 Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op
- pad_and_swizzle_single takes 1 arg (2D tensor), not 4
- Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group
- Remove unused CuTeDSLNvfp4Linear import from custom op impl
2026-05-19 01:01:42 +00:00
e0eb436914 Fix custom_op registration: use as decorator with proper type hints 2026-05-19 00:54:30 +00:00
c609e9ba3c Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM
Dynamo in fullgraph mode traces through torch.autograd.Function, hitting
CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op
makes it opaque to Dynamo — tracing calls the fake impl, real impl only
runs during inference.

Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b,
    in_features, out_features, activation_global_scale) -> Tensor

Store finalized weight tensors on the layer (from runner._mat_b etc.)
instead of the runner object, since custom ops can only accept tensors.
2026-05-19 00:50:43 +00:00
c043a11bcc Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers
- Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class
- Register it via init_nvfp4_linear_kernel() selection mechanism
  (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer)
- process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner
- apply_weights: route through CuTeDSL GEMM
- Update Dockerfile: copy kernel + registration script
- Fix attention: always use forward() for quantized compressor/indexer
  layers (dtype check was fragile after kernel swaps weights to dummy BF16)
2026-05-19 00:44:44 +00:00
358830925a Fix unpack error: handle both tuple and tensor returns from NVFP4 forward() 2026-05-19 00:33:43 +00:00
d9dc042ff7 Fix compressor kv_score: use forward() for NVFP4 quantized weights
Raw torch.mm doesn't work with packed uint8 NVFP4 weights.
Use MergedColumnParallelLinear.forward() which handles dequantization.
2026-05-19 00:29:43 +00:00
10c14ddb49 Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm
- input_layernorm → attn_norm, post_attention_layernorm → ffn_norm
- hc_head.fn/base/scale → hc_head_fn/base/scale
- attn_hc/ffn_hc → hc_attn/hc_ffn (dot to underscore)
- q_a_norm → q_norm, sinks → attn_sink
- Indexer params: self_attn.compressor.indexer → attn.indexer
  (not attn.mla_attn.compressor.indexer)
2026-05-19 00:24:26 +00:00
540e7ee8fc Fix: layer.self_attn → layer.attn (model uses attn, not self_attn) 2026-05-19 00:14:09 +00:00
201a40e6c4 Fix zero-dim tensor concatenation in compressor scale buffer
input_scale and weight_scale_2 are 0-dim scalars in the NVFP4 checkpoint.
torch.cat can't concatenate scalars — reshape to 1-d first.
2026-05-19 00:10:13 +00:00
d41a48aa1f Fix KeyError for missing stacked params (indexer.compressor)
Not all layers have the same indexer structure. The stacking path
was trying to access params that don't exist in params_dict. Added
checks to skip missing stacked params instead of KeyError.
2026-05-18 23:54:02 +00:00
4b0d8263f6 Fix NameError: use print instead of logger (not imported) 2026-05-18 23:49:42 +00:00
e3c24769e2 Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint)
o_a_proj is NOT quantized by modelopt in the checkpoint (bfloat16),
but the attention forward pass expects FP8 (weight + weight_scale_inv).

- Create wo_a with quant_config=None to load bfloat16 weights
- Add FP8 quantization of wo_a in finalize_mega_moe_weights:
  per-tensor symmetric quantization to float8_e4m3fn + weight_scale_inv
- This matches what the fused_inv_rope_fp8_quant + einsum expects
2026-05-18 23:41:39 +00:00
9d016aa1c0 Use print instead of logger for weight load debug 2026-05-18 23:30:58 +00:00
a6f61bda5d Add debug logging for weight loading failures 2026-05-18 23:28:15 +00:00
eef0ef76af Fix NVFP4 compressor scale loading: buffer and concatenate scale shards
The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses
weight_loader(param, weight, shard_id), but PerTensorScaleParameter
and ModelWeightParameter for NVFP4 scale params don't support shard_id
in load_column_parallel_weight (asserts shape equality).

Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate
shards, then concatenate along dim 0 and copy_ into the param after all
weights are loaded.
2026-05-18 23:24:08 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
  AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
  in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
  that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
2026-05-18 23:03:34 +00:00
b039123207 Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames
- Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b}
- Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate')
- Order substr renames: specific before general
2026-05-18 22:53:09 +00:00
ea648a9bc2 Fix NVFP4 mapper: keep model. prefix (model params use it) 2026-05-18 22:49:40 +00:00
1528d4e182 Fix NVFP4 mapper: strip model. prefix from checkpoint keys
The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader
expects layers.* (relative to the model module). Strip the model. prefix
instead of adding it.
2026-05-18 22:46:04 +00:00
5d37674fb1 Add cutedsl to MoEBackend type in kernel config 2026-05-18 22:38:41 +00:00
7409204d71 Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper
The upstream deepseek_v4.py has imports that don't exist in the nightly
Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the
nightly's own files as the base and add only the minimal NVFP4 changes:
- Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping
- Select NVFP4 mapper when quant_config is modelopt_fp4
- cos_sin_cache float32 fix in attention
- Remove utils.py patch (not needed)
2026-05-18 22:33:51 +00:00
a19ed4a18e Remove breakable_cudagraph import (not in nightly) 2026-05-18 22:29:24 +00:00
b007937a68 Fix garbled imports in cutedsl/runner.py 2026-05-18 22:22:52 +00:00
a7ed8faec6 Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework
Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
  (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
  that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py

No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.
2026-05-18 22:19:23 +00:00
48386e34ad Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable
torch.compile fullgraph mode can't handle @torch.compiler.disable (skips
the function and refuses to compile). Custom autograd Functions are treated
as opaque ops by torch.compile — they execute eagerly without the compiler
trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).
2026-05-18 21:38:28 +00:00