Commit Graph

163 Commits

Author SHA1 Message Date
48fa64dfda Eliminate weight copies: pass stacked checkpoint tensors directly
Memory optimization for MoE weight processing:

Before (3-4 copies of weights in memory):
1. Original checkpoint weights in layer.w13_weight (copy 1)
2. Per-expert permuted copies (copy 2)
3. torch.stack() in runner._ensure_stacked (copy 3)
4. make_b_k_major re-stride (copy 4)
5. Scales: permute then assemble_scales_3d_side un-permutes (wasted)

After (1-2 copies):
1. View checkpoint as fp4 (NO copy — byte-preserving view)
2. Pass (E, N, K) stacked tensor directly to runner
3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref
4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref
5. Scales: already (N, K_sf) from checkpoint, call assembly directly
6. Free layer.w13_weight etc. immediately after extracting views

Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally,
but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose
by calling assemble_raw_scales_2d3d_3d_side directly.
2026-05-19 02:16:43 +00:00
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.

Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
  - Runners registered in global dict with integer IDs
  - Custom op takes (tensors, runner_id, shape_hint) -> tensor
  - Dynamo calls fake impl for shape inference, never touches the runner
  - At execution time, real impl looks up runner and calls _run_impl

Changes:
  - New: cutedsl/custom_ops.py (custom op definitions + registry)
  - New: tests/test_custom_op.py (local unit tests, no GPU needed)
  - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
  - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
    to use custom ops instead of autograd.Function
  - Updated: cutedsl_quant_method.py to use custom op + registry
2026-05-19 01:54:48 +00:00
98153002c0 Go back to torch.library.custom_op with correct GEMM impl
allow_in_graph doesn't work — Dynamo can't create proxies for Python
objects (the runner). The custom op approach requires only tensor args.

This time the GEMM impl correctly:
- Uses quantize_activation_nvfp4 for activation quantization
- Pads x_fp4 via uint8 + view(float4) for torch.zeros compat
- Assembles A-side scales with pad + swizzle
- Uses int32 expert_offsets (CuTeDSL requirement)
- Passes runner's pre-assembled mat_b, scale_b, gsb tensors
2026-05-19 01:24:41 +00:00
02c500bbb1 Switch to allow_in_graph for Dynamo opacity instead of custom op
The custom op approach required reimplementing the GEMM (wrong scale
assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead,
use torch.autograd.Function + torch._dynamo.allow_in_graph which
tells Dynamo to treat the function as an opaque kernel call, while
still using the runner's battle-tested _run_impl for the actual GEMM.

allow_in_graph is the proper way to register opaque ops for Dynamo
without reimplementing the computation.
2026-05-19 01:20:07 +00:00
581d87f9a6 Remove warmup forward from process_weights_after_loading
The warmup custom op call hit cudaErrorIllegalAddress because our
custom op GEMM implementation doesn't match the runner's call convention.
Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.
2026-05-19 01:18:54 +00:00
5d49849156 Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype
Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.
2026-05-19 01:15:24 +00:00
e1fcfc4f01 Add CuTeDSL warmup + CUDA sync after JIT compilation
CuTeDSL cute.compile corrupts GPU memory. Add warmup forward +
torch.cuda.synchronize() + health check after finalize_weights,
matching the MoE runner pattern.
2026-05-19 01:11:44 +00:00
1d9c0f996c Fix expert_offsets dtype: CuTeDSL expects int32 not int64
The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was
caused by passing int64 expert_offsets when the kernel expects int32.
2026-05-19 01:05:20 +00:00
b81200f427 Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op
- pad_and_swizzle_single takes 1 arg (2D tensor), not 4
- Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group
- Remove unused CuTeDSLNvfp4Linear import from custom op impl
2026-05-19 01:01:42 +00:00
e0eb436914 Fix custom_op registration: use as decorator with proper type hints 2026-05-19 00:54:30 +00:00
c609e9ba3c Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM
Dynamo in fullgraph mode traces through torch.autograd.Function, hitting
CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op
makes it opaque to Dynamo — tracing calls the fake impl, real impl only
runs during inference.

Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b,
    in_features, out_features, activation_global_scale) -> Tensor

Store finalized weight tensors on the layer (from runner._mat_b etc.)
instead of the runner object, since custom ops can only accept tensors.
2026-05-19 00:50:43 +00:00
c043a11bcc Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers
- Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class
- Register it via init_nvfp4_linear_kernel() selection mechanism
  (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer)
- process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner
- apply_weights: route through CuTeDSL GEMM
- Update Dockerfile: copy kernel + registration script
- Fix attention: always use forward() for quantized compressor/indexer
  layers (dtype check was fragile after kernel swaps weights to dummy BF16)
2026-05-19 00:44:44 +00:00
358830925a Fix unpack error: handle both tuple and tensor returns from NVFP4 forward() 2026-05-19 00:33:43 +00:00
d9dc042ff7 Fix compressor kv_score: use forward() for NVFP4 quantized weights
Raw torch.mm doesn't work with packed uint8 NVFP4 weights.
Use MergedColumnParallelLinear.forward() which handles dequantization.
2026-05-19 00:29:43 +00:00
10c14ddb49 Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm
- input_layernorm → attn_norm, post_attention_layernorm → ffn_norm
- hc_head.fn/base/scale → hc_head_fn/base/scale
- attn_hc/ffn_hc → hc_attn/hc_ffn (dot to underscore)
- q_a_norm → q_norm, sinks → attn_sink
- Indexer params: self_attn.compressor.indexer → attn.indexer
  (not attn.mla_attn.compressor.indexer)
2026-05-19 00:24:26 +00:00
540e7ee8fc Fix: layer.self_attn → layer.attn (model uses attn, not self_attn) 2026-05-19 00:14:09 +00:00
201a40e6c4 Fix zero-dim tensor concatenation in compressor scale buffer
input_scale and weight_scale_2 are 0-dim scalars in the NVFP4 checkpoint.
torch.cat can't concatenate scalars — reshape to 1-d first.
2026-05-19 00:10:13 +00:00
d41a48aa1f Fix KeyError for missing stacked params (indexer.compressor)
Not all layers have the same indexer structure. The stacking path
was trying to access params that don't exist in params_dict. Added
checks to skip missing stacked params instead of KeyError.
2026-05-18 23:54:02 +00:00
4b0d8263f6 Fix NameError: use print instead of logger (not imported) 2026-05-18 23:49:42 +00:00
e3c24769e2 Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint)
o_a_proj is NOT quantized by modelopt in the checkpoint (bfloat16),
but the attention forward pass expects FP8 (weight + weight_scale_inv).

- Create wo_a with quant_config=None to load bfloat16 weights
- Add FP8 quantization of wo_a in finalize_mega_moe_weights:
  per-tensor symmetric quantization to float8_e4m3fn + weight_scale_inv
- This matches what the fused_inv_rope_fp8_quant + einsum expects
2026-05-18 23:41:39 +00:00
9d016aa1c0 Use print instead of logger for weight load debug 2026-05-18 23:30:58 +00:00
a6f61bda5d Add debug logging for weight loading failures 2026-05-18 23:28:15 +00:00
eef0ef76af Fix NVFP4 compressor scale loading: buffer and concatenate scale shards
The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses
weight_loader(param, weight, shard_id), but PerTensorScaleParameter
and ModelWeightParameter for NVFP4 scale params don't support shard_id
in load_column_parallel_weight (asserts shape equality).

Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate
shards, then concatenate along dim 0 and copy_ into the param after all
weights are loaded.
2026-05-18 23:24:08 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
  AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
  in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
  that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
2026-05-18 23:03:34 +00:00
b039123207 Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames
- Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b}
- Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate')
- Order substr renames: specific before general
2026-05-18 22:53:09 +00:00
ea648a9bc2 Fix NVFP4 mapper: keep model. prefix (model params use it) 2026-05-18 22:49:40 +00:00
1528d4e182 Fix NVFP4 mapper: strip model. prefix from checkpoint keys
The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader
expects layers.* (relative to the model module). Strip the model. prefix
instead of adding it.
2026-05-18 22:46:04 +00:00
5d37674fb1 Add cutedsl to MoEBackend type in kernel config 2026-05-18 22:38:41 +00:00
7409204d71 Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper
The upstream deepseek_v4.py has imports that don't exist in the nightly
Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the
nightly's own files as the base and add only the minimal NVFP4 changes:
- Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping
- Select NVFP4 mapper when quant_config is modelopt_fp4
- cos_sin_cache float32 fix in attention
- Remove utils.py patch (not needed)
2026-05-18 22:33:51 +00:00
a19ed4a18e Remove breakable_cudagraph import (not in nightly) 2026-05-18 22:29:24 +00:00
b007937a68 Fix garbled imports in cutedsl/runner.py 2026-05-18 22:22:52 +00:00
a7ed8faec6 Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework
Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
  (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
  that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py

No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.
2026-05-18 22:19:23 +00:00
48386e34ad Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable
torch.compile fullgraph mode can't handle @torch.compiler.disable (skips
the function and refuses to compile). Custom autograd Functions are treated
as opaque ops by torch.compile — they execute eagerly without the compiler
trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).
2026-05-18 21:38:28 +00:00
85e1cd3b69 Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run()
CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with
torch.dynamo tracing. Marking run() as compiler-disabled makes the
runners opaque to torch.compile — they execute eagerly while the
rest of the model gets compiled.
2026-05-18 21:07:35 +00:00
a94011ec92 Fix torch.compile crash: remove threading.Lock from LUT cache path
The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under
torch.compile/cudagraph. LUT is now pre-populated during warmup so
the fast path (cache hit) never hits a lock.

Also removed all init/warmup debug prints from CuTeDSL kernels.
2026-05-18 20:54:55 +00:00
6326222d68 Fix: add abstract create_weights to CuTeDSLNvfp4LinearMethod 2026-05-18 20:40:48 +00:00
450793311c Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4
- CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners
  during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod
  for forward dispatch
- Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through
  CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference)
- Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16)
  with monkey-patched forward for fused L1+SiLU+L2 pipeline
- Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix,
  input_scale fixes)
- Deleted _post_quant_fix hook from utils.py
- Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp)
- Cleaned up all debug prints
- Updated Dockerfile with new kernel files
2026-05-18 20:27:42 +00:00
1836e5fdc7 Add shared experts to post-quant BF16 dequant fix
Shared experts also use FlashInferCutlassNvFp4LinearKernel with
broken input_scale. They need the same BF16 dequant treatment.
gate_up_proj and down_proj on ffn.shared_experts.
2026-05-18 19:27:49 +00:00
82ac648563 Patch utils.py the standard way: copy modified file into Docker image
Instead of fragile inline Dockerfile patching, just copy a modified
utils.py (with _post_quant_fix call) into the image, same pattern
as deepseek_v4.py and deepseek_v4_attention.py patches.
2026-05-18 19:10:08 +00:00
75844a8361 Post-quant fix via Dockerfile patch to process_weights_after_loading
Forward pre-hook approach didn't work — torch.compile and model
wrappers bypass hooks. Instead, patch vLLM's utils.py to call
model._post_quant_fix() at the end of process_weights_after_loading.
This guarantees the fix runs AFTER quant methods set up their attrs.

Dockerfile now patches:
  model_loader/utils.py → calls model._post_quant_fix() if it exists

DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention
NVFP4 weights to BF16 and replaces quant_method.
2026-05-18 18:35:34 +00:00
a4ad5898c1 Fix post-quant hook: register on inner model, fix module refs
vLLM V1 calls DeepseekV4Model.forward() directly, not
DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires.
Moved hook to self.model (inner) and fixed module.model.layers →
module.layers.
2026-05-18 18:15:36 +00:00
a51edd238e Add post-quant-init forward hook to fix attention NVFP4
The key insight: process_weights_after_loading runs AFTER load_weights
and sets up FlashInferCutlassNvFp4LinearKernel with broken
input_global_scale_inv. Any fix inside load_weights gets overwritten.

Solution: register a one-shot forward pre-hook that runs on the first
forward call (guaranteed after all init). It dequantizes attention
NVFP4 weights to BF16 and replaces quant_method with
UnquantizedLinearMethod. Since process_weights_after_loading already
ran, our changes won't be overwritten.

Standalone test confirmed: all attention weights produce valid
non-NaN output when dequantized to BF16.
2026-05-18 17:56:19 +00:00
2835cb040b Fix input_scale BEFORE process_weights_after_loading runs
Instead of dequantizing to BF16 (which gets overwritten by
process_weights_after_loading), fix the input_scale parameter
on the module before the quant method reads it. The quant method
computes input_global_scale_inv = input_scale.max(), so fixing
input_scale propagates the correct activation scale.

Computes correct input_scale by temporarily dequantizing weight
to BF16, running warmup forward, and computing act_amax.
input_scale = 1/(act_amax * headroom).
2026-05-18 16:43:44 +00:00
2fc81ccac4 Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early)
process_weights_after_loading sets input_global_scale_inv AFTER
_convert_nvfp4_post_load runs, so the fix couldn't find the attrs.
Going back to BF16 dequant approach. The zeros in the dummy run are
expected (attention_impl returns early with out.zero_()). Need to test
with a real request under cudagraph_mode=NONE.
2026-05-18 16:23:41 +00:00
4a57399592 Add debug prints for input_global_scale_inv check 2026-05-18 15:59:59 +00:00
f86892e26b Replace BF16 dequant with input_scale warmup fix for attention NVFP4
Instead of dequantizing attention weights to BF16 (which had issues
with MergedColumnParallelLinear and different weight_scale_2 values),
keep the NVFP4 path but fix the activation global scale.

Compute correct input_global_scale_inv by:
1. Temporarily dequantizing weight to BF16
2. Running warmup forward with random input
3. Computing actual activation amax
4. Setting scale_inv = amax * headroom

This preserves the original NVFP4 quantization pipeline.
2026-05-18 15:43:46 +00:00
301015b037 Remove all inline diagnostics — incompatible with torch.compile
Data-dependent expressions (amax().item(), isnan().any().item())
cause Dynamo guard failures even when gated by os.environ.
cudagraph_mode=NONE still uses torch.compile, so these break.
Will need enforce-eager for diagnostics going forward.
2026-05-18 15:22:53 +00:00
2a2a42c6d6 Add attention-internal diagnostics: MLA output, FP8 quant output 2026-05-18 14:45:43 +00:00
5c1dda10f6 Add granular attention diagnostics: pre/post attn, embed, dequant stats 2026-05-18 14:24:14 +00:00