Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.
Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
- Runners registered in global dict with integer IDs
- Custom op takes (tensors, runner_id, shape_hint) -> tensor
- Dynamo calls fake impl for shape inference, never touches the runner
- At execution time, real impl looks up runner and calls _run_impl
Changes:
- New: cutedsl/custom_ops.py (custom op definitions + registry)
- New: tests/test_custom_op.py (local unit tests, no GPU needed)
- Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
- Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
to use custom ops instead of autograd.Function
- Updated: cutedsl_quant_method.py to use custom op + registry
allow_in_graph doesn't work — Dynamo can't create proxies for Python
objects (the runner). The custom op approach requires only tensor args.
This time the GEMM impl correctly:
- Uses quantize_activation_nvfp4 for activation quantization
- Pads x_fp4 via uint8 + view(float4) for torch.zeros compat
- Assembles A-side scales with pad + swizzle
- Uses int32 expert_offsets (CuTeDSL requirement)
- Passes runner's pre-assembled mat_b, scale_b, gsb tensors
The custom op approach required reimplementing the GEMM (wrong scale
assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead,
use torch.autograd.Function + torch._dynamo.allow_in_graph which
tells Dynamo to treat the function as an opaque kernel call, while
still using the runner's battle-tested _run_impl for the actual GEMM.
allow_in_graph is the proper way to register opaque ops for Dynamo
without reimplementing the computation.
The warmup custom op call hit cudaErrorIllegalAddress because our
custom op GEMM implementation doesn't match the runner's call convention.
Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.
- pad_and_swizzle_single takes 1 arg (2D tensor), not 4
- Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group
- Remove unused CuTeDSLNvfp4Linear import from custom op impl
Dynamo in fullgraph mode traces through torch.autograd.Function, hitting
CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op
makes it opaque to Dynamo — tracing calls the fake impl, real impl only
runs during inference.
Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b,
in_features, out_features, activation_global_scale) -> Tensor
Store finalized weight tensors on the layer (from runner._mat_b etc.)
instead of the runner object, since custom ops can only accept tensors.
- Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class
- Register it via init_nvfp4_linear_kernel() selection mechanism
(inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer)
- process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner
- apply_weights: route through CuTeDSL GEMM
- Update Dockerfile: copy kernel + registration script
- Fix attention: always use forward() for quantized compressor/indexer
layers (dtype check was fragile after kernel swaps weights to dummy BF16)
Not all layers have the same indexer structure. The stacking path
was trying to access params that don't exist in params_dict. Added
checks to skip missing stacked params instead of KeyError.
o_a_proj is NOT quantized by modelopt in the checkpoint (bfloat16),
but the attention forward pass expects FP8 (weight + weight_scale_inv).
- Create wo_a with quant_config=None to load bfloat16 weights
- Add FP8 quantization of wo_a in finalize_mega_moe_weights:
per-tensor symmetric quantization to float8_e4m3fn + weight_scale_inv
- This matches what the fused_inv_rope_fp8_quant + einsum expects
The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses
weight_loader(param, weight, shard_id), but PerTensorScaleParameter
and ModelWeightParameter for NVFP4 scale params don't support shard_id
in load_column_parallel_weight (asserts shape equality).
Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate
shards, then concatenate along dim 0 and copy_ into the param after all
weights are loaded.
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
- Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b}
- Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate')
- Order substr renames: specific before general
The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader
expects layers.* (relative to the model module). Strip the model. prefix
instead of adding it.
The upstream deepseek_v4.py has imports that don't exist in the nightly
Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the
nightly's own files as the base and add only the minimal NVFP4 changes:
- Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping
- Select NVFP4 mapper when quant_config is modelopt_fp4
- cos_sin_cache float32 fix in attention
- Remove utils.py patch (not needed)
Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
(gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py
No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.
torch.compile fullgraph mode can't handle @torch.compiler.disable (skips
the function and refuses to compile). Custom autograd Functions are treated
as opaque ops by torch.compile — they execute eagerly without the compiler
trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).
CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with
torch.dynamo tracing. Marking run() as compiler-disabled makes the
runners opaque to torch.compile — they execute eagerly while the
rest of the model gets compiled.
The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under
torch.compile/cudagraph. LUT is now pre-populated during warmup so
the fast path (cache hit) never hits a lock.
Also removed all init/warmup debug prints from CuTeDSL kernels.
- CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners
during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod
for forward dispatch
- Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through
CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference)
- Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16)
with monkey-patched forward for fused L1+SiLU+L2 pipeline
- Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix,
input_scale fixes)
- Deleted _post_quant_fix hook from utils.py
- Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp)
- Cleaned up all debug prints
- Updated Dockerfile with new kernel files
Shared experts also use FlashInferCutlassNvFp4LinearKernel with
broken input_scale. They need the same BF16 dequant treatment.
gate_up_proj and down_proj on ffn.shared_experts.
Instead of fragile inline Dockerfile patching, just copy a modified
utils.py (with _post_quant_fix call) into the image, same pattern
as deepseek_v4.py and deepseek_v4_attention.py patches.
Forward pre-hook approach didn't work — torch.compile and model
wrappers bypass hooks. Instead, patch vLLM's utils.py to call
model._post_quant_fix() at the end of process_weights_after_loading.
This guarantees the fix runs AFTER quant methods set up their attrs.
Dockerfile now patches:
model_loader/utils.py → calls model._post_quant_fix() if it exists
DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention
NVFP4 weights to BF16 and replaces quant_method.
vLLM V1 calls DeepseekV4Model.forward() directly, not
DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires.
Moved hook to self.model (inner) and fixed module.model.layers →
module.layers.
The key insight: process_weights_after_loading runs AFTER load_weights
and sets up FlashInferCutlassNvFp4LinearKernel with broken
input_global_scale_inv. Any fix inside load_weights gets overwritten.
Solution: register a one-shot forward pre-hook that runs on the first
forward call (guaranteed after all init). It dequantizes attention
NVFP4 weights to BF16 and replaces quant_method with
UnquantizedLinearMethod. Since process_weights_after_loading already
ran, our changes won't be overwritten.
Standalone test confirmed: all attention weights produce valid
non-NaN output when dequantized to BF16.
Instead of dequantizing to BF16 (which gets overwritten by
process_weights_after_loading), fix the input_scale parameter
on the module before the quant method reads it. The quant method
computes input_global_scale_inv = input_scale.max(), so fixing
input_scale propagates the correct activation scale.
Computes correct input_scale by temporarily dequantizing weight
to BF16, running warmup forward, and computing act_amax.
input_scale = 1/(act_amax * headroom).
process_weights_after_loading sets input_global_scale_inv AFTER
_convert_nvfp4_post_load runs, so the fix couldn't find the attrs.
Going back to BF16 dequant approach. The zeros in the dummy run are
expected (attention_impl returns early with out.zero_()). Need to test
with a real request under cudagraph_mode=NONE.
Instead of dequantizing attention weights to BF16 (which had issues
with MergedColumnParallelLinear and different weight_scale_2 values),
keep the NVFP4 path but fix the activation global scale.
Compute correct input_global_scale_inv by:
1. Temporarily dequantizing weight to BF16
2. Running warmup forward with random input
3. Computing actual activation amax
4. Setting scale_inv = amax * headroom
This preserves the original NVFP4 quantization pipeline.
Data-dependent expressions (amax().item(), isnan().any().item())
cause Dynamo guard failures even when gated by os.environ.
cudagraph_mode=NONE still uses torch.compile, so these break.
Will need enforce-eager for diagnostics going forward.