The grouped GEMM expects each group's tokens at their own offset range:
- Group 0: rows [0, padded_T)
- Group 1: rows [padded_T, 2*padded_T)
- etc.
Previously we wrote all groups' data contiguously starting at row 0,
so group 1+ would read zeros from the padding area. Now we scatter
each group's quantized activation at the correct offset.
Also:
- Size buffer for total_max_rows = padded_max * n_groups
- Use assemble_scales_2d_side for multi-group scale assembly
- Extract output per-group at correct offsets
The grouped GEMM expects mat_a to be laid out contiguously per group:
[all tokens for group0, all tokens for group1, ...]
A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout
which is wrong. Fix: permute to (G, T, D) before flattening.
Same fix for output: permute (G, T, R) → (T, G, R).
The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion
in layout.hpp:39) when processing wo_a (o-projection first half) in the
attention layer. The crash is caused by scale tensor dimension mismatch
for the SM100 recipe (1, 1, 128).
Instead of fighting DeepGEMM, replace the entire wo_a path with our own
CuTeDSL NVFP4 kernel:
1. inverse_rope_bf16() — Python implementation of inverse RoPE
(replaces fused_inv_rope_fp8_quant CUDA kernel)
2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using
ScaledGroupedGemm with n_local_groups=8 groups
3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4,
no conversion to another quantization)
Changes:
- cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation)
- cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a
- vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner
is initialized, keep DeepGEMM fallback
- vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant
- tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.
Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
- Runners registered in global dict with integer IDs
- Custom op takes (tensors, runner_id, shape_hint) -> tensor
- Dynamo calls fake impl for shape inference, never touches the runner
- At execution time, real impl looks up runner and calls _run_impl
Changes:
- New: cutedsl/custom_ops.py (custom op definitions + registry)
- New: tests/test_custom_op.py (local unit tests, no GPU needed)
- Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
- Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
to use custom ops instead of autograd.Function
- Updated: cutedsl_quant_method.py to use custom op + registry
allow_in_graph doesn't work — Dynamo can't create proxies for Python
objects (the runner). The custom op approach requires only tensor args.
This time the GEMM impl correctly:
- Uses quantize_activation_nvfp4 for activation quantization
- Pads x_fp4 via uint8 + view(float4) for torch.zeros compat
- Assembles A-side scales with pad + swizzle
- Uses int32 expert_offsets (CuTeDSL requirement)
- Passes runner's pre-assembled mat_b, scale_b, gsb tensors
The custom op approach required reimplementing the GEMM (wrong scale
assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead,
use torch.autograd.Function + torch._dynamo.allow_in_graph which
tells Dynamo to treat the function as an opaque kernel call, while
still using the runner's battle-tested _run_impl for the actual GEMM.
allow_in_graph is the proper way to register opaque ops for Dynamo
without reimplementing the computation.
The warmup custom op call hit cudaErrorIllegalAddress because our
custom op GEMM implementation doesn't match the runner's call convention.
Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.
- pad_and_swizzle_single takes 1 arg (2D tensor), not 4
- Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group
- Remove unused CuTeDSLNvfp4Linear import from custom op impl
Dynamo in fullgraph mode traces through torch.autograd.Function, hitting
CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op
makes it opaque to Dynamo — tracing calls the fake impl, real impl only
runs during inference.
Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b,
in_features, out_features, activation_global_scale) -> Tensor
Store finalized weight tensors on the layer (from runner._mat_b etc.)
instead of the runner object, since custom ops can only accept tensors.
- Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class
- Register it via init_nvfp4_linear_kernel() selection mechanism
(inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer)
- process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner
- apply_weights: route through CuTeDSL GEMM
- Update Dockerfile: copy kernel + registration script
- Fix attention: always use forward() for quantized compressor/indexer
layers (dtype check was fragile after kernel swaps weights to dummy BF16)
Not all layers have the same indexer structure. The stacking path
was trying to access params that don't exist in params_dict. Added
checks to skip missing stacked params instead of KeyError.
o_a_proj is NOT quantized by modelopt in the checkpoint (bfloat16),
but the attention forward pass expects FP8 (weight + weight_scale_inv).
- Create wo_a with quant_config=None to load bfloat16 weights
- Add FP8 quantization of wo_a in finalize_mega_moe_weights:
per-tensor symmetric quantization to float8_e4m3fn + weight_scale_inv
- This matches what the fused_inv_rope_fp8_quant + einsum expects
The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses
weight_loader(param, weight, shard_id), but PerTensorScaleParameter
and ModelWeightParameter for NVFP4 scale params don't support shard_id
in load_column_parallel_weight (asserts shape equality).
Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate
shards, then concatenate along dim 0 and copy_ into the param after all
weights are loaded.
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
- Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b}
- Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate')
- Order substr renames: specific before general
The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader
expects layers.* (relative to the model module). Strip the model. prefix
instead of adding it.
The upstream deepseek_v4.py has imports that don't exist in the nightly
Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the
nightly's own files as the base and add only the minimal NVFP4 changes:
- Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping
- Select NVFP4 mapper when quant_config is modelopt_fp4
- cos_sin_cache float32 fix in attention
- Remove utils.py patch (not needed)
Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
(gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py
No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.
torch.compile fullgraph mode can't handle @torch.compiler.disable (skips
the function and refuses to compile). Custom autograd Functions are treated
as opaque ops by torch.compile — they execute eagerly without the compiler
trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).