Commit Graph

368 Commits

Author SHA1 Message Date
dfd9c10ae9 Fix MHC import: don't import .torch from layers/mhc.py
The layers/mhc.py was trying to import kernels.mhc.torch which
failed because our __init__.py was breaking the package. Instead,
just import our mhc_torch_ops which has everything we need.

Also fix __init__.py to explicitly import mhc_pre_torch and
mhc_post_torch from .torch instead of using import *.
2026-05-19 05:36:35 +00:00
e404e18efb Also replace layers/mhc.py CustomOp dispatch
The original layers/mhc.py forward_cuda calls
torch.ops.vllm.mhc_pre_tilelang which triggers TileLang JIT.
Replace with our torch implementations in forward_cuda.
This is what the CustomOp dispatch routes through.
2026-05-19 05:31:05 +00:00
5e6d459145 Fix MHC custom op registration
Previous approach used @CustomOp.register which doesn't create
torch.ops.vllm.mhc_pre. The model code calls torch.ops.vllm.mhc_pre()
directly, which requires direct_register_custom_op.

Use direct_register_custom_op to register mhc_pre, mhc_post,
mhc_fused_post_pre, and hc_head_fused_kernel as PyTorch custom ops
with torch (eager) implementations.

Patch kernels/mhc/__init__.py to import from both .torch (original)
and .mhc_torch_ops (our replacements), skipping tilelang import.
2026-05-19 05:19:48 +00:00
9ff1679064 Replace MHC TileLang kernels with pure PyTorch
TileLang kernels (mhc_pre_big_fuse_tilelang, mhc_fused_tilelang) don't
work correctly on Blackwell SM100 and cause empty model output.

Replace with pure PyTorch implementations:
- mhc_pre_torch: Sinkhorn-normalized HC residual mixing
- mhc_post_torch: HC post block (einsum residual + post layer mix)
- mhc_fused_post_pre_torch: Fused post+pre (composition of above)
- hc_head_fused_torch: RMS norm + linear + sigmoid + weighted sum

Patch both layers/mhc.py (CustomOp dispatch) and kernels/mhc/__init__.py
(no tilelang import). Also remove tilelang from pyproject.toml deps.
2026-05-19 05:07:41 +00:00
5c770c68ca Keep MoE scale tensors: framework warmup needs them
The framework's deep_gemm_warmup calls get_fused_moe_quant_config
which accesses w13_input_scale etc. Setting them to None caused
TypeError: float / NoneType. Keep scales (small tensors) and only
free the large weight tensors.
2026-05-19 04:50:31 +00:00
e0f385ac45 Fix workspace_shapes: output dim is hidden_dim, not K*2
K comes from hidden_states.size(-1) which is the full BF16 dimension
(7168), not the packed weight dimension. K*2=14336 is wrong.
The MoE output is always hidden_dim (7168).
2026-05-19 04:42:22 +00:00
cfd8ec741d Debug: add shape mismatch logging in MoE apply 2026-05-19 04:35:58 +00:00
ffc1a5c6a8 Fix workspace_shapes: remove wrong assertion, compute output dim from K
The framework may pass K in different forms (packed or unpacked).
Use max(K*2, hidden_dim) to handle both cases.
2026-05-19 04:28:04 +00:00
f023b3b2c6 Fix: wrap dummy MoE weights in nn.Parameter
PyTorch requires module attributes to be nn.Parameter or None.
torch.empty can't be assigned to a registered parameter slot.
2026-05-19 04:21:35 +00:00
b06dcb40dc Fix MoE w1=None crash: keep shape-preserving dummy weights on CPU
The modular kernel framework reads w1.shape[0] in its outer apply()
before delegating to our expert impl. Setting layer.w13_weight = None
caused AttributeError. Replace with shape-preserving CPU dummy tensors
to free GPU memory while keeping shape metadata accessible.
2026-05-19 04:17:10 +00:00
c289c44920 Fix BF16 wo_a: per-group BMM instead of flat linear
The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1))
which flattens across groups: (num_tokens, n_local_heads*head_dim)=(tokens, 8192).
But wo_a is a BMM with in_features=n_heads*head_dim/n_groups=4096.

The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes.
The BF16 path now does the same: reshape o_inv to per-group format,
do torch.bmm, then reshape output and handle TP all-gather manually.
2026-05-19 04:10:02 +00:00
6f9a400ae0 Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn
- Removed hc_head prefix mapping (checkpoint already has model.hc_head.*)
- Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn)
- The model has self.hc_head_fn as flat params, not inside a sub-module
2026-05-19 03:58:25 +00:00
909a2710e4 Fix double lm_head mapping: NVFP4 checkpoint already uses correct names
The checkpoint has lm_head.weight and model.embed_tokens.weight
already — the suffix mappings head.weight→lm_head.weight and
embed.weight→embed_tokens.weight were incorrectly applying to keys
that already had the right prefix, producing lm_lm_head.weight.
2026-05-19 03:54:14 +00:00
4cf5b8b751 Fix compressor path: attn.mla_attn.compressor (not attn.compressor)
The compressor is inside mla_attn, not directly on the attention wrapper.
Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*
2026-05-19 03:47:26 +00:00
9d41419e9f Debug: print compressor params to diagnose KeyError 2026-05-19 03:44:40 +00:00
db5192fe41 Patch from Docker image's vLLM (0.20.2rc1) instead of newer upstream
The nightly Docker image uses an older vLLM that doesn't have
NormGateLinear, breakable_cudagraph, etc. Patching the Docker
image's own files ensures compatibility.

- deepseek_v4.py: Patches from Docker image + NVFP4 mapper + wo_a BF16
- deepseek_v4_attention.py: Patches from Docker image + inv rope BF16
  + weights_proj quant + removed QuantFP8/GroupShape imports
2026-05-19 03:35:15 +00:00
df5a496f5d Fix: make eager_break_during_capture import conditional for older vLLM 2026-05-19 03:29:05 +00:00
4ed91b81d0 Fix inverse RoPE formula: swap signs on cross terms 2026-05-19 03:22:10 +00:00
fece06f746 Add unit tests for NVFP4 weight mapper and inverse RoPE BF16 2026-05-19 03:22:00 +00:00
b0b5113467 Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj
- The compressor is on attn.compressor (not attn.mla_attn.compressor)
- weights_proj in indexer is NVFP4-quantized in our checkpoint
2026-05-19 03:20:41 +00:00
396a83ea56 Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper
- deepseek_v4.py: Fresh upstream copy with minimal NVFP4 changes
  - wo_a uses quant_config=None (BF16 in NVFP4 checkpoint, no scales)
  - Added _make_deepseek_v4_nvfp4_weights_mapper() using official WeightsMapper API
  - Handles: self_attn→attn, mlp→ffn, gate_proj→w1, compressor renames, etc.
  - Mapper selected by quant_config.get_name() == 'modelopt_fp4'

- deepseek_v4_attention.py: Fresh upstream copy with minimal NVFP4 changes
  - Removed _wo_a_act_quant and custom CuTeDSL wo_a runner
  - Added _apply_inv_rope_bf16() helper (inverse RoPE in BF16)
  - Detects BF16 wo_a (no weight_scale_inv) and uses BF16 path
  - FP8 einsum path kept as fallback for SM90 checkpoints
  - BF16 path: inverse RoPE → wo_a() → wo_b() (standard linear methods)
2026-05-19 03:13:38 +00:00
b856ee9315 Clean up debug scripts 2026-05-19 02:47:29 +00:00
05cdde1676 Fix wo_a: scatter each group's data at correct offset in padded buffer
The grouped GEMM expects each group's tokens at their own offset range:
- Group 0: rows [0, padded_T)
- Group 1: rows [padded_T, 2*padded_T)
- etc.

Previously we wrote all groups' data contiguously starting at row 0,
so group 1+ would read zeros from the padding area. Now we scatter
each group's quantized activation at the correct offset.

Also:
- Size buffer for total_max_rows = padded_max * n_groups
- Use assemble_scales_2d_side for multi-group scale assembly
- Extract output per-group at correct offsets
2026-05-19 02:45:57 +00:00
8fe5546bb3 Fix debug script 2026-05-19 02:43:17 +00:00
788f0aa65a Add step-by-step debug for wo_a 2026-05-19 02:43:05 +00:00
5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM
The grouped GEMM expects mat_a to be laid out contiguously per group:
[all tokens for group0, all tokens for group1, ...]
A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout
which is wrong. Fix: permute to (G, T, D) before flattening.
Same fix for output: permute (G, T, R) → (T, G, R).
2026-05-19 02:41:32 +00:00
77e4970d93 Add debug script for wo_a quantization 2026-05-19 02:40:43 +00:00
80122b850b Add debug script for wo_a 2026-05-19 02:39:55 +00:00
ae233ab648 Fix test: cos_sin_cache on CUDA device 2026-05-19 02:37:50 +00:00
882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)
The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion
in layout.hpp:39) when processing wo_a (o-projection first half) in the
attention layer. The crash is caused by scale tensor dimension mismatch
for the SM100 recipe (1, 1, 128).

Instead of fighting DeepGEMM, replace the entire wo_a path with our own
CuTeDSL NVFP4 kernel:

1. inverse_rope_bf16() — Python implementation of inverse RoPE
   (replaces fused_inv_rope_fp8_quant CUDA kernel)
2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using
   ScaledGroupedGemm with n_local_groups=8 groups
3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4,
   no conversion to another quantization)

Changes:
- cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation)
- cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a
- vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner
  is initialized, keep DeepGEMM fallback
- vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant
- tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM
2026-05-19 02:36:30 +00:00
bab1f75f29 Fix gs None error in legacy _ensure_stacked path 2026-05-19 02:17:53 +00:00
48fa64dfda Eliminate weight copies: pass stacked checkpoint tensors directly
Memory optimization for MoE weight processing:

Before (3-4 copies of weights in memory):
1. Original checkpoint weights in layer.w13_weight (copy 1)
2. Per-expert permuted copies (copy 2)
3. torch.stack() in runner._ensure_stacked (copy 3)
4. make_b_k_major re-stride (copy 4)
5. Scales: permute then assemble_scales_3d_side un-permutes (wasted)

After (1-2 copies):
1. View checkpoint as fp4 (NO copy — byte-preserving view)
2. Pass (E, N, K) stacked tensor directly to runner
3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref
4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref
5. Scales: already (N, K_sf) from checkpoint, call assembly directly
6. Free layer.w13_weight etc. immediately after extracting views

Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally,
but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose
by calling assemble_raw_scales_2d3d_3d_side directly.
2026-05-19 02:16:43 +00:00
0612c1ab54 use proper backend 2026-05-19 02:08:18 +00:00
00fe63b56f Fix compile test: add warmup for activation global scales 2026-05-19 01:57:16 +00:00
bba3bca4d3 Add torch.compile + custom op integration test 2026-05-19 01:56:46 +00:00
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.

Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
  - Runners registered in global dict with integer IDs
  - Custom op takes (tensors, runner_id, shape_hint) -> tensor
  - Dynamo calls fake impl for shape inference, never touches the runner
  - At execution time, real impl looks up runner and calls _run_impl

Changes:
  - New: cutedsl/custom_ops.py (custom op definitions + registry)
  - New: tests/test_custom_op.py (local unit tests, no GPU needed)
  - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
  - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
    to use custom ops instead of autograd.Function
  - Updated: cutedsl_quant_method.py to use custom op + registry
2026-05-19 01:54:48 +00:00
98153002c0 Go back to torch.library.custom_op with correct GEMM impl
allow_in_graph doesn't work — Dynamo can't create proxies for Python
objects (the runner). The custom op approach requires only tensor args.

This time the GEMM impl correctly:
- Uses quantize_activation_nvfp4 for activation quantization
- Pads x_fp4 via uint8 + view(float4) for torch.zeros compat
- Assembles A-side scales with pad + swizzle
- Uses int32 expert_offsets (CuTeDSL requirement)
- Passes runner's pre-assembled mat_b, scale_b, gsb tensors
2026-05-19 01:24:41 +00:00
02c500bbb1 Switch to allow_in_graph for Dynamo opacity instead of custom op
The custom op approach required reimplementing the GEMM (wrong scale
assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead,
use torch.autograd.Function + torch._dynamo.allow_in_graph which
tells Dynamo to treat the function as an opaque kernel call, while
still using the runner's battle-tested _run_impl for the actual GEMM.

allow_in_graph is the proper way to register opaque ops for Dynamo
without reimplementing the computation.
2026-05-19 01:20:07 +00:00
581d87f9a6 Remove warmup forward from process_weights_after_loading
The warmup custom op call hit cudaErrorIllegalAddress because our
custom op GEMM implementation doesn't match the runner's call convention.
Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.
2026-05-19 01:18:54 +00:00
5d49849156 Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype
Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.
2026-05-19 01:15:24 +00:00
e1fcfc4f01 Add CuTeDSL warmup + CUDA sync after JIT compilation
CuTeDSL cute.compile corrupts GPU memory. Add warmup forward +
torch.cuda.synchronize() + health check after finalize_weights,
matching the MoE runner pattern.
2026-05-19 01:11:44 +00:00
1d9c0f996c Fix expert_offsets dtype: CuTeDSL expects int32 not int64
The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was
caused by passing int64 expert_offsets when the kernel expects int32.
2026-05-19 01:05:20 +00:00
b81200f427 Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op
- pad_and_swizzle_single takes 1 arg (2D tensor), not 4
- Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group
- Remove unused CuTeDSLNvfp4Linear import from custom op impl
2026-05-19 01:01:42 +00:00
e0eb436914 Fix custom_op registration: use as decorator with proper type hints 2026-05-19 00:54:30 +00:00
c609e9ba3c Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM
Dynamo in fullgraph mode traces through torch.autograd.Function, hitting
CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op
makes it opaque to Dynamo — tracing calls the fake impl, real impl only
runs during inference.

Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b,
    in_features, out_features, activation_global_scale) -> Tensor

Store finalized weight tensors on the layer (from runner._mat_b etc.)
instead of the runner object, since custom ops can only accept tensors.
2026-05-19 00:50:43 +00:00
c043a11bcc Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers
- Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class
- Register it via init_nvfp4_linear_kernel() selection mechanism
  (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer)
- process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner
- apply_weights: route through CuTeDSL GEMM
- Update Dockerfile: copy kernel + registration script
- Fix attention: always use forward() for quantized compressor/indexer
  layers (dtype check was fragile after kernel swaps weights to dummy BF16)
2026-05-19 00:44:44 +00:00
358830925a Fix unpack error: handle both tuple and tensor returns from NVFP4 forward() 2026-05-19 00:33:43 +00:00
d9dc042ff7 Fix compressor kv_score: use forward() for NVFP4 quantized weights
Raw torch.mm doesn't work with packed uint8 NVFP4 weights.
Use MergedColumnParallelLinear.forward() which handles dequantization.
2026-05-19 00:29:43 +00:00
10c14ddb49 Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm
- input_layernorm → attn_norm, post_attention_layernorm → ffn_norm
- hc_head.fn/base/scale → hc_head_fn/base/scale
- attn_hc/ffn_hc → hc_attn/hc_ffn (dot to underscore)
- q_a_norm → q_norm, sinks → attn_sink
- Indexer params: self_attn.compressor.indexer → attn.indexer
  (not attn.mla_attn.compressor.indexer)
2026-05-19 00:24:26 +00:00
540e7ee8fc Fix: layer.self_attn → layer.attn (model uses attn, not self_attn) 2026-05-19 00:14:09 +00:00