Commit Graph

80 Commits

Author SHA1 Message Date
10c14ddb49 Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm
- input_layernorm → attn_norm, post_attention_layernorm → ffn_norm
- hc_head.fn/base/scale → hc_head_fn/base/scale
- attn_hc/ffn_hc → hc_attn/hc_ffn (dot to underscore)
- q_a_norm → q_norm, sinks → attn_sink
- Indexer params: self_attn.compressor.indexer → attn.indexer
  (not attn.mla_attn.compressor.indexer)
2026-05-19 00:24:26 +00:00
540e7ee8fc Fix: layer.self_attn → layer.attn (model uses attn, not self_attn) 2026-05-19 00:14:09 +00:00
201a40e6c4 Fix zero-dim tensor concatenation in compressor scale buffer
input_scale and weight_scale_2 are 0-dim scalars in the NVFP4 checkpoint.
torch.cat can't concatenate scalars — reshape to 1-d first.
2026-05-19 00:10:13 +00:00
d41a48aa1f Fix KeyError for missing stacked params (indexer.compressor)
Not all layers have the same indexer structure. The stacking path
was trying to access params that don't exist in params_dict. Added
checks to skip missing stacked params instead of KeyError.
2026-05-18 23:54:02 +00:00
4b0d8263f6 Fix NameError: use print instead of logger (not imported) 2026-05-18 23:49:42 +00:00
e3c24769e2 Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint)
o_a_proj is NOT quantized by modelopt in the checkpoint (bfloat16),
but the attention forward pass expects FP8 (weight + weight_scale_inv).

- Create wo_a with quant_config=None to load bfloat16 weights
- Add FP8 quantization of wo_a in finalize_mega_moe_weights:
  per-tensor symmetric quantization to float8_e4m3fn + weight_scale_inv
- This matches what the fused_inv_rope_fp8_quant + einsum expects
2026-05-18 23:41:39 +00:00
9d016aa1c0 Use print instead of logger for weight load debug 2026-05-18 23:30:58 +00:00
a6f61bda5d Add debug logging for weight loading failures 2026-05-18 23:28:15 +00:00
eef0ef76af Fix NVFP4 compressor scale loading: buffer and concatenate scale shards
The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses
weight_loader(param, weight, shard_id), but PerTensorScaleParameter
and ModelWeightParameter for NVFP4 scale params don't support shard_id
in load_column_parallel_weight (asserts shape equality).

Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate
shards, then concatenate along dim 0 and copy_ into the param after all
weights are loaded.
2026-05-18 23:24:08 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
  AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
  in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
  that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
2026-05-18 23:03:34 +00:00
b039123207 Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames
- Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b}
- Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate')
- Order substr renames: specific before general
2026-05-18 22:53:09 +00:00
ea648a9bc2 Fix NVFP4 mapper: keep model. prefix (model params use it) 2026-05-18 22:49:40 +00:00
1528d4e182 Fix NVFP4 mapper: strip model. prefix from checkpoint keys
The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader
expects layers.* (relative to the model module). Strip the model. prefix
instead of adding it.
2026-05-18 22:46:04 +00:00
7409204d71 Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper
The upstream deepseek_v4.py has imports that don't exist in the nightly
Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the
nightly's own files as the base and add only the minimal NVFP4 changes:
- Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping
- Select NVFP4 mapper when quant_config is modelopt_fp4
- cos_sin_cache float32 fix in attention
- Remove utils.py patch (not needed)
2026-05-18 22:33:51 +00:00
a7ed8faec6 Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework
Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
  (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
  that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py

No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.
2026-05-18 22:19:23 +00:00
450793311c Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4
- CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners
  during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod
  for forward dispatch
- Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through
  CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference)
- Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16)
  with monkey-patched forward for fused L1+SiLU+L2 pipeline
- Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix,
  input_scale fixes)
- Deleted _post_quant_fix hook from utils.py
- Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp)
- Cleaned up all debug prints
- Updated Dockerfile with new kernel files
2026-05-18 20:27:42 +00:00
1836e5fdc7 Add shared experts to post-quant BF16 dequant fix
Shared experts also use FlashInferCutlassNvFp4LinearKernel with
broken input_scale. They need the same BF16 dequant treatment.
gate_up_proj and down_proj on ffn.shared_experts.
2026-05-18 19:27:49 +00:00
75844a8361 Post-quant fix via Dockerfile patch to process_weights_after_loading
Forward pre-hook approach didn't work — torch.compile and model
wrappers bypass hooks. Instead, patch vLLM's utils.py to call
model._post_quant_fix() at the end of process_weights_after_loading.
This guarantees the fix runs AFTER quant methods set up their attrs.

Dockerfile now patches:
  model_loader/utils.py → calls model._post_quant_fix() if it exists

DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention
NVFP4 weights to BF16 and replaces quant_method.
2026-05-18 18:35:34 +00:00
a4ad5898c1 Fix post-quant hook: register on inner model, fix module refs
vLLM V1 calls DeepseekV4Model.forward() directly, not
DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires.
Moved hook to self.model (inner) and fixed module.model.layers →
module.layers.
2026-05-18 18:15:36 +00:00
a51edd238e Add post-quant-init forward hook to fix attention NVFP4
The key insight: process_weights_after_loading runs AFTER load_weights
and sets up FlashInferCutlassNvFp4LinearKernel with broken
input_global_scale_inv. Any fix inside load_weights gets overwritten.

Solution: register a one-shot forward pre-hook that runs on the first
forward call (guaranteed after all init). It dequantizes attention
NVFP4 weights to BF16 and replaces quant_method with
UnquantizedLinearMethod. Since process_weights_after_loading already
ran, our changes won't be overwritten.

Standalone test confirmed: all attention weights produce valid
non-NaN output when dequantized to BF16.
2026-05-18 17:56:19 +00:00
2835cb040b Fix input_scale BEFORE process_weights_after_loading runs
Instead of dequantizing to BF16 (which gets overwritten by
process_weights_after_loading), fix the input_scale parameter
on the module before the quant method reads it. The quant method
computes input_global_scale_inv = input_scale.max(), so fixing
input_scale propagates the correct activation scale.

Computes correct input_scale by temporarily dequantizing weight
to BF16, running warmup forward, and computing act_amax.
input_scale = 1/(act_amax * headroom).
2026-05-18 16:43:44 +00:00
2fc81ccac4 Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early)
process_weights_after_loading sets input_global_scale_inv AFTER
_convert_nvfp4_post_load runs, so the fix couldn't find the attrs.
Going back to BF16 dequant approach. The zeros in the dummy run are
expected (attention_impl returns early with out.zero_()). Need to test
with a real request under cudagraph_mode=NONE.
2026-05-18 16:23:41 +00:00
4a57399592 Add debug prints for input_global_scale_inv check 2026-05-18 15:59:59 +00:00
f86892e26b Replace BF16 dequant with input_scale warmup fix for attention NVFP4
Instead of dequantizing attention weights to BF16 (which had issues
with MergedColumnParallelLinear and different weight_scale_2 values),
keep the NVFP4 path but fix the activation global scale.

Compute correct input_global_scale_inv by:
1. Temporarily dequantizing weight to BF16
2. Running warmup forward with random input
3. Computing actual activation amax
4. Setting scale_inv = amax * headroom

This preserves the original NVFP4 quantization pipeline.
2026-05-18 15:43:46 +00:00
301015b037 Remove all inline diagnostics — incompatible with torch.compile
Data-dependent expressions (amax().item(), isnan().any().item())
cause Dynamo guard failures even when gated by os.environ.
cudagraph_mode=NONE still uses torch.compile, so these break.
Will need enforce-eager for diagnostics going forward.
2026-05-18 15:22:53 +00:00
2a2a42c6d6 Add attention-internal diagnostics: MLA output, FP8 quant output 2026-05-18 14:45:43 +00:00
5c1dda10f6 Add granular attention diagnostics: pre/post attn, embed, dequant stats 2026-05-18 14:24:14 +00:00
e0e0528778 Add debug logging for BF16 dequant to find missing attrs 2026-05-18 14:04:12 +00:00
2e8c3c961f Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv
wq_a and wkv are fused into a single MergedColumnParallelLinear
called fused_wqa_wkv. Was checking for non-existent separate attrs.
2026-05-18 13:47:08 +00:00
a7216b27df Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16
wo_a uses fp8_einsum which is weight-only FP8 (no input_scale).
Only q_a, q_b, kv, o_b need BF16 dequant to avoid broken input_scale.
2026-05-18 13:22:15 +00:00
334e95047e Fix: dequantize ALL attention NVFP4 projections to BF16
Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel
uses checkpoint input_scale for activation quantization, which produces
NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a,
wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken
input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method.

This trades memory for correctness. Future optimization: add warmup
for attention input_global_scale_inv (same as MoE warmup).
2026-05-18 13:09:36 +00:00
9e7639fba4 Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)
When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer.
Must run with --enforce-eager (data-dependent prints break Dynamo).
Gated by os.environ so dead-code-eliminated during compilation.
2026-05-18 12:51:51 +00:00
2d1e9f42b1 Remove NaN check — incompatible with Dynamo fullgraph compilation
Dynamo fullgraph mode rejects BOTH data-dependent branching AND
torch.compiler.disable as graph breaks. The NaN check cannot coexist
with vLLM's AOT compilation. Use layertest/cudagraph_test for debugging.
2026-05-18 12:17:25 +00:00
65763a200c Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break
The inline os.environ gate doesn't work — Dynamo still sees the
data-dependent branching (torch.isnan().any()) and crashes with
'Unsupported: Data-dependent branching'. Extracting into a
@torch.compiler.disable decorated function makes Dynamo skip it.
2026-05-18 11:33:29 +00:00
b8df4a8cc5 Fix NaN check: use os.environ gate instead of is_current_stream_capturing
torch.cuda.is_current_stream_capturing() returns bool, which breaks
Dynamo FX tracing (non-Tensor output). Switch to env var gate:
CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection.

Dynamo evaluates os.environ at trace time — if the env var is not set,
the entire NaN check block is compiled away. Set it before first
inference to get NaN detection during prefill only.
2026-05-18 02:20:14 +00:00
0c02d84514 Add NaN/Inf detection in DeepseekV4Model.forward layer loop
- Checks every layer during prefill (not during cudagraph capture)
- is_current_stream_capturing() gate prevents CPU-GPU syncs during capture
- Prints amax every 10 layers for magnitude tracking
- Breaks on first NaN/Inf to avoid wasting compute
2026-05-17 23:37:12 +00:00
22e0370e6e Fix AttributeError: DeepseekV4MegaMoEExperts has no swiglu_limit
Get swiglu_limit from vllm_config.model_config.hf_config instead
of self (it was only set on the parent DeepseekV4MoE class).
2026-05-17 18:06:44 +00:00
a10c582cf4 Add swiglu_limit=10.0 activation clamping (was missing)
DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps:
- silu(gate) to max 10.0
- up to [-10.0, 10.0]

Our runner was doing plain F.silu(gate) * up without clamping.
Large gate values could produce unbounded SiLU output, causing
numerical issues in the L2 GEMM. This is likely contributing to
garbage model output.
2026-05-17 17:52:16 +00:00
b1ac74bb4d Fix shape mismatch: shared padded buffers, revert max_num_tokens cap
Root cause: capping max_num_tokens to 512 made buffers too small for the
actual 8192-token warmup. slot_hidden had 49152 rows but padded_hidden
only had 6144.

Fix: Revert the 512 cap. Use SHARED padded buffers (not per-layer) to
avoid OOM. Only 72 MB total (not 4.3 GB) since layers run sequentially
and reuse the same buffer. Cudagraph-safe since capture and replay both
run layers sequentially on the same tensor.
2026-05-17 15:47:10 +00:00
8ac8e20fa9 Fix OOM: cap buffer pre-allocation at cudagraph max capture size
padded_hidden/activated buffers were sized for max_num_tokens=8192,
which is 72 MB per layer × 60 layers = 4.3 GB → OOM with 178 GB GPUs
(almost full from model + KV cache).

Now cap at max cudagraph capture size (512 tokens). Eager-mode runs
with >512 tokens will need dynamic allocation, but vLLM always uses
cudagraph for inference after warmup.
2026-05-17 14:14:13 +00:00
b0221662e7 Fix warmup: pass local expert IDs (not global), remove incorrect _warmup_done guard
compute_activation_global_scales expects local IDs (0..num_experts-1),
not global IDs. EP5/EP7 were getting L2 gs=0 because global IDs (240+,
336+) didn't match expert_id_range (0..47), so no tokens matched any
expert → L1 GEMM got zero inputs → L2 gs=0 → NaN/crash.

Also removed _warmup_done guard since each layer needs its own warmup
(different weights, different gs values).
2026-05-17 11:38:19 +00:00
b531a98f8f Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing
- Reverted from full-buffer swizzle to per-expert 128-row slots
- Scatter into e*128 fixed positions (cudagraph-compatible, fixed shape)
- Clamp local_row to 127 for experts with >128 tokens (GEMM uses expert_offsets)
- Buffer sized for num_experts*128 rows (not max_tokens*top_k)
- Add _warmup_done guard to only run warmup once (not 60x)
2026-05-17 11:10:59 +00:00
04245b664b Add warmup-based activation global scale computation in finalize_weights
The checkpoint input_scale is a calibration value that produces wrong gs
at runtime (too small → block scales saturate → garbage output → EOS).

Now calls compute_activation_global_scales() with sample data during weight
finalization, before cudagraph capture. This observes actual activation
magnitudes and computes correct L1 and L2 gs values.
2026-05-17 10:48:24 +00:00
d9bae6d770 Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks
Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens
and top_k=6, the actual padded row count can exceed 6144. Also:
- Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192)
- Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks)
- Remove debug prints
2026-05-17 09:56:28 +00:00
ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync
Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
  as local IDs (0-31 with EP=8). Tokens for non-local experts got
  wrong expert assignments, causing out-of-bounds scatter indices
  in _assemble_scales_cudagraph_safe.

Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
   JIT GPU memory corruption (refill after first GEMM call)
2026-05-17 08:58:43 +00:00
d2965b432d fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch 2026-05-17 03:35:20 +00:00
b382a7a528 fix: handle input_scale as 1D or 2D (EP splits change the shape) 2026-05-16 22:49:30 +00:00
139c9c37cd fix: read input_scale from nn.Parameter before it's freed 2026-05-16 22:23:24 +00:00
152648789d fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688)
The checkpoint stores input_scale per projection — the pre-computed
activation normalization factor. Using 1/2688 was wrong for most layers
(e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off).
This caused under-quantized activations and garbage output.
2026-05-16 21:46:00 +00:00