Commit Graph

312 Commits

Author SHA1 Message Date
eef0ef76af Fix NVFP4 compressor scale loading: buffer and concatenate scale shards
The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses
weight_loader(param, weight, shard_id), but PerTensorScaleParameter
and ModelWeightParameter for NVFP4 scale params don't support shard_id
in load_column_parallel_weight (asserts shape equality).

Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate
shards, then concatenate along dim 0 and copy_ into the param after all
weights are loaded.
2026-05-18 23:24:08 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
  AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
  in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
  that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
2026-05-18 23:03:34 +00:00
b039123207 Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames
- Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b}
- Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate')
- Order substr renames: specific before general
2026-05-18 22:53:09 +00:00
ea648a9bc2 Fix NVFP4 mapper: keep model. prefix (model params use it) 2026-05-18 22:49:40 +00:00
1528d4e182 Fix NVFP4 mapper: strip model. prefix from checkpoint keys
The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader
expects layers.* (relative to the model module). Strip the model. prefix
instead of adding it.
2026-05-18 22:46:04 +00:00
5d37674fb1 Add cutedsl to MoEBackend type in kernel config 2026-05-18 22:38:41 +00:00
7409204d71 Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper
The upstream deepseek_v4.py has imports that don't exist in the nightly
Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the
nightly's own files as the base and add only the minimal NVFP4 changes:
- Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping
- Select NVFP4 mapper when quant_config is modelopt_fp4
- cos_sin_cache float32 fix in attention
- Remove utils.py patch (not needed)
2026-05-18 22:33:51 +00:00
a19ed4a18e Remove breakable_cudagraph import (not in nightly) 2026-05-18 22:29:24 +00:00
b007937a68 Fix garbled imports in cutedsl/runner.py 2026-05-18 22:22:52 +00:00
a7ed8faec6 Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework
Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
  (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
  that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py

No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.
2026-05-18 22:19:23 +00:00
48386e34ad Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable
torch.compile fullgraph mode can't handle @torch.compiler.disable (skips
the function and refuses to compile). Custom autograd Functions are treated
as opaque ops by torch.compile — they execute eagerly without the compiler
trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).
2026-05-18 21:38:28 +00:00
85e1cd3b69 Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run()
CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with
torch.dynamo tracing. Marking run() as compiler-disabled makes the
runners opaque to torch.compile — they execute eagerly while the
rest of the model gets compiled.
2026-05-18 21:07:35 +00:00
a94011ec92 Fix torch.compile crash: remove threading.Lock from LUT cache path
The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under
torch.compile/cudagraph. LUT is now pre-populated during warmup so
the fast path (cache hit) never hits a lock.

Also removed all init/warmup debug prints from CuTeDSL kernels.
2026-05-18 20:54:55 +00:00
6326222d68 Fix: add abstract create_weights to CuTeDSLNvfp4LinearMethod 2026-05-18 20:40:48 +00:00
450793311c Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4
- CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners
  during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod
  for forward dispatch
- Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through
  CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference)
- Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16)
  with monkey-patched forward for fused L1+SiLU+L2 pipeline
- Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix,
  input_scale fixes)
- Deleted _post_quant_fix hook from utils.py
- Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp)
- Cleaned up all debug prints
- Updated Dockerfile with new kernel files
2026-05-18 20:27:42 +00:00
6ce6a47be9 Add NVFP4 linear runner + attention projection test
- CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection
- test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16
- Same pad+swizzle pattern as shared expert, but no SiLU/fusion
2026-05-18 20:14:03 +00:00
f07643791e Fix hidden_size: shared expert uses 7168, not HC_DIM 28672 2026-05-18 20:10:32 +00:00
70f50a1ec6 Fix scale assembly: use correctly-sized temp buffer for swizzle 2026-05-18 20:09:50 +00:00
97bdd604e9 Fix scale assembly: reshape swizzled output to 2D 2026-05-18 20:09:19 +00:00
c1aa4af123 Shared expert: dedicated CuTeDSL runner with proper scale assembly
- CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing
- _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle
- All buffers pre-allocated for cudagraph compatibility
- Updated test to use dedicated runner instead of MoE runner hack
2026-05-18 20:08:34 +00:00
b3451c74f8 Update README and CURRENT_BUG.md with current state
- README: updated NVFP4 coverage table, status, and plan
- CURRENT_BUG.md: full debugging journey, what works, what's next
- Both reflect decision to build our own CuTeDSL kernels
2026-05-18 20:05:03 +00:00
e8b289e30d WIP: CuTeDSL shared expert kernel
Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py).
Tried reusing MoE runner with 1 expert — fails because MoE runner assumes
hidden_size != HC_DIM for scatter. Need dedicated runner with correct
scale assembly. Will continue tomorrow.
2026-05-18 20:02:19 +00:00
1836e5fdc7 Add shared experts to post-quant BF16 dequant fix
Shared experts also use FlashInferCutlassNvFp4LinearKernel with
broken input_scale. They need the same BF16 dequant treatment.
gate_up_proj and down_proj on ffn.shared_experts.
2026-05-18 19:27:49 +00:00
82ac648563 Patch utils.py the standard way: copy modified file into Docker image
Instead of fragile inline Dockerfile patching, just copy a modified
utils.py (with _post_quant_fix call) into the image, same pattern
as deepseek_v4.py and deepseek_v4_attention.py patches.
2026-05-18 19:10:08 +00:00
3c1a76bdcc Fix Dockerfile: use external patch script instead of inline Python
Docker's parser chokes on multi-line Python in RUN. Moved to
scripts/patch_utils.py and COPY + RUN it.
2026-05-18 19:03:57 +00:00
75844a8361 Post-quant fix via Dockerfile patch to process_weights_after_loading
Forward pre-hook approach didn't work — torch.compile and model
wrappers bypass hooks. Instead, patch vLLM's utils.py to call
model._post_quant_fix() at the end of process_weights_after_loading.
This guarantees the fix runs AFTER quant methods set up their attrs.

Dockerfile now patches:
  model_loader/utils.py → calls model._post_quant_fix() if it exists

DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention
NVFP4 weights to BF16 and replaces quant_method.
2026-05-18 18:35:34 +00:00
a4ad5898c1 Fix post-quant hook: register on inner model, fix module refs
vLLM V1 calls DeepseekV4Model.forward() directly, not
DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires.
Moved hook to self.model (inner) and fixed module.model.layers →
module.layers.
2026-05-18 18:15:36 +00:00
a51edd238e Add post-quant-init forward hook to fix attention NVFP4
The key insight: process_weights_after_loading runs AFTER load_weights
and sets up FlashInferCutlassNvFp4LinearKernel with broken
input_global_scale_inv. Any fix inside load_weights gets overwritten.

Solution: register a one-shot forward pre-hook that runs on the first
forward call (guaranteed after all init). It dequantizes attention
NVFP4 weights to BF16 and replaces quant_method with
UnquantizedLinearMethod. Since process_weights_after_loading already
ran, our changes won't be overwritten.

Standalone test confirmed: all attention weights produce valid
non-NaN output when dequantized to BF16.
2026-05-18 17:56:19 +00:00
2835cb040b Fix input_scale BEFORE process_weights_after_loading runs
Instead of dequantizing to BF16 (which gets overwritten by
process_weights_after_loading), fix the input_scale parameter
on the module before the quant method reads it. The quant method
computes input_global_scale_inv = input_scale.max(), so fixing
input_scale propagates the correct activation scale.

Computes correct input_scale by temporarily dequantizing weight
to BF16, running warmup forward, and computing act_amax.
input_scale = 1/(act_amax * headroom).
2026-05-18 16:43:44 +00:00
2fc81ccac4 Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early)
process_weights_after_loading sets input_global_scale_inv AFTER
_convert_nvfp4_post_load runs, so the fix couldn't find the attrs.
Going back to BF16 dequant approach. The zeros in the dummy run are
expected (attention_impl returns early with out.zero_()). Need to test
with a real request under cudagraph_mode=NONE.
2026-05-18 16:23:41 +00:00
4a57399592 Add debug prints for input_global_scale_inv check 2026-05-18 15:59:59 +00:00
f86892e26b Replace BF16 dequant with input_scale warmup fix for attention NVFP4
Instead of dequantizing attention weights to BF16 (which had issues
with MergedColumnParallelLinear and different weight_scale_2 values),
keep the NVFP4 path but fix the activation global scale.

Compute correct input_global_scale_inv by:
1. Temporarily dequantizing weight to BF16
2. Running warmup forward with random input
3. Computing actual activation amax
4. Setting scale_inv = amax * headroom

This preserves the original NVFP4 quantization pipeline.
2026-05-18 15:43:46 +00:00
301015b037 Remove all inline diagnostics — incompatible with torch.compile
Data-dependent expressions (amax().item(), isnan().any().item())
cause Dynamo guard failures even when gated by os.environ.
cudagraph_mode=NONE still uses torch.compile, so these break.
Will need enforce-eager for diagnostics going forward.
2026-05-18 15:22:53 +00:00
a83d364d45 Switch to cudagraph_mode=NONE (not enforce-eager) for real inference testing 2026-05-18 15:05:52 +00:00
2a2a42c6d6 Add attention-internal diagnostics: MLA output, FP8 quant output 2026-05-18 14:45:43 +00:00
5c1dda10f6 Add granular attention diagnostics: pre/post attn, embed, dequant stats 2026-05-18 14:24:14 +00:00
e0e0528778 Add debug logging for BF16 dequant to find missing attrs 2026-05-18 14:04:12 +00:00
2e8c3c961f Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv
wq_a and wkv are fused into a single MergedColumnParallelLinear
called fused_wqa_wkv. Was checking for non-existent separate attrs.
2026-05-18 13:47:08 +00:00
a7216b27df Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16
wo_a uses fp8_einsum which is weight-only FP8 (no input_scale).
Only q_a, q_b, kv, o_b need BF16 dequant to avoid broken input_scale.
2026-05-18 13:22:15 +00:00
334e95047e Fix: dequantize ALL attention NVFP4 projections to BF16
Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel
uses checkpoint input_scale for activation quantization, which produces
NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a,
wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken
input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method.

This trades memory for correctness. Future optimization: add warmup
for attention input_global_scale_inv (same as MoE warmup).
2026-05-18 13:09:36 +00:00
a83c332059 Fix docker-compose: remove orphaned compilation-config arg, enforce-eager mode 2026-05-18 12:54:14 +00:00
9e7639fba4 Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)
When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer.
Must run with --enforce-eager (data-dependent prints break Dynamo).
Gated by os.environ so dead-code-eliminated during compilation.
2026-05-18 12:51:51 +00:00
2d1e9f42b1 Remove NaN check — incompatible with Dynamo fullgraph compilation
Dynamo fullgraph mode rejects BOTH data-dependent branching AND
torch.compiler.disable as graph breaks. The NaN check cannot coexist
with vLLM's AOT compilation. Use layertest/cudagraph_test for debugging.
2026-05-18 12:17:25 +00:00
65763a200c Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break
The inline os.environ gate doesn't work — Dynamo still sees the
data-dependent branching (torch.isnan().any()) and crashes with
'Unsupported: Data-dependent branching'. Extracting into a
@torch.compiler.disable decorated function makes Dynamo skip it.
2026-05-18 11:33:29 +00:00
8758bc93ca crap shoot 2026-05-18 11:13:29 +00:00
b8df4a8cc5 Fix NaN check: use os.environ gate instead of is_current_stream_capturing
torch.cuda.is_current_stream_capturing() returns bool, which breaks
Dynamo FX tracing (non-Tensor output). Switch to env var gate:
CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection.

Dynamo evaluates os.environ at trace time — if the env var is not set,
the entire NaN check block is compiled away. Set it before first
inference to get NaN detection during prefill only.
2026-05-18 02:20:14 +00:00
0c02d84514 Add NaN/Inf detection in DeepseekV4Model.forward layer loop
- Checks every layer during prefill (not during cudagraph capture)
- is_current_stream_capturing() gate prevents CPU-GPU syncs during capture
- Prints amax every 10 layers for magnitude tracking
- Breaks on first NaN/Inf to avoid wasting compute
2026-05-17 23:37:12 +00:00
bedcfc4dab Pipeline test: use max_num_tokens=8192 matching vLLM 2026-05-17 23:04:44 +00:00
c45364b3a8 Add MoE scale ratio output 2026-05-17 22:58:27 +00:00