Commit Graph

337 Commits

Author SHA1 Message Date
3c1a76bdcc Fix Dockerfile: use external patch script instead of inline Python
Docker's parser chokes on multi-line Python in RUN. Moved to
scripts/patch_utils.py and COPY + RUN it.
2026-05-18 19:03:57 +00:00
75844a8361 Post-quant fix via Dockerfile patch to process_weights_after_loading
Forward pre-hook approach didn't work — torch.compile and model
wrappers bypass hooks. Instead, patch vLLM's utils.py to call
model._post_quant_fix() at the end of process_weights_after_loading.
This guarantees the fix runs AFTER quant methods set up their attrs.

Dockerfile now patches:
  model_loader/utils.py → calls model._post_quant_fix() if it exists

DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention
NVFP4 weights to BF16 and replaces quant_method.
2026-05-18 18:35:34 +00:00
a4ad5898c1 Fix post-quant hook: register on inner model, fix module refs
vLLM V1 calls DeepseekV4Model.forward() directly, not
DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires.
Moved hook to self.model (inner) and fixed module.model.layers →
module.layers.
2026-05-18 18:15:36 +00:00
a51edd238e Add post-quant-init forward hook to fix attention NVFP4
The key insight: process_weights_after_loading runs AFTER load_weights
and sets up FlashInferCutlassNvFp4LinearKernel with broken
input_global_scale_inv. Any fix inside load_weights gets overwritten.

Solution: register a one-shot forward pre-hook that runs on the first
forward call (guaranteed after all init). It dequantizes attention
NVFP4 weights to BF16 and replaces quant_method with
UnquantizedLinearMethod. Since process_weights_after_loading already
ran, our changes won't be overwritten.

Standalone test confirmed: all attention weights produce valid
non-NaN output when dequantized to BF16.
2026-05-18 17:56:19 +00:00
2835cb040b Fix input_scale BEFORE process_weights_after_loading runs
Instead of dequantizing to BF16 (which gets overwritten by
process_weights_after_loading), fix the input_scale parameter
on the module before the quant method reads it. The quant method
computes input_global_scale_inv = input_scale.max(), so fixing
input_scale propagates the correct activation scale.

Computes correct input_scale by temporarily dequantizing weight
to BF16, running warmup forward, and computing act_amax.
input_scale = 1/(act_amax * headroom).
2026-05-18 16:43:44 +00:00
2fc81ccac4 Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early)
process_weights_after_loading sets input_global_scale_inv AFTER
_convert_nvfp4_post_load runs, so the fix couldn't find the attrs.
Going back to BF16 dequant approach. The zeros in the dummy run are
expected (attention_impl returns early with out.zero_()). Need to test
with a real request under cudagraph_mode=NONE.
2026-05-18 16:23:41 +00:00
4a57399592 Add debug prints for input_global_scale_inv check 2026-05-18 15:59:59 +00:00
f86892e26b Replace BF16 dequant with input_scale warmup fix for attention NVFP4
Instead of dequantizing attention weights to BF16 (which had issues
with MergedColumnParallelLinear and different weight_scale_2 values),
keep the NVFP4 path but fix the activation global scale.

Compute correct input_global_scale_inv by:
1. Temporarily dequantizing weight to BF16
2. Running warmup forward with random input
3. Computing actual activation amax
4. Setting scale_inv = amax * headroom

This preserves the original NVFP4 quantization pipeline.
2026-05-18 15:43:46 +00:00
301015b037 Remove all inline diagnostics — incompatible with torch.compile
Data-dependent expressions (amax().item(), isnan().any().item())
cause Dynamo guard failures even when gated by os.environ.
cudagraph_mode=NONE still uses torch.compile, so these break.
Will need enforce-eager for diagnostics going forward.
2026-05-18 15:22:53 +00:00
a83d364d45 Switch to cudagraph_mode=NONE (not enforce-eager) for real inference testing 2026-05-18 15:05:52 +00:00
2a2a42c6d6 Add attention-internal diagnostics: MLA output, FP8 quant output 2026-05-18 14:45:43 +00:00
5c1dda10f6 Add granular attention diagnostics: pre/post attn, embed, dequant stats 2026-05-18 14:24:14 +00:00
e0e0528778 Add debug logging for BF16 dequant to find missing attrs 2026-05-18 14:04:12 +00:00
2e8c3c961f Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv
wq_a and wkv are fused into a single MergedColumnParallelLinear
called fused_wqa_wkv. Was checking for non-existent separate attrs.
2026-05-18 13:47:08 +00:00
a7216b27df Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16
wo_a uses fp8_einsum which is weight-only FP8 (no input_scale).
Only q_a, q_b, kv, o_b need BF16 dequant to avoid broken input_scale.
2026-05-18 13:22:15 +00:00
334e95047e Fix: dequantize ALL attention NVFP4 projections to BF16
Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel
uses checkpoint input_scale for activation quantization, which produces
NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a,
wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken
input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method.

This trades memory for correctness. Future optimization: add warmup
for attention input_global_scale_inv (same as MoE warmup).
2026-05-18 13:09:36 +00:00
a83c332059 Fix docker-compose: remove orphaned compilation-config arg, enforce-eager mode 2026-05-18 12:54:14 +00:00
9e7639fba4 Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)
When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer.
Must run with --enforce-eager (data-dependent prints break Dynamo).
Gated by os.environ so dead-code-eliminated during compilation.
2026-05-18 12:51:51 +00:00
2d1e9f42b1 Remove NaN check — incompatible with Dynamo fullgraph compilation
Dynamo fullgraph mode rejects BOTH data-dependent branching AND
torch.compiler.disable as graph breaks. The NaN check cannot coexist
with vLLM's AOT compilation. Use layertest/cudagraph_test for debugging.
2026-05-18 12:17:25 +00:00
65763a200c Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break
The inline os.environ gate doesn't work — Dynamo still sees the
data-dependent branching (torch.isnan().any()) and crashes with
'Unsupported: Data-dependent branching'. Extracting into a
@torch.compiler.disable decorated function makes Dynamo skip it.
2026-05-18 11:33:29 +00:00
8758bc93ca crap shoot 2026-05-18 11:13:29 +00:00
b8df4a8cc5 Fix NaN check: use os.environ gate instead of is_current_stream_capturing
torch.cuda.is_current_stream_capturing() returns bool, which breaks
Dynamo FX tracing (non-Tensor output). Switch to env var gate:
CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection.

Dynamo evaluates os.environ at trace time — if the env var is not set,
the entire NaN check block is compiled away. Set it before first
inference to get NaN detection during prefill only.
2026-05-18 02:20:14 +00:00
0c02d84514 Add NaN/Inf detection in DeepseekV4Model.forward layer loop
- Checks every layer during prefill (not during cudagraph capture)
- is_current_stream_capturing() gate prevents CPU-GPU syncs during capture
- Prints amax every 10 layers for magnitude tracking
- Breaks on first NaN/Inf to avoid wasting compute
2026-05-17 23:37:12 +00:00
bedcfc4dab Pipeline test: use max_num_tokens=8192 matching vLLM 2026-05-17 23:04:44 +00:00
c45364b3a8 Add MoE scale ratio output 2026-05-17 22:58:27 +00:00
bf99ad49ec Print both MoE and residual cosine 2026-05-17 22:56:56 +00:00
8637020487 Fix multi-layer test: add residual connections 2026-05-17 22:55:40 +00:00
11dce13afe Add multi-layer pipeline test to check error accumulation 2026-05-17 22:53:28 +00:00
87582fc9f7 HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph 2026-05-17 22:28:32 +00:00
8717e0e411 Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping 2026-05-17 22:03:48 +00:00
d332f4f900 Add NaN debug checks after L1 and L2 GEMM 2026-05-17 22:02:24 +00:00
e65f2b2ba2 Update CURRENT_BUG.md with Bug 26 fix 2026-05-17 21:36:25 +00:00
72628fb689 Full pipeline test: runner vs BF16 reference 2026-05-17 21:29:16 +00:00
2796bd81e8 Fix: scatter FP4 as uint8 (float4 doesn't support index_put) 2026-05-17 21:28:04 +00:00
364f8372bb Fix FP4 buffer shapes: D//2 for packed dimensions 2026-05-17 21:26:46 +00:00
5e4d674736 Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf 2026-05-17 21:25:58 +00:00
803e7160d8 Fix: allocate FP4 buffers as uint8 then view-cast 2026-05-17 21:25:04 +00:00
7256070dd3 FIX Bug 26: quantize slot tokens, not padded buffer
The runner was quantizing the padded_hidden (4096 rows) and then
taking x_sf[:num_slots] (first 48 rows). This only got scales for
expert 0 (the first 48 rows of the padded buffer), not the scales
for tokens scattered across padded positions (expert 1 at row 128, etc).

Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get
correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer
for the GEMM. The scale assembly now receives the correct x_sf.

Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.
2026-05-17 21:24:43 +00:00
4d0b6d889d Set runner weights before _ensure_stacked 2026-05-17 21:22:50 +00:00
b7acac5e4e Call _ensure_stacked() before using runner buffers 2026-05-17 21:22:30 +00:00
1acf01fc1a Fix token_indices: repeat each token ID top_k times, not arange 2026-05-17 21:22:11 +00:00
a478ca4746 Debug: trace runner logic step by step, test L1 GEMM 2026-05-17 21:21:45 +00:00
a100bd11c1 Simplify pipeline test: BF16 ref + bridge ref + full runner 2026-05-17 21:20:41 +00:00
6eade5e7f8 Fix: gs values are floats not tensors 2026-05-17 21:19:47 +00:00
b05a38a9bd Test stages 1-2 first: sort + L1 GEMM 2026-05-17 21:19:23 +00:00
9728604ea1 Pipeline test: stage-by-stage with BF16 reference comparison 2026-05-17 21:19:17 +00:00
7fff5fd39b Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes 2026-05-17 21:18:20 +00:00
4ef345773d Rewrite pipeline test: load real weights, step-by-step vs BF16 reference 2026-05-17 21:17:18 +00:00
b43541afdd Fix test path setup 2026-05-17 21:00:00 +00:00
490ddfa294 Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours) 2026-05-17 20:58:06 +00:00