Forward pre-hook approach didn't work — torch.compile and model
wrappers bypass hooks. Instead, patch vLLM's utils.py to call
model._post_quant_fix() at the end of process_weights_after_loading.
This guarantees the fix runs AFTER quant methods set up their attrs.
Dockerfile now patches:
model_loader/utils.py → calls model._post_quant_fix() if it exists
DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention
NVFP4 weights to BF16 and replaces quant_method.
vLLM V1 calls DeepseekV4Model.forward() directly, not
DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires.
Moved hook to self.model (inner) and fixed module.model.layers →
module.layers.
The key insight: process_weights_after_loading runs AFTER load_weights
and sets up FlashInferCutlassNvFp4LinearKernel with broken
input_global_scale_inv. Any fix inside load_weights gets overwritten.
Solution: register a one-shot forward pre-hook that runs on the first
forward call (guaranteed after all init). It dequantizes attention
NVFP4 weights to BF16 and replaces quant_method with
UnquantizedLinearMethod. Since process_weights_after_loading already
ran, our changes won't be overwritten.
Standalone test confirmed: all attention weights produce valid
non-NaN output when dequantized to BF16.
Instead of dequantizing to BF16 (which gets overwritten by
process_weights_after_loading), fix the input_scale parameter
on the module before the quant method reads it. The quant method
computes input_global_scale_inv = input_scale.max(), so fixing
input_scale propagates the correct activation scale.
Computes correct input_scale by temporarily dequantizing weight
to BF16, running warmup forward, and computing act_amax.
input_scale = 1/(act_amax * headroom).
process_weights_after_loading sets input_global_scale_inv AFTER
_convert_nvfp4_post_load runs, so the fix couldn't find the attrs.
Going back to BF16 dequant approach. The zeros in the dummy run are
expected (attention_impl returns early with out.zero_()). Need to test
with a real request under cudagraph_mode=NONE.
Instead of dequantizing attention weights to BF16 (which had issues
with MergedColumnParallelLinear and different weight_scale_2 values),
keep the NVFP4 path but fix the activation global scale.
Compute correct input_global_scale_inv by:
1. Temporarily dequantizing weight to BF16
2. Running warmup forward with random input
3. Computing actual activation amax
4. Setting scale_inv = amax * headroom
This preserves the original NVFP4 quantization pipeline.
Data-dependent expressions (amax().item(), isnan().any().item())
cause Dynamo guard failures even when gated by os.environ.
cudagraph_mode=NONE still uses torch.compile, so these break.
Will need enforce-eager for diagnostics going forward.
Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel
uses checkpoint input_scale for activation quantization, which produces
NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a,
wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken
input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method.
This trades memory for correctness. Future optimization: add warmup
for attention input_global_scale_inv (same as MoE warmup).
When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer.
Must run with --enforce-eager (data-dependent prints break Dynamo).
Gated by os.environ so dead-code-eliminated during compilation.
Dynamo fullgraph mode rejects BOTH data-dependent branching AND
torch.compiler.disable as graph breaks. The NaN check cannot coexist
with vLLM's AOT compilation. Use layertest/cudagraph_test for debugging.
The inline os.environ gate doesn't work — Dynamo still sees the
data-dependent branching (torch.isnan().any()) and crashes with
'Unsupported: Data-dependent branching'. Extracting into a
@torch.compiler.disable decorated function makes Dynamo skip it.
torch.cuda.is_current_stream_capturing() returns bool, which breaks
Dynamo FX tracing (non-Tensor output). Switch to env var gate:
CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection.
Dynamo evaluates os.environ at trace time — if the env var is not set,
the entire NaN check block is compiled away. Set it before first
inference to get NaN detection during prefill only.
- Checks every layer during prefill (not during cudagraph capture)
- is_current_stream_capturing() gate prevents CPU-GPU syncs during capture
- Prints amax every 10 layers for magnitude tracking
- Breaks on first NaN/Inf to avoid wasting compute
The runner was quantizing the padded_hidden (4096 rows) and then
taking x_sf[:num_slots] (first 48 rows). This only got scales for
expert 0 (the first 48 rows of the padded buffer), not the scales
for tokens scattered across padded positions (expert 1 at row 128, etc).
Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get
correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer
for the GEMM. The scale assembly now receives the correct x_sf.
Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.