torch.cuda.is_current_stream_capturing() returns bool, which breaks
Dynamo FX tracing (non-Tensor output). Switch to env var gate:
CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection.
Dynamo evaluates os.environ at trace time — if the env var is not set,
the entire NaN check block is compiled away. Set it before first
inference to get NaN detection during prefill only.
- Checks every layer during prefill (not during cudagraph capture)
- is_current_stream_capturing() gate prevents CPU-GPU syncs during capture
- Prints amax every 10 layers for magnitude tracking
- Breaks on first NaN/Inf to avoid wasting compute
The runner was quantizing the padded_hidden (4096 rows) and then
taking x_sf[:num_slots] (first 48 rows). This only got scales for
expert 0 (the first 48 rows of the padded buffer), not the scales
for tokens scattered across padded positions (expert 1 at row 128, etc).
Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get
correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer
for the GEMM. The scale assembly now receives the correct x_sf.
Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.
DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps:
- silu(gate) to max 10.0
- up to [-10.0, 10.0]
Our runner was doing plain F.silu(gate) * up without clamping.
Large gate values could produce unbounded SiLU output, causing
numerical issues in the L2 GEMM. This is likely contributing to
garbage model output.
Dynamic slicing with GPU scalars (e.g. buf[:gpu_scalar]) is a CUDA
operation not permitted during stream capture. Use full pre-allocated
buffers instead of dynamic slices. The GEMM only reads rows indicated
by expert_offsets, ignoring the zero padding.
Also pass x_sf[:num_slots] (Python int slicing, cudagraph-safe) to
scale assembly so it only processes real token scale data.
Removed .cpu().tolist() and per-expert Python loops. Apply the
Blackwell 32_4_4 swizzle to the entire padded_x_sf buffer at once.
The buffer is already 128-row aligned (padded per expert) and 4-col
aligned, so the full-buffer swizzle produces the correct layout.
The GEMM reads scale_a using padded_expert_offsets, which matches
the scatter layout. Fully GPU, zero CPU syncs, cudagraph-safe.
GPU scalars can't be used for Python indexing (requires sync).
Compute padded_expert_offsets on CPU via .cpu().tolist() for
the Python loop. This is OK for cudagraph: Python code only
runs during capture, not replay. The GPU kernel launches
recorded during capture are deterministic.
Root cause of garbage output: fixed-layout padding with
max_chunks=ceil(avg) was too small for uneven expert assignment.
Tokens beyond max_chunks*128 per expert were silently dropped
(clamped_local overwrote the same row).
Fix: compute padded_expert_offsets from actual tokens_per_expert
(padded to 128). No clamping needed — each expert gets exactly
the space it needs. Pass padded_expert_offsets to scale assembly
and GEMM.