Commit Graph

163 Commits

Author SHA1 Message Date
e0e0528778 Add debug logging for BF16 dequant to find missing attrs 2026-05-18 14:04:12 +00:00
2e8c3c961f Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv
wq_a and wkv are fused into a single MergedColumnParallelLinear
called fused_wqa_wkv. Was checking for non-existent separate attrs.
2026-05-18 13:47:08 +00:00
a7216b27df Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16
wo_a uses fp8_einsum which is weight-only FP8 (no input_scale).
Only q_a, q_b, kv, o_b need BF16 dequant to avoid broken input_scale.
2026-05-18 13:22:15 +00:00
334e95047e Fix: dequantize ALL attention NVFP4 projections to BF16
Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel
uses checkpoint input_scale for activation quantization, which produces
NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a,
wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken
input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method.

This trades memory for correctness. Future optimization: add warmup
for attention input_global_scale_inv (same as MoE warmup).
2026-05-18 13:09:36 +00:00
9e7639fba4 Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)
When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer.
Must run with --enforce-eager (data-dependent prints break Dynamo).
Gated by os.environ so dead-code-eliminated during compilation.
2026-05-18 12:51:51 +00:00
2d1e9f42b1 Remove NaN check — incompatible with Dynamo fullgraph compilation
Dynamo fullgraph mode rejects BOTH data-dependent branching AND
torch.compiler.disable as graph breaks. The NaN check cannot coexist
with vLLM's AOT compilation. Use layertest/cudagraph_test for debugging.
2026-05-18 12:17:25 +00:00
65763a200c Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break
The inline os.environ gate doesn't work — Dynamo still sees the
data-dependent branching (torch.isnan().any()) and crashes with
'Unsupported: Data-dependent branching'. Extracting into a
@torch.compiler.disable decorated function makes Dynamo skip it.
2026-05-18 11:33:29 +00:00
b8df4a8cc5 Fix NaN check: use os.environ gate instead of is_current_stream_capturing
torch.cuda.is_current_stream_capturing() returns bool, which breaks
Dynamo FX tracing (non-Tensor output). Switch to env var gate:
CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection.

Dynamo evaluates os.environ at trace time — if the env var is not set,
the entire NaN check block is compiled away. Set it before first
inference to get NaN detection during prefill only.
2026-05-18 02:20:14 +00:00
0c02d84514 Add NaN/Inf detection in DeepseekV4Model.forward layer loop
- Checks every layer during prefill (not during cudagraph capture)
- is_current_stream_capturing() gate prevents CPU-GPU syncs during capture
- Prints amax every 10 layers for magnitude tracking
- Breaks on first NaN/Inf to avoid wasting compute
2026-05-17 23:37:12 +00:00
87582fc9f7 HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph 2026-05-17 22:28:32 +00:00
8717e0e411 Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping 2026-05-17 22:03:48 +00:00
d332f4f900 Add NaN debug checks after L1 and L2 GEMM 2026-05-17 22:02:24 +00:00
2796bd81e8 Fix: scatter FP4 as uint8 (float4 doesn't support index_put) 2026-05-17 21:28:04 +00:00
364f8372bb Fix FP4 buffer shapes: D//2 for packed dimensions 2026-05-17 21:26:46 +00:00
803e7160d8 Fix: allocate FP4 buffers as uint8 then view-cast 2026-05-17 21:25:04 +00:00
7256070dd3 FIX Bug 26: quantize slot tokens, not padded buffer
The runner was quantizing the padded_hidden (4096 rows) and then
taking x_sf[:num_slots] (first 48 rows). This only got scales for
expert 0 (the first 48 rows of the padded buffer), not the scales
for tokens scattered across padded positions (expert 1 at row 128, etc).

Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get
correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer
for the GEMM. The scale assembly now receives the correct x_sf.

Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.
2026-05-17 21:24:43 +00:00
22e0370e6e Fix AttributeError: DeepseekV4MegaMoEExperts has no swiglu_limit
Get swiglu_limit from vllm_config.model_config.hf_config instead
of self (it was only set on the parent DeepseekV4MoE class).
2026-05-17 18:06:44 +00:00
a10c582cf4 Add swiglu_limit=10.0 activation clamping (was missing)
DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps:
- silu(gate) to max 10.0
- up to [-10.0, 10.0]

Our runner was doing plain F.silu(gate) * up without clamping.
Large gate values could produce unbounded SiLU output, causing
numerical issues in the L2 GEMM. This is likely contributing to
garbage model output.
2026-05-17 17:52:16 +00:00
3f2f4e1882 Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing
Dynamic slicing with GPU scalars (e.g. buf[:gpu_scalar]) is a CUDA
operation not permitted during stream capture. Use full pre-allocated
buffers instead of dynamic slices. The GEMM only reads rows indicated
by expert_offsets, ignoring the zero padding.

Also pass x_sf[:num_slots] (Python int slicing, cudagraph-safe) to
scale assembly so it only processes real token scale data.
2026-05-17 17:24:26 +00:00
11b5aa5e37 Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops
Removed .cpu().tolist() and per-expert Python loops. Apply the
Blackwell 32_4_4 swizzle to the entire padded_x_sf buffer at once.
The buffer is already 128-row aligned (padded per expert) and 4-col
aligned, so the full-buffer swizzle produces the correct layout.

The GEMM reads scale_a using padded_expert_offsets, which matches
the scatter layout. Fully GPU, zero CPU syncs, cudagraph-safe.
2026-05-17 16:59:51 +00:00
94dec5922d Scale assembly Phase 2: use CPU-computed offsets for Python slicing
GPU scalars can't be used for Python indexing (requires sync).
Compute padded_expert_offsets on CPU via .cpu().tolist() for
the Python loop. This is OK for cudagraph: Python code only
runs during capture, not replay. The GPU kernel launches
recorded during capture are deterministic.
2026-05-17 16:56:52 +00:00
49c28e6562 Fix: use real padded expert offsets instead of fixed layout
Root cause of garbage output: fixed-layout padding with
max_chunks=ceil(avg) was too small for uneven expert assignment.
Tokens beyond max_chunks*128 per expert were silently dropped
(clamped_local overwrote the same row).

Fix: compute padded_expert_offsets from actual tokens_per_expert
(padded to 128). No clamping needed — each expert gets exactly
the space it needs. Pass padded_expert_offsets to scale assembly
and GEMM.
2026-05-17 16:55:47 +00:00
7c16f3cb46 Fix: init shared dict before using it, remove duplicate _output_buf 2026-05-17 16:06:58 +00:00
ea8acf9852 Share padded_x_sf and output buffers across layers to save ~300 MB
Per-layer padded_xsf (2.4 MB) + output_buf (4.2 MB) × 60 layers = ~400 MB.
Sharing reduces to ~3.6 MB total. Layers run sequentially during both
capture and replay.
2026-05-17 16:05:53 +00:00
455ecb5631 Fix: define padded_max_slots before using it in shared buffer allocation 2026-05-17 15:47:38 +00:00
b1ac74bb4d Fix shape mismatch: shared padded buffers, revert max_num_tokens cap
Root cause: capping max_num_tokens to 512 made buffers too small for the
actual 8192-token warmup. slot_hidden had 49152 rows but padded_hidden
only had 6144.

Fix: Revert the 512 cap. Use SHARED padded buffers (not per-layer) to
avoid OOM. Only 72 MB total (not 4.3 GB) since layers run sequentially
and reuse the same buffer. Cudagraph-safe since capture and replay both
run layers sequentially on the same tensor.
2026-05-17 15:47:10 +00:00
faf7c8cc51 Debug: print runner max_num_tokens and max_chunks 2026-05-17 15:18:07 +00:00
c5af1aba6b Fix OOB: size padded buffers for num_experts*max_chunks*128
padded_max_slots was computed from max_tokens*top_k (3072) but
total_padded_slots in run() is num_experts*max_chunks*128 (6144).
The buffer was too small, causing index out of bounds.
2026-05-17 14:59:45 +00:00
8ac8e20fa9 Fix OOM: cap buffer pre-allocation at cudagraph max capture size
padded_hidden/activated buffers were sized for max_num_tokens=8192,
which is 72 MB per layer × 60 layers = 4.3 GB → OOM with 178 GB GPUs
(almost full from model + KV cache).

Now cap at max cudagraph capture size (512 tokens). Eager-mode runs
with >512 tokens will need dynamic allocation, but vLLM always uses
cudagraph for inference after warmup.
2026-05-17 14:14:13 +00:00
5bb78564f5 Remove dynamic tensor allocation in scale assembly (cudagraph fix)
Removed torch.zeros() call that created padded_expert_offsets during
scale assembly. Now uses fixed layout computed from Python constants.
Also removed dead reference to padded_expert_offsets variable.
2026-05-17 14:01:32 +00:00
8c31e78359 Fix cudagraph: fully fixed-layout per-expert sections, no GPU scalars in Python control flow
- Each expert gets max_chunks*128 rows at fixed offsets (e*max_chunks*128)
- Phase 1 scatters into fixed offsets with clamped local_row
- Phase 2 reads from fixed offsets (pure Python arithmetic, no GPU sync)
- padded_x_sf_buf sized for num_experts * max_chunks * 128
- padded_expert_offsets pre-computed in _allocate_buffers
2026-05-17 13:58:58 +00:00
ff74b33d2c Fix cudagraph: static loop for per-expert scale swizzle
The while loop had variable trip count (GPU scalar in condition),
requiring CPU-GPU sync. Replaced with fixed max_chunks_per_expert
iterations. Unused chunks are zero buffers (harmless for GEMM).
2026-05-17 13:56:52 +00:00
bf22b6f0e4 Fix scale assembly: variable-size per-expert padding matching GEMM offsets
- Compute padded_expert_offsets from real expert_offsets (ceil to 128)
- Scatter x_sf into padded positions matching those offsets
- Per-expert swizzle in 128-row chunks (supports >128 tokens per expert)
- Pad slot_hidden/activated using same padded offsets for GEMM input
- Pre-allocated buffers sized for max_tokens*top_k (not num_experts*128)
2026-05-17 13:55:10 +00:00
bde81b95f4 Fix GEMM scale layout: pad to 128 tokens per expert
Root cause of garbage output: the GEMM reads scale_a according to
expert_offsets (e.g. [0, 500, 1024, ...]) but scale_a had data at
fixed e*128 offsets. When expert 0 has 500 tokens, the GEMM reads
scale_a[0:500] but only rows 0-127 had valid data.

Fix: pad slot_hidden to num_experts*128 rows (128 per expert) and
pass padded_expert_offsets=[0, 128, 256, ...] to the GEMM. Scale
assembly's fixed 128-row layout now matches the GEMM's expectations.
Padding tokens' GEMM output is discarded (scatter_add only uses
sorted_token_ids for real tokens).
2026-05-17 13:19:31 +00:00
7e692c3aec Fix cudaErrorStreamCaptureUnsupported: pre-allocate all tensors used during capture
torch.full(), torch.zeros(), torch.arange() allocate new tensors during
cudagraph capture, which triggers cudaErrorStreamCaptureUnsupported.

Pre-allocate:
- _l1_gsa_buf / _l2_gsa_buf (use .fill_() instead of torch.full)
- _output_buf (use .zero_() on pre-allocated slice)
- _row_indices_buf (pre-allocated arange, sliced during use)
2026-05-17 12:31:25 +00:00
b0221662e7 Fix warmup: pass local expert IDs (not global), remove incorrect _warmup_done guard
compute_activation_global_scales expects local IDs (0..num_experts-1),
not global IDs. EP5/EP7 were getting L2 gs=0 because global IDs (240+,
336+) didn't match expert_id_range (0..47), so no tokens matched any
expert → L1 GEMM got zero inputs → L2 gs=0 → NaN/crash.

Also removed _warmup_done guard since each layer needs its own warmup
(different weights, different gs values).
2026-05-17 11:38:19 +00:00
b531a98f8f Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing
- Reverted from full-buffer swizzle to per-expert 128-row slots
- Scatter into e*128 fixed positions (cudagraph-compatible, fixed shape)
- Clamp local_row to 127 for experts with >128 tokens (GEMM uses expert_offsets)
- Buffer sized for num_experts*128 rows (not max_tokens*top_k)
- Add _warmup_done guard to only run warmup once (not 60x)
2026-05-17 11:10:59 +00:00
04245b664b Add warmup-based activation global scale computation in finalize_weights
The checkpoint input_scale is a calibration value that produces wrong gs
at runtime (too small → block scales saturate → garbage output → EOS).

Now calls compute_activation_global_scales() with sample data during weight
finalization, before cudagraph capture. This observes actual activation
magnitudes and computes correct L1 and L2 gs values.
2026-05-17 10:48:24 +00:00
4445882ba7 Fix: return 2D scale tensor for GEMM (shape[1] access) 2026-05-17 09:59:57 +00:00
3cd910193c Rewrite scale assembly: no .item() calls, no Python loops, fully GPU
Apply to_blocked swizzle on entire padded buffer at once instead of
per-expert loops. No .item()/.cpu() calls. Fully cudagraph-safe.
2026-05-17 09:59:12 +00:00
4f6217acb9 Fix padded_cols calculation in scale assembly 2026-05-17 09:58:09 +00:00
918aa8aede Fix scale assembly output shape: reshape to 2D for GEMM 2026-05-17 09:57:27 +00:00
d9bae6d770 Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks
Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens
and top_k=6, the actual padded row count can exceed 6144. Also:
- Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192)
- Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks)
- Remove debug prints
2026-05-17 09:56:28 +00:00
55ac60eb91 Add detailed debug prints for OOB investigation 2026-05-17 09:39:42 +00:00
fed3c417ba Add debug OOB check for sorted_token_ids 2026-05-17 09:19:10 +00:00
ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync
Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
  as local IDs (0-31 with EP=8). Tokens for non-local experts got
  wrong expert assignments, causing out-of-bounds scatter indices
  in _assemble_scales_cudagraph_safe.

Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
   JIT GPU memory corruption (refill after first GEMM call)
2026-05-17 08:58:43 +00:00
1330e2b2cf cleanup: remove debug prints, ready for testing
Current state:
- Token indices on CPU (avoids CuTeDSL GPU memory corruption)
- Scale assembly uses per-expert swizzle + scatter (matches reference)
- compute_activation_global_scales warmup gets ~0.97 cosine
- expert_offsets passed without leading 0 (matches pipeline)
- layertest + cudagraph_test pass
2026-05-17 08:30:41 +00:00
d635dcbbb6 fix: keep token_indices on CPU, index with CPU sort_idx
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Keeping token_indices on CPU and using sort_idx.cpu() for indexing
avoids the corruption. The .to(device) call after indexing moves the
result back to GPU for the hidden_states indexing.
2026-05-17 08:29:18 +00:00
235d5b314f fix: fallback token indices allocation with verify+rebuild 2026-05-17 08:27:47 +00:00
dd0b3fd4f9 debug: print sorted_token_ids in warmup 2026-05-17 08:25:25 +00:00