Commit Graph

267 Commits

Author SHA1 Message Date
8758bc93ca crap shoot 2026-05-18 11:13:29 +00:00
b8df4a8cc5 Fix NaN check: use os.environ gate instead of is_current_stream_capturing
torch.cuda.is_current_stream_capturing() returns bool, which breaks
Dynamo FX tracing (non-Tensor output). Switch to env var gate:
CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection.

Dynamo evaluates os.environ at trace time — if the env var is not set,
the entire NaN check block is compiled away. Set it before first
inference to get NaN detection during prefill only.
2026-05-18 02:20:14 +00:00
0c02d84514 Add NaN/Inf detection in DeepseekV4Model.forward layer loop
- Checks every layer during prefill (not during cudagraph capture)
- is_current_stream_capturing() gate prevents CPU-GPU syncs during capture
- Prints amax every 10 layers for magnitude tracking
- Breaks on first NaN/Inf to avoid wasting compute
2026-05-17 23:37:12 +00:00
bedcfc4dab Pipeline test: use max_num_tokens=8192 matching vLLM 2026-05-17 23:04:44 +00:00
c45364b3a8 Add MoE scale ratio output 2026-05-17 22:58:27 +00:00
bf99ad49ec Print both MoE and residual cosine 2026-05-17 22:56:56 +00:00
8637020487 Fix multi-layer test: add residual connections 2026-05-17 22:55:40 +00:00
11dce13afe Add multi-layer pipeline test to check error accumulation 2026-05-17 22:53:28 +00:00
87582fc9f7 HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph 2026-05-17 22:28:32 +00:00
8717e0e411 Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping 2026-05-17 22:03:48 +00:00
d332f4f900 Add NaN debug checks after L1 and L2 GEMM 2026-05-17 22:02:24 +00:00
e65f2b2ba2 Update CURRENT_BUG.md with Bug 26 fix 2026-05-17 21:36:25 +00:00
72628fb689 Full pipeline test: runner vs BF16 reference 2026-05-17 21:29:16 +00:00
2796bd81e8 Fix: scatter FP4 as uint8 (float4 doesn't support index_put) 2026-05-17 21:28:04 +00:00
364f8372bb Fix FP4 buffer shapes: D//2 for packed dimensions 2026-05-17 21:26:46 +00:00
5e4d674736 Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf 2026-05-17 21:25:58 +00:00
803e7160d8 Fix: allocate FP4 buffers as uint8 then view-cast 2026-05-17 21:25:04 +00:00
7256070dd3 FIX Bug 26: quantize slot tokens, not padded buffer
The runner was quantizing the padded_hidden (4096 rows) and then
taking x_sf[:num_slots] (first 48 rows). This only got scales for
expert 0 (the first 48 rows of the padded buffer), not the scales
for tokens scattered across padded positions (expert 1 at row 128, etc).

Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get
correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer
for the GEMM. The scale assembly now receives the correct x_sf.

Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.
2026-05-17 21:24:43 +00:00
4d0b6d889d Set runner weights before _ensure_stacked 2026-05-17 21:22:50 +00:00
b7acac5e4e Call _ensure_stacked() before using runner buffers 2026-05-17 21:22:30 +00:00
1acf01fc1a Fix token_indices: repeat each token ID top_k times, not arange 2026-05-17 21:22:11 +00:00
a478ca4746 Debug: trace runner logic step by step, test L1 GEMM 2026-05-17 21:21:45 +00:00
a100bd11c1 Simplify pipeline test: BF16 ref + bridge ref + full runner 2026-05-17 21:20:41 +00:00
6eade5e7f8 Fix: gs values are floats not tensors 2026-05-17 21:19:47 +00:00
b05a38a9bd Test stages 1-2 first: sort + L1 GEMM 2026-05-17 21:19:23 +00:00
9728604ea1 Pipeline test: stage-by-stage with BF16 reference comparison 2026-05-17 21:19:17 +00:00
7fff5fd39b Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes 2026-05-17 21:18:20 +00:00
4ef345773d Rewrite pipeline test: load real weights, step-by-step vs BF16 reference 2026-05-17 21:17:18 +00:00
b43541afdd Fix test path setup 2026-05-17 21:00:00 +00:00
490ddfa294 Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours) 2026-05-17 20:58:06 +00:00
c1bb551446 Fix weight loading: skip already-loaded experts correctly 2026-05-17 18:15:51 +00:00
955d7533f2 Use system Python for pipeline test (CuTeDSL in system site-packages) 2026-05-17 18:13:42 +00:00
925e390b93 Fix import: use direct import from vllm/ subdirectory 2026-05-17 18:12:53 +00:00
cd6144b832 Fix imports: all functions are in cutedsl.bridge, not separate modules 2026-05-17 18:11:03 +00:00
5e63a0d8a3 Rewrite pipeline test: use raw checkpoint weights, compare runner vs dynamic-gs reference 2026-05-17 18:10:05 +00:00
e51eafe288 Rewrite pipeline test: compare runner vs reference with real weights, step-by-step 2026-05-17 18:08:33 +00:00
e38d60a6e8 Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline 2026-05-17 18:07:44 +00:00
22e0370e6e Fix AttributeError: DeepseekV4MegaMoEExperts has no swiglu_limit
Get swiglu_limit from vllm_config.model_config.hf_config instead
of self (it was only set on the parent DeepseekV4MoE class).
2026-05-17 18:06:44 +00:00
6692166d0f Update CURRENT_BUG.md: Bug 25 (swiglu_limit), shared expert path verification, variable padded offsets 2026-05-17 17:56:04 +00:00
a10c582cf4 Add swiglu_limit=10.0 activation clamping (was missing)
DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps:
- silu(gate) to max 10.0
- up to [-10.0, 10.0]

Our runner was doing plain F.silu(gate) * up without clamping.
Large gate values could produce unbounded SiLU output, causing
numerical issues in the L2 GEMM. This is likely contributing to
garbage model output.
2026-05-17 17:52:16 +00:00
3f2f4e1882 Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing
Dynamic slicing with GPU scalars (e.g. buf[:gpu_scalar]) is a CUDA
operation not permitted during stream capture. Use full pre-allocated
buffers instead of dynamic slices. The GEMM only reads rows indicated
by expert_offsets, ignoring the zero padding.

Also pass x_sf[:num_slots] (Python int slicing, cudagraph-safe) to
scale assembly so it only processes real token scale data.
2026-05-17 17:24:26 +00:00
11b5aa5e37 Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops
Removed .cpu().tolist() and per-expert Python loops. Apply the
Blackwell 32_4_4 swizzle to the entire padded_x_sf buffer at once.
The buffer is already 128-row aligned (padded per expert) and 4-col
aligned, so the full-buffer swizzle produces the correct layout.

The GEMM reads scale_a using padded_expert_offsets, which matches
the scatter layout. Fully GPU, zero CPU syncs, cudagraph-safe.
2026-05-17 16:59:51 +00:00
94dec5922d Scale assembly Phase 2: use CPU-computed offsets for Python slicing
GPU scalars can't be used for Python indexing (requires sync).
Compute padded_expert_offsets on CPU via .cpu().tolist() for
the Python loop. This is OK for cudagraph: Python code only
runs during capture, not replay. The GPU kernel launches
recorded during capture are deterministic.
2026-05-17 16:56:52 +00:00
49c28e6562 Fix: use real padded expert offsets instead of fixed layout
Root cause of garbage output: fixed-layout padding with
max_chunks=ceil(avg) was too small for uneven expert assignment.
Tokens beyond max_chunks*128 per expert were silently dropped
(clamped_local overwrote the same row).

Fix: compute padded_expert_offsets from actual tokens_per_expert
(padded to 128). No clamping needed — each expert gets exactly
the space it needs. Pass padded_expert_offsets to scale assembly
and GEMM.
2026-05-17 16:55:47 +00:00
87a223f1ac Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses 2026-05-17 16:52:40 +00:00
c03438fc4e crap shoot 2026-05-17 16:25:38 +00:00
7c16f3cb46 Fix: init shared dict before using it, remove duplicate _output_buf 2026-05-17 16:06:58 +00:00
ea8acf9852 Share padded_x_sf and output buffers across layers to save ~300 MB
Per-layer padded_xsf (2.4 MB) + output_buf (4.2 MB) × 60 layers = ~400 MB.
Sharing reduces to ~3.6 MB total. Layers run sequentially during both
capture and replay.
2026-05-17 16:05:53 +00:00
3d0b1408b4 Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status 2026-05-17 15:52:06 +00:00
455ecb5631 Fix: define padded_max_slots before using it in shared buffer allocation 2026-05-17 15:47:38 +00:00