Commit Graph

74 Commits

Author SHA1 Message Date
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.

Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
  - Runners registered in global dict with integer IDs
  - Custom op takes (tensors, runner_id, shape_hint) -> tensor
  - Dynamo calls fake impl for shape inference, never touches the runner
  - At execution time, real impl looks up runner and calls _run_impl

Changes:
  - New: cutedsl/custom_ops.py (custom op definitions + registry)
  - New: tests/test_custom_op.py (local unit tests, no GPU needed)
  - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
  - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
    to use custom ops instead of autograd.Function
  - Updated: cutedsl_quant_method.py to use custom op + registry
2026-05-19 01:54:48 +00:00
48386e34ad Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable
torch.compile fullgraph mode can't handle @torch.compiler.disable (skips
the function and refuses to compile). Custom autograd Functions are treated
as opaque ops by torch.compile — they execute eagerly without the compiler
trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).
2026-05-18 21:38:28 +00:00
85e1cd3b69 Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run()
CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with
torch.dynamo tracing. Marking run() as compiler-disabled makes the
runners opaque to torch.compile — they execute eagerly while the
rest of the model gets compiled.
2026-05-18 21:07:35 +00:00
a94011ec92 Fix torch.compile crash: remove threading.Lock from LUT cache path
The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under
torch.compile/cudagraph. LUT is now pre-populated during warmup so
the fast path (cache hit) never hits a lock.

Also removed all init/warmup debug prints from CuTeDSL kernels.
2026-05-18 20:54:55 +00:00
450793311c Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4
- CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners
  during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod
  for forward dispatch
- Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through
  CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference)
- Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16)
  with monkey-patched forward for fused L1+SiLU+L2 pipeline
- Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix,
  input_scale fixes)
- Deleted _post_quant_fix hook from utils.py
- Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp)
- Cleaned up all debug prints
- Updated Dockerfile with new kernel files
2026-05-18 20:27:42 +00:00
87582fc9f7 HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph 2026-05-17 22:28:32 +00:00
8717e0e411 Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping 2026-05-17 22:03:48 +00:00
d332f4f900 Add NaN debug checks after L1 and L2 GEMM 2026-05-17 22:02:24 +00:00
2796bd81e8 Fix: scatter FP4 as uint8 (float4 doesn't support index_put) 2026-05-17 21:28:04 +00:00
364f8372bb Fix FP4 buffer shapes: D//2 for packed dimensions 2026-05-17 21:26:46 +00:00
803e7160d8 Fix: allocate FP4 buffers as uint8 then view-cast 2026-05-17 21:25:04 +00:00
7256070dd3 FIX Bug 26: quantize slot tokens, not padded buffer
The runner was quantizing the padded_hidden (4096 rows) and then
taking x_sf[:num_slots] (first 48 rows). This only got scales for
expert 0 (the first 48 rows of the padded buffer), not the scales
for tokens scattered across padded positions (expert 1 at row 128, etc).

Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get
correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer
for the GEMM. The scale assembly now receives the correct x_sf.

Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.
2026-05-17 21:24:43 +00:00
a10c582cf4 Add swiglu_limit=10.0 activation clamping (was missing)
DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps:
- silu(gate) to max 10.0
- up to [-10.0, 10.0]

Our runner was doing plain F.silu(gate) * up without clamping.
Large gate values could produce unbounded SiLU output, causing
numerical issues in the L2 GEMM. This is likely contributing to
garbage model output.
2026-05-17 17:52:16 +00:00
3f2f4e1882 Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing
Dynamic slicing with GPU scalars (e.g. buf[:gpu_scalar]) is a CUDA
operation not permitted during stream capture. Use full pre-allocated
buffers instead of dynamic slices. The GEMM only reads rows indicated
by expert_offsets, ignoring the zero padding.

Also pass x_sf[:num_slots] (Python int slicing, cudagraph-safe) to
scale assembly so it only processes real token scale data.
2026-05-17 17:24:26 +00:00
11b5aa5e37 Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops
Removed .cpu().tolist() and per-expert Python loops. Apply the
Blackwell 32_4_4 swizzle to the entire padded_x_sf buffer at once.
The buffer is already 128-row aligned (padded per expert) and 4-col
aligned, so the full-buffer swizzle produces the correct layout.

The GEMM reads scale_a using padded_expert_offsets, which matches
the scatter layout. Fully GPU, zero CPU syncs, cudagraph-safe.
2026-05-17 16:59:51 +00:00
94dec5922d Scale assembly Phase 2: use CPU-computed offsets for Python slicing
GPU scalars can't be used for Python indexing (requires sync).
Compute padded_expert_offsets on CPU via .cpu().tolist() for
the Python loop. This is OK for cudagraph: Python code only
runs during capture, not replay. The GPU kernel launches
recorded during capture are deterministic.
2026-05-17 16:56:52 +00:00
49c28e6562 Fix: use real padded expert offsets instead of fixed layout
Root cause of garbage output: fixed-layout padding with
max_chunks=ceil(avg) was too small for uneven expert assignment.
Tokens beyond max_chunks*128 per expert were silently dropped
(clamped_local overwrote the same row).

Fix: compute padded_expert_offsets from actual tokens_per_expert
(padded to 128). No clamping needed — each expert gets exactly
the space it needs. Pass padded_expert_offsets to scale assembly
and GEMM.
2026-05-17 16:55:47 +00:00
7c16f3cb46 Fix: init shared dict before using it, remove duplicate _output_buf 2026-05-17 16:06:58 +00:00
ea8acf9852 Share padded_x_sf and output buffers across layers to save ~300 MB
Per-layer padded_xsf (2.4 MB) + output_buf (4.2 MB) × 60 layers = ~400 MB.
Sharing reduces to ~3.6 MB total. Layers run sequentially during both
capture and replay.
2026-05-17 16:05:53 +00:00
455ecb5631 Fix: define padded_max_slots before using it in shared buffer allocation 2026-05-17 15:47:38 +00:00
b1ac74bb4d Fix shape mismatch: shared padded buffers, revert max_num_tokens cap
Root cause: capping max_num_tokens to 512 made buffers too small for the
actual 8192-token warmup. slot_hidden had 49152 rows but padded_hidden
only had 6144.

Fix: Revert the 512 cap. Use SHARED padded buffers (not per-layer) to
avoid OOM. Only 72 MB total (not 4.3 GB) since layers run sequentially
and reuse the same buffer. Cudagraph-safe since capture and replay both
run layers sequentially on the same tensor.
2026-05-17 15:47:10 +00:00
faf7c8cc51 Debug: print runner max_num_tokens and max_chunks 2026-05-17 15:18:07 +00:00
c5af1aba6b Fix OOB: size padded buffers for num_experts*max_chunks*128
padded_max_slots was computed from max_tokens*top_k (3072) but
total_padded_slots in run() is num_experts*max_chunks*128 (6144).
The buffer was too small, causing index out of bounds.
2026-05-17 14:59:45 +00:00
8ac8e20fa9 Fix OOM: cap buffer pre-allocation at cudagraph max capture size
padded_hidden/activated buffers were sized for max_num_tokens=8192,
which is 72 MB per layer × 60 layers = 4.3 GB → OOM with 178 GB GPUs
(almost full from model + KV cache).

Now cap at max cudagraph capture size (512 tokens). Eager-mode runs
with >512 tokens will need dynamic allocation, but vLLM always uses
cudagraph for inference after warmup.
2026-05-17 14:14:13 +00:00
5bb78564f5 Remove dynamic tensor allocation in scale assembly (cudagraph fix)
Removed torch.zeros() call that created padded_expert_offsets during
scale assembly. Now uses fixed layout computed from Python constants.
Also removed dead reference to padded_expert_offsets variable.
2026-05-17 14:01:32 +00:00
8c31e78359 Fix cudagraph: fully fixed-layout per-expert sections, no GPU scalars in Python control flow
- Each expert gets max_chunks*128 rows at fixed offsets (e*max_chunks*128)
- Phase 1 scatters into fixed offsets with clamped local_row
- Phase 2 reads from fixed offsets (pure Python arithmetic, no GPU sync)
- padded_x_sf_buf sized for num_experts * max_chunks * 128
- padded_expert_offsets pre-computed in _allocate_buffers
2026-05-17 13:58:58 +00:00
ff74b33d2c Fix cudagraph: static loop for per-expert scale swizzle
The while loop had variable trip count (GPU scalar in condition),
requiring CPU-GPU sync. Replaced with fixed max_chunks_per_expert
iterations. Unused chunks are zero buffers (harmless for GEMM).
2026-05-17 13:56:52 +00:00
bf22b6f0e4 Fix scale assembly: variable-size per-expert padding matching GEMM offsets
- Compute padded_expert_offsets from real expert_offsets (ceil to 128)
- Scatter x_sf into padded positions matching those offsets
- Per-expert swizzle in 128-row chunks (supports >128 tokens per expert)
- Pad slot_hidden/activated using same padded offsets for GEMM input
- Pre-allocated buffers sized for max_tokens*top_k (not num_experts*128)
2026-05-17 13:55:10 +00:00
bde81b95f4 Fix GEMM scale layout: pad to 128 tokens per expert
Root cause of garbage output: the GEMM reads scale_a according to
expert_offsets (e.g. [0, 500, 1024, ...]) but scale_a had data at
fixed e*128 offsets. When expert 0 has 500 tokens, the GEMM reads
scale_a[0:500] but only rows 0-127 had valid data.

Fix: pad slot_hidden to num_experts*128 rows (128 per expert) and
pass padded_expert_offsets=[0, 128, 256, ...] to the GEMM. Scale
assembly's fixed 128-row layout now matches the GEMM's expectations.
Padding tokens' GEMM output is discarded (scatter_add only uses
sorted_token_ids for real tokens).
2026-05-17 13:19:31 +00:00
7e692c3aec Fix cudaErrorStreamCaptureUnsupported: pre-allocate all tensors used during capture
torch.full(), torch.zeros(), torch.arange() allocate new tensors during
cudagraph capture, which triggers cudaErrorStreamCaptureUnsupported.

Pre-allocate:
- _l1_gsa_buf / _l2_gsa_buf (use .fill_() instead of torch.full)
- _output_buf (use .zero_() on pre-allocated slice)
- _row_indices_buf (pre-allocated arange, sliced during use)
2026-05-17 12:31:25 +00:00
b531a98f8f Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing
- Reverted from full-buffer swizzle to per-expert 128-row slots
- Scatter into e*128 fixed positions (cudagraph-compatible, fixed shape)
- Clamp local_row to 127 for experts with >128 tokens (GEMM uses expert_offsets)
- Buffer sized for num_experts*128 rows (not max_tokens*top_k)
- Add _warmup_done guard to only run warmup once (not 60x)
2026-05-17 11:10:59 +00:00
4445882ba7 Fix: return 2D scale tensor for GEMM (shape[1] access) 2026-05-17 09:59:57 +00:00
3cd910193c Rewrite scale assembly: no .item() calls, no Python loops, fully GPU
Apply to_blocked swizzle on entire padded buffer at once instead of
per-expert loops. No .item()/.cpu() calls. Fully cudagraph-safe.
2026-05-17 09:59:12 +00:00
4f6217acb9 Fix padded_cols calculation in scale assembly 2026-05-17 09:58:09 +00:00
918aa8aede Fix scale assembly output shape: reshape to 2D for GEMM 2026-05-17 09:57:27 +00:00
d9bae6d770 Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks
Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens
and top_k=6, the actual padded row count can exceed 6144. Also:
- Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192)
- Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks)
- Remove debug prints
2026-05-17 09:56:28 +00:00
55ac60eb91 Add detailed debug prints for OOB investigation 2026-05-17 09:39:42 +00:00
fed3c417ba Add debug OOB check for sorted_token_ids 2026-05-17 09:19:10 +00:00
ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync
Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
  as local IDs (0-31 with EP=8). Tokens for non-local experts got
  wrong expert assignments, causing out-of-bounds scatter indices
  in _assemble_scales_cudagraph_safe.

Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
   JIT GPU memory corruption (refill after first GEMM call)
2026-05-17 08:58:43 +00:00
1330e2b2cf cleanup: remove debug prints, ready for testing
Current state:
- Token indices on CPU (avoids CuTeDSL GPU memory corruption)
- Scale assembly uses per-expert swizzle + scatter (matches reference)
- compute_activation_global_scales warmup gets ~0.97 cosine
- expert_offsets passed without leading 0 (matches pipeline)
- layertest + cudagraph_test pass
2026-05-17 08:30:41 +00:00
d635dcbbb6 fix: keep token_indices on CPU, index with CPU sort_idx
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Keeping token_indices on CPU and using sort_idx.cpu() for indexing
avoids the corruption. The .to(device) call after indexing moves the
result back to GPU for the hidden_states indexing.
2026-05-17 08:29:18 +00:00
235d5b314f fix: fallback token indices allocation with verify+rebuild 2026-05-17 08:27:47 +00:00
dd0b3fd4f9 debug: print sorted_token_ids in warmup 2026-05-17 08:25:25 +00:00
04999d86cf fix: add quantize_to_nvfp4 import 2026-05-17 08:24:57 +00:00
7073daaffa fix: allocate token_indices on CPU, move to GPU AFTER JIT compilation
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Tensors allocated on GPU before/during compilation get zeroed.
Fix: create token_indices on CPU, then .to(device) after JIT is done.
2026-05-17 08:22:51 +00:00
0e7b06b55c debug: clone + sync token indices before JIT 2026-05-17 08:22:11 +00:00
70c0618361 fix: allocate token_indices before CuTeDSL JIT compilation
CuTeDSL's cute.compile appears to corrupt GPU memory state,
causing torch.arange to produce zero-filled tensors when allocated
after the JIT compilation. Moving token_indices allocation before
the weight stacking operations fixes the corruption.
2026-05-17 08:20:41 +00:00
2bbe04efd8 debug: remove assert, test token corruption 2026-05-17 08:19:45 +00:00
66627926c5 debug: int32 token indices with sync verify 2026-05-17 08:18:37 +00:00
da02a5dc11 debug: assert token indices are correct after allocation 2026-05-17 08:16:09 +00:00