Commit Graph

161 Commits

Author SHA1 Message Date
af087e655e docs: update README — vLLM cudagraph inference running, output quality in progress 2026-05-16 21:40:59 +00:00
0a5cfe0433 add kernel compile caching — compile once, invoke on subsequent calls
First call: cute.compile() with real tensors (warmup).
Subsequent calls: just invoke compiled() with new CuTe views.
No cute.compile() in the forward path = cudagraph-safe.
2026-05-16 20:45:46 +00:00
3465b9d471 remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe) 2026-05-16 20:42:49 +00:00
5e245bc0c6 fix: missing newline 2026-05-16 20:40:18 +00:00
288e179f88 add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale) 2026-05-16 20:39:37 +00:00
521e11e468 test: old bridge + LUT quantization only (step 1 of cudagraph migration) 2026-05-16 20:37:42 +00:00
f51be76e8f temp: restore EXACT old bridge.py from b685112 2026-05-16 20:34:45 +00:00
58dc36e21c fix: compile fresh each call — cached compile produces wrong TMA descriptors
The CuTeDSL kernel's TMA descriptors are bound to the
compilation-time tensor addresses. Caching the compiled kernel
and reusing it with different tensor allocations produces wrong
memory access patterns (cosine 0.5 instead of 0.99).

Fresh compilation is proven correct (cosine 0.989). We can
optimize later with proper TMA descriptor reinitialization.
2026-05-16 20:28:15 +00:00
98cc6ac1f3 fix: invert cache check logic (compile when NOT in cache) 2026-05-16 20:25:16 +00:00
e337ec86a3 debug: test with cache enabled 2026-05-16 20:24:04 +00:00
bc56452be8 debug: disable kernel cache to test fresh compilation 2026-05-16 20:22:51 +00:00
647c03b2ee fix: make_b_k_major must preserve shape — use double-permute trick
permute(K,N).contiguous().permute(K,N) gives same (E,K,N) shape
but with K-contiguous memory. Single permute changes the shape.
2026-05-16 20:19:21 +00:00
ed4f501bba fix: make_b_k_major stride check — K-major means stride[1]==1, not stride[2]==1
For (E, K, N): stride[2]==1 is N-major (columns contiguous).
K-major requires stride[1]==1 (rows contiguous).
2026-05-16 20:18:18 +00:00
2162cee4ad fix: restore proper quantize_weight_to_nvfp4 — K is the packed dim, not N
quantize_to_nvfp4() only packs the last dimension, but for weight
matrices (K, N), K is the packed dimension. The weight quantizer
reshapes (k_blocks, block_size, N) and computes block scales along
the K block dimension. This was accidentally replaced with a simple
delegation to quantize_to_nvfp4, producing wrong tensor shapes.
2026-05-16 20:16:28 +00:00
10f1dca982 fix: import ceil_div from correct module 2026-05-16 20:09:02 +00:00
81632e2f21 fix: correct cutlass_torch import (cutlass.torch, not top-level) 2026-05-16 20:08:21 +00:00
16c4fad025 fix: remove cutlass.cute.backend import 2026-05-16 20:06:38 +00:00
44b40d41fe fix: compile CuTeDSL kernel with real tensors, not dummy shapes
The kernel's TMA descriptors are sized from compilation-time shapes.
Dummy 256x256 caused wrong memory access for real 3584x6144 data.
Now compiles with actual runtime tensors on first use, cached by
(num_experts, K, N). Compilation happens once during warmup.
Forward call remains cudagraph-safe.
2026-05-16 20:05:59 +00:00
79281b6fda fix: compute K_packed/N_packed before passing to _get_compiled_kernel 2026-05-16 20:00:35 +00:00
caf93d6c45 fix: pass K_packed/N_packed to _get_compiled_kernel 2026-05-16 19:59:43 +00:00
ecc7b83334 fix: compile CuTeDSL kernel with actual tensor shapes, not dummy 256x256
The compiled kernel's TMA descriptors are sized based on compilation
shapes. Using dummy 256x256 shapes caused wrong memory access patterns
for the real 3584x6144 data. Now uses actual K_packed and N_packed
from the runtime tensors.
2026-05-16 19:58:13 +00:00
cc75a55bd9 restore: new bridge/moe_pipeline/layertest 2026-05-16 19:55:19 +00:00
0c878b3a9e temp: restore old layertest+bridge for cosine comparison 2026-05-16 19:54:04 +00:00
0069769d12 debug: print global scales 2026-05-16 19:38:31 +00:00
84589fe984 debug: more prints 2026-05-16 19:31:54 +00:00
fa2d5708c5 debug: add L1 GEMM and SiLU output debug prints 2026-05-16 19:29:42 +00:00
4c06c51ec3 fix: moe_pipeline.py gate/up split — L1 output is 2*intermediate, not intermediate 2026-05-16 19:28:15 +00:00
da31ce7e1a allow for cuda graphs again 2026-05-16 19:23:41 +00:00
d15c43294b fix: test L2 weight N dim should be hidden_size, not hidden_size//2 2026-05-16 19:07:36 +00:00
28788c6f55 fix: L1 weight N dimension is 2*intermediate (gate+up), not intermediate
float4_e2m1fn_x2 packs 2 values per byte along K, not N.
The GEMM output N dimension is the logical N from mat_b.shape[2],
not 2x packed. Previous n_dim*2 was wrong — it accidentally worked
in the test because intermediate_size*2 == 2*intermediate_size.
Real model with N=9216 exposed the bug.
2026-05-16 19:07:08 +00:00
f7e29fdf1e docs: update README with cudagraph compatibility work and decisions 2026-05-16 18:55:47 +00:00
103fd451ce fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph)
buf[:gpu_scalar, :] triggers cudaErrorStreamCaptureInvalidated.
Always use the full pre-allocated buffer; extra rows are zeros.
2026-05-16 18:50:35 +00:00
2f68c7ba77 fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward)
torch.tensor() and new_tensor() both trigger CPU->CUDA copies during
cudagraph capture. Pre-cache the LUT on first use per device.
2026-05-16 18:48:31 +00:00
6c298be842 fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy)
torch.tensor() creates on CPU then copies to CUDA, which is forbidden
during cudagraph capture. new_tensor() creates directly on the
source tensor's device.
2026-05-16 18:47:39 +00:00
53c25bee0b rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices
- Removed all [:total_slots] dynamic slicing with GPU scalars
- slot_hidden gathers from hidden_states directly using sorted_token_ids
- scatter_add uses full sorted_token_ids (padding slots have zero weight)
- _assemble_scales_cudagraph_safe returns 2D via padded_scales.shape[0]
- Fixed padded_scales_buf allocation via float16->float8 cast
- GEMM output size: n_dim * 2 for float4_e2m1fn_x2 packed format
2026-05-16 18:44:25 +00:00
4300775bfe fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead 2026-05-16 18:29:12 +00:00
5a79065b2b fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element) 2026-05-16 18:27:44 +00:00
95a1345b92 fix: return 2D scale tensor from _assemble_scales_cudagraph_safe 2026-05-16 18:26:57 +00:00
533089c9d2 fix: token_indices slice bug + torch.zeros for float4/float8 dtypes 2026-05-16 18:21:27 +00:00
54c470e535 fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8) 2026-05-16 18:13:14 +00:00
f2de95c526 fix: use randint for float4 dummy weights in cudagraph test 2026-05-16 18:08:45 +00:00
f66d4b69a4 GPU-only scale assembly + cudagraph test harness
- assemble_activation_scales_gpu: builds padded+swizzled scale tensor
  without .item() or .tolist() CPU syncs. Uses GPU index arange + cat
  + single scatter instead of per-expert Python slicing.
- Still has a for e in range(num_experts) loop but num_experts is
  compile-time constant so torch.compile unrolls it.
- Added tests/cudagraph_test.py: attempts CUDA graph capture on the
  MoE runner, diagnoses sync violations with patched torch functions.
- Removed the if total_slots == 0 early return (Python control flow
  on GPU data)
2026-05-16 18:05:13 +00:00
5121074782 cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly
Key changes for cudagraph compatibility:
- No .item() or .tolist() calls (zero CPU-GPU syncs)
- Pre-allocated buffers at max_num_tokens size
- GPU-only expert offsets via bincount+cumsum
- searchsorted to map rows to experts (no Python for-loop with GPU indices)
- Single scatter operation for scale padding
- Pre-allocated token_indices reused for searchsorted row mapping
- quantize_activation_nvfp4 with fixed global scale (no .max() sync)
- Cached CuTeDSL kernel (no cute.compile per forward)
- No torch.cuda.synchronize() in forward path
2026-05-16 18:01:47 +00:00
ab126b0c0d fix: revert to .item() based scale assembly (fixes index OOB)
The fully GPU-vectorized _assemble_scales_gpu() caused index out of
bounds errors because tensor slicing with GPU-computed indices from
Python is undefined behavior.

Went back to .item() on expert_offsets for the per-expert scale split.
This forces CPU-GPU syncs (breaks cudagraph) but produces correct results.

The path to cudagraph compatibility is either:
1. Modify CuTeDSL scale assembly API to accept flat tensor + offsets
2. Use the CUTLASS kernel (already verified working)
2026-05-16 17:55:32 +00:00
7594968482 WIP: cudagraph-compatible CuTeDSL MoE runner
- Cache compiled CuTeDSL kernel (compile once, reuse every forward)
- Remove torch.cuda.synchronize() from forward path
- Add quantize_activation_nvfp4() (no .max() CPU-GPU sync)
- Pre-allocate buffers (token_indices, expert_id_range, output_bufs)
- GPU-only expert offset computation (bincount + cumsum)
- Replace Python for-loop scale assembly with GPU-vectorized version

Still TODO:
- Test with FULL_AND_PIECEWISE cudagraph mode
- Add vllm::deepseek_v4_mega_moe_experts to splitting_ops
- Verify CuTeDSL kernel launch is cudagraph-safe
2026-05-16 16:36:19 +00:00
f0c1be3ced fix: remove broken hc_head warmup (wrong tensor shape)
hc_head_fuse_tilelang expects fn shape[0]=hc_mult (4) but we passed
hc_mult*(2+hc_mult) (24). Since --enforce-eager disables @torch.compile
anyway, hc_head runs eagerly and doesn't need warmup.
2026-05-16 10:11:34 +00:00
c803180706 fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe
After _ensure_stacked frees per-expert lists, code that accesses
l1_fp4 or w13_weight.device crashes with NoneType errors. Fix:
- _check_runtime_supported: fall back to _l1_mat_b.device
- _run_mega_moe assertion: check _l1_mat_b as alternative
- finalize_weights guard: check _l1_mat_b as alternative
2026-05-16 09:16:24 +00:00
cdd813cf7e fix: free per-expert weight lists after stacking in CuTeDSL runner
_ensure_stacked() creates stacked copies of all weights but never freed
the per-expert lists. For 256 experts on a 175GB model, this doubles
weight memory to ~350GB, causing OOM.

Now the per-expert lists (l1_fp4, l1_sf, l1_gs, l2_fp4, l2_sf, l2_gs)
are set to None after stacking, keeping only the single stacked copy.
2026-05-16 08:54:52 +00:00
99c11c218d fucken a 2026-05-16 08:39:13 +00:00
906ee80a42 Add tilelang kernel warmup in load_weights
Force-compile all lazy tilelang JIT kernels (mhc_pre, mhc_post)
and torch.compile'd hc_head during model loading, BEFORE the HTTP
server comes up. This eliminates the crash when eager mode inference
hits the model before tilelang compilation finishes.

Fixes the core issue: cudagraph capture forced eager compilation but
ate all GPU memory. Now we can run eager mode safely.
2026-05-16 08:28:39 +00:00