CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Tensors allocated on GPU before/during compilation get zeroed.
Fix: create token_indices on CPU, then .to(device) after JIT is done.
CuTeDSL's cute.compile appears to corrupt GPU memory state,
causing torch.arange to produce zero-filled tensors when allocated
after the JIT compilation. Moving token_indices allocation before
the weight stacking operations fixes the corruption.
Uses quantize_to_nvfp4 during warmup to get exact gs values for L1 and L2.
L1 gs comes from slot_hidden, L2 gs from the actual L1 GEMM output.
These values are then used with quantize_activation_nvfp4 (cudagraph-safe)
during inference.
The checkpoint stores input_scale per projection — the pre-computed
activation normalization factor. Using 1/2688 was wrong for most layers
(e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off).
This caused under-quantized activations and garbage output.
- Removed all [:total_slots] dynamic slicing with GPU scalars
- slot_hidden gathers from hidden_states directly using sorted_token_ids
- scatter_add uses full sorted_token_ids (padding slots have zero weight)
- _assemble_scales_cudagraph_safe returns 2D via padded_scales.shape[0]
- Fixed padded_scales_buf allocation via float16->float8 cast
- GEMM output size: n_dim * 2 for float4_e2m1fn_x2 packed format
Key changes for cudagraph compatibility:
- No .item() or .tolist() calls (zero CPU-GPU syncs)
- Pre-allocated buffers at max_num_tokens size
- GPU-only expert offsets via bincount+cumsum
- searchsorted to map rows to experts (no Python for-loop with GPU indices)
- Single scatter operation for scale padding
- Pre-allocated token_indices reused for searchsorted row mapping
- quantize_activation_nvfp4 with fixed global scale (no .max() sync)
- Cached CuTeDSL kernel (no cute.compile per forward)
- No torch.cuda.synchronize() in forward path
The fully GPU-vectorized _assemble_scales_gpu() caused index out of
bounds errors because tensor slicing with GPU-computed indices from
Python is undefined behavior.
Went back to .item() on expert_offsets for the per-expert scale split.
This forces CPU-GPU syncs (breaks cudagraph) but produces correct results.
The path to cudagraph compatibility is either:
1. Modify CuTeDSL scale assembly API to accept flat tensor + offsets
2. Use the CUTLASS kernel (already verified working)
hc_head_fuse_tilelang expects fn shape[0]=hc_mult (4) but we passed
hc_mult*(2+hc_mult) (24). Since --enforce-eager disables @torch.compile
anyway, hc_head runs eagerly and doesn't need warmup.
After _ensure_stacked frees per-expert lists, code that accesses
l1_fp4 or w13_weight.device crashes with NoneType errors. Fix:
- _check_runtime_supported: fall back to _l1_mat_b.device
- _run_mega_moe assertion: check _l1_mat_b as alternative
- finalize_weights guard: check _l1_mat_b as alternative
_ensure_stacked() creates stacked copies of all weights but never freed
the per-expert lists. For 256 experts on a 175GB model, this doubles
weight memory to ~350GB, causing OOM.
Now the per-expert lists (l1_fp4, l1_sf, l1_gs, l2_fp4, l2_sf, l2_gs)
are set to None after stacking, keeping only the single stacked copy.
Force-compile all lazy tilelang JIT kernels (mhc_pre, mhc_post)
and torch.compile'd hc_head during model loading, BEFORE the HTTP
server comes up. This eliminates the crash when eager mode inference
hits the model before tilelang compilation finishes.
Fixes the core issue: cudagraph capture forced eager compilation but
ate all GPU memory. Now we can run eager mode safely.
The warmup allocated 1GB of dummy tensors but the model already
uses 175.7GB of the 178.35GB per GPU. No room.
With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during
the graph capture phase (which manages memory properly). The warmup
was a band-aid for eager mode and is now redundant.
CuTeDSL's grouped GEMM uses int32 for expert offsets internally.
Our cumsum produced int64, causing a type mismatch inside a dynamic
if-branch (prev_off changes from Int32 to Int64).
Also cast tokens_per_expert to int32 before cumsum.
CUDA graphs forbid CPU-GPU syncs (.item()) and Python loops over
tokens during graph capture. The old scatter loop did both.
Changes:
- Slot routing: replaced Python loop with GPU-native argsort + gather
(sort tokens by expert id, gather hidden states in slot order)
- Scatter: replaced Python loop with torch.scatter_add_ (GPU-native)
- Weight stacking: lazily pre-built once, reused every forward call
- Removed all .item() calls from the forward path
- expert_offsets built from GPU tensor operations
This is required for FULL_AND_PIECEWISE CUDA graph mode which
compiles and captures graphs during startup.
The 5-minute gap after safetensors load is GPU weight upload — no
output, k8s marks the pod unhealthy. Now prints a heartbeat every
256 weight loads during the expert loading phase.
Also adds checkpoint-ready and model-ready prints around finalize:
Checkpoint loaded. Transferring weights to GPU & preparing NVFP4...
(JIT compile)NVFP4 MoE layers: 50%|██████████░░░░░░░░░░| 31/61
NVFP4 model ready ✓
_convert_nvfp4_post_load() was converting wq_b, wo_b, fused_wqa_wkv
from NVFP4→BF16. These layers already have FlashInferCutlassNvFp4LinearKernel
registered as their quant_method — they CAN run native NVFP4.
Now only wo_a gets FP8 conversion (fp8_einsum requires FP8) and
compressor gets BF16 reconstruction (weight_loader issue).
Everything else stays NVFP4 native — Blackwell FP4 acceleration
for the full model, not just the MoE experts.
This also eliminates the 5-minute NVFP4→BF16 conversion loop.
The outer loop tqdm now covers the full finalize_weights + warmup for
each MoE layer. CuTeDSL caches by (M,N,K) so every layer shape gets
compiled during warmup — no RPC timeouts during inference.
(JIT compile)NVFP4 MoE layers: 50%|██████████░░░░░░░░░░| 31/61
CuTeDSL caches kernels by (M, N, K) shape. Different layer shapes
(L1 vs L2, different expert counts) trigger new compiles. We can't
skip the warmup call — only suppress the print spam.
Flag now gates the message, not the warmup.
The warmup was running for every MoE layer (61 layers × 8 ranks = 488
compile attempts). The kernel is cached after the first compile —
subsequent calls are instant. But the print spam was insane.
Now uses a class-level flag to compile exactly once per process.
All 61 layers on a rank share the same compiled kernel.
Progress now shows per-layer instead of per-expert — cleaner and
covers the full finalize_mega_moe_weights loop (61 layers) which was
the silent 5-minute gap after checkpoint loading.
(view-cast)uint8→NVFP4 experts: 80%|████████████████░░░░| 49/61
(upcast)NVFP4→FP8/BF16 convert: 30%|██████░░░░░░░░░░░░░░| 20/61
The CuTeDSL kernel uses MMA tiler (128,128,256). With only 1 token,
the kernel can't fill a tile and may access illegal memory. Using 128
tokens for the warmup.
Also improved error message — after CUDA illegal memory access, the
context is corrupted and can't recover.
JIT compiles the MLIR→PTX during finalize_weights instead of on the
first inference request. Prevents vLLM's 5-min RPC timeout from
killing the engine while workers are busy compiling.
Warmup runs a single-token, single-expert forward pass — just enough
to trigger compilation. Takes ~1-2 min, same as layertest.
Makes it crystal clear what's happening:
- Experts: direct uint8→float4 view-cast (Blackwell native, no BF16)
- Convert: NVFP4→FP8/BF16 for attention weights (non-expert path)
Visual feedback during the slow parts of model loading:
NVFP4 experts [████████████████░░░░] 80% (26/32)
NVFP4 convert [██████░░░░░░░░░░░░░░] 30% (20/61)
Updates every 10% so it's not spammy.