CUDA graph capture needs extra memory on top of the model weights.
With 175GB model on 178GB GPUs, there's no room.
Going back to --enforce-eager with 10-min RPC timeout. The first
inference request will be slow (2-3 min JIT compilation) but won't
crash. Subsequent requests are fast.
CUDA graph mode requires either more GPU memory or a smaller model.
quantize_to_nvfp4 was allocating a (..., n_blocks, block_size, 8)
float32 tensor for nearest-neighbor distances to all 8 E2M1 values.
That's 32x the input size — 10.5GB for a typical batch, causing OOM
with only 3GB free.
New approach: clamp to [0, 6], scale to half-integer steps, round,
then map through a 13-byte lookup table to E2M1 indices.
Peak memory is now ~2x input (x_f32 + x_scaled) instead of 32x.
This makes activation quantization CUDA-graph-safe for the
memory-constrained DeepSeek-V4 on B200 (175GB model / 178GB GPU).
The warmup allocated 1GB of dummy tensors but the model already
uses 175.7GB of the 178.35GB per GPU. No room.
With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during
the graph capture phase (which manages memory properly). The warmup
was a band-aid for eager mode and is now redundant.
CuTeDSL's grouped GEMM uses int32 for expert offsets internally.
Our cumsum produced int64, causing a type mismatch inside a dynamic
if-branch (prev_off changes from Int32 to Int64).
Also cast tokens_per_expert to int32 before cumsum.
CUDA graphs forbid CPU-GPU syncs (.item()) and Python loops over
tokens during graph capture. The old scatter loop did both.
Changes:
- Slot routing: replaced Python loop with GPU-native argsort + gather
(sort tokens by expert id, gather hidden states in slot order)
- Scatter: replaced Python loop with torch.scatter_add_ (GPU-native)
- Weight stacking: lazily pre-built once, reused every forward call
- Removed all .item() calls from the forward path
- expert_offsets built from GPU tensor operations
This is required for FULL_AND_PIECEWISE CUDA graph mode which
compiles and captures graphs during startup.
After the container starts, the script waits for the API to come up,
then sends a warmup request to trigger all JIT compilation (Triton,
TileLang, CuTeDSL). This way the first real inference request is fast.
Also added tqdm for expert weight loading:
Loading Native NVFP4 Expert Weights: 50%|██████████░░| 480/960
First inference triggers Triton/TileLang kernel JIT compilation (2-3 min).
The default 5-min RPC timeout kills the engine. Bumped to 10 min via
VLLM_RPC_TIMEOUT_MS so the first request survives compilation.
Not ideal — would prefer to warm up the kernels during startup.
But CUDA graphs don't work well with grouped GEMMs and variable
expert counts. Will investigate vLLM warmup shape config later.
The 5-minute gap after safetensors load is GPU weight upload — no
output, k8s marks the pod unhealthy. Now prints a heartbeat every
256 weight loads during the expert loading phase.
Also adds checkpoint-ready and model-ready prints around finalize:
Checkpoint loaded. Transferring weights to GPU & preparing NVFP4...
(JIT compile)NVFP4 MoE layers: 50%|██████████░░░░░░░░░░| 31/61
NVFP4 model ready ✓
_convert_nvfp4_post_load() was converting wq_b, wo_b, fused_wqa_wkv
from NVFP4→BF16. These layers already have FlashInferCutlassNvFp4LinearKernel
registered as their quant_method — they CAN run native NVFP4.
Now only wo_a gets FP8 conversion (fp8_einsum requires FP8) and
compressor gets BF16 reconstruction (weight_loader issue).
Everything else stays NVFP4 native — Blackwell FP4 acceleration
for the full model, not just the MoE experts.
This also eliminates the 5-minute NVFP4→BF16 conversion loop.
The outer loop tqdm now covers the full finalize_weights + warmup for
each MoE layer. CuTeDSL caches by (M,N,K) so every layer shape gets
compiled during warmup — no RPC timeouts during inference.
(JIT compile)NVFP4 MoE layers: 50%|██████████░░░░░░░░░░| 31/61
CuTeDSL caches kernels by (M, N, K) shape. Different layer shapes
(L1 vs L2, different expert counts) trigger new compiles. We can't
skip the warmup call — only suppress the print spam.
Flag now gates the message, not the warmup.
The warmup was running for every MoE layer (61 layers × 8 ranks = 488
compile attempts). The kernel is cached after the first compile —
subsequent calls are instant. But the print spam was insane.
Now uses a class-level flag to compile exactly once per process.
All 61 layers on a rank share the same compiled kernel.
Progress now shows per-layer instead of per-expert — cleaner and
covers the full finalize_mega_moe_weights loop (61 layers) which was
the silent 5-minute gap after checkpoint loading.
(view-cast)uint8→NVFP4 experts: 80%|████████████████░░░░| 49/61
(upcast)NVFP4→FP8/BF16 convert: 30%|██████░░░░░░░░░░░░░░| 20/61
The CuTeDSL kernel uses MMA tiler (128,128,256). With only 1 token,
the kernel can't fill a tile and may access illegal memory. Using 128
tokens for the warmup.
Also improved error message — after CUDA illegal memory access, the
context is corrupted and can't recover.
JIT compiles the MLIR→PTX during finalize_weights instead of on the
first inference request. Prevents vLLM's 5-min RPC timeout from
killing the engine while workers are busy compiling.
Warmup runs a single-token, single-expert forward pass — just enough
to trigger compilation. Takes ~1-2 min, same as layertest.
Makes it crystal clear what's happening:
- Experts: direct uint8→float4 view-cast (Blackwell native, no BF16)
- Convert: NVFP4→FP8/BF16 for attention weights (non-expert path)
Python buffers stdout by default. Docker only sees the buffer dumps,
so all progress bars appear at once when the step completes.
PYTHONUNBUFFERED=1 disables buffering — prints flush immediately.
Visual feedback during the slow parts of model loading:
NVFP4 experts [████████████████░░░░] 80% (26/32)
NVFP4 convert [██████░░░░░░░░░░░░░░] 30% (20/61)
Updates every 10% so it's not spammy.
intermediate_size=3072 is the size of gate OR up, not gate+up.
Split L1 output at intermediate_size, not intermediate_size//2.
gate = l1_out[:, :3072], up = l1_out[:, 3072:]
The bridge's assemble_scales_3d_side expects (K_sf, N) input and
transposes to (N, K_sf) internally before swizzling. The checkpoint
stores scales as (N, K_sf). Without this transpose, the kernel was
reading completely wrong scale data — cosine dropped to 0.713.
Also fixed dual global scale normalization: after transpose, gate/up
are along dim 1 (columns), not dim 0 (rows).
finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2
directly. Block scales (float8_e4m3fn) and global scales (float32)
pass through unchanged. Zero precision loss on the weights themselves.
L1 dual global scale handling: gate and up have different global scales.
Normalize to max(gate_gs, up_gs) and fold the ratio into block scales
via float32 (one multiply + float8 round-trip on the RATIO only —
much better than dequantizing the entire weight matrix).
layertest.py: updated to test direct path. Expect cosine improvement
from 0.989 → 0.995+ (matching the L1-only result).
README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.
Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)
Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
- Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
- Handles slot-based routing, L1→SiLU→L2→scatter
- prepare_weights_from_dequantized() for weight prep
Tagged the-last-of-cutlass on the old C++ kernel state.
The layertest dequantizes checkpoint NVFP4→BF16 then re-quantizes
BF16→NVFP4. This double quantization costs ~1% cosine. The kernel
itself is correct — the 0.989 cosine is expected quantization noise.
cutedsl/moe_pipeline.py: complete pipeline
- stage_activation: BF16 → NVFP4 (keeps data in FP4)
- L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up)
- SiLU(gate) * up: BF16 (only nonlinear, can't avoid)
- Re-quantize: BF16 → NVFP4 (back to native)
- L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
- Scatter with routing weights → BF16 output
layertest.py: now tests the FULL MoE pipeline against BF16 reference.
NVFP4-native: both GEMMs use float4_e2m1fn_x2 for A and B,
float8_e4m3fn for block scales, float32 for global scales.
BF16 only for SiLU activation and final scatter.
Tokens must be laid out as [expert0_tokens | expert1_tokens | ...]
for the 2Dx3D grouped GEMM. Each expert gets its own contiguous
block of tokens. Scale factors split by expert offsets.
Copied from CUTLASS examples (no more runtime dependency on
/root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.*
instead of blackwell.kernel.*.
Structure:
cutedsl/__init__.py
cutedsl/kernel/__init__.py
cutedsl/kernel/moe/ (the MoE scaled grouped GEMM)
cutedsl/kernel/blockscaled_gemm/ (dense blockscaled GEMM)
test_cutedsl.py updated to import from our local copy.
Tests the NVIDIA reference kernel with our quantization pipeline:
1. Quantize BF16 → NVFP4 (our stage_activation logic)
2. Pad and swizzle scale factors (to_blocked)
3. Run ScaledGroupedGemmKernel (2Dx3D scenario)
4. Compare against BF16 matmul reference
Also adds cutedsl/moe.py module for the future pipeline integration.
The C++ CUTLASS kernel is fundamentally broken (cosine 0.05 with real
data). Switching to NVIDIA's CuTeDSL approach based on their official
MoE scaled grouped GEMM example.
Reference files copied:
- moe_torch_scaled_grouped_mm.py (3900 lines — our new kernel)
- moe_utils.py, moe_persistent_scheduler.py, moe_sched_extension.py
- grouped_blockscaled_gemm.py, dense_blockscaled_gemm_persistent.py
- blockscaled_layout.py