Commit Graph

126 Commits

Author SHA1 Message Date
4300775bfe fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead 2026-05-16 18:29:12 +00:00
5a79065b2b fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element) 2026-05-16 18:27:44 +00:00
95a1345b92 fix: return 2D scale tensor from _assemble_scales_cudagraph_safe 2026-05-16 18:26:57 +00:00
533089c9d2 fix: token_indices slice bug + torch.zeros for float4/float8 dtypes 2026-05-16 18:21:27 +00:00
54c470e535 fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8) 2026-05-16 18:13:14 +00:00
f2de95c526 fix: use randint for float4 dummy weights in cudagraph test 2026-05-16 18:08:45 +00:00
f66d4b69a4 GPU-only scale assembly + cudagraph test harness
- assemble_activation_scales_gpu: builds padded+swizzled scale tensor
  without .item() or .tolist() CPU syncs. Uses GPU index arange + cat
  + single scatter instead of per-expert Python slicing.
- Still has a for e in range(num_experts) loop but num_experts is
  compile-time constant so torch.compile unrolls it.
- Added tests/cudagraph_test.py: attempts CUDA graph capture on the
  MoE runner, diagnoses sync violations with patched torch functions.
- Removed the if total_slots == 0 early return (Python control flow
  on GPU data)
2026-05-16 18:05:13 +00:00
5121074782 cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly
Key changes for cudagraph compatibility:
- No .item() or .tolist() calls (zero CPU-GPU syncs)
- Pre-allocated buffers at max_num_tokens size
- GPU-only expert offsets via bincount+cumsum
- searchsorted to map rows to experts (no Python for-loop with GPU indices)
- Single scatter operation for scale padding
- Pre-allocated token_indices reused for searchsorted row mapping
- quantize_activation_nvfp4 with fixed global scale (no .max() sync)
- Cached CuTeDSL kernel (no cute.compile per forward)
- No torch.cuda.synchronize() in forward path
2026-05-16 18:01:47 +00:00
ab126b0c0d fix: revert to .item() based scale assembly (fixes index OOB)
The fully GPU-vectorized _assemble_scales_gpu() caused index out of
bounds errors because tensor slicing with GPU-computed indices from
Python is undefined behavior.

Went back to .item() on expert_offsets for the per-expert scale split.
This forces CPU-GPU syncs (breaks cudagraph) but produces correct results.

The path to cudagraph compatibility is either:
1. Modify CuTeDSL scale assembly API to accept flat tensor + offsets
2. Use the CUTLASS kernel (already verified working)
2026-05-16 17:55:32 +00:00
7594968482 WIP: cudagraph-compatible CuTeDSL MoE runner
- Cache compiled CuTeDSL kernel (compile once, reuse every forward)
- Remove torch.cuda.synchronize() from forward path
- Add quantize_activation_nvfp4() (no .max() CPU-GPU sync)
- Pre-allocate buffers (token_indices, expert_id_range, output_bufs)
- GPU-only expert offset computation (bincount + cumsum)
- Replace Python for-loop scale assembly with GPU-vectorized version

Still TODO:
- Test with FULL_AND_PIECEWISE cudagraph mode
- Add vllm::deepseek_v4_mega_moe_experts to splitting_ops
- Verify CuTeDSL kernel launch is cudagraph-safe
2026-05-16 16:36:19 +00:00
f0c1be3ced fix: remove broken hc_head warmup (wrong tensor shape)
hc_head_fuse_tilelang expects fn shape[0]=hc_mult (4) but we passed
hc_mult*(2+hc_mult) (24). Since --enforce-eager disables @torch.compile
anyway, hc_head runs eagerly and doesn't need warmup.
2026-05-16 10:11:34 +00:00
c803180706 fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe
After _ensure_stacked frees per-expert lists, code that accesses
l1_fp4 or w13_weight.device crashes with NoneType errors. Fix:
- _check_runtime_supported: fall back to _l1_mat_b.device
- _run_mega_moe assertion: check _l1_mat_b as alternative
- finalize_weights guard: check _l1_mat_b as alternative
2026-05-16 09:16:24 +00:00
cdd813cf7e fix: free per-expert weight lists after stacking in CuTeDSL runner
_ensure_stacked() creates stacked copies of all weights but never freed
the per-expert lists. For 256 experts on a 175GB model, this doubles
weight memory to ~350GB, causing OOM.

Now the per-expert lists (l1_fp4, l1_sf, l1_gs, l2_fp4, l2_sf, l2_gs)
are set to None after stacking, keeping only the single stacked copy.
2026-05-16 08:54:52 +00:00
99c11c218d fucken a 2026-05-16 08:39:13 +00:00
906ee80a42 Add tilelang kernel warmup in load_weights
Force-compile all lazy tilelang JIT kernels (mhc_pre, mhc_post)
and torch.compile'd hc_head during model loading, BEFORE the HTTP
server comes up. This eliminates the crash when eager mode inference
hits the model before tilelang compilation finishes.

Fixes the core issue: cudagraph capture forced eager compilation but
ate all GPU memory. Now we can run eager mode safely.
2026-05-16 08:28:39 +00:00
a51ef3d2cf fucken a 2026-05-16 08:23:27 +00:00
72bf750a0b fix: revert to eager mode — CUDA graphs OOM with 175GB model
CUDA graph capture needs extra memory on top of the model weights.
With 175GB model on 178GB GPUs, there's no room.

Going back to --enforce-eager with 10-min RPC timeout. The first
inference request will be slow (2-3 min JIT compilation) but won't
crash. Subsequent requests are fast.

CUDA graph mode requires either more GPU memory or a smaller model.
2026-05-16 08:07:44 +00:00
baf44c92f8 fix: memory-efficient E2M1 quantization — no 32x distance tensor
quantize_to_nvfp4 was allocating a (..., n_blocks, block_size, 8)
float32 tensor for nearest-neighbor distances to all 8 E2M1 values.
That's 32x the input size — 10.5GB for a typical batch, causing OOM
with only 3GB free.

New approach: clamp to [0, 6], scale to half-integer steps, round,
then map through a 13-byte lookup table to E2M1 indices.
Peak memory is now ~2x input (x_f32 + x_scaled) instead of 32x.

This makes activation quantization CUDA-graph-safe for the
memory-constrained DeepSeek-V4 on B200 (175GB model / 178GB GPU).
2026-05-16 07:49:38 +00:00
a2cac7a7fe fix: remove CuTeDSL warmup — OOM with 175GB model loaded
The warmup allocated 1GB of dummy tensors but the model already
uses 175.7GB of the 178.35GB per GPU. No room.

With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during
the graph capture phase (which manages memory properly). The warmup
was a band-aid for eager mode and is now redundant.
2026-05-16 07:32:17 +00:00
e0814eb54e fix: cast expert_offsets to int32 for CuTeDSL kernel
CuTeDSL's grouped GEMM uses int32 for expert offsets internally.
Our cumsum produced int64, causing a type mismatch inside a dynamic
if-branch (prev_off changes from Int32 to Int64).

Also cast tokens_per_expert to int32 before cumsum.
2026-05-16 07:15:57 +00:00
4b0a9557f0 fix: rewrite CuTeDSLMoERunner for CUDA graph compatibility
CUDA graphs forbid CPU-GPU syncs (.item()) and Python loops over
tokens during graph capture. The old scatter loop did both.

Changes:
- Slot routing: replaced Python loop with GPU-native argsort + gather
  (sort tokens by expert id, gather hidden states in slot order)
- Scatter: replaced Python loop with torch.scatter_add_ (GPU-native)
- Weight stacking: lazily pre-built once, reused every forward call
- Removed all .item() calls from the forward path
- expert_offsets built from GPU tensor operations

This is required for FULL_AND_PIECEWISE CUDA graph mode which
compiles and captures graphs during startup.
2026-05-16 07:03:08 +00:00
dab31b0961 fix: missing tqdm import in weight_loader 2026-05-16 06:31:14 +00:00
8496ac99bc dang clonkurs 2026-05-16 06:28:16 +00:00
e7c6274107 Revert "feat: auto-warmup in build_and_run.sh"
This reverts commit f792537719.
2026-05-16 06:14:28 +00:00
f792537719 feat: auto-warmup in build_and_run.sh
After the container starts, the script waits for the API to come up,
then sends a warmup request to trigger all JIT compilation (Triton,
TileLang, CuTeDSL). This way the first real inference request is fast.

Also added tqdm for expert weight loading:
  Loading Native NVFP4 Expert Weights: 50%|██████████░░| 480/960
2026-05-16 06:11:38 +00:00
5d975d00d9 feat: tqdm progress bar for expert weight loading
Replaces heartbeat prints with a clean tqdm bar:
  Loading Native NVFP4 Expert Weights: 50%|██████████░░| 480/960
2026-05-16 06:09:22 +00:00
2e4ff6b8d4 fix: increase vLLM RPC timeout to 10 min for first-request JIT
First inference triggers Triton/TileLang kernel JIT compilation (2-3 min).
The default 5-min RPC timeout kills the engine. Bumped to 10 min via
VLLM_RPC_TIMEOUT_MS so the first request survives compilation.

Not ideal — would prefer to warm up the kernels during startup.
But CUDA graphs don't work well with grouped GEMMs and variable
expert counts. Will investigate vLLM warmup shape config later.
2026-05-16 06:02:11 +00:00
a569612df5 feat: add load progress heartbeats to prevent k8s health check kills
The 5-minute gap after safetensors load is GPU weight upload — no
output, k8s marks the pod unhealthy. Now prints a heartbeat every
256 weight loads during the expert loading phase.

Also adds checkpoint-ready and model-ready prints around finalize:
  Checkpoint loaded. Transferring weights to GPU & preparing NVFP4...
  (JIT compile)NVFP4 MoE layers: 50%|██████████░░░░░░░░░░| 31/61
  NVFP4 model ready ✓
2026-05-16 05:51:35 +00:00
e5370140cb docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status
- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)
2026-05-16 05:43:33 +00:00
3445bd24c1 feat: keep attention weights native NVFP4 — stop dequantizing to BF16
_convert_nvfp4_post_load() was converting wq_b, wo_b, fused_wqa_wkv
from NVFP4→BF16. These layers already have FlashInferCutlassNvFp4LinearKernel
registered as their quant_method — they CAN run native NVFP4.

Now only wo_a gets FP8 conversion (fp8_einsum requires FP8) and
compressor gets BF16 reconstruction (weight_loader issue).
Everything else stays NVFP4 native — Blackwell FP4 acceleration
for the full model, not just the MoE experts.

This also eliminates the 5-minute NVFP4→BF16 conversion loop.
2026-05-16 05:36:34 +00:00
4d4cfa6b28 fix: tqdm over MoE layer warmup, compile every layer, no print spam
The outer loop tqdm now covers the full finalize_weights + warmup for
each MoE layer. CuTeDSL caches by (M,N,K) so every layer shape gets
compiled during warmup — no RPC timeouts during inference.

  (JIT compile)NVFP4 MoE layers:  50%|██████████░░░░░░░░░░| 31/61
2026-05-16 05:21:11 +00:00
3838561c19 fix: only suppress compile message, still warmup all layers
CuTeDSL caches kernels by (M, N, K) shape. Different layer shapes
(L1 vs L2, different expert counts) trigger new compiles. We can't
skip the warmup call — only suppress the print spam.

Flag now gates the message, not the warmup.
2026-05-16 05:18:10 +00:00
f19932d8db fix: compile CuTeDSL kernel once per process, not per MoE layer
The warmup was running for every MoE layer (61 layers × 8 ranks = 488
compile attempts). The kernel is cached after the first compile —
subsequent calls are instant. But the print spam was insane.

Now uses a class-level flag to compile exactly once per process.
All 61 layers on a rank share the same compiled kernel.
2026-05-16 05:16:53 +00:00
936982c5aa fix: add layer-level tqdm for expert finalization, remove inner expert tqdm
Progress now shows per-layer instead of per-expert — cleaner and
covers the full finalize_mega_moe_weights loop (61 layers) which was
the silent 5-minute gap after checkpoint loading.

  (view-cast)uint8→NVFP4 experts:  80%|████████████████░░░░| 49/61
  (upcast)NVFP4→FP8/BF16 convert:  30%|██████░░░░░░░░░░░░░░| 20/61
2026-05-16 05:01:20 +00:00
cf0731cf4b fix: warmup with 128 tokens (fills MMA tile), better error handling
The CuTeDSL kernel uses MMA tiler (128,128,256). With only 1 token,
the kernel can't fill a tile and may access illegal memory. Using 128
tokens for the warmup.

Also improved error message — after CUDA illegal memory access, the
context is corrupted and can't recover.
2026-05-16 04:56:45 +00:00
a70d2d3984 fix: clearer warmup message — 'Compiling CuTeDSL NVFP4 MegaMoE kernel' 2026-05-16 04:40:31 +00:00
f191af7e29 feat: warm up CuTeDSL kernel during model loading
JIT compiles the MLIR→PTX during finalize_weights instead of on the
first inference request. Prevents vLLM's 5-min RPC timeout from
killing the engine while workers are busy compiling.

Warmup runs a single-token, single-expert forward pass — just enough
to trigger compilation. Takes ~1-2 min, same as layertest.
2026-05-16 04:39:05 +00:00
4d67b570b9 fix: descriptive tqdm labels — uint8→NVFP4 and NVFP4→FP8/BF16
Makes it crystal clear what's happening:
- Experts: direct uint8→float4 view-cast (Blackwell native, no BF16)
- Convert: NVFP4→FP8/BF16 for attention weights (non-expert path)
2026-05-16 04:28:25 +00:00
8efdd165da fix: use tqdm for progress bars — single line, live updating
Replaces manual bar printing with tqdm. Overwrites the same line
instead of spewing one line per update.
2026-05-16 04:26:43 +00:00
830f042443 fix: PYTHONUNBUFFERED=1 so progress bars stream in real-time
Python buffers stdout by default. Docker only sees the buffer dumps,
so all progress bars appear at once when the step completes.
PYTHONUNBUFFERED=1 disables buffering — prints flush immediately.
2026-05-16 04:18:07 +00:00
00b766af60 feat: add progress bars for expert quantization and post-load conversion
Visual feedback during the slow parts of model loading:
  NVFP4 experts [████████████████░░░░]  80% (26/32)
  NVFP4 convert [██████░░░░░░░░░░░░░░]  30% (20/61)

Updates every 10% so it's not spammy.
2026-05-16 04:14:07 +00:00
b465579a02 cleanup: nuke all debug prints and env var gates from vLLM patch
Removed:
- [WT-LOAD] weight loader debug (MEGA_MOE_DEBUG gate)
- [NVFP4 DEBUG] shape logging in _run_mega_moe
- [NVFP4_DEBUG] post-load expert weight counting
- [NVFP4] post-load sync + CUDA OK print (NVFP4_DEBUG_SYNC gate)
- [POST-LOAD] all-zero param tensor scanning
- [LOGITS] top-k printing + Paris probe
- SKIP_ATTENTION env var gate for skipping attention
- Unused total_fp8/total_bf16 variables

Debugging belongs in layertest.py, not in the vLLM serving path.
These prints polluted logs, bloated context windows, and slowed loading.
2026-05-16 04:10:42 +00:00
174ad70dca fix: same gate/up split fix in moe_pipeline.py 2026-05-16 04:04:53 +00:00
6d17988b51 fix: L1 gate/up split — intermediate_size is per-projection, not fused
intermediate_size=3072 is the size of gate OR up, not gate+up.
Split L1 output at intermediate_size, not intermediate_size//2.
gate = l1_out[:, :3072], up = l1_out[:, 3072:]
2026-05-16 04:04:40 +00:00
37aa0cbeab debug: add try/except with shape logging to _run_mega_moe 2026-05-16 04:02:01 +00:00
b04bff7e8b feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build
Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import

docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)

deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl

README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization
2026-05-16 03:50:07 +00:00
a0ff8a3278 fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge
The bridge's assemble_scales_3d_side expects (K_sf, N) input and
transposes to (N, K_sf) internally before swizzling. The checkpoint
stores scales as (N, K_sf). Without this transpose, the kernel was
reading completely wrong scale data — cosine dropped to 0.713.

Also fixed dual global scale normalization: after transpose, gate/up
are along dim 1 (columns), not dim 0 (rows).
2026-05-16 03:43:30 +00:00
389453fbf4 feat: direct NVFP4 path — no BF16 round-trip on weights
finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2
directly. Block scales (float8_e4m3fn) and global scales (float32)
pass through unchanged. Zero precision loss on the weights themselves.

L1 dual global scale handling: gate and up have different global scales.
Normalize to max(gate_gs, up_gs) and fold the ratio into block scales
via float32 (one multiply + float8 round-trip on the RATIO only —
much better than dequantizing the entire weight matrix).

layertest.py: updated to test direct path. Expect cosine improvement
from 0.989 → 0.995+ (matching the L1-only result).
2026-05-16 03:41:23 +00:00
8fd9579127 feat: vLLM integration — replace C++ kernel with CuTeDSL
deepseek_v4.py changes:
- finalize_weights(): dequantize checkpoint → BF16 → re-quantize to
  float4_e2m1fn_x2 via CuTeDSLMoERunner (replaces transform_nvfp4_weights_for_mega_moe)
- _run_mega_moe(): calls CuTeDSLMoERunner.run() (replaces nvfp4_mega_moe_full)
- Removed get_symm_buffer() and SymmBuffer (CuTeDSL manages its own workspace)
- Removed _transformed_l1_weights / _transformed_l2_weights
- Added _cutedsl_runner class variable
- Weight loader unchanged (checkpoint loading is the same)

vllm/nvfp4_cutedsl.py:
- CuTeDSLMoERunner class handles the full pipeline
- prepare_weights_from_dequantized() for weight prep
- run() does L1→SiLU→L2→scatter with NVFP4-native GEMMs
2026-05-16 03:36:12 +00:00
3ec9c3074b docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub
README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.
2026-05-16 03:33:16 +00:00