- Reverted from full-buffer swizzle to per-expert 128-row slots
- Scatter into e*128 fixed positions (cudagraph-compatible, fixed shape)
- Clamp local_row to 127 for experts with >128 tokens (GEMM uses expert_offsets)
- Buffer sized for num_experts*128 rows (not max_tokens*top_k)
- Add _warmup_done guard to only run warmup once (not 60x)
The checkpoint input_scale is a calibration value that produces wrong gs
at runtime (too small → block scales saturate → garbage output → EOS).
Now calls compute_activation_global_scales() with sample data during weight
finalization, before cudagraph capture. This observes actual activation
magnitudes and computes correct L1 and L2 gs values.
Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens
and top_k=6, the actual padded row count can exceed 6144. Also:
- Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192)
- Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks)
- Remove debug prints
Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
as local IDs (0-31 with EP=8). Tokens for non-local experts got
wrong expert assignments, causing out-of-bounds scatter indices
in _assemble_scales_cudagraph_safe.
Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
JIT GPU memory corruption (refill after first GEMM call)
The checkpoint stores input_scale per projection — the pre-computed
activation normalization factor. Using 1/2688 was wrong for most layers
(e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off).
This caused under-quantized activations and garbage output.
hc_head_fuse_tilelang expects fn shape[0]=hc_mult (4) but we passed
hc_mult*(2+hc_mult) (24). Since --enforce-eager disables @torch.compile
anyway, hc_head runs eagerly and doesn't need warmup.
After _ensure_stacked frees per-expert lists, code that accesses
l1_fp4 or w13_weight.device crashes with NoneType errors. Fix:
- _check_runtime_supported: fall back to _l1_mat_b.device
- _run_mega_moe assertion: check _l1_mat_b as alternative
- finalize_weights guard: check _l1_mat_b as alternative
Force-compile all lazy tilelang JIT kernels (mhc_pre, mhc_post)
and torch.compile'd hc_head during model loading, BEFORE the HTTP
server comes up. This eliminates the crash when eager mode inference
hits the model before tilelang compilation finishes.
Fixes the core issue: cudagraph capture forced eager compilation but
ate all GPU memory. Now we can run eager mode safely.
The warmup allocated 1GB of dummy tensors but the model already
uses 175.7GB of the 178.35GB per GPU. No room.
With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during
the graph capture phase (which manages memory properly). The warmup
was a band-aid for eager mode and is now redundant.
The 5-minute gap after safetensors load is GPU weight upload — no
output, k8s marks the pod unhealthy. Now prints a heartbeat every
256 weight loads during the expert loading phase.
Also adds checkpoint-ready and model-ready prints around finalize:
Checkpoint loaded. Transferring weights to GPU & preparing NVFP4...
(JIT compile)NVFP4 MoE layers: 50%|██████████░░░░░░░░░░| 31/61
NVFP4 model ready ✓
_convert_nvfp4_post_load() was converting wq_b, wo_b, fused_wqa_wkv
from NVFP4→BF16. These layers already have FlashInferCutlassNvFp4LinearKernel
registered as their quant_method — they CAN run native NVFP4.
Now only wo_a gets FP8 conversion (fp8_einsum requires FP8) and
compressor gets BF16 reconstruction (weight_loader issue).
Everything else stays NVFP4 native — Blackwell FP4 acceleration
for the full model, not just the MoE experts.
This also eliminates the 5-minute NVFP4→BF16 conversion loop.
The outer loop tqdm now covers the full finalize_weights + warmup for
each MoE layer. CuTeDSL caches by (M,N,K) so every layer shape gets
compiled during warmup — no RPC timeouts during inference.
(JIT compile)NVFP4 MoE layers: 50%|██████████░░░░░░░░░░| 31/61
CuTeDSL caches kernels by (M, N, K) shape. Different layer shapes
(L1 vs L2, different expert counts) trigger new compiles. We can't
skip the warmup call — only suppress the print spam.
Flag now gates the message, not the warmup.
The warmup was running for every MoE layer (61 layers × 8 ranks = 488
compile attempts). The kernel is cached after the first compile —
subsequent calls are instant. But the print spam was insane.
Now uses a class-level flag to compile exactly once per process.
All 61 layers on a rank share the same compiled kernel.
Progress now shows per-layer instead of per-expert — cleaner and
covers the full finalize_mega_moe_weights loop (61 layers) which was
the silent 5-minute gap after checkpoint loading.
(view-cast)uint8→NVFP4 experts: 80%|████████████████░░░░| 49/61
(upcast)NVFP4→FP8/BF16 convert: 30%|██████░░░░░░░░░░░░░░| 20/61
The CuTeDSL kernel uses MMA tiler (128,128,256). With only 1 token,
the kernel can't fill a tile and may access illegal memory. Using 128
tokens for the warmup.
Also improved error message — after CUDA illegal memory access, the
context is corrupted and can't recover.
JIT compiles the MLIR→PTX during finalize_weights instead of on the
first inference request. Prevents vLLM's 5-min RPC timeout from
killing the engine while workers are busy compiling.
Warmup runs a single-token, single-expert forward pass — just enough
to trigger compilation. Takes ~1-2 min, same as layertest.
Makes it crystal clear what's happening:
- Experts: direct uint8→float4 view-cast (Blackwell native, no BF16)
- Convert: NVFP4→FP8/BF16 for attention weights (non-expert path)
Visual feedback during the slow parts of model loading:
NVFP4 experts [████████████████░░░░] 80% (26/32)
NVFP4 convert [██████░░░░░░░░░░░░░░] 30% (20/61)
Updates every 10% so it's not spammy.
The bridge's assemble_scales_3d_side expects (K_sf, N) input and
transposes to (N, K_sf) internally before swizzling. The checkpoint
stores scales as (N, K_sf). Without this transpose, the kernel was
reading completely wrong scale data — cosine dropped to 0.713.
Also fixed dual global scale normalization: after transpose, gate/up
are along dim 1 (columns), not dim 0 (rows).
finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2
directly. Block scales (float8_e4m3fn) and global scales (float32)
pass through unchanged. Zero precision loss on the weights themselves.
L1 dual global scale handling: gate and up have different global scales.
Normalize to max(gate_gs, up_gs) and fold the ratio into block scales
via float32 (one multiply + float8 round-trip on the RATIO only —
much better than dequantizing the entire weight matrix).
layertest.py: updated to test direct path. Expect cosine improvement
from 0.989 → 0.995+ (matching the L1-only result).
- Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32
(our checkpoint uses float8_e4m3fn, NOT E8M0)
- Removed dead is_scale_e8m0 property (never referenced)
- Removed dead _block_scale_to_float32 copy in MegaMoEExperts class
- Cleaned up stale E8M0/UE8M0/shift-by-23 comments
- Simplified E8M0 assertion to ValueError (not assert False)
- Updated DeepseekV4FP8Config docstring for NVFP4
Using checkpoint input_scale as the normalization scale saturates
FP4 values (all block scales = 448). The input_scale is a calibration
constant, NOT the amax/(6*448) normalization scale.
Reverted to dynamic amax/(6*448) for activation quantization.
The correct use of checkpoint input_scale is still under investigation.
Preserved: _w13_input_scale and _w2_input_scale in finalize_weights
for future use once we understand the correct alpha contract.
Critical fix: the checkpoint's input_scale was used during weight
calibration but we were computing dynamic scale from data (amax/2688).
This was 13x off from the checkpoint value.
Changes:
- stage_activation() accepts optional input_global_scale parameter
- nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale
- vLLM patch preserves w13/w2_input_scale in finalize_weights
- L1 activation uses checkpoint w13_input_scale for quantization
- L2 activation uses checkpoint w2_input_scale for quantization
- alpha = input_scale * weight_scale_2 (correct calibration contract)
The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision.
Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision
zone where step size is 25%. This makes model output garbage despite finite values.
Fix: keep block scales as original float8, return global scales separately as
float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in
cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate
gate/up global scales, use gate_gs as alpha and apply up_correction ratio to
the up half post-GEMM.
weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf)
nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs
kernel.py: per_expert_alpha parameter in grouped GEMM
deepseek_v4.py: updated type hints and comments