nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	103fd451ce	fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph) buf[:gpu_scalar, :] triggers cudaErrorStreamCaptureInvalidated. Always use the full pre-allocated buffer; extra rows are zeros.	2026-05-16 18:50:35 +00:00
biondizzle	53c25bee0b	rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices - Removed all [:total_slots] dynamic slicing with GPU scalars - slot_hidden gathers from hidden_states directly using sorted_token_ids - scatter_add uses full sorted_token_ids (padding slots have zero weight) - _assemble_scales_cudagraph_safe returns 2D via padded_scales.shape[0] - Fixed padded_scales_buf allocation via float16->float8 cast - GEMM output size: n_dim * 2 for float4_e2m1fn_x2 packed format	2026-05-16 18:44:25 +00:00
biondizzle	4300775bfe	fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead	2026-05-16 18:29:12 +00:00
biondizzle	95a1345b92	fix: return 2D scale tensor from _assemble_scales_cudagraph_safe	2026-05-16 18:26:57 +00:00
biondizzle	533089c9d2	fix: token_indices slice bug + torch.zeros for float4/float8 dtypes	2026-05-16 18:21:27 +00:00
biondizzle	5121074782	cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly Key changes for cudagraph compatibility: - No .item() or .tolist() calls (zero CPU-GPU syncs) - Pre-allocated buffers at max_num_tokens size - GPU-only expert offsets via bincount+cumsum - searchsorted to map rows to experts (no Python for-loop with GPU indices) - Single scatter operation for scale padding - Pre-allocated token_indices reused for searchsorted row mapping - quantize_activation_nvfp4 with fixed global scale (no .max() sync) - Cached CuTeDSL kernel (no cute.compile per forward) - No torch.cuda.synchronize() in forward path	2026-05-16 18:01:47 +00:00
biondizzle	ab126b0c0d	fix: revert to .item() based scale assembly (fixes index OOB) The fully GPU-vectorized _assemble_scales_gpu() caused index out of bounds errors because tensor slicing with GPU-computed indices from Python is undefined behavior. Went back to .item() on expert_offsets for the per-expert scale split. This forces CPU-GPU syncs (breaks cudagraph) but produces correct results. The path to cudagraph compatibility is either: 1. Modify CuTeDSL scale assembly API to accept flat tensor + offsets 2. Use the CUTLASS kernel (already verified working)	2026-05-16 17:55:32 +00:00
biondizzle	7594968482	WIP: cudagraph-compatible CuTeDSL MoE runner - Cache compiled CuTeDSL kernel (compile once, reuse every forward) - Remove torch.cuda.synchronize() from forward path - Add quantize_activation_nvfp4() (no .max() CPU-GPU sync) - Pre-allocate buffers (token_indices, expert_id_range, output_bufs) - GPU-only expert offset computation (bincount + cumsum) - Replace Python for-loop scale assembly with GPU-vectorized version Still TODO: - Test with FULL_AND_PIECEWISE cudagraph mode - Add vllm::deepseek_v4_mega_moe_experts to splitting_ops - Verify CuTeDSL kernel launch is cudagraph-safe	2026-05-16 16:36:19 +00:00
biondizzle	f0c1be3ced	fix: remove broken hc_head warmup (wrong tensor shape) hc_head_fuse_tilelang expects fn shape[0]=hc_mult (4) but we passed hc_mult*(2+hc_mult) (24). Since --enforce-eager disables @torch.compile anyway, hc_head runs eagerly and doesn't need warmup.	2026-05-16 10:11:34 +00:00
biondizzle	c803180706	fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe After _ensure_stacked frees per-expert lists, code that accesses l1_fp4 or w13_weight.device crashes with NoneType errors. Fix: - _check_runtime_supported: fall back to _l1_mat_b.device - _run_mega_moe assertion: check _l1_mat_b as alternative - finalize_weights guard: check _l1_mat_b as alternative	2026-05-16 09:16:24 +00:00
biondizzle	cdd813cf7e	fix: free per-expert weight lists after stacking in CuTeDSL runner _ensure_stacked() creates stacked copies of all weights but never freed the per-expert lists. For 256 experts on a 175GB model, this doubles weight memory to ~350GB, causing OOM. Now the per-expert lists (l1_fp4, l1_sf, l1_gs, l2_fp4, l2_sf, l2_gs) are set to None after stacking, keeping only the single stacked copy.	2026-05-16 08:54:52 +00:00
biondizzle	906ee80a42	Add tilelang kernel warmup in load_weights Force-compile all lazy tilelang JIT kernels (mhc_pre, mhc_post) and torch.compile'd hc_head during model loading, BEFORE the HTTP server comes up. This eliminates the crash when eager mode inference hits the model before tilelang compilation finishes. Fixes the core issue: cudagraph capture forced eager compilation but ate all GPU memory. Now we can run eager mode safely.	2026-05-16 08:28:39 +00:00
biondizzle	a2cac7a7fe	fix: remove CuTeDSL warmup — OOM with 175GB model loaded The warmup allocated 1GB of dummy tensors but the model already uses 175.7GB of the 178.35GB per GPU. No room. With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during the graph capture phase (which manages memory properly). The warmup was a band-aid for eager mode and is now redundant.	2026-05-16 07:32:17 +00:00
biondizzle	e0814eb54e	fix: cast expert_offsets to int32 for CuTeDSL kernel CuTeDSL's grouped GEMM uses int32 for expert offsets internally. Our cumsum produced int64, causing a type mismatch inside a dynamic if-branch (prev_off changes from Int32 to Int64). Also cast tokens_per_expert to int32 before cumsum.	2026-05-16 07:15:57 +00:00
biondizzle	4b0a9557f0	fix: rewrite CuTeDSLMoERunner for CUDA graph compatibility CUDA graphs forbid CPU-GPU syncs (.item()) and Python loops over tokens during graph capture. The old scatter loop did both. Changes: - Slot routing: replaced Python loop with GPU-native argsort + gather (sort tokens by expert id, gather hidden states in slot order) - Scatter: replaced Python loop with torch.scatter_add_ (GPU-native) - Weight stacking: lazily pre-built once, reused every forward call - Removed all .item() calls from the forward path - expert_offsets built from GPU tensor operations This is required for FULL_AND_PIECEWISE CUDA graph mode which compiles and captures graphs during startup.	2026-05-16 07:03:08 +00:00
biondizzle	dab31b0961	fix: missing tqdm import in weight_loader	2026-05-16 06:31:14 +00:00
biondizzle	8496ac99bc	dang clonkurs	2026-05-16 06:28:16 +00:00
biondizzle	5d975d00d9	feat: tqdm progress bar for expert weight loading Replaces heartbeat prints with a clean tqdm bar: Loading Native NVFP4 Expert Weights: 50%\|██████████░░\| 480/960	2026-05-16 06:09:22 +00:00
biondizzle	a569612df5	feat: add load progress heartbeats to prevent k8s health check kills The 5-minute gap after safetensors load is GPU weight upload — no output, k8s marks the pod unhealthy. Now prints a heartbeat every 256 weight loads during the expert loading phase. Also adds checkpoint-ready and model-ready prints around finalize: Checkpoint loaded. Transferring weights to GPU & preparing NVFP4... (JIT compile)NVFP4 MoE layers: 50%\|██████████░░░░░░░░░░\| 31/61 NVFP4 model ready ✓	2026-05-16 05:51:35 +00:00
biondizzle	3445bd24c1	feat: keep attention weights native NVFP4 — stop dequantizing to BF16 _convert_nvfp4_post_load() was converting wq_b, wo_b, fused_wqa_wkv from NVFP4→BF16. These layers already have FlashInferCutlassNvFp4LinearKernel registered as their quant_method — they CAN run native NVFP4. Now only wo_a gets FP8 conversion (fp8_einsum requires FP8) and compressor gets BF16 reconstruction (weight_loader issue). Everything else stays NVFP4 native — Blackwell FP4 acceleration for the full model, not just the MoE experts. This also eliminates the 5-minute NVFP4→BF16 conversion loop.	2026-05-16 05:36:34 +00:00
biondizzle	4d4cfa6b28	fix: tqdm over MoE layer warmup, compile every layer, no print spam The outer loop tqdm now covers the full finalize_weights + warmup for each MoE layer. CuTeDSL caches by (M,N,K) so every layer shape gets compiled during warmup — no RPC timeouts during inference. (JIT compile)NVFP4 MoE layers: 50%\|██████████░░░░░░░░░░\| 31/61	2026-05-16 05:21:11 +00:00
biondizzle	3838561c19	fix: only suppress compile message, still warmup all layers CuTeDSL caches kernels by (M, N, K) shape. Different layer shapes (L1 vs L2, different expert counts) trigger new compiles. We can't skip the warmup call — only suppress the print spam. Flag now gates the message, not the warmup.	2026-05-16 05:18:10 +00:00
biondizzle	f19932d8db	fix: compile CuTeDSL kernel once per process, not per MoE layer The warmup was running for every MoE layer (61 layers × 8 ranks = 488 compile attempts). The kernel is cached after the first compile — subsequent calls are instant. But the print spam was insane. Now uses a class-level flag to compile exactly once per process. All 61 layers on a rank share the same compiled kernel.	2026-05-16 05:16:53 +00:00
biondizzle	936982c5aa	fix: add layer-level tqdm for expert finalization, remove inner expert tqdm Progress now shows per-layer instead of per-expert — cleaner and covers the full finalize_mega_moe_weights loop (61 layers) which was the silent 5-minute gap after checkpoint loading. (view-cast)uint8→NVFP4 experts: 80%\|████████████████░░░░\| 49/61 (upcast)NVFP4→FP8/BF16 convert: 30%\|██████░░░░░░░░░░░░░░\| 20/61	2026-05-16 05:01:20 +00:00
biondizzle	cf0731cf4b	fix: warmup with 128 tokens (fills MMA tile), better error handling The CuTeDSL kernel uses MMA tiler (128,128,256). With only 1 token, the kernel can't fill a tile and may access illegal memory. Using 128 tokens for the warmup. Also improved error message — after CUDA illegal memory access, the context is corrupted and can't recover.	2026-05-16 04:56:45 +00:00
biondizzle	a70d2d3984	fix: clearer warmup message — 'Compiling CuTeDSL NVFP4 MegaMoE kernel'	2026-05-16 04:40:31 +00:00
biondizzle	f191af7e29	feat: warm up CuTeDSL kernel during model loading JIT compiles the MLIR→PTX during finalize_weights instead of on the first inference request. Prevents vLLM's 5-min RPC timeout from killing the engine while workers are busy compiling. Warmup runs a single-token, single-expert forward pass — just enough to trigger compilation. Takes ~1-2 min, same as layertest.	2026-05-16 04:39:05 +00:00
biondizzle	4d67b570b9	fix: descriptive tqdm labels — uint8→NVFP4 and NVFP4→FP8/BF16 Makes it crystal clear what's happening: - Experts: direct uint8→float4 view-cast (Blackwell native, no BF16) - Convert: NVFP4→FP8/BF16 for attention weights (non-expert path)	2026-05-16 04:28:25 +00:00
biondizzle	8efdd165da	fix: use tqdm for progress bars — single line, live updating Replaces manual bar printing with tqdm. Overwrites the same line instead of spewing one line per update.	2026-05-16 04:26:43 +00:00
biondizzle	00b766af60	feat: add progress bars for expert quantization and post-load conversion Visual feedback during the slow parts of model loading: NVFP4 experts [████████████████░░░░] 80% (26/32) NVFP4 convert [██████░░░░░░░░░░░░░░] 30% (20/61) Updates every 10% so it's not spammy.	2026-05-16 04:14:07 +00:00
biondizzle	b465579a02	cleanup: nuke all debug prints and env var gates from vLLM patch Removed: - [WT-LOAD] weight loader debug (MEGA_MOE_DEBUG gate) - [NVFP4 DEBUG] shape logging in _run_mega_moe - [NVFP4_DEBUG] post-load expert weight counting - [NVFP4] post-load sync + CUDA OK print (NVFP4_DEBUG_SYNC gate) - [POST-LOAD] all-zero param tensor scanning - [LOGITS] top-k printing + Paris probe - SKIP_ATTENTION env var gate for skipping attention - Unused total_fp8/total_bf16 variables Debugging belongs in layertest.py, not in the vLLM serving path. These prints polluted logs, bloated context windows, and slowed loading.	2026-05-16 04:10:42 +00:00
biondizzle	6d17988b51	fix: L1 gate/up split — intermediate_size is per-projection, not fused intermediate_size=3072 is the size of gate OR up, not gate+up. Split L1 output at intermediate_size, not intermediate_size//2. gate = l1_out[:, :3072], up = l1_out[:, 3072:]	2026-05-16 04:04:40 +00:00
biondizzle	37aa0cbeab	debug: add try/except with shape logging to _run_mega_moe	2026-05-16 04:02:01 +00:00
biondizzle	b04bff7e8b	feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build Dockerfile: - Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone - Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy - Copy nvfp4_cutedsl.py to vllm models dir - Verify step checks cutlass import docker-compose.yml: - Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.) deepseek_v4.py: - Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl README.md: - Updated results: 0% weight loss confirmed (bit-identical view-cast) - 1.1% cosine loss is entirely from activation quantization	2026-05-16 03:50:07 +00:00
biondizzle	a0ff8a3278	fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge The bridge's assemble_scales_3d_side expects (K_sf, N) input and transposes to (N, K_sf) internally before swizzling. The checkpoint stores scales as (N, K_sf). Without this transpose, the kernel was reading completely wrong scale data — cosine dropped to 0.713. Also fixed dual global scale normalization: after transpose, gate/up are along dim 1 (columns), not dim 0 (rows).	2026-05-16 03:43:30 +00:00
biondizzle	389453fbf4	feat: direct NVFP4 path — no BF16 round-trip on weights finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2 directly. Block scales (float8_e4m3fn) and global scales (float32) pass through unchanged. Zero precision loss on the weights themselves. L1 dual global scale handling: gate and up have different global scales. Normalize to max(gate_gs, up_gs) and fold the ratio into block scales via float32 (one multiply + float8 round-trip on the RATIO only — much better than dequantizing the entire weight matrix). layertest.py: updated to test direct path. Expect cosine improvement from 0.989 → 0.995+ (matching the L1-only result).	2026-05-16 03:41:23 +00:00
biondizzle	8fd9579127	feat: vLLM integration — replace C++ kernel with CuTeDSL deepseek_v4.py changes: - finalize_weights(): dequantize checkpoint → BF16 → re-quantize to float4_e2m1fn_x2 via CuTeDSLMoERunner (replaces transform_nvfp4_weights_for_mega_moe) - _run_mega_moe(): calls CuTeDSLMoERunner.run() (replaces nvfp4_mega_moe_full) - Removed get_symm_buffer() and SymmBuffer (CuTeDSL manages its own workspace) - Removed _transformed_l1_weights / _transformed_l2_weights - Added _cutedsl_runner class variable - Weight loader unchanged (checkpoint loading is the same) vllm/nvfp4_cutedsl.py: - CuTeDSLMoERunner class handles the full pipeline - prepare_weights_from_dequantized() for weight prep - run() does L1→SiLU→L2→scatter with NVFP4-native GEMMs	2026-05-16 03:36:12 +00:00
biondizzle	3ec9c3074b	docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub README.md: full rewrite explaining how we got here, project structure, plan, and key lessons learned from the C++ CUTLASS disaster. Removed: - DEBUG_LOG.md (old debug timeline, no longer relevant) - REWRITE_PLAN.md (plan is now in README) - test_gemm.py (C++ extension test) Added: - vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel - Handles slot-based routing, L1→SiLU→L2→scatter - prepare_weights_from_dequantized() for weight prep Tagged the-last-of-cutlass on the old C++ kernel state.	2026-05-16 03:33:16 +00:00
biondizzle	294e9f98f2	cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code - Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32 (our checkpoint uses float8_e4m3fn, NOT E8M0) - Removed dead is_scale_e8m0 property (never referenced) - Removed dead _block_scale_to_float32 copy in MegaMoEExperts class - Cleaned up stale E8M0/UE8M0/shift-by-23 comments - Simplified E8M0 assertion to ValueError (not assert False) - Updated DeepseekV4FP8Config docstring for NVFP4	2026-05-16 01:55:56 +00:00
biondizzle	79b9becf9c	revert: don't use checkpoint input_scale for activation normalization Using checkpoint input_scale as the normalization scale saturates FP4 values (all block scales = 448). The input_scale is a calibration constant, NOT the amax/(6448) normalization scale. Reverted to dynamic amax/(6448) for activation quantization. The correct use of checkpoint input_scale is still under investigation. Preserved: _w13_input_scale and _w2_input_scale in finalize_weights for future use once we understand the correct alpha contract.	2026-05-16 00:12:41 +00:00
biondizzle	a7eae10ef4	fix: use checkpoint input_scale for activation quantization Critical fix: the checkpoint's input_scale was used during weight calibration but we were computing dynamic scale from data (amax/2688). This was 13x off from the checkpoint value. Changes: - stage_activation() accepts optional input_global_scale parameter - nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale - vLLM patch preserves w13/w2_input_scale in finalize_weights - L1 activation uses checkpoint w13_input_scale for quantization - L2 activation uses checkpoint w2_input_scale for quantization - alpha = input_scale * weight_scale_2 (correct calibration contract)	2026-05-15 23:57:08 +00:00
biondizzle	fd59222fc0	fix: stop folding global scale into float8 block scales The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision. Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision zone where step size is 25%. This makes model output garbage despite finite values. Fix: keep block scales as original float8, return global scales separately as float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate gate/up global scales, use gate_gs as alpha and apply up_correction ratio to the up half post-GEMM. weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf) nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs kernel.py: per_expert_alpha parameter in grouped GEMM deepseek_v4.py: updated type hints and comments	2026-05-15 12:42:53 +00:00
biondizzle	9908fd64d9	feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)	2026-05-15 11:38:18 +00:00

43 Commits