nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	647c03b2ee	fix: make_b_k_major must preserve shape — use double-permute trick permute(K,N).contiguous().permute(K,N) gives same (E,K,N) shape but with K-contiguous memory. Single permute changes the shape.	2026-05-16 20:19:21 +00:00
biondizzle	ed4f501bba	fix: make_b_k_major stride check — K-major means stride[1]==1, not stride[2]==1 For (E, K, N): stride[2]==1 is N-major (columns contiguous). K-major requires stride[1]==1 (rows contiguous).	2026-05-16 20:18:18 +00:00
biondizzle	2162cee4ad	fix: restore proper quantize_weight_to_nvfp4 — K is the packed dim, not N quantize_to_nvfp4() only packs the last dimension, but for weight matrices (K, N), K is the packed dimension. The weight quantizer reshapes (k_blocks, block_size, N) and computes block scales along the K block dimension. This was accidentally replaced with a simple delegation to quantize_to_nvfp4, producing wrong tensor shapes.	2026-05-16 20:16:28 +00:00
biondizzle	10f1dca982	fix: import ceil_div from correct module	2026-05-16 20:09:02 +00:00
biondizzle	81632e2f21	fix: correct cutlass_torch import (cutlass.torch, not top-level)	2026-05-16 20:08:21 +00:00
biondizzle	16c4fad025	fix: remove cutlass.cute.backend import	2026-05-16 20:06:38 +00:00
biondizzle	44b40d41fe	fix: compile CuTeDSL kernel with real tensors, not dummy shapes The kernel's TMA descriptors are sized from compilation-time shapes. Dummy 256x256 caused wrong memory access for real 3584x6144 data. Now compiles with actual runtime tensors on first use, cached by (num_experts, K, N). Compilation happens once during warmup. Forward call remains cudagraph-safe.	2026-05-16 20:05:59 +00:00
biondizzle	79281b6fda	fix: compute K_packed/N_packed before passing to _get_compiled_kernel	2026-05-16 20:00:35 +00:00
biondizzle	caf93d6c45	fix: pass K_packed/N_packed to _get_compiled_kernel	2026-05-16 19:59:43 +00:00
biondizzle	ecc7b83334	fix: compile CuTeDSL kernel with actual tensor shapes, not dummy 256x256 The compiled kernel's TMA descriptors are sized based on compilation shapes. Using dummy 256x256 shapes caused wrong memory access patterns for the real 3584x6144 data. Now uses actual K_packed and N_packed from the runtime tensors.	2026-05-16 19:58:13 +00:00
biondizzle	cc75a55bd9	restore: new bridge/moe_pipeline/layertest	2026-05-16 19:55:19 +00:00
biondizzle	0c878b3a9e	temp: restore old layertest+bridge for cosine comparison	2026-05-16 19:54:04 +00:00
biondizzle	0069769d12	debug: print global scales	2026-05-16 19:38:31 +00:00
biondizzle	84589fe984	debug: more prints	2026-05-16 19:31:54 +00:00
biondizzle	fa2d5708c5	debug: add L1 GEMM and SiLU output debug prints	2026-05-16 19:29:42 +00:00
biondizzle	4c06c51ec3	fix: moe_pipeline.py gate/up split — L1 output is 2*intermediate, not intermediate	2026-05-16 19:28:15 +00:00
biondizzle	da31ce7e1a	allow for cuda graphs again	2026-05-16 19:23:41 +00:00
biondizzle	d15c43294b	fix: test L2 weight N dim should be hidden_size, not hidden_size//2	2026-05-16 19:07:36 +00:00
biondizzle	28788c6f55	fix: L1 weight N dimension is 2intermediate (gate+up), not intermediate float4_e2m1fn_x2 packs 2 values per byte along K, not N. The GEMM output N dimension is the logical N from mat_b.shape[2], not 2x packed. Previous n_dim2 was wrong — it accidentally worked in the test because intermediate_size2 == 2intermediate_size. Real model with N=9216 exposed the bug.	2026-05-16 19:07:08 +00:00
biondizzle	f7e29fdf1e	docs: update README with cudagraph compatibility work and decisions	2026-05-16 18:55:47 +00:00
biondizzle	103fd451ce	fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph) buf[:gpu_scalar, :] triggers cudaErrorStreamCaptureInvalidated. Always use the full pre-allocated buffer; extra rows are zeros.	2026-05-16 18:50:35 +00:00
biondizzle	2f68c7ba77	fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward) torch.tensor() and new_tensor() both trigger CPU->CUDA copies during cudagraph capture. Pre-cache the LUT on first use per device.	2026-05-16 18:48:31 +00:00
biondizzle	6c298be842	fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy) torch.tensor() creates on CPU then copies to CUDA, which is forbidden during cudagraph capture. new_tensor() creates directly on the source tensor's device.	2026-05-16 18:47:39 +00:00
biondizzle	53c25bee0b	rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices - Removed all [:total_slots] dynamic slicing with GPU scalars - slot_hidden gathers from hidden_states directly using sorted_token_ids - scatter_add uses full sorted_token_ids (padding slots have zero weight) - _assemble_scales_cudagraph_safe returns 2D via padded_scales.shape[0] - Fixed padded_scales_buf allocation via float16->float8 cast - GEMM output size: n_dim * 2 for float4_e2m1fn_x2 packed format	2026-05-16 18:44:25 +00:00
biondizzle	4300775bfe	fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead	2026-05-16 18:29:12 +00:00
biondizzle	5a79065b2b	fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element)	2026-05-16 18:27:44 +00:00
biondizzle	95a1345b92	fix: return 2D scale tensor from _assemble_scales_cudagraph_safe	2026-05-16 18:26:57 +00:00
biondizzle	533089c9d2	fix: token_indices slice bug + torch.zeros for float4/float8 dtypes	2026-05-16 18:21:27 +00:00
biondizzle	54c470e535	fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8)	2026-05-16 18:13:14 +00:00
biondizzle	f2de95c526	fix: use randint for float4 dummy weights in cudagraph test	2026-05-16 18:08:45 +00:00
biondizzle	f66d4b69a4	GPU-only scale assembly + cudagraph test harness - assemble_activation_scales_gpu: builds padded+swizzled scale tensor without .item() or .tolist() CPU syncs. Uses GPU index arange + cat + single scatter instead of per-expert Python slicing. - Still has a for e in range(num_experts) loop but num_experts is compile-time constant so torch.compile unrolls it. - Added tests/cudagraph_test.py: attempts CUDA graph capture on the MoE runner, diagnoses sync violations with patched torch functions. - Removed the if total_slots == 0 early return (Python control flow on GPU data)	2026-05-16 18:05:13 +00:00
biondizzle	5121074782	cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly Key changes for cudagraph compatibility: - No .item() or .tolist() calls (zero CPU-GPU syncs) - Pre-allocated buffers at max_num_tokens size - GPU-only expert offsets via bincount+cumsum - searchsorted to map rows to experts (no Python for-loop with GPU indices) - Single scatter operation for scale padding - Pre-allocated token_indices reused for searchsorted row mapping - quantize_activation_nvfp4 with fixed global scale (no .max() sync) - Cached CuTeDSL kernel (no cute.compile per forward) - No torch.cuda.synchronize() in forward path	2026-05-16 18:01:47 +00:00
biondizzle	ab126b0c0d	fix: revert to .item() based scale assembly (fixes index OOB) The fully GPU-vectorized _assemble_scales_gpu() caused index out of bounds errors because tensor slicing with GPU-computed indices from Python is undefined behavior. Went back to .item() on expert_offsets for the per-expert scale split. This forces CPU-GPU syncs (breaks cudagraph) but produces correct results. The path to cudagraph compatibility is either: 1. Modify CuTeDSL scale assembly API to accept flat tensor + offsets 2. Use the CUTLASS kernel (already verified working)	2026-05-16 17:55:32 +00:00
biondizzle	7594968482	WIP: cudagraph-compatible CuTeDSL MoE runner - Cache compiled CuTeDSL kernel (compile once, reuse every forward) - Remove torch.cuda.synchronize() from forward path - Add quantize_activation_nvfp4() (no .max() CPU-GPU sync) - Pre-allocate buffers (token_indices, expert_id_range, output_bufs) - GPU-only expert offset computation (bincount + cumsum) - Replace Python for-loop scale assembly with GPU-vectorized version Still TODO: - Test with FULL_AND_PIECEWISE cudagraph mode - Add vllm::deepseek_v4_mega_moe_experts to splitting_ops - Verify CuTeDSL kernel launch is cudagraph-safe	2026-05-16 16:36:19 +00:00
biondizzle	f0c1be3ced	fix: remove broken hc_head warmup (wrong tensor shape) hc_head_fuse_tilelang expects fn shape[0]=hc_mult (4) but we passed hc_mult*(2+hc_mult) (24). Since --enforce-eager disables @torch.compile anyway, hc_head runs eagerly and doesn't need warmup.	2026-05-16 10:11:34 +00:00
biondizzle	c803180706	fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe After _ensure_stacked frees per-expert lists, code that accesses l1_fp4 or w13_weight.device crashes with NoneType errors. Fix: - _check_runtime_supported: fall back to _l1_mat_b.device - _run_mega_moe assertion: check _l1_mat_b as alternative - finalize_weights guard: check _l1_mat_b as alternative	2026-05-16 09:16:24 +00:00
biondizzle	cdd813cf7e	fix: free per-expert weight lists after stacking in CuTeDSL runner _ensure_stacked() creates stacked copies of all weights but never freed the per-expert lists. For 256 experts on a 175GB model, this doubles weight memory to ~350GB, causing OOM. Now the per-expert lists (l1_fp4, l1_sf, l1_gs, l2_fp4, l2_sf, l2_gs) are set to None after stacking, keeping only the single stacked copy.	2026-05-16 08:54:52 +00:00
biondizzle	99c11c218d	fucken a	2026-05-16 08:39:13 +00:00
biondizzle	906ee80a42	Add tilelang kernel warmup in load_weights Force-compile all lazy tilelang JIT kernels (mhc_pre, mhc_post) and torch.compile'd hc_head during model loading, BEFORE the HTTP server comes up. This eliminates the crash when eager mode inference hits the model before tilelang compilation finishes. Fixes the core issue: cudagraph capture forced eager compilation but ate all GPU memory. Now we can run eager mode safely.	2026-05-16 08:28:39 +00:00
biondizzle	a51ef3d2cf	fucken a	2026-05-16 08:23:27 +00:00
biondizzle	72bf750a0b	fix: revert to eager mode — CUDA graphs OOM with 175GB model CUDA graph capture needs extra memory on top of the model weights. With 175GB model on 178GB GPUs, there's no room. Going back to --enforce-eager with 10-min RPC timeout. The first inference request will be slow (2-3 min JIT compilation) but won't crash. Subsequent requests are fast. CUDA graph mode requires either more GPU memory or a smaller model.	2026-05-16 08:07:44 +00:00
biondizzle	baf44c92f8	fix: memory-efficient E2M1 quantization — no 32x distance tensor quantize_to_nvfp4 was allocating a (..., n_blocks, block_size, 8) float32 tensor for nearest-neighbor distances to all 8 E2M1 values. That's 32x the input size — 10.5GB for a typical batch, causing OOM with only 3GB free. New approach: clamp to [0, 6], scale to half-integer steps, round, then map through a 13-byte lookup table to E2M1 indices. Peak memory is now ~2x input (x_f32 + x_scaled) instead of 32x. This makes activation quantization CUDA-graph-safe for the memory-constrained DeepSeek-V4 on B200 (175GB model / 178GB GPU).	2026-05-16 07:49:38 +00:00
biondizzle	a2cac7a7fe	fix: remove CuTeDSL warmup — OOM with 175GB model loaded The warmup allocated 1GB of dummy tensors but the model already uses 175.7GB of the 178.35GB per GPU. No room. With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during the graph capture phase (which manages memory properly). The warmup was a band-aid for eager mode and is now redundant.	2026-05-16 07:32:17 +00:00
biondizzle	e0814eb54e	fix: cast expert_offsets to int32 for CuTeDSL kernel CuTeDSL's grouped GEMM uses int32 for expert offsets internally. Our cumsum produced int64, causing a type mismatch inside a dynamic if-branch (prev_off changes from Int32 to Int64). Also cast tokens_per_expert to int32 before cumsum.	2026-05-16 07:15:57 +00:00
biondizzle	4b0a9557f0	fix: rewrite CuTeDSLMoERunner for CUDA graph compatibility CUDA graphs forbid CPU-GPU syncs (.item()) and Python loops over tokens during graph capture. The old scatter loop did both. Changes: - Slot routing: replaced Python loop with GPU-native argsort + gather (sort tokens by expert id, gather hidden states in slot order) - Scatter: replaced Python loop with torch.scatter_add_ (GPU-native) - Weight stacking: lazily pre-built once, reused every forward call - Removed all .item() calls from the forward path - expert_offsets built from GPU tensor operations This is required for FULL_AND_PIECEWISE CUDA graph mode which compiles and captures graphs during startup.	2026-05-16 07:03:08 +00:00
biondizzle	dab31b0961	fix: missing tqdm import in weight_loader	2026-05-16 06:31:14 +00:00
biondizzle	8496ac99bc	dang clonkurs	2026-05-16 06:28:16 +00:00
biondizzle	e7c6274107	Revert "feat: auto-warmup in build_and_run.sh" This reverts commit `f792537719`.	2026-05-16 06:14:28 +00:00
biondizzle	f792537719	feat: auto-warmup in build_and_run.sh After the container starts, the script waits for the API to come up, then sends a warmup request to trigger all JIT compilation (Triton, TileLang, CuTeDSL). This way the first real inference request is fast. Also added tqdm for expert weight loading: Loading Native NVFP4 Expert Weights: 50%\|██████████░░\| 480/960	2026-05-16 06:11:38 +00:00
biondizzle	5d975d00d9	feat: tqdm progress bar for expert weight loading Replaces heartbeat prints with a clean tqdm bar: Loading Native NVFP4 Expert Weights: 50%\|██████████░░\| 480/960	2026-05-16 06:09:22 +00:00

1 2 3

150 Commits