nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	48386e34ad	Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable torch.compile fullgraph mode can't handle @torch.compiler.disable (skips the function and refuses to compile). Custom autograd Functions are treated as opaque ops by torch.compile — they execute eagerly without the compiler trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).	2026-05-18 21:38:28 +00:00
biondizzle	85e1cd3b69	Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run() CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with torch.dynamo tracing. Marking run() as compiler-disabled makes the runners opaque to torch.compile — they execute eagerly while the rest of the model gets compiled.	2026-05-18 21:07:35 +00:00
biondizzle	a94011ec92	Fix torch.compile crash: remove threading.Lock from LUT cache path The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under torch.compile/cudagraph. LUT is now pre-populated during warmup so the fast path (cache hit) never hits a lock. Also removed all init/warmup debug prints from CuTeDSL kernels.	2026-05-18 20:54:55 +00:00
biondizzle	450793311c	Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4 - CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod for forward dispatch - Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference) - Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16) with monkey-patched forward for fused L1+SiLU+L2 pipeline - Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix, input_scale fixes) - Deleted _post_quant_fix hook from utils.py - Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp) - Cleaned up all debug prints - Updated Dockerfile with new kernel files	2026-05-18 20:27:42 +00:00
biondizzle	87582fc9f7	HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph	2026-05-17 22:28:32 +00:00
biondizzle	8717e0e411	Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping	2026-05-17 22:03:48 +00:00
biondizzle	d332f4f900	Add NaN debug checks after L1 and L2 GEMM	2026-05-17 22:02:24 +00:00
biondizzle	2796bd81e8	Fix: scatter FP4 as uint8 (float4 doesn't support index_put)	2026-05-17 21:28:04 +00:00
biondizzle	364f8372bb	Fix FP4 buffer shapes: D//2 for packed dimensions	2026-05-17 21:26:46 +00:00
biondizzle	803e7160d8	Fix: allocate FP4 buffers as uint8 then view-cast	2026-05-17 21:25:04 +00:00
biondizzle	7256070dd3	FIX Bug 26: quantize slot tokens, not padded buffer The runner was quantizing the padded_hidden (4096 rows) and then taking x_sf[:num_slots] (first 48 rows). This only got scales for expert 0 (the first 48 rows of the padded buffer), not the scales for tokens scattered across padded positions (expert 1 at row 128, etc). Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer for the GEMM. The scale assembly now receives the correct x_sf. Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.	2026-05-17 21:24:43 +00:00
biondizzle	a10c582cf4	Add swiglu_limit=10.0 activation clamping (was missing) DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps: - silu(gate) to max 10.0 - up to [-10.0, 10.0] Our runner was doing plain F.silu(gate) * up without clamping. Large gate values could produce unbounded SiLU output, causing numerical issues in the L2 GEMM. This is likely contributing to garbage model output.	2026-05-17 17:52:16 +00:00
biondizzle	3f2f4e1882	Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing Dynamic slicing with GPU scalars (e.g. buf[:gpu_scalar]) is a CUDA operation not permitted during stream capture. Use full pre-allocated buffers instead of dynamic slices. The GEMM only reads rows indicated by expert_offsets, ignoring the zero padding. Also pass x_sf[:num_slots] (Python int slicing, cudagraph-safe) to scale assembly so it only processes real token scale data.	2026-05-17 17:24:26 +00:00
biondizzle	11b5aa5e37	Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops Removed .cpu().tolist() and per-expert Python loops. Apply the Blackwell 32_4_4 swizzle to the entire padded_x_sf buffer at once. The buffer is already 128-row aligned (padded per expert) and 4-col aligned, so the full-buffer swizzle produces the correct layout. The GEMM reads scale_a using padded_expert_offsets, which matches the scatter layout. Fully GPU, zero CPU syncs, cudagraph-safe.	2026-05-17 16:59:51 +00:00
biondizzle	94dec5922d	Scale assembly Phase 2: use CPU-computed offsets for Python slicing GPU scalars can't be used for Python indexing (requires sync). Compute padded_expert_offsets on CPU via .cpu().tolist() for the Python loop. This is OK for cudagraph: Python code only runs during capture, not replay. The GPU kernel launches recorded during capture are deterministic.	2026-05-17 16:56:52 +00:00
biondizzle	49c28e6562	Fix: use real padded expert offsets instead of fixed layout Root cause of garbage output: fixed-layout padding with max_chunks=ceil(avg) was too small for uneven expert assignment. Tokens beyond max_chunks*128 per expert were silently dropped (clamped_local overwrote the same row). Fix: compute padded_expert_offsets from actual tokens_per_expert (padded to 128). No clamping needed — each expert gets exactly the space it needs. Pass padded_expert_offsets to scale assembly and GEMM.	2026-05-17 16:55:47 +00:00
biondizzle	7c16f3cb46	Fix: init shared dict before using it, remove duplicate _output_buf	2026-05-17 16:06:58 +00:00
biondizzle	ea8acf9852	Share padded_x_sf and output buffers across layers to save ~300 MB Per-layer padded_xsf (2.4 MB) + output_buf (4.2 MB) × 60 layers = ~400 MB. Sharing reduces to ~3.6 MB total. Layers run sequentially during both capture and replay.	2026-05-17 16:05:53 +00:00
biondizzle	455ecb5631	Fix: define padded_max_slots before using it in shared buffer allocation	2026-05-17 15:47:38 +00:00
biondizzle	b1ac74bb4d	Fix shape mismatch: shared padded buffers, revert max_num_tokens cap Root cause: capping max_num_tokens to 512 made buffers too small for the actual 8192-token warmup. slot_hidden had 49152 rows but padded_hidden only had 6144. Fix: Revert the 512 cap. Use SHARED padded buffers (not per-layer) to avoid OOM. Only 72 MB total (not 4.3 GB) since layers run sequentially and reuse the same buffer. Cudagraph-safe since capture and replay both run layers sequentially on the same tensor.	2026-05-17 15:47:10 +00:00
biondizzle	faf7c8cc51	Debug: print runner max_num_tokens and max_chunks	2026-05-17 15:18:07 +00:00
biondizzle	c5af1aba6b	Fix OOB: size padded buffers for num_expertsmax_chunks128 padded_max_slots was computed from max_tokenstop_k (3072) but total_padded_slots in run() is num_expertsmax_chunks*128 (6144). The buffer was too small, causing index out of bounds.	2026-05-17 14:59:45 +00:00
biondizzle	8ac8e20fa9	Fix OOM: cap buffer pre-allocation at cudagraph max capture size padded_hidden/activated buffers were sized for max_num_tokens=8192, which is 72 MB per layer × 60 layers = 4.3 GB → OOM with 178 GB GPUs (almost full from model + KV cache). Now cap at max cudagraph capture size (512 tokens). Eager-mode runs with >512 tokens will need dynamic allocation, but vLLM always uses cudagraph for inference after warmup.	2026-05-17 14:14:13 +00:00
biondizzle	5bb78564f5	Remove dynamic tensor allocation in scale assembly (cudagraph fix) Removed torch.zeros() call that created padded_expert_offsets during scale assembly. Now uses fixed layout computed from Python constants. Also removed dead reference to padded_expert_offsets variable.	2026-05-17 14:01:32 +00:00
biondizzle	8c31e78359	Fix cudagraph: fully fixed-layout per-expert sections, no GPU scalars in Python control flow - Each expert gets max_chunks128 rows at fixed offsets (emax_chunks128) - Phase 1 scatters into fixed offsets with clamped local_row - Phase 2 reads from fixed offsets (pure Python arithmetic, no GPU sync) - padded_x_sf_buf sized for num_experts max_chunks * 128 - padded_expert_offsets pre-computed in _allocate_buffers	2026-05-17 13:58:58 +00:00
biondizzle	ff74b33d2c	Fix cudagraph: static loop for per-expert scale swizzle The while loop had variable trip count (GPU scalar in condition), requiring CPU-GPU sync. Replaced with fixed max_chunks_per_expert iterations. Unused chunks are zero buffers (harmless for GEMM).	2026-05-17 13:56:52 +00:00
biondizzle	bf22b6f0e4	Fix scale assembly: variable-size per-expert padding matching GEMM offsets - Compute padded_expert_offsets from real expert_offsets (ceil to 128) - Scatter x_sf into padded positions matching those offsets - Per-expert swizzle in 128-row chunks (supports >128 tokens per expert) - Pad slot_hidden/activated using same padded offsets for GEMM input - Pre-allocated buffers sized for max_tokenstop_k (not num_experts128)	2026-05-17 13:55:10 +00:00
biondizzle	bde81b95f4	Fix GEMM scale layout: pad to 128 tokens per expert Root cause of garbage output: the GEMM reads scale_a according to expert_offsets (e.g. [0, 500, 1024, ...]) but scale_a had data at fixed e128 offsets. When expert 0 has 500 tokens, the GEMM reads scale_a[0:500] but only rows 0-127 had valid data. Fix: pad slot_hidden to num_experts128 rows (128 per expert) and pass padded_expert_offsets=[0, 128, 256, ...] to the GEMM. Scale assembly's fixed 128-row layout now matches the GEMM's expectations. Padding tokens' GEMM output is discarded (scatter_add only uses sorted_token_ids for real tokens).	2026-05-17 13:19:31 +00:00
biondizzle	7e692c3aec	Fix cudaErrorStreamCaptureUnsupported: pre-allocate all tensors used during capture torch.full(), torch.zeros(), torch.arange() allocate new tensors during cudagraph capture, which triggers cudaErrorStreamCaptureUnsupported. Pre-allocate: - _l1_gsa_buf / _l2_gsa_buf (use .fill_() instead of torch.full) - _output_buf (use .zero_() on pre-allocated slice) - _row_indices_buf (pre-allocated arange, sliced during use)	2026-05-17 12:31:25 +00:00
biondizzle	b531a98f8f	Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing - Reverted from full-buffer swizzle to per-expert 128-row slots - Scatter into e128 fixed positions (cudagraph-compatible, fixed shape) - Clamp local_row to 127 for experts with >128 tokens (GEMM uses expert_offsets) - Buffer sized for num_experts128 rows (not max_tokens*top_k) - Add _warmup_done guard to only run warmup once (not 60x)	2026-05-17 11:10:59 +00:00
biondizzle	4445882ba7	Fix: return 2D scale tensor for GEMM (shape[1] access)	2026-05-17 09:59:57 +00:00
biondizzle	3cd910193c	Rewrite scale assembly: no .item() calls, no Python loops, fully GPU Apply to_blocked swizzle on entire padded buffer at once instead of per-expert loops. No .item()/.cpu() calls. Fully cudagraph-safe.	2026-05-17 09:59:12 +00:00
biondizzle	4f6217acb9	Fix padded_cols calculation in scale assembly	2026-05-17 09:58:09 +00:00
biondizzle	918aa8aede	Fix scale assembly output shape: reshape to 2D for GEMM	2026-05-17 09:57:27 +00:00
biondizzle	d9bae6d770	Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens and top_k=6, the actual padded row count can exceed 6144. Also: - Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192) - Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks) - Remove debug prints	2026-05-17 09:56:28 +00:00
biondizzle	55ac60eb91	Add detailed debug prints for OOB investigation	2026-05-17 09:39:42 +00:00
biondizzle	fed3c417ba	Add debug OOB check for sorted_token_ids	2026-05-17 09:19:10 +00:00
biondizzle	ca3cba5bbd	Fix global→local expert ID remapping for EP and remove .cpu() sync Root cause of CUDA_ERROR_ASSERT index out of bounds: - topk_ids contains GLOBAL expert IDs (0-255) but runner treated them as local IDs (0-31 with EP=8). Tokens for non-local experts got wrong expert assignments, causing out-of-bounds scatter indices in _assemble_scales_cudagraph_safe. Fixes: 1. Add experts_start_idx param to CuTeDSLMoERunner 2. In run(), remap global→local IDs and zero weights for non-local experts 3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync) 4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL JIT GPU memory corruption (refill after first GEMM call)	2026-05-17 08:58:43 +00:00
biondizzle	1330e2b2cf	cleanup: remove debug prints, ready for testing Current state: - Token indices on CPU (avoids CuTeDSL GPU memory corruption) - Scale assembly uses per-expert swizzle + scatter (matches reference) - compute_activation_global_scales warmup gets ~0.97 cosine - expert_offsets passed without leading 0 (matches pipeline) - layertest + cudagraph_test pass	2026-05-17 08:30:41 +00:00
biondizzle	d635dcbbb6	fix: keep token_indices on CPU, index with CPU sort_idx CuTeDSL's cute.compile corrupts GPU memory during JIT compilation. Keeping token_indices on CPU and using sort_idx.cpu() for indexing avoids the corruption. The .to(device) call after indexing moves the result back to GPU for the hidden_states indexing.	2026-05-17 08:29:18 +00:00
biondizzle	235d5b314f	fix: fallback token indices allocation with verify+rebuild	2026-05-17 08:27:47 +00:00
biondizzle	dd0b3fd4f9	debug: print sorted_token_ids in warmup	2026-05-17 08:25:25 +00:00
biondizzle	04999d86cf	fix: add quantize_to_nvfp4 import	2026-05-17 08:24:57 +00:00
biondizzle	7073daaffa	fix: allocate token_indices on CPU, move to GPU AFTER JIT compilation CuTeDSL's cute.compile corrupts GPU memory during JIT compilation. Tensors allocated on GPU before/during compilation get zeroed. Fix: create token_indices on CPU, then .to(device) after JIT is done.	2026-05-17 08:22:51 +00:00
biondizzle	0e7b06b55c	debug: clone + sync token indices before JIT	2026-05-17 08:22:11 +00:00
biondizzle	70c0618361	fix: allocate token_indices before CuTeDSL JIT compilation CuTeDSL's cute.compile appears to corrupt GPU memory state, causing torch.arange to produce zero-filled tensors when allocated after the JIT compilation. Moving token_indices allocation before the weight stacking operations fixes the corruption.	2026-05-17 08:20:41 +00:00
biondizzle	2bbe04efd8	debug: remove assert, test token corruption	2026-05-17 08:19:45 +00:00
biondizzle	66627926c5	debug: int32 token indices with sync verify	2026-05-17 08:18:37 +00:00
biondizzle	da02a5dc11	debug: assert token indices are correct after allocation	2026-05-17 08:16:09 +00:00

1 2

74 Commits