nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	48fa64dfda	Eliminate weight copies: pass stacked checkpoint tensors directly Memory optimization for MoE weight processing: Before (3-4 copies of weights in memory): 1. Original checkpoint weights in layer.w13_weight (copy 1) 2. Per-expert permuted copies (copy 2) 3. torch.stack() in runner._ensure_stacked (copy 3) 4. make_b_k_major re-stride (copy 4) 5. Scales: permute then assemble_scales_3d_side un-permutes (wasted) After (1-2 copies): 1. View checkpoint as fp4 (NO copy — byte-preserving view) 2. Pass (E, N, K) stacked tensor directly to runner 3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref 4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref 5. Scales: already (N, K_sf) from checkpoint, call assembly directly 6. Free layer.w13_weight etc. immediately after extracting views Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally, but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose by calling assemble_raw_scales_2d3d_3d_side directly.	2026-05-19 02:16:43 +00:00
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	b007937a68	Fix garbled imports in cutedsl/runner.py	2026-05-18 22:22:52 +00:00
biondizzle	a7ed8faec6	Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework Major refactor to eliminate all post-load hacks: - deepseek_v4.py: use upstream model with NVFP4 weight mapper only (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn) - Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass that wraps our CuTeDSL runner as a proper vLLM MoE backend - Register CUTEDSL backend in the NVFP4 oracle - Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config) - ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections - Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile - CuTeDSL runner moved to cutedsl/runner.py for clean imports - cos_sin_cache float32 fix in deepseek_v4_attention.py No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.	2026-05-18 22:19:23 +00:00
biondizzle	48386e34ad	Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable torch.compile fullgraph mode can't handle @torch.compiler.disable (skips the function and refuses to compile). Custom autograd Functions are treated as opaque ops by torch.compile — they execute eagerly without the compiler trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).	2026-05-18 21:38:28 +00:00
biondizzle	85e1cd3b69	Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run() CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with torch.dynamo tracing. Marking run() as compiler-disabled makes the runners opaque to torch.compile — they execute eagerly while the rest of the model gets compiled.	2026-05-18 21:07:35 +00:00
biondizzle	a94011ec92	Fix torch.compile crash: remove threading.Lock from LUT cache path The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under torch.compile/cudagraph. LUT is now pre-populated during warmup so the fast path (cache hit) never hits a lock. Also removed all init/warmup debug prints from CuTeDSL kernels.	2026-05-18 20:54:55 +00:00
biondizzle	450793311c	Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4 - CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod for forward dispatch - Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference) - Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16) with monkey-patched forward for fused L1+SiLU+L2 pipeline - Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix, input_scale fixes) - Deleted _post_quant_fix hook from utils.py - Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp) - Cleaned up all debug prints - Updated Dockerfile with new kernel files	2026-05-18 20:27:42 +00:00
biondizzle	6ce6a47be9	Add NVFP4 linear runner + attention projection test - CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection - test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16 - Same pad+swizzle pattern as shared expert, but no SiLU/fusion	2026-05-18 20:14:03 +00:00
biondizzle	70f50a1ec6	Fix scale assembly: use correctly-sized temp buffer for swizzle	2026-05-18 20:09:50 +00:00
biondizzle	97bdd604e9	Fix scale assembly: reshape swizzled output to 2D	2026-05-18 20:09:19 +00:00
biondizzle	c1aa4af123	Shared expert: dedicated CuTeDSL runner with proper scale assembly - CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing - _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle - All buffers pre-allocated for cudagraph compatibility - Updated test to use dedicated runner instead of MoE runner hack	2026-05-18 20:08:34 +00:00
biondizzle	e8b289e30d	WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow.	2026-05-18 20:02:19 +00:00
biondizzle	e38d60a6e8	Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline	2026-05-17 18:07:44 +00:00
biondizzle	0a5cfe0433	add kernel compile caching — compile once, invoke on subsequent calls First call: cute.compile() with real tensors (warmup). Subsequent calls: just invoke compiled() with new CuTe views. No cute.compile() in the forward path = cudagraph-safe.	2026-05-16 20:45:46 +00:00
biondizzle	3465b9d471	remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe)	2026-05-16 20:42:49 +00:00
biondizzle	5e245bc0c6	fix: missing newline	2026-05-16 20:40:18 +00:00
biondizzle	288e179f88	add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale)	2026-05-16 20:39:37 +00:00
biondizzle	521e11e468	test: old bridge + LUT quantization only (step 1 of cudagraph migration)	2026-05-16 20:37:42 +00:00
biondizzle	f51be76e8f	temp: restore EXACT old bridge.py from `b685112`	2026-05-16 20:34:45 +00:00
biondizzle	58dc36e21c	fix: compile fresh each call — cached compile produces wrong TMA descriptors The CuTeDSL kernel's TMA descriptors are bound to the compilation-time tensor addresses. Caching the compiled kernel and reusing it with different tensor allocations produces wrong memory access patterns (cosine 0.5 instead of 0.99). Fresh compilation is proven correct (cosine 0.989). We can optimize later with proper TMA descriptor reinitialization.	2026-05-16 20:28:15 +00:00
biondizzle	98cc6ac1f3	fix: invert cache check logic (compile when NOT in cache)	2026-05-16 20:25:16 +00:00
biondizzle	e337ec86a3	debug: test with cache enabled	2026-05-16 20:24:04 +00:00
biondizzle	bc56452be8	debug: disable kernel cache to test fresh compilation	2026-05-16 20:22:51 +00:00
biondizzle	647c03b2ee	fix: make_b_k_major must preserve shape — use double-permute trick permute(K,N).contiguous().permute(K,N) gives same (E,K,N) shape but with K-contiguous memory. Single permute changes the shape.	2026-05-16 20:19:21 +00:00
biondizzle	ed4f501bba	fix: make_b_k_major stride check — K-major means stride[1]==1, not stride[2]==1 For (E, K, N): stride[2]==1 is N-major (columns contiguous). K-major requires stride[1]==1 (rows contiguous).	2026-05-16 20:18:18 +00:00
biondizzle	2162cee4ad	fix: restore proper quantize_weight_to_nvfp4 — K is the packed dim, not N quantize_to_nvfp4() only packs the last dimension, but for weight matrices (K, N), K is the packed dimension. The weight quantizer reshapes (k_blocks, block_size, N) and computes block scales along the K block dimension. This was accidentally replaced with a simple delegation to quantize_to_nvfp4, producing wrong tensor shapes.	2026-05-16 20:16:28 +00:00
biondizzle	10f1dca982	fix: import ceil_div from correct module	2026-05-16 20:09:02 +00:00
biondizzle	81632e2f21	fix: correct cutlass_torch import (cutlass.torch, not top-level)	2026-05-16 20:08:21 +00:00
biondizzle	16c4fad025	fix: remove cutlass.cute.backend import	2026-05-16 20:06:38 +00:00
biondizzle	44b40d41fe	fix: compile CuTeDSL kernel with real tensors, not dummy shapes The kernel's TMA descriptors are sized from compilation-time shapes. Dummy 256x256 caused wrong memory access for real 3584x6144 data. Now compiles with actual runtime tensors on first use, cached by (num_experts, K, N). Compilation happens once during warmup. Forward call remains cudagraph-safe.	2026-05-16 20:05:59 +00:00
biondizzle	79281b6fda	fix: compute K_packed/N_packed before passing to _get_compiled_kernel	2026-05-16 20:00:35 +00:00
biondizzle	caf93d6c45	fix: pass K_packed/N_packed to _get_compiled_kernel	2026-05-16 19:59:43 +00:00
biondizzle	ecc7b83334	fix: compile CuTeDSL kernel with actual tensor shapes, not dummy 256x256 The compiled kernel's TMA descriptors are sized based on compilation shapes. Using dummy 256x256 shapes caused wrong memory access patterns for the real 3584x6144 data. Now uses actual K_packed and N_packed from the runtime tensors.	2026-05-16 19:58:13 +00:00
biondizzle	cc75a55bd9	restore: new bridge/moe_pipeline/layertest	2026-05-16 19:55:19 +00:00
biondizzle	0c878b3a9e	temp: restore old layertest+bridge for cosine comparison	2026-05-16 19:54:04 +00:00
biondizzle	0069769d12	debug: print global scales	2026-05-16 19:38:31 +00:00
biondizzle	84589fe984	debug: more prints	2026-05-16 19:31:54 +00:00
biondizzle	fa2d5708c5	debug: add L1 GEMM and SiLU output debug prints	2026-05-16 19:29:42 +00:00
biondizzle	4c06c51ec3	fix: moe_pipeline.py gate/up split — L1 output is 2*intermediate, not intermediate	2026-05-16 19:28:15 +00:00
biondizzle	28788c6f55	fix: L1 weight N dimension is 2intermediate (gate+up), not intermediate float4_e2m1fn_x2 packs 2 values per byte along K, not N. The GEMM output N dimension is the logical N from mat_b.shape[2], not 2x packed. Previous n_dim2 was wrong — it accidentally worked in the test because intermediate_size2 == 2intermediate_size. Real model with N=9216 exposed the bug.	2026-05-16 19:07:08 +00:00
biondizzle	2f68c7ba77	fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward) torch.tensor() and new_tensor() both trigger CPU->CUDA copies during cudagraph capture. Pre-cache the LUT on first use per device.	2026-05-16 18:48:31 +00:00
biondizzle	6c298be842	fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy) torch.tensor() creates on CPU then copies to CUDA, which is forbidden during cudagraph capture. new_tensor() creates directly on the source tensor's device.	2026-05-16 18:47:39 +00:00
biondizzle	5a79065b2b	fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element)	2026-05-16 18:27:44 +00:00
biondizzle	533089c9d2	fix: token_indices slice bug + torch.zeros for float4/float8 dtypes	2026-05-16 18:21:27 +00:00
biondizzle	7594968482	WIP: cudagraph-compatible CuTeDSL MoE runner - Cache compiled CuTeDSL kernel (compile once, reuse every forward) - Remove torch.cuda.synchronize() from forward path - Add quantize_activation_nvfp4() (no .max() CPU-GPU sync) - Pre-allocate buffers (token_indices, expert_id_range, output_bufs) - GPU-only expert offset computation (bincount + cumsum) - Replace Python for-loop scale assembly with GPU-vectorized version Still TODO: - Test with FULL_AND_PIECEWISE cudagraph mode - Add vllm::deepseek_v4_mega_moe_experts to splitting_ops - Verify CuTeDSL kernel launch is cudagraph-safe	2026-05-16 16:36:19 +00:00
biondizzle	baf44c92f8	fix: memory-efficient E2M1 quantization — no 32x distance tensor quantize_to_nvfp4 was allocating a (..., n_blocks, block_size, 8) float32 tensor for nearest-neighbor distances to all 8 E2M1 values. That's 32x the input size — 10.5GB for a typical batch, causing OOM with only 3GB free. New approach: clamp to [0, 6], scale to half-integer steps, round, then map through a 13-byte lookup table to E2M1 indices. Peak memory is now ~2x input (x_f32 + x_scaled) instead of 32x. This makes activation quantization CUDA-graph-safe for the memory-constrained DeepSeek-V4 on B200 (175GB model / 178GB GPU).	2026-05-16 07:49:38 +00:00
biondizzle	174ad70dca	fix: same gate/up split fix in moe_pipeline.py	2026-05-16 04:04:53 +00:00
biondizzle	09ff5c5b98	feat: full NVFP4 MoE pipeline (L1→SiLU→L2→scatter) cutedsl/moe_pipeline.py: complete pipeline - stage_activation: BF16 → NVFP4 (keeps data in FP4) - L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up) - SiLU(gate) * up: BF16 (only nonlinear, can't avoid) - Re-quantize: BF16 → NVFP4 (back to native) - L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj) - Scatter with routing weights → BF16 output layertest.py: now tests the FULL MoE pipeline against BF16 reference. NVFP4-native: both GEMMs use float4_e2m1fn_x2 for A and B, float8_e4m3fn for block scales, float32 for global scales. BF16 only for SiLU activation and final scatter.	2026-05-16 03:22:43 +00:00
biondizzle	0cdcc4144a	refactor: add cutedsl/bridge.py, rewrite layertest to use it bridge.py: clean API for CuTeDSL kernel - quantize_to_nvfp4 / quantize_weight_to_nvfp4 - assemble_scales_2d_side / assemble_scales_3d_side - make_b_k_major (stride conversion) - compute_expert_offsets - run_nvfp4_grouped_gemm (full kernel launch) layertest.py: now uses bridge layer, tests with real DeepSeek-V4 layer 0 weights (7168 hidden, 6144 intermediate). The bridge code will be reused by the vLLM integration layer.	2026-05-16 03:13:54 +00:00

1 2

51 Commits