nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	48fa64dfda	Eliminate weight copies: pass stacked checkpoint tensors directly Memory optimization for MoE weight processing: Before (3-4 copies of weights in memory): 1. Original checkpoint weights in layer.w13_weight (copy 1) 2. Per-expert permuted copies (copy 2) 3. torch.stack() in runner._ensure_stacked (copy 3) 4. make_b_k_major re-stride (copy 4) 5. Scales: permute then assemble_scales_3d_side un-permutes (wasted) After (1-2 copies): 1. View checkpoint as fp4 (NO copy — byte-preserving view) 2. Pass (E, N, K) stacked tensor directly to runner 3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref 4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref 5. Scales: already (N, K_sf) from checkpoint, call assembly directly 6. Free layer.w13_weight etc. immediately after extracting views Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally, but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose by calling assemble_raw_scales_2d3d_3d_side directly.	2026-05-19 02:16:43 +00:00
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	b007937a68	Fix garbled imports in cutedsl/runner.py	2026-05-18 22:22:52 +00:00
biondizzle	a7ed8faec6	Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework Major refactor to eliminate all post-load hacks: - deepseek_v4.py: use upstream model with NVFP4 weight mapper only (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn) - Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass that wraps our CuTeDSL runner as a proper vLLM MoE backend - Register CUTEDSL backend in the NVFP4 oracle - Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config) - ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections - Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile - CuTeDSL runner moved to cutedsl/runner.py for clean imports - cos_sin_cache float32 fix in deepseek_v4_attention.py No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.	2026-05-18 22:19:23 +00:00

4 Commits