nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	48fa64dfda	Eliminate weight copies: pass stacked checkpoint tensors directly Memory optimization for MoE weight processing: Before (3-4 copies of weights in memory): 1. Original checkpoint weights in layer.w13_weight (copy 1) 2. Per-expert permuted copies (copy 2) 3. torch.stack() in runner._ensure_stacked (copy 3) 4. make_b_k_major re-stride (copy 4) 5. Scales: permute then assemble_scales_3d_side un-permutes (wasted) After (1-2 copies): 1. View checkpoint as fp4 (NO copy — byte-preserving view) 2. Pass (E, N, K) stacked tensor directly to runner 3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref 4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref 5. Scales: already (N, K_sf) from checkpoint, call assembly directly 6. Free layer.w13_weight etc. immediately after extracting views Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally, but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose by calling assemble_raw_scales_2d3d_3d_side directly.	2026-05-19 02:16:43 +00:00
biondizzle	a7ed8faec6	Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework Major refactor to eliminate all post-load hacks: - deepseek_v4.py: use upstream model with NVFP4 weight mapper only (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn) - Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass that wraps our CuTeDSL runner as a proper vLLM MoE backend - Register CUTEDSL backend in the NVFP4 oracle - Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config) - ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections - Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile - CuTeDSL runner moved to cutedsl/runner.py for clean imports - cos_sin_cache float32 fix in deepseek_v4_attention.py No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.	2026-05-18 22:19:23 +00:00

Author

SHA1

Message

Date

biondizzle

48fa64dfda

Eliminate weight copies: pass stacked checkpoint tensors directly

Memory optimization for MoE weight processing:

Before (3-4 copies of weights in memory):
1. Original checkpoint weights in layer.w13_weight (copy 1)
2. Per-expert permuted copies (copy 2)
3. torch.stack() in runner._ensure_stacked (copy 3)
4. make_b_k_major re-stride (copy 4)
5. Scales: permute then assemble_scales_3d_side un-permutes (wasted)

After (1-2 copies):
1. View checkpoint as fp4 (NO copy — byte-preserving view)
2. Pass (E, N, K) stacked tensor directly to runner
3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref
4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref
5. Scales: already (N, K_sf) from checkpoint, call assembly directly
6. Free layer.w13_weight etc. immediately after extracting views

Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally,
but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose
by calling assemble_raw_scales_2d3d_3d_side directly.

2026-05-19 02:16:43 +00:00

biondizzle

a7ed8faec6

Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework

Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
  (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
  that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py

No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.

2026-05-18 22:19:23 +00:00

2 Commits