nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	fece06f746	Add unit tests for NVFP4 weight mapper and inverse RoPE BF16	2026-05-19 03:22:00 +00:00
biondizzle	b0b5113467	Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj - The compressor is on attn.compressor (not attn.mla_attn.compressor) - weights_proj in indexer is NVFP4-quantized in our checkpoint	2026-05-19 03:20:41 +00:00
biondizzle	396a83ea56	Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper - deepseek_v4.py: Fresh upstream copy with minimal NVFP4 changes - wo_a uses quant_config=None (BF16 in NVFP4 checkpoint, no scales) - Added _make_deepseek_v4_nvfp4_weights_mapper() using official WeightsMapper API - Handles: self_attn→attn, mlp→ffn, gate_proj→w1, compressor renames, etc. - Mapper selected by quant_config.get_name() == 'modelopt_fp4' - deepseek_v4_attention.py: Fresh upstream copy with minimal NVFP4 changes - Removed _wo_a_act_quant and custom CuTeDSL wo_a runner - Added _apply_inv_rope_bf16() helper (inverse RoPE in BF16) - Detects BF16 wo_a (no weight_scale_inv) and uses BF16 path - FP8 einsum path kept as fallback for SM90 checkpoints - BF16 path: inverse RoPE → wo_a() → wo_b() (standard linear methods)	2026-05-19 03:13:38 +00:00
biondizzle	b856ee9315	Clean up debug scripts	2026-05-19 02:47:29 +00:00
biondizzle	05cdde1676	Fix wo_a: scatter each group's data at correct offset in padded buffer The grouped GEMM expects each group's tokens at their own offset range: - Group 0: rows [0, padded_T) - Group 1: rows [padded_T, 2padded_T) - etc. Previously we wrote all groups' data contiguously starting at row 0, so group 1+ would read zeros from the padding area. Now we scatter each group's quantized activation at the correct offset. Also: - Size buffer for total_max_rows = padded_max n_groups - Use assemble_scales_2d_side for multi-group scale assembly - Extract output per-group at correct offsets	2026-05-19 02:45:57 +00:00
biondizzle	8fe5546bb3	Fix debug script	2026-05-19 02:43:17 +00:00
biondizzle	788f0aa65a	Add step-by-step debug for wo_a	2026-05-19 02:43:05 +00:00
biondizzle	5f5b997fc3	Fix wo_a: permute to groups-first layout for grouped GEMM The grouped GEMM expects mat_a to be laid out contiguously per group: [all tokens for group0, all tokens for group1, ...] A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout which is wrong. Fix: permute to (G, T, D) before flattening. Same fix for output: permute (G, T, R) → (T, G, R).	2026-05-19 02:41:32 +00:00
biondizzle	77e4970d93	Add debug script for wo_a quantization	2026-05-19 02:40:43 +00:00
biondizzle	80122b850b	Add debug script for wo_a	2026-05-19 02:39:55 +00:00
biondizzle	ae233ab648	Fix test: cos_sin_cache on CUDA device	2026-05-19 02:37:50 +00:00
biondizzle	882d4996ff	Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj) The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion in layout.hpp:39) when processing wo_a (o-projection first half) in the attention layer. The crash is caused by scale tensor dimension mismatch for the SM100 recipe (1, 1, 128). Instead of fighting DeepGEMM, replace the entire wo_a path with our own CuTeDSL NVFP4 kernel: 1. inverse_rope_bf16() — Python implementation of inverse RoPE (replaces fused_inv_rope_fp8_quant CUDA kernel) 2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using ScaledGroupedGemm with n_local_groups=8 groups 3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4, no conversion to another quantization) Changes: - cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation) - cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a - vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner is initialized, keep DeepGEMM fallback - vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant - tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM	2026-05-19 02:36:30 +00:00
biondizzle	bab1f75f29	Fix gs None error in legacy _ensure_stacked path	2026-05-19 02:17:53 +00:00
biondizzle	48fa64dfda	Eliminate weight copies: pass stacked checkpoint tensors directly Memory optimization for MoE weight processing: Before (3-4 copies of weights in memory): 1. Original checkpoint weights in layer.w13_weight (copy 1) 2. Per-expert permuted copies (copy 2) 3. torch.stack() in runner._ensure_stacked (copy 3) 4. make_b_k_major re-stride (copy 4) 5. Scales: permute then assemble_scales_3d_side un-permutes (wasted) After (1-2 copies): 1. View checkpoint as fp4 (NO copy — byte-preserving view) 2. Pass (E, N, K) stacked tensor directly to runner 3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref 4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref 5. Scales: already (N, K_sf) from checkpoint, call assembly directly 6. Free layer.w13_weight etc. immediately after extracting views Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally, but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose by calling assemble_raw_scales_2d3d_3d_side directly.	2026-05-19 02:16:43 +00:00
biondizzle	0612c1ab54	use proper backend	2026-05-19 02:08:18 +00:00
biondizzle	00fe63b56f	Fix compile test: add warmup for activation global scales	2026-05-19 01:57:16 +00:00
biondizzle	bba3bca4d3	Add torch.compile + custom op integration test	2026-05-19 01:56:46 +00:00
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	98153002c0	Go back to torch.library.custom_op with correct GEMM impl allow_in_graph doesn't work — Dynamo can't create proxies for Python objects (the runner). The custom op approach requires only tensor args. This time the GEMM impl correctly: - Uses quantize_activation_nvfp4 for activation quantization - Pads x_fp4 via uint8 + view(float4) for torch.zeros compat - Assembles A-side scales with pad + swizzle - Uses int32 expert_offsets (CuTeDSL requirement) - Passes runner's pre-assembled mat_b, scale_b, gsb tensors	2026-05-19 01:24:41 +00:00
biondizzle	02c500bbb1	Switch to allow_in_graph for Dynamo opacity instead of custom op The custom op approach required reimplementing the GEMM (wrong scale assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead, use torch.autograd.Function + torch._dynamo.allow_in_graph which tells Dynamo to treat the function as an opaque kernel call, while still using the runner's battle-tested _run_impl for the actual GEMM. allow_in_graph is the proper way to register opaque ops for Dynamo without reimplementing the computation.	2026-05-19 01:20:07 +00:00
biondizzle	581d87f9a6	Remove warmup forward from process_weights_after_loading The warmup custom op call hit cudaErrorIllegalAddress because our custom op GEMM implementation doesn't match the runner's call convention. Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.	2026-05-19 01:18:54 +00:00
biondizzle	5d49849156	Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.	2026-05-19 01:15:24 +00:00
biondizzle	e1fcfc4f01	Add CuTeDSL warmup + CUDA sync after JIT compilation CuTeDSL cute.compile corrupts GPU memory. Add warmup forward + torch.cuda.synchronize() + health check after finalize_weights, matching the MoE runner pattern.	2026-05-19 01:11:44 +00:00
biondizzle	1d9c0f996c	Fix expert_offsets dtype: CuTeDSL expects int32 not int64 The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was caused by passing int64 expert_offsets when the kernel expects int32.	2026-05-19 01:05:20 +00:00
biondizzle	b81200f427	Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op - pad_and_swizzle_single takes 1 arg (2D tensor), not 4 - Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group - Remove unused CuTeDSLNvfp4Linear import from custom op impl	2026-05-19 01:01:42 +00:00
biondizzle	e0eb436914	Fix custom_op registration: use as decorator with proper type hints	2026-05-19 00:54:30 +00:00
biondizzle	c609e9ba3c	Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM Dynamo in fullgraph mode traces through torch.autograd.Function, hitting CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op makes it opaque to Dynamo — tracing calls the fake impl, real impl only runs during inference. Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b, in_features, out_features, activation_global_scale) -> Tensor Store finalized weight tensors on the layer (from runner._mat_b etc.) instead of the runner object, since custom ops can only accept tensors.	2026-05-19 00:50:43 +00:00
biondizzle	c043a11bcc	Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers - Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class - Register it via init_nvfp4_linear_kernel() selection mechanism (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer) - process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner - apply_weights: route through CuTeDSL GEMM - Update Dockerfile: copy kernel + registration script - Fix attention: always use forward() for quantized compressor/indexer layers (dtype check was fragile after kernel swaps weights to dummy BF16)	2026-05-19 00:44:44 +00:00
biondizzle	358830925a	Fix unpack error: handle both tuple and tensor returns from NVFP4 forward()	2026-05-19 00:33:43 +00:00
biondizzle	d9dc042ff7	Fix compressor kv_score: use forward() for NVFP4 quantized weights Raw torch.mm doesn't work with packed uint8 NVFP4 weights. Use MergedColumnParallelLinear.forward() which handles dequantization.	2026-05-19 00:29:43 +00:00
biondizzle	10c14ddb49	Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm - input_layernorm → attn_norm, post_attention_layernorm → ffn_norm - hc_head.fn/base/scale → hc_head_fn/base/scale - attn_hc/ffn_hc → hc_attn/hc_ffn (dot to underscore) - q_a_norm → q_norm, sinks → attn_sink - Indexer params: self_attn.compressor.indexer → attn.indexer (not attn.mla_attn.compressor.indexer)	2026-05-19 00:24:26 +00:00
biondizzle	540e7ee8fc	Fix: layer.self_attn → layer.attn (model uses attn, not self_attn)	2026-05-19 00:14:09 +00:00
biondizzle	201a40e6c4	Fix zero-dim tensor concatenation in compressor scale buffer input_scale and weight_scale_2 are 0-dim scalars in the NVFP4 checkpoint. torch.cat can't concatenate scalars — reshape to 1-d first.	2026-05-19 00:10:13 +00:00
biondizzle	d41a48aa1f	Fix KeyError for missing stacked params (indexer.compressor) Not all layers have the same indexer structure. The stacking path was trying to access params that don't exist in params_dict. Added checks to skip missing stacked params instead of KeyError.	2026-05-18 23:54:02 +00:00
biondizzle	4b0d8263f6	Fix NameError: use print instead of logger (not imported)	2026-05-18 23:49:42 +00:00
biondizzle	e3c24769e2	Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint) o_a_proj is NOT quantized by modelopt in the checkpoint (bfloat16), but the attention forward pass expects FP8 (weight + weight_scale_inv). - Create wo_a with quant_config=None to load bfloat16 weights - Add FP8 quantization of wo_a in finalize_mega_moe_weights: per-tensor symmetric quantization to float8_e4m3fn + weight_scale_inv - This matches what the fused_inv_rope_fp8_quant + einsum expects	2026-05-18 23:41:39 +00:00
biondizzle	9d016aa1c0	Use print instead of logger for weight load debug	2026-05-18 23:30:58 +00:00
biondizzle	a6f61bda5d	Add debug logging for weight loading failures	2026-05-18 23:28:15 +00:00
biondizzle	eef0ef76af	Fix NVFP4 compressor scale loading: buffer and concatenate scale shards The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses weight_loader(param, weight, shard_id), but PerTensorScaleParameter and ModelWeightParameter for NVFP4 scale params don't support shard_id in load_column_parallel_weight (asserts shape equality). Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate shards, then concatenate along dim 0 and copy_ into the param after all weights are loaded.	2026-05-18 23:24:08 +00:00
biondizzle	f74447bfd0	Proper NVFP4 integration: quantized compressor/indexer + mapper fixes Weight mapper fixes: - Reorder substr renames: compressor renames first, then .self_attn.compressor. → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up under mla_attn after the compressor rename already fired) - Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm, position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names) - Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv, gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape, weights_proj stays (structural: compressor.indexer → indexer.compressor) - Remove broken suffix renames (already fixed in prior commit) Model architecture fixes: - Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4 checkpoint has quantized compressor weights with input_scale/weight_scale) - Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config (was None, but checkpoint has quantized weights) - Add indexer.compressor.fused_wkv_wgate stacking in load_weights Infrastructure: - Add deepseek_compressor.py to Dockerfile - Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM) - Update unit test to 50 cases (compressor + indexer + quantization scales)	2026-05-18 23:20:13 +00:00
biondizzle	17496b2615	Fix NVFP4 weights mapper: add prefix mappings, fix substr order - Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.) AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required - Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn. in substr_renames so compressor keys get the mla_attn prefix before the general rename - Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight) that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens - Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)	2026-05-18 23:03:34 +00:00
biondizzle	b039123207	Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames - Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b} - Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate') - Order substr renames: specific before general	2026-05-18 22:53:09 +00:00
biondizzle	ea648a9bc2	Fix NVFP4 mapper: keep model. prefix (model params use it)	2026-05-18 22:49:40 +00:00
biondizzle	1528d4e182	Fix NVFP4 mapper: strip model. prefix from checkpoint keys The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader expects layers.* (relative to the model module). Strip the model. prefix instead of adding it.	2026-05-18 22:46:04 +00:00
biondizzle	5d37674fb1	Add cutedsl to MoEBackend type in kernel config	2026-05-18 22:38:41 +00:00
biondizzle	7409204d71	Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper The upstream deepseek_v4.py has imports that don't exist in the nightly Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the nightly's own files as the base and add only the minimal NVFP4 changes: - Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping - Select NVFP4 mapper when quant_config is modelopt_fp4 - cos_sin_cache float32 fix in attention - Remove utils.py patch (not needed)	2026-05-18 22:33:51 +00:00
biondizzle	a19ed4a18e	Remove breakable_cudagraph import (not in nightly)	2026-05-18 22:29:24 +00:00
biondizzle	b007937a68	Fix garbled imports in cutedsl/runner.py	2026-05-18 22:22:52 +00:00
biondizzle	a7ed8faec6	Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework Major refactor to eliminate all post-load hacks: - deepseek_v4.py: use upstream model with NVFP4 weight mapper only (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn) - Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass that wraps our CuTeDSL runner as a proper vLLM MoE backend - Register CUTEDSL backend in the NVFP4 oracle - Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config) - ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections - Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile - CuTeDSL runner moved to cutedsl/runner.py for clean imports - cos_sin_cache float32 fix in deepseek_v4_attention.py No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.	2026-05-18 22:19:23 +00:00
biondizzle	48386e34ad	Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable torch.compile fullgraph mode can't handle @torch.compiler.disable (skips the function and refuses to compile). Custom autograd Functions are treated as opaque ops by torch.compile — they execute eagerly without the compiler trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).	2026-05-18 21:38:28 +00:00

1 2 3 4 5 ...

350 Commits