nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	48fa64dfda	Eliminate weight copies: pass stacked checkpoint tensors directly Memory optimization for MoE weight processing: Before (3-4 copies of weights in memory): 1. Original checkpoint weights in layer.w13_weight (copy 1) 2. Per-expert permuted copies (copy 2) 3. torch.stack() in runner._ensure_stacked (copy 3) 4. make_b_k_major re-stride (copy 4) 5. Scales: permute then assemble_scales_3d_side un-permutes (wasted) After (1-2 copies): 1. View checkpoint as fp4 (NO copy — byte-preserving view) 2. Pass (E, N, K) stacked tensor directly to runner 3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref 4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref 5. Scales: already (N, K_sf) from checkpoint, call assembly directly 6. Free layer.w13_weight etc. immediately after extracting views Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally, but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose by calling assemble_raw_scales_2d3d_3d_side directly.	2026-05-19 02:16:43 +00:00
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	98153002c0	Go back to torch.library.custom_op with correct GEMM impl allow_in_graph doesn't work — Dynamo can't create proxies for Python objects (the runner). The custom op approach requires only tensor args. This time the GEMM impl correctly: - Uses quantize_activation_nvfp4 for activation quantization - Pads x_fp4 via uint8 + view(float4) for torch.zeros compat - Assembles A-side scales with pad + swizzle - Uses int32 expert_offsets (CuTeDSL requirement) - Passes runner's pre-assembled mat_b, scale_b, gsb tensors	2026-05-19 01:24:41 +00:00
biondizzle	02c500bbb1	Switch to allow_in_graph for Dynamo opacity instead of custom op The custom op approach required reimplementing the GEMM (wrong scale assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead, use torch.autograd.Function + torch._dynamo.allow_in_graph which tells Dynamo to treat the function as an opaque kernel call, while still using the runner's battle-tested _run_impl for the actual GEMM. allow_in_graph is the proper way to register opaque ops for Dynamo without reimplementing the computation.	2026-05-19 01:20:07 +00:00
biondizzle	581d87f9a6	Remove warmup forward from process_weights_after_loading The warmup custom op call hit cudaErrorIllegalAddress because our custom op GEMM implementation doesn't match the runner's call convention. Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.	2026-05-19 01:18:54 +00:00
biondizzle	5d49849156	Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.	2026-05-19 01:15:24 +00:00
biondizzle	e1fcfc4f01	Add CuTeDSL warmup + CUDA sync after JIT compilation CuTeDSL cute.compile corrupts GPU memory. Add warmup forward + torch.cuda.synchronize() + health check after finalize_weights, matching the MoE runner pattern.	2026-05-19 01:11:44 +00:00
biondizzle	1d9c0f996c	Fix expert_offsets dtype: CuTeDSL expects int32 not int64 The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was caused by passing int64 expert_offsets when the kernel expects int32.	2026-05-19 01:05:20 +00:00
biondizzle	b81200f427	Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op - pad_and_swizzle_single takes 1 arg (2D tensor), not 4 - Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group - Remove unused CuTeDSLNvfp4Linear import from custom op impl	2026-05-19 01:01:42 +00:00
biondizzle	e0eb436914	Fix custom_op registration: use as decorator with proper type hints	2026-05-19 00:54:30 +00:00
biondizzle	c609e9ba3c	Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM Dynamo in fullgraph mode traces through torch.autograd.Function, hitting CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op makes it opaque to Dynamo — tracing calls the fake impl, real impl only runs during inference. Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b, in_features, out_features, activation_global_scale) -> Tensor Store finalized weight tensors on the layer (from runner._mat_b etc.) instead of the runner object, since custom ops can only accept tensors.	2026-05-19 00:50:43 +00:00
biondizzle	c043a11bcc	Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers - Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class - Register it via init_nvfp4_linear_kernel() selection mechanism (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer) - process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner - apply_weights: route through CuTeDSL GEMM - Update Dockerfile: copy kernel + registration script - Fix attention: always use forward() for quantized compressor/indexer layers (dtype check was fragile after kernel swaps weights to dummy BF16)	2026-05-19 00:44:44 +00:00
biondizzle	358830925a	Fix unpack error: handle both tuple and tensor returns from NVFP4 forward()	2026-05-19 00:33:43 +00:00
biondizzle	d9dc042ff7	Fix compressor kv_score: use forward() for NVFP4 quantized weights Raw torch.mm doesn't work with packed uint8 NVFP4 weights. Use MergedColumnParallelLinear.forward() which handles dequantization.	2026-05-19 00:29:43 +00:00
biondizzle	10c14ddb49	Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm - input_layernorm → attn_norm, post_attention_layernorm → ffn_norm - hc_head.fn/base/scale → hc_head_fn/base/scale - attn_hc/ffn_hc → hc_attn/hc_ffn (dot to underscore) - q_a_norm → q_norm, sinks → attn_sink - Indexer params: self_attn.compressor.indexer → attn.indexer (not attn.mla_attn.compressor.indexer)	2026-05-19 00:24:26 +00:00
biondizzle	540e7ee8fc	Fix: layer.self_attn → layer.attn (model uses attn, not self_attn)	2026-05-19 00:14:09 +00:00
biondizzle	201a40e6c4	Fix zero-dim tensor concatenation in compressor scale buffer input_scale and weight_scale_2 are 0-dim scalars in the NVFP4 checkpoint. torch.cat can't concatenate scalars — reshape to 1-d first.	2026-05-19 00:10:13 +00:00
biondizzle	d41a48aa1f	Fix KeyError for missing stacked params (indexer.compressor) Not all layers have the same indexer structure. The stacking path was trying to access params that don't exist in params_dict. Added checks to skip missing stacked params instead of KeyError.	2026-05-18 23:54:02 +00:00
biondizzle	4b0d8263f6	Fix NameError: use print instead of logger (not imported)	2026-05-18 23:49:42 +00:00
biondizzle	e3c24769e2	Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint) o_a_proj is NOT quantized by modelopt in the checkpoint (bfloat16), but the attention forward pass expects FP8 (weight + weight_scale_inv). - Create wo_a with quant_config=None to load bfloat16 weights - Add FP8 quantization of wo_a in finalize_mega_moe_weights: per-tensor symmetric quantization to float8_e4m3fn + weight_scale_inv - This matches what the fused_inv_rope_fp8_quant + einsum expects	2026-05-18 23:41:39 +00:00
biondizzle	9d016aa1c0	Use print instead of logger for weight load debug	2026-05-18 23:30:58 +00:00
biondizzle	a6f61bda5d	Add debug logging for weight loading failures	2026-05-18 23:28:15 +00:00
biondizzle	eef0ef76af	Fix NVFP4 compressor scale loading: buffer and concatenate scale shards The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses weight_loader(param, weight, shard_id), but PerTensorScaleParameter and ModelWeightParameter for NVFP4 scale params don't support shard_id in load_column_parallel_weight (asserts shape equality). Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate shards, then concatenate along dim 0 and copy_ into the param after all weights are loaded.	2026-05-18 23:24:08 +00:00
biondizzle	f74447bfd0	Proper NVFP4 integration: quantized compressor/indexer + mapper fixes Weight mapper fixes: - Reorder substr renames: compressor renames first, then .self_attn.compressor. → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up under mla_attn after the compressor rename already fired) - Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm, position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names) - Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv, gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape, weights_proj stays (structural: compressor.indexer → indexer.compressor) - Remove broken suffix renames (already fixed in prior commit) Model architecture fixes: - Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4 checkpoint has quantized compressor weights with input_scale/weight_scale) - Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config (was None, but checkpoint has quantized weights) - Add indexer.compressor.fused_wkv_wgate stacking in load_weights Infrastructure: - Add deepseek_compressor.py to Dockerfile - Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM) - Update unit test to 50 cases (compressor + indexer + quantization scales)	2026-05-18 23:20:13 +00:00
biondizzle	17496b2615	Fix NVFP4 weights mapper: add prefix mappings, fix substr order - Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.) AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required - Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn. in substr_renames so compressor keys get the mla_attn prefix before the general rename - Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight) that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens - Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)	2026-05-18 23:03:34 +00:00
biondizzle	b039123207	Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames - Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b} - Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate') - Order substr renames: specific before general	2026-05-18 22:53:09 +00:00
biondizzle	ea648a9bc2	Fix NVFP4 mapper: keep model. prefix (model params use it)	2026-05-18 22:49:40 +00:00
biondizzle	1528d4e182	Fix NVFP4 mapper: strip model. prefix from checkpoint keys The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader expects layers.* (relative to the model module). Strip the model. prefix instead of adding it.	2026-05-18 22:46:04 +00:00
biondizzle	5d37674fb1	Add cutedsl to MoEBackend type in kernel config	2026-05-18 22:38:41 +00:00
biondizzle	7409204d71	Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper The upstream deepseek_v4.py has imports that don't exist in the nightly Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the nightly's own files as the base and add only the minimal NVFP4 changes: - Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping - Select NVFP4 mapper when quant_config is modelopt_fp4 - cos_sin_cache float32 fix in attention - Remove utils.py patch (not needed)	2026-05-18 22:33:51 +00:00
biondizzle	a19ed4a18e	Remove breakable_cudagraph import (not in nightly)	2026-05-18 22:29:24 +00:00
biondizzle	b007937a68	Fix garbled imports in cutedsl/runner.py	2026-05-18 22:22:52 +00:00
biondizzle	a7ed8faec6	Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework Major refactor to eliminate all post-load hacks: - deepseek_v4.py: use upstream model with NVFP4 weight mapper only (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn) - Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass that wraps our CuTeDSL runner as a proper vLLM MoE backend - Register CUTEDSL backend in the NVFP4 oracle - Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config) - ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections - Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile - CuTeDSL runner moved to cutedsl/runner.py for clean imports - cos_sin_cache float32 fix in deepseek_v4_attention.py No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.	2026-05-18 22:19:23 +00:00
biondizzle	48386e34ad	Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable torch.compile fullgraph mode can't handle @torch.compiler.disable (skips the function and refuses to compile). Custom autograd Functions are treated as opaque ops by torch.compile — they execute eagerly without the compiler trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).	2026-05-18 21:38:28 +00:00
biondizzle	85e1cd3b69	Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run() CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with torch.dynamo tracing. Marking run() as compiler-disabled makes the runners opaque to torch.compile — they execute eagerly while the rest of the model gets compiled.	2026-05-18 21:07:35 +00:00
biondizzle	a94011ec92	Fix torch.compile crash: remove threading.Lock from LUT cache path The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under torch.compile/cudagraph. LUT is now pre-populated during warmup so the fast path (cache hit) never hits a lock. Also removed all init/warmup debug prints from CuTeDSL kernels.	2026-05-18 20:54:55 +00:00
biondizzle	6326222d68	Fix: add abstract create_weights to CuTeDSLNvfp4LinearMethod	2026-05-18 20:40:48 +00:00
biondizzle	450793311c	Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4 - CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod for forward dispatch - Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference) - Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16) with monkey-patched forward for fused L1+SiLU+L2 pipeline - Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix, input_scale fixes) - Deleted _post_quant_fix hook from utils.py - Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp) - Cleaned up all debug prints - Updated Dockerfile with new kernel files	2026-05-18 20:27:42 +00:00
biondizzle	1836e5fdc7	Add shared experts to post-quant BF16 dequant fix Shared experts also use FlashInferCutlassNvFp4LinearKernel with broken input_scale. They need the same BF16 dequant treatment. gate_up_proj and down_proj on ffn.shared_experts.	2026-05-18 19:27:49 +00:00
biondizzle	82ac648563	Patch utils.py the standard way: copy modified file into Docker image Instead of fragile inline Dockerfile patching, just copy a modified utils.py (with _post_quant_fix call) into the image, same pattern as deepseek_v4.py and deepseek_v4_attention.py patches.	2026-05-18 19:10:08 +00:00
biondizzle	75844a8361	Post-quant fix via Dockerfile patch to process_weights_after_loading Forward pre-hook approach didn't work — torch.compile and model wrappers bypass hooks. Instead, patch vLLM's utils.py to call model._post_quant_fix() at the end of process_weights_after_loading. This guarantees the fix runs AFTER quant methods set up their attrs. Dockerfile now patches: model_loader/utils.py → calls model._post_quant_fix() if it exists DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention NVFP4 weights to BF16 and replaces quant_method.	2026-05-18 18:35:34 +00:00
biondizzle	a4ad5898c1	Fix post-quant hook: register on inner model, fix module refs vLLM V1 calls DeepseekV4Model.forward() directly, not DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires. Moved hook to self.model (inner) and fixed module.model.layers → module.layers.	2026-05-18 18:15:36 +00:00
biondizzle	a51edd238e	Add post-quant-init forward hook to fix attention NVFP4 The key insight: process_weights_after_loading runs AFTER load_weights and sets up FlashInferCutlassNvFp4LinearKernel with broken input_global_scale_inv. Any fix inside load_weights gets overwritten. Solution: register a one-shot forward pre-hook that runs on the first forward call (guaranteed after all init). It dequantizes attention NVFP4 weights to BF16 and replaces quant_method with UnquantizedLinearMethod. Since process_weights_after_loading already ran, our changes won't be overwritten. Standalone test confirmed: all attention weights produce valid non-NaN output when dequantized to BF16.	2026-05-18 17:56:19 +00:00
biondizzle	2835cb040b	Fix input_scale BEFORE process_weights_after_loading runs Instead of dequantizing to BF16 (which gets overwritten by process_weights_after_loading), fix the input_scale parameter on the module before the quant method reads it. The quant method computes input_global_scale_inv = input_scale.max(), so fixing input_scale propagates the correct activation scale. Computes correct input_scale by temporarily dequantizing weight to BF16, running warmup forward, and computing act_amax. input_scale = 1/(act_amax * headroom).	2026-05-18 16:43:44 +00:00
biondizzle	2fc81ccac4	Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early) process_weights_after_loading sets input_global_scale_inv AFTER _convert_nvfp4_post_load runs, so the fix couldn't find the attrs. Going back to BF16 dequant approach. The zeros in the dummy run are expected (attention_impl returns early with out.zero_()). Need to test with a real request under cudagraph_mode=NONE.	2026-05-18 16:23:41 +00:00
biondizzle	4a57399592	Add debug prints for input_global_scale_inv check	2026-05-18 15:59:59 +00:00
biondizzle	f86892e26b	Replace BF16 dequant with input_scale warmup fix for attention NVFP4 Instead of dequantizing attention weights to BF16 (which had issues with MergedColumnParallelLinear and different weight_scale_2 values), keep the NVFP4 path but fix the activation global scale. Compute correct input_global_scale_inv by: 1. Temporarily dequantizing weight to BF16 2. Running warmup forward with random input 3. Computing actual activation amax 4. Setting scale_inv = amax * headroom This preserves the original NVFP4 quantization pipeline.	2026-05-18 15:43:46 +00:00
biondizzle	301015b037	Remove all inline diagnostics — incompatible with torch.compile Data-dependent expressions (amax().item(), isnan().any().item()) cause Dynamo guard failures even when gated by os.environ. cudagraph_mode=NONE still uses torch.compile, so these break. Will need enforce-eager for diagnostics going forward.	2026-05-18 15:22:53 +00:00
biondizzle	2a2a42c6d6	Add attention-internal diagnostics: MLA output, FP8 quant output	2026-05-18 14:45:43 +00:00
biondizzle	5c1dda10f6	Add granular attention diagnostics: pre/post attn, embed, dequant stats	2026-05-18 14:24:14 +00:00

1 2 3 4

163 Commits