nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	10c14ddb49	Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm - input_layernorm → attn_norm, post_attention_layernorm → ffn_norm - hc_head.fn/base/scale → hc_head_fn/base/scale - attn_hc/ffn_hc → hc_attn/hc_ffn (dot to underscore) - q_a_norm → q_norm, sinks → attn_sink - Indexer params: self_attn.compressor.indexer → attn.indexer (not attn.mla_attn.compressor.indexer)	2026-05-19 00:24:26 +00:00
biondizzle	540e7ee8fc	Fix: layer.self_attn → layer.attn (model uses attn, not self_attn)	2026-05-19 00:14:09 +00:00
biondizzle	201a40e6c4	Fix zero-dim tensor concatenation in compressor scale buffer input_scale and weight_scale_2 are 0-dim scalars in the NVFP4 checkpoint. torch.cat can't concatenate scalars — reshape to 1-d first.	2026-05-19 00:10:13 +00:00
biondizzle	d41a48aa1f	Fix KeyError for missing stacked params (indexer.compressor) Not all layers have the same indexer structure. The stacking path was trying to access params that don't exist in params_dict. Added checks to skip missing stacked params instead of KeyError.	2026-05-18 23:54:02 +00:00
biondizzle	4b0d8263f6	Fix NameError: use print instead of logger (not imported)	2026-05-18 23:49:42 +00:00
biondizzle	e3c24769e2	Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint) o_a_proj is NOT quantized by modelopt in the checkpoint (bfloat16), but the attention forward pass expects FP8 (weight + weight_scale_inv). - Create wo_a with quant_config=None to load bfloat16 weights - Add FP8 quantization of wo_a in finalize_mega_moe_weights: per-tensor symmetric quantization to float8_e4m3fn + weight_scale_inv - This matches what the fused_inv_rope_fp8_quant + einsum expects	2026-05-18 23:41:39 +00:00
biondizzle	9d016aa1c0	Use print instead of logger for weight load debug	2026-05-18 23:30:58 +00:00
biondizzle	a6f61bda5d	Add debug logging for weight loading failures	2026-05-18 23:28:15 +00:00
biondizzle	eef0ef76af	Fix NVFP4 compressor scale loading: buffer and concatenate scale shards The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses weight_loader(param, weight, shard_id), but PerTensorScaleParameter and ModelWeightParameter for NVFP4 scale params don't support shard_id in load_column_parallel_weight (asserts shape equality). Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate shards, then concatenate along dim 0 and copy_ into the param after all weights are loaded.	2026-05-18 23:24:08 +00:00
biondizzle	f74447bfd0	Proper NVFP4 integration: quantized compressor/indexer + mapper fixes Weight mapper fixes: - Reorder substr renames: compressor renames first, then .self_attn.compressor. → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up under mla_attn after the compressor rename already fired) - Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm, position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names) - Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv, gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape, weights_proj stays (structural: compressor.indexer → indexer.compressor) - Remove broken suffix renames (already fixed in prior commit) Model architecture fixes: - Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4 checkpoint has quantized compressor weights with input_scale/weight_scale) - Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config (was None, but checkpoint has quantized weights) - Add indexer.compressor.fused_wkv_wgate stacking in load_weights Infrastructure: - Add deepseek_compressor.py to Dockerfile - Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM) - Update unit test to 50 cases (compressor + indexer + quantization scales)	2026-05-18 23:20:13 +00:00
biondizzle	17496b2615	Fix NVFP4 weights mapper: add prefix mappings, fix substr order - Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.) AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required - Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn. in substr_renames so compressor keys get the mla_attn prefix before the general rename - Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight) that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens - Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)	2026-05-18 23:03:34 +00:00
biondizzle	b039123207	Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames - Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b} - Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate') - Order substr renames: specific before general	2026-05-18 22:53:09 +00:00
biondizzle	ea648a9bc2	Fix NVFP4 mapper: keep model. prefix (model params use it)	2026-05-18 22:49:40 +00:00
biondizzle	1528d4e182	Fix NVFP4 mapper: strip model. prefix from checkpoint keys The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader expects layers.* (relative to the model module). Strip the model. prefix instead of adding it.	2026-05-18 22:46:04 +00:00
biondizzle	7409204d71	Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper The upstream deepseek_v4.py has imports that don't exist in the nightly Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the nightly's own files as the base and add only the minimal NVFP4 changes: - Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping - Select NVFP4 mapper when quant_config is modelopt_fp4 - cos_sin_cache float32 fix in attention - Remove utils.py patch (not needed)	2026-05-18 22:33:51 +00:00
biondizzle	a7ed8faec6	Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework Major refactor to eliminate all post-load hacks: - deepseek_v4.py: use upstream model with NVFP4 weight mapper only (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn) - Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass that wraps our CuTeDSL runner as a proper vLLM MoE backend - Register CUTEDSL backend in the NVFP4 oracle - Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config) - ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections - Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile - CuTeDSL runner moved to cutedsl/runner.py for clean imports - cos_sin_cache float32 fix in deepseek_v4_attention.py No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.	2026-05-18 22:19:23 +00:00
biondizzle	450793311c	Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4 - CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod for forward dispatch - Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference) - Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16) with monkey-patched forward for fused L1+SiLU+L2 pipeline - Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix, input_scale fixes) - Deleted _post_quant_fix hook from utils.py - Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp) - Cleaned up all debug prints - Updated Dockerfile with new kernel files	2026-05-18 20:27:42 +00:00
biondizzle	1836e5fdc7	Add shared experts to post-quant BF16 dequant fix Shared experts also use FlashInferCutlassNvFp4LinearKernel with broken input_scale. They need the same BF16 dequant treatment. gate_up_proj and down_proj on ffn.shared_experts.	2026-05-18 19:27:49 +00:00
biondizzle	75844a8361	Post-quant fix via Dockerfile patch to process_weights_after_loading Forward pre-hook approach didn't work — torch.compile and model wrappers bypass hooks. Instead, patch vLLM's utils.py to call model._post_quant_fix() at the end of process_weights_after_loading. This guarantees the fix runs AFTER quant methods set up their attrs. Dockerfile now patches: model_loader/utils.py → calls model._post_quant_fix() if it exists DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention NVFP4 weights to BF16 and replaces quant_method.	2026-05-18 18:35:34 +00:00
biondizzle	a4ad5898c1	Fix post-quant hook: register on inner model, fix module refs vLLM V1 calls DeepseekV4Model.forward() directly, not DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires. Moved hook to self.model (inner) and fixed module.model.layers → module.layers.	2026-05-18 18:15:36 +00:00
biondizzle	a51edd238e	Add post-quant-init forward hook to fix attention NVFP4 The key insight: process_weights_after_loading runs AFTER load_weights and sets up FlashInferCutlassNvFp4LinearKernel with broken input_global_scale_inv. Any fix inside load_weights gets overwritten. Solution: register a one-shot forward pre-hook that runs on the first forward call (guaranteed after all init). It dequantizes attention NVFP4 weights to BF16 and replaces quant_method with UnquantizedLinearMethod. Since process_weights_after_loading already ran, our changes won't be overwritten. Standalone test confirmed: all attention weights produce valid non-NaN output when dequantized to BF16.	2026-05-18 17:56:19 +00:00
biondizzle	2835cb040b	Fix input_scale BEFORE process_weights_after_loading runs Instead of dequantizing to BF16 (which gets overwritten by process_weights_after_loading), fix the input_scale parameter on the module before the quant method reads it. The quant method computes input_global_scale_inv = input_scale.max(), so fixing input_scale propagates the correct activation scale. Computes correct input_scale by temporarily dequantizing weight to BF16, running warmup forward, and computing act_amax. input_scale = 1/(act_amax * headroom).	2026-05-18 16:43:44 +00:00
biondizzle	2fc81ccac4	Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early) process_weights_after_loading sets input_global_scale_inv AFTER _convert_nvfp4_post_load runs, so the fix couldn't find the attrs. Going back to BF16 dequant approach. The zeros in the dummy run are expected (attention_impl returns early with out.zero_()). Need to test with a real request under cudagraph_mode=NONE.	2026-05-18 16:23:41 +00:00
biondizzle	4a57399592	Add debug prints for input_global_scale_inv check	2026-05-18 15:59:59 +00:00
biondizzle	f86892e26b	Replace BF16 dequant with input_scale warmup fix for attention NVFP4 Instead of dequantizing attention weights to BF16 (which had issues with MergedColumnParallelLinear and different weight_scale_2 values), keep the NVFP4 path but fix the activation global scale. Compute correct input_global_scale_inv by: 1. Temporarily dequantizing weight to BF16 2. Running warmup forward with random input 3. Computing actual activation amax 4. Setting scale_inv = amax * headroom This preserves the original NVFP4 quantization pipeline.	2026-05-18 15:43:46 +00:00
biondizzle	301015b037	Remove all inline diagnostics — incompatible with torch.compile Data-dependent expressions (amax().item(), isnan().any().item()) cause Dynamo guard failures even when gated by os.environ. cudagraph_mode=NONE still uses torch.compile, so these break. Will need enforce-eager for diagnostics going forward.	2026-05-18 15:22:53 +00:00
biondizzle	2a2a42c6d6	Add attention-internal diagnostics: MLA output, FP8 quant output	2026-05-18 14:45:43 +00:00
biondizzle	5c1dda10f6	Add granular attention diagnostics: pre/post attn, embed, dequant stats	2026-05-18 14:24:14 +00:00
biondizzle	e0e0528778	Add debug logging for BF16 dequant to find missing attrs	2026-05-18 14:04:12 +00:00
biondizzle	2e8c3c961f	Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv wq_a and wkv are fused into a single MergedColumnParallelLinear called fused_wqa_wkv. Was checking for non-existent separate attrs.	2026-05-18 13:47:08 +00:00
biondizzle	a7216b27df	Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16 wo_a uses fp8_einsum which is weight-only FP8 (no input_scale). Only q_a, q_b, kv, o_b need BF16 dequant to avoid broken input_scale.	2026-05-18 13:22:15 +00:00
biondizzle	334e95047e	Fix: dequantize ALL attention NVFP4 projections to BF16 Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel uses checkpoint input_scale for activation quantization, which produces NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a, wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method. This trades memory for correctness. Future optimization: add warmup for attention input_global_scale_inv (same as MoE warmup).	2026-05-18 13:09:36 +00:00
biondizzle	9e7639fba4	Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager) When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer. Must run with --enforce-eager (data-dependent prints break Dynamo). Gated by os.environ so dead-code-eliminated during compilation.	2026-05-18 12:51:51 +00:00
biondizzle	2d1e9f42b1	Remove NaN check — incompatible with Dynamo fullgraph compilation Dynamo fullgraph mode rejects BOTH data-dependent branching AND torch.compiler.disable as graph breaks. The NaN check cannot coexist with vLLM's AOT compilation. Use layertest/cudagraph_test for debugging.	2026-05-18 12:17:25 +00:00
biondizzle	65763a200c	Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break The inline os.environ gate doesn't work — Dynamo still sees the data-dependent branching (torch.isnan().any()) and crashes with 'Unsupported: Data-dependent branching'. Extracting into a @torch.compiler.disable decorated function makes Dynamo skip it.	2026-05-18 11:33:29 +00:00
biondizzle	b8df4a8cc5	Fix NaN check: use os.environ gate instead of is_current_stream_capturing torch.cuda.is_current_stream_capturing() returns bool, which breaks Dynamo FX tracing (non-Tensor output). Switch to env var gate: CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection. Dynamo evaluates os.environ at trace time — if the env var is not set, the entire NaN check block is compiled away. Set it before first inference to get NaN detection during prefill only.	2026-05-18 02:20:14 +00:00
biondizzle	0c02d84514	Add NaN/Inf detection in DeepseekV4Model.forward layer loop - Checks every layer during prefill (not during cudagraph capture) - is_current_stream_capturing() gate prevents CPU-GPU syncs during capture - Prints amax every 10 layers for magnitude tracking - Breaks on first NaN/Inf to avoid wasting compute	2026-05-17 23:37:12 +00:00
biondizzle	22e0370e6e	Fix AttributeError: DeepseekV4MegaMoEExperts has no swiglu_limit Get swiglu_limit from vllm_config.model_config.hf_config instead of self (it was only set on the parent DeepseekV4MoE class).	2026-05-17 18:06:44 +00:00
biondizzle	a10c582cf4	Add swiglu_limit=10.0 activation clamping (was missing) DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps: - silu(gate) to max 10.0 - up to [-10.0, 10.0] Our runner was doing plain F.silu(gate) * up without clamping. Large gate values could produce unbounded SiLU output, causing numerical issues in the L2 GEMM. This is likely contributing to garbage model output.	2026-05-17 17:52:16 +00:00
biondizzle	b1ac74bb4d	Fix shape mismatch: shared padded buffers, revert max_num_tokens cap Root cause: capping max_num_tokens to 512 made buffers too small for the actual 8192-token warmup. slot_hidden had 49152 rows but padded_hidden only had 6144. Fix: Revert the 512 cap. Use SHARED padded buffers (not per-layer) to avoid OOM. Only 72 MB total (not 4.3 GB) since layers run sequentially and reuse the same buffer. Cudagraph-safe since capture and replay both run layers sequentially on the same tensor.	2026-05-17 15:47:10 +00:00
biondizzle	8ac8e20fa9	Fix OOM: cap buffer pre-allocation at cudagraph max capture size padded_hidden/activated buffers were sized for max_num_tokens=8192, which is 72 MB per layer × 60 layers = 4.3 GB → OOM with 178 GB GPUs (almost full from model + KV cache). Now cap at max cudagraph capture size (512 tokens). Eager-mode runs with >512 tokens will need dynamic allocation, but vLLM always uses cudagraph for inference after warmup.	2026-05-17 14:14:13 +00:00
biondizzle	b0221662e7	Fix warmup: pass local expert IDs (not global), remove incorrect _warmup_done guard compute_activation_global_scales expects local IDs (0..num_experts-1), not global IDs. EP5/EP7 were getting L2 gs=0 because global IDs (240+, 336+) didn't match expert_id_range (0..47), so no tokens matched any expert → L1 GEMM got zero inputs → L2 gs=0 → NaN/crash. Also removed _warmup_done guard since each layer needs its own warmup (different weights, different gs values).	2026-05-17 11:38:19 +00:00
biondizzle	b531a98f8f	Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing - Reverted from full-buffer swizzle to per-expert 128-row slots - Scatter into e128 fixed positions (cudagraph-compatible, fixed shape) - Clamp local_row to 127 for experts with >128 tokens (GEMM uses expert_offsets) - Buffer sized for num_experts128 rows (not max_tokens*top_k) - Add _warmup_done guard to only run warmup once (not 60x)	2026-05-17 11:10:59 +00:00
biondizzle	04245b664b	Add warmup-based activation global scale computation in finalize_weights The checkpoint input_scale is a calibration value that produces wrong gs at runtime (too small → block scales saturate → garbage output → EOS). Now calls compute_activation_global_scales() with sample data during weight finalization, before cudagraph capture. This observes actual activation magnitudes and computes correct L1 and L2 gs values.	2026-05-17 10:48:24 +00:00
biondizzle	d9bae6d770	Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens and top_k=6, the actual padded row count can exceed 6144. Also: - Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192) - Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks) - Remove debug prints	2026-05-17 09:56:28 +00:00
biondizzle	ca3cba5bbd	Fix global→local expert ID remapping for EP and remove .cpu() sync Root cause of CUDA_ERROR_ASSERT index out of bounds: - topk_ids contains GLOBAL expert IDs (0-255) but runner treated them as local IDs (0-31 with EP=8). Tokens for non-local experts got wrong expert assignments, causing out-of-bounds scatter indices in _assemble_scales_cudagraph_safe. Fixes: 1. Add experts_start_idx param to CuTeDSLMoERunner 2. In run(), remap global→local IDs and zero weights for non-local experts 3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync) 4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL JIT GPU memory corruption (refill after first GEMM call)	2026-05-17 08:58:43 +00:00
biondizzle	d2965b432d	fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch	2026-05-17 03:35:20 +00:00
biondizzle	b382a7a528	fix: handle input_scale as 1D or 2D (EP splits change the shape)	2026-05-16 22:49:30 +00:00
biondizzle	139c9c37cd	fix: read input_scale from nn.Parameter before it's freed	2026-05-16 22:23:24 +00:00
biondizzle	152648789d	fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688) The checkpoint stores input_scale per projection — the pre-computed activation normalization factor. Using 1/2688 was wrong for most layers (e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off). This caused under-quantized activations and garbage output.	2026-05-16 21:46:00 +00:00

1 2

80 Commits