nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	7adfaef113	fix: in-place prepack to avoid 2× peak memory torch.stack(packed) held all expert tensors + final stack (~3.5 GiB). Now pre-allocate output and fill in-place — only 1 expert tmp + final tensor in memory at any time.	2026-05-15 10:38:44 +00:00
biondizzle	5dc18df494	feat: MEGA_MOE_PREPACK_CACHE_MAX env var (default 2) with CUDA graph warning Added big comment block explaining the cache sizing rationale and the CUDA graph trap: default of 2 works for sequential layer execution but will cause use-after-free if CUDA graphs capture multiple layers. Set MEGA_MOE_PREPACK_CACHE_MAX to cover all captured layers in that case.	2026-05-15 10:33:53 +00:00
biondizzle	90313f3a92	fix: LRU(2) eviction for prepack cache — prevents OOM across 61 layers Cache was growing unbounded — 61 MoE layers × 2 = 122 prepacked SFB tensors permanently in GPU memory (~1.75 GiB each). With sequential layer execution, only 2 entries are needed at a time (current L1 + L2). Added LRU eviction to keep max 2 entries.	2026-05-15 10:31:57 +00:00
biondizzle	1da6726a86	fix: assert float8_e4m3fn dtype in _prepack_weight_sf Non-prepacked path handles uint32 unpacking, but prepack sends weight_sf directly to _C.prepack_sfb(). If scales aren't float8, this would silently produce wrong layout/values.	2026-05-15 10:14:29 +00:00
biondizzle	489c620159	docs: document M_for_layout=128 assumption in _prepack_weight_sf SFB layout size may depend on M. Currently unverified — only tested with M=128. Added TODO to test with M=1 and M=256.	2026-05-15 10:13:19 +00:00
biondizzle	b7c7e9fb50	refactor: clean up slot_token handling in cutlass_grouped_nvfp4_gemm - Split provided_slot_token vs slot_token_out (returned to caller) - No gather when slot_token=None (L2 path), no unnecessary alloc - .contiguous() on gathered tensors for CUTLASS alignment - Return slot_token_out consistently	2026-05-15 10:11:40 +00:00
biondizzle	7a1538d0c8	fix: gather on slot_token presence, add shape asserts L1→L2 - Remove torch.equal heuristic — just gather when slot_token is provided - Add asserts for slot mapping shapes (ndim, numel == num_slots) - Add post-L1 and pre-L2 shape asserts (l1_slots, activated, l1_fp4, l1_sf_out)	2026-05-15 10:06:07 +00:00
biondizzle	3cc00b12df	fix: prepack cache key includes data_ptr, shape, dtype, device, N, K Old cache used only tag ('l1'/'l2'), so layer 1 would reuse layer 0's packed scales if the function object persisted. Now keyed by (tag, data_ptr, shape, dtype, device, N, K) — safe across layers.	2026-05-15 10:03:37 +00:00
biondizzle	3ba41b9322	fix: use slot_token identity check instead of shape heuristic for gather Shape-based check (x_fp4.shape[0] != num_slots) silently fails when num_tokens == num_slots in L1 (topk=1). Now checks if slot_token is the identity mapping — only gathers when slot ordering differs from token ordering.	2026-05-15 10:00:41 +00:00
biondizzle	ded80be133	refactor: unify L1/L2 to use 1D slot_expert_ids consistently Both L1 and L2 now pass pre-built 1D slot_expert_ids and slot_token to cutlass_grouped_nvfp4_gemm instead of the 2D topk_ids. The 2D path was broken for expert parallelism — local_mask matched ALL local experts, producing mismatched slot_token/slot_k lengths that caused vectorized_gather_kernel index out of bounds. cutlass_grouped_nvfp4_gemm now: - Takes 1D slot_expert_ids + optional slot_token - Gathers x_fp4 by slot_token when needed (L1: tokens→slots) - Skips gather when x_fp4 already has num_slots rows (L2)	2026-05-15 09:56:46 +00:00
biondizzle	093babadc6	docs: clarify L1 interleave removal — transpose is still needed	2026-05-15 09:49:13 +00:00
biondizzle	c7db2242ee	fix: pass slot_expert_ids directly to L2 instead of rebuilding from topk_ids The L2 function was rebuilding slot_expert_ids by scanning topk_ids with a local_mask. This produced mismatched slot_k (all-expert mask) vs slot_token (rank-local mask), causing vectorized_gather_kernel index out of bounds. Now slot_expert_local is passed directly from the outer routing logic, matching the same slot ordering as L1.	2026-05-15 09:42:13 +00:00
biondizzle	f29b96de09	bug fixes	2026-05-15 09:25:33 +00:00
biondizzle	a780bb5fde	bug fix	2026-05-15 09:11:05 +00:00
biondizzle	91338428d9	some optimizations	2026-05-15 09:09:35 +00:00
biondizzle	fae418c3a3	final scatter	2026-05-15 08:57:43 +00:00
biondizzle	f2cacfc2f2	fix the L2 path and the clamping math	2026-05-15 08:51:23 +00:00
biondizzle	d22dae2df3	were getting close	2026-05-15 08:28:40 +00:00
biondizzle	d493193d25	fix the god damn projections	2026-05-15 08:02:02 +00:00
biondizzle	9810de7109	more debug	2026-05-15 07:45:34 +00:00
biondizzle	1a37b66922	dang python	2026-05-15 07:23:10 +00:00
biondizzle	7b3a853465	more debugging	2026-05-15 07:10:13 +00:00
biondizzle	6b4b59c6a4	double check that weird line	2026-05-15 06:40:38 +00:00
biondizzle	beacc31569	is paris in the top n?	2026-05-15 06:38:20 +00:00
biondizzle	311b28bd9f	fixey wixey	2026-05-15 06:07:18 +00:00
biondizzle	685bce48b4	actually handle expert param mapping	2026-05-15 06:01:50 +00:00
biondizzle	f17efa340d	are the weights ever not zero?	2026-05-15 05:48:38 +00:00
biondizzle	c5d800f133	can we see the wt in?	2026-05-15 05:41:12 +00:00
biondizzle	6a4f52cedc	god dam i just want the gemm in	2026-05-15 05:31:13 +00:00
biondizzle	3b3c506af5	whoops	2026-05-15 05:21:42 +00:00
biondizzle	76e9b078a2	more debug2	2026-05-15 05:08:53 +00:00
biondizzle	912e4622d7	more debug	2026-05-15 04:53:26 +00:00
biondizzle	c7f6a1dc4d	fix: transpose B and SFB on the Python side at weight-load time, and adjust the SFB remap kernel to read from column-major source layout	2026-05-15 04:35:45 +00:00
biondizzle	c56cc34ae1	fix: LayoutBTag is now RowMajor	2026-05-15 04:30:27 +00:00
biondizzle	9975558c23	Add always-on alpha/x_sf debug prints for L1 and L2 GEMM calls	2026-05-15 03:59:07 +00:00
biondizzle	9c318c3353	force no cache	2026-05-15 03:52:00 +00:00
biondizzle	ff6bb32684	Plumb global scale as GEMM alpha instead of folding into UE4M3 stage_activation now returns (x_fp4, x_sf, input_global_scale). The global scale is applied as the CUTLASS GEMM alpha parameter in the epilogue: D = alpha * A @ B, avoiding the fp32→UE4M3 round-trip that folding would introduce. Changes: - stage_activation: returns global scale as 3rd value - cutlass_nvfp4_gemm C++ binding: alpha param (was hardcoded 1.0) - cutlass_grouped_nvfp4_gemm: passes alpha to per-expert GEMM - nvfp4_mega_moe_l1/l2: accept alpha, pass to grouped GEMM - nvfp4_moe_full: reads symm_buffer.input_global_scale for L1, uses stage_activation's returned global scale for L2 - SymmBuffer: added input_global_scale field - vllm patch: stores global scale from stage_activation	2026-05-15 03:32:19 +00:00
biondizzle	d547da2948	stage_activation: add per-tensor global scale matching NVFP4 spec Without a global scale, block scales (block_max / 6.0) could exceed UE4M3 max (448.0) for large activations, causing saturation and garbage MoE outputs. The degeneration pattern (positions 1-5 OK, then constant spaces) is consistent with UE4M3 overflow: first few tokens have small activations that fit, but once SiLU(mul(gate, up)) produces larger values, block scales overflow and the GEMM produces zeros/garbage. Fix: compute input_global_scale = amax / (6.0 * 448.0), normalize before block quantization, then fold global scale back into block scales (same as weight_transform.py folds weight_scale_2). This ensures block scales are always ≤ 448.0 in UE4M3 range.	2026-05-15 03:27:47 +00:00
biondizzle	108ff07569	debug: remove one-shot gate from logit dump, log every forward	2026-05-15 03:01:05 +00:00
biondizzle	3600a4b06a	debug: add logit quality dump in compute_logits (ungated, once)	2026-05-15 02:37:23 +00:00
biondizzle	29f8b8c174	fix: load lm_head.weight in outer model before forwarding to inner lm_head lives on DeepseekV4ForCausalLM, not DeepseekV4Model. The inner load_weights silently drops it (not in params_dict). Extract it in the outer loader, load it directly, then forward the rest to the inner model.	2026-05-15 02:17:16 +00:00
biondizzle	46536e5ccf	fix: hc param renames missing leading dot .attn_hc.base -> hc_attn_base produced layers.0hc_attn_base (no dot). Need .hc_attn_base to preserve the dot separator.	2026-05-15 01:54:53 +00:00
biondizzle	086f3fa5c5	fix: hc params dot→underscore + compressor position_bias→ape combined rule Two fixes: 1. attn_hc.base → hc_attn_base (underscore not dot before base/fn/scale) Same for fn, scale, and ffn_hc variants. 2. compressor.position_bias → compressor.ape was never firing because the .self_attn.compressor. rule matched first (break). Added combined .self_attn.compressor.position_bias → .attn.mla_attn.compressor.ape.	2026-05-15 01:29:00 +00:00
biondizzle	44d4b6c225	fix: add missing renames for Hadamard coding + compressor.ape Three more dropped checkpoint→model mappings: 1. hc_head: checkpoint has hc_head.hc_base/fn/scale, model has hc_head_base/fn/scale (underscore not dot separator) 2. attn_hc/ffn_hc: checkpoint has .attn_hc. and .ffn_hc., model has .hc_attn. and .hc_ffn. (word order reversed) 3. compressor.position_bias → compressor.ape: checkpoint name is position_bias, model attr is ape (absolute position encoding) All 461 remaining zero params should now be just indexer.k_norm.bias (legit zero - no bias in checkpoint, only weight).	2026-05-15 01:16:19 +00:00
biondizzle	af6583eb19	fix: unpack uint8 NVFP4→bf16 for non-stacked params (weights_proj) indexer.weights_proj is uint8 [64,3584] in checkpoint but bf16 [64,7168] in model. The uint8→bf16 unpack logic only ran in the stacked_params loop, so non-stacked NVFP4 params hit a size mismatch assertion.	2026-05-15 00:51:50 +00:00
biondizzle	e6ed9facf3	fix: indexer + shared_experts + compressor checkpoint→model key renames Three categories of missed renames in CKPT_KEY_SUBST: 1. Shared experts: .shared_experts.gate_proj.→.ffn.shared_experts.w1. fired but break prevented .mlp.→.ffn. from also applying, producing mlp.ffn.shared_experts.w1. (double prefix). Fixed by including .mlp. in the pattern. Added missing .shared_experts.down_proj. rule. 2. Indexer (layers 2+): .self_attn.compressor.indexer.* was caught by the generic .self_attn.compressor.→.attn.mla_attn.compressor. rule, producing wrong path attn.mla_attn.compressor.indexer.* instead of attn.indexer.*. Added indexer-specific patterns (q_b_proj→wq_b, kv_norm→k_norm, position_bias→compressor.ape, gate_proj→compressor.wgate, kv_proj→compressor.wkv) before the generic compressor rule. 3. Compressor kv_proj/gate_proj: old .compressor.kv_proj.→.compressor.wkv. pattern could never fire because .self_attn.compressor. matched first (break). Merged into combined patterns that handle both the self_attn.compressor→attn.mla_attn.compressor path AND the projection rename in one step.	2026-05-15 00:39:37 +00:00
biondizzle	21018fca8a	fix: shared_experts missing ffn. prefix in checkpoint→model rename Checkpoint keys are model.layers.N.shared_experts.gate_proj.weight but model params are layers.N.ffn.shared_experts.gate_up_proj.weight. The .ffn. was missing from the rename, so stacked gate_up_proj never matched params_dict.	2026-05-15 00:17:59 +00:00
biondizzle	483046b9d6	fix: shared_experts gate_up_proj stacking was skipped by .experts. check The stacking logic skipped any key containing '.experts.' to avoid MoE routed expert weights. But 'shared_experts' also matches that substring, so gate_proj and up_proj were never stacked into gate_up_proj. Changed to '.ffn.experts.' which only matches the routed experts path. Also includes POST-LOAD all-zero param scan.	2026-05-15 00:08:04 +00:00
biondizzle	78dc83dc6e	a little more debug1	2026-05-15 00:02:22 +00:00
biondizzle	8dbd616add	a little more debug1	2026-05-15 00:02:00 +00:00

1 2 3

134 Commits