nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	dfd9c10ae9	Fix MHC import: don't import .torch from layers/mhc.py The layers/mhc.py was trying to import kernels.mhc.torch which failed because our __init__.py was breaking the package. Instead, just import our mhc_torch_ops which has everything we need. Also fix __init__.py to explicitly import mhc_pre_torch and mhc_post_torch from .torch instead of using import *.	2026-05-19 05:36:35 +00:00
biondizzle	e404e18efb	Also replace layers/mhc.py CustomOp dispatch The original layers/mhc.py forward_cuda calls torch.ops.vllm.mhc_pre_tilelang which triggers TileLang JIT. Replace with our torch implementations in forward_cuda. This is what the CustomOp dispatch routes through.	2026-05-19 05:31:05 +00:00
biondizzle	5e6d459145	Fix MHC custom op registration Previous approach used @CustomOp.register which doesn't create torch.ops.vllm.mhc_pre. The model code calls torch.ops.vllm.mhc_pre() directly, which requires direct_register_custom_op. Use direct_register_custom_op to register mhc_pre, mhc_post, mhc_fused_post_pre, and hc_head_fused_kernel as PyTorch custom ops with torch (eager) implementations. Patch kernels/mhc/__init__.py to import from both .torch (original) and .mhc_torch_ops (our replacements), skipping tilelang import.	2026-05-19 05:19:48 +00:00
biondizzle	9ff1679064	Replace MHC TileLang kernels with pure PyTorch TileLang kernels (mhc_pre_big_fuse_tilelang, mhc_fused_tilelang) don't work correctly on Blackwell SM100 and cause empty model output. Replace with pure PyTorch implementations: - mhc_pre_torch: Sinkhorn-normalized HC residual mixing - mhc_post_torch: HC post block (einsum residual + post layer mix) - mhc_fused_post_pre_torch: Fused post+pre (composition of above) - hc_head_fused_torch: RMS norm + linear + sigmoid + weighted sum Patch both layers/mhc.py (CustomOp dispatch) and kernels/mhc/__init__.py (no tilelang import). Also remove tilelang from pyproject.toml deps.	2026-05-19 05:07:41 +00:00
biondizzle	5c770c68ca	Keep MoE scale tensors: framework warmup needs them The framework's deep_gemm_warmup calls get_fused_moe_quant_config which accesses w13_input_scale etc. Setting them to None caused TypeError: float / NoneType. Keep scales (small tensors) and only free the large weight tensors.	2026-05-19 04:50:31 +00:00
biondizzle	e0f385ac45	Fix workspace_shapes: output dim is hidden_dim, not K2 K comes from hidden_states.size(-1) which is the full BF16 dimension (7168), not the packed weight dimension. K2=14336 is wrong. The MoE output is always hidden_dim (7168).	2026-05-19 04:42:22 +00:00
biondizzle	cfd8ec741d	Debug: add shape mismatch logging in MoE apply	2026-05-19 04:35:58 +00:00
biondizzle	ffc1a5c6a8	Fix workspace_shapes: remove wrong assertion, compute output dim from K The framework may pass K in different forms (packed or unpacked). Use max(K*2, hidden_dim) to handle both cases.	2026-05-19 04:28:04 +00:00
biondizzle	f023b3b2c6	Fix: wrap dummy MoE weights in nn.Parameter PyTorch requires module attributes to be nn.Parameter or None. torch.empty can't be assigned to a registered parameter slot.	2026-05-19 04:21:35 +00:00
biondizzle	b06dcb40dc	Fix MoE w1=None crash: keep shape-preserving dummy weights on CPU The modular kernel framework reads w1.shape[0] in its outer apply() before delegating to our expert impl. Setting layer.w13_weight = None caused AttributeError. Replace with shape-preserving CPU dummy tensors to free GPU memory while keeping shape metadata accessible.	2026-05-19 04:17:10 +00:00
biondizzle	c289c44920	Fix BF16 wo_a: per-group BMM instead of flat linear The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1)) which flattens across groups: (num_tokens, n_local_headshead_dim)=(tokens, 8192). But wo_a is a BMM with in_features=n_headshead_dim/n_groups=4096. The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes. The BF16 path now does the same: reshape o_inv to per-group format, do torch.bmm, then reshape output and handle TP all-gather manually.	2026-05-19 04:10:02 +00:00
biondizzle	6f9a400ae0	Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn - Removed hc_head prefix mapping (checkpoint already has model.hc_head.*) - Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn) - The model has self.hc_head_fn as flat params, not inside a sub-module	2026-05-19 03:58:25 +00:00
biondizzle	909a2710e4	Fix double lm_head mapping: NVFP4 checkpoint already uses correct names The checkpoint has lm_head.weight and model.embed_tokens.weight already — the suffix mappings head.weight→lm_head.weight and embed.weight→embed_tokens.weight were incorrectly applying to keys that already had the right prefix, producing lm_lm_head.weight.	2026-05-19 03:54:14 +00:00
biondizzle	4cf5b8b751	Fix compressor path: attn.mla_attn.compressor (not attn.compressor) The compressor is inside mla_attn, not directly on the attention wrapper. Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*	2026-05-19 03:47:26 +00:00
biondizzle	9d41419e9f	Debug: print compressor params to diagnose KeyError	2026-05-19 03:44:40 +00:00
biondizzle	db5192fe41	Patch from Docker image's vLLM (0.20.2rc1) instead of newer upstream The nightly Docker image uses an older vLLM that doesn't have NormGateLinear, breakable_cudagraph, etc. Patching the Docker image's own files ensures compatibility. - deepseek_v4.py: Patches from Docker image + NVFP4 mapper + wo_a BF16 - deepseek_v4_attention.py: Patches from Docker image + inv rope BF16 + weights_proj quant + removed QuantFP8/GroupShape imports	2026-05-19 03:35:15 +00:00
biondizzle	df5a496f5d	Fix: make eager_break_during_capture import conditional for older vLLM	2026-05-19 03:29:05 +00:00
biondizzle	4ed91b81d0	Fix inverse RoPE formula: swap signs on cross terms	2026-05-19 03:22:10 +00:00
biondizzle	fece06f746	Add unit tests for NVFP4 weight mapper and inverse RoPE BF16	2026-05-19 03:22:00 +00:00
biondizzle	b0b5113467	Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj - The compressor is on attn.compressor (not attn.mla_attn.compressor) - weights_proj in indexer is NVFP4-quantized in our checkpoint	2026-05-19 03:20:41 +00:00
biondizzle	396a83ea56	Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper - deepseek_v4.py: Fresh upstream copy with minimal NVFP4 changes - wo_a uses quant_config=None (BF16 in NVFP4 checkpoint, no scales) - Added _make_deepseek_v4_nvfp4_weights_mapper() using official WeightsMapper API - Handles: self_attn→attn, mlp→ffn, gate_proj→w1, compressor renames, etc. - Mapper selected by quant_config.get_name() == 'modelopt_fp4' - deepseek_v4_attention.py: Fresh upstream copy with minimal NVFP4 changes - Removed _wo_a_act_quant and custom CuTeDSL wo_a runner - Added _apply_inv_rope_bf16() helper (inverse RoPE in BF16) - Detects BF16 wo_a (no weight_scale_inv) and uses BF16 path - FP8 einsum path kept as fallback for SM90 checkpoints - BF16 path: inverse RoPE → wo_a() → wo_b() (standard linear methods)	2026-05-19 03:13:38 +00:00
biondizzle	b856ee9315	Clean up debug scripts	2026-05-19 02:47:29 +00:00
biondizzle	05cdde1676	Fix wo_a: scatter each group's data at correct offset in padded buffer The grouped GEMM expects each group's tokens at their own offset range: - Group 0: rows [0, padded_T) - Group 1: rows [padded_T, 2padded_T) - etc. Previously we wrote all groups' data contiguously starting at row 0, so group 1+ would read zeros from the padding area. Now we scatter each group's quantized activation at the correct offset. Also: - Size buffer for total_max_rows = padded_max n_groups - Use assemble_scales_2d_side for multi-group scale assembly - Extract output per-group at correct offsets	2026-05-19 02:45:57 +00:00
biondizzle	8fe5546bb3	Fix debug script	2026-05-19 02:43:17 +00:00
biondizzle	788f0aa65a	Add step-by-step debug for wo_a	2026-05-19 02:43:05 +00:00
biondizzle	5f5b997fc3	Fix wo_a: permute to groups-first layout for grouped GEMM The grouped GEMM expects mat_a to be laid out contiguously per group: [all tokens for group0, all tokens for group1, ...] A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout which is wrong. Fix: permute to (G, T, D) before flattening. Same fix for output: permute (G, T, R) → (T, G, R).	2026-05-19 02:41:32 +00:00
biondizzle	77e4970d93	Add debug script for wo_a quantization	2026-05-19 02:40:43 +00:00
biondizzle	80122b850b	Add debug script for wo_a	2026-05-19 02:39:55 +00:00
biondizzle	ae233ab648	Fix test: cos_sin_cache on CUDA device	2026-05-19 02:37:50 +00:00
biondizzle	882d4996ff	Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj) The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion in layout.hpp:39) when processing wo_a (o-projection first half) in the attention layer. The crash is caused by scale tensor dimension mismatch for the SM100 recipe (1, 1, 128). Instead of fighting DeepGEMM, replace the entire wo_a path with our own CuTeDSL NVFP4 kernel: 1. inverse_rope_bf16() — Python implementation of inverse RoPE (replaces fused_inv_rope_fp8_quant CUDA kernel) 2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using ScaledGroupedGemm with n_local_groups=8 groups 3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4, no conversion to another quantization) Changes: - cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation) - cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a - vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner is initialized, keep DeepGEMM fallback - vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant - tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM	2026-05-19 02:36:30 +00:00
biondizzle	bab1f75f29	Fix gs None error in legacy _ensure_stacked path	2026-05-19 02:17:53 +00:00
biondizzle	48fa64dfda	Eliminate weight copies: pass stacked checkpoint tensors directly Memory optimization for MoE weight processing: Before (3-4 copies of weights in memory): 1. Original checkpoint weights in layer.w13_weight (copy 1) 2. Per-expert permuted copies (copy 2) 3. torch.stack() in runner._ensure_stacked (copy 3) 4. make_b_k_major re-stride (copy 4) 5. Scales: permute then assemble_scales_3d_side un-permutes (wasted) After (1-2 copies): 1. View checkpoint as fp4 (NO copy — byte-preserving view) 2. Pass (E, N, K) stacked tensor directly to runner 3. Runner permutes to (E, K, N) contiguous (copy 1), frees stacked ref 4. make_b_k_major re-strides (copy 2), frees (E, K, N) ref 5. Scales: already (N, K_sf) from checkpoint, call assembly directly 6. Free layer.w13_weight etc. immediately after extracting views Also: assemble_scales_3d_side transposes (K_sf, N)→(N, K_sf) internally, but checkpoint scales are ALREADY (N, K_sf). Skip the double-transpose by calling assemble_raw_scales_2d3d_3d_side directly.	2026-05-19 02:16:43 +00:00
biondizzle	0612c1ab54	use proper backend	2026-05-19 02:08:18 +00:00
biondizzle	00fe63b56f	Fix compile test: add warmup for activation global scales	2026-05-19 01:57:16 +00:00
biondizzle	bba3bca4d3	Add torch.compile + custom op integration test	2026-05-19 01:56:46 +00:00
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	98153002c0	Go back to torch.library.custom_op with correct GEMM impl allow_in_graph doesn't work — Dynamo can't create proxies for Python objects (the runner). The custom op approach requires only tensor args. This time the GEMM impl correctly: - Uses quantize_activation_nvfp4 for activation quantization - Pads x_fp4 via uint8 + view(float4) for torch.zeros compat - Assembles A-side scales with pad + swizzle - Uses int32 expert_offsets (CuTeDSL requirement) - Passes runner's pre-assembled mat_b, scale_b, gsb tensors	2026-05-19 01:24:41 +00:00
biondizzle	02c500bbb1	Switch to allow_in_graph for Dynamo opacity instead of custom op The custom op approach required reimplementing the GEMM (wrong scale assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead, use torch.autograd.Function + torch._dynamo.allow_in_graph which tells Dynamo to treat the function as an opaque kernel call, while still using the runner's battle-tested _run_impl for the actual GEMM. allow_in_graph is the proper way to register opaque ops for Dynamo without reimplementing the computation.	2026-05-19 01:20:07 +00:00
biondizzle	581d87f9a6	Remove warmup forward from process_weights_after_loading The warmup custom op call hit cudaErrorIllegalAddress because our custom op GEMM implementation doesn't match the runner's call convention. Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.	2026-05-19 01:18:54 +00:00
biondizzle	5d49849156	Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.	2026-05-19 01:15:24 +00:00
biondizzle	e1fcfc4f01	Add CuTeDSL warmup + CUDA sync after JIT compilation CuTeDSL cute.compile corrupts GPU memory. Add warmup forward + torch.cuda.synchronize() + health check after finalize_weights, matching the MoE runner pattern.	2026-05-19 01:11:44 +00:00
biondizzle	1d9c0f996c	Fix expert_offsets dtype: CuTeDSL expects int32 not int64 The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was caused by passing int64 expert_offsets when the kernel expects int32.	2026-05-19 01:05:20 +00:00
biondizzle	b81200f427	Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op - pad_and_swizzle_single takes 1 arg (2D tensor), not 4 - Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group - Remove unused CuTeDSLNvfp4Linear import from custom op impl	2026-05-19 01:01:42 +00:00
biondizzle	e0eb436914	Fix custom_op registration: use as decorator with proper type hints	2026-05-19 00:54:30 +00:00
biondizzle	c609e9ba3c	Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM Dynamo in fullgraph mode traces through torch.autograd.Function, hitting CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op makes it opaque to Dynamo — tracing calls the fake impl, real impl only runs during inference. Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b, in_features, out_features, activation_global_scale) -> Tensor Store finalized weight tensors on the layer (from runner._mat_b etc.) instead of the runner object, since custom ops can only accept tensors.	2026-05-19 00:50:43 +00:00
biondizzle	c043a11bcc	Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers - Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class - Register it via init_nvfp4_linear_kernel() selection mechanism (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer) - process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner - apply_weights: route through CuTeDSL GEMM - Update Dockerfile: copy kernel + registration script - Fix attention: always use forward() for quantized compressor/indexer layers (dtype check was fragile after kernel swaps weights to dummy BF16)	2026-05-19 00:44:44 +00:00
biondizzle	358830925a	Fix unpack error: handle both tuple and tensor returns from NVFP4 forward()	2026-05-19 00:33:43 +00:00
biondizzle	d9dc042ff7	Fix compressor kv_score: use forward() for NVFP4 quantized weights Raw torch.mm doesn't work with packed uint8 NVFP4 weights. Use MergedColumnParallelLinear.forward() which handles dequantization.	2026-05-19 00:29:43 +00:00
biondizzle	10c14ddb49	Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm - input_layernorm → attn_norm, post_attention_layernorm → ffn_norm - hc_head.fn/base/scale → hc_head_fn/base/scale - attn_hc/ffn_hc → hc_attn/hc_ffn (dot to underscore) - q_a_norm → q_norm, sinks → attn_sink - Indexer params: self_attn.compressor.indexer → attn.indexer (not attn.mla_attn.compressor.indexer)	2026-05-19 00:24:26 +00:00
biondizzle	540e7ee8fc	Fix: layer.self_attn → layer.attn (model uses attn, not self_attn)	2026-05-19 00:14:09 +00:00

1 2 3 4 5 ...

368 Commits