nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	eef0ef76af	Fix NVFP4 compressor scale loading: buffer and concatenate scale shards The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses weight_loader(param, weight, shard_id), but PerTensorScaleParameter and ModelWeightParameter for NVFP4 scale params don't support shard_id in load_column_parallel_weight (asserts shape equality). Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate shards, then concatenate along dim 0 and copy_ into the param after all weights are loaded.	2026-05-18 23:24:08 +00:00
biondizzle	f74447bfd0	Proper NVFP4 integration: quantized compressor/indexer + mapper fixes Weight mapper fixes: - Reorder substr renames: compressor renames first, then .self_attn.compressor. → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up under mla_attn after the compressor rename already fired) - Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm, position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names) - Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv, gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape, weights_proj stays (structural: compressor.indexer → indexer.compressor) - Remove broken suffix renames (already fixed in prior commit) Model architecture fixes: - Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4 checkpoint has quantized compressor weights with input_scale/weight_scale) - Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config (was None, but checkpoint has quantized weights) - Add indexer.compressor.fused_wkv_wgate stacking in load_weights Infrastructure: - Add deepseek_compressor.py to Dockerfile - Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM) - Update unit test to 50 cases (compressor + indexer + quantization scales)	2026-05-18 23:20:13 +00:00
biondizzle	17496b2615	Fix NVFP4 weights mapper: add prefix mappings, fix substr order - Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.) AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required - Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn. in substr_renames so compressor keys get the mla_attn prefix before the general rename - Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight) that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens - Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)	2026-05-18 23:03:34 +00:00
biondizzle	b039123207	Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames - Add specific .self_attn.{q_a,kv,q_b,o_a,o_b}_proj → .attn.{wq_a,wkv,wq_b,wo_a,wo_b} - Remove norm_gate suffix renames (nightly uses 'gate' not 'norm_gate') - Order substr renames: specific before general	2026-05-18 22:53:09 +00:00
biondizzle	ea648a9bc2	Fix NVFP4 mapper: keep model. prefix (model params use it)	2026-05-18 22:49:40 +00:00
biondizzle	1528d4e182	Fix NVFP4 mapper: strip model. prefix from checkpoint keys The NVFP4 checkpoint uses model.layers.* but vLLM's AutoWeightsLoader expects layers.* (relative to the model module). Strip the model. prefix instead of adding it.	2026-05-18 22:46:04 +00:00
biondizzle	5d37674fb1	Add cutedsl to MoEBackend type in kernel config	2026-05-18 22:38:41 +00:00
biondizzle	7409204d71	Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper The upstream deepseek_v4.py has imports that don't exist in the nightly Docker image (norm_gate_linear, breakable_cudagraph, etc.). Use the nightly's own files as the base and add only the minimal NVFP4 changes: - Add _make_deepseek_v4_nvfp4_weights_mapper() for checkpoint key mapping - Select NVFP4 mapper when quant_config is modelopt_fp4 - cos_sin_cache float32 fix in attention - Remove utils.py patch (not needed)	2026-05-18 22:33:51 +00:00
biondizzle	a19ed4a18e	Remove breakable_cudagraph import (not in nightly)	2026-05-18 22:29:24 +00:00
biondizzle	b007937a68	Fix garbled imports in cutedsl/runner.py	2026-05-18 22:22:52 +00:00
biondizzle	a7ed8faec6	Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework Major refactor to eliminate all post-load hacks: - deepseek_v4.py: use upstream model with NVFP4 weight mapper only (gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn) - Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass that wraps our CuTeDSL runner as a proper vLLM MoE backend - Register CUTEDSL backend in the NVFP4 oracle - Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config) - ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections - Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile - CuTeDSL runner moved to cutedsl/runner.py for clean imports - cos_sin_cache float32 fix in deepseek_v4_attention.py No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.	2026-05-18 22:19:23 +00:00
biondizzle	48386e34ad	Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable torch.compile fullgraph mode can't handle @torch.compiler.disable (skips the function and refuses to compile). Custom autograd Functions are treated as opaque ops by torch.compile — they execute eagerly without the compiler trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).	2026-05-18 21:38:28 +00:00
biondizzle	85e1cd3b69	Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run() CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with torch.dynamo tracing. Marking run() as compiler-disabled makes the runners opaque to torch.compile — they execute eagerly while the rest of the model gets compiled.	2026-05-18 21:07:35 +00:00
biondizzle	a94011ec92	Fix torch.compile crash: remove threading.Lock from LUT cache path The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under torch.compile/cudagraph. LUT is now pre-populated during warmup so the fast path (cache hit) never hits a lock. Also removed all init/warmup debug prints from CuTeDSL kernels.	2026-05-18 20:54:55 +00:00
biondizzle	6326222d68	Fix: add abstract create_weights to CuTeDSLNvfp4LinearMethod	2026-05-18 20:40:48 +00:00
biondizzle	450793311c	Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4 - CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod for forward dispatch - Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference) - Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16) with monkey-patched forward for fused L1+SiLU+L2 pipeline - Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix, input_scale fixes) - Deleted _post_quant_fix hook from utils.py - Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp) - Cleaned up all debug prints - Updated Dockerfile with new kernel files	2026-05-18 20:27:42 +00:00
biondizzle	6ce6a47be9	Add NVFP4 linear runner + attention projection test - CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection - test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16 - Same pad+swizzle pattern as shared expert, but no SiLU/fusion	2026-05-18 20:14:03 +00:00
biondizzle	f07643791e	Fix hidden_size: shared expert uses 7168, not HC_DIM 28672	2026-05-18 20:10:32 +00:00
biondizzle	70f50a1ec6	Fix scale assembly: use correctly-sized temp buffer for swizzle	2026-05-18 20:09:50 +00:00
biondizzle	97bdd604e9	Fix scale assembly: reshape swizzled output to 2D	2026-05-18 20:09:19 +00:00
biondizzle	c1aa4af123	Shared expert: dedicated CuTeDSL runner with proper scale assembly - CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing - _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle - All buffers pre-allocated for cudagraph compatibility - Updated test to use dedicated runner instead of MoE runner hack	2026-05-18 20:08:34 +00:00
biondizzle	b3451c74f8	Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels	2026-05-18 20:05:03 +00:00
biondizzle	e8b289e30d	WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow.	2026-05-18 20:02:19 +00:00
biondizzle	1836e5fdc7	Add shared experts to post-quant BF16 dequant fix Shared experts also use FlashInferCutlassNvFp4LinearKernel with broken input_scale. They need the same BF16 dequant treatment. gate_up_proj and down_proj on ffn.shared_experts.	2026-05-18 19:27:49 +00:00
biondizzle	82ac648563	Patch utils.py the standard way: copy modified file into Docker image Instead of fragile inline Dockerfile patching, just copy a modified utils.py (with _post_quant_fix call) into the image, same pattern as deepseek_v4.py and deepseek_v4_attention.py patches.	2026-05-18 19:10:08 +00:00
biondizzle	3c1a76bdcc	Fix Dockerfile: use external patch script instead of inline Python Docker's parser chokes on multi-line Python in RUN. Moved to scripts/patch_utils.py and COPY + RUN it.	2026-05-18 19:03:57 +00:00
biondizzle	75844a8361	Post-quant fix via Dockerfile patch to process_weights_after_loading Forward pre-hook approach didn't work — torch.compile and model wrappers bypass hooks. Instead, patch vLLM's utils.py to call model._post_quant_fix() at the end of process_weights_after_loading. This guarantees the fix runs AFTER quant methods set up their attrs. Dockerfile now patches: model_loader/utils.py → calls model._post_quant_fix() if it exists DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention NVFP4 weights to BF16 and replaces quant_method.	2026-05-18 18:35:34 +00:00
biondizzle	a4ad5898c1	Fix post-quant hook: register on inner model, fix module refs vLLM V1 calls DeepseekV4Model.forward() directly, not DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires. Moved hook to self.model (inner) and fixed module.model.layers → module.layers.	2026-05-18 18:15:36 +00:00
biondizzle	a51edd238e	Add post-quant-init forward hook to fix attention NVFP4 The key insight: process_weights_after_loading runs AFTER load_weights and sets up FlashInferCutlassNvFp4LinearKernel with broken input_global_scale_inv. Any fix inside load_weights gets overwritten. Solution: register a one-shot forward pre-hook that runs on the first forward call (guaranteed after all init). It dequantizes attention NVFP4 weights to BF16 and replaces quant_method with UnquantizedLinearMethod. Since process_weights_after_loading already ran, our changes won't be overwritten. Standalone test confirmed: all attention weights produce valid non-NaN output when dequantized to BF16.	2026-05-18 17:56:19 +00:00
biondizzle	2835cb040b	Fix input_scale BEFORE process_weights_after_loading runs Instead of dequantizing to BF16 (which gets overwritten by process_weights_after_loading), fix the input_scale parameter on the module before the quant method reads it. The quant method computes input_global_scale_inv = input_scale.max(), so fixing input_scale propagates the correct activation scale. Computes correct input_scale by temporarily dequantizing weight to BF16, running warmup forward, and computing act_amax. input_scale = 1/(act_amax * headroom).	2026-05-18 16:43:44 +00:00
biondizzle	2fc81ccac4	Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early) process_weights_after_loading sets input_global_scale_inv AFTER _convert_nvfp4_post_load runs, so the fix couldn't find the attrs. Going back to BF16 dequant approach. The zeros in the dummy run are expected (attention_impl returns early with out.zero_()). Need to test with a real request under cudagraph_mode=NONE.	2026-05-18 16:23:41 +00:00
biondizzle	4a57399592	Add debug prints for input_global_scale_inv check	2026-05-18 15:59:59 +00:00
biondizzle	f86892e26b	Replace BF16 dequant with input_scale warmup fix for attention NVFP4 Instead of dequantizing attention weights to BF16 (which had issues with MergedColumnParallelLinear and different weight_scale_2 values), keep the NVFP4 path but fix the activation global scale. Compute correct input_global_scale_inv by: 1. Temporarily dequantizing weight to BF16 2. Running warmup forward with random input 3. Computing actual activation amax 4. Setting scale_inv = amax * headroom This preserves the original NVFP4 quantization pipeline.	2026-05-18 15:43:46 +00:00
biondizzle	301015b037	Remove all inline diagnostics — incompatible with torch.compile Data-dependent expressions (amax().item(), isnan().any().item()) cause Dynamo guard failures even when gated by os.environ. cudagraph_mode=NONE still uses torch.compile, so these break. Will need enforce-eager for diagnostics going forward.	2026-05-18 15:22:53 +00:00
biondizzle	a83d364d45	Switch to cudagraph_mode=NONE (not enforce-eager) for real inference testing	2026-05-18 15:05:52 +00:00
biondizzle	2a2a42c6d6	Add attention-internal diagnostics: MLA output, FP8 quant output	2026-05-18 14:45:43 +00:00
biondizzle	5c1dda10f6	Add granular attention diagnostics: pre/post attn, embed, dequant stats	2026-05-18 14:24:14 +00:00
biondizzle	e0e0528778	Add debug logging for BF16 dequant to find missing attrs	2026-05-18 14:04:12 +00:00
biondizzle	2e8c3c961f	Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv wq_a and wkv are fused into a single MergedColumnParallelLinear called fused_wqa_wkv. Was checking for non-existent separate attrs.	2026-05-18 13:47:08 +00:00
biondizzle	a7216b27df	Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16 wo_a uses fp8_einsum which is weight-only FP8 (no input_scale). Only q_a, q_b, kv, o_b need BF16 dequant to avoid broken input_scale.	2026-05-18 13:22:15 +00:00
biondizzle	334e95047e	Fix: dequantize ALL attention NVFP4 projections to BF16 Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel uses checkpoint input_scale for activation quantization, which produces NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a, wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method. This trades memory for correctness. Future optimization: add warmup for attention input_global_scale_inv (same as MoE warmup).	2026-05-18 13:09:36 +00:00
biondizzle	a83c332059	Fix docker-compose: remove orphaned compilation-config arg, enforce-eager mode	2026-05-18 12:54:14 +00:00
biondizzle	9e7639fba4	Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager) When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer. Must run with --enforce-eager (data-dependent prints break Dynamo). Gated by os.environ so dead-code-eliminated during compilation.	2026-05-18 12:51:51 +00:00
biondizzle	2d1e9f42b1	Remove NaN check — incompatible with Dynamo fullgraph compilation Dynamo fullgraph mode rejects BOTH data-dependent branching AND torch.compiler.disable as graph breaks. The NaN check cannot coexist with vLLM's AOT compilation. Use layertest/cudagraph_test for debugging.	2026-05-18 12:17:25 +00:00
biondizzle	65763a200c	Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break The inline os.environ gate doesn't work — Dynamo still sees the data-dependent branching (torch.isnan().any()) and crashes with 'Unsupported: Data-dependent branching'. Extracting into a @torch.compiler.disable decorated function makes Dynamo skip it.	2026-05-18 11:33:29 +00:00
biondizzle	8758bc93ca	crap shoot	2026-05-18 11:13:29 +00:00
biondizzle	b8df4a8cc5	Fix NaN check: use os.environ gate instead of is_current_stream_capturing torch.cuda.is_current_stream_capturing() returns bool, which breaks Dynamo FX tracing (non-Tensor output). Switch to env var gate: CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection. Dynamo evaluates os.environ at trace time — if the env var is not set, the entire NaN check block is compiled away. Set it before first inference to get NaN detection during prefill only.	2026-05-18 02:20:14 +00:00
biondizzle	0c02d84514	Add NaN/Inf detection in DeepseekV4Model.forward layer loop - Checks every layer during prefill (not during cudagraph capture) - is_current_stream_capturing() gate prevents CPU-GPU syncs during capture - Prints amax every 10 layers for magnitude tracking - Breaks on first NaN/Inf to avoid wasting compute	2026-05-17 23:37:12 +00:00
biondizzle	bedcfc4dab	Pipeline test: use max_num_tokens=8192 matching vLLM	2026-05-17 23:04:44 +00:00
biondizzle	c45364b3a8	Add MoE scale ratio output	2026-05-17 22:58:27 +00:00

1 2 3 4 5 ...

312 Commits