nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	3c1a76bdcc	Fix Dockerfile: use external patch script instead of inline Python Docker's parser chokes on multi-line Python in RUN. Moved to scripts/patch_utils.py and COPY + RUN it.	2026-05-18 19:03:57 +00:00
biondizzle	75844a8361	Post-quant fix via Dockerfile patch to process_weights_after_loading Forward pre-hook approach didn't work — torch.compile and model wrappers bypass hooks. Instead, patch vLLM's utils.py to call model._post_quant_fix() at the end of process_weights_after_loading. This guarantees the fix runs AFTER quant methods set up their attrs. Dockerfile now patches: model_loader/utils.py → calls model._post_quant_fix() if it exists DeepseekV4ForCausalLM._post_quant_fix() dequantizes attention NVFP4 weights to BF16 and replaces quant_method.	2026-05-18 18:35:34 +00:00
biondizzle	a4ad5898c1	Fix post-quant hook: register on inner model, fix module refs vLLM V1 calls DeepseekV4Model.forward() directly, not DeepseekV4ForCausalLM.forward(). Hook on the outer model never fires. Moved hook to self.model (inner) and fixed module.model.layers → module.layers.	2026-05-18 18:15:36 +00:00
biondizzle	a51edd238e	Add post-quant-init forward hook to fix attention NVFP4 The key insight: process_weights_after_loading runs AFTER load_weights and sets up FlashInferCutlassNvFp4LinearKernel with broken input_global_scale_inv. Any fix inside load_weights gets overwritten. Solution: register a one-shot forward pre-hook that runs on the first forward call (guaranteed after all init). It dequantizes attention NVFP4 weights to BF16 and replaces quant_method with UnquantizedLinearMethod. Since process_weights_after_loading already ran, our changes won't be overwritten. Standalone test confirmed: all attention weights produce valid non-NaN output when dequantized to BF16.	2026-05-18 17:56:19 +00:00
biondizzle	2835cb040b	Fix input_scale BEFORE process_weights_after_loading runs Instead of dequantizing to BF16 (which gets overwritten by process_weights_after_loading), fix the input_scale parameter on the module before the quant method reads it. The quant method computes input_global_scale_inv = input_scale.max(), so fixing input_scale propagates the correct activation scale. Computes correct input_scale by temporarily dequantizing weight to BF16, running warmup forward, and computing act_amax. input_scale = 1/(act_amax * headroom).	2026-05-18 16:43:44 +00:00
biondizzle	2fc81ccac4	Revert to BF16 dequant for attention NVFP4 (input_scale fix was too early) process_weights_after_loading sets input_global_scale_inv AFTER _convert_nvfp4_post_load runs, so the fix couldn't find the attrs. Going back to BF16 dequant approach. The zeros in the dummy run are expected (attention_impl returns early with out.zero_()). Need to test with a real request under cudagraph_mode=NONE.	2026-05-18 16:23:41 +00:00
biondizzle	4a57399592	Add debug prints for input_global_scale_inv check	2026-05-18 15:59:59 +00:00
biondizzle	f86892e26b	Replace BF16 dequant with input_scale warmup fix for attention NVFP4 Instead of dequantizing attention weights to BF16 (which had issues with MergedColumnParallelLinear and different weight_scale_2 values), keep the NVFP4 path but fix the activation global scale. Compute correct input_global_scale_inv by: 1. Temporarily dequantizing weight to BF16 2. Running warmup forward with random input 3. Computing actual activation amax 4. Setting scale_inv = amax * headroom This preserves the original NVFP4 quantization pipeline.	2026-05-18 15:43:46 +00:00
biondizzle	301015b037	Remove all inline diagnostics — incompatible with torch.compile Data-dependent expressions (amax().item(), isnan().any().item()) cause Dynamo guard failures even when gated by os.environ. cudagraph_mode=NONE still uses torch.compile, so these break. Will need enforce-eager for diagnostics going forward.	2026-05-18 15:22:53 +00:00
biondizzle	a83d364d45	Switch to cudagraph_mode=NONE (not enforce-eager) for real inference testing	2026-05-18 15:05:52 +00:00
biondizzle	2a2a42c6d6	Add attention-internal diagnostics: MLA output, FP8 quant output	2026-05-18 14:45:43 +00:00
biondizzle	5c1dda10f6	Add granular attention diagnostics: pre/post attn, embed, dequant stats	2026-05-18 14:24:14 +00:00
biondizzle	e0e0528778	Add debug logging for BF16 dequant to find missing attrs	2026-05-18 14:04:12 +00:00
biondizzle	2e8c3c961f	Fix: dequantize fused_wqa_wkv instead of separate wq_a/wkv wq_a and wkv are fused into a single MergedColumnParallelLinear called fused_wqa_wkv. Was checking for non-existent separate attrs.	2026-05-18 13:47:08 +00:00
biondizzle	a7216b27df	Fix: keep wo_a as FP8 (fp8_einsum path), dequant others to BF16 wo_a uses fp8_einsum which is weight-only FP8 (no input_scale). Only q_a, q_b, kv, o_b need BF16 dequant to avoid broken input_scale.	2026-05-18 13:22:15 +00:00
biondizzle	334e95047e	Fix: dequantize ALL attention NVFP4 projections to BF16 Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel uses checkpoint input_scale for activation quantization, which produces NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a, wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method. This trades memory for correctness. Future optimization: add warmup for attention input_global_scale_inv (same as MoE warmup).	2026-05-18 13:09:36 +00:00
biondizzle	a83c332059	Fix docker-compose: remove orphaned compilation-config arg, enforce-eager mode	2026-05-18 12:54:14 +00:00
biondizzle	9e7639fba4	Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager) When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer. Must run with --enforce-eager (data-dependent prints break Dynamo). Gated by os.environ so dead-code-eliminated during compilation.	2026-05-18 12:51:51 +00:00
biondizzle	2d1e9f42b1	Remove NaN check — incompatible with Dynamo fullgraph compilation Dynamo fullgraph mode rejects BOTH data-dependent branching AND torch.compiler.disable as graph breaks. The NaN check cannot coexist with vLLM's AOT compilation. Use layertest/cudagraph_test for debugging.	2026-05-18 12:17:25 +00:00
biondizzle	65763a200c	Fix NaN check: wrap in @torch.compiler.disable to prevent Dynamo graph break The inline os.environ gate doesn't work — Dynamo still sees the data-dependent branching (torch.isnan().any()) and crashes with 'Unsupported: Data-dependent branching'. Extracting into a @torch.compiler.disable decorated function makes Dynamo skip it.	2026-05-18 11:33:29 +00:00
biondizzle	8758bc93ca	crap shoot	2026-05-18 11:13:29 +00:00
biondizzle	b8df4a8cc5	Fix NaN check: use os.environ gate instead of is_current_stream_capturing torch.cuda.is_current_stream_capturing() returns bool, which breaks Dynamo FX tracing (non-Tensor output). Switch to env var gate: CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection. Dynamo evaluates os.environ at trace time — if the env var is not set, the entire NaN check block is compiled away. Set it before first inference to get NaN detection during prefill only.	2026-05-18 02:20:14 +00:00
biondizzle	0c02d84514	Add NaN/Inf detection in DeepseekV4Model.forward layer loop - Checks every layer during prefill (not during cudagraph capture) - is_current_stream_capturing() gate prevents CPU-GPU syncs during capture - Prints amax every 10 layers for magnitude tracking - Breaks on first NaN/Inf to avoid wasting compute	2026-05-17 23:37:12 +00:00
biondizzle	bedcfc4dab	Pipeline test: use max_num_tokens=8192 matching vLLM	2026-05-17 23:04:44 +00:00
biondizzle	c45364b3a8	Add MoE scale ratio output	2026-05-17 22:58:27 +00:00
biondizzle	bf99ad49ec	Print both MoE and residual cosine	2026-05-17 22:56:56 +00:00
biondizzle	8637020487	Fix multi-layer test: add residual connections	2026-05-17 22:55:40 +00:00
biondizzle	11dce13afe	Add multi-layer pipeline test to check error accumulation	2026-05-17 22:53:28 +00:00
biondizzle	87582fc9f7	HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph	2026-05-17 22:28:32 +00:00
biondizzle	8717e0e411	Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping	2026-05-17 22:03:48 +00:00
biondizzle	d332f4f900	Add NaN debug checks after L1 and L2 GEMM	2026-05-17 22:02:24 +00:00
biondizzle	e65f2b2ba2	Update CURRENT_BUG.md with Bug 26 fix	2026-05-17 21:36:25 +00:00
biondizzle	72628fb689	Full pipeline test: runner vs BF16 reference	2026-05-17 21:29:16 +00:00
biondizzle	2796bd81e8	Fix: scatter FP4 as uint8 (float4 doesn't support index_put)	2026-05-17 21:28:04 +00:00
biondizzle	364f8372bb	Fix FP4 buffer shapes: D//2 for packed dimensions	2026-05-17 21:26:46 +00:00
biondizzle	5e4d674736	Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf	2026-05-17 21:25:58 +00:00
biondizzle	803e7160d8	Fix: allocate FP4 buffers as uint8 then view-cast	2026-05-17 21:25:04 +00:00
biondizzle	7256070dd3	FIX Bug 26: quantize slot tokens, not padded buffer The runner was quantizing the padded_hidden (4096 rows) and then taking x_sf[:num_slots] (first 48 rows). This only got scales for expert 0 (the first 48 rows of the padded buffer), not the scales for tokens scattered across padded positions (expert 1 at row 128, etc). Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer for the GEMM. The scale assembly now receives the correct x_sf. Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.	2026-05-17 21:24:43 +00:00
biondizzle	4d0b6d889d	Set runner weights before _ensure_stacked	2026-05-17 21:22:50 +00:00
biondizzle	b7acac5e4e	Call _ensure_stacked() before using runner buffers	2026-05-17 21:22:30 +00:00
biondizzle	1acf01fc1a	Fix token_indices: repeat each token ID top_k times, not arange	2026-05-17 21:22:11 +00:00
biondizzle	a478ca4746	Debug: trace runner logic step by step, test L1 GEMM	2026-05-17 21:21:45 +00:00
biondizzle	a100bd11c1	Simplify pipeline test: BF16 ref + bridge ref + full runner	2026-05-17 21:20:41 +00:00
biondizzle	6eade5e7f8	Fix: gs values are floats not tensors	2026-05-17 21:19:47 +00:00
biondizzle	b05a38a9bd	Test stages 1-2 first: sort + L1 GEMM	2026-05-17 21:19:23 +00:00
biondizzle	9728604ea1	Pipeline test: stage-by-stage with BF16 reference comparison	2026-05-17 21:19:17 +00:00
biondizzle	7fff5fd39b	Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes	2026-05-17 21:18:20 +00:00
biondizzle	4ef345773d	Rewrite pipeline test: load real weights, step-by-step vs BF16 reference	2026-05-17 21:17:18 +00:00
biondizzle	b43541afdd	Fix test path setup	2026-05-17 21:00:00 +00:00
biondizzle	490ddfa294	Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours)	2026-05-17 20:58:06 +00:00

1 2 3 4 5 ...

337 Commits