nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	37fecb588f	fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls	2026-05-17 07:43:05 +00:00
biondizzle	b824b838a9	fix: 128-row-align each expert's scales in padded buffer	2026-05-17 07:39:49 +00:00
biondizzle	8dadd9a723	test: scale assembly debug	2026-05-17 07:37:47 +00:00
biondizzle	8642946274	fix: padded x_sf buffer for fixed-shape scale assembly	2026-05-17 07:37:04 +00:00
biondizzle	418e29f7f5	fix: per-expert scale assembly (match assemble_scales_2d_side)	2026-05-17 07:35:49 +00:00
biondizzle	7b95e76723	test: runner vs pipeline comparison + scale assembly comparison	2026-05-17 07:33:20 +00:00
biondizzle	366a0240a5	vllm tweaks	2026-05-17 07:14:58 +00:00
biondizzle	34c43958d0	vllm tweaks	2026-05-17 07:10:16 +00:00
biondizzle	48e4cb625d	fix: default activation global_scale so runner works without finalize_weights	2026-05-17 06:24:15 +00:00
biondizzle	d2965b432d	fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch	2026-05-17 03:35:20 +00:00
biondizzle	b382a7a528	fix: handle input_scale as 1D or 2D (EP splits change the shape)	2026-05-16 22:49:30 +00:00
biondizzle	139c9c37cd	fix: read input_scale from nn.Parameter before it's freed	2026-05-16 22:23:24 +00:00
biondizzle	152648789d	fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688) The checkpoint stores input_scale per projection — the pre-computed activation normalization factor. Using 1/2688 was wrong for most layers (e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off). This caused under-quantized activations and garbage output.	2026-05-16 21:46:00 +00:00
biondizzle	af087e655e	docs: update README — vLLM cudagraph inference running, output quality in progress	2026-05-16 21:40:59 +00:00
biondizzle	0a5cfe0433	add kernel compile caching — compile once, invoke on subsequent calls First call: cute.compile() with real tensors (warmup). Subsequent calls: just invoke compiled() with new CuTe views. No cute.compile() in the forward path = cudagraph-safe.	2026-05-16 20:45:46 +00:00
biondizzle	3465b9d471	remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe)	2026-05-16 20:42:49 +00:00
biondizzle	5e245bc0c6	fix: missing newline	2026-05-16 20:40:18 +00:00
biondizzle	288e179f88	add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale)	2026-05-16 20:39:37 +00:00
biondizzle	521e11e468	test: old bridge + LUT quantization only (step 1 of cudagraph migration)	2026-05-16 20:37:42 +00:00
biondizzle	f51be76e8f	temp: restore EXACT old bridge.py from `b685112`	2026-05-16 20:34:45 +00:00
biondizzle	58dc36e21c	fix: compile fresh each call — cached compile produces wrong TMA descriptors The CuTeDSL kernel's TMA descriptors are bound to the compilation-time tensor addresses. Caching the compiled kernel and reusing it with different tensor allocations produces wrong memory access patterns (cosine 0.5 instead of 0.99). Fresh compilation is proven correct (cosine 0.989). We can optimize later with proper TMA descriptor reinitialization.	2026-05-16 20:28:15 +00:00
biondizzle	98cc6ac1f3	fix: invert cache check logic (compile when NOT in cache)	2026-05-16 20:25:16 +00:00
biondizzle	e337ec86a3	debug: test with cache enabled	2026-05-16 20:24:04 +00:00
biondizzle	bc56452be8	debug: disable kernel cache to test fresh compilation	2026-05-16 20:22:51 +00:00
biondizzle	647c03b2ee	fix: make_b_k_major must preserve shape — use double-permute trick permute(K,N).contiguous().permute(K,N) gives same (E,K,N) shape but with K-contiguous memory. Single permute changes the shape.	2026-05-16 20:19:21 +00:00
biondizzle	ed4f501bba	fix: make_b_k_major stride check — K-major means stride[1]==1, not stride[2]==1 For (E, K, N): stride[2]==1 is N-major (columns contiguous). K-major requires stride[1]==1 (rows contiguous).	2026-05-16 20:18:18 +00:00
biondizzle	2162cee4ad	fix: restore proper quantize_weight_to_nvfp4 — K is the packed dim, not N quantize_to_nvfp4() only packs the last dimension, but for weight matrices (K, N), K is the packed dimension. The weight quantizer reshapes (k_blocks, block_size, N) and computes block scales along the K block dimension. This was accidentally replaced with a simple delegation to quantize_to_nvfp4, producing wrong tensor shapes.	2026-05-16 20:16:28 +00:00
biondizzle	10f1dca982	fix: import ceil_div from correct module	2026-05-16 20:09:02 +00:00
biondizzle	81632e2f21	fix: correct cutlass_torch import (cutlass.torch, not top-level)	2026-05-16 20:08:21 +00:00
biondizzle	16c4fad025	fix: remove cutlass.cute.backend import	2026-05-16 20:06:38 +00:00
biondizzle	44b40d41fe	fix: compile CuTeDSL kernel with real tensors, not dummy shapes The kernel's TMA descriptors are sized from compilation-time shapes. Dummy 256x256 caused wrong memory access for real 3584x6144 data. Now compiles with actual runtime tensors on first use, cached by (num_experts, K, N). Compilation happens once during warmup. Forward call remains cudagraph-safe.	2026-05-16 20:05:59 +00:00
biondizzle	79281b6fda	fix: compute K_packed/N_packed before passing to _get_compiled_kernel	2026-05-16 20:00:35 +00:00
biondizzle	caf93d6c45	fix: pass K_packed/N_packed to _get_compiled_kernel	2026-05-16 19:59:43 +00:00
biondizzle	ecc7b83334	fix: compile CuTeDSL kernel with actual tensor shapes, not dummy 256x256 The compiled kernel's TMA descriptors are sized based on compilation shapes. Using dummy 256x256 shapes caused wrong memory access patterns for the real 3584x6144 data. Now uses actual K_packed and N_packed from the runtime tensors.	2026-05-16 19:58:13 +00:00
biondizzle	cc75a55bd9	restore: new bridge/moe_pipeline/layertest	2026-05-16 19:55:19 +00:00
biondizzle	0c878b3a9e	temp: restore old layertest+bridge for cosine comparison	2026-05-16 19:54:04 +00:00
biondizzle	0069769d12	debug: print global scales	2026-05-16 19:38:31 +00:00
biondizzle	84589fe984	debug: more prints	2026-05-16 19:31:54 +00:00
biondizzle	fa2d5708c5	debug: add L1 GEMM and SiLU output debug prints	2026-05-16 19:29:42 +00:00
biondizzle	4c06c51ec3	fix: moe_pipeline.py gate/up split — L1 output is 2*intermediate, not intermediate	2026-05-16 19:28:15 +00:00
biondizzle	da31ce7e1a	allow for cuda graphs again	2026-05-16 19:23:41 +00:00
biondizzle	d15c43294b	fix: test L2 weight N dim should be hidden_size, not hidden_size//2	2026-05-16 19:07:36 +00:00
biondizzle	28788c6f55	fix: L1 weight N dimension is 2intermediate (gate+up), not intermediate float4_e2m1fn_x2 packs 2 values per byte along K, not N. The GEMM output N dimension is the logical N from mat_b.shape[2], not 2x packed. Previous n_dim2 was wrong — it accidentally worked in the test because intermediate_size2 == 2intermediate_size. Real model with N=9216 exposed the bug.	2026-05-16 19:07:08 +00:00
biondizzle	f7e29fdf1e	docs: update README with cudagraph compatibility work and decisions	2026-05-16 18:55:47 +00:00
biondizzle	103fd451ce	fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph) buf[:gpu_scalar, :] triggers cudaErrorStreamCaptureInvalidated. Always use the full pre-allocated buffer; extra rows are zeros.	2026-05-16 18:50:35 +00:00
biondizzle	2f68c7ba77	fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward) torch.tensor() and new_tensor() both trigger CPU->CUDA copies during cudagraph capture. Pre-cache the LUT on first use per device.	2026-05-16 18:48:31 +00:00
biondizzle	6c298be842	fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy) torch.tensor() creates on CPU then copies to CUDA, which is forbidden during cudagraph capture. new_tensor() creates directly on the source tensor's device.	2026-05-16 18:47:39 +00:00
biondizzle	53c25bee0b	rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices - Removed all [:total_slots] dynamic slicing with GPU scalars - slot_hidden gathers from hidden_states directly using sorted_token_ids - scatter_add uses full sorted_token_ids (padding slots have zero weight) - _assemble_scales_cudagraph_safe returns 2D via padded_scales.shape[0] - Fixed padded_scales_buf allocation via float16->float8 cast - GEMM output size: n_dim * 2 for float4_e2m1fn_x2 packed format	2026-05-16 18:44:25 +00:00
biondizzle	4300775bfe	fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead	2026-05-16 18:29:12 +00:00
biondizzle	5a79065b2b	fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element)	2026-05-16 18:27:44 +00:00

1 2 3 4

174 Commits