nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	918aa8aede	Fix scale assembly output shape: reshape to 2D for GEMM	2026-05-17 09:57:27 +00:00
biondizzle	d9bae6d770	Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens and top_k=6, the actual padded row count can exceed 6144. Also: - Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192) - Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks) - Remove debug prints	2026-05-17 09:56:28 +00:00
biondizzle	55ac60eb91	Add detailed debug prints for OOB investigation	2026-05-17 09:39:42 +00:00
biondizzle	fed3c417ba	Add debug OOB check for sorted_token_ids	2026-05-17 09:19:10 +00:00
biondizzle	eb7d4f099b	Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)	2026-05-17 09:01:24 +00:00
biondizzle	ca3cba5bbd	Fix global→local expert ID remapping for EP and remove .cpu() sync Root cause of CUDA_ERROR_ASSERT index out of bounds: - topk_ids contains GLOBAL expert IDs (0-255) but runner treated them as local IDs (0-31 with EP=8). Tokens for non-local experts got wrong expert assignments, causing out-of-bounds scatter indices in _assemble_scales_cudagraph_safe. Fixes: 1. Add experts_start_idx param to CuTeDSLMoERunner 2. In run(), remap global→local IDs and zero weights for non-local experts 3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync) 4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL JIT GPU memory corruption (refill after first GEMM call)	2026-05-17 08:58:43 +00:00
biondizzle	1330e2b2cf	cleanup: remove debug prints, ready for testing Current state: - Token indices on CPU (avoids CuTeDSL GPU memory corruption) - Scale assembly uses per-expert swizzle + scatter (matches reference) - compute_activation_global_scales warmup gets ~0.97 cosine - expert_offsets passed without leading 0 (matches pipeline) - layertest + cudagraph_test pass	2026-05-17 08:30:41 +00:00
biondizzle	d635dcbbb6	fix: keep token_indices on CPU, index with CPU sort_idx CuTeDSL's cute.compile corrupts GPU memory during JIT compilation. Keeping token_indices on CPU and using sort_idx.cpu() for indexing avoids the corruption. The .to(device) call after indexing moves the result back to GPU for the hidden_states indexing.	2026-05-17 08:29:18 +00:00
biondizzle	235d5b314f	fix: fallback token indices allocation with verify+rebuild	2026-05-17 08:27:47 +00:00
biondizzle	dd0b3fd4f9	debug: print sorted_token_ids in warmup	2026-05-17 08:25:25 +00:00
biondizzle	04999d86cf	fix: add quantize_to_nvfp4 import	2026-05-17 08:24:57 +00:00
biondizzle	33e28100ee	test: use runner's built-in warmup method	2026-05-17 08:24:27 +00:00
biondizzle	7073daaffa	fix: allocate token_indices on CPU, move to GPU AFTER JIT compilation CuTeDSL's cute.compile corrupts GPU memory during JIT compilation. Tensors allocated on GPU before/during compilation get zeroed. Fix: create token_indices on CPU, then .to(device) after JIT is done.	2026-05-17 08:22:51 +00:00
biondizzle	0e7b06b55c	debug: clone + sync token indices before JIT	2026-05-17 08:22:11 +00:00
biondizzle	70c0618361	fix: allocate token_indices before CuTeDSL JIT compilation CuTeDSL's cute.compile appears to corrupt GPU memory state, causing torch.arange to produce zero-filled tensors when allocated after the JIT compilation. Moving token_indices allocation before the weight stacking operations fixes the corruption.	2026-05-17 08:20:41 +00:00
biondizzle	2bbe04efd8	debug: remove assert, test token corruption	2026-05-17 08:19:45 +00:00
biondizzle	66627926c5	debug: int32 token indices with sync verify	2026-05-17 08:18:37 +00:00
biondizzle	da02a5dc11	debug: assert token indices are correct after allocation	2026-05-17 08:16:09 +00:00
biondizzle	c0d016a472	feat: compute_activation_global_scales warmup method Uses quantize_to_nvfp4 during warmup to get exact gs values for L1 and L2. L1 gs comes from slot_hidden, L2 gs from the actual L1 GEMM output. These values are then used with quantize_activation_nvfp4 (cudagraph-safe) during inference.	2026-05-17 08:11:01 +00:00
biondizzle	8c9a51e006	fix: call _ensure_stacked in warmup test	2026-05-17 08:07:09 +00:00
biondizzle	5ba77e355f	test: warmup gs computation with safety margin sweep	2026-05-17 08:06:27 +00:00
biondizzle	ae6b879d38	fix: pass expert_offsets without leading 0 to GEMM (matches pipeline)	2026-05-17 07:59:00 +00:00
biondizzle	a1e6f5f891	fix: searchsorted right=True for correct expert assignment	2026-05-17 07:57:00 +00:00
biondizzle	ddffb7d8df	docs: current bug analysis — scale_a layout vs expert_offsets mismatch	2026-05-17 07:53:58 +00:00
biondizzle	ed90341ea9	fix: scatter+per-expert-swizzle scale assembly (cudagraph-safe)	2026-05-17 07:47:14 +00:00
biondizzle	37fecb588f	fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls	2026-05-17 07:43:05 +00:00
biondizzle	b824b838a9	fix: 128-row-align each expert's scales in padded buffer	2026-05-17 07:39:49 +00:00
biondizzle	8dadd9a723	test: scale assembly debug	2026-05-17 07:37:47 +00:00
biondizzle	8642946274	fix: padded x_sf buffer for fixed-shape scale assembly	2026-05-17 07:37:04 +00:00
biondizzle	418e29f7f5	fix: per-expert scale assembly (match assemble_scales_2d_side)	2026-05-17 07:35:49 +00:00
biondizzle	7b95e76723	test: runner vs pipeline comparison + scale assembly comparison	2026-05-17 07:33:20 +00:00
biondizzle	366a0240a5	vllm tweaks	2026-05-17 07:14:58 +00:00
biondizzle	34c43958d0	vllm tweaks	2026-05-17 07:10:16 +00:00
biondizzle	48e4cb625d	fix: default activation global_scale so runner works without finalize_weights	2026-05-17 06:24:15 +00:00
biondizzle	d2965b432d	fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch	2026-05-17 03:35:20 +00:00
biondizzle	b382a7a528	fix: handle input_scale as 1D or 2D (EP splits change the shape)	2026-05-16 22:49:30 +00:00
biondizzle	139c9c37cd	fix: read input_scale from nn.Parameter before it's freed	2026-05-16 22:23:24 +00:00
biondizzle	152648789d	fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688) The checkpoint stores input_scale per projection — the pre-computed activation normalization factor. Using 1/2688 was wrong for most layers (e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off). This caused under-quantized activations and garbage output.	2026-05-16 21:46:00 +00:00
biondizzle	af087e655e	docs: update README — vLLM cudagraph inference running, output quality in progress	2026-05-16 21:40:59 +00:00
biondizzle	0a5cfe0433	add kernel compile caching — compile once, invoke on subsequent calls First call: cute.compile() with real tensors (warmup). Subsequent calls: just invoke compiled() with new CuTe views. No cute.compile() in the forward path = cudagraph-safe.	2026-05-16 20:45:46 +00:00
biondizzle	3465b9d471	remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe)	2026-05-16 20:42:49 +00:00
biondizzle	5e245bc0c6	fix: missing newline	2026-05-16 20:40:18 +00:00
biondizzle	288e179f88	add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale)	2026-05-16 20:39:37 +00:00
biondizzle	521e11e468	test: old bridge + LUT quantization only (step 1 of cudagraph migration)	2026-05-16 20:37:42 +00:00
biondizzle	f51be76e8f	temp: restore EXACT old bridge.py from `b685112`	2026-05-16 20:34:45 +00:00
biondizzle	58dc36e21c	fix: compile fresh each call — cached compile produces wrong TMA descriptors The CuTeDSL kernel's TMA descriptors are bound to the compilation-time tensor addresses. Caching the compiled kernel and reusing it with different tensor allocations produces wrong memory access patterns (cosine 0.5 instead of 0.99). Fresh compilation is proven correct (cosine 0.989). We can optimize later with proper TMA descriptor reinitialization.	2026-05-16 20:28:15 +00:00
biondizzle	98cc6ac1f3	fix: invert cache check logic (compile when NOT in cache)	2026-05-16 20:25:16 +00:00
biondizzle	e337ec86a3	debug: test with cache enabled	2026-05-16 20:24:04 +00:00
biondizzle	bc56452be8	debug: disable kernel cache to test fresh compilation	2026-05-16 20:22:51 +00:00
biondizzle	647c03b2ee	fix: make_b_k_major must preserve shape — use double-permute trick permute(K,N).contiguous().permute(K,N) gives same (E,K,N) shape but with K-contiguous memory. Single permute changes the shape.	2026-05-16 20:19:21 +00:00

1 2 3 4

199 Commits