nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	21018fca8a	fix: shared_experts missing ffn. prefix in checkpoint→model rename Checkpoint keys are model.layers.N.shared_experts.gate_proj.weight but model params are layers.N.ffn.shared_experts.gate_up_proj.weight. The .ffn. was missing from the rename, so stacked gate_up_proj never matched params_dict.	2026-05-15 00:17:59 +00:00
biondizzle	483046b9d6	fix: shared_experts gate_up_proj stacking was skipped by .experts. check The stacking logic skipped any key containing '.experts.' to avoid MoE routed expert weights. But 'shared_experts' also matches that substring, so gate_proj and up_proj were never stacked into gate_up_proj. Changed to '.ffn.experts.' which only matches the routed experts path. Also includes POST-LOAD all-zero param scan.	2026-05-15 00:08:04 +00:00
biondizzle	78dc83dc6e	a little more debug1	2026-05-15 00:02:22 +00:00
biondizzle	8dbd616add	a little more debug1	2026-05-15 00:02:00 +00:00
biondizzle	756ea2192f	clean up and possible big fix	2026-05-14 23:41:10 +00:00
biondizzle	9f01307c5b	debug more7	2026-05-14 23:20:19 +00:00
biondizzle	e4f52c8900	debug more5	2026-05-14 23:01:59 +00:00
biondizzle	e46ff41569	debug more4	2026-05-14 22:50:51 +00:00
biondizzle	fd5f04eb15	debug more3	2026-05-14 22:36:34 +00:00
biondizzle	7573f12659	debug more2	2026-05-14 22:26:22 +00:00
biondizzle	11bbf675af	debug more	2026-05-14 22:21:30 +00:00
biondizzle	ce4c4b6fcb	debug empty output	2026-05-14 22:13:32 +00:00
biondizzle	09d1307d78	damn clankers2	2026-05-14 20:34:51 +00:00
biondizzle	5bbe51357c	damn clankers	2026-05-14 20:23:42 +00:00
biondizzle	6aae8f1393	more fixes7	2026-05-14 20:11:37 +00:00
biondizzle	4363eee2ce	more fixes6	2026-05-14 20:08:25 +00:00
biondizzle	40b980b9d6	more fixes5	2026-05-14 19:55:34 +00:00
biondizzle	d56e86b40e	more fixes4	2026-05-14 19:51:56 +00:00
biondizzle	bf17bd3fc4	more fixes3	2026-05-14 19:47:02 +00:00
biondizzle	c68f4e9d6e	more fixes2	2026-05-14 19:43:24 +00:00
biondizzle	4749a92fca	more fixes	2026-05-14 19:39:16 +00:00
biondizzle	1ceff541b0	more fixes	2026-05-14 19:35:39 +00:00
biondizzle	3be051e140	fix	2026-05-14 19:29:47 +00:00
biondizzle	57512d5f0d	clean up	2026-05-14 19:20:08 +00:00
biondizzle	0d8e1bd035	restructure: move Dockerfile and docker-compose to root, docker/ → vllm/	2026-05-14 18:47:30 +00:00
biondizzle	878ad4fc5b	fix Dockerfile patch paths and add explicit env vars for debug suppression	2026-05-14 18:44:08 +00:00
biondizzle	072a1d4a0b	clean up	2026-05-14 18:40:15 +00:00
biondizzle	1150e325bb	Consolidate serving into kernel repo - Dockerfile: COPY kernel source instead of git clone - docker-compose: build context at repo root, all debug flags OFF - vLLM patches: deepseek_v4.py, staging_kernel.py, deepseek_v4_attention.py - serve_vllm.py script - .dockerignore to keep image clean	2026-05-14 18:20:20 +00:00
biondizzle	2687d1fc53	fix: convert global expert IDs to local before GEMM vLLM's symm_buffer stores topk_ids as GLOBAL expert IDs (0..383). Our weight tensors are indexed by LOCAL IDs (0..47 per rank). Each rank r handles experts [r48, r48+47]. Without conversion, topk_ids like 137, 222, 378 would index way out of bounds in the weight tensor (shape (48, N, K)), producing garbage. Derive experts_start_idx from the topk_ids and subtract to get local IDs. This was why all ranks except rank 0 produced zero expert matches → zero output → garbage text.	2026-05-14 17:43:58 +00:00
biondizzle	128ff84358	fix: 384 experts (not 256), clarify cross-rank reduce is in caller DeepSeek-V4-Pro has 384 routed experts, 48 per rank (384/8). The cross-rank all-reduce happens in the parent DeepseekV4MoE.forward, not in our kernel. Our kernel writes local output; caller does reduce. Fixed README, nvfp4_mega_moe.py comments.	2026-05-14 17:33:59 +00:00
biondizzle	1c15dadaa5	cleanup: remove dead _pack_ue4m3_to_uint32, fix data format docs weight_transform.py returns float8_e4m3fn scales, NOT packed uint32. The _pack_ue4m3_to_uint32 function was never called. Removed it. Updated README data formats to accurately reflect the pipeline: - Weight scales: float8_e4m3fn (direct to CUTLASS, no unpack) - Activation scales: uint32 packed (from staging kernel, unpacked to float8)	2026-05-14 17:28:12 +00:00
biondizzle	008f8cccbd	docs: comprehensive README with SF remap probe data, bug history, coordinate table Added detailed SF remap section with the empirical coordinate dump table showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed, the diagnostic trail (constant-scale test, single-element probes), and the 6 verification probes confirming the extraction formula.	2026-05-14 17:02:53 +00:00
biondizzle	1e0cea055c	cleanup: remove all debug printfs from CUDA kernel and weight_transform Removed printf from remap kernel (flat_rank dump, coordinate probes, first-coord log). Removed weight_scale_2 debug prints from weight_transform.py. Production-ready now.	2026-05-14 16:57:32 +00:00
biondizzle	839835cba4	fix: correct SF remap coordinate extraction for flat_rank=8 m = f0 + f132 + f2128 (CuTe 'first sub varies fastest') k_sf = f4 + f54 f3 is the Step<2> stride (degenerate, always=total), NOT a coordinate. Previous formula (f32+f2)*128 was catastrophically wrong — mapped everything to m=0 or m=huge.	2026-05-14 16:40:48 +00:00
biondizzle	1ef2fbc2fd	debug: more indices for SF layout dump	2026-05-14 16:26:15 +00:00
biondizzle	c4b5b52a33	debug: single-thread SF layout dump at specific indices	2026-05-14 16:13:05 +00:00
biondizzle	17e6033ade	debug: print specific indices for SF layout coordinate decomposition	2026-05-14 15:57:55 +00:00
biondizzle	8ee3f90e44	debug: handle flat_rank=8 for SF remap, add coordinate dump Previous approach assumed rank 2-6, but actual rank is 8. For R==8: 4 M sub-indices (inner_32, inner_4, tile_interleave, tile_m) 4 K sub-indices (inner_16, inner_4_k, tile_k_interleave, tile_k) m = (f32 + f2)128 + f04 + f1 k_sf = f5 + f64 (tentative, needs printf verification) Added printf of all 8 flat values for first 3 indices.	2026-05-14 15:45:52 +00:00
biondizzle	d2c1c76f5b	debug: idx2crd+flatten approach with printf to determine flat_rank Going back to the idx2crd approach which compiles and runs. Added printf for flat_rank, MN, K_sf, and first coordinate extraction. Handles ranks 2-6 with logical (m, k_sf) extraction. This will tell us the actual flat_rank and whether our extraction is correct.	2026-05-14 15:34:46 +00:00
biondizzle	2ac3a7d631	fix: construct nested coordinate for CuTe layout shape ((32,4), K) layout_sf(m, k_elem) with flat ints fails: Mismatched Ranks because the layout shape is ((32,4), K_padded), not (M, K). Decompose m into (inner_m, sub_m) = (m/4, m%4) to match the (32,4) sub-shape, and pass as make_tuple(make_tuple(inner, sub), k_elem).	2026-05-14 15:32:12 +00:00
biondizzle	593ae998f8	fix: clean rewrite of cutlass_nvfp4_gemm.cu — no more file splicing Removed dead code from old idx2crd approach. File is now clean: - Source-iterating SF remap kernel with layout_sf(m, k_elem) - Zero-init dest buffers before remap - Proper extern C wrapping	2026-05-14 15:31:03 +00:00
biondizzle	196ee37fdb	fix: rewrite SF remap kernel — source-iterating with layout_sf(m, k_elem) Ripped out idx2crd + flatten + get<> approach entirely. New kernel iterates over source indices (m, k_group) and uses layout_sf(m, k_elem) to compute the CUTLASS destination offset. CuTe handles nested shape decomposition internally — no rank inspection needed. K coordinate is in element-space (k_group * SFVecSize) as the layout expects. Iterates over groups (not every element) since all 16 elements within a group share one SF byte — avoids 16x redundant writes. Grid size based on source count (MN * K_sf), not dest buffer size.	2026-05-14 15:28:44 +00:00
biondizzle	fb390b24e2	debug: add printf to SF remap kernel to check flat_rank and layout shape	2026-05-14 15:24:18 +00:00
biondizzle	8f5322ca31	fix: add missing extern "C" opening brace lost during file reconstruction	2026-05-14 15:04:43 +00:00
biondizzle	a8bd962452	fix: SF remap — iterate dest indices, extract logical (m, k_sf) from nested coord The forward-map approach (src -> layout_sf(m, k)) failed because CuTe's layout operator requires coordinates matching the nested shape rank, and passing flat (int, int) to a ((32,4),K) shape triggers Mismatched Ranks. New approach: iterate over CUTLASS dest indices, use idx2crd to get the hierarchical coordinate, flatten it, then extract logical (m, k_sf) by interpreting the flattened sub-coordinates correctly: flat[0..2] = (inner_M, sub_M, tile_M) -> m = tile_M128 + inner_M4 + sub_M flat[3..5] = (inner_K, sub_K, tile_K) -> k_sf = tile_K*4 + sub_K (inner_K is within one SF group — same byte, so ignored for k_sf) Previous bug: get<0> and get<1> of flatten gave (inner_M, sub_M) — both M sub-indices. K information was never extracted, so only k_group=0 worked. Dest buffer is zero-initialized so padding slots (where m >= MN or k_sf >= K_sf) stay zero.	2026-05-14 15:01:47 +00:00
biondizzle	395cc31883	fix: use layout_sf(m, k_elem) instead of make_coord for nested shapes make_coord(m, k_elem) produces rank-2 coord, but tile_to_shape creates nested shapes like ((32,4,tiles_m), (16,4,tiles_k)) which expect matching nested coords. layout_sf(m, k) operator handles hierarchical projection automatically.	2026-05-14 14:57:13 +00:00
biondizzle	d90967d6e9	fix: SF remap — element-space K coords + zero-init dest buffer Two fixes: 1. CuTe layout uses element-space K, not group-space. k_group=3 with SFVecSize=16 maps to k_elem=48 in the layout, not k=3. Added SFVecSize param to remap kernel, multiply k_sf * SFVecSize before passing to layout_sf(). 2. Zero-init CUTLASS dest buffer before remap. The layout pads to tile boundaries (128x64), so dest is larger than MK_sf. Unmapped padding slots reading garbage causes sporadic wrong results. Also fixed grid size to use source count (MK_sf), not dest size.	2026-05-14 14:54:18 +00:00
biondizzle	5968ebad9f	fix: SF remap was using idx2crd+flatten which gives atom sub-indices, not logical (m,k) The remap kernel iterated over CUTLASS linear indices and tried to reverse-map with idx2crd + flatten. But flatten() on the nested CuTe coordinate (from tile_to_shape(SfAtom{}, ...)) gives atom-level sub-indices, not logical (m, k). This caused all K-groups > 0 in SFA to map to m*K_sf+0, losing K-group information entirely. Proof: setting SFA[0,0]=2.0 changed row 0, but SFA[0,3]=2.0 produced zero change. Only K-group 0 was being read. Fix: iterate over SOURCE indices (row-major m, k) and use the CuTe layout forward: layout_sf(make_coord(m, k)) -> CUTLASS dst index. This is the correct forward direction that CuTe handles natively. Constant-scale test (all SF=1.0) gave cosine=1.0, confirming the FP4 data path is correct. The bug was purely in the SF remap.	2026-05-14 14:51:02 +00:00
biondizzle	cf796e37cf	debug: add weight_scale_2 shape/value logging in weight transform	2026-05-14 14:19:35 +00:00
biondizzle	879adc324d	fix: _fold_global_scale — remove broken logical_widths branch The logical_widths branch took expert 0 and 1's global scales and applied them to ALL experts. For L1 with logical_widths=[3072,3072], every expert got expert-0's scale on its gate half and expert-1's scale on its up half. All other experts' global scales were discarded. The else branch correctly broadcasts each expert's own (E,1) global scale across (E, N, K//16). Removed the dead logical_widths code.	2026-05-14 14:17:44 +00:00

1 2

88 Commits