nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	78dc83dc6e	a little more debug1	2026-05-15 00:02:22 +00:00
biondizzle	8dbd616add	a little more debug1	2026-05-15 00:02:00 +00:00
biondizzle	756ea2192f	clean up and possible big fix	2026-05-14 23:41:10 +00:00
biondizzle	9f01307c5b	debug more7	2026-05-14 23:20:19 +00:00
biondizzle	e4f52c8900	debug more5	2026-05-14 23:01:59 +00:00
biondizzle	e46ff41569	debug more4	2026-05-14 22:50:51 +00:00
biondizzle	fd5f04eb15	debug more3	2026-05-14 22:36:34 +00:00
biondizzle	7573f12659	debug more2	2026-05-14 22:26:22 +00:00
biondizzle	11bbf675af	debug more	2026-05-14 22:21:30 +00:00
biondizzle	ce4c4b6fcb	debug empty output	2026-05-14 22:13:32 +00:00
biondizzle	09d1307d78	damn clankers2	2026-05-14 20:34:51 +00:00
biondizzle	5bbe51357c	damn clankers	2026-05-14 20:23:42 +00:00
biondizzle	6aae8f1393	more fixes7	2026-05-14 20:11:37 +00:00
biondizzle	4363eee2ce	more fixes6	2026-05-14 20:08:25 +00:00
biondizzle	40b980b9d6	more fixes5	2026-05-14 19:55:34 +00:00
biondizzle	d56e86b40e	more fixes4	2026-05-14 19:51:56 +00:00
biondizzle	bf17bd3fc4	more fixes3	2026-05-14 19:47:02 +00:00
biondizzle	c68f4e9d6e	more fixes2	2026-05-14 19:43:24 +00:00
biondizzle	4749a92fca	more fixes	2026-05-14 19:39:16 +00:00
biondizzle	1ceff541b0	more fixes	2026-05-14 19:35:39 +00:00
biondizzle	3be051e140	fix	2026-05-14 19:29:47 +00:00
biondizzle	57512d5f0d	clean up	2026-05-14 19:20:08 +00:00
biondizzle	0d8e1bd035	restructure: move Dockerfile and docker-compose to root, docker/ → vllm/	2026-05-14 18:47:30 +00:00
biondizzle	878ad4fc5b	fix Dockerfile patch paths and add explicit env vars for debug suppression	2026-05-14 18:44:08 +00:00
biondizzle	072a1d4a0b	clean up	2026-05-14 18:40:15 +00:00
biondizzle	1150e325bb	Consolidate serving into kernel repo - Dockerfile: COPY kernel source instead of git clone - docker-compose: build context at repo root, all debug flags OFF - vLLM patches: deepseek_v4.py, staging_kernel.py, deepseek_v4_attention.py - serve_vllm.py script - .dockerignore to keep image clean	2026-05-14 18:20:20 +00:00
biondizzle	2687d1fc53	fix: convert global expert IDs to local before GEMM vLLM's symm_buffer stores topk_ids as GLOBAL expert IDs (0..383). Our weight tensors are indexed by LOCAL IDs (0..47 per rank). Each rank r handles experts [r48, r48+47]. Without conversion, topk_ids like 137, 222, 378 would index way out of bounds in the weight tensor (shape (48, N, K)), producing garbage. Derive experts_start_idx from the topk_ids and subtract to get local IDs. This was why all ranks except rank 0 produced zero expert matches → zero output → garbage text.	2026-05-14 17:43:58 +00:00
biondizzle	128ff84358	fix: 384 experts (not 256), clarify cross-rank reduce is in caller DeepSeek-V4-Pro has 384 routed experts, 48 per rank (384/8). The cross-rank all-reduce happens in the parent DeepseekV4MoE.forward, not in our kernel. Our kernel writes local output; caller does reduce. Fixed README, nvfp4_mega_moe.py comments.	2026-05-14 17:33:59 +00:00
biondizzle	1c15dadaa5	cleanup: remove dead _pack_ue4m3_to_uint32, fix data format docs weight_transform.py returns float8_e4m3fn scales, NOT packed uint32. The _pack_ue4m3_to_uint32 function was never called. Removed it. Updated README data formats to accurately reflect the pipeline: - Weight scales: float8_e4m3fn (direct to CUTLASS, no unpack) - Activation scales: uint32 packed (from staging kernel, unpacked to float8)	2026-05-14 17:28:12 +00:00
biondizzle	008f8cccbd	docs: comprehensive README with SF remap probe data, bug history, coordinate table Added detailed SF remap section with the empirical coordinate dump table showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed, the diagnostic trail (constant-scale test, single-element probes), and the 6 verification probes confirming the extraction formula.	2026-05-14 17:02:53 +00:00
biondizzle	1e0cea055c	cleanup: remove all debug printfs from CUDA kernel and weight_transform Removed printf from remap kernel (flat_rank dump, coordinate probes, first-coord log). Removed weight_scale_2 debug prints from weight_transform.py. Production-ready now.	2026-05-14 16:57:32 +00:00
biondizzle	839835cba4	fix: correct SF remap coordinate extraction for flat_rank=8 m = f0 + f132 + f2128 (CuTe 'first sub varies fastest') k_sf = f4 + f54 f3 is the Step<2> stride (degenerate, always=total), NOT a coordinate. Previous formula (f32+f2)*128 was catastrophically wrong — mapped everything to m=0 or m=huge.	2026-05-14 16:40:48 +00:00
biondizzle	1ef2fbc2fd	debug: more indices for SF layout dump	2026-05-14 16:26:15 +00:00
biondizzle	c4b5b52a33	debug: single-thread SF layout dump at specific indices	2026-05-14 16:13:05 +00:00
biondizzle	17e6033ade	debug: print specific indices for SF layout coordinate decomposition	2026-05-14 15:57:55 +00:00
biondizzle	8ee3f90e44	debug: handle flat_rank=8 for SF remap, add coordinate dump Previous approach assumed rank 2-6, but actual rank is 8. For R==8: 4 M sub-indices (inner_32, inner_4, tile_interleave, tile_m) 4 K sub-indices (inner_16, inner_4_k, tile_k_interleave, tile_k) m = (f32 + f2)128 + f04 + f1 k_sf = f5 + f64 (tentative, needs printf verification) Added printf of all 8 flat values for first 3 indices.	2026-05-14 15:45:52 +00:00
biondizzle	d2c1c76f5b	debug: idx2crd+flatten approach with printf to determine flat_rank Going back to the idx2crd approach which compiles and runs. Added printf for flat_rank, MN, K_sf, and first coordinate extraction. Handles ranks 2-6 with logical (m, k_sf) extraction. This will tell us the actual flat_rank and whether our extraction is correct.	2026-05-14 15:34:46 +00:00
biondizzle	2ac3a7d631	fix: construct nested coordinate for CuTe layout shape ((32,4), K) layout_sf(m, k_elem) with flat ints fails: Mismatched Ranks because the layout shape is ((32,4), K_padded), not (M, K). Decompose m into (inner_m, sub_m) = (m/4, m%4) to match the (32,4) sub-shape, and pass as make_tuple(make_tuple(inner, sub), k_elem).	2026-05-14 15:32:12 +00:00
biondizzle	593ae998f8	fix: clean rewrite of cutlass_nvfp4_gemm.cu — no more file splicing Removed dead code from old idx2crd approach. File is now clean: - Source-iterating SF remap kernel with layout_sf(m, k_elem) - Zero-init dest buffers before remap - Proper extern C wrapping	2026-05-14 15:31:03 +00:00
biondizzle	196ee37fdb	fix: rewrite SF remap kernel — source-iterating with layout_sf(m, k_elem) Ripped out idx2crd + flatten + get<> approach entirely. New kernel iterates over source indices (m, k_group) and uses layout_sf(m, k_elem) to compute the CUTLASS destination offset. CuTe handles nested shape decomposition internally — no rank inspection needed. K coordinate is in element-space (k_group * SFVecSize) as the layout expects. Iterates over groups (not every element) since all 16 elements within a group share one SF byte — avoids 16x redundant writes. Grid size based on source count (MN * K_sf), not dest buffer size.	2026-05-14 15:28:44 +00:00
biondizzle	fb390b24e2	debug: add printf to SF remap kernel to check flat_rank and layout shape	2026-05-14 15:24:18 +00:00
biondizzle	8f5322ca31	fix: add missing extern "C" opening brace lost during file reconstruction	2026-05-14 15:04:43 +00:00
biondizzle	a8bd962452	fix: SF remap — iterate dest indices, extract logical (m, k_sf) from nested coord The forward-map approach (src -> layout_sf(m, k)) failed because CuTe's layout operator requires coordinates matching the nested shape rank, and passing flat (int, int) to a ((32,4),K) shape triggers Mismatched Ranks. New approach: iterate over CUTLASS dest indices, use idx2crd to get the hierarchical coordinate, flatten it, then extract logical (m, k_sf) by interpreting the flattened sub-coordinates correctly: flat[0..2] = (inner_M, sub_M, tile_M) -> m = tile_M128 + inner_M4 + sub_M flat[3..5] = (inner_K, sub_K, tile_K) -> k_sf = tile_K*4 + sub_K (inner_K is within one SF group — same byte, so ignored for k_sf) Previous bug: get<0> and get<1> of flatten gave (inner_M, sub_M) — both M sub-indices. K information was never extracted, so only k_group=0 worked. Dest buffer is zero-initialized so padding slots (where m >= MN or k_sf >= K_sf) stay zero.	2026-05-14 15:01:47 +00:00
biondizzle	395cc31883	fix: use layout_sf(m, k_elem) instead of make_coord for nested shapes make_coord(m, k_elem) produces rank-2 coord, but tile_to_shape creates nested shapes like ((32,4,tiles_m), (16,4,tiles_k)) which expect matching nested coords. layout_sf(m, k) operator handles hierarchical projection automatically.	2026-05-14 14:57:13 +00:00
biondizzle	d90967d6e9	fix: SF remap — element-space K coords + zero-init dest buffer Two fixes: 1. CuTe layout uses element-space K, not group-space. k_group=3 with SFVecSize=16 maps to k_elem=48 in the layout, not k=3. Added SFVecSize param to remap kernel, multiply k_sf * SFVecSize before passing to layout_sf(). 2. Zero-init CUTLASS dest buffer before remap. The layout pads to tile boundaries (128x64), so dest is larger than MK_sf. Unmapped padding slots reading garbage causes sporadic wrong results. Also fixed grid size to use source count (MK_sf), not dest size.	2026-05-14 14:54:18 +00:00
biondizzle	5968ebad9f	fix: SF remap was using idx2crd+flatten which gives atom sub-indices, not logical (m,k) The remap kernel iterated over CUTLASS linear indices and tried to reverse-map with idx2crd + flatten. But flatten() on the nested CuTe coordinate (from tile_to_shape(SfAtom{}, ...)) gives atom-level sub-indices, not logical (m, k). This caused all K-groups > 0 in SFA to map to m*K_sf+0, losing K-group information entirely. Proof: setting SFA[0,0]=2.0 changed row 0, but SFA[0,3]=2.0 produced zero change. Only K-group 0 was being read. Fix: iterate over SOURCE indices (row-major m, k) and use the CuTe layout forward: layout_sf(make_coord(m, k)) -> CUTLASS dst index. This is the correct forward direction that CuTe handles natively. Constant-scale test (all SF=1.0) gave cosine=1.0, confirming the FP4 data path is correct. The bug was purely in the SF remap.	2026-05-14 14:51:02 +00:00
biondizzle	cf796e37cf	debug: add weight_scale_2 shape/value logging in weight transform	2026-05-14 14:19:35 +00:00
biondizzle	879adc324d	fix: _fold_global_scale — remove broken logical_widths branch The logical_widths branch took expert 0 and 1's global scales and applied them to ALL experts. For L1 with logical_widths=[3072,3072], every expert got expert-0's scale on its gate half and expert-1's scale on its up half. All other experts' global scales were discarded. The else branch correctly broadcasts each expert's own (E,1) global scale across (E, N, K//16). Removed the dead logical_widths code.	2026-05-14 14:17:44 +00:00
biondizzle	ef9cd023a9	fix: unpack_ue4m3_u32 — uint32 lacks CUDA bitwise ops, use int32 PyTorch doesn't implement bitwise_and/shift for UInt32 on CUDA. Cast to int32 first, then extract bytes, then uint8 → view float8.	2026-05-14 13:44:42 +00:00
biondizzle	1c39e21d87	fix: remove broken L1 weight interleave The interleave assumed gate/up were pre-interleaved in groups of 16 and that we needed 2CTA UMMA layout. Both wrong: 1. vLLM w13_weight is plain concat [gate; up] along output dim 2. Our CUTLASS kernel uses ClusterShape 1x1x1, not 2CTA The interleave was shuffling weights into nonsense, making L1 GEMM compute the wrong thing, and chunk(2) would split wrong halves.	2026-05-14 13:05:45 +00:00

1 2

86 Commits