nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	2ef71dc21a	fix: B tensor K-major strides, scale_b axis swap Two fixes: 1. B tensor: permute(0,2,1).contiguous().permute(0,2,1) gives K-major stride (16384,1,128) matching reference 2. scale_b: transpose to (N, K_sf) before swizzling — reference uses (intermediate, hidden//16) not (hidden//16, intermediate)	2026-05-16 03:04:31 +00:00
biondizzle	6294b84213	fix: B tensor must be K-major (transpose last 2 dims) Reference shows B stride=(16384,1,128) — K is stride-1 (K-major). Our stack produces N-major stride=(16384,128,1). Added .T.contiguous().	2026-05-16 03:03:00 +00:00
biondizzle	7c882fe2e0	fix: correct weight quantization for CuTeDSL kernel Weight K dimension (hidden) must be the packed dimension, not N. Block scales computed along K dim. FP4 packing along K.	2026-05-16 02:58:55 +00:00
biondizzle	ca28f1335d	refactor: copy CuTeDSL kernel into repo with local imports Copied from CUTLASS examples (no more runtime dependency on /root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.* instead of blackwell.kernel.*. Structure: cutedsl/__init__.py cutedsl/kernel/__init__.py cutedsl/kernel/moe/ (the MoE scaled grouped GEMM) cutedsl/kernel/blockscaled_gemm/ (dense blockscaled GEMM) test_cutedsl.py updated to import from our local copy.	2026-05-16 02:57:54 +00:00
biondizzle	a3aa2d201e	fix: clarify import path setup for CuTeDSL	2026-05-16 02:55:25 +00:00
biondizzle	f951d284e7	test: add CuTeDSL NVFP4 GEMM test using reference ScaledGroupedGemmKernel Tests the NVIDIA reference kernel with our quantization pipeline: 1. Quantize BF16 → NVFP4 (our stage_activation logic) 2. Pad and swizzle scale factors (to_blocked) 3. Run ScaledGroupedGemmKernel (2Dx3D scenario) 4. Compare against BF16 matmul reference Also adds cutedsl/moe.py module for the future pipeline integration.	2026-05-16 02:55:04 +00:00
biondizzle	a2ea836c74	docs: add CuTeDSL rewrite plan + reference files The C++ CUTLASS kernel is fundamentally broken (cosine 0.05 with real data). Switching to NVIDIA's CuTeDSL approach based on their official MoE scaled grouped GEMM example. Reference files copied: - moe_torch_scaled_grouped_mm.py (3900 lines — our new kernel) - moe_utils.py, moe_persistent_scheduler.py, moe_sched_extension.py - grouped_blockscaled_gemm.py, dense_blockscaled_gemm_persistent.py - blockscaled_layout.py	2026-05-16 02:41:51 +00:00
biondizzle	c4a262bd54	test: streamline layertest — kernel vs BF16 ref only, exit on fail Removed original checkpoint loading (already verified 0.997 cosine). Test now: load NVFP4 → dequant BF16 ref → run kernel → compare. Exits with code 1 if cosine < 0.99.	2026-05-16 02:29:41 +00:00
biondizzle	de9b50cbe7	fix: use setup.py install for CUTLASS extension build	2026-05-16 02:21:17 +00:00
biondizzle	882bff8fb7	fix: also build CUTLASS C++ extension in run_test.sh	2026-05-16 02:19:40 +00:00
biondizzle	55d9a24bf6	fix: handle model. prefix normalization in checkpoint keys	2026-05-16 02:18:52 +00:00
biondizzle	bdf9f31ae2	fix: checkpoint keys don't have 'model.' prefix	2026-05-16 02:17:13 +00:00
biondizzle	ea5ee7c1f7	fix: remove prefix_filter from layer tensor loading	2026-05-16 02:15:55 +00:00
biondizzle	303b6a8993	cleanup: move useful tests to tests/, nuke stale debug tests Kept (moved to tests/): - test_uniform_fp4.py — proves GEMM math (72.0 = 1.5² × K) - test_b_layout.py — proves B matrix column layout - test_quick_rand.py — quick GEMM sanity check Removed (stale SF remap debug artifacts): - test_forward_map.py, test_gemm_sweep.py, test_m1_gemm.py - test_minimal_gemm.py, test_rand_gemm.py, test_sf_check.py - test_sf_remap.py, test_sf_signed.py, test_sf_layout_diag.cu	2026-05-16 02:14:37 +00:00
biondizzle	2114bd11be	test: add standalone layer 0 comparison test (no vLLM, no Docker) tests/layertest.py: - Loads layer 0 expert weights from both original (MXFP4) and NVFP4 checkpoints - Dequantizes both to BF16 for reference comparison - Runs MoE forward pass in pure BF16 (no kernel) - Runs same forward pass through our NVFP4 CUTLASS kernel - Compares cosine similarity: kernel vs BF16 reference tests/run_test.sh: - Creates venv, installs deps, builds kernel from source, runs test Isolates our kernel completely from vLLM's weight loading, tensor parallelism, and MoE routing. If cosine ≈ 1.0, bug is in vLLM. If cosine ≈ 0, bug is in our kernel pipeline.	2026-05-16 02:13:18 +00:00
biondizzle	294e9f98f2	cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code - Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32 (our checkpoint uses float8_e4m3fn, NOT E8M0) - Removed dead is_scale_e8m0 property (never referenced) - Removed dead _block_scale_to_float32 copy in MegaMoEExperts class - Cleaned up stale E8M0/UE8M0/shift-by-23 comments - Simplified E8M0 assertion to ValueError (not assert False) - Updated DeepseekV4FP8Config docstring for NVFP4	2026-05-16 01:55:56 +00:00
biondizzle	4a624879ca	docs: update DEBUG_LOG — input_scale red herring, current state, next steps	2026-05-16 01:15:49 +00:00
biondizzle	79b9becf9c	revert: don't use checkpoint input_scale for activation normalization Using checkpoint input_scale as the normalization scale saturates FP4 values (all block scales = 448). The input_scale is a calibration constant, NOT the amax/(6448) normalization scale. Reverted to dynamic amax/(6448) for activation quantization. The correct use of checkpoint input_scale is still under investigation. Preserved: _w13_input_scale and _w2_input_scale in finalize_weights for future use once we understand the correct alpha contract.	2026-05-16 00:12:41 +00:00
biondizzle	a7eae10ef4	fix: use checkpoint input_scale for activation quantization Critical fix: the checkpoint's input_scale was used during weight calibration but we were computing dynamic scale from data (amax/2688). This was 13x off from the checkpoint value. Changes: - stage_activation() accepts optional input_global_scale parameter - nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale - vLLM patch preserves w13/w2_input_scale in finalize_weights - L1 activation uses checkpoint w13_input_scale for quantization - L2 activation uses checkpoint w2_input_scale for quantization - alpha = input_scale * weight_scale_2 (correct calibration contract)	2026-05-15 23:57:08 +00:00
biondizzle	af50e98fe9	test: B layout test with N=128 K=256	2026-05-15 23:52:22 +00:00
biondizzle	efd7a2c56d	test: B matrix weight layout verification via one-hot A	2026-05-15 23:52:00 +00:00
biondizzle	bb5a1ba4c8	cleanup: remove unused slot_token from nvfp4_moe_l2 L2 input is already slot-major, so slot_token was accepted but never passed to the GEMM. Made it explicit by removing the parameter.	2026-05-15 23:50:39 +00:00
biondizzle	887360281e	docs: major update — SF remap verified correct, BF16 ref is the red herring Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM. Proof: uniform FP4+SF produces mathematically exact output, and the roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.	2026-05-15 23:38:34 +00:00
biondizzle	eb26d291cb	test: uniform FP4 + uniform SF sanity check	2026-05-15 23:36:08 +00:00
biondizzle	1f09b51168	test: check SF signed vs unsigned interpretation	2026-05-15 23:35:06 +00:00
biondizzle	4f857d5f99	docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline	2026-05-15 23:02:30 +00:00
biondizzle	aa209ddd21	debug: add SF remap roundtrip verifier Checks that forward remap wrote the correct bytes by comparing src[mnstride_mn + k_sfstride_ksf] against dst[layout_sf(make_coord(mn, k_sf*16, 0))]. Prints error count for SFA and SFB on first GEMM call.	2026-05-15 22:59:44 +00:00
biondizzle	6626b75a2f	fix: use filter_zeros for SF allocation + no-branch forward mapping - Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples - Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank - Avoids silent fallthrough that wrote dst[0] for all threads	2026-05-15 22:58:51 +00:00
biondizzle	6fc8fa61e0	fix: use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0)) CuTe maps compatible flat coordinates into the natural hierarchical coordinate before applying strides. No manual decomposition needed. k_elem = k_sf * 16 (logical K element, not compact SF index).	2026-05-15 22:53:57 +00:00
biondizzle	a48717ccf5	fix: remove duplicate dst_idx declaration	2026-05-15 22:31:05 +00:00
biondizzle	5ff1b9e401	fix: use hierarchical coordinates for layout_sf forward mapping Flat make_coord(mn, k*16) doesn't decompose into the nested atom shape. Must manually decompose: mn -> (m0, m1, mt) where m0=mn%32, m1=(mn/32)%4, mt=mn/128 k_sf -> (k0, k1, kt) where k0=0 (stride-0), k1=k_sf%4, kt=k_sf/4	2026-05-15 22:11:14 +00:00
biondizzle	3b4a7b591f	test: verify forward mapping with prepack vs live SFB	2026-05-15 22:09:56 +00:00
biondizzle	a1fd4d6233	revert: back to layout_sf(make_coord(...)) — crd2idx was unnecessary	2026-05-15 21:55:00 +00:00
biondizzle	ea678ece64	fix: remove duplicate template declaration	2026-05-15 21:54:10 +00:00
biondizzle	59dad8e2fb	fix: use crd2idx instead of layout operator() for SF forward mapping	2026-05-15 21:52:02 +00:00
biondizzle	a09d8e477e	fix: remove static_assert in constexpr else (build fix)	2026-05-15 21:27:27 +00:00
biondizzle	7285331395	fix: replace col_major_src with explicit source strides SFA: src_stride_mn=K_sf, src_stride_ksf=1 (row-major M, K_sf) SFB: src_stride_mn=1, src_stride_ksf=N (row-major K_sf, N after transpose) Removes ambiguity about physical memory layout. The source indexing now uses mnsrc_stride_mn + k_sfsrc_stride_ksf which works for any contiguous or transposed layout.	2026-05-15 21:23:21 +00:00
biondizzle	f6fd549800	fix: restore col_major_src handling for SFB source layout SFB scales arrive as (K_sf, N) row-major after transpose+contiguous in weight_transform.py. The col_major_src flag correctly describes this. Don't assume both sources are (MN, K_sf).	2026-05-15 21:19:58 +00:00
biondizzle	63e67e1025	fix: rewrite SF remap as forward mapping (source→dst) - Iterate over source indices (MN * K_sf) instead of dst indices - Use layout_sf forward mapping: layout_sf(make_coord(mn, k_sf*16)) - No more idx2crd reverse extraction or stride-0 ambiguity - Cleaner, less error-prone, blog-compatible	2026-05-15 20:51:30 +00:00
biondizzle	30b6c89424	fix: correct SF remap coordinate extraction - First flattened group IS M/N (not K as previously assumed) - mn = f0 + 32f1 + 128f2 - k_sf = f4 + 4f5 (f3 is stride-0 inner K, ignored) - The atom stride-0 dimension (f3) maps to offset 0, not a meaningful K sub-index. The actual k_sf comes from f4 (sub_k) + f54 (tile_k) - Original code had group assignment right but k_sf extraction wrong	2026-05-15 20:44:46 +00:00
biondizzle	ff5a0843dc	fix: divide K element index by SFVecSize to get k_sf Based on veitner bearblog analysis of CUTLASS SF layout: - Shape is ((32,4,K_tiles), (SFVecSize,4,M_tiles)) for SFA - get<0..2> covers K dimension, get<3..5> covers M dimension - k_sf = K_element_index / SFVecSize	2026-05-15 20:17:24 +00:00
biondizzle	a09b9b53a3	cleanup: remove printf and diag function from CUDA kernel (build fix)	2026-05-15 20:11:40 +00:00
biondizzle	e7c3341317	docs: update DEBUG_LOG with M/K swap root cause	2026-05-15 20:03:20 +00:00
biondizzle	deb6b3231a	debug: swap M/K in SF remap + add printf diagnostics	2026-05-15 20:01:47 +00:00
biondizzle	22f0457ccf	test: isolate SFA vs SFB remap bug	2026-05-15 19:59:39 +00:00
biondizzle	9eaf6d07e8	test: quick random test	2026-05-15 19:58:57 +00:00
biondizzle	fa7b394571	docs: update DEBUG_LOG with root cause (size→cosize) and full debug timeline	2026-05-15 18:56:09 +00:00
biondizzle	c3841983a0	fix: SF remap uses cute::cosize() instead of cute::size() The comment explicitly warned about this: allocation uses cosize (physical size including tile padding) but the iteration bound used size (logical size). This meant padding positions in the CUTLASS SF layout were never written, leaving them as zero instead of their actual SF values. With uniform data (all-ones), all SF values are the same so the bug was invisible. With random data, different SF values are needed at different positions and the missing writes corrupt the result.	2026-05-15 18:52:23 +00:00
biondizzle	67dcfa83f5	test: random data at small dims + alpha sweep	2026-05-15 18:51:52 +00:00
biondizzle	60f7f60818	test: ultra-minimal GEMM with all-ones	2026-05-15 18:51:31 +00:00

... 11 12 13 14 15

719 Commits