nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	09ff5c5b98	feat: full NVFP4 MoE pipeline (L1→SiLU→L2→scatter) cutedsl/moe_pipeline.py: complete pipeline - stage_activation: BF16 → NVFP4 (keeps data in FP4) - L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up) - SiLU(gate) * up: BF16 (only nonlinear, can't avoid) - Re-quantize: BF16 → NVFP4 (back to native) - L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj) - Scatter with routing weights → BF16 output layertest.py: now tests the FULL MoE pipeline against BF16 reference. NVFP4-native: both GEMMs use float4_e2m1fn_x2 for A and B, float8_e4m3fn for block scales, float32 for global scales. BF16 only for SiLU activation and final scatter.	2026-05-16 03:22:43 +00:00
biondizzle	0359215ab4	fix: compare kernel vs BF16 in slot-major layout	2026-05-16 03:18:41 +00:00
biondizzle	ed18638a3c	fix: slot-major token layout for grouped GEMM Tokens must be laid out as [expert0_tokens \| expert1_tokens \| ...] for the 2Dx3D grouped GEMM. Each expert gets its own contiguous block of tokens. Scale factors split by expert offsets.	2026-05-16 03:17:19 +00:00
biondizzle	5385de3142	fix: layertest tests L1 GEMM only with correct output size L1 produces (tokens, 6144) gate+up, not (tokens, 7168) hidden. Compare against BF16 L1 reference only.	2026-05-16 03:15:29 +00:00
biondizzle	0cdcc4144a	refactor: add cutedsl/bridge.py, rewrite layertest to use it bridge.py: clean API for CuTeDSL kernel - quantize_to_nvfp4 / quantize_weight_to_nvfp4 - assemble_scales_2d_side / assemble_scales_3d_side - make_b_k_major (stride conversion) - compute_expert_offsets - run_nvfp4_grouped_gemm (full kernel launch) layertest.py: now uses bridge layer, tests with real DeepSeek-V4 layer 0 weights (7168 hidden, 6144 intermediate). The bridge code will be reused by the vLLM integration layer.	2026-05-16 03:13:54 +00:00
biondizzle	2ef71dc21a	fix: B tensor K-major strides, scale_b axis swap Two fixes: 1. B tensor: permute(0,2,1).contiguous().permute(0,2,1) gives K-major stride (16384,1,128) matching reference 2. scale_b: transpose to (N, K_sf) before swizzling — reference uses (intermediate, hidden//16) not (hidden//16, intermediate)	2026-05-16 03:04:31 +00:00
biondizzle	6294b84213	fix: B tensor must be K-major (transpose last 2 dims) Reference shows B stride=(16384,1,128) — K is stride-1 (K-major). Our stack produces N-major stride=(16384,128,1). Added .T.contiguous().	2026-05-16 03:03:00 +00:00
biondizzle	7c882fe2e0	fix: correct weight quantization for CuTeDSL kernel Weight K dimension (hidden) must be the packed dimension, not N. Block scales computed along K dim. FP4 packing along K.	2026-05-16 02:58:55 +00:00
biondizzle	ca28f1335d	refactor: copy CuTeDSL kernel into repo with local imports Copied from CUTLASS examples (no more runtime dependency on /root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.* instead of blackwell.kernel.*. Structure: cutedsl/__init__.py cutedsl/kernel/__init__.py cutedsl/kernel/moe/ (the MoE scaled grouped GEMM) cutedsl/kernel/blockscaled_gemm/ (dense blockscaled GEMM) test_cutedsl.py updated to import from our local copy.	2026-05-16 02:57:54 +00:00
biondizzle	a3aa2d201e	fix: clarify import path setup for CuTeDSL	2026-05-16 02:55:25 +00:00
biondizzle	f951d284e7	test: add CuTeDSL NVFP4 GEMM test using reference ScaledGroupedGemmKernel Tests the NVIDIA reference kernel with our quantization pipeline: 1. Quantize BF16 → NVFP4 (our stage_activation logic) 2. Pad and swizzle scale factors (to_blocked) 3. Run ScaledGroupedGemmKernel (2Dx3D scenario) 4. Compare against BF16 matmul reference Also adds cutedsl/moe.py module for the future pipeline integration.	2026-05-16 02:55:04 +00:00
biondizzle	a2ea836c74	docs: add CuTeDSL rewrite plan + reference files The C++ CUTLASS kernel is fundamentally broken (cosine 0.05 with real data). Switching to NVIDIA's CuTeDSL approach based on their official MoE scaled grouped GEMM example. Reference files copied: - moe_torch_scaled_grouped_mm.py (3900 lines — our new kernel) - moe_utils.py, moe_persistent_scheduler.py, moe_sched_extension.py - grouped_blockscaled_gemm.py, dense_blockscaled_gemm_persistent.py - blockscaled_layout.py	2026-05-16 02:41:51 +00:00
biondizzle	c4a262bd54	test: streamline layertest — kernel vs BF16 ref only, exit on fail Removed original checkpoint loading (already verified 0.997 cosine). Test now: load NVFP4 → dequant BF16 ref → run kernel → compare. Exits with code 1 if cosine < 0.99.	2026-05-16 02:29:41 +00:00
biondizzle	de9b50cbe7	fix: use setup.py install for CUTLASS extension build	2026-05-16 02:21:17 +00:00
biondizzle	882bff8fb7	fix: also build CUTLASS C++ extension in run_test.sh	2026-05-16 02:19:40 +00:00
biondizzle	55d9a24bf6	fix: handle model. prefix normalization in checkpoint keys	2026-05-16 02:18:52 +00:00
biondizzle	bdf9f31ae2	fix: checkpoint keys don't have 'model.' prefix	2026-05-16 02:17:13 +00:00
biondizzle	ea5ee7c1f7	fix: remove prefix_filter from layer tensor loading	2026-05-16 02:15:55 +00:00
biondizzle	303b6a8993	cleanup: move useful tests to tests/, nuke stale debug tests Kept (moved to tests/): - test_uniform_fp4.py — proves GEMM math (72.0 = 1.5² × K) - test_b_layout.py — proves B matrix column layout - test_quick_rand.py — quick GEMM sanity check Removed (stale SF remap debug artifacts): - test_forward_map.py, test_gemm_sweep.py, test_m1_gemm.py - test_minimal_gemm.py, test_rand_gemm.py, test_sf_check.py - test_sf_remap.py, test_sf_signed.py, test_sf_layout_diag.cu	2026-05-16 02:14:37 +00:00
biondizzle	2114bd11be	test: add standalone layer 0 comparison test (no vLLM, no Docker) tests/layertest.py: - Loads layer 0 expert weights from both original (MXFP4) and NVFP4 checkpoints - Dequantizes both to BF16 for reference comparison - Runs MoE forward pass in pure BF16 (no kernel) - Runs same forward pass through our NVFP4 CUTLASS kernel - Compares cosine similarity: kernel vs BF16 reference tests/run_test.sh: - Creates venv, installs deps, builds kernel from source, runs test Isolates our kernel completely from vLLM's weight loading, tensor parallelism, and MoE routing. If cosine ≈ 1.0, bug is in vLLM. If cosine ≈ 0, bug is in our kernel pipeline.	2026-05-16 02:13:18 +00:00
biondizzle	294e9f98f2	cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code - Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32 (our checkpoint uses float8_e4m3fn, NOT E8M0) - Removed dead is_scale_e8m0 property (never referenced) - Removed dead _block_scale_to_float32 copy in MegaMoEExperts class - Cleaned up stale E8M0/UE8M0/shift-by-23 comments - Simplified E8M0 assertion to ValueError (not assert False) - Updated DeepseekV4FP8Config docstring for NVFP4	2026-05-16 01:55:56 +00:00
biondizzle	4a624879ca	docs: update DEBUG_LOG — input_scale red herring, current state, next steps	2026-05-16 01:15:49 +00:00
biondizzle	79b9becf9c	revert: don't use checkpoint input_scale for activation normalization Using checkpoint input_scale as the normalization scale saturates FP4 values (all block scales = 448). The input_scale is a calibration constant, NOT the amax/(6448) normalization scale. Reverted to dynamic amax/(6448) for activation quantization. The correct use of checkpoint input_scale is still under investigation. Preserved: _w13_input_scale and _w2_input_scale in finalize_weights for future use once we understand the correct alpha contract.	2026-05-16 00:12:41 +00:00
biondizzle	a7eae10ef4	fix: use checkpoint input_scale for activation quantization Critical fix: the checkpoint's input_scale was used during weight calibration but we were computing dynamic scale from data (amax/2688). This was 13x off from the checkpoint value. Changes: - stage_activation() accepts optional input_global_scale parameter - nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale - vLLM patch preserves w13/w2_input_scale in finalize_weights - L1 activation uses checkpoint w13_input_scale for quantization - L2 activation uses checkpoint w2_input_scale for quantization - alpha = input_scale * weight_scale_2 (correct calibration contract)	2026-05-15 23:57:08 +00:00
biondizzle	af50e98fe9	test: B layout test with N=128 K=256	2026-05-15 23:52:22 +00:00
biondizzle	efd7a2c56d	test: B matrix weight layout verification via one-hot A	2026-05-15 23:52:00 +00:00
biondizzle	bb5a1ba4c8	cleanup: remove unused slot_token from nvfp4_moe_l2 L2 input is already slot-major, so slot_token was accepted but never passed to the GEMM. Made it explicit by removing the parameter.	2026-05-15 23:50:39 +00:00
biondizzle	887360281e	docs: major update — SF remap verified correct, BF16 ref is the red herring Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM. Proof: uniform FP4+SF produces mathematically exact output, and the roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.	2026-05-15 23:38:34 +00:00
biondizzle	eb26d291cb	test: uniform FP4 + uniform SF sanity check	2026-05-15 23:36:08 +00:00
biondizzle	1f09b51168	test: check SF signed vs unsigned interpretation	2026-05-15 23:35:06 +00:00
biondizzle	4f857d5f99	docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline	2026-05-15 23:02:30 +00:00
biondizzle	aa209ddd21	debug: add SF remap roundtrip verifier Checks that forward remap wrote the correct bytes by comparing src[mnstride_mn + k_sfstride_ksf] against dst[layout_sf(make_coord(mn, k_sf*16, 0))]. Prints error count for SFA and SFB on first GEMM call.	2026-05-15 22:59:44 +00:00
biondizzle	6626b75a2f	fix: use filter_zeros for SF allocation + no-branch forward mapping - Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples - Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank - Avoids silent fallthrough that wrote dst[0] for all threads	2026-05-15 22:58:51 +00:00
biondizzle	6fc8fa61e0	fix: use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0)) CuTe maps compatible flat coordinates into the natural hierarchical coordinate before applying strides. No manual decomposition needed. k_elem = k_sf * 16 (logical K element, not compact SF index).	2026-05-15 22:53:57 +00:00
biondizzle	a48717ccf5	fix: remove duplicate dst_idx declaration	2026-05-15 22:31:05 +00:00
biondizzle	5ff1b9e401	fix: use hierarchical coordinates for layout_sf forward mapping Flat make_coord(mn, k*16) doesn't decompose into the nested atom shape. Must manually decompose: mn -> (m0, m1, mt) where m0=mn%32, m1=(mn/32)%4, mt=mn/128 k_sf -> (k0, k1, kt) where k0=0 (stride-0), k1=k_sf%4, kt=k_sf/4	2026-05-15 22:11:14 +00:00
biondizzle	3b4a7b591f	test: verify forward mapping with prepack vs live SFB	2026-05-15 22:09:56 +00:00
biondizzle	a1fd4d6233	revert: back to layout_sf(make_coord(...)) — crd2idx was unnecessary	2026-05-15 21:55:00 +00:00
biondizzle	ea678ece64	fix: remove duplicate template declaration	2026-05-15 21:54:10 +00:00
biondizzle	59dad8e2fb	fix: use crd2idx instead of layout operator() for SF forward mapping	2026-05-15 21:52:02 +00:00
biondizzle	a09d8e477e	fix: remove static_assert in constexpr else (build fix)	2026-05-15 21:27:27 +00:00
biondizzle	7285331395	fix: replace col_major_src with explicit source strides SFA: src_stride_mn=K_sf, src_stride_ksf=1 (row-major M, K_sf) SFB: src_stride_mn=1, src_stride_ksf=N (row-major K_sf, N after transpose) Removes ambiguity about physical memory layout. The source indexing now uses mnsrc_stride_mn + k_sfsrc_stride_ksf which works for any contiguous or transposed layout.	2026-05-15 21:23:21 +00:00
biondizzle	f6fd549800	fix: restore col_major_src handling for SFB source layout SFB scales arrive as (K_sf, N) row-major after transpose+contiguous in weight_transform.py. The col_major_src flag correctly describes this. Don't assume both sources are (MN, K_sf).	2026-05-15 21:19:58 +00:00
biondizzle	63e67e1025	fix: rewrite SF remap as forward mapping (source→dst) - Iterate over source indices (MN * K_sf) instead of dst indices - Use layout_sf forward mapping: layout_sf(make_coord(mn, k_sf*16)) - No more idx2crd reverse extraction or stride-0 ambiguity - Cleaner, less error-prone, blog-compatible	2026-05-15 20:51:30 +00:00
biondizzle	30b6c89424	fix: correct SF remap coordinate extraction - First flattened group IS M/N (not K as previously assumed) - mn = f0 + 32f1 + 128f2 - k_sf = f4 + 4f5 (f3 is stride-0 inner K, ignored) - The atom stride-0 dimension (f3) maps to offset 0, not a meaningful K sub-index. The actual k_sf comes from f4 (sub_k) + f54 (tile_k) - Original code had group assignment right but k_sf extraction wrong	2026-05-15 20:44:46 +00:00
biondizzle	ff5a0843dc	fix: divide K element index by SFVecSize to get k_sf Based on veitner bearblog analysis of CUTLASS SF layout: - Shape is ((32,4,K_tiles), (SFVecSize,4,M_tiles)) for SFA - get<0..2> covers K dimension, get<3..5> covers M dimension - k_sf = K_element_index / SFVecSize	2026-05-15 20:17:24 +00:00
biondizzle	a09b9b53a3	cleanup: remove printf and diag function from CUDA kernel (build fix)	2026-05-15 20:11:40 +00:00
biondizzle	e7c3341317	docs: update DEBUG_LOG with M/K swap root cause	2026-05-15 20:03:20 +00:00
biondizzle	deb6b3231a	debug: swap M/K in SF remap + add printf diagnostics	2026-05-15 20:01:47 +00:00
biondizzle	22f0457ccf	test: isolate SFA vs SFB remap bug	2026-05-15 19:59:39 +00:00

1 2

74 Commits