nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	ea5ee7c1f7	fix: remove prefix_filter from layer tensor loading	2026-05-16 02:15:55 +00:00
biondizzle	303b6a8993	cleanup: move useful tests to tests/, nuke stale debug tests Kept (moved to tests/): - test_uniform_fp4.py — proves GEMM math (72.0 = 1.5² × K) - test_b_layout.py — proves B matrix column layout - test_quick_rand.py — quick GEMM sanity check Removed (stale SF remap debug artifacts): - test_forward_map.py, test_gemm_sweep.py, test_m1_gemm.py - test_minimal_gemm.py, test_rand_gemm.py, test_sf_check.py - test_sf_remap.py, test_sf_signed.py, test_sf_layout_diag.cu	2026-05-16 02:14:37 +00:00
biondizzle	2114bd11be	test: add standalone layer 0 comparison test (no vLLM, no Docker) tests/layertest.py: - Loads layer 0 expert weights from both original (MXFP4) and NVFP4 checkpoints - Dequantizes both to BF16 for reference comparison - Runs MoE forward pass in pure BF16 (no kernel) - Runs same forward pass through our NVFP4 CUTLASS kernel - Compares cosine similarity: kernel vs BF16 reference tests/run_test.sh: - Creates venv, installs deps, builds kernel from source, runs test Isolates our kernel completely from vLLM's weight loading, tensor parallelism, and MoE routing. If cosine ≈ 1.0, bug is in vLLM. If cosine ≈ 0, bug is in our kernel pipeline.	2026-05-16 02:13:18 +00:00
biondizzle	294e9f98f2	cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code - Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32 (our checkpoint uses float8_e4m3fn, NOT E8M0) - Removed dead is_scale_e8m0 property (never referenced) - Removed dead _block_scale_to_float32 copy in MegaMoEExperts class - Cleaned up stale E8M0/UE8M0/shift-by-23 comments - Simplified E8M0 assertion to ValueError (not assert False) - Updated DeepseekV4FP8Config docstring for NVFP4	2026-05-16 01:55:56 +00:00
biondizzle	4a624879ca	docs: update DEBUG_LOG — input_scale red herring, current state, next steps	2026-05-16 01:15:49 +00:00
biondizzle	79b9becf9c	revert: don't use checkpoint input_scale for activation normalization Using checkpoint input_scale as the normalization scale saturates FP4 values (all block scales = 448). The input_scale is a calibration constant, NOT the amax/(6448) normalization scale. Reverted to dynamic amax/(6448) for activation quantization. The correct use of checkpoint input_scale is still under investigation. Preserved: _w13_input_scale and _w2_input_scale in finalize_weights for future use once we understand the correct alpha contract.	2026-05-16 00:12:41 +00:00
biondizzle	a7eae10ef4	fix: use checkpoint input_scale for activation quantization Critical fix: the checkpoint's input_scale was used during weight calibration but we were computing dynamic scale from data (amax/2688). This was 13x off from the checkpoint value. Changes: - stage_activation() accepts optional input_global_scale parameter - nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale - vLLM patch preserves w13/w2_input_scale in finalize_weights - L1 activation uses checkpoint w13_input_scale for quantization - L2 activation uses checkpoint w2_input_scale for quantization - alpha = input_scale * weight_scale_2 (correct calibration contract)	2026-05-15 23:57:08 +00:00
biondizzle	af50e98fe9	test: B layout test with N=128 K=256	2026-05-15 23:52:22 +00:00
biondizzle	efd7a2c56d	test: B matrix weight layout verification via one-hot A	2026-05-15 23:52:00 +00:00
biondizzle	bb5a1ba4c8	cleanup: remove unused slot_token from nvfp4_moe_l2 L2 input is already slot-major, so slot_token was accepted but never passed to the GEMM. Made it explicit by removing the parameter.	2026-05-15 23:50:39 +00:00
biondizzle	887360281e	docs: major update — SF remap verified correct, BF16 ref is the red herring Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM. Proof: uniform FP4+SF produces mathematically exact output, and the roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.	2026-05-15 23:38:34 +00:00
biondizzle	eb26d291cb	test: uniform FP4 + uniform SF sanity check	2026-05-15 23:36:08 +00:00
biondizzle	1f09b51168	test: check SF signed vs unsigned interpretation	2026-05-15 23:35:06 +00:00
biondizzle	4f857d5f99	docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline	2026-05-15 23:02:30 +00:00
biondizzle	aa209ddd21	debug: add SF remap roundtrip verifier Checks that forward remap wrote the correct bytes by comparing src[mnstride_mn + k_sfstride_ksf] against dst[layout_sf(make_coord(mn, k_sf*16, 0))]. Prints error count for SFA and SFB on first GEMM call.	2026-05-15 22:59:44 +00:00
biondizzle	6626b75a2f	fix: use filter_zeros for SF allocation + no-branch forward mapping - Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples - Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank - Avoids silent fallthrough that wrote dst[0] for all threads	2026-05-15 22:58:51 +00:00
biondizzle	6fc8fa61e0	fix: use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0)) CuTe maps compatible flat coordinates into the natural hierarchical coordinate before applying strides. No manual decomposition needed. k_elem = k_sf * 16 (logical K element, not compact SF index).	2026-05-15 22:53:57 +00:00
biondizzle	a48717ccf5	fix: remove duplicate dst_idx declaration	2026-05-15 22:31:05 +00:00
biondizzle	5ff1b9e401	fix: use hierarchical coordinates for layout_sf forward mapping Flat make_coord(mn, k*16) doesn't decompose into the nested atom shape. Must manually decompose: mn -> (m0, m1, mt) where m0=mn%32, m1=(mn/32)%4, mt=mn/128 k_sf -> (k0, k1, kt) where k0=0 (stride-0), k1=k_sf%4, kt=k_sf/4	2026-05-15 22:11:14 +00:00
biondizzle	3b4a7b591f	test: verify forward mapping with prepack vs live SFB	2026-05-15 22:09:56 +00:00
biondizzle	a1fd4d6233	revert: back to layout_sf(make_coord(...)) — crd2idx was unnecessary	2026-05-15 21:55:00 +00:00
biondizzle	ea678ece64	fix: remove duplicate template declaration	2026-05-15 21:54:10 +00:00
biondizzle	59dad8e2fb	fix: use crd2idx instead of layout operator() for SF forward mapping	2026-05-15 21:52:02 +00:00
biondizzle	a09d8e477e	fix: remove static_assert in constexpr else (build fix)	2026-05-15 21:27:27 +00:00
biondizzle	7285331395	fix: replace col_major_src with explicit source strides SFA: src_stride_mn=K_sf, src_stride_ksf=1 (row-major M, K_sf) SFB: src_stride_mn=1, src_stride_ksf=N (row-major K_sf, N after transpose) Removes ambiguity about physical memory layout. The source indexing now uses mnsrc_stride_mn + k_sfsrc_stride_ksf which works for any contiguous or transposed layout.	2026-05-15 21:23:21 +00:00
biondizzle	f6fd549800	fix: restore col_major_src handling for SFB source layout SFB scales arrive as (K_sf, N) row-major after transpose+contiguous in weight_transform.py. The col_major_src flag correctly describes this. Don't assume both sources are (MN, K_sf).	2026-05-15 21:19:58 +00:00
biondizzle	63e67e1025	fix: rewrite SF remap as forward mapping (source→dst) - Iterate over source indices (MN * K_sf) instead of dst indices - Use layout_sf forward mapping: layout_sf(make_coord(mn, k_sf*16)) - No more idx2crd reverse extraction or stride-0 ambiguity - Cleaner, less error-prone, blog-compatible	2026-05-15 20:51:30 +00:00
biondizzle	30b6c89424	fix: correct SF remap coordinate extraction - First flattened group IS M/N (not K as previously assumed) - mn = f0 + 32f1 + 128f2 - k_sf = f4 + 4f5 (f3 is stride-0 inner K, ignored) - The atom stride-0 dimension (f3) maps to offset 0, not a meaningful K sub-index. The actual k_sf comes from f4 (sub_k) + f54 (tile_k) - Original code had group assignment right but k_sf extraction wrong	2026-05-15 20:44:46 +00:00
biondizzle	ff5a0843dc	fix: divide K element index by SFVecSize to get k_sf Based on veitner bearblog analysis of CUTLASS SF layout: - Shape is ((32,4,K_tiles), (SFVecSize,4,M_tiles)) for SFA - get<0..2> covers K dimension, get<3..5> covers M dimension - k_sf = K_element_index / SFVecSize	2026-05-15 20:17:24 +00:00
biondizzle	a09b9b53a3	cleanup: remove printf and diag function from CUDA kernel (build fix)	2026-05-15 20:11:40 +00:00
biondizzle	e7c3341317	docs: update DEBUG_LOG with M/K swap root cause	2026-05-15 20:03:20 +00:00
biondizzle	deb6b3231a	debug: swap M/K in SF remap + add printf diagnostics	2026-05-15 20:01:47 +00:00
biondizzle	22f0457ccf	test: isolate SFA vs SFB remap bug	2026-05-15 19:59:39 +00:00
biondizzle	9eaf6d07e8	test: quick random test	2026-05-15 19:58:57 +00:00
biondizzle	fa7b394571	docs: update DEBUG_LOG with root cause (size→cosize) and full debug timeline	2026-05-15 18:56:09 +00:00
biondizzle	c3841983a0	fix: SF remap uses cute::cosize() instead of cute::size() The comment explicitly warned about this: allocation uses cosize (physical size including tile padding) but the iteration bound used size (logical size). This meant padding positions in the CUTLASS SF layout were never written, leaving them as zero instead of their actual SF values. With uniform data (all-ones), all SF values are the same so the bug was invisible. With random data, different SF values are needed at different positions and the missing writes corrupt the result.	2026-05-15 18:52:23 +00:00
biondizzle	67dcfa83f5	test: random data at small dims + alpha sweep	2026-05-15 18:51:52 +00:00
biondizzle	60f7f60818	test: ultra-minimal GEMM with all-ones	2026-05-15 18:51:31 +00:00
biondizzle	363dd893f0	test: dimension sweep to isolate GEMM bug	2026-05-15 18:51:09 +00:00
biondizzle	fee5a97ebb	fix: cosine_similarity dim for M>0	2026-05-15 18:50:45 +00:00
biondizzle	f9330a1777	test: standalone M=1 GEMM test with deterministic data	2026-05-15 18:47:26 +00:00
biondizzle	1b63a46168	docs: update DEBUG_LOG with cosine≈0 finding + new hypotheses	2026-05-15 18:35:00 +00:00
biondizzle	773967452f	debug: fix gs scalar conversion + add traceback	2026-05-15 18:27:44 +00:00
biondizzle	df916b87eb	debug: fix gs.item() for multi-element tensor	2026-05-15 18:09:41 +00:00
biondizzle	755f9ad567	debug: fix per_expert_alpha ref + clean up BF16 reference scaling	2026-05-15 17:55:11 +00:00
biondizzle	de8acc7965	debug: dump raw GEMM inputs + first 8 output values	2026-05-15 17:02:40 +00:00
biondizzle	9159cb6bb3	docs: add debug log — current state, hypotheses, fixes	2026-05-15 15:48:57 +00:00
biondizzle	2fd55a94c6	fix: weight reshape bug + igs double-count in BF16 reference	2026-05-15 15:46:16 +00:00
biondizzle	c421a668f3	debug: BF16 reference GEMM + cosine comparison for L1	2026-05-15 14:16:24 +00:00
biondizzle	995589ac8a	debug: add FP4 quantization round-trip diagnostic	2026-05-15 13:41:09 +00:00

1 2

57 Commits