nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	efd7a2c56d	test: B matrix weight layout verification via one-hot A	2026-05-15 23:52:00 +00:00
biondizzle	bb5a1ba4c8	cleanup: remove unused slot_token from nvfp4_moe_l2 L2 input is already slot-major, so slot_token was accepted but never passed to the GEMM. Made it explicit by removing the parameter.	2026-05-15 23:50:39 +00:00
biondizzle	887360281e	docs: major update — SF remap verified correct, BF16 ref is the red herring Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM. Proof: uniform FP4+SF produces mathematically exact output, and the roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.	2026-05-15 23:38:34 +00:00
biondizzle	eb26d291cb	test: uniform FP4 + uniform SF sanity check	2026-05-15 23:36:08 +00:00
biondizzle	1f09b51168	test: check SF signed vs unsigned interpretation	2026-05-15 23:35:06 +00:00
biondizzle	4f857d5f99	docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline	2026-05-15 23:02:30 +00:00
biondizzle	aa209ddd21	debug: add SF remap roundtrip verifier Checks that forward remap wrote the correct bytes by comparing src[mnstride_mn + k_sfstride_ksf] against dst[layout_sf(make_coord(mn, k_sf*16, 0))]. Prints error count for SFA and SFB on first GEMM call.	2026-05-15 22:59:44 +00:00
biondizzle	6626b75a2f	fix: use filter_zeros for SF allocation + no-branch forward mapping - Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples - Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank - Avoids silent fallthrough that wrote dst[0] for all threads	2026-05-15 22:58:51 +00:00
biondizzle	6fc8fa61e0	fix: use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0)) CuTe maps compatible flat coordinates into the natural hierarchical coordinate before applying strides. No manual decomposition needed. k_elem = k_sf * 16 (logical K element, not compact SF index).	2026-05-15 22:53:57 +00:00
biondizzle	a48717ccf5	fix: remove duplicate dst_idx declaration	2026-05-15 22:31:05 +00:00
biondizzle	5ff1b9e401	fix: use hierarchical coordinates for layout_sf forward mapping Flat make_coord(mn, k*16) doesn't decompose into the nested atom shape. Must manually decompose: mn -> (m0, m1, mt) where m0=mn%32, m1=(mn/32)%4, mt=mn/128 k_sf -> (k0, k1, kt) where k0=0 (stride-0), k1=k_sf%4, kt=k_sf/4	2026-05-15 22:11:14 +00:00
biondizzle	3b4a7b591f	test: verify forward mapping with prepack vs live SFB	2026-05-15 22:09:56 +00:00
biondizzle	a1fd4d6233	revert: back to layout_sf(make_coord(...)) — crd2idx was unnecessary	2026-05-15 21:55:00 +00:00
biondizzle	ea678ece64	fix: remove duplicate template declaration	2026-05-15 21:54:10 +00:00
biondizzle	59dad8e2fb	fix: use crd2idx instead of layout operator() for SF forward mapping	2026-05-15 21:52:02 +00:00
biondizzle	a09d8e477e	fix: remove static_assert in constexpr else (build fix)	2026-05-15 21:27:27 +00:00
biondizzle	7285331395	fix: replace col_major_src with explicit source strides SFA: src_stride_mn=K_sf, src_stride_ksf=1 (row-major M, K_sf) SFB: src_stride_mn=1, src_stride_ksf=N (row-major K_sf, N after transpose) Removes ambiguity about physical memory layout. The source indexing now uses mnsrc_stride_mn + k_sfsrc_stride_ksf which works for any contiguous or transposed layout.	2026-05-15 21:23:21 +00:00
biondizzle	f6fd549800	fix: restore col_major_src handling for SFB source layout SFB scales arrive as (K_sf, N) row-major after transpose+contiguous in weight_transform.py. The col_major_src flag correctly describes this. Don't assume both sources are (MN, K_sf).	2026-05-15 21:19:58 +00:00
biondizzle	63e67e1025	fix: rewrite SF remap as forward mapping (source→dst) - Iterate over source indices (MN * K_sf) instead of dst indices - Use layout_sf forward mapping: layout_sf(make_coord(mn, k_sf*16)) - No more idx2crd reverse extraction or stride-0 ambiguity - Cleaner, less error-prone, blog-compatible	2026-05-15 20:51:30 +00:00
biondizzle	30b6c89424	fix: correct SF remap coordinate extraction - First flattened group IS M/N (not K as previously assumed) - mn = f0 + 32f1 + 128f2 - k_sf = f4 + 4f5 (f3 is stride-0 inner K, ignored) - The atom stride-0 dimension (f3) maps to offset 0, not a meaningful K sub-index. The actual k_sf comes from f4 (sub_k) + f54 (tile_k) - Original code had group assignment right but k_sf extraction wrong	2026-05-15 20:44:46 +00:00
biondizzle	ff5a0843dc	fix: divide K element index by SFVecSize to get k_sf Based on veitner bearblog analysis of CUTLASS SF layout: - Shape is ((32,4,K_tiles), (SFVecSize,4,M_tiles)) for SFA - get<0..2> covers K dimension, get<3..5> covers M dimension - k_sf = K_element_index / SFVecSize	2026-05-15 20:17:24 +00:00
biondizzle	a09b9b53a3	cleanup: remove printf and diag function from CUDA kernel (build fix)	2026-05-15 20:11:40 +00:00
biondizzle	e7c3341317	docs: update DEBUG_LOG with M/K swap root cause	2026-05-15 20:03:20 +00:00
biondizzle	deb6b3231a	debug: swap M/K in SF remap + add printf diagnostics	2026-05-15 20:01:47 +00:00
biondizzle	22f0457ccf	test: isolate SFA vs SFB remap bug	2026-05-15 19:59:39 +00:00
biondizzle	9eaf6d07e8	test: quick random test	2026-05-15 19:58:57 +00:00
biondizzle	fa7b394571	docs: update DEBUG_LOG with root cause (size→cosize) and full debug timeline	2026-05-15 18:56:09 +00:00
biondizzle	c3841983a0	fix: SF remap uses cute::cosize() instead of cute::size() The comment explicitly warned about this: allocation uses cosize (physical size including tile padding) but the iteration bound used size (logical size). This meant padding positions in the CUTLASS SF layout were never written, leaving them as zero instead of their actual SF values. With uniform data (all-ones), all SF values are the same so the bug was invisible. With random data, different SF values are needed at different positions and the missing writes corrupt the result.	2026-05-15 18:52:23 +00:00
biondizzle	67dcfa83f5	test: random data at small dims + alpha sweep	2026-05-15 18:51:52 +00:00
biondizzle	60f7f60818	test: ultra-minimal GEMM with all-ones	2026-05-15 18:51:31 +00:00
biondizzle	363dd893f0	test: dimension sweep to isolate GEMM bug	2026-05-15 18:51:09 +00:00
biondizzle	fee5a97ebb	fix: cosine_similarity dim for M>0	2026-05-15 18:50:45 +00:00
biondizzle	f9330a1777	test: standalone M=1 GEMM test with deterministic data	2026-05-15 18:47:26 +00:00
biondizzle	1b63a46168	docs: update DEBUG_LOG with cosine≈0 finding + new hypotheses	2026-05-15 18:35:00 +00:00
biondizzle	773967452f	debug: fix gs scalar conversion + add traceback	2026-05-15 18:27:44 +00:00
biondizzle	df916b87eb	debug: fix gs.item() for multi-element tensor	2026-05-15 18:09:41 +00:00
biondizzle	755f9ad567	debug: fix per_expert_alpha ref + clean up BF16 reference scaling	2026-05-15 17:55:11 +00:00
biondizzle	de8acc7965	debug: dump raw GEMM inputs + first 8 output values	2026-05-15 17:02:40 +00:00
biondizzle	9159cb6bb3	docs: add debug log — current state, hypotheses, fixes	2026-05-15 15:48:57 +00:00
biondizzle	2fd55a94c6	fix: weight reshape bug + igs double-count in BF16 reference	2026-05-15 15:46:16 +00:00
biondizzle	c421a668f3	debug: BF16 reference GEMM + cosine comparison for L1	2026-05-15 14:16:24 +00:00
biondizzle	995589ac8a	debug: add FP4 quantization round-trip diagnostic	2026-05-15 13:41:09 +00:00
biondizzle	d0ed3d84a8	debug: add L2, SiLU, and scatter pipeline prints	2026-05-15 13:21:25 +00:00
biondizzle	da5572f497	clean: remove diagnostic scripts from repo	2026-05-15 12:50:14 +00:00
biondizzle	fd59222fc0	fix: stop folding global scale into float8 block scales The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision. Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision zone where step size is 25%. This makes model output garbage despite finite values. Fix: keep block scales as original float8, return global scales separately as float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate gate/up global scales, use gate_gs as alpha and apply up_correction ratio to the up half post-GEMM. weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf) nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs kernel.py: per_expert_alpha parameter in grouped GEMM deepseek_v4.py: updated type hints and comments	2026-05-15 12:42:53 +00:00
biondizzle	56e62e916d	revert: idx2crd remap approach — source-first needs hierarchical coords cute::crd2idx requires hierarchical coordinates matching the layout's nested shape, which we don't have from flat (m, k_sf). Reverted to idx2crd dest-first approach. The real bug was cute::size vs cute::cosize for allocation, not the remap direction.	2026-05-15 11:44:38 +00:00
biondizzle	d5949a23b4	fix: use cute::crd2idx for SF remap — layout_sf() not directly callable CuTe Layout objects with hierarchical shapes can't be called directly with flat (m, k_sf). Use cute::crd2idx(make_coord(m, k_sf), layout_sf) to convert logical coordinates to physical indices.	2026-05-15 11:39:57 +00:00
biondizzle	9908fd64d9	feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)	2026-05-15 11:38:18 +00:00
biondizzle	c2b752c2fe	Initial: TileLang NVFP4 mega_moe kernel package - nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe - transform_nvfp4_weights_for_mega_moe: weight transformation (tested) - SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs - MEGA_MOE_STATIC=1 support for pipeline testing - pyproject.toml for pip install	2026-05-13 15:44:51 +00:00

49 Commits