nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	a8bd962452	fix: SF remap — iterate dest indices, extract logical (m, k_sf) from nested coord The forward-map approach (src -> layout_sf(m, k)) failed because CuTe's layout operator requires coordinates matching the nested shape rank, and passing flat (int, int) to a ((32,4),K) shape triggers Mismatched Ranks. New approach: iterate over CUTLASS dest indices, use idx2crd to get the hierarchical coordinate, flatten it, then extract logical (m, k_sf) by interpreting the flattened sub-coordinates correctly: flat[0..2] = (inner_M, sub_M, tile_M) -> m = tile_M128 + inner_M4 + sub_M flat[3..5] = (inner_K, sub_K, tile_K) -> k_sf = tile_K*4 + sub_K (inner_K is within one SF group — same byte, so ignored for k_sf) Previous bug: get<0> and get<1> of flatten gave (inner_M, sub_M) — both M sub-indices. K information was never extracted, so only k_group=0 worked. Dest buffer is zero-initialized so padding slots (where m >= MN or k_sf >= K_sf) stay zero.	2026-05-14 15:01:47 +00:00
biondizzle	395cc31883	fix: use layout_sf(m, k_elem) instead of make_coord for nested shapes make_coord(m, k_elem) produces rank-2 coord, but tile_to_shape creates nested shapes like ((32,4,tiles_m), (16,4,tiles_k)) which expect matching nested coords. layout_sf(m, k) operator handles hierarchical projection automatically.	2026-05-14 14:57:13 +00:00
biondizzle	d90967d6e9	fix: SF remap — element-space K coords + zero-init dest buffer Two fixes: 1. CuTe layout uses element-space K, not group-space. k_group=3 with SFVecSize=16 maps to k_elem=48 in the layout, not k=3. Added SFVecSize param to remap kernel, multiply k_sf * SFVecSize before passing to layout_sf(). 2. Zero-init CUTLASS dest buffer before remap. The layout pads to tile boundaries (128x64), so dest is larger than MK_sf. Unmapped padding slots reading garbage causes sporadic wrong results. Also fixed grid size to use source count (MK_sf), not dest size.	2026-05-14 14:54:18 +00:00
biondizzle	5968ebad9f	fix: SF remap was using idx2crd+flatten which gives atom sub-indices, not logical (m,k) The remap kernel iterated over CUTLASS linear indices and tried to reverse-map with idx2crd + flatten. But flatten() on the nested CuTe coordinate (from tile_to_shape(SfAtom{}, ...)) gives atom-level sub-indices, not logical (m, k). This caused all K-groups > 0 in SFA to map to m*K_sf+0, losing K-group information entirely. Proof: setting SFA[0,0]=2.0 changed row 0, but SFA[0,3]=2.0 produced zero change. Only K-group 0 was being read. Fix: iterate over SOURCE indices (row-major m, k) and use the CuTe layout forward: layout_sf(make_coord(m, k)) -> CUTLASS dst index. This is the correct forward direction that CuTe handles natively. Constant-scale test (all SF=1.0) gave cosine=1.0, confirming the FP4 data path is correct. The bug was purely in the SF remap.	2026-05-14 14:51:02 +00:00
biondizzle	cf796e37cf	debug: add weight_scale_2 shape/value logging in weight transform	2026-05-14 14:19:35 +00:00
biondizzle	879adc324d	fix: _fold_global_scale — remove broken logical_widths branch The logical_widths branch took expert 0 and 1's global scales and applied them to ALL experts. For L1 with logical_widths=[3072,3072], every expert got expert-0's scale on its gate half and expert-1's scale on its up half. All other experts' global scales were discarded. The else branch correctly broadcasts each expert's own (E,1) global scale across (E, N, K//16). Removed the dead logical_widths code.	2026-05-14 14:17:44 +00:00
biondizzle	ef9cd023a9	fix: unpack_ue4m3_u32 — uint32 lacks CUDA bitwise ops, use int32 PyTorch doesn't implement bitwise_and/shift for UInt32 on CUDA. Cast to int32 first, then extract bytes, then uint8 → view float8.	2026-05-14 13:44:42 +00:00
biondizzle	1c39e21d87	fix: remove broken L1 weight interleave The interleave assumed gate/up were pre-interleaved in groups of 16 and that we needed 2CTA UMMA layout. Both wrong: 1. vLLM w13_weight is plain concat [gate; up] along output dim 2. Our CUTLASS kernel uses ClusterShape 1x1x1, not 2CTA The interleave was shuffling weights into nonsense, making L1 GEMM compute the wrong thing, and chunk(2) would split wrong halves.	2026-05-14 13:05:45 +00:00
biondizzle	80495c0cd6	docs: clarify SF layout remap is in CUDA, not sf_layout.py sf_layout.py was a no-op (return sf) but the actual remap happens in remap_sf_to_cutlass_kernel in cutlass_nvfp4_gemm.cu. Updated sf_layout.py to pure reference docs so nobody gets confused again.	2026-05-14 13:04:31 +00:00
biondizzle	16f91ff0e1	fix: rewrite stage_activation with proper E2M1 quantization Three bugs fixed: 1. clamp(0,15) was destroying sign bits — E2M1 is sign-magnitude 4-bit nibbles, not unsigned. Half the activation was zeroed. 2. Scale stored block_max but divided by block_max/6, so stored scale was 6× too large. Now correctly stores block_max/6 (the actual dequant factor). 3. Uniform 0.5 step doesn't match E2M1 values {0,0.5,1,1.5,2,3,4,6}. Now snaps to nearest E2M1 representable magnitude. New _quantize_to_e2m1 helper handles all three correctly: - Sign-magnitude 4-bit nibble packing (bit3=sign, bits2:0=mag index) - Correct block scale (block_max / 6.0) - Nearest-neighbor to actual E2M1 values	2026-05-14 13:02:10 +00:00
biondizzle	3bcc0ac057	fix: unpack_ue4m3_u32 was value-casting instead of bit-reinterpreting Byte 0x3F was becoming float8(63.0) instead of the float8 whose bit pattern IS 0x3F (~0.984). Pack uses .view() (correct), unpack used .to() (wrong) — they were not inverses. This corrupted every activation scale fed to the L1 GEMM while weight scales were fine.	2026-05-14 12:59:20 +00:00
biondizzle	d3f35c9465	cleanup: remove abandoned TileLang and Mojo files - Deleted: layout.mojo, mega_moe.mojo, quantize.mojo (Mojo attempt) - Deleted: nvfp4_blockscaled_gemm.py, staging.py, nvfp4_mega_moe.py (TileLang top-level) - Deleted: tilelang_nvfp4_gemm.py, tilelang_kernels.py, nvfp4_dequant.py (TileLang package) - Deleted: src/weight_transform.py (duplicate of package version) - Fixed nvfp4_mega_moe.py: inlined unpack_ue4m3_u32, removed TileLang fallback imports - Fixed weight_transform.py: renamed function, removed TileLang alias, updated docs - Fixed __init__.py: removed TileLang alias, updated docstring - CUTLASS is the only kernel path now	2026-05-14 12:44:47 +00:00
biondizzle	802c4ee12c	Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims)	2026-05-14 12:14:01 +00:00
biondizzle	69e0174792	Fix stage_activation: use Triton staging kernel instead of broken simple quantize	2026-05-14 12:01:34 +00:00
biondizzle	c016e66e23	Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel	2026-05-14 11:27:58 +00:00
biondizzle	904fc37ad8	Fix: use idx2crd instead of get_coord for CuTe layout coordinate lookup	2026-05-14 10:50:26 +00:00
biondizzle	494d30b6ab	Fix: use CuTe get_coord for proper scale factor remap to CUTLASS interleaved layout	2026-05-14 10:48:58 +00:00
biondizzle	869151d211	Fix kernel.py: remove broken expand on scale factors (was expanding sf to weight size)	2026-05-14 10:36:16 +00:00
biondizzle	84becfac93	Test: pass scales directly to CUTLASS (no remap) to diagnose layout issue	2026-05-14 10:23:02 +00:00
biondizzle	a272bc49b0	Fix: torch::kBFloat16	2026-05-14 10:21:10 +00:00
biondizzle	3f62e49e6e	Fix PyTorch API: use c10::cuda and at::kBF16	2026-05-14 10:20:00 +00:00
biondizzle	2ee4e26772	Fix: remove compile-time SM100 guard from pytorch binding, use runtime check instead	2026-05-14 10:18:36 +00:00
biondizzle	540e68593f	Add scale factor remap kernel: remap simple row-major SFs to CUTLASS interleaved layout	2026-05-14 10:05:38 +00:00
biondizzle	2998c889e7	Implement simple FP4 quantization for L1→L2 re-quant step (no vLLM fp4_utils dependency)	2026-05-14 09:50:52 +00:00
biondizzle	98913c9b1a	Fix stage_activation: use Triton staging kernel from vLLM patch instead of fp4_utils	2026-05-14 09:38:50 +00:00
biondizzle	25cbc85afe	Replace kernel.py with thin wrapper around pre-compiled _C extension	2026-05-14 09:25:56 +00:00
biondizzle	33e5d67326	Add CUTLASS_CHECK macro	2026-05-13 23:28:03 +00:00
biondizzle	b7c5cba407	Fix device_memory include path	2026-05-13 23:27:06 +00:00
biondizzle	3299d22ad6	Fix type casts and includes for CUTLASS NVFP4 GEMM	2026-05-13 23:26:18 +00:00
biondizzle	1eb9c43217	Rewrite CUTLASS kernel based on NVIDIA example 72b (nv_float4_t, CollectiveBuilder, OpClassBlockScaledTensorOp)	2026-05-13 23:25:20 +00:00
biondizzle	8a9af441dc	Fix includes: use cutlass/float_subbyte.h (has float_e2m1_t and float_ue4m3_t), point to latest CUTLASS	2026-05-13 23:23:01 +00:00
biondizzle	d789f5e3e0	Add CCCL include path for CUTLASS 3.x	2026-05-13 23:18:26 +00:00
biondizzle	12588047fd	Fix setup.py: use include_dirs and extra_compile_args (correct PyTorch extension API)	2026-05-13 23:17:30 +00:00
biondizzle	1b1c3a42fe	Fix setup.py source paths	2026-05-13 23:14:05 +00:00
biondizzle	f375c80bfe	feat: CUTLASS NVFP4 block-scaled GEMM kernel (native SM100 Blackwell) - Native NVFP4 block-scaled MMA using CUTLASS MainloopSm100TmaUmmaWarpSpecializedBlockScaled - Invokes mxf8f6f4.block_scale tensor core instructions (tcgen05.mma) - E2M1 (packed int8) + UE4M3 (float8_e4m3fn) block-16 scales → BF16 output - No dequantization: hardware block-scaled MMA avoids costly dequantize+BF16 path - PyTorch CUDA extension with CollectiveBuilder auto-deduction - Grouped expert GEMM for MoE dispatch (32 experts/rank, top-6 routing) - Integrated into nvfp4_mega_moe.py as primary path with TileLang fallback - Standalone C API (cutlass_nvfp4_gemm.cu) for direct B200 compilation - Build script, setup.py, and test script for B200 deployment Files: cutlass_nvfp4_gemm/ — Kernel source, PyTorch binding, build/test scripts nvfp4_mega_moe.py — Updated to use CUTLASS kernel when available	2026-05-13 23:11:15 +00:00
biondizzle	56c7880296	Native NVFP4 TileLang kernel: tcgen05 block-scaled MMA	2026-05-13 23:02:06 +00:00
biondizzle	bf13665dbe	Implement TileLang NVFP4 mega_moe L1/L2 kernels - nvfp4_mega_moe_l1: L1 GEMM (gate_up_proj) with FP4 dequant → BF16 GEMM - nvfp4_mega_moe_l2: L2 GEMM (down_proj) with FP4 dequant → BF16 GEMM - nvfp4_dequant.py: E2M1 packed → BF16 with UE4M3 block16 scales - tilelang_kernels.py: Grouped expert GEMM with TileLang-compiled BF16 GEMM - Full pipeline: L1 GEMM → SiLU+Mul → re-quantize → L2 GEMM → output - MEGA_MOE_STATIC=1 bypass still works for pipeline testing Current approach: dequantize FP4→BF16 then run BF16 GEMM via TileLang T.gemm (auto-lowers to tcgen05 on Blackwell). Will be upgraded to native FP4 block-scaled MMA (tcgen05.mma kind::mxf8f6f4.block_scale) once TileLang adds E2M1+UE4M3 support.	2026-05-13 22:36:58 +00:00
biondizzle	ebc0ab0cac	Fix: keep scales as float8_e4m3fn, don't pack to uint32 (min_all_cuda unsupported)	2026-05-13 21:54:39 +00:00
biondizzle	94233c4dd3	Fix __init__.py: remove private imports	2026-05-13 21:43:47 +00:00
biondizzle	1a452ffabd	Fix weight_transform signature to match nightly vLLM finalize_weights call	2026-05-13 21:41:43 +00:00
biondizzle	47ca5631d8	Fix __init__.py: only import from package modules	2026-05-13 16:08:37 +00:00
biondizzle	c2b752c2fe	Initial: TileLang NVFP4 mega_moe kernel package - nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe - transform_nvfp4_weights_for_mega_moe: weight transformation (tested) - SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs - MEGA_MOE_STATIC=1 support for pipeline testing - pyproject.toml for pip install	2026-05-13 15:44:51 +00:00

42 Commits