DeepGEMM

Author	SHA1	Message	Date
biondizzle	b0094175a2	fix: restore elem_size declaration for TMA desc build	2026-05-12 17:40:25 +00:00
biondizzle	48b5b2b702	fix: TMA dimensions for packed FP4 must be in individual FP4 values (not bytes) CUDA docs: 'Dimension for the packed data types must reflect the number of individual U# values.' For 16U4_ALIGN8B, gmem/smem inner dims must be FP4 value counts, not byte counts. Double the byte-oriented dimensions passed by callers. gmem_outer_stride stays in bytes.	2026-05-12 17:39:07 +00:00
biondizzle	75f1c8544b	fix: remove smem_inner_dim doubling for packed FP4 TMA — must match MMA row width (BLOCK_K/2)	2026-05-12 17:14:44 +00:00
biondizzle	b95f9eb446	revert: remove SMEM warp transpose (deadlock in elect_one_sync, not needed with transform_sf_token_idx)	2026-05-12 17:11:19 +00:00
biondizzle	54a7de03a0	fix: add UTCCP SMEM warp transpose for NVFP4 packed UE4M3 scales	2026-05-12 16:48:06 +00:00
biondizzle	8a53228745	fix: no GPU tensor ops in crash handler (CUDA is broken after 715)	2026-05-12 16:20:11 +00:00
biondizzle	9115f83afb	debug: try/catch around mega_moe kernel with data diagnostics on crash	2026-05-12 16:05:55 +00:00
biondizzle	758389645a	fix: contiguous copy for SF byte view sanity check	2026-05-12 15:44:50 +00:00
biondizzle	cc3e3da45c	debug: check for zero/NaN/Inf in weight SF values	2026-05-12 15:30:38 +00:00
biondizzle	94b30dc2bc	revert: block_n/4 was correct (SwiGLU halving × FP4 packing)	2026-05-12 15:04:23 +00:00
biondizzle	c71fb97687	fix: L1 output TMA smem_inner_dim was block_n/4, should be block_n/2 Packed E2M1 output has 2 elements per byte, so block_n elements = block_n/2 bytes. block_n/4 was under-sizing the TMA SMEM row by 2x → OOB write → LAUNCH_FAILED.	2026-05-12 14:58:11 +00:00
biondizzle	d8ae7a3225	debug: print SF shape/strides before interleave	2026-05-12 14:31:41 +00:00
biondizzle	e498a2c729	fix: single transpose back to MN-major, don't double-transpose The .contiguous().transpose() dance was swapping dims back. A single transpose from (g,k,mn) gives (g,mn,k) with stride(-2)=1, which is exactly the MN-major layout TMA expects.	2026-05-12 14:23:02 +00:00
biondizzle	916f03d528	debug: add transform output shape/stride prints	2026-05-12 14:22:05 +00:00
biondizzle	1f13b24354	debug: add strides to SF debug prints	2026-05-12 14:11:53 +00:00
biondizzle	bfe612969b	fix: preserve MN-major layout when interleaving L1 SF tensors _interleave_l1_weights used empty_like+copy_ which destroyed the MN-major stride layout required by TMA. Added interleave_sf_mn_major that works in K-major, interleaves, then transposes back to MN-major.	2026-05-12 14:01:58 +00:00
biondizzle	76220ac6ee	fix: force contiguous on SF tensors before C++ call	2026-05-12 13:48:45 +00:00
biondizzle	bf5bf8d995	fix: unpack weight tuples before printing debug info	2026-05-12 13:28:32 +00:00
biondizzle	5ac151d0a5	debug: print tensor dtypes/shapes at C++ call boundary in fp8_nvfp4_mega_moe	2026-05-12 13:10:32 +00:00
biondizzle	26a8ab75a1	NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16 - TMA: issue two tma::copy calls per K-block (K_box=1, 2 SF K-columns) - UTCCP: double loop for 2 K-columns, correct SMEM offsets - TMEM: double SFA/SFB column counts (SF_BLOCK_M/32 * 2) - Heuristic: fix smem_size (2× SF, packed FP4 A/B, packed send buffers, no amax) - Staging kernel: fix double-count bug in packed_k_mask	2026-05-12 08:08:17 +00:00
biondizzle	680874d067	NVFP4 L1 epilogue: group_size=16 SF layout - Single amax per warp (16 N-elements = 1 SF group, no warp-pair reduction) - Single sf_val instead of sf.x/sf.y split - All 4 warps write SF (k_idx = n_block_idx*4 + warp_idx_in_wg) - Remove dead SMEM amax storage, reclaim barrier offset space - Remove dead __syncwarp after register-local amax	2026-05-12 07:08:08 +00:00
biondizzle	c0850a6859	Fix weight TMA descriptors: packed E2M1 needs K/2, block_k/2, swizzle/2 Weights are packed E2M1 (2 per byte) but TMA descriptors were using unpacked dimensions — K-dim in elements instead of bytes, 128B swizzle instead of 64B, full block_k instead of block_k/2. This caused OOB reads and swizzle mismatch with the UMMA descriptor, producing illegal instruction traps.	2026-05-12 06:51:39 +00:00
biondizzle	fbfeb54c9a	Fix fold_global_scale: UE4M3 scales use .to(float32), not shift-by-23 Checkpoint stores float8_e4m3fn (standard NVFP4), not UE8M0. The shift-by-23 was misinterpreting E4M3 bytes as E8M0 exponents.	2026-05-12 05:52:33 +00:00
biondizzle	af092fa7ba	fix: double SMEM SF allocation for NVFP4 group=16 + clean stale comments - SMEM_SFA/SFB_SIZE_PER_STAGE doubled: group=16 needs 8 SFs per token per BLOCK_K=128 (vs 4 for group=32) - arrive_and_expect_tx updated to use SMEM_SFA/SFB constants - Removed stale comments about 8/16 TMEM columns	2026-05-11 23:58:07 +00:00
biondizzle	aa97a3f949	fix: correct TMEM column layout for scale_vec::4X UTCCP 4x32dp128bit always writes 4 TMEM cols per 128-element group regardless of 1X vs 4X. The 4X only changes MMA interpretation, not UTCCP column count. Reverted from (4, stride i8) to (same as 1X, stride i4): - kNumSFATmemCols: SF_BLOCK_M/32 (was SF_BLOCK_M/324) - kNumSFBTmemCols: SF_BLOCK_N/32 (was SF_BLOCK_N/324) - UTCCP stride: i4 (was i*8)	2026-05-11 23:44:12 +00:00
biondizzle	d6551617c0	fix: 4 kernel compilation fixes for packed FP4 1. sizeof_bits_v→sizeof_bits<T>::value (our CUTLASS lacks C++17 _v form) 2. reinterpret_cast<uint8_t> for TMA copy and UMMA desc calls (smem_a returns float_e2m1_t but templates expect uint8_t*) 3. kNumChunks extended to 4 (packed FP4 halved SMEM, need more chunks) 4. No code changes to PatternVisitor — all fixes at call sites	2026-05-11 23:17:51 +00:00
biondizzle	49e5646b42	fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8 kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate. The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so tensors match the existing kPackedFP4 path in the TMA switch.	2026-05-11 22:55:28 +00:00
biondizzle	80df24a641	fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8 - runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping - mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8 (same byte layout, but kInt8 is recognized by the TMA dtype switch)	2026-05-11 22:54:47 +00:00
biondizzle	e608a20dec	docs: major README update — packed FP4 SMEM layout, L1 epilogue, TMA descriptors Added detailed documentation of the packed FP4 architecture: - mxf4nvf4 reads packed (2 per byte), NOT unpacked like mxf8f6f4 - SMEM layout: float_e2m1_t, BLOCK_K/2 swizzle, UMMA desc byte math - L1 epilogue: st.shared.u16, no swizzle, kWarpBytesPerRow - Host TMA: hidden/2 K-dim, block_k/2 inner, fp4_unpacked_smem=false - Build history through Build 35	2026-05-11 22:40:09 +00:00
biondizzle	30d72e7ef5	fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue Key changes: - a_dtype_t/b_dtype_t: float_e2m1_t (packed 4-bit) with sizeof_bits_v==4 assert - kSwizzleAMode/BMode: BLOCK_K/2 (64 bytes packed, not 128 unpacked) - SMEM sizes: LOAD_BLOCK_M * BLOCK_K / 2 (packed byte count) - Token layouts: kHidden/2, kIntermediateHidden/2 (packed bytes) - TMA loads: BLOCK_K/2 inner dim, uint8_t, byte offsets k_block_idx*(BLOCK_K/2) - UMMA descriptors: BLOCK_K/2 template param, uint8_t dtype, UMMA_K/2 advance - L1 epilogue: dropped STSM, direct st.shared.u16 with packed nibbles, no swizzle (v1) - Pybind buffer sizes: hidden/2, intermediate_hidden/2 with packed tensor shapes - Host TMA descriptors: hidden/2 K-dims, block_k/2 inner, fp4_unpacked_smem=false - L1 output TMA: block_n/4 inner, no swizzle (CU_TENSOR_MAP_SWIZZLE_NONE)	2026-05-11 21:59:21 +00:00
biondizzle	0ac73a82f9	fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8 - float_e2m1_unpacksmem_t: sizeof=1, SMEM is 1 byte/element (not packed) - TMA load unpacks 2 E2M1/global-byte → 2 SMEM bytes - UMMA reads unpacked SMEM, packs internally for mxf4nvf4 - L1→L2 handoff: unpacked format (same byte count as FP8) - Epilogue: 4 E2M1 bytes per uint32 STSM atom, same as FP8 - Dispatch TMA: kHidden bytes (unpacked), not kHidden/2 - Added static_assert on sizeof(a_dtype_t) and sizeof(b_dtype_t) - Note: no bandwidth savings at L1→L2 boundary for v1	2026-05-11 21:27:35 +00:00
biondizzle	091b974736	fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output Keep STSM (not naive SMEM write) so TMA reads correct bank layout. Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle. Known perf note: 32B swizzle zone for L1 output (land for v1).	2026-05-11 20:57:34 +00:00
biondizzle	a554de8b24	fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs	2026-05-11 20:47:58 +00:00
biondizzle	b3d1aae038	feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed). Changes: - a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t - UMMA_K: 32 → 64 (FP4 MMA atom) - L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor - L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8 - TMA descriptors: adjusted for FP4 packing (K/2 bytes per row) - SymmBuffer: uint8 activations, shape (M, K//2) - Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales	2026-05-11 20:29:08 +00:00
biondizzle	2cd86ff5e7	fix: UE8M0→float32 reinterpret in fold_global_scale (Bug #7 )	2026-05-11 19:40:01 +00:00
biondizzle	47621bb990	add NVFP4SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe Python wrapper The C++ binding was registered but there was no Python wrapper. vLLM patch imports get_symm_buffer_for_nvfp4_mega_moe from deep_gemm.mega.	2026-05-11 16:25:08 +00:00
biondizzle	86a1263f44	fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe - transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3 scales (4 per int32, group_size=16). Previously only handled 32/128. - get_arch: always return '100a' for SM100, never '100f'. The family variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing 'scale_vec::4X not supported on sm_100f' errors. - transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block scales, pack UE4M3→int32, transpose MN-major, call transform_sf_into_required_layout with gran_k=16.	2026-05-11 16:11:11 +00:00
biondizzle	fbdddaccf4	revert: restore mxf4nvf4/block16 code (correct path for sm_100a) Reverted to commit 36b439e's NVFP4 kernel code: - kGranK=16, mxf4nvf4.block_scale.scale_vec::4X - float_ue4m3_t instruction descriptor - Block16 SF layout (4X TMEM) - UE4M3 L1 epilogue - No UE4M3→UE8M0 conversion, no block16→block32 merge The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully on both sm_100 and sm_100f with CUDA 13.0. The previous build 17 error was likely from a different cause, not the arch flag. Python: reverted transform_nvfp4_weights_for_mega_moe to use pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.	2026-05-11 15:02:47 +00:00
biondizzle	e80fe9af60	docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200) The build 17-18 'scale_vec not supported on sm_100f' error was because we targeted sm_100 instead of sm_100a. The 'a' suffix is required for FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct arch target is the path forward.	2026-05-11 14:24:55 +00:00
biondizzle	c2f4a30780	docs: comprehensive README update through build 22	2026-05-11 13:55:17 +00:00
biondizzle	57c629ed1b	fix: cast to int32 before >> 23 (uint32 doesn't support right-shift)	2026-05-11 09:45:54 +00:00
biondizzle	6d7231a50e	fix: reinterpret float32 bits as uint32 before >> 23 for UE8M0	2026-05-11 09:42:03 +00:00
biondizzle	f44ff7f6ca	docs: document SM100 hardware constraint and full debugging log	2026-05-11 09:30:44 +00:00
biondizzle	03b8c99ee1	fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+ B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X). Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100. Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0 in the weight transformation. This is a scale format adaptation for hardware compatibility, not a format conversion. Changes: - Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor - L1 epilogue: back to UE8M0 (>> 23) activation scales - Python: merge block16→block32, convert UE4M3→float32→UE8M0 - Packing: uint8 (UE8M0) → int32, same as MXFP4	2026-05-11 09:28:45 +00:00
biondizzle	b856c57ba6	fix: kGranK=32 in C++ binding (was still 16 from old block16 code)	2026-05-11 09:09:32 +00:00
biondizzle	cd7a612175	debug: add shape logging to SF packing	2026-05-11 08:54:14 +00:00
biondizzle	dcebe033e2	fix: use scale_vec::2X (block32) for SM100 B200 compatibility scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200). Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with UE4M3 scale format instead of UE8M0. Changes: - kGranK: 16 → 32 - PTX: scale_vec::4X → scale_vec::2X - SF layout: same as MXFP4 (K/32, K/128 for int32 packed) - UTCCP: i8 → i4 (2X layout, same as MXFP4) - TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32) - Python: merge NVFP4 block16→block32 scales (max of adjacent pairs) - recipe: (1,1,16) → (1,1,32)	2026-05-11 08:36:59 +00:00
biondizzle	deff80c9c1	fix: add Python wrapper for NVFP4 SymmBuffer allocation get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe to allocate the correct buffer size (2x SF entries due to group_size=16). Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.	2026-05-11 08:05:21 +00:00
biondizzle	acbe006498	docs: update debugging log in README	2026-05-11 07:33:02 +00:00
biondizzle	8d02eb38fa	fix: transpose SF to MN-major layout before TMA stride checks transform_sf_into_required_layout expects MN-major input (stride(-2)=1). Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two dims, make contiguous, then transpose back so data is in MN-major order.	2026-05-11 07:32:10 +00:00

1 2 3 4 5

249 Commits