Commit Graph

254 Commits

Author SHA1 Message Date
4442c06ba8 diag: remove format=5 override, keep block_m=128 baseline test 2026-05-12 20:01:37 +00:00
2c09545faa diag: force block_m=128 to test UMMA_N=192 validity for mxf4nvf4 2026-05-12 19:37:11 +00:00
c1cbe488f3 diag: force a_format/b_format=5 (MXF8F6F4Format::E2M1), re-enable MMA, dump k=0+k=1 2026-05-12 19:06:28 +00:00
3b8aa5fd4d diag: stub MMA + dump descriptors for ILLEGAL_INSTRUCTION debug 2026-05-12 18:37:59 +00:00
c56f5dda7e fix: use UINT8 TMA for packed FP4 instead of 16U4_ALIGN8B
The 16U4_ALIGN8B TMA data type is not supported on this driver
(CUDA_ERROR_INVALID_VALUE). Use UINT8 TMA to load raw bytes and let
the UMMA descriptor interpret SMEM as packed FP4 for mxf4nvf4.
TMA dimensions stay in bytes (like UINT8).
2026-05-12 18:05:11 +00:00
b0094175a2 fix: restore elem_size declaration for TMA desc build 2026-05-12 17:40:25 +00:00
48b5b2b702 fix: TMA dimensions for packed FP4 must be in individual FP4 values (not bytes)
CUDA docs: 'Dimension for the packed data types must reflect the number
of individual U# values.' For 16U4_ALIGN8B, gmem/smem inner dims must be
FP4 value counts, not byte counts. Double the byte-oriented dimensions
passed by callers. gmem_outer_stride stays in bytes.
2026-05-12 17:39:07 +00:00
75f1c8544b fix: remove smem_inner_dim doubling for packed FP4 TMA — must match MMA row width (BLOCK_K/2) 2026-05-12 17:14:44 +00:00
b95f9eb446 revert: remove SMEM warp transpose (deadlock in elect_one_sync, not needed with transform_sf_token_idx) 2026-05-12 17:11:19 +00:00
54a7de03a0 fix: add UTCCP SMEM warp transpose for NVFP4 packed UE4M3 scales 2026-05-12 16:48:06 +00:00
8a53228745 fix: no GPU tensor ops in crash handler (CUDA is broken after 715) 2026-05-12 16:20:11 +00:00
9115f83afb debug: try/catch around mega_moe kernel with data diagnostics on crash 2026-05-12 16:05:55 +00:00
758389645a fix: contiguous copy for SF byte view sanity check 2026-05-12 15:44:50 +00:00
cc3e3da45c debug: check for zero/NaN/Inf in weight SF values 2026-05-12 15:30:38 +00:00
94b30dc2bc revert: block_n/4 was correct (SwiGLU halving × FP4 packing) 2026-05-12 15:04:23 +00:00
c71fb97687 fix: L1 output TMA smem_inner_dim was block_n/4, should be block_n/2
Packed E2M1 output has 2 elements per byte, so block_n elements = block_n/2 bytes.
block_n/4 was under-sizing the TMA SMEM row by 2x → OOB write → LAUNCH_FAILED.
2026-05-12 14:58:11 +00:00
d8ae7a3225 debug: print SF shape/strides before interleave 2026-05-12 14:31:41 +00:00
e498a2c729 fix: single transpose back to MN-major, don't double-transpose
The .contiguous().transpose() dance was swapping dims back.
A single transpose from (g,k,mn) gives (g,mn,k) with stride(-2)=1,
which is exactly the MN-major layout TMA expects.
2026-05-12 14:23:02 +00:00
916f03d528 debug: add transform output shape/stride prints 2026-05-12 14:22:05 +00:00
1f13b24354 debug: add strides to SF debug prints 2026-05-12 14:11:53 +00:00
bfe612969b fix: preserve MN-major layout when interleaving L1 SF tensors
_interleave_l1_weights used empty_like+copy_ which destroyed the
MN-major stride layout required by TMA. Added interleave_sf_mn_major
that works in K-major, interleaves, then transposes back to MN-major.
2026-05-12 14:01:58 +00:00
76220ac6ee fix: force contiguous on SF tensors before C++ call 2026-05-12 13:48:45 +00:00
bf5bf8d995 fix: unpack weight tuples before printing debug info 2026-05-12 13:28:32 +00:00
5ac151d0a5 debug: print tensor dtypes/shapes at C++ call boundary in fp8_nvfp4_mega_moe 2026-05-12 13:10:32 +00:00
26a8ab75a1 NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16
- TMA: issue two tma::copy calls per K-block (K_box=1, 2 SF K-columns)
- UTCCP: double loop for 2 K-columns, correct SMEM offsets
- TMEM: double SFA/SFB column counts (SF_BLOCK_M/32 * 2)
- Heuristic: fix smem_size (2× SF, packed FP4 A/B, packed send buffers, no amax)
- Staging kernel: fix double-count bug in packed_k_mask
2026-05-12 08:08:17 +00:00
680874d067 NVFP4 L1 epilogue: group_size=16 SF layout
- Single amax per warp (16 N-elements = 1 SF group, no warp-pair reduction)
- Single sf_val instead of sf.x/sf.y split
- All 4 warps write SF (k_idx = n_block_idx*4 + warp_idx_in_wg)
- Remove dead SMEM amax storage, reclaim barrier offset space
- Remove dead __syncwarp after register-local amax
2026-05-12 07:08:08 +00:00
c0850a6859 Fix weight TMA descriptors: packed E2M1 needs K/2, block_k/2, swizzle/2
Weights are packed E2M1 (2 per byte) but TMA descriptors were using
unpacked dimensions — K-dim in elements instead of bytes, 128B swizzle
instead of 64B, full block_k instead of block_k/2. This caused OOB
reads and swizzle mismatch with the UMMA descriptor, producing
illegal instruction traps.
2026-05-12 06:51:39 +00:00
fbfeb54c9a Fix fold_global_scale: UE4M3 scales use .to(float32), not shift-by-23
Checkpoint stores float8_e4m3fn (standard NVFP4), not UE8M0.
The shift-by-23 was misinterpreting E4M3 bytes as E8M0 exponents.
2026-05-12 05:52:33 +00:00
af092fa7ba fix: double SMEM SF allocation for NVFP4 group=16 + clean stale comments
- SMEM_SFA/SFB_SIZE_PER_STAGE doubled: group=16 needs 8 SFs per token
  per BLOCK_K=128 (vs 4 for group=32)
- arrive_and_expect_tx updated to use SMEM_SFA/SFB constants
- Removed stale comments about 8/16 TMEM columns
2026-05-11 23:58:07 +00:00
aa97a3f949 fix: correct TMEM column layout for scale_vec::4X
UTCCP 4x32dp128bit always writes 4 TMEM cols per 128-element group
regardless of 1X vs 4X. The 4X only changes MMA interpretation,
not UTCCP column count. Reverted from (*4, stride i*8) to (same as 1X, stride i*4):
- kNumSFATmemCols: SF_BLOCK_M/32 (was SF_BLOCK_M/32*4)
- kNumSFBTmemCols: SF_BLOCK_N/32 (was SF_BLOCK_N/32*4)
- UTCCP stride: i*4 (was i*8)
2026-05-11 23:44:12 +00:00
d6551617c0 fix: 4 kernel compilation fixes for packed FP4
1. sizeof_bits_v→sizeof_bits<T>::value (our CUTLASS lacks C++17 _v form)
2. reinterpret_cast<uint8_t*> for TMA copy and UMMA desc calls
   (smem_a returns float_e2m1_t* but templates expect uint8_t*)
3. kNumChunks extended to 4 (packed FP4 halved SMEM, need more chunks)
4. No code changes to PatternVisitor — all fixes at call sites
2026-05-11 23:17:51 +00:00
49e5646b42 fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8
kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate.
The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so
tensors match the existing kPackedFP4 path in the TMA switch.
2026-05-11 22:55:28 +00:00
80df24a641 fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8
- runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping
- mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8
  (same byte layout, but kInt8 is recognized by the TMA dtype switch)
2026-05-11 22:54:47 +00:00
e608a20dec docs: major README update — packed FP4 SMEM layout, L1 epilogue, TMA descriptors
Added detailed documentation of the packed FP4 architecture:
- mxf4nvf4 reads packed (2 per byte), NOT unpacked like mxf8f6f4
- SMEM layout: float_e2m1_t, BLOCK_K/2 swizzle, UMMA desc byte math
- L1 epilogue: st.shared.u16, no swizzle, kWarpBytesPerRow
- Host TMA: hidden/2 K-dim, block_k/2 inner, fp4_unpacked_smem=false
- Build history through Build 35
2026-05-11 22:40:09 +00:00
30d72e7ef5 fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue
Key changes:
- a_dtype_t/b_dtype_t: float_e2m1_t (packed 4-bit) with sizeof_bits_v==4 assert
- kSwizzleAMode/BMode: BLOCK_K/2 (64 bytes packed, not 128 unpacked)
- SMEM sizes: LOAD_BLOCK_M * BLOCK_K / 2 (packed byte count)
- Token layouts: kHidden/2, kIntermediateHidden/2 (packed bytes)
- TMA loads: BLOCK_K/2 inner dim, uint8_t, byte offsets k_block_idx*(BLOCK_K/2)
- UMMA descriptors: BLOCK_K/2 template param, uint8_t dtype, UMMA_K/2 advance
- L1 epilogue: dropped STSM, direct st.shared.u16 with packed nibbles, no swizzle (v1)
- Pybind buffer sizes: hidden/2, intermediate_hidden/2 with packed tensor shapes
- Host TMA descriptors: hidden/2 K-dims, block_k/2 inner, fp4_unpacked_smem=false
- L1 output TMA: block_n/4 inner, no swizzle (CU_TENSOR_MAP_SWIZZLE_NONE)
2026-05-11 21:59:21 +00:00
0ac73a82f9 fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8
- float_e2m1_unpacksmem_t: sizeof=1, SMEM is 1 byte/element (not packed)
- TMA load unpacks 2 E2M1/global-byte → 2 SMEM bytes
- UMMA reads unpacked SMEM, packs internally for mxf4nvf4
- L1→L2 handoff: unpacked format (same byte count as FP8)
- Epilogue: 4 E2M1 bytes per uint32 STSM atom, same as FP8
- Dispatch TMA: kHidden bytes (unpacked), not kHidden/2
- Added static_assert on sizeof(a_dtype_t) and sizeof(b_dtype_t)
- Note: no bandwidth savings at L1→L2 boundary for v1
2026-05-11 21:27:35 +00:00
091b974736 fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output
Keep STSM (not naive SMEM write) so TMA reads correct bank layout.
Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle.
Known perf note: 32B swizzle zone for L1 output (land for v1).
2026-05-11 20:57:34 +00:00
a554de8b24 fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs 2026-05-11 20:47:58 +00:00
b3d1aae038 feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales
mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed).
Changes:
- a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t
- UMMA_K: 32 → 64 (FP4 MMA atom)
- L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor
- L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8
- TMA descriptors: adjusted for FP4 packing (K/2 bytes per row)
- SymmBuffer: uint8 activations, shape (M, K//2)
- Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales
2026-05-11 20:29:08 +00:00
2cd86ff5e7 fix: UE8M0→float32 reinterpret in fold_global_scale (Bug #7) 2026-05-11 19:40:01 +00:00
47621bb990 add NVFP4SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe Python wrapper
The C++ binding was registered but there was no Python wrapper.
vLLM patch imports get_symm_buffer_for_nvfp4_mega_moe from deep_gemm.mega.
2026-05-11 16:25:08 +00:00
86a1263f44 fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe
- transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3
  scales (4 per int32, group_size=16). Previously only handled 32/128.
- get_arch: always return '100a' for SM100, never '100f'. The family
  variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing
  'scale_vec::4X not supported on sm_100f' errors.
- transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block
  scales, pack UE4M3→int32, transpose MN-major, call
  transform_sf_into_required_layout with gran_k=16.
2026-05-11 16:11:11 +00:00
fbdddaccf4 revert: restore mxf4nvf4/block16 code (correct path for sm_100a)
Reverted to commit 36b439e's NVFP4 kernel code:
- kGranK=16, mxf4nvf4.block_scale.scale_vec::4X
- float_ue4m3_t instruction descriptor
- Block16 SF layout (4X TMEM)
- UE4M3 L1 epilogue
- No UE4M3→UE8M0 conversion, no block16→block32 merge

The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully
on both sm_100 and sm_100f with CUDA 13.0. The previous build 17
error was likely from a different cause, not the arch flag.

Python: reverted transform_nvfp4_weights_for_mega_moe to use
pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.
2026-05-11 15:02:47 +00:00
e80fe9af60 docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.
2026-05-11 14:24:55 +00:00
c2f4a30780 docs: comprehensive README update through build 22 2026-05-11 13:55:17 +00:00
57c629ed1b fix: cast to int32 before >> 23 (uint32 doesn't support right-shift) 2026-05-11 09:45:54 +00:00
6d7231a50e fix: reinterpret float32 bits as uint32 before >> 23 for UE8M0 2026-05-11 09:42:03 +00:00
f44ff7f6ca docs: document SM100 hardware constraint and full debugging log 2026-05-11 09:30:44 +00:00
03b8c99ee1 fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+
B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X).
Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100.

Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0
in the weight transformation. This is a scale format adaptation for
hardware compatibility, not a format conversion.

Changes:
- Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor
- L1 epilogue: back to UE8M0 (>> 23) activation scales
- Python: merge block16→block32, convert UE4M3→float32→UE8M0
- Packing: uint8 (UE8M0) → int32, same as MXFP4
2026-05-11 09:28:45 +00:00
b856c57ba6 fix: kGranK=32 in C++ binding (was still 16 from old block16 code) 2026-05-11 09:09:32 +00:00