CUDA docs: 'Dimension for the packed data types must reflect the number
of individual U# values.' For 16U4_ALIGN8B, gmem/smem inner dims must be
FP4 value counts, not byte counts. Double the byte-oriented dimensions
passed by callers. gmem_outer_stride stays in bytes.
Packed E2M1 output has 2 elements per byte, so block_n elements = block_n/2 bytes.
block_n/4 was under-sizing the TMA SMEM row by 2x → OOB write → LAUNCH_FAILED.
The .contiguous().transpose() dance was swapping dims back.
A single transpose from (g,k,mn) gives (g,mn,k) with stride(-2)=1,
which is exactly the MN-major layout TMA expects.
_interleave_l1_weights used empty_like+copy_ which destroyed the
MN-major stride layout required by TMA. Added interleave_sf_mn_major
that works in K-major, interleaves, then transposes back to MN-major.
- Single amax per warp (16 N-elements = 1 SF group, no warp-pair reduction)
- Single sf_val instead of sf.x/sf.y split
- All 4 warps write SF (k_idx = n_block_idx*4 + warp_idx_in_wg)
- Remove dead SMEM amax storage, reclaim barrier offset space
- Remove dead __syncwarp after register-local amax
Weights are packed E2M1 (2 per byte) but TMA descriptors were using
unpacked dimensions — K-dim in elements instead of bytes, 128B swizzle
instead of 64B, full block_k instead of block_k/2. This caused OOB
reads and swizzle mismatch with the UMMA descriptor, producing
illegal instruction traps.
- SMEM_SFA/SFB_SIZE_PER_STAGE doubled: group=16 needs 8 SFs per token
per BLOCK_K=128 (vs 4 for group=32)
- arrive_and_expect_tx updated to use SMEM_SFA/SFB constants
- Removed stale comments about 8/16 TMEM columns
1. sizeof_bits_v→sizeof_bits<T>::value (our CUTLASS lacks C++17 _v form)
2. reinterpret_cast<uint8_t*> for TMA copy and UMMA desc calls
(smem_a returns float_e2m1_t* but templates expect uint8_t*)
3. kNumChunks extended to 4 (packed FP4 halved SMEM, need more chunks)
4. No code changes to PatternVisitor — all fixes at call sites
kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate.
The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so
tensors match the existing kPackedFP4 path in the TMA switch.
- runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping
- mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8
(same byte layout, but kInt8 is recognized by the TMA dtype switch)
Keep STSM (not naive SMEM write) so TMA reads correct bank layout.
Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle.
Known perf note: 32B swizzle zone for L1 output (land for v1).
Reverted to commit 36b439e's NVFP4 kernel code:
- kGranK=16, mxf4nvf4.block_scale.scale_vec::4X
- float_ue4m3_t instruction descriptor
- Block16 SF layout (4X TMEM)
- UE4M3 L1 epilogue
- No UE4M3→UE8M0 conversion, no block16→block32 merge
The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully
on both sm_100 and sm_100f with CUDA 13.0. The previous build 17
error was likely from a different cause, not the arch flag.
Python: reverted transform_nvfp4_weights_for_mega_moe to use
pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.
The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.
B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X).
Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100.
Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0
in the weight transformation. This is a scale format adaptation for
hardware compatibility, not a format conversion.
Changes:
- Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor
- L1 epilogue: back to UE8M0 (>> 23) activation scales
- Python: merge block16→block32, convert UE4M3→float32→UE8M0
- Packing: uint8 (UE8M0) → int32, same as MXFP4
scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200).
Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with
UE4M3 scale format instead of UE8M0.
Changes:
- kGranK: 16 → 32
- PTX: scale_vec::4X → scale_vec::2X
- SF layout: same as MXFP4 (K/32, K/128 for int32 packed)
- UTCCP: i*8 → i*4 (2X layout, same as MXFP4)
- TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32)
- Python: merge NVFP4 block16→block32 scales (max of adjacent pairs)
- recipe: (1,1,16) → (1,1,32)
get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe
to allocate the correct buffer size (2x SF entries due to group_size=16).
Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.
transform_sf_into_required_layout expects MN-major input (stride(-2)=1).
Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two
dims, make contiguous, then transpose back so data is in MN-major order.