kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate.
The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so
tensors match the existing kPackedFP4 path in the TMA switch.
- runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping
- mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8
(same byte layout, but kInt8 is recognized by the TMA dtype switch)
Keep STSM (not naive SMEM write) so TMA reads correct bank layout.
Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle.
Known perf note: 32B swizzle zone for L1 output (land for v1).
Reverted to commit 36b439e's NVFP4 kernel code:
- kGranK=16, mxf4nvf4.block_scale.scale_vec::4X
- float_ue4m3_t instruction descriptor
- Block16 SF layout (4X TMEM)
- UE4M3 L1 epilogue
- No UE4M3→UE8M0 conversion, no block16→block32 merge
The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully
on both sm_100 and sm_100f with CUDA 13.0. The previous build 17
error was likely from a different cause, not the arch flag.
Python: reverted transform_nvfp4_weights_for_mega_moe to use
pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.
The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.
B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X).
Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100.
Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0
in the weight transformation. This is a scale format adaptation for
hardware compatibility, not a format conversion.
Changes:
- Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor
- L1 epilogue: back to UE8M0 (>> 23) activation scales
- Python: merge block16→block32, convert UE4M3→float32→UE8M0
- Packing: uint8 (UE8M0) → int32, same as MXFP4
scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200).
Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with
UE4M3 scale format instead of UE8M0.
Changes:
- kGranK: 16 → 32
- PTX: scale_vec::4X → scale_vec::2X
- SF layout: same as MXFP4 (K/32, K/128 for int32 packed)
- UTCCP: i*8 → i*4 (2X layout, same as MXFP4)
- TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32)
- Python: merge NVFP4 block16→block32 scales (max of adjacent pairs)
- recipe: (1,1,16) → (1,1,32)
get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe
to allocate the correct buffer size (2x SF entries due to group_size=16).
Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.
transform_sf_into_required_layout expects MN-major input (stride(-2)=1).
Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two
dims, make contiguous, then transpose back so data is in MN-major order.
The C++ check_sf_layout stride assertion fails on 3D (experts, mn, K//64)
tensors. Reshape to 2D (experts*mn, K//64) before calling the transform
function, matching the expected stride layout.
The C++ transform function expects int32 (for kInt type) with 4 UE4M3
bytes packed per int32. We pack first, then transform for TMA alignment
and UTCCP transpose with recipe (1, 16).
Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++
transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4.
This handles TMA alignment and UTCCP transpose correctly.
- transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2
- Folds global scale into block scales: UE4M3 * FP32 -> UE4M3
- Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize
- This is needed because the kernel only applies one level of block scaling
The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec
to calculate TMA descriptor block sizes. Since this file is the SM90
implementation, it should consistently use SM90ArchSpec like the other
functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous,
etc.).
This fixes a copy-paste error that could cause incorrect block size
calculations on SM90 (Hopper) GPUs.
Fixes#242🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>