The build 17-18 'scale_vec not supported on sm_100f' error was because we targeted sm_100 instead of sm_100a. The 'a' suffix is required for FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct arch target is the path forward.
5.5 KiB
DeepGEMM NVFP4 Mega MoE Kernel
Overview
A native NVFP4 mega MoE kernel for DeepGEMM that uses kind::mxf4nvf4.block_scale.scale_vec::4X
to consume NVFP4 weights (E2M1 + UE4M3 block scales, group_size=16) directly on B200 (SM100a).
HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.
SM100a (B200) Hardware Support
B200 (SM100a) DOES support kind::mxf4nvf4 with scale_vec::4X (block16, UE4M3 scales).
Documented in PTX ISA 8.7 (CUDA 12.8+), confirmed by NVIDIA/CUTLASS/Colfax.
The key requirement: target sm_100a (not sm_100). The a suffix enables the FP4
block-scaled instructions including mxf4nvf4. Targeting plain sm_100 will produce
"Feature '.scale_vec::4X' not supported on .target 'sm_100f'" errors.
Kernel Architecture (TARGET)
sm100_fp8_nvfp4_mega_moe_impl
├── kGranK = 16 (NVFP4 native block size)
├── kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
├── float_ue4m3_t instruction descriptor
├── SF layout: scale_vec::4X, 4 TMEM sub-columns per UMMA atom
├── UTCCP copy: i*8 stride (4X layout, 8 TMEM cols per 128-element group)
├── kNumSFATmemCols = SF_BLOCK_M / 32 * 4
├── kNumSFBTmemCols = SF_BLOCK_N / 32 * 4
├── kNumSFUint32 = kHidden / 64 (4 UE4M3 per int32)
├── UE4M3 L1 epilogue (float → cutlass::float_e4m3_t cast, sign bit cleared)
└── recipe = (1, 1, 16)
Weight Transformation Pipeline
NVFP4 Checkpoint Kernel Format
┌─────────────────────┐ ┌────────────────────────┐
│ weight: uint8 │────────────────→│ int8 (E2M1, same) │
│ (E2M1, 2 per byte) │ .view(int8) │ packed, interleaved │
├─────────────────────┤ ├────────────────────────┤
│ weight_scale: │ 1. fold global │ int32 (TMA-aligned │
│ float8_e4m3fn │ 2. pack 4→i32 │ UTCCP layout, │
│ (UE4M3, group=16) │ 3. transpose │ gran_k=16) │
├─────────────────────┤ 4. TMA-align └────────────────────────┘
│ weight_scale_2: │
│ float32 (global) │──folded into block scales before packing
└─────────────────────┘
NO UE4M3→UE8M0 conversion. NO block16→block32 merge. The kernel consumes native UE4M3 scales with block16 grouping.
Key Differences from MXFP4 mega_moe
| Parameter | MXFP4 | NVFP4 (this kernel) |
|---|---|---|
kGranK |
32 | 16 |
| PTX instruction | mxf8f6f4.block_scale |
mxf4nvf4.block_scale.scale_vec::4X |
| Scale factor type | float_ue8m0_t |
float_ue4m3_t |
| SF vector size | block32 / 2X | block16 / 4X |
| TMEM SF cols (SFA) | SF_BLOCK_M / 32 |
SF_BLOCK_M / 32 * 4 |
| UTCCP col stride | i * 4 |
i * 8 |
kNumSFUint32 |
kHidden / 128 |
kHidden / 64 |
| L1 epilogue | UE8M0 (>> 23) |
UE4M3 (float→e4m3 cast) |
| recipe | (1, 1, 32) |
(1, 1, 16) |
Critical Implementation Details
scale_format_ constraint
The CUTLASS instruction descriptor has a single scale_format_ bit (0=E4M3, 1=E8M0)
that applies to BOTH A and B scale factors. For NVFP4 (E4M3), both activation (SFA)
and weight (SFB) scales must use UE4M3. The L1 epilogue outputs UE4M3 activation scales
(float → cutlass::float_e4m3_t with sign bit cleared).
Arch flag
The JIT compiler MUST target sm_100a, not sm_100. Without the a suffix, the
mxf4nvf4 instruction is unavailable and compilation will fail with
"Feature '.scale_vec::4X' not supported on .target 'sm_100f'".
Weight scale_2 folding
The NVFP4 checkpoint has dual-level scaling: per-block UE4M3 + per-tensor float32.
The weight_scale_2 must be folded into the block scales before packing:
effective_scale = block_scale * global_scale, then re-quantize to UE4M3.
Build History
| Build | Error | Fix |
|---|---|---|
| 1–6 | Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH |
| 7 | kPackedFP4 type mismatch |
uint8→int8 view |
| 9 | SF stride assertion | MN-major layout + TMA alignment |
| 10 | transform_sf no gran_k=16 |
C++ fix |
| 11 | SF dtype float8_e4m3fn rejected | Pack UE4M3→int32 first |
| 12–14 | SF stride layout | Transpose to MN-major |
| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer (2× SF) |
| 16 | ImportError |
Python wrapper |
| 17 | NVCC: scale_vec::4X not on sm_100f |
Wrong arch: need sm_100a |
| 18 | scale_vec::2X also failed |
Same — sm_100a required |
| 19 | kGranK still 16 in C++ binding | Should stay 16 — was wrongly changed to 32 |
| 20 | uint32 >> 23 fails |
Cast to int32 first |
| 22 | Garbled output | Fell back to mxf8f6f4 — should use mxf4nvf4 on sm_100a |
Remaining Work
- Fix DeepGEMM JIT to target
sm_100ainstead ofsm_100 - Add NVFP4 MMA kind enum to DeepGEMM runtime (not just MXFP8FP4 with NVFP4 hat)
- Revert to Build 17's
mxf4nvf4.scale_vec::4Xinstruction (was correct, just wrong arch) - Revert
kGranKto 16, UE4M3 scales, block16 SF layout - Add
get_sf_uttcp_aligned_block_sizesbranch for block16 layout - Remove UE4M3→UE8M0 conversion and block16→block32 merge from Python
- Verify TMEM 4X layout (i*8 stride, 4 sub-columns)
- End-to-end quality test on B200