Files

biondizzle e80fe9af60 docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

2026-05-11 14:24:55 +00:00

5.5 KiB

Raw Blame History

DeepGEMM NVFP4 Mega MoE Kernel

Overview

A native NVFP4 mega MoE kernel for DeepGEMM that uses kind::mxf4nvf4.block_scale.scale_vec::4X to consume NVFP4 weights (E2M1 + UE4M3 block scales, group_size=16) directly on B200 (SM100a).

HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.

SM100a (B200) Hardware Support

B200 (SM100a) DOES support kind::mxf4nvf4 with scale_vec::4X (block16, UE4M3 scales). Documented in PTX ISA 8.7 (CUDA 12.8+), confirmed by NVIDIA/CUTLASS/Colfax.

The key requirement: target sm_100a (not sm_100). The a suffix enables the FP4 block-scaled instructions including mxf4nvf4. Targeting plain sm_100 will produce "Feature '.scale_vec::4X' not supported on .target 'sm_100f'" errors.

Kernel Architecture (TARGET)

sm100_fp8_nvfp4_mega_moe_impl
├── kGranK = 16 (NVFP4 native block size)
├── kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
├── float_ue4m3_t instruction descriptor
├── SF layout: scale_vec::4X, 4 TMEM sub-columns per UMMA atom
├── UTCCP copy: i*8 stride (4X layout, 8 TMEM cols per 128-element group)
├── kNumSFATmemCols = SF_BLOCK_M / 32 * 4
├── kNumSFBTmemCols = SF_BLOCK_N / 32 * 4
├── kNumSFUint32 = kHidden / 64 (4 UE4M3 per int32)
├── UE4M3 L1 epilogue (float → cutlass::float_e4m3_t cast, sign bit cleared)
└── recipe = (1, 1, 16)

Weight Transformation Pipeline

NVFP4 Checkpoint                         Kernel Format
┌─────────────────────┐                 ┌────────────────────────┐
│ weight: uint8       │────────────────→│ int8 (E2M1, same)     │
│ (E2M1, 2 per byte)  │  .view(int8)    │ packed, interleaved    │
├─────────────────────┤                 ├────────────────────────┤
│ weight_scale:       │ 1. fold global  │ int32 (TMA-aligned     │
│ float8_e4m3fn       │ 2. pack 4→i32   │  UTCCP layout,         │
│ (UE4M3, group=16)   │ 3. transpose    │  gran_k=16)            │
├─────────────────────┤ 4. TMA-align    └────────────────────────┘
│ weight_scale_2:     │
│ float32 (global)    │──folded into block scales before packing
└─────────────────────┘

NO UE4M3→UE8M0 conversion. NO block16→block32 merge. The kernel consumes native UE4M3 scales with block16 grouping.

Key Differences from MXFP4 mega_moe

Parameter	MXFP4	NVFP4 (this kernel)
`kGranK`	32	16
PTX instruction	`mxf8f6f4.block_scale`	`mxf4nvf4.block_scale.scale_vec::4X`
Scale factor type	`float_ue8m0_t`	`float_ue4m3_t`
SF vector size	block32 / 2X	block16 / 4X
TMEM SF cols (SFA)	`SF_BLOCK_M / 32`	`SF_BLOCK_M / 32 * 4`
UTCCP col stride	`i * 4`	`i * 8`
`kNumSFUint32`	`kHidden / 128`	`kHidden / 64`
L1 epilogue	UE8M0 (`>> 23`)	UE4M3 (float→e4m3 cast)
recipe	`(1, 1, 32)`	`(1, 1, 16)`

Critical Implementation Details

scale_format_ constraint

The CUTLASS instruction descriptor has a single scale_format_ bit (0=E4M3, 1=E8M0) that applies to BOTH A and B scale factors. For NVFP4 (E4M3), both activation (SFA) and weight (SFB) scales must use UE4M3. The L1 epilogue outputs UE4M3 activation scales (float → cutlass::float_e4m3_t with sign bit cleared).

Arch flag

The JIT compiler MUST target sm_100a, not sm_100. Without the a suffix, the mxf4nvf4 instruction is unavailable and compilation will fail with "Feature '.scale_vec::4X' not supported on .target 'sm_100f'".

Weight scale_2 folding

The NVFP4 checkpoint has dual-level scaling: per-block UE4M3 + per-tensor float32. The weight_scale_2 must be folded into the block scales before packing: effective_scale = block_scale * global_scale, then re-quantize to UE4M3.

Build History

Build	Error	Fix
1–6	Dockerfile/build issues	NVRTC symlink, CPATH, PYTHONPATH
7	`kPackedFP4` type mismatch	uint8→int8 view
9	SF stride assertion	MN-major layout + TMA alignment
10	`transform_sf` no gran_k=16	C++ fix
11	SF dtype float8_e4m3fn rejected	Pack UE4M3→int32 first
12–14	SF stride layout	Transpose to MN-major
15	SymmBuffer too small	NVFP4-specific SymmBuffer (2× SF)
16	`ImportError`	Python wrapper
17	NVCC: `scale_vec::4X` not on sm_100f	Wrong arch: need `sm_100a`
18	`scale_vec::2X` also failed	Same — `sm_100a` required
19	kGranK still 16 in C++ binding	Should stay 16 — was wrongly changed to 32
20	`uint32 >> 23` fails	Cast to int32 first
22	Garbled output	Fell back to mxf8f6f4 — should use mxf4nvf4 on sm_100a

Remaining Work

Fix DeepGEMM JIT to target sm_100a instead of sm_100
Add NVFP4 MMA kind enum to DeepGEMM runtime (not just MXFP8FP4 with NVFP4 hat)
Revert to Build 17's mxf4nvf4.scale_vec::4X instruction (was correct, just wrong arch)
Revert kGranK to 16, UE4M3 scales, block16 SF layout
Add get_sf_uttcp_aligned_block_sizes branch for block16 layout
Remove UE4M3→UE8M0 conversion and block16→block32 merge from Python
Verify TMEM 4X layout (i*8 stride, 4 sub-columns)
End-to-end quality test on B200

5.5 KiB Raw Blame History Unescape Escape