README_NVFP4.md

# DeepGEMM NVFP4 Mega MoE Kernel

## Overview

A native NVFP4 mega MoE kernel for DeepGEMM that uses `kind::mxf4nvf4.block_scale.scale_vec::4X`
to consume NVFP4 weights (E2M1 + UE4M3 block scales, group_size=16) directly on B200 (SM100a).

**HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.**

## SM100a (B200) Hardware Support

**B200 (SM100a) DOES support `kind::mxf4nvf4` with `scale_vec::4X`** (block16, UE4M3 scales).
Documented in PTX ISA 8.7 (CUDA 12.8+), confirmed by NVIDIA/CUTLASS/Colfax.

The key requirement: target **`sm_100a`** (not `sm_100`). The `a` suffix enables the FP4
block-scaled instructions including `mxf4nvf4`. Targeting plain `sm_100` will produce
"Feature '.scale_vec::4X' not supported on .target 'sm_100f'" errors.

## Kernel Architecture (TARGET)

```
sm100_fp8_nvfp4_mega_moe_impl
├── kGranK = 16 (NVFP4 native block size)
├── kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
├── float_ue4m3_t instruction descriptor
├── SF layout: scale_vec::4X, 4 TMEM sub-columns per UMMA atom
├── UTCCP copy: i*8 stride (4X layout, 8 TMEM cols per 128-element group)
├── kNumSFATmemCols = SF_BLOCK_M / 32 * 4
├── kNumSFBTmemCols = SF_BLOCK_N / 32 * 4
├── kNumSFUint32 = kHidden / 64 (4 UE4M3 per int32)
├── UE4M3 L1 epilogue (float → cutlass::float_e4m3_t cast, sign bit cleared)
└── recipe = (1, 1, 16)
```

## Weight Transformation Pipeline

```
NVFP4 Checkpoint                         Kernel Format
┌─────────────────────┐                 ┌────────────────────────┐
│ weight: uint8       │────────────────→│ int8 (E2M1, same)     │
│ (E2M1, 2 per byte)  │  .view(int8)    │ packed, interleaved    │
├─────────────────────┤                 ├────────────────────────┤
│ weight_scale:       │ 1. fold global  │ int32 (TMA-aligned     │
│ float8_e4m3fn       │ 2. pack 4→i32   │  UTCCP layout,         │
│ (UE4M3, group=16)   │ 3. transpose    │  gran_k=16)            │
├─────────────────────┤ 4. TMA-align    └────────────────────────┘
│ weight_scale_2:     │
│ float32 (global)    │──folded into block scales before packing
└─────────────────────┘
```

**NO UE4M3→UE8M0 conversion. NO block16→block32 merge.** The kernel consumes
native UE4M3 scales with block16 grouping.

## Key Differences from MXFP4 mega_moe

| Parameter | MXFP4 | NVFP4 (this kernel) |
|-----------|-------|---------------------|
| `kGranK` | 32 | 16 |
| PTX instruction | `mxf8f6f4.block_scale` | `mxf4nvf4.block_scale.scale_vec::4X` |
| Scale factor type | `float_ue8m0_t` | `float_ue4m3_t` |
| SF vector size | block32 / 2X | block16 / 4X |
| TMEM SF cols (SFA) | `SF_BLOCK_M / 32` | `SF_BLOCK_M / 32 * 4` |
| UTCCP col stride | `i * 4` | `i * 8` |
| `kNumSFUint32` | `kHidden / 128` | `kHidden / 64` |
| L1 epilogue | UE8M0 (`>> 23`) | UE4M3 (float→e4m3 cast) |
| recipe | `(1, 1, 32)` | `(1, 1, 16)` |

## Critical Implementation Details

### scale_format_ constraint
The CUTLASS instruction descriptor has a single `scale_format_` bit (0=E4M3, 1=E8M0)
that applies to BOTH A and B scale factors. For NVFP4 (E4M3), both activation (SFA)
and weight (SFB) scales must use UE4M3. The L1 epilogue outputs UE4M3 activation scales
(float → `cutlass::float_e4m3_t` with sign bit cleared).

### Arch flag
The JIT compiler MUST target `sm_100a`, not `sm_100`. Without the `a` suffix, the
`mxf4nvf4` instruction is unavailable and compilation will fail with
"Feature '.scale_vec::4X' not supported on .target 'sm_100f'".

### Weight scale_2 folding
The NVFP4 checkpoint has dual-level scaling: per-block UE4M3 + per-tensor float32.
The `weight_scale_2` must be folded into the block scales before packing:
`effective_scale = block_scale * global_scale`, then re-quantize to UE4M3.

## Build History

| Build | Error | Fix |
|-------|-------|-----|
| 1–6 | Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH |
| 7 | `kPackedFP4` type mismatch | uint8→int8 view |
| 9 | SF stride assertion | MN-major layout + TMA alignment |
| 10 | `transform_sf` no gran_k=16 | C++ fix |
| 11 | SF dtype float8_e4m3fn rejected | Pack UE4M3→int32 first |
| 12–14 | SF stride layout | Transpose to MN-major |
| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer (2× SF) |
| 16 | `ImportError` | Python wrapper |
| **17** | **NVCC: `scale_vec::4X` not on sm_100f** | **Wrong arch: need `sm_100a`** |
| 18 | `scale_vec::2X` also failed | Same — `sm_100a` required |
| 19 | kGranK still 16 in C++ binding | Should stay 16 — was wrongly changed to 32 |
| 20 | `uint32 >> 23` fails | Cast to int32 first |
| 22 | Garbled output | Fell back to mxf8f6f4 — should use mxf4nvf4 on sm_100a |

## Remaining Work

- [ ] Fix DeepGEMM JIT to target `sm_100a` instead of `sm_100`
- [ ] Add NVFP4 MMA kind enum to DeepGEMM runtime (not just MXFP8FP4 with NVFP4 hat)
- [ ] Revert to Build 17's `mxf4nvf4.scale_vec::4X` instruction (was correct, just wrong arch)
- [ ] Revert `kGranK` to 16, UE4M3 scales, block16 SF layout
- [ ] Add `get_sf_uttcp_aligned_block_sizes` branch for block16 layout
- [ ] Remove UE4M3→UE8M0 conversion and block16→block32 merge from Python
- [ ] Verify TMEM 4X layout (i*8 stride, 4 sub-columns)
- [ ] End-to-end quality test on B200
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
+								# DeepGEMM NVFP4 Mega MoE Kernel
 								## Overview
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								A native NVFP4 mega MoE kernel for DeepGEMM that uses `kind::mxf4nvf4.block_scale.scale_vec::4X`
 								to consume NVFP4 weights (E2M1 + UE4M3 block scales, group_size=16) directly on B200 (SM100a).
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
 								**HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.**
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								## SM100a (B200) Hardware Support
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								**B200 (SM100a) DOES support `kind::mxf4nvf4` with `scale_vec::4X`** (block16, UE4M3 scales).
 								Documented in PTX ISA 8.7 (CUDA 12.8+), confirmed by NVIDIA/CUTLASS/Colfax.
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								The key requirement: target **`sm_100a`** (not `sm_100`). The `a` suffix enables the FP4
 								block-scaled instructions including `mxf4nvf4`. Targeting plain `sm_100` will produce
 								"Feature '.scale_vec::4X' not supported on .target 'sm_100f'" errors.
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								## Kernel Architecture (TARGET)
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								```
 								sm100_fp8_nvfp4_mega_moe_impl
 								├── kGranK = 16 (NVFP4 native block size)
 								├── kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
 								├── float_ue4m3_t instruction descriptor
 								├── SF layout: scale_vec::4X, 4 TMEM sub-columns per UMMA atom
 								├── UTCCP copy: i*8 stride (4X layout, 8 TMEM cols per 128-element group)
 								├── kNumSFATmemCols = SF_BLOCK_M / 32 * 4
 								├── kNumSFBTmemCols = SF_BLOCK_N / 32 * 4
 								├── kNumSFUint32 = kHidden / 64 (4 UE4M3 per int32)
 								├── UE4M3 L1 epilogue (float → cutlass::float_e4m3_t cast, sign bit cleared)
 								└── recipe = (1, 1, 16)
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
+								```
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								## Weight Transformation Pipeline
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
 								```
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
+								NVFP4 Checkpoint                         Kernel Format
 								┌─────────────────────┐                 ┌────────────────────────┐
 								│ weight: uint8       │────────────────→│ int8 (E2M1, same)     │
 								│ (E2M1, 2 per byte)  │  .view(int8)    │ packed, interleaved    │
 								├─────────────────────┤                 ├────────────────────────┤
 								│ weight_scale:       │ 1. fold global  │ int32 (TMA-aligned     │
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								│ float8_e4m3fn       │ 2. pack 4→i32   │  UTCCP layout,         │
 								│ (UE4M3, group=16)   │ 3. transpose    │  gran_k=16)            │
 								├─────────────────────┤ 4. TMA-align    └────────────────────────┘
 								│ weight_scale_2:     │
 								│ float32 (global)    │──folded into block scales before packing
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
+								└─────────────────────┘
-												docs: add NVFP4 mega MoE kernel README

											
										
										
											2026-05-11 05:41:25 +00:00
+								```
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								**NO UE4M3→UE8M0 conversion. NO block16→block32 merge.** The kernel consumes
 								native UE4M3 scales with block16 grouping.
 								## Key Differences from MXFP4 mega_moe
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								| Parameter | MXFP4 | NVFP4 (this kernel) |
 								|-----------|-------|---------------------|
 								| `kGranK` | 32 | 16 |
 								| PTX instruction | `mxf8f6f4.block_scale` | `mxf4nvf4.block_scale.scale_vec::4X` |
 								| Scale factor type | `float_ue8m0_t` | `float_ue4m3_t` |
 								| SF vector size | block32 / 2X | block16 / 4X |
 								| TMEM SF cols (SFA) | `SF_BLOCK_M / 32` | `SF_BLOCK_M / 32 * 4` |
 								| UTCCP col stride | `i * 4` | `i * 8` |
 								| `kNumSFUint32` | `kHidden / 128` | `kHidden / 64` |
 								| L1 epilogue | UE8M0 (`>> 23`) | UE4M3 (float→e4m3 cast) |
 								| recipe | `(1, 1, 32)` | `(1, 1, 16)` |
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								## Critical Implementation Details
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								### scale_format_ constraint
 								The CUTLASS instruction descriptor has a single `scale_format_` bit (0=E4M3, 1=E8M0)
 								that applies to BOTH A and B scale factors. For NVFP4 (E4M3), both activation (SFA)
 								and weight (SFB) scales must use UE4M3. The L1 epilogue outputs UE4M3 activation scales
 								(float → `cutlass::float_e4m3_t` with sign bit cleared).
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								### Arch flag
 								The JIT compiler MUST target `sm_100a`, not `sm_100`. Without the `a` suffix, the
 								`mxf4nvf4` instruction is unavailable and compilation will fail with
 								"Feature '.scale_vec::4X' not supported on .target 'sm_100f'".
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								### Weight scale_2 folding
 								The NVFP4 checkpoint has dual-level scaling: per-block UE4M3 + per-tensor float32.
 								The `weight_scale_2` must be folded into the block scales before packing:
 								`effective_scale = block_scale * global_scale`, then re-quantize to UE4M3.
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								## Build History
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
 								| Build | Error | Fix |
 								|-------|-------|-----|
 								| 1–6 | Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH |
 								| 7 | `kPackedFP4` type mismatch | uint8→int8 view |
 								| 9 | SF stride assertion | MN-major layout + TMA alignment |
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								| 10 | `transform_sf` no gran_k=16 | C++ fix |
 								| 11 | SF dtype float8_e4m3fn rejected | Pack UE4M3→int32 first |
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
+								| 12–14 | SF stride layout | Transpose to MN-major |
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer (2× SF) |
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
+								| 16 | `ImportError` | Python wrapper |
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								| **17** | **NVCC: `scale_vec::4X` not on sm_100f** | **Wrong arch: need `sm_100a`** |
 								| 18 | `scale_vec::2X` also failed | Same — `sm_100a` required |
 								| 19 | kGranK still 16 in C++ binding | Should stay 16 — was wrongly changed to 32 |
-												docs: comprehensive README update through build 22

											
										
										
											2026-05-11 13:55:17 +00:00
+								| 20 | `uint32 >> 23` fails | Cast to int32 first |
-												docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.

											
										
										
											2026-05-11 14:24:55 +00:00
+								| 22 | Garbled output | Fell back to mxf8f6f4 — should use mxf4nvf4 on sm_100a |
 								## Remaining Work
 								- [ ] Fix DeepGEMM JIT to target `sm_100a` instead of `sm_100`
 								- [ ] Add NVFP4 MMA kind enum to DeepGEMM runtime (not just MXFP8FP4 with NVFP4 hat)
 								- [ ] Revert to Build 17's `mxf4nvf4.scale_vec::4X` instruction (was correct, just wrong arch)
 								- [ ] Revert `kGranK` to 16, UE4M3 scales, block16 SF layout
 								- [ ] Add `get_sf_uttcp_aligned_block_sizes` branch for block16 layout
 								- [ ] Remove UE4M3→UE8M0 conversion and block16→block32 merge from Python
 								- [ ] Verify TMEM 4X layout (i*8 stride, 4 sub-columns)
 								- [ ] End-to-end quality test on B200