Files
DeepGEMM/README_NVFP4.md
biondizzle e80fe9af60 docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.
2026-05-11 14:24:55 +00:00

115 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DeepGEMM NVFP4 Mega MoE Kernel
## Overview
A native NVFP4 mega MoE kernel for DeepGEMM that uses `kind::mxf4nvf4.block_scale.scale_vec::4X`
to consume NVFP4 weights (E2M1 + UE4M3 block scales, group_size=16) directly on B200 (SM100a).
**HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.**
## SM100a (B200) Hardware Support
**B200 (SM100a) DOES support `kind::mxf4nvf4` with `scale_vec::4X`** (block16, UE4M3 scales).
Documented in PTX ISA 8.7 (CUDA 12.8+), confirmed by NVIDIA/CUTLASS/Colfax.
The key requirement: target **`sm_100a`** (not `sm_100`). The `a` suffix enables the FP4
block-scaled instructions including `mxf4nvf4`. Targeting plain `sm_100` will produce
"Feature '.scale_vec::4X' not supported on .target 'sm_100f'" errors.
## Kernel Architecture (TARGET)
```
sm100_fp8_nvfp4_mega_moe_impl
├── kGranK = 16 (NVFP4 native block size)
├── kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
├── float_ue4m3_t instruction descriptor
├── SF layout: scale_vec::4X, 4 TMEM sub-columns per UMMA atom
├── UTCCP copy: i*8 stride (4X layout, 8 TMEM cols per 128-element group)
├── kNumSFATmemCols = SF_BLOCK_M / 32 * 4
├── kNumSFBTmemCols = SF_BLOCK_N / 32 * 4
├── kNumSFUint32 = kHidden / 64 (4 UE4M3 per int32)
├── UE4M3 L1 epilogue (float → cutlass::float_e4m3_t cast, sign bit cleared)
└── recipe = (1, 1, 16)
```
## Weight Transformation Pipeline
```
NVFP4 Checkpoint Kernel Format
┌─────────────────────┐ ┌────────────────────────┐
│ weight: uint8 │────────────────→│ int8 (E2M1, same) │
│ (E2M1, 2 per byte) │ .view(int8) │ packed, interleaved │
├─────────────────────┤ ├────────────────────────┤
│ weight_scale: │ 1. fold global │ int32 (TMA-aligned │
│ float8_e4m3fn │ 2. pack 4→i32 │ UTCCP layout, │
│ (UE4M3, group=16) │ 3. transpose │ gran_k=16) │
├─────────────────────┤ 4. TMA-align └────────────────────────┘
│ weight_scale_2: │
│ float32 (global) │──folded into block scales before packing
└─────────────────────┘
```
**NO UE4M3→UE8M0 conversion. NO block16→block32 merge.** The kernel consumes
native UE4M3 scales with block16 grouping.
## Key Differences from MXFP4 mega_moe
| Parameter | MXFP4 | NVFP4 (this kernel) |
|-----------|-------|---------------------|
| `kGranK` | 32 | 16 |
| PTX instruction | `mxf8f6f4.block_scale` | `mxf4nvf4.block_scale.scale_vec::4X` |
| Scale factor type | `float_ue8m0_t` | `float_ue4m3_t` |
| SF vector size | block32 / 2X | block16 / 4X |
| TMEM SF cols (SFA) | `SF_BLOCK_M / 32` | `SF_BLOCK_M / 32 * 4` |
| UTCCP col stride | `i * 4` | `i * 8` |
| `kNumSFUint32` | `kHidden / 128` | `kHidden / 64` |
| L1 epilogue | UE8M0 (`>> 23`) | UE4M3 (float→e4m3 cast) |
| recipe | `(1, 1, 32)` | `(1, 1, 16)` |
## Critical Implementation Details
### scale_format_ constraint
The CUTLASS instruction descriptor has a single `scale_format_` bit (0=E4M3, 1=E8M0)
that applies to BOTH A and B scale factors. For NVFP4 (E4M3), both activation (SFA)
and weight (SFB) scales must use UE4M3. The L1 epilogue outputs UE4M3 activation scales
(float → `cutlass::float_e4m3_t` with sign bit cleared).
### Arch flag
The JIT compiler MUST target `sm_100a`, not `sm_100`. Without the `a` suffix, the
`mxf4nvf4` instruction is unavailable and compilation will fail with
"Feature '.scale_vec::4X' not supported on .target 'sm_100f'".
### Weight scale_2 folding
The NVFP4 checkpoint has dual-level scaling: per-block UE4M3 + per-tensor float32.
The `weight_scale_2` must be folded into the block scales before packing:
`effective_scale = block_scale * global_scale`, then re-quantize to UE4M3.
## Build History
| Build | Error | Fix |
|-------|-------|-----|
| 16 | Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH |
| 7 | `kPackedFP4` type mismatch | uint8→int8 view |
| 9 | SF stride assertion | MN-major layout + TMA alignment |
| 10 | `transform_sf` no gran_k=16 | C++ fix |
| 11 | SF dtype float8_e4m3fn rejected | Pack UE4M3→int32 first |
| 1214 | SF stride layout | Transpose to MN-major |
| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer (2× SF) |
| 16 | `ImportError` | Python wrapper |
| **17** | **NVCC: `scale_vec::4X` not on sm_100f** | **Wrong arch: need `sm_100a`** |
| 18 | `scale_vec::2X` also failed | Same — `sm_100a` required |
| 19 | kGranK still 16 in C++ binding | Should stay 16 — was wrongly changed to 32 |
| 20 | `uint32 >> 23` fails | Cast to int32 first |
| 22 | Garbled output | Fell back to mxf8f6f4 — should use mxf4nvf4 on sm_100a |
## Remaining Work
- [ ] Fix DeepGEMM JIT to target `sm_100a` instead of `sm_100`
- [ ] Add NVFP4 MMA kind enum to DeepGEMM runtime (not just MXFP8FP4 with NVFP4 hat)
- [ ] Revert to Build 17's `mxf4nvf4.scale_vec::4X` instruction (was correct, just wrong arch)
- [ ] Revert `kGranK` to 16, UE4M3 scales, block16 SF layout
- [ ] Add `get_sf_uttcp_aligned_block_sizes` branch for block16 layout
- [ ] Remove UE4M3→UE8M0 conversion and block16→block32 merge from Python
- [ ] Verify TMEM 4X layout (i*8 stride, 4 sub-columns)
- [ ] End-to-end quality test on B200