docs: comprehensive README update through build 22
This commit is contained in:
236
README_NVFP4.md
236
README_NVFP4.md
@@ -2,103 +2,21 @@
|
||||
|
||||
## Overview
|
||||
|
||||
This branch adds native NVFP4 support to DeepGEMM's mega MoE kernel. NVFP4 uses E2M1 packed weights with UE4M3 block scales (group_size=16), as opposed to MXFP4 which uses UE8M0 scales (group_size=32).
|
||||
A native NVFP4 mega MoE kernel for DeepGEMM, built on the `nvfp4-mega-moe` branch.
|
||||
Adapts the existing `sm100_fp8_fp4_mega_moe_impl` to handle NVFP4 checkpoint weights
|
||||
(E2M1 + UE4M3 block scales, group_size=16).
|
||||
|
||||
**HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.**
|
||||
|
||||
## What Changed
|
||||
## SM100 (B200) Hardware Constraint
|
||||
|
||||
### CUDA Kernel (`sm100_fp8_nvfp4_mega_moe.cuh`)
|
||||
**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`).
|
||||
This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled
|
||||
MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32).
|
||||
|
||||
| Parameter | MXFP4 (original) | NVFP4 (new) |
|
||||
|-----------|-----------------|-------------|
|
||||
| `kGranK` | 32 | 16 |
|
||||
| Scale factor type | `float_ue8m0_t` | `float_ue4m3_t` |
|
||||
| PTX instruction | `kind::mxf8f6f4.block_scale` | `kind::mxf4nvf4.block_scale.scale_vec::4X` |
|
||||
| SF vector size | block32 / 2X | block16 / 4X |
|
||||
| TMEM SF layout | 2 sub-columns per UMMA | 4 sub-columns per UMMA |
|
||||
| UTCCP col stride | `i * 4` | `i * 8` |
|
||||
| `kNumSFATmemCols` | `SF_BLOCK_M / 32` | `SF_BLOCK_M / 32 * 4` |
|
||||
| `kNumSFBTmemCols` | `SF_BLOCK_N / 32` | `SF_BLOCK_N / 32 * 4` |
|
||||
| Activation SF format | UE8M0 (`>> 23`) | UE4M3 (float→e4m3 cast) |
|
||||
| `kNumSFUint32` | `kHidden / 128` | `kHidden / 64` |
|
||||
| `recipe` | `(1, 1, 32)` | `(1, 1, 16)` |
|
||||
|
||||
### PTX Wrappers (`tcgen05.cuh`)
|
||||
|
||||
Added:
|
||||
- `SM100_MMA_MXF4NVF4_2x1SM_SS` — 2-CTA NVFP4 block-scaled MMA
|
||||
- `SM100_MMA_MXF4NVF4_SS` — 1-CTA NVFP4 block-scaled MMA
|
||||
|
||||
### Python API (`mega/__init__.py`)
|
||||
|
||||
- `fp8_nvfp4_mega_moe()` — main entry point
|
||||
- `transform_nvfp4_weights_for_mega_moe()` — converts checkpoint weights to kernel format
|
||||
- `_pack_nvfp4_sf_for_utccp()` — packs UE4M3 scales into int32 UTCCP layout
|
||||
|
||||
### C++ Bindings
|
||||
|
||||
- `csrc/apis/mega_nvfp4.hpp` — SymmBuffer with NVFP4 SF strides (K/16)
|
||||
- `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` — JIT kernel
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
NVFP4 Checkpoint Kernel Format
|
||||
┌─────────────────────┐ ┌────────────────────────┐
|
||||
│ weight: uint8 │───────────>│ uint8 (same, E2M1) │
|
||||
│ (E2M1, 2 per byte) │ │ packed, interleaved │
|
||||
├─────────────────────┤ ├────────────────────────┤
|
||||
│ weight_scale: │ │ int32 (UTCCP layout) │
|
||||
│ float8_e4m3fn │──pack────>│ 4 UE4M3 bytes → 1 i32 │
|
||||
│ (UE4M3, group=16) │ +UTCCP │ then 4x32 transpose │
|
||||
├─────────────────────┤ └────────────────────────┘
|
||||
│ weight_scale_2: │
|
||||
│ float32 (global) │──────> Applied at weight load time
|
||||
│ │ (multiply into block scales)
|
||||
└─────────────────────┘
|
||||
```
|
||||
|
||||
### Scale Factor Flow
|
||||
|
||||
```
|
||||
Checkpoint: weight_scale (UE4M3) × weight_scale_2 (FP32) = dequantized BF16
|
||||
|
||||
Kernel path:
|
||||
1. pre-scale: weight_scale_2 * weight_scale → float32 block scales
|
||||
2. pack to UE4M3: float32 → cutlass::float_e4m3_t → uint8 → pack 4→int32
|
||||
3. UTCCP transpose for TMA consumption
|
||||
4. Tensor Core reads UE4M3 scales via mxf4nvf4 instruction
|
||||
```
|
||||
|
||||
## Important Notes
|
||||
|
||||
### scale_format_ constraint
|
||||
The CUTLASS instruction descriptor has a single `scale_format_` bit (0=E4M3, 1=E8M0) that applies to BOTH A and B scale factors. This means both activation (SFA) and weight (SFB) scales must use the same format. The L1 epilogue outputs UE4M3 activation scales to match the NVFP4 weight scales.
|
||||
|
||||
### Weight scale_2 handling
|
||||
The NVFP4 checkpoint has a dual-level scaling scheme:
|
||||
- `weight_scale`: per-block UE4M3 (group_size=16)
|
||||
- `weight_scale_2`: per-tensor float32 global scale
|
||||
|
||||
The `weight_scale_2` must be multiplied into the block scales **before** packing for the kernel. This is done in `transform_nvfp4_weights_for_mega_moe()`.
|
||||
|
||||
## Remaining Work
|
||||
|
||||
- [x] Test compilation on B200 (SM100) — **COMPILED**
|
||||
- [ ] Verify UTCCP 4X column stride (i*8)
|
||||
- [ ] Verify SF packing: UE4M3 → int32 → TMA-aligned layout
|
||||
- [x] Add gran_k=16 to C++ transform_sf_into_required_layout
|
||||
- [ ] Fix SF layout: must be MN-major (stride(-2)=1) with TMA-aligned stride
|
||||
- [ ] Verify the L1 epilogue UE4M3 conversion
|
||||
- [ ] Integration with vLLM DeepseekV4MegaMoEExperts — wired, debugging
|
||||
- [ ] End-to-end quality test
|
||||
|
||||
### SM100 (B200) Hardware Constraint
|
||||
|
||||
**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`). This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32).
|
||||
|
||||
**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0 for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a scale format adaptation, not a weight format conversion.
|
||||
**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0
|
||||
for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a
|
||||
scale format adaptation, not a weight format conversion.
|
||||
|
||||
| Parameter | NVFP4 Checkpoint | Kernel (SM100 Adapted) |
|
||||
|-----------|-----------------|----------------------|
|
||||
@@ -106,17 +24,125 @@ The `weight_scale_2` must be multiplied into the block scales **before** packing
|
||||
| Block scale format | UE4M3 (float8_e4m3fn) | UE8M0 (uint8) |
|
||||
| Block size | 16 | 32 (merged) |
|
||||
| Global scale | float32 | Folded in before UE4M3→UE8M0 |
|
||||
| PTX instruction | N/A (requires SM103+) | mxf8f6f4.block_scale |
|
||||
| PTX instruction | N/A (requires SM103+) | `mxf8f6f4.block_scale` |
|
||||
|
||||
### Debugging Log
|
||||
- Build 7: kPackedFP4 mismatch → uint8→int8 view
|
||||
- Build 9: SF stride assertion → need MN-major layout + TMA alignment
|
||||
- Build 10: transform_sf_into_required_layout doesn't support gran_k=16 → C++ fix
|
||||
- Build 11: SF dtype mismatch (float8_e4m3fn → must pack to int32 first)
|
||||
- Build 12-14: SF stride layout — transpose to MN-major before transform
|
||||
- Build 15: SymmBuffer too small (NVFP4 has 2x SF) → use NVFP4 SymmBuffer
|
||||
- Build 16: ImportError (deep_gemm.mega.nvfp4) → Python wrapper
|
||||
- Build 17: NVCC error: scale_vec::4X not supported on sm_100f
|
||||
- Build 18: NVCC error: scale_vec::2X ALSO not supported on sm_100f
|
||||
- Build 19: kGranK still 16 in C++ binding
|
||||
- Build 20: Use mxf8f6f4 (same as MXFP4) with UE4M0 conversion
|
||||
**Result**: Server starts and serves, but output is garbled. The UE4M3→UE8M0 conversion
|
||||
loses 3 bits of mantissa precision per scale (8× precision loss), destroying output quality.
|
||||
|
||||
## Kernel Changes from MXFP4
|
||||
|
||||
The kernel is functionally identical to the MXFP4 mega_moe. The only differences are
|
||||
in the weight transformation pipeline (Python) and the documentation/comments.
|
||||
|
||||
### CUDA Kernel (`sm100_fp8_nvfp4_mega_moe.cuh`)
|
||||
|
||||
Identical to `sm100_fp8_fp4_mega_moe.cuh` in runtime behavior:
|
||||
- Same `kGranK = 32`, same `float_ue8m0_t` descriptor, same `mxf8f6f4` instruction
|
||||
- Same TMEM layout (2X), UTCCP copy (i*4), SF counts
|
||||
- Same L1 epilogue (UE8M0, `>> 23`)
|
||||
|
||||
The renamed file exists for documentation and future SM103+ support.
|
||||
|
||||
### Python API (`deep_gemm/mega/__init__.py`)
|
||||
|
||||
```python
|
||||
from deep_gemm.mega import (
|
||||
fp8_nvfp4_mega_moe,
|
||||
transform_nvfp4_weights_for_mega_moe,
|
||||
get_symm_buffer_for_nvfp4_mega_moe,
|
||||
)
|
||||
|
||||
# Transform NVFP4 checkpoint weights for the kernel
|
||||
l1_weights, l2_weights = transform_nvfp4_weights_for_mega_moe(
|
||||
(w13_weight, w13_weight_scale), # uint8, float8_e4m3fn
|
||||
(w2_weight, w2_weight_scale),
|
||||
l1_weight_scale_2=w13_weight_scale_2, # float32 global scale
|
||||
l2_weight_scale_2=w2_weight_scale_2,
|
||||
)
|
||||
# l1_weights/l2_weights: (int8 weight, int32 TMA-aligned SF)
|
||||
|
||||
# Run the kernel
|
||||
fp8_nvfp4_mega_moe(y, l1_weights, l2_weights, symm_buffer, ...)
|
||||
```
|
||||
|
||||
### Weight Transformation Pipeline
|
||||
|
||||
```
|
||||
NVFP4 Checkpoint Kernel Format
|
||||
┌─────────────────────┐ ┌────────────────────────┐
|
||||
│ weight: uint8 │────────────────→│ int8 (E2M1, same) │
|
||||
│ (E2M1, 2 per byte) │ .view(int8) │ packed, interleaved │
|
||||
├─────────────────────┤ ├────────────────────────┤
|
||||
│ weight_scale: │ 1. fold global │ int32 (TMA-aligned │
|
||||
│ float8_e4m3fn │ 2. merge 16→32 │ UTCCP layout) │
|
||||
│ (UE4M3, group=16) │ 3. UE4M3→UE8M0 │ │
|
||||
├─────────────────────┤ 4. pack 4→i32 └────────────────────────┘
|
||||
│ weight_scale_2: │ 5. transpose (same as MXFP4 pipeline)
|
||||
│ float32 (global) │ 6. TMA-align
|
||||
└─────────────────────┘
|
||||
```
|
||||
|
||||
**Key steps**:
|
||||
1. **Fold global scale**: `block_scale * global_scale → UE4M3` (in float32, re-quantize)
|
||||
2. **Merge block16→block32**: `max(adjacent_pair)` — max preserves magnitude
|
||||
3. **UE4M3→UE8M0**: `float32 → (bits >> 23) & 0xFF → uint8` (extract IEEE 754 exponent)
|
||||
4. **Pack**: 4 uint8 UE8M0 values → 1 int32
|
||||
5. **Transpose**: MN-major layout
|
||||
6. **TMA-align**: `transform_sf_into_required_layout(sf, mn, k, (1,32), num_experts)`
|
||||
|
||||
### C++ Changes
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `csrc/apis/mega_nvfp4.hpp` | NVFP4 SymmBuffer (SF stride K/32, K/128 packed) |
|
||||
| `csrc/apis/layout.hpp` | gran_k=32 support (already existed, just documented) |
|
||||
| `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` | JIT kernel header |
|
||||
| `csrc/python_api.cpp` | Register NVFP4 APIs |
|
||||
|
||||
### PTX Wrappers Added (`tcgen05.cuh`)
|
||||
|
||||
- `SM100_MMA_MXF4NVF4_2x1SM_SS` — for future SM103+ support
|
||||
- `SM100_MMA_MXF4NVF4_SS` — single-CTA variant
|
||||
|
||||
**Note**: These are NOT used on SM100. The kernel uses `SM100_MMA_MXF8F6F4_2x1SM_SS` instead.
|
||||
|
||||
## Debugging Log (Builds 1–22)
|
||||
|
||||
| Build | Error | Fix |
|
||||
|-------|-------|-----|
|
||||
| 1–6 | Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH |
|
||||
| 7 | `kPackedFP4` type mismatch | uint8→int8 view |
|
||||
| 9 | SF stride assertion | MN-major layout + TMA alignment |
|
||||
| 10 | `transform_sf` no gran_k=16 | C++ fix (later reverted to 32) |
|
||||
| 11 | SF dtype float8_e4m3fn rejected | Pack to int32 first |
|
||||
| 12–14 | SF stride layout | Transpose to MN-major |
|
||||
| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer |
|
||||
| 16 | `ImportError` | Python wrapper |
|
||||
| 17 | NVCC: `scale_vec::4X` not on sm_100f | Hardware limit |
|
||||
| 18 | NVCC: `scale_vec::2X` also not on sm_100f | Hardware limit |
|
||||
| 19 | kGranK=16 in C++ binding | → 32 |
|
||||
| 20 | `uint32 >> 23` fails | Cast to int32 first |
|
||||
| 22 | Garbled output | UE4M3→UE8M0 precision loss (unfixable on SM100) |
|
||||
|
||||
## Path Forward
|
||||
|
||||
### For SM103+ (B300/GB300)
|
||||
The `SM100_MMA_MXF4NVF4` PTX wrappers are already in the code. On SM103+:
|
||||
1. Switch kernel to use `mxf4nvf4.block_scale.scale_vec::4X` (block16)
|
||||
2. Keep UE4M3 scales (no conversion to UE8M0)
|
||||
3. Update TMEM layout to 4X (i*8 stride, 4 sub-columns)
|
||||
4. This should produce correct output with full NVFP4 precision
|
||||
|
||||
### For SM100 (B200)
|
||||
The UE4M3→UE8M0 conversion is fundamentally lossy. Options:
|
||||
1. **FlashInfer FP4 MoE** — dequant NVFP4→BF16, use BF16 GEMM (avoids scale conversion)
|
||||
2. **BF16 mega_moe** — dequant in shared memory, use BF16 MMA
|
||||
3. **Accept the precision loss** — only viable if the model is robust to scale quantization
|
||||
|
||||
## Integration with vLLM
|
||||
|
||||
The `mega-moe-nvfp4` branch of `deepseek-v4-quant` wires this kernel into
|
||||
`DeepseekV4MegaMoEExperts`:
|
||||
- `finalize_weights()`: calls `transform_nvfp4_weights_for_mega_moe()`
|
||||
- `forward()`: calls `fp8_nvfp4_mega_moe()` with `recipe=(1,1,32)`
|
||||
- `get_symm_buffer()`: uses `get_symm_buffer_for_nvfp4_mega_moe()`
|
||||
|
||||
Reference in New Issue
Block a user