From c2f4a3078065d3031ab72fe649acb0b224d49b54 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Mon, 11 May 2026 13:55:17 +0000 Subject: [PATCH] docs: comprehensive README update through build 22 --- README_NVFP4.md | 236 +++++++++++++++++++++++++++--------------------- 1 file changed, 131 insertions(+), 105 deletions(-) diff --git a/README_NVFP4.md b/README_NVFP4.md index e93fdf9..a1373c2 100644 --- a/README_NVFP4.md +++ b/README_NVFP4.md @@ -2,103 +2,21 @@ ## Overview -This branch adds native NVFP4 support to DeepGEMM's mega MoE kernel. NVFP4 uses E2M1 packed weights with UE4M3 block scales (group_size=16), as opposed to MXFP4 which uses UE8M0 scales (group_size=32). +A native NVFP4 mega MoE kernel for DeepGEMM, built on the `nvfp4-mega-moe` branch. +Adapts the existing `sm100_fp8_fp4_mega_moe_impl` to handle NVFP4 checkpoint weights +(E2M1 + UE4M3 block scales, group_size=16). **HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.** -## What Changed +## SM100 (B200) Hardware Constraint -### CUDA Kernel (`sm100_fp8_nvfp4_mega_moe.cuh`) +**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`). +This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled +MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32). -| Parameter | MXFP4 (original) | NVFP4 (new) | -|-----------|-----------------|-------------| -| `kGranK` | 32 | 16 | -| Scale factor type | `float_ue8m0_t` | `float_ue4m3_t` | -| PTX instruction | `kind::mxf8f6f4.block_scale` | `kind::mxf4nvf4.block_scale.scale_vec::4X` | -| SF vector size | block32 / 2X | block16 / 4X | -| TMEM SF layout | 2 sub-columns per UMMA | 4 sub-columns per UMMA | -| UTCCP col stride | `i * 4` | `i * 8` | -| `kNumSFATmemCols` | `SF_BLOCK_M / 32` | `SF_BLOCK_M / 32 * 4` | -| `kNumSFBTmemCols` | `SF_BLOCK_N / 32` | `SF_BLOCK_N / 32 * 4` | -| Activation SF format | UE8M0 (`>> 23`) | UE4M3 (float→e4m3 cast) | -| `kNumSFUint32` | `kHidden / 128` | `kHidden / 64` | -| `recipe` | `(1, 1, 32)` | `(1, 1, 16)` | - -### PTX Wrappers (`tcgen05.cuh`) - -Added: -- `SM100_MMA_MXF4NVF4_2x1SM_SS` — 2-CTA NVFP4 block-scaled MMA -- `SM100_MMA_MXF4NVF4_SS` — 1-CTA NVFP4 block-scaled MMA - -### Python API (`mega/__init__.py`) - -- `fp8_nvfp4_mega_moe()` — main entry point -- `transform_nvfp4_weights_for_mega_moe()` — converts checkpoint weights to kernel format -- `_pack_nvfp4_sf_for_utccp()` — packs UE4M3 scales into int32 UTCCP layout - -### C++ Bindings - -- `csrc/apis/mega_nvfp4.hpp` — SymmBuffer with NVFP4 SF strides (K/16) -- `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` — JIT kernel - -## Architecture - -``` -NVFP4 Checkpoint Kernel Format -┌─────────────────────┐ ┌────────────────────────┐ -│ weight: uint8 │───────────>│ uint8 (same, E2M1) │ -│ (E2M1, 2 per byte) │ │ packed, interleaved │ -├─────────────────────┤ ├────────────────────────┤ -│ weight_scale: │ │ int32 (UTCCP layout) │ -│ float8_e4m3fn │──pack────>│ 4 UE4M3 bytes → 1 i32 │ -│ (UE4M3, group=16) │ +UTCCP │ then 4x32 transpose │ -├─────────────────────┤ └────────────────────────┘ -│ weight_scale_2: │ -│ float32 (global) │──────> Applied at weight load time -│ │ (multiply into block scales) -└─────────────────────┘ -``` - -### Scale Factor Flow - -``` -Checkpoint: weight_scale (UE4M3) × weight_scale_2 (FP32) = dequantized BF16 - -Kernel path: - 1. pre-scale: weight_scale_2 * weight_scale → float32 block scales - 2. pack to UE4M3: float32 → cutlass::float_e4m3_t → uint8 → pack 4→int32 - 3. UTCCP transpose for TMA consumption - 4. Tensor Core reads UE4M3 scales via mxf4nvf4 instruction -``` - -## Important Notes - -### scale_format_ constraint -The CUTLASS instruction descriptor has a single `scale_format_` bit (0=E4M3, 1=E8M0) that applies to BOTH A and B scale factors. This means both activation (SFA) and weight (SFB) scales must use the same format. The L1 epilogue outputs UE4M3 activation scales to match the NVFP4 weight scales. - -### Weight scale_2 handling -The NVFP4 checkpoint has a dual-level scaling scheme: -- `weight_scale`: per-block UE4M3 (group_size=16) -- `weight_scale_2`: per-tensor float32 global scale - -The `weight_scale_2` must be multiplied into the block scales **before** packing for the kernel. This is done in `transform_nvfp4_weights_for_mega_moe()`. - -## Remaining Work - -- [x] Test compilation on B200 (SM100) — **COMPILED** -- [ ] Verify UTCCP 4X column stride (i*8) -- [ ] Verify SF packing: UE4M3 → int32 → TMA-aligned layout -- [x] Add gran_k=16 to C++ transform_sf_into_required_layout -- [ ] Fix SF layout: must be MN-major (stride(-2)=1) with TMA-aligned stride -- [ ] Verify the L1 epilogue UE4M3 conversion -- [ ] Integration with vLLM DeepseekV4MegaMoEExperts — wired, debugging -- [ ] End-to-end quality test - -### SM100 (B200) Hardware Constraint - -**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`). This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32). - -**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0 for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a scale format adaptation, not a weight format conversion. +**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0 +for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a +scale format adaptation, not a weight format conversion. | Parameter | NVFP4 Checkpoint | Kernel (SM100 Adapted) | |-----------|-----------------|----------------------| @@ -106,17 +24,125 @@ The `weight_scale_2` must be multiplied into the block scales **before** packing | Block scale format | UE4M3 (float8_e4m3fn) | UE8M0 (uint8) | | Block size | 16 | 32 (merged) | | Global scale | float32 | Folded in before UE4M3→UE8M0 | -| PTX instruction | N/A (requires SM103+) | mxf8f6f4.block_scale | +| PTX instruction | N/A (requires SM103+) | `mxf8f6f4.block_scale` | -### Debugging Log -- Build 7: kPackedFP4 mismatch → uint8→int8 view -- Build 9: SF stride assertion → need MN-major layout + TMA alignment -- Build 10: transform_sf_into_required_layout doesn't support gran_k=16 → C++ fix -- Build 11: SF dtype mismatch (float8_e4m3fn → must pack to int32 first) -- Build 12-14: SF stride layout — transpose to MN-major before transform -- Build 15: SymmBuffer too small (NVFP4 has 2x SF) → use NVFP4 SymmBuffer -- Build 16: ImportError (deep_gemm.mega.nvfp4) → Python wrapper -- Build 17: NVCC error: scale_vec::4X not supported on sm_100f -- Build 18: NVCC error: scale_vec::2X ALSO not supported on sm_100f -- Build 19: kGranK still 16 in C++ binding -- Build 20: Use mxf8f6f4 (same as MXFP4) with UE4M0 conversion +**Result**: Server starts and serves, but output is garbled. The UE4M3→UE8M0 conversion +loses 3 bits of mantissa precision per scale (8× precision loss), destroying output quality. + +## Kernel Changes from MXFP4 + +The kernel is functionally identical to the MXFP4 mega_moe. The only differences are +in the weight transformation pipeline (Python) and the documentation/comments. + +### CUDA Kernel (`sm100_fp8_nvfp4_mega_moe.cuh`) + +Identical to `sm100_fp8_fp4_mega_moe.cuh` in runtime behavior: +- Same `kGranK = 32`, same `float_ue8m0_t` descriptor, same `mxf8f6f4` instruction +- Same TMEM layout (2X), UTCCP copy (i*4), SF counts +- Same L1 epilogue (UE8M0, `>> 23`) + +The renamed file exists for documentation and future SM103+ support. + +### Python API (`deep_gemm/mega/__init__.py`) + +```python +from deep_gemm.mega import ( + fp8_nvfp4_mega_moe, + transform_nvfp4_weights_for_mega_moe, + get_symm_buffer_for_nvfp4_mega_moe, +) + +# Transform NVFP4 checkpoint weights for the kernel +l1_weights, l2_weights = transform_nvfp4_weights_for_mega_moe( + (w13_weight, w13_weight_scale), # uint8, float8_e4m3fn + (w2_weight, w2_weight_scale), + l1_weight_scale_2=w13_weight_scale_2, # float32 global scale + l2_weight_scale_2=w2_weight_scale_2, +) +# l1_weights/l2_weights: (int8 weight, int32 TMA-aligned SF) + +# Run the kernel +fp8_nvfp4_mega_moe(y, l1_weights, l2_weights, symm_buffer, ...) +``` + +### Weight Transformation Pipeline + +``` +NVFP4 Checkpoint Kernel Format +┌─────────────────────┐ ┌────────────────────────┐ +│ weight: uint8 │────────────────→│ int8 (E2M1, same) │ +│ (E2M1, 2 per byte) │ .view(int8) │ packed, interleaved │ +├─────────────────────┤ ├────────────────────────┤ +│ weight_scale: │ 1. fold global │ int32 (TMA-aligned │ +│ float8_e4m3fn │ 2. merge 16→32 │ UTCCP layout) │ +│ (UE4M3, group=16) │ 3. UE4M3→UE8M0 │ │ +├─────────────────────┤ 4. pack 4→i32 └────────────────────────┘ +│ weight_scale_2: │ 5. transpose (same as MXFP4 pipeline) +│ float32 (global) │ 6. TMA-align +└─────────────────────┘ +``` + +**Key steps**: +1. **Fold global scale**: `block_scale * global_scale → UE4M3` (in float32, re-quantize) +2. **Merge block16→block32**: `max(adjacent_pair)` — max preserves magnitude +3. **UE4M3→UE8M0**: `float32 → (bits >> 23) & 0xFF → uint8` (extract IEEE 754 exponent) +4. **Pack**: 4 uint8 UE8M0 values → 1 int32 +5. **Transpose**: MN-major layout +6. **TMA-align**: `transform_sf_into_required_layout(sf, mn, k, (1,32), num_experts)` + +### C++ Changes + +| File | Change | +|------|--------| +| `csrc/apis/mega_nvfp4.hpp` | NVFP4 SymmBuffer (SF stride K/32, K/128 packed) | +| `csrc/apis/layout.hpp` | gran_k=32 support (already existed, just documented) | +| `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` | JIT kernel header | +| `csrc/python_api.cpp` | Register NVFP4 APIs | + +### PTX Wrappers Added (`tcgen05.cuh`) + +- `SM100_MMA_MXF4NVF4_2x1SM_SS` — for future SM103+ support +- `SM100_MMA_MXF4NVF4_SS` — single-CTA variant + +**Note**: These are NOT used on SM100. The kernel uses `SM100_MMA_MXF8F6F4_2x1SM_SS` instead. + +## Debugging Log (Builds 1–22) + +| Build | Error | Fix | +|-------|-------|-----| +| 1–6 | Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH | +| 7 | `kPackedFP4` type mismatch | uint8→int8 view | +| 9 | SF stride assertion | MN-major layout + TMA alignment | +| 10 | `transform_sf` no gran_k=16 | C++ fix (later reverted to 32) | +| 11 | SF dtype float8_e4m3fn rejected | Pack to int32 first | +| 12–14 | SF stride layout | Transpose to MN-major | +| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer | +| 16 | `ImportError` | Python wrapper | +| 17 | NVCC: `scale_vec::4X` not on sm_100f | Hardware limit | +| 18 | NVCC: `scale_vec::2X` also not on sm_100f | Hardware limit | +| 19 | kGranK=16 in C++ binding | → 32 | +| 20 | `uint32 >> 23` fails | Cast to int32 first | +| 22 | Garbled output | UE4M3→UE8M0 precision loss (unfixable on SM100) | + +## Path Forward + +### For SM103+ (B300/GB300) +The `SM100_MMA_MXF4NVF4` PTX wrappers are already in the code. On SM103+: +1. Switch kernel to use `mxf4nvf4.block_scale.scale_vec::4X` (block16) +2. Keep UE4M3 scales (no conversion to UE8M0) +3. Update TMEM layout to 4X (i*8 stride, 4 sub-columns) +4. This should produce correct output with full NVFP4 precision + +### For SM100 (B200) +The UE4M3→UE8M0 conversion is fundamentally lossy. Options: +1. **FlashInfer FP4 MoE** — dequant NVFP4→BF16, use BF16 GEMM (avoids scale conversion) +2. **BF16 mega_moe** — dequant in shared memory, use BF16 MMA +3. **Accept the precision loss** — only viable if the model is robust to scale quantization + +## Integration with vLLM + +The `mega-moe-nvfp4` branch of `deepseek-v4-quant` wires this kernel into +`DeepseekV4MegaMoEExperts`: +- `finalize_weights()`: calls `transform_nvfp4_weights_for_mega_moe()` +- `forward()`: calls `fp8_nvfp4_mega_moe()` with `recipe=(1,1,32)` +- `get_symm_buffer()`: uses `get_symm_buffer_for_nvfp4_mega_moe()`