From c2f4a3078065d3031ab72fe649acb0b224d49b54 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Mon, 11 May 2026 13:55:17 +0000
Subject: [PATCH] docs: comprehensive README update through build 22

---
 README_NVFP4.md | 236 +++++++++++++++++++++++++++---------------------
 1 file changed, 131 insertions(+), 105 deletions(-)

diff --git a/README_NVFP4.md b/README_NVFP4.md
index e93fdf9..a1373c2 100644
--- a/README_NVFP4.md
+++ b/README_NVFP4.md
@@ -2,103 +2,21 @@
 
 ## Overview
 
-This branch adds native NVFP4 support to DeepGEMM's mega MoE kernel. NVFP4 uses E2M1 packed weights with UE4M3 block scales (group_size=16), as opposed to MXFP4 which uses UE8M0 scales (group_size=32).
+A native NVFP4 mega MoE kernel for DeepGEMM, built on the `nvfp4-mega-moe` branch.
+Adapts the existing `sm100_fp8_fp4_mega_moe_impl` to handle NVFP4 checkpoint weights
+(E2M1 + UE4M3 block scales, group_size=16).
 
 **HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.**
 
-## What Changed
+## SM100 (B200) Hardware Constraint
 
-### CUDA Kernel (`sm100_fp8_nvfp4_mega_moe.cuh`)
+**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`).
+This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled
+MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32).
 
-| Parameter | MXFP4 (original) | NVFP4 (new) |
-|-----------|-----------------|-------------|
-| `kGranK` | 32 | 16 |
-| Scale factor type | `float_ue8m0_t` | `float_ue4m3_t` |
-| PTX instruction | `kind::mxf8f6f4.block_scale` | `kind::mxf4nvf4.block_scale.scale_vec::4X` |
-| SF vector size | block32 / 2X | block16 / 4X |
-| TMEM SF layout | 2 sub-columns per UMMA | 4 sub-columns per UMMA |
-| UTCCP col stride | `i * 4` | `i * 8` |
-| `kNumSFATmemCols` | `SF_BLOCK_M / 32` | `SF_BLOCK_M / 32 * 4` |
-| `kNumSFBTmemCols` | `SF_BLOCK_N / 32` | `SF_BLOCK_N / 32 * 4` |
-| Activation SF format | UE8M0 (`>> 23`) | UE4M3 (float→e4m3 cast) |
-| `kNumSFUint32` | `kHidden / 128` | `kHidden / 64` |
-| `recipe` | `(1, 1, 32)` | `(1, 1, 16)` |
-
-### PTX Wrappers (`tcgen05.cuh`)
-
-Added:
-- `SM100_MMA_MXF4NVF4_2x1SM_SS` — 2-CTA NVFP4 block-scaled MMA
-- `SM100_MMA_MXF4NVF4_SS` — 1-CTA NVFP4 block-scaled MMA
-
-### Python API (`mega/__init__.py`)
-
-- `fp8_nvfp4_mega_moe()` — main entry point
-- `transform_nvfp4_weights_for_mega_moe()` — converts checkpoint weights to kernel format
-- `_pack_nvfp4_sf_for_utccp()` — packs UE4M3 scales into int32 UTCCP layout
-
-### C++ Bindings
-
-- `csrc/apis/mega_nvfp4.hpp` — SymmBuffer with NVFP4 SF strides (K/16)
-- `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` — JIT kernel
-
-## Architecture
-
-```
-NVFP4 Checkpoint                    Kernel Format
-┌─────────────────────┐            ┌────────────────────────┐
-│ weight: uint8       │───────────>│ uint8 (same, E2M1)    │
-│ (E2M1, 2 per byte)  │            │ packed, interleaved    │
-├─────────────────────┤            ├────────────────────────┤
-│ weight_scale:       │            │ int32 (UTCCP layout)   │
-│ float8_e4m3fn       │──pack────>│ 4 UE4M3 bytes → 1 i32 │
-│ (UE4M3, group=16)   │  +UTCCP   │ then 4x32 transpose    │
-├─────────────────────┤            └────────────────────────┘
-│ weight_scale_2:     │
-│ float32 (global)    │──────> Applied at weight load time
-│                     │        (multiply into block scales)
-└─────────────────────┘
-```
-
-### Scale Factor Flow
-
-```
-Checkpoint: weight_scale (UE4M3) × weight_scale_2 (FP32) = dequantized BF16
-
-Kernel path:
-  1. pre-scale: weight_scale_2 * weight_scale → float32 block scales
-  2. pack to UE4M3: float32 → cutlass::float_e4m3_t → uint8 → pack 4→int32
-  3. UTCCP transpose for TMA consumption
-  4. Tensor Core reads UE4M3 scales via mxf4nvf4 instruction
-```
-
-## Important Notes
-
-### scale_format_ constraint
-The CUTLASS instruction descriptor has a single `scale_format_` bit (0=E4M3, 1=E8M0) that applies to BOTH A and B scale factors. This means both activation (SFA) and weight (SFB) scales must use the same format. The L1 epilogue outputs UE4M3 activation scales to match the NVFP4 weight scales.
-
-### Weight scale_2 handling
-The NVFP4 checkpoint has a dual-level scaling scheme:
-- `weight_scale`: per-block UE4M3 (group_size=16)  
-- `weight_scale_2`: per-tensor float32 global scale
-
-The `weight_scale_2` must be multiplied into the block scales **before** packing for the kernel. This is done in `transform_nvfp4_weights_for_mega_moe()`.
-
-## Remaining Work
-
-- [x] Test compilation on B200 (SM100) — **COMPILED**
-- [ ] Verify UTCCP 4X column stride (i*8)
-- [ ] Verify SF packing: UE4M3 → int32 → TMA-aligned layout
-- [x] Add gran_k=16 to C++ transform_sf_into_required_layout
-- [ ] Fix SF layout: must be MN-major (stride(-2)=1) with TMA-aligned stride
-- [ ] Verify the L1 epilogue UE4M3 conversion
-- [ ] Integration with vLLM DeepseekV4MegaMoEExperts — wired, debugging
-- [ ] End-to-end quality test
-
-### SM100 (B200) Hardware Constraint
-
-**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`). This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32).
-
-**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0 for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a scale format adaptation, not a weight format conversion.
+**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0
+for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a
+scale format adaptation, not a weight format conversion.
 
 | Parameter | NVFP4 Checkpoint | Kernel (SM100 Adapted) |
 |-----------|-----------------|----------------------|
@@ -106,17 +24,125 @@ The `weight_scale_2` must be multiplied into the block scales **before** packing
 | Block scale format | UE4M3 (float8_e4m3fn) | UE8M0 (uint8) |
 | Block size | 16 | 32 (merged) |
 | Global scale | float32 | Folded in before UE4M3→UE8M0 |
-| PTX instruction | N/A (requires SM103+) | mxf8f6f4.block_scale |
+| PTX instruction | N/A (requires SM103+) | `mxf8f6f4.block_scale` |
 
-### Debugging Log
-- Build 7: kPackedFP4 mismatch → uint8→int8 view
-- Build 9: SF stride assertion → need MN-major layout + TMA alignment
-- Build 10: transform_sf_into_required_layout doesn't support gran_k=16 → C++ fix
-- Build 11: SF dtype mismatch (float8_e4m3fn → must pack to int32 first)
-- Build 12-14: SF stride layout — transpose to MN-major before transform
-- Build 15: SymmBuffer too small (NVFP4 has 2x SF) → use NVFP4 SymmBuffer
-- Build 16: ImportError (deep_gemm.mega.nvfp4) → Python wrapper
-- Build 17: NVCC error: scale_vec::4X not supported on sm_100f
-- Build 18: NVCC error: scale_vec::2X ALSO not supported on sm_100f
-- Build 19: kGranK still 16 in C++ binding
-- Build 20: Use mxf8f6f4 (same as MXFP4) with UE4M0 conversion
+**Result**: Server starts and serves, but output is garbled. The UE4M3→UE8M0 conversion
+loses 3 bits of mantissa precision per scale (8× precision loss), destroying output quality.
+
+## Kernel Changes from MXFP4
+
+The kernel is functionally identical to the MXFP4 mega_moe. The only differences are
+in the weight transformation pipeline (Python) and the documentation/comments.
+
+### CUDA Kernel (`sm100_fp8_nvfp4_mega_moe.cuh`)
+
+Identical to `sm100_fp8_fp4_mega_moe.cuh` in runtime behavior:
+- Same `kGranK = 32`, same `float_ue8m0_t` descriptor, same `mxf8f6f4` instruction
+- Same TMEM layout (2X), UTCCP copy (i*4), SF counts
+- Same L1 epilogue (UE8M0, `>> 23`)
+
+The renamed file exists for documentation and future SM103+ support.
+
+### Python API (`deep_gemm/mega/__init__.py`)
+
+```python
+from deep_gemm.mega import (
+    fp8_nvfp4_mega_moe,
+    transform_nvfp4_weights_for_mega_moe,
+    get_symm_buffer_for_nvfp4_mega_moe,
+)
+
+# Transform NVFP4 checkpoint weights for the kernel
+l1_weights, l2_weights = transform_nvfp4_weights_for_mega_moe(
+    (w13_weight, w13_weight_scale),  # uint8, float8_e4m3fn
+    (w2_weight, w2_weight_scale),
+    l1_weight_scale_2=w13_weight_scale_2,  # float32 global scale
+    l2_weight_scale_2=w2_weight_scale_2,
+)
+# l1_weights/l2_weights: (int8 weight, int32 TMA-aligned SF)
+
+# Run the kernel
+fp8_nvfp4_mega_moe(y, l1_weights, l2_weights, symm_buffer, ...)
+```
+
+### Weight Transformation Pipeline
+
+```
+NVFP4 Checkpoint                         Kernel Format
+┌─────────────────────┐                 ┌────────────────────────┐
+│ weight: uint8       │────────────────→│ int8 (E2M1, same)     │
+│ (E2M1, 2 per byte)  │  .view(int8)    │ packed, interleaved    │
+├─────────────────────┤                 ├────────────────────────┤
+│ weight_scale:       │ 1. fold global  │ int32 (TMA-aligned     │
+│ float8_e4m3fn       │ 2. merge 16→32  │  UTCCP layout)         │
+│ (UE4M3, group=16)   │ 3. UE4M3→UE8M0 │                        │
+├─────────────────────┤ 4. pack 4→i32   └────────────────────────┘
+│ weight_scale_2:     │ 5. transpose    (same as MXFP4 pipeline)
+│ float32 (global)    │ 6. TMA-align
+└─────────────────────┘
+```
+
+**Key steps**:
+1. **Fold global scale**: `block_scale * global_scale → UE4M3` (in float32, re-quantize)
+2. **Merge block16→block32**: `max(adjacent_pair)` — max preserves magnitude
+3. **UE4M3→UE8M0**: `float32 → (bits >> 23) & 0xFF → uint8` (extract IEEE 754 exponent)
+4. **Pack**: 4 uint8 UE8M0 values → 1 int32
+5. **Transpose**: MN-major layout
+6. **TMA-align**: `transform_sf_into_required_layout(sf, mn, k, (1,32), num_experts)`
+
+### C++ Changes
+
+| File | Change |
+|------|--------|
+| `csrc/apis/mega_nvfp4.hpp` | NVFP4 SymmBuffer (SF stride K/32, K/128 packed) |
+| `csrc/apis/layout.hpp` | gran_k=32 support (already existed, just documented) |
+| `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` | JIT kernel header |
+| `csrc/python_api.cpp` | Register NVFP4 APIs |
+
+### PTX Wrappers Added (`tcgen05.cuh`)
+
+- `SM100_MMA_MXF4NVF4_2x1SM_SS` — for future SM103+ support
+- `SM100_MMA_MXF4NVF4_SS` — single-CTA variant
+
+**Note**: These are NOT used on SM100. The kernel uses `SM100_MMA_MXF8F6F4_2x1SM_SS` instead.
+
+## Debugging Log (Builds 1–22)
+
+| Build | Error | Fix |
+|-------|-------|-----|
+| 1–6 | Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH |
+| 7 | `kPackedFP4` type mismatch | uint8→int8 view |
+| 9 | SF stride assertion | MN-major layout + TMA alignment |
+| 10 | `transform_sf` no gran_k=16 | C++ fix (later reverted to 32) |
+| 11 | SF dtype float8_e4m3fn rejected | Pack to int32 first |
+| 12–14 | SF stride layout | Transpose to MN-major |
+| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer |
+| 16 | `ImportError` | Python wrapper |
+| 17 | NVCC: `scale_vec::4X` not on sm_100f | Hardware limit |
+| 18 | NVCC: `scale_vec::2X` also not on sm_100f | Hardware limit |
+| 19 | kGranK=16 in C++ binding | → 32 |
+| 20 | `uint32 >> 23` fails | Cast to int32 first |
+| 22 | Garbled output | UE4M3→UE8M0 precision loss (unfixable on SM100) |
+
+## Path Forward
+
+### For SM103+ (B300/GB300)
+The `SM100_MMA_MXF4NVF4` PTX wrappers are already in the code. On SM103+:
+1. Switch kernel to use `mxf4nvf4.block_scale.scale_vec::4X` (block16)
+2. Keep UE4M3 scales (no conversion to UE8M0)
+3. Update TMEM layout to 4X (i*8 stride, 4 sub-columns)
+4. This should produce correct output with full NVFP4 precision
+
+### For SM100 (B200)
+The UE4M3→UE8M0 conversion is fundamentally lossy. Options:
+1. **FlashInfer FP4 MoE** — dequant NVFP4→BF16, use BF16 GEMM (avoids scale conversion)
+2. **BF16 mega_moe** — dequant in shared memory, use BF16 MMA
+3. **Accept the precision loss** — only viable if the model is robust to scale quantization
+
+## Integration with vLLM
+
+The `mega-moe-nvfp4` branch of `deepseek-v4-quant` wires this kernel into
+`DeepseekV4MegaMoEExperts`:
+- `finalize_weights()`: calls `transform_nvfp4_weights_for_mega_moe()`
+- `forward()`: calls `fp8_nvfp4_mega_moe()` with `recipe=(1,1,32)`
+- `get_symm_buffer()`: uses `get_symm_buffer_for_nvfp4_mega_moe()`