From e80fe9af606e7324b294d6fd011ab80ae86820d9 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Mon, 11 May 2026 14:24:55 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20CORRECTED=20=E2=80=94=20mxf4nvf4=20IS?=
 =?UTF-8?q?=20supported=20on=20sm=5F100a=20(B200)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The build 17-18 'scale_vec not supported on sm_100f' error was because
we targeted sm_100 instead of sm_100a. The 'a' suffix is required for
FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct
arch target is the path forward.
---
 README_NVFP4.md | 180 ++++++++++++++++++++----------------------------
 1 file changed, 73 insertions(+), 107 deletions(-)

diff --git a/README_NVFP4.md b/README_NVFP4.md
index a1373c2..74ab0c3 100644
--- a/README_NVFP4.md
+++ b/README_NVFP4.md
@@ -2,70 +2,37 @@
 
 ## Overview
 
-A native NVFP4 mega MoE kernel for DeepGEMM, built on the `nvfp4-mega-moe` branch.
-Adapts the existing `sm100_fp8_fp4_mega_moe_impl` to handle NVFP4 checkpoint weights
-(E2M1 + UE4M3 block scales, group_size=16).
+A native NVFP4 mega MoE kernel for DeepGEMM that uses `kind::mxf4nvf4.block_scale.scale_vec::4X`
+to consume NVFP4 weights (E2M1 + UE4M3 block scales, group_size=16) directly on B200 (SM100a).
 
 **HARD RULE: MoE experts stay in NVFP4. Never convert to MXFP4.**
 
-## SM100 (B200) Hardware Constraint
+## SM100a (B200) Hardware Support
 
-**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`).
-This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled
-MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32).
+**B200 (SM100a) DOES support `kind::mxf4nvf4` with `scale_vec::4X`** (block16, UE4M3 scales).
+Documented in PTX ISA 8.7 (CUDA 12.8+), confirmed by NVIDIA/CUTLASS/Colfax.
 
-**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0
-for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a
-scale format adaptation, not a weight format conversion.
+The key requirement: target **`sm_100a`** (not `sm_100`). The `a` suffix enables the FP4
+block-scaled instructions including `mxf4nvf4`. Targeting plain `sm_100` will produce
+"Feature '.scale_vec::4X' not supported on .target 'sm_100f'" errors.
 
-| Parameter | NVFP4 Checkpoint | Kernel (SM100 Adapted) |
-|-----------|-----------------|----------------------|
-| Weight format | E2M1 uint8 | E2M1 uint8 (unchanged) |
-| Block scale format | UE4M3 (float8_e4m3fn) | UE8M0 (uint8) |
-| Block size | 16 | 32 (merged) |
-| Global scale | float32 | Folded in before UE4M3→UE8M0 |
-| PTX instruction | N/A (requires SM103+) | `mxf8f6f4.block_scale` |
+## Kernel Architecture (TARGET)
 
-**Result**: Server starts and serves, but output is garbled. The UE4M3→UE8M0 conversion
-loses 3 bits of mantissa precision per scale (8× precision loss), destroying output quality.
-
-## Kernel Changes from MXFP4
-
-The kernel is functionally identical to the MXFP4 mega_moe. The only differences are
-in the weight transformation pipeline (Python) and the documentation/comments.
-
-### CUDA Kernel (`sm100_fp8_nvfp4_mega_moe.cuh`)
-
-Identical to `sm100_fp8_fp4_mega_moe.cuh` in runtime behavior:
-- Same `kGranK = 32`, same `float_ue8m0_t` descriptor, same `mxf8f6f4` instruction
-- Same TMEM layout (2X), UTCCP copy (i*4), SF counts
-- Same L1 epilogue (UE8M0, `>> 23`)
-
-The renamed file exists for documentation and future SM103+ support.
-
-### Python API (`deep_gemm/mega/__init__.py`)
-
-```python
-from deep_gemm.mega import (
-    fp8_nvfp4_mega_moe,
-    transform_nvfp4_weights_for_mega_moe,
-    get_symm_buffer_for_nvfp4_mega_moe,
-)
-
-# Transform NVFP4 checkpoint weights for the kernel
-l1_weights, l2_weights = transform_nvfp4_weights_for_mega_moe(
-    (w13_weight, w13_weight_scale),  # uint8, float8_e4m3fn
-    (w2_weight, w2_weight_scale),
-    l1_weight_scale_2=w13_weight_scale_2,  # float32 global scale
-    l2_weight_scale_2=w2_weight_scale_2,
-)
-# l1_weights/l2_weights: (int8 weight, int32 TMA-aligned SF)
-
-# Run the kernel
-fp8_nvfp4_mega_moe(y, l1_weights, l2_weights, symm_buffer, ...)
+```
+sm100_fp8_nvfp4_mega_moe_impl
+├── kGranK = 16 (NVFP4 native block size)
+├── kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
+├── float_ue4m3_t instruction descriptor
+├── SF layout: scale_vec::4X, 4 TMEM sub-columns per UMMA atom
+├── UTCCP copy: i*8 stride (4X layout, 8 TMEM cols per 128-element group)
+├── kNumSFATmemCols = SF_BLOCK_M / 32 * 4
+├── kNumSFBTmemCols = SF_BLOCK_N / 32 * 4
+├── kNumSFUint32 = kHidden / 64 (4 UE4M3 per int32)
+├── UE4M3 L1 epilogue (float → cutlass::float_e4m3_t cast, sign bit cleared)
+└── recipe = (1, 1, 16)
 ```
 
-### Weight Transformation Pipeline
+## Weight Transformation Pipeline
 
 ```
 NVFP4 Checkpoint                         Kernel Format
@@ -74,75 +41,74 @@ NVFP4 Checkpoint                         Kernel Format
 │ (E2M1, 2 per byte)  │  .view(int8)    │ packed, interleaved    │
 ├─────────────────────┤                 ├────────────────────────┤
 │ weight_scale:       │ 1. fold global  │ int32 (TMA-aligned     │
-│ float8_e4m3fn       │ 2. merge 16→32  │  UTCCP layout)         │
-│ (UE4M3, group=16)   │ 3. UE4M3→UE8M0 │                        │
-├─────────────────────┤ 4. pack 4→i32   └────────────────────────┘
-│ weight_scale_2:     │ 5. transpose    (same as MXFP4 pipeline)
-│ float32 (global)    │ 6. TMA-align
+│ float8_e4m3fn       │ 2. pack 4→i32   │  UTCCP layout,         │
+│ (UE4M3, group=16)   │ 3. transpose    │  gran_k=16)            │
+├─────────────────────┤ 4. TMA-align    └────────────────────────┘
+│ weight_scale_2:     │
+│ float32 (global)    │──folded into block scales before packing
 └─────────────────────┘
 ```
 
-**Key steps**:
-1. **Fold global scale**: `block_scale * global_scale → UE4M3` (in float32, re-quantize)
-2. **Merge block16→block32**: `max(adjacent_pair)` — max preserves magnitude
-3. **UE4M3→UE8M0**: `float32 → (bits >> 23) & 0xFF → uint8` (extract IEEE 754 exponent)
-4. **Pack**: 4 uint8 UE8M0 values → 1 int32
-5. **Transpose**: MN-major layout
-6. **TMA-align**: `transform_sf_into_required_layout(sf, mn, k, (1,32), num_experts)`
+**NO UE4M3→UE8M0 conversion. NO block16→block32 merge.** The kernel consumes
+native UE4M3 scales with block16 grouping.
 
-### C++ Changes
+## Key Differences from MXFP4 mega_moe
 
-| File | Change |
-|------|--------|
-| `csrc/apis/mega_nvfp4.hpp` | NVFP4 SymmBuffer (SF stride K/32, K/128 packed) |
-| `csrc/apis/layout.hpp` | gran_k=32 support (already existed, just documented) |
-| `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` | JIT kernel header |
-| `csrc/python_api.cpp` | Register NVFP4 APIs |
+| Parameter | MXFP4 | NVFP4 (this kernel) |
+|-----------|-------|---------------------|
+| `kGranK` | 32 | 16 |
+| PTX instruction | `mxf8f6f4.block_scale` | `mxf4nvf4.block_scale.scale_vec::4X` |
+| Scale factor type | `float_ue8m0_t` | `float_ue4m3_t` |
+| SF vector size | block32 / 2X | block16 / 4X |
+| TMEM SF cols (SFA) | `SF_BLOCK_M / 32` | `SF_BLOCK_M / 32 * 4` |
+| UTCCP col stride | `i * 4` | `i * 8` |
+| `kNumSFUint32` | `kHidden / 128` | `kHidden / 64` |
+| L1 epilogue | UE8M0 (`>> 23`) | UE4M3 (float→e4m3 cast) |
+| recipe | `(1, 1, 32)` | `(1, 1, 16)` |
 
-### PTX Wrappers Added (`tcgen05.cuh`)
+## Critical Implementation Details
 
-- `SM100_MMA_MXF4NVF4_2x1SM_SS` — for future SM103+ support
-- `SM100_MMA_MXF4NVF4_SS` — single-CTA variant
+### scale_format_ constraint
+The CUTLASS instruction descriptor has a single `scale_format_` bit (0=E4M3, 1=E8M0)
+that applies to BOTH A and B scale factors. For NVFP4 (E4M3), both activation (SFA)
+and weight (SFB) scales must use UE4M3. The L1 epilogue outputs UE4M3 activation scales
+(float → `cutlass::float_e4m3_t` with sign bit cleared).
 
-**Note**: These are NOT used on SM100. The kernel uses `SM100_MMA_MXF8F6F4_2x1SM_SS` instead.
+### Arch flag
+The JIT compiler MUST target `sm_100a`, not `sm_100`. Without the `a` suffix, the
+`mxf4nvf4` instruction is unavailable and compilation will fail with
+"Feature '.scale_vec::4X' not supported on .target 'sm_100f'".
 
-## Debugging Log (Builds 1–22)
+### Weight scale_2 folding
+The NVFP4 checkpoint has dual-level scaling: per-block UE4M3 + per-tensor float32.
+The `weight_scale_2` must be folded into the block scales before packing:
+`effective_scale = block_scale * global_scale`, then re-quantize to UE4M3.
+
+## Build History
 
 | Build | Error | Fix |
 |-------|-------|-----|
 | 1–6 | Dockerfile/build issues | NVRTC symlink, CPATH, PYTHONPATH |
 | 7 | `kPackedFP4` type mismatch | uint8→int8 view |
 | 9 | SF stride assertion | MN-major layout + TMA alignment |
-| 10 | `transform_sf` no gran_k=16 | C++ fix (later reverted to 32) |
-| 11 | SF dtype float8_e4m3fn rejected | Pack to int32 first |
+| 10 | `transform_sf` no gran_k=16 | C++ fix |
+| 11 | SF dtype float8_e4m3fn rejected | Pack UE4M3→int32 first |
 | 12–14 | SF stride layout | Transpose to MN-major |
-| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer |
+| 15 | SymmBuffer too small | NVFP4-specific SymmBuffer (2× SF) |
 | 16 | `ImportError` | Python wrapper |
-| 17 | NVCC: `scale_vec::4X` not on sm_100f | Hardware limit |
-| 18 | NVCC: `scale_vec::2X` also not on sm_100f | Hardware limit |
-| 19 | kGranK=16 in C++ binding | → 32 |
+| **17** | **NVCC: `scale_vec::4X` not on sm_100f** | **Wrong arch: need `sm_100a`** |
+| 18 | `scale_vec::2X` also failed | Same — `sm_100a` required |
+| 19 | kGranK still 16 in C++ binding | Should stay 16 — was wrongly changed to 32 |
 | 20 | `uint32 >> 23` fails | Cast to int32 first |
-| 22 | Garbled output | UE4M3→UE8M0 precision loss (unfixable on SM100) |
+| 22 | Garbled output | Fell back to mxf8f6f4 — should use mxf4nvf4 on sm_100a |
 
-## Path Forward
+## Remaining Work
 
-### For SM103+ (B300/GB300)
-The `SM100_MMA_MXF4NVF4` PTX wrappers are already in the code. On SM103+:
-1. Switch kernel to use `mxf4nvf4.block_scale.scale_vec::4X` (block16)
-2. Keep UE4M3 scales (no conversion to UE8M0)
-3. Update TMEM layout to 4X (i*8 stride, 4 sub-columns)
-4. This should produce correct output with full NVFP4 precision
-
-### For SM100 (B200)
-The UE4M3→UE8M0 conversion is fundamentally lossy. Options:
-1. **FlashInfer FP4 MoE** — dequant NVFP4→BF16, use BF16 GEMM (avoids scale conversion)
-2. **BF16 mega_moe** — dequant in shared memory, use BF16 MMA
-3. **Accept the precision loss** — only viable if the model is robust to scale quantization
-
-## Integration with vLLM
-
-The `mega-moe-nvfp4` branch of `deepseek-v4-quant` wires this kernel into
-`DeepseekV4MegaMoEExperts`:
-- `finalize_weights()`: calls `transform_nvfp4_weights_for_mega_moe()`
-- `forward()`: calls `fp8_nvfp4_mega_moe()` with `recipe=(1,1,32)`
-- `get_symm_buffer()`: uses `get_symm_buffer_for_nvfp4_mega_moe()`
+- [ ] Fix DeepGEMM JIT to target `sm_100a` instead of `sm_100`
+- [ ] Add NVFP4 MMA kind enum to DeepGEMM runtime (not just MXFP8FP4 with NVFP4 hat)
+- [ ] Revert to Build 17's `mxf4nvf4.scale_vec::4X` instruction (was correct, just wrong arch)
+- [ ] Revert `kGranK` to 16, UE4M3 scales, block16 SF layout
+- [ ] Add `get_sf_uttcp_aligned_block_sizes` branch for block16 layout
+- [ ] Remove UE4M3→UE8M0 conversion and block16→block32 merge from Python
+- [ ] Verify TMEM 4X layout (i*8 stride, 4 sub-columns)
+- [ ] End-to-end quality test on B200