diff --git a/README_NVFP4.md b/README_NVFP4.md index 962b204..e93fdf9 100644 --- a/README_NVFP4.md +++ b/README_NVFP4.md @@ -94,9 +94,29 @@ The `weight_scale_2` must be multiplied into the block scales **before** packing - [ ] Integration with vLLM DeepseekV4MegaMoEExperts — wired, debugging - [ ] End-to-end quality test +### SM100 (B200) Hardware Constraint + +**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`). This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32). + +**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0 for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a scale format adaptation, not a weight format conversion. + +| Parameter | NVFP4 Checkpoint | Kernel (SM100 Adapted) | +|-----------|-----------------|----------------------| +| Weight format | E2M1 uint8 | E2M1 uint8 (unchanged) | +| Block scale format | UE4M3 (float8_e4m3fn) | UE8M0 (uint8) | +| Block size | 16 | 32 (merged) | +| Global scale | float32 | Folded in before UE4M3→UE8M0 | +| PTX instruction | N/A (requires SM103+) | mxf8f6f4.block_scale | + ### Debugging Log - Build 7: kPackedFP4 mismatch → uint8→int8 view - Build 9: SF stride assertion → need MN-major layout + TMA alignment - Build 10: transform_sf_into_required_layout doesn't support gran_k=16 → C++ fix - Build 11: SF dtype mismatch (float8_e4m3fn → must pack to int32 first) - Build 12-14: SF stride layout — transpose to MN-major before transform +- Build 15: SymmBuffer too small (NVFP4 has 2x SF) → use NVFP4 SymmBuffer +- Build 16: ImportError (deep_gemm.mega.nvfp4) → Python wrapper +- Build 17: NVCC error: scale_vec::4X not supported on sm_100f +- Build 18: NVCC error: scale_vec::2X ALSO not supported on sm_100f +- Build 19: kGranK still 16 in C++ binding +- Build 20: Use mxf8f6f4 (same as MXFP4) with UE4M0 conversion