docs: document SM100 hardware constraint and full debugging log
This commit is contained in:
@@ -94,9 +94,29 @@ The `weight_scale_2` must be multiplied into the block scales **before** packing
|
||||
- [ ] Integration with vLLM DeepseekV4MegaMoEExperts — wired, debugging
|
||||
- [ ] End-to-end quality test
|
||||
|
||||
### SM100 (B200) Hardware Constraint
|
||||
|
||||
**CRITICAL**: B200 (SM100) does NOT support `kind::mxf4nvf4` (neither `scale_vec::2X` nor `4X`). This instruction requires SM103 (B300) or SM120 (GB300). On SM100, the only FP4 block-scaled MMA is `kind::mxf8f6f4.block_scale` with UE8M0 scales (block32, group_size=32).
|
||||
|
||||
**Strategy**: Keep NVFP4 E2M1 weights (same as MXFP4), convert UE4M3 block scales to UE8M0 for hardware compatibility. Merge NVFP4 block16→block32 (max of adjacent pairs). This is a scale format adaptation, not a weight format conversion.
|
||||
|
||||
| Parameter | NVFP4 Checkpoint | Kernel (SM100 Adapted) |
|
||||
|-----------|-----------------|----------------------|
|
||||
| Weight format | E2M1 uint8 | E2M1 uint8 (unchanged) |
|
||||
| Block scale format | UE4M3 (float8_e4m3fn) | UE8M0 (uint8) |
|
||||
| Block size | 16 | 32 (merged) |
|
||||
| Global scale | float32 | Folded in before UE4M3→UE8M0 |
|
||||
| PTX instruction | N/A (requires SM103+) | mxf8f6f4.block_scale |
|
||||
|
||||
### Debugging Log
|
||||
- Build 7: kPackedFP4 mismatch → uint8→int8 view
|
||||
- Build 9: SF stride assertion → need MN-major layout + TMA alignment
|
||||
- Build 10: transform_sf_into_required_layout doesn't support gran_k=16 → C++ fix
|
||||
- Build 11: SF dtype mismatch (float8_e4m3fn → must pack to int32 first)
|
||||
- Build 12-14: SF stride layout — transpose to MN-major before transform
|
||||
- Build 15: SymmBuffer too small (NVFP4 has 2x SF) → use NVFP4 SymmBuffer
|
||||
- Build 16: ImportError (deep_gemm.mega.nvfp4) → Python wrapper
|
||||
- Build 17: NVCC error: scale_vec::4X not supported on sm_100f
|
||||
- Build 18: NVCC error: scale_vec::2X ALSO not supported on sm_100f
|
||||
- Build 19: kGranK still 16 in C++ binding
|
||||
- Build 20: Use mxf8f6f4 (same as MXFP4) with UE4M0 conversion
|
||||
|
||||
Reference in New Issue
Block a user