diff --git a/README_NVFP4.md b/README_NVFP4.md index 620b757..962b204 100644 --- a/README_NVFP4.md +++ b/README_NVFP4.md @@ -85,12 +85,18 @@ The `weight_scale_2` must be multiplied into the block scales **before** packing ## Remaining Work -- [ ] Test compilation on B200 (SM100) -- [ ] Verify UTCCP 4X column stride (i*8) — may need adjustment based on TMEM layout diagrams -- [ ] Verify SF packing: UE4M3 bytes → int32 layout matches what the UTCCP instruction expects -- [ ] Verify the L1 epilogue UE4M3 conversion (float → e4m3 cast + sign bit clear) -- [ ] Validate scale_format_ bit value: currently set by `make_instr_desc_block_scaled` which sets scale_format_=0 (E4M3) -- [ ] Verify kNumSFATmemCols and kNumSFBTmemCols calculations for 4X layout -- [ ] Integration with vLLM DeepseekV4MegaMoEExperts class -- [ ] Weight loading: map NVFP4 checkpoint params to DeepseekV4MegaMoEExperts -- [ ] End-to-end quality test: compare NVFP4 mega_moe output vs FlashInfer FP4 MoE +- [x] Test compilation on B200 (SM100) — **COMPILED** +- [ ] Verify UTCCP 4X column stride (i*8) +- [ ] Verify SF packing: UE4M3 → int32 → TMA-aligned layout +- [x] Add gran_k=16 to C++ transform_sf_into_required_layout +- [ ] Fix SF layout: must be MN-major (stride(-2)=1) with TMA-aligned stride +- [ ] Verify the L1 epilogue UE4M3 conversion +- [ ] Integration with vLLM DeepseekV4MegaMoEExperts — wired, debugging +- [ ] End-to-end quality test + +### Debugging Log +- Build 7: kPackedFP4 mismatch → uint8→int8 view +- Build 9: SF stride assertion → need MN-major layout + TMA alignment +- Build 10: transform_sf_into_required_layout doesn't support gran_k=16 → C++ fix +- Build 11: SF dtype mismatch (float8_e4m3fn → must pack to int32 first) +- Build 12-14: SF stride layout — transpose to MN-major before transform