docs: update debugging log in README

2026-05-11 07:33:02 +00:00
parent 8d02eb38fa
commit acbe006498
1 changed files with 15 additions and 9 deletions
--- a/README_NVFP4.md
+++ b/README_NVFP4.md
@@ -85,12 +85,18 @@ The `weight_scale_2` must be multiplied into the block scales **before** packing

 ## Remaining Work

- [ ] Test compilation on B200 (SM100)
- [ ] Verify UTCCP 4X column stride (i*8) — may need adjustment based on TMEM layout diagrams
- [ ] Verify SF packing: UE4M3 bytes → int32 layout matches what the UTCCP instruction expects
- [ ] Verify the L1 epilogue UE4M3 conversion (float → e4m3 cast + sign bit clear)
- [ ] Validate scale_format_ bit value: currently set by `make_instr_desc_block_scaled<float_ue4m3_t>` which sets scale_format_=0 (E4M3)
- [ ] Verify kNumSFATmemCols and kNumSFBTmemCols calculations for 4X layout
- [ ] Integration with vLLM DeepseekV4MegaMoEExperts class
- [ ] Weight loading: map NVFP4 checkpoint params to DeepseekV4MegaMoEExperts
- [ ] End-to-end quality test: compare NVFP4 mega_moe output vs FlashInfer FP4 MoE
+- [x] Test compilation on B200 (SM100) — **COMPILED**
+- [ ] Verify UTCCP 4X column stride (i*8)
+- [ ] Verify SF packing: UE4M3 → int32 → TMA-aligned layout
+- [x] Add gran_k=16 to C++ transform_sf_into_required_layout
+- [ ] Fix SF layout: must be MN-major (stride(-2)=1) with TMA-aligned stride
+- [ ] Verify the L1 epilogue UE4M3 conversion
+- [ ] Integration with vLLM DeepseekV4MegaMoEExperts — wired, debugging
+- [ ] End-to-end quality test
+
+### Debugging Log
+- Build 7: kPackedFP4 mismatch → uint8→int8 view
+- Build 9: SF stride assertion → need MN-major layout + TMA alignment
+- Build 10: transform_sf_into_required_layout doesn't support gran_k=16 → C++ fix
+- Build 11: SF dtype mismatch (float8_e4m3fn → must pack to int32 first)
+- Build 12-14: SF stride layout — transpose to MN-major before transform