docs: update debugging log in README
This commit is contained in:
@@ -85,12 +85,18 @@ The `weight_scale_2` must be multiplied into the block scales **before** packing
|
||||
|
||||
## Remaining Work
|
||||
|
||||
- [ ] Test compilation on B200 (SM100)
|
||||
- [ ] Verify UTCCP 4X column stride (i*8) — may need adjustment based on TMEM layout diagrams
|
||||
- [ ] Verify SF packing: UE4M3 bytes → int32 layout matches what the UTCCP instruction expects
|
||||
- [ ] Verify the L1 epilogue UE4M3 conversion (float → e4m3 cast + sign bit clear)
|
||||
- [ ] Validate scale_format_ bit value: currently set by `make_instr_desc_block_scaled<float_ue4m3_t>` which sets scale_format_=0 (E4M3)
|
||||
- [ ] Verify kNumSFATmemCols and kNumSFBTmemCols calculations for 4X layout
|
||||
- [ ] Integration with vLLM DeepseekV4MegaMoEExperts class
|
||||
- [ ] Weight loading: map NVFP4 checkpoint params to DeepseekV4MegaMoEExperts
|
||||
- [ ] End-to-end quality test: compare NVFP4 mega_moe output vs FlashInfer FP4 MoE
|
||||
- [x] Test compilation on B200 (SM100) — **COMPILED**
|
||||
- [ ] Verify UTCCP 4X column stride (i*8)
|
||||
- [ ] Verify SF packing: UE4M3 → int32 → TMA-aligned layout
|
||||
- [x] Add gran_k=16 to C++ transform_sf_into_required_layout
|
||||
- [ ] Fix SF layout: must be MN-major (stride(-2)=1) with TMA-aligned stride
|
||||
- [ ] Verify the L1 epilogue UE4M3 conversion
|
||||
- [ ] Integration with vLLM DeepseekV4MegaMoEExperts — wired, debugging
|
||||
- [ ] End-to-end quality test
|
||||
|
||||
### Debugging Log
|
||||
- Build 7: kPackedFP4 mismatch → uint8→int8 view
|
||||
- Build 9: SF stride assertion → need MN-major layout + TMA alignment
|
||||
- Build 10: transform_sf_into_required_layout doesn't support gran_k=16 → C++ fix
|
||||
- Build 11: SF dtype mismatch (float8_e4m3fn → must pack to int32 first)
|
||||
- Build 12-14: SF stride layout — transpose to MN-major before transform
|
||||
|
||||
Reference in New Issue
Block a user