Commit Graph

6 Commits

Author SHA1 Message Date
128ff84358 fix: 384 experts (not 256), clarify cross-rank reduce is in caller
DeepSeek-V4-Pro has 384 routed experts, 48 per rank (384/8).
The cross-rank all-reduce happens in the parent DeepseekV4MoE.forward,
not in our kernel. Our kernel writes local output; caller does reduce.
Fixed README, nvfp4_mega_moe.py comments.
2026-05-14 17:33:59 +00:00
1c15dadaa5 cleanup: remove dead _pack_ue4m3_to_uint32, fix data format docs
weight_transform.py returns float8_e4m3fn scales, NOT packed uint32.
The _pack_ue4m3_to_uint32 function was never called. Removed it.
Updated README data formats to accurately reflect the pipeline:
- Weight scales: float8_e4m3fn (direct to CUTLASS, no unpack)
- Activation scales: uint32 packed (from staging kernel, unpacked to float8)
2026-05-14 17:28:12 +00:00
008f8cccbd docs: comprehensive README with SF remap probe data, bug history, coordinate table
Added detailed SF remap section with the empirical coordinate dump table
showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed,
the diagnostic trail (constant-scale test, single-element probes), and
the 6 verification probes confirming the extraction formula.
2026-05-14 17:02:53 +00:00
8b7fa0c91e add README: pipeline diagram, file map, data formats, known issues 2026-05-14 12:48:08 +00:00
1dfe5ffd05 Add comprehensive README documenting quirks, pitfalls, and setup 2026-05-14 11:23:32 +00:00
c2b752c2fe Initial: TileLang NVFP4 mega_moe kernel package
- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install
2026-05-13 15:44:51 +00:00