DeepSeek-V4-Pro has 384 routed experts, 48 per rank (384/8).
The cross-rank all-reduce happens in the parent DeepseekV4MoE.forward,
not in our kernel. Our kernel writes local output; caller does reduce.
Fixed README, nvfp4_mega_moe.py comments.
weight_transform.py returns float8_e4m3fn scales, NOT packed uint32.
The _pack_ue4m3_to_uint32 function was never called. Removed it.
Updated README data formats to accurately reflect the pipeline:
- Weight scales: float8_e4m3fn (direct to CUTLASS, no unpack)
- Activation scales: uint32 packed (from staging kernel, unpacked to float8)
Added detailed SF remap section with the empirical coordinate dump table
showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed,
the diagnostic trail (constant-scale test, single-element probes), and
the 6 verification probes confirming the extraction formula.