Without a global scale, block scales (block_max / 6.0) could exceed
UE4M3 max (448.0) for large activations, causing saturation and garbage
MoE outputs. The degeneration pattern (positions 1-5 OK, then constant
spaces) is consistent with UE4M3 overflow: first few tokens have small
activations that fit, but once SiLU(mul(gate, up)) produces larger
values, block scales overflow and the GEMM produces zeros/garbage.
Fix: compute input_global_scale = amax / (6.0 * 448.0), normalize
before block quantization, then fold global scale back into block
scales (same as weight_transform.py folds weight_scale_2). This
ensures block scales are always ≤ 448.0 in UE4M3 range.
lm_head lives on DeepseekV4ForCausalLM, not DeepseekV4Model. The inner
load_weights silently drops it (not in params_dict). Extract it in the
outer loader, load it directly, then forward the rest to the inner model.
Two fixes:
1. attn_hc.base → hc_attn_base (underscore not dot before base/fn/scale)
Same for fn, scale, and ffn_hc variants.
2. compressor.position_bias → compressor.ape was never firing because
the .self_attn.compressor. rule matched first (break). Added combined
.self_attn.compressor.position_bias → .attn.mla_attn.compressor.ape.
Three more dropped checkpoint→model mappings:
1. hc_head: checkpoint has hc_head.hc_base/fn/scale, model has
hc_head_base/fn/scale (underscore not dot separator)
2. attn_hc/ffn_hc: checkpoint has .attn_hc. and .ffn_hc., model has
.hc_attn. and .hc_ffn. (word order reversed)
3. compressor.position_bias → compressor.ape: checkpoint name is
position_bias, model attr is ape (absolute position encoding)
All 461 remaining zero params should now be just indexer.k_norm.bias
(legit zero - no bias in checkpoint, only weight).
indexer.weights_proj is uint8 [64,3584] in checkpoint but bf16 [64,7168]
in model. The uint8→bf16 unpack logic only ran in the stacked_params
loop, so non-stacked NVFP4 params hit a size mismatch assertion.
Three categories of missed renames in CKPT_KEY_SUBST:
1. Shared experts: .shared_experts.gate_proj.→.ffn.shared_experts.w1. fired
but break prevented .mlp.→.ffn. from also applying, producing
mlp.ffn.shared_experts.w1. (double prefix). Fixed by including .mlp.
in the pattern. Added missing .shared_experts.down_proj. rule.
2. Indexer (layers 2+): .self_attn.compressor.indexer.* was caught by the
generic .self_attn.compressor.→.attn.mla_attn.compressor. rule, producing
wrong path attn.mla_attn.compressor.indexer.* instead of attn.indexer.*.
Added indexer-specific patterns (q_b_proj→wq_b, kv_norm→k_norm,
position_bias→compressor.ape, gate_proj→compressor.wgate,
kv_proj→compressor.wkv) before the generic compressor rule.
3. Compressor kv_proj/gate_proj: old .compressor.kv_proj.→.compressor.wkv.
pattern could never fire because .self_attn.compressor. matched first
(break). Merged into combined patterns that handle both the
self_attn.compressor→attn.mla_attn.compressor path AND the projection
rename in one step.
Checkpoint keys are model.layers.N.shared_experts.gate_proj.weight
but model params are layers.N.ffn.shared_experts.gate_up_proj.weight.
The .ffn. was missing from the rename, so stacked gate_up_proj
never matched params_dict.
The stacking logic skipped any key containing '.experts.' to avoid
MoE routed expert weights. But 'shared_experts' also matches that
substring, so gate_proj and up_proj were never stacked into
gate_up_proj. Changed to '.ffn.experts.' which only matches the
routed experts path.
Also includes POST-LOAD all-zero param scan.
vLLM's symm_buffer stores topk_ids as GLOBAL expert IDs (0..383).
Our weight tensors are indexed by LOCAL IDs (0..47 per rank).
Each rank r handles experts [r*48, r*48+47]. Without conversion,
topk_ids like 137, 222, 378 would index way out of bounds in the
weight tensor (shape (48, N, K)), producing garbage.
Derive experts_start_idx from the topk_ids and subtract to get
local IDs. This was why all ranks except rank 0 produced zero
expert matches → zero output → garbage text.
DeepSeek-V4-Pro has 384 routed experts, 48 per rank (384/8).
The cross-rank all-reduce happens in the parent DeepseekV4MoE.forward,
not in our kernel. Our kernel writes local output; caller does reduce.
Fixed README, nvfp4_mega_moe.py comments.
weight_transform.py returns float8_e4m3fn scales, NOT packed uint32.
The _pack_ue4m3_to_uint32 function was never called. Removed it.
Updated README data formats to accurately reflect the pipeline:
- Weight scales: float8_e4m3fn (direct to CUTLASS, no unpack)
- Activation scales: uint32 packed (from staging kernel, unpacked to float8)
Added detailed SF remap section with the empirical coordinate dump table
showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed,
the diagnostic trail (constant-scale test, single-element probes), and
the 6 verification probes confirming the extraction formula.
m = f0 + f1*32 + f2*128 (CuTe 'first sub varies fastest')
k_sf = f4 + f5*4
f3 is the Step<2> stride (degenerate, always=total), NOT a coordinate.
Previous formula (f3*2+f2)*128 was catastrophically wrong — mapped
everything to m=0 or m=huge.
Previous approach assumed rank 2-6, but actual rank is 8.
For R==8: 4 M sub-indices (inner_32, inner_4, tile_interleave, tile_m)
4 K sub-indices (inner_16, inner_4_k, tile_k_interleave, tile_k)
m = (f3*2 + f2)*128 + f0*4 + f1
k_sf = f5 + f6*4 (tentative, needs printf verification)
Added printf of all 8 flat values for first 3 indices.
Going back to the idx2crd approach which compiles and runs.
Added printf for flat_rank, MN, K_sf, and first coordinate extraction.
Handles ranks 2-6 with logical (m, k_sf) extraction.
This will tell us the actual flat_rank and whether our extraction is correct.
layout_sf(m, k_elem) with flat ints fails: Mismatched Ranks because
the layout shape is ((32,4), K_padded), not (M, K).
Decompose m into (inner_m, sub_m) = (m/4, m%4) to match the (32,4)
sub-shape, and pass as make_tuple(make_tuple(inner, sub), k_elem).
Removed dead code from old idx2crd approach. File is now clean:
- Source-iterating SF remap kernel with layout_sf(m, k_elem)
- Zero-init dest buffers before remap
- Proper extern C wrapping