Revert SE BF16 fallback — produced garbage output

The dequantize_nvfp4 path for shared expert made output WORSE (random
Chinese tokens, gibberish) vs NVFP4 GEMM which at least produces 'OK'.
The SE NVFP4 GEMM is working; the dequant scale computation was likely
wrong. Keeping BF16 router gate (which improved output from 'response'
loop to 'OK').
This commit is contained in:
2026-06-03 13:48:44 +00:00
parent 0c3796966d
commit f05ee6cd69

View File

@@ -1359,9 +1359,7 @@ def main():
se.set_fused_swiglu(True)
# EAGERLY process shared expert weights
se._ensure_initialized()
# BF16 fallback for shared expert — dequantize NVFP4 weights to BF16
se._use_runtime_gsa = True
se.enable_bf16_fallback() # sets _fused_swiglu=False, pre-materializes BF16 weights
se_runners[li] = se
if (li+1) % 10 == 0: print(f" Built {li+1}/{n_layers} MoE layers")
torch.cuda.empty_cache()