Revert SE BF16 fallback — produced garbage output

The dequantize_nvfp4 path for shared expert made output WORSE (random Chinese tokens, gibberish) vs NVFP4 GEMM which at least produces 'OK'. The SE NVFP4 GEMM is working; the dequant scale computation was likely wrong. Keeping BF16 router gate (which improved output from 'response' loop to 'OK').
2026-06-03 13:48:44 +00:00
parent 0c3796966d
commit f05ee6cd69
1 changed files with 0 additions and 2 deletions
--- a/single_shot_inference.py
+++ b/single_shot_inference.py
@@ -1359,9 +1359,7 @@ def main():
        se.set_fused_swiglu(True)
        # EAGERLY process shared expert weights
        se._ensure_initialized()
-        # BF16 fallback for shared expert — dequantize NVFP4 weights to BF16
        se._use_runtime_gsa = True
-        se.enable_bf16_fallback()  # sets _fused_swiglu=False, pre-materializes BF16 weights
        se_runners[li] = se
        if (li+1) % 10 == 0: print(f"  Built {li+1}/{n_layers} MoE layers")
        torch.cuda.empty_cache()