Revert SE BF16 fallback — produced garbage output
The dequantize_nvfp4 path for shared expert made output WORSE (random Chinese tokens, gibberish) vs NVFP4 GEMM which at least produces 'OK'. The SE NVFP4 GEMM is working; the dequant scale computation was likely wrong. Keeping BF16 router gate (which improved output from 'response' loop to 'OK').
This commit is contained in:
@@ -1359,9 +1359,7 @@ def main():
|
||||
se.set_fused_swiglu(True)
|
||||
# EAGERLY process shared expert weights
|
||||
se._ensure_initialized()
|
||||
# BF16 fallback for shared expert — dequantize NVFP4 weights to BF16
|
||||
se._use_runtime_gsa = True
|
||||
se.enable_bf16_fallback() # sets _fused_swiglu=False, pre-materializes BF16 weights
|
||||
se_runners[li] = se
|
||||
if (li+1) % 10 == 0: print(f" Built {li+1}/{n_layers} MoE layers")
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
Reference in New Issue
Block a user