The L1 GEMM produces gate+up combined output with 2*intermediate_size
BF16 columns, but _l1_out_buf was only allocated with intermediate_size
columns. The GEMM wrote past the buffer boundary, corrupting GPU memory
and causing cudaErrorInvalidValue on subsequent operations.
This was the root cause of ALL the cudaErrorInvalidValue errors in the
shared expert and MoE L2 paths — the corrupted memory from the L1 buffer
overflow propagated downstream.
Fix: _l1_out_buf shape (max_rows, 2*intermediate_size) instead of
(max_rows, intermediate_size). Applied to both shared_expert.py and moe.py.
Also removed all DEBUG sync/print statements from quantize.py and
shared_expert.py — the bug was not in the quantize kernels, it was
the buffer overflow.