stage_activation now returns (x_fp4, x_sf, input_global_scale).
The global scale is applied as the CUTLASS GEMM alpha parameter
in the epilogue: D = alpha * A @ B, avoiding the fp32→UE4M3
round-trip that folding would introduce.
Changes:
- stage_activation: returns global scale as 3rd value
- cutlass_nvfp4_gemm C++ binding: alpha param (was hardcoded 1.0)
- cutlass_grouped_nvfp4_gemm: passes alpha to per-expert GEMM
- nvfp4_mega_moe_l1/l2: accept alpha, pass to grouped GEMM
- nvfp4_moe_full: reads symm_buffer.input_global_scale for L1,
uses stage_activation's returned global scale for L2
- SymmBuffer: added input_global_scale field
- vllm patch: stores global scale from stage_activation