cutedsl/moe_pipeline.py: complete pipeline
- stage_activation: BF16 → NVFP4 (keeps data in FP4)
- L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up)
- SiLU(gate) * up: BF16 (only nonlinear, can't avoid)
- Re-quantize: BF16 → NVFP4 (back to native)
- L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
- Scatter with routing weights → BF16 output
layertest.py: now tests the FULL MoE pipeline against BF16 reference.
NVFP4-native: both GEMMs use float4_e2m1fn_x2 for A and B,
float8_e4m3fn for block scales, float32 for global scales.
BF16 only for SiLU activation and final scatter.