Files
nvfp4-megamoe-kernel/dsv4/ops
biondizzle d53e0a33a9 NVFP4-3: add use_2cta_instrs conditional to gemm_runner
- run_nvfp4_grouped_gemm: use_2cta = tokens_sum >= 256 and cluster_m even
- run_fused_swiglu_grouped_gemm: same conditional
- Auto-warms up on first use via lazy compilation cache
- 1.7-1.9× throughput at prefill shapes (M>=256)
- Decode (M<256) stays 1-CTA (correct, no waste)
2026-05-25 16:42:02 +00:00
..