nvfp4-megamoe-kernel

Files

biondizzle d53e0a33a9 NVFP4-3: add use_2cta_instrs conditional to gemm_runner

- run_nvfp4_grouped_gemm: use_2cta = tokens_sum >= 256 and cluster_m even
- run_fused_swiglu_grouped_gemm: same conditional
- Auto-warms up on first use via lazy compilation cache
- 1.7-1.9× throughput at prefill shapes (M>=256)
- Decode (M<256) stays 1-CTA (correct, no waste)

2026-05-25 16:42:02 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

custom_ops.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

decode_sparse.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

decode_swa.py

Restructure: cutedsl/ -> dsv4/ with proper layering