nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	d53e0a33a9	NVFP4-3: add use_2cta_instrs conditional to gemm_runner - run_nvfp4_grouped_gemm: use_2cta = tokens_sum >= 256 and cluster_m even - run_fused_swiglu_grouped_gemm: same conditional - Auto-warms up on first use via lazy compilation cache - 1.7-1.9× throughput at prefill shapes (M>=256) - Decode (M<256) stays 1-CTA (correct, no waste)	2026-05-25 16:42:02 +00:00
biondizzle	e3e67c3992	NVFP4-3: enable 2-CTA UMMA when MMA tile M >= 256 (1.7-1.9x throughput)	2026-05-24 22:57:49 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00

Author

SHA1

Message

Date

biondizzle

d53e0a33a9

NVFP4-3: add use_2cta_instrs conditional to gemm_runner

- run_nvfp4_grouped_gemm: use_2cta = tokens_sum >= 256 and cluster_m even
- run_fused_swiglu_grouped_gemm: same conditional
- Auto-warms up on first use via lazy compilation cache
- 1.7-1.9× throughput at prefill shapes (M>=256)
- Decode (M<256) stays 1-CTA (correct, no waste)

2026-05-25 16:42:02 +00:00

biondizzle

e3e67c3992

NVFP4-3: enable 2-CTA UMMA when MMA tile M >= 256 (1.7-1.9x throughput)

2026-05-24 22:57:49 +00:00

biondizzle

3fb3c925af

Restructure: cutedsl/ -> dsv4/ with proper layering

- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name

2026-05-21 17:30:44 +00:00

3 Commits