nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	1c18c16c68	Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16	2026-05-31 09:17:36 +00:00
biondizzle	300dddedc0	E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test E1: LayerCacheHandle now exposes gather_compressed_kv, gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim. Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu. Python wrapper in dsv4/kernels/cache/gather.py. E2: tests/e2e/test_one_layer.py — SWA path smoke test. E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs for CSA/HCA compress_and_store, compute_index_scores_topk). E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path. Error checking via C API return code instead. Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).	2026-05-30 21:10:26 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00

Author

SHA1

Message

Date

biondizzle

1c18c16c68

Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16

2026-05-31 09:17:36 +00:00

biondizzle

300dddedc0

E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test

E1: LayerCacheHandle now exposes gather_compressed_kv,
    gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim.
    Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu.
    Python wrapper in dsv4/kernels/cache/gather.py.

E2: tests/e2e/test_one_layer.py — SWA path smoke test.

E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs
    for CSA/HCA compress_and_store, compute_index_scores_topk).

E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path.
    Error checking via C API return code instead.

Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).

2026-05-30 21:10:26 +00:00

biondizzle

3fb3c925af

Restructure: cutedsl/ -> dsv4/ with proper layering

- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name

2026-05-21 17:30:44 +00:00

3 Commits