Files
nvfp4-megamoe-kernel/tests
biondizzle 2bbbead984 P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops)
New files:
- dsv4/kernels/cuda/rope_cuda.cu: GPT-J interleaved RoPE kernel (forward+inverse)
- dsv4/ops/rope_cuda.py: Python bridge with ctypes loading
- tests/unit/test_rope_cuda.py: correctness test (cos >= 0.999998)

Savings: ~915 launches/token → 183 launches/token
2026-06-02 09:05:22 +00:00
..
2026-05-30 21:22:34 +00:00
2026-05-31 20:11:37 +00:00