- Standalone kernel cos 0.979 (128x512) - Post-SwiGLU quantization cos 0.976 (vs Python 0.995) - Larger shape cos 0.979 (512x4096) - FP8 scale match 100% across all tests - GPU kernel replaces CPU-GPU sync quantize path - Ready for integration into MoE pipeline
9.4 KiB
9.4 KiB