- test_p3_fast_decode.py: clean kernel test + full API test - Removed debug tests (sanity, v_debug, v_ref_debug) - Double normalization fix verified: kernel output matches reference at cos >= 0.999990 across all MHA/MQA/GQA configs
- test_p3_fast_decode.py: clean kernel test + full API test - Removed debug tests (sanity, v_debug, v_ref_debug) - Double normalization fix verified: kernel output matches reference at cos >= 0.999990 across all MHA/MQA/GQA configs