- New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel) - New: fmha_mixed_fp8_capi.cu (C ABI launcher) - New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge) - New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels) - New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers) - Modified: production.py (dsv4_attention_mixed_fp8_decode API) - Modified: single_shot_inference.py (B1 gather + FMHA path) - Modified: __init__.py (export mixed FP8 API) - New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16. No global FP8->BF16 KV staging before FMHA. Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64. CUDA compile/runtime validation pending on B200.
8 lines
355 B
Python
8 lines
355 B
Python
"""DSV4 Attention kernels — public integration API.
|
|
|
|
The live inference path uses dsv4.kernels.attention.production directly.
|
|
See production.py for the dsv4_attention function used by single_shot_inference.py.
|
|
"""
|
|
from dsv4.kernels.attention.production import dsv4_attention
|
|
from dsv4.kernels.attention.production import dsv4_attention_mixed_fp8_decode
|