- production.py: head-packed M dimension for MQA/GQA (q_per_kv*T rows in single launch per KV group, eliminating redundant K/V TMA loads) - production.py: batch dimension support (outer Python loop) - production.py: warmup_attention_kernels() for pre-compilation - production.py: dsv4_attention_per_head() for exact per-head sink bias - __init__.py: sparse_fmha_with_swa, dense_fmha_with_swa, swa_only_fmha integration functions bridging AttentionSubBlock → production FMHA - custom_ops.py: dsv4::sparse_fmha_with_swa custom_op registration - test_production.py: comprehensive tests (MHA/MQA/GQA, head-packed vs per-head parity, multi-segment KV, SWA+causal+sink, batch, edge cases)
4.8 KiB
4.8 KiB