[Misc] Fix up attention benchmarks (#33810)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
This commit is contained in:
@@ -25,10 +25,18 @@ batch_specs:
|
||||
- "4q1k_16q1s2k" # 4 prefill + 16 decode
|
||||
- "2q4k_32q1s1k" # 2 large prefill + 32 decode
|
||||
|
||||
# Context extension
|
||||
- "q1ks2k" # 1k query, 2k sequence (chunked prefill)
|
||||
# Speculative decode (q <= 8)
|
||||
- "16q2s1k" # 16 requests, 2 spec tokens, 1k KV cache
|
||||
- "16q4s1k" # 16 requests, 4 spec tokens, 1k KV cache
|
||||
- "16q8s1k" # 16 requests, 8 spec tokens, 1k KV cache
|
||||
- "32q4s2k" # 32 requests, 4 spec tokens, 2k KV cache
|
||||
- "8q8s4k" # 8 requests, 8 spec tokens, 4k KV cache
|
||||
|
||||
# Context extension (chunked prefill)
|
||||
- "q1ks2k" # 1k query, 2k sequence
|
||||
- "2q1ks4k" # 2 requests: 1k query, 4k sequence
|
||||
|
||||
# Available backends: flash, triton, flashinfer
|
||||
backends:
|
||||
- flash
|
||||
- triton
|
||||
|
||||
Reference in New Issue
Block a user