[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)
Signed-off-by: Andrew Xia <axia@meta.com>
This commit is contained in:
@@ -244,6 +244,7 @@ statistics relating to that iteration:
|
||||
prefill in this iteration. However, we calculate this interval
|
||||
relative to when the request was first received by the frontend
|
||||
(`arrival_time`) in order to account for input processing time.
|
||||
Currently `arrival_time` starts when tokenization begins.
|
||||
|
||||
For any requests that were completed in a given iteration, we also
|
||||
record:
|
||||
@@ -587,7 +588,7 @@ see:
|
||||
- [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
|
||||
- [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
|
||||
- <https://github.com/vllm-project/vllm/issues/5041> and <https://github.com/vllm-project/vllm/pull/12726>.
|
||||
|
||||
|
||||
This is a non-trivial topic. Consider this comment from Rob:
|
||||
|
||||
> I think this metric should focus on trying to estimate what the max
|
||||
|
||||
Reference in New Issue
Block a user