[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)

Signed-off-by: Andrew Xia <axia@meta.com>
This commit is contained in:
Andrew Xia
2026-03-23 02:58:08 -07:00
committed by GitHub
parent 27d5ee3e6f
commit 9ace378a63
2 changed files with 4 additions and 1 deletions

View File

@@ -244,6 +244,7 @@ statistics relating to that iteration:
prefill in this iteration. However, we calculate this interval
relative to when the request was first received by the frontend
(`arrival_time`) in order to account for input processing time.
Currently `arrival_time` starts when tokenization begins.
For any requests that were completed in a given iteration, we also
record:
@@ -587,7 +588,7 @@ see:
- [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
- [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
- <https://github.com/vllm-project/vllm/issues/5041> and <https://github.com/vllm-project/vllm/pull/12726>.
This is a non-trivial topic. Consider this comment from Rob:
> I think this metric should focus on trying to estimate what the max