[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)

Signed-off-by: Andrew Xia <axia@meta.com>
2026-03-23 02:58:08 -07:00
parent 27d5ee3e6f
commit 9ace378a63
2 changed files with 4 additions and 1 deletions
--- a/docs/design/metrics.md
+++ b/docs/design/metrics.md
@@ -244,6 +244,7 @@ statistics relating to that iteration:
  prefill in this iteration. However, we calculate this interval
  relative to when the request was first received by the frontend
  (`arrival_time`) in order to account for input processing time.
+  Currently `arrival_time` starts when tokenization begins.

 For any requests that were completed in a given iteration, we also
 record:
@@ -587,7 +588,7 @@ see:
 - [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
 - [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
 - <https://github.com/vllm-project/vllm/issues/5041> and <https://github.com/vllm-project/vllm/pull/12726>.
-  
+
 This is a non-trivial topic. Consider this comment from Rob:

 > I think this metric should focus on trying to estimate what the max