[Core][Observability] Add KV cache residency metrics (#27793)
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior: vllm:kv_block_lifetime_seconds — total lifetime from allocation to free vllm:kv_block_idle_before_evict_seconds — idle duration before eviction vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates. Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled. Two new runtime flags are introduced: --kv-cache-metrics – enable KV cache residency metrics --kv-cache-metrics-sample – control sampling ratio (default: 0.01) Signed-off-by: Shivam <shivamprasad91@gmail.com>
This commit is contained in:
@@ -263,6 +263,29 @@ record:
|
||||
- End-to-end latency - the interval between frontend `arrival_time`
|
||||
and the frontend receiving the final token.
|
||||
|
||||
### KV Cache Residency Metrics
|
||||
|
||||
We also emit a set of histograms that describe how long sampled KV cache
|
||||
blocks stay resident and how often they are reused. Sampling
|
||||
(`--kv-cache-metrics-sample`) keeps the overhead tiny; when a block is
|
||||
chosen we record:
|
||||
|
||||
- `lifetime` – allocation ⟶ eviction
|
||||
- `idle before eviction` – last touch ⟶ eviction
|
||||
- `reuse gaps` – the pauses between touches when the block gets reused
|
||||
|
||||
Those map directly to the Prometheus metrics:
|
||||
|
||||
- `vllm:kv_block_lifetime_seconds` – how long each sampled block exists.
|
||||
- `vllm:kv_block_idle_before_evict_seconds` – idle tail after the final access.
|
||||
- `vllm:kv_block_reuse_gap_seconds` – time between consecutive touches.
|
||||
|
||||
The engine core only ships raw eviction events via `SchedulerStats`; the
|
||||
frontend drains them, turns them into Prometheus observations, and also
|
||||
exposes the same data through `LLM.get_metrics()` when logging is on.
|
||||
Looking at lifetime and idle time on one chart makes it easy to spot
|
||||
stranded cache or workloads that pin prompts for a long decode.
|
||||
|
||||
### Metrics Publishing - Logging
|
||||
|
||||
The `LoggingStatLogger` metrics publisher outputs a log `INFO` message
|
||||
|
||||
Reference in New Issue
Block a user