[Core] Switch Flat logprob control from environment variable to SamplingParams (#28914)

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-11-18 18:10:02 -08:00
parent da94c7c0eb
commit 40b6b38f2c
6 changed files with 33 additions and 41 deletions
--- a/vllm/sampling_params.py
+++ b/vllm/sampling_params.py
@@ -204,6 +204,12 @@ class SamplingParams(
    prompt_logprobs: int | None = None
    """Number of log probabilities to return per prompt token.
    When set to -1, return all `vocab_size` log probabilities."""
+    flat_logprobs: bool = False
+    """Whether to return logprobs in flatten format (i.e. FlatLogprob)
+    for better performance.
+    NOTE: GC costs of FlatLogprobs is significantly smaller than
+    list[dict[int, Logprob]]. After enabled, PromptLogprobs and
+    SampleLogprobs would populated as FlatLogprobs."""
    # NOTE: This parameter is only exposed at the engine level for now.
    # It is not exposed in the OpenAI API server, as the OpenAI API does
    # not support returning only a list of token IDs.