[V1] [Doc] Update V1 docs for Mamba models (#20499)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-07-09 10:02:41 +02:00
parent 2155e95ef1
commit 5358cce5ff
2 changed files with 17 additions and 9 deletions
--- a/docs/usage/v1_guide.md
+++ b/docs/usage/v1_guide.md
@@ -83,7 +83,7 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
 | **Decoder-only Models**     | <nobr>🚀 Optimized</nobr>                                                          |
 | **Encoder-Decoder Models**  | <nobr>🟠 Delayed</nobr>                                                            |
 | **Embedding Models**        | <nobr>🟢 Functional</nobr>                                                         |
-| **Mamba Models**            | <nobr>🚧 WIP (<gh-pr:19327>)</nobr> |
+| **Mamba Models**            | <nobr>🟢 (Mamba-2), 🟡 (Mamba-1)</nobr>                                            |
 | **Multimodal Models**       | <nobr>🟢 Functional</nobr>                                                         |

 vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol.
@@ -104,8 +104,16 @@ to enable simultaneous generation and embedding using the same engine instance i

 #### Mamba Models

-Models using selective state-space mechanisms instead of standard transformer attention (e.g., `MambaForCausalLM`, `JambaForCausalLM`)
-will be supported via <gh-pr:19327>.
+Models using selective state-space mechanisms instead of standard transformer attention are partially supported.
+Models that use Mamba-2 layers (e.g., `Mamba2ForCausalLM`) are supported, but models that use older Mamba-1 layers
+(e.g., `MambaForCausalLM`, `JambaForCausalLM`) are not yet suported. Please note that these models currently require
+enforcing eager mode and disabling prefix caching in V1.
+
+Models that combine Mamba-2 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
+`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`). Please note that
+these models currently require enforcing eager mode, disabling prefix caching, and using the FlashInfer attention
+backend in V1. It is also necessary to pass a non-standard block size for attention layers (this is not possible
+using the `vllm serve` CLI yet).

 #### Encoder-Decoder Models