[V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli (#20840)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-07-15 13:04:35 +02:00
parent c586b55667
commit 3534c39a20
14 changed files with 441 additions and 353 deletions
--- a/docs/usage/v1_guide.md
+++ b/docs/usage/v1_guide.md
@@ -112,8 +112,7 @@ enforcing eager mode and disabling prefix caching in V1.
 Models that combine Mamba-2 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
 `Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`). Please note that
 these models currently require enforcing eager mode, disabling prefix caching, and using the FlashInfer attention
-backend in V1. It is also necessary to pass a non-standard block size for attention layers (this is not possible
-using the `vllm serve` CLI yet).
+backend in V1.

 #### Encoder-Decoder Models