Commit Graph

  • 127c8b782a Add gather_indexer_k_quant_cache kernel (#25931) Barry Kang 2025-10-08 12:58:57 +08:00
  • cd9890544b fix(v1/kv_cache): resolve async KV transfer bug in cascade attention (#23485) Ayush Satyam 2025-10-08 10:16:33 +05:30
  • 067da2d1df [Core] Simplify setting new_token_ids in CachedRequestData (#26388) Nick Hill 2025-10-07 20:32:37 -07:00
  • 046118b938 Add SwigluOAI implementation for CPUFusedMOE (#26347) isharif168 2025-10-08 03:17:49 +01:00
  • b32260ab85 [torchao] safetensors integration (#25969) liangel-02 2025-10-07 19:12:35 -07:00
  • f80e7866c0 [Misc] Clean up cruft from previous FlashMLA sparse implementation (#26125) Lucas Wilkinson 2025-10-07 22:09:34 -04:00
  • 31a4b3e6c4 Revert #24446 and #26168 (#26332) Thomas Parnell 2025-10-08 00:38:19 +02:00
  • caf8b1c084 [Bugfix] Fix MTP+FlashInfer crash when trtllm kernels are available but disabled (#26361) Benjamin Chislett 2025-10-07 18:12:26 -04:00
  • 1b86bd8e18 Add more libraries to rlhf.md (#26374) Michael Goin 2025-10-07 16:59:41 -04:00
  • 59012df99b [TPU] update TPU benchmark threshold (#25713) Johnny Yang 2025-10-07 13:53:09 -07:00
  • 01efc7ef78 [ci] fix wheel names for arm wheels (#24898) v0.10.2 Simon Mo 2025-09-15 14:39:08 -07:00
  • 3d1f67616d [Spec Decode] Enable efficient speculative decoding with FlashInfer-MLA (#25984) Benjamin Chislett 2025-10-07 16:05:59 -04:00
  • 6ebaf43ee4 [V1] Logit processors for rejection sampler (#19482) Sergei Skvortsov 2025-10-07 21:02:49 +01:00
  • 0c824fc46f [Frontend] CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops (#26113) Morrison Turnansky 2025-10-07 15:53:43 -04:00
  • eb577e4655 [Bugfix] Add missing sink tensor into flash attn cascade attn implementation (#26325) Pei-Lun Liao 2025-10-07 11:56:39 -07:00
  • 8f36850f73 [Bug] Fix Shape Validation for Fallback while Enabling E8M0 for DeepGEMM (#26322) Wentao Ye 2025-10-07 13:50:30 -04:00
  • 29fd2662ba [deepseek] add EP8 FusedMOE config for H200 and B200 (#26331) Chen Zhang 2025-10-08 01:38:54 +08:00
  • 30a3e5af69 [CI] Add Qwen3 MoE NVFP4 to Blackwell lm-eval (#26316) Michael Goin 2025-10-07 13:36:15 -04:00
  • a38c1bfe09 [ci] Rename test_mxfp4_moe.py to test_ocp_mx_moe.py (#26364) fxmarty-amd 2025-10-07 18:52:24 +02:00
  • 320feae6f5 [Model] Lfm2Moe (#26344) Paul Pak 2025-10-08 01:03:05 +09:00
  • 1e4ecca1d0 [V0 Deprecation] Remove VLLM_USE_V1 from tests (#26341) Cyrus Leung 2025-10-07 23:42:31 +08:00
  • c0a7b89d8e [Misc] Move LRUCache into its own file (#26342) Cyrus Leung 2025-10-07 23:08:40 +08:00
  • 6f59beaf0b [Model] Add support for ModernBertForTokenClassification (#26340) antrec 2025-10-07 16:29:19 +02:00
  • 41f1cf38f2 [Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4 (#21166) fxmarty-amd 2025-10-07 15:35:26 +02:00
  • 08d26a1b7e [Model] Use merge_by_field_config for MM models (Ovis family) (#26308) Isotr0py 2025-10-07 20:54:22 +08:00
  • 63773a6200 [Docs] add docs for cuda graph v1 (#24374) fhl2000 2025-10-07 20:25:05 +08:00
  • 883b42896a Add TRL example notebook to RLHF docs (#26346) Sergio Paniego Blanco 2025-10-07 13:31:28 +02:00
  • e1098ced95 Add topk logits torch op for DS3.2. (#25945) Daniel Cámpora 2025-10-07 12:07:32 +02:00
  • d100d78eb3 Optimize KV cache distribution for asymmetric pipeline parallelism (#25164) Grant Holmes (Ren) 2025-10-07 04:20:30 -05:00
  • 7e4cd070b0 [V0 Deprecation] Remove VLLM_USE_V1 from docs and scripts (#26336) Cyrus Leung 2025-10-07 16:46:44 +08:00
  • 46b0779996 [BugFix] Update KV block hash type from BlockHash to ExternalBlockHash in kv_events_subscriber - #26264 (#26265) Snehlata 2025-10-07 14:12:28 +05:30
  • de342585ff [Model] Define merge_by_field_config MM interface (R-T) (#26260) Ayush Satyam 2025-10-07 13:40:55 +05:30
  • 185d8ed44f [responsesAPI][bugfix] serialize harmony messages (#26185) Andrew Xia 2025-10-07 00:07:53 -07:00
  • d9836d4517 [Deprecation] Deprecate LLM.set_tokenizer (#26333) Cyrus Leung 2025-10-07 14:50:57 +08:00
  • 5f7e8a916a [Model] Define merge_by_field_config MM interface (U-Z) (#26261) Ayush Satyam 2025-10-07 12:15:49 +05:30
  • 4dbdf4a294 [BUG] Fix file parsing for load_format runai_streamer_sharded (#26324) ahao-anyscale 2025-10-06 20:23:07 -07:00
  • c6873c4e6d [UX] Support nested dicts in hf_overrides (#25727) Michael Goin 2025-10-06 23:19:16 -04:00
  • 2111b4643c [Core] Simplify the Dp padding/should ubatch coordination logic (#25768) Sage Moore 2025-10-06 18:57:49 -07:00
  • c50901f3b9 [Docs][DBO] Add initial doc that describes the DBO implementation (#26024) Sage Moore 2025-10-06 17:47:28 -07:00
  • 8229280a9c [Misc] Define EP kernel arch list in Dockerfile (#25635) Simon Mo 2025-10-06 17:05:33 -07:00
  • f77df94647 [Perf] Add decode full-graph support to FlashInfer-MLA backend (#26313) Benjamin Chislett 2025-10-06 19:03:49 -04:00
  • f231e5bc21 [ROCm] Split AITER unified attention into its own backend (#25507) Gregory Shtrasberg 2025-10-06 18:49:23 -04:00
  • 2161efe978 [Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) (#25987) Benjamin Chislett 2025-10-06 16:16:30 -04:00
  • f23b4c04fd [BugFix] Pad input buffers in _dummy_run (#26209) Varun Sundar Rabindranath 2025-10-06 16:07:51 -04:00
  • 93540958b8 [Docs] Fix broken table in moe_kernel_features doc (#26314) Varun Sundar Rabindranath 2025-10-06 15:58:05 -04:00
  • 44b9af5bb2 [Benchmark] Enable MM Embedding benchmarks (#26310) Cyrus Leung 2025-10-07 03:51:58 +08:00
  • 7cd95dc8a3 [Bugfix] Fix gemma3 with transformers backend (#23178) Raushan Turganbay 2025-10-06 20:42:32 +02:00
  • c02058c222 Add bias handling to CPUFusedMOE kernel (#26289) Crefeda Rodrigues 2025-10-06 19:39:10 +01:00
  • b2ea5ba677 [Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs (#26231) 7mile 2025-10-07 02:24:22 +08:00
  • 824a3f403f [Misc] auto_tune: kill specific vllm process (#26304) Karan Goel 2025-10-06 11:02:51 -07:00
  • 05f6846ede Support llama3 eagle3 head with llama4 verifier (#25961) Rahul Tuli 2025-10-06 23:26:08 +05:30
  • 20db99cc69 [CI Bugfix] Make sure TRTLLM attention is available in test_blackwell_moe (#26188) Michael Goin 2025-10-06 13:50:11 -04:00
  • 6431be808f [Tests] conftest: Extending VllmRunner and HfRunner to accept token_ids as input (#26295) Yannick Schnider 2025-10-06 19:19:34 +02:00
  • 4727a8afa7 [Attention] Remove unused reorder_batch method (#24463) Matthew Bonanni 2025-10-06 13:13:39 -04:00
  • b8f603cebe [Model] EVS support for nano_nemotron_vl (#26269) tomeras91 2025-10-06 19:23:37 +03:00
  • fc679696f8 Fix DotsOCR tensor type (#26281) Chatcharin Sangbutsarakum 2025-10-06 19:23:43 +07:00
  • ab5e7d93f4 [Bugfix] Fix mrope in Transformers Backend (#26087) Raushan Turganbay 2025-10-06 13:40:50 +02:00
  • 0340f45553 Support expert parallel load balancing in Transformers backend (#26287) Harry Mellor 2025-10-06 12:20:16 +01:00
  • 19a00eb210 [Model] Use merge_by_field_config for MM models (Llava family) (#26280) Cyrus Leung 2025-10-06 17:45:26 +08:00
  • 391612e78b [Frontend] Consolidate tokenizer init code (#26276) Cyrus Leung 2025-10-06 17:34:52 +08:00
  • 77c95f72f7 [Doc] add KAITO to integrations (#25521) abhisheksheth28 2025-10-06 02:30:03 -07:00
  • 59f30d0448 [Docs] Edit HF Inference Endpoints documentation (#26275) Aritra Roy Gosthipaty 2025-10-06 14:43:09 +05:30
  • 43c146ca42 [Misc] Clean up unnecessary E501 ignore (#26274) Roger Wang 2025-10-06 00:29:18 -07:00
  • 7c2ec0fe87 [Benchmarking] Add disable_shuffle option for dataset loading (#26258) Yasmin Moslem 2025-10-06 08:05:44 +01:00
  • 039b6bade3 Bump actions/stale from 10.0.0 to 10.1.0 (#26272) dependabot[bot] 2025-10-06 07:01:21 +00:00
  • 6c04638214 Fix per file ruff ignores related to line length (#26262) Harry Mellor 2025-10-06 06:12:40 +01:00
  • 91ac7f764d [CI][gpt-oss] Enable python tool tests in CI (#24315) wuhang 2025-10-06 12:20:06 +08:00
  • 4be7d7c1c9 [MISC] Add heheda12345 to CODEOWNERS of vllm/config/cache.py (#26270) Chen Zhang 2025-10-05 19:58:59 -07:00
  • 59b477645c [Doc] Edited minor typo (#26266) orangeng 2025-10-05 19:53:09 -07:00
  • 778f554157 [V1] [Hybrid] Some additional clean-up in Mamba2 prefix caching (#26222) Thomas Parnell 2025-10-06 04:40:30 +02:00
  • d3c84297c3 [CI] Add comment about the single cudagraph capture size that is used (#26252) Thomas Parnell 2025-10-06 04:35:37 +02:00
  • f509a20846 [DOC] Update production-stack.md (#26177) Elieser Pereira 2025-10-05 18:32:48 -03:00
  • 60bc25e74c [CI] Add Blackwell LM Eval Small Models test to nightly (#26052) Michael Goin 2025-10-05 16:59:50 -04:00
  • b893d661b1 Fix per file ruff ignores related to simplification (#26259) Harry Mellor 2025-10-05 21:31:53 +01:00
  • 6b6e98775f [NVIDIA] flashinfer TRTLLM attention prefill token limit (#25998) Jason Li 2025-10-05 16:24:37 -04:00
  • 9c3c21c519 [CI] fix mamba kernel test (#26250) Jiangyun Zhu 2025-10-06 02:26:59 +08:00
  • 512b8affa4 Update ruff pre-commit hooks version (#26255) Harry Mellor 2025-10-05 17:50:50 +01:00
  • 1c0c68202c Fix per file ruff ignores related to typing (#26254) Harry Mellor 2025-10-05 17:37:55 +01:00
  • 5f317530ec fix(tests): Resolve late binding of loop variable in assert message lambda (#26249) ihb2032 2025-10-06 00:18:22 +08:00
  • 557b2e961d Remove all cases of fmt: on/off (#26253) Harry Mellor 2025-10-05 17:18:14 +01:00
  • 4e256cadc2 Remove all references to yapf as it's no longer used (#26251) Harry Mellor 2025-10-05 17:18:11 +01:00
  • d6953beb91 Convert formatting to use ruff instead of yapf + isort (#26247) Harry Mellor 2025-10-05 15:06:22 +01:00
  • 17edd8a807 [Platform][Kernel] platform-specific kernel loading (#25823) Hank_ 2025-10-05 19:25:15 +08:00
  • 3303cfb4ac [Bugfix][Hardware][RISC-V] Limit supported dtypes to float32 to avoid scheduler segfault (#26228) ihb2032 2025-10-05 18:36:54 +08:00
  • b7e8e4e6be [Bugfix] Always apply MM processor even when no MM items are passed (#26240) Cyrus Leung 2025-10-05 18:10:20 +08:00
  • 432e1cbc23 [Bugfix]: Assertion error when using FlashInfer backend (#25933) Simon Danielsson 2025-10-05 09:46:36 +01:00
  • 201c971e96 [Perf][Easy] Early stop in request_block_hasher (#26112) Jialin Ouyang 2025-10-05 01:46:03 -07:00
  • e0986ea07b Add documentation for granite 4 tool calling (#26175) Maximilien de Bayser 2025-10-05 04:35:42 -03:00
  • a964e5e6c3 [Bugfix] Allow --skip-tokenizer-init with echo and return_token_ids (#26238) Cyrus Leung 2025-10-05 13:38:53 +08:00
  • 78c1d5bfd2 [Easy] Add str repr for IterationStats (#26232) 22quinn 2025-10-04 22:00:21 -07:00
  • 59a85c366e [Model] Use merge_by_field_config for MM models (H-L) (#26230) Cyrus Leung 2025-10-05 11:54:17 +08:00
  • 119f00630b [Renderer] Clean up renderer code (#26216) Cyrus Leung 2025-10-05 01:05:29 +08:00
  • a42d2df75f [Frontend] Cache chat template kwargs resolution (#26227) Isotr0py 2025-10-04 23:32:30 +08:00
  • 5c057e068f [CPU] Refine batch reorder of CPU attention backend (#26096) Li, Jiang 2025-10-04 21:54:35 +08:00
  • ed3aeb25a4 [V1] [Hybrid] Remove code to override default CUDA graph configuration (#26226) Thomas Parnell 2025-10-04 15:47:48 +02:00
  • 86ee949128 Fix tensor device and dtype placement in Qwen2VL model (#26219) yuafng 2025-10-04 06:41:39 -07:00
  • 4570535ec4 [Model] CLIP Embedding Support (#26010) Cyrus Leung 2025-10-04 21:21:42 +08:00
  • 2a6dc67eb5 [Bugfix] Fix _reqs_to_process leak on abort (#26012) Nicolò Lucchesi 2025-10-04 13:39:31 +02:00
  • f05fea1f5e [Core] Enable decode of context length equal to max model length (#26168) Yannick Schnider 2025-10-04 11:59:26 +02:00
  • d0df145c2a Add Olmo 3 reasoning parser (#26054) Luca Soldaini 2025-10-04 02:48:29 -07:00