Commit Graph

  • 54cf1cae62 [Misc] Do not print async output warning for v1 (#21151) Woosuk Kwon 2025-07-17 21:57:02 -07:00
  • 5780121c95 [Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm (#20911) shixianc 2025-07-17 21:34:43 -07:00
  • c7d8724e78 [Core] FlashInfer CUTLASS fused MoE backend (NVFP4) (#20037) Shu Wang 2025-07-17 23:32:45 -05:00
  • b38baabcf9 [Doc] Add inplace weights loading example (#19640) 22quinn 2025-07-17 21:12:23 -07:00
  • 89cab4d01f [Attention] Make local attention backend agnostic (#21093) Lucas Wilkinson 2025-07-18 00:10:42 -04:00
  • b9a21e9173 [Docs] Update supported models documentation with missing models (#20844) Lucia Fang 2025-07-18 11:12:13 +08:00
  • c4e3b12524 [Docs] Add minimal demo of Ray Data API usage (#21080) Ricardo Decal 2025-07-17 20:09:19 -07:00
  • 8dfb45ca33 [Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel (#21133) elvischenv 2025-07-18 08:35:58 +08:00
  • 8a8fc94639 [Log] Debugging Log with more Information (#20770) Wentao Ye 2025-07-17 20:19:46 -04:00
  • 4de7146351 [V0 deprecation] Remove V0 HPU backend (#21131) Woosuk Kwon 2025-07-17 16:37:36 -07:00
  • ac9fb732a5 On environments where numa cannot be detected we get 0 (#21115) Eric Curtin 2025-07-17 19:52:17 +01:00
  • a3a6c695f4 [Misc] Qwen MoE model supports LoRA (#20932) Jee Jee Li 2025-07-18 02:32:52 +08:00
  • 90bd2ab6e3 [Model] Update pooling model interface (#21058) Cyrus Leung 2025-07-18 00:05:40 +08:00
  • 9fb2d22032 [Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) ElizaWszola 2025-07-17 15:56:44 +02:00
  • 2d6a38209b [Docs] Move code block out of admonition now that it's short (#21118) Harry Mellor 2025-07-17 14:12:29 +01:00
  • 89e3c4e9b4 [Misc] Avoid unnecessary import (#21106) wangxiyuan 2025-07-17 20:57:41 +08:00
  • fe8a2c544a [Docs] Improve docstring formatting for FusedMoEParallelConfig.make (#21117) Harry Mellor 2025-07-17 12:13:00 +01:00
  • 4ef00b5cac [VLM] Add Nemotron-Nano-VL-8B-V1 support (#20349) kYLe 2025-07-17 05:07:55 -05:00
  • 5a7fb3ab9e [Model] Add ToolParser and MoE Config for Hunyuan A13B (#20820) Asher 2025-07-17 17:10:09 +08:00
  • 11dfdf21bf [Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels (#20903) Varun Sundar Rabindranath 2025-07-17 13:40:37 +05:30
  • fdc5b43d20 [Bugfix]: Fix final_res_batch list index out of range error (#21055) Chauncey 2025-07-17 15:29:09 +08:00
  • c5b8b5953a [Misc] Fix PhiMoE expert mapping (#21085) Jee Jee Li 2025-07-17 13:47:49 +08:00
  • 4fcef49ec4 [V1] [KVConnector] Fix MultiprocExecutor worker output aggregation (#21048) David Ben-David 2025-07-17 08:29:45 +03:00
  • 8a4e5c5f3c [V1][P/D]Enhance Performance and code readability for P2pNcclConnector (#20906) Zhonghua Deng 2025-07-17 13:13:00 +08:00
  • 76b494444f [Attention] Refactor attention metadata builder interface (#20466) Lucas Wilkinson 2025-07-17 00:44:25 -04:00
  • 28a6d5423d [Bugfix] Fix Machete zero point issue for GPTQ models on SM90 (#21066) Michael Goin 2025-07-16 22:54:45 -04:00
  • 58760e12b1 [TPU] Start using python 3.12 (#21000) XiongfeiWei 2025-07-16 19:37:44 -07:00
  • a50d918225 [Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile (#21013) Michael Goin 2025-07-16 22:37:13 -04:00
  • c9ba8104ed [Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group (#21024) Kevin_Xiong 2025-07-17 10:36:36 +08:00
  • 4e7dfbe7b4 Update PyTorch to torch==2.7.1 for CUDA (#21011) Michael Goin 2025-07-16 22:30:44 -04:00
  • 72ad273582 Remove torch_xla.tpu.version() from pallas.py. (#21065) QiliangCui 2025-07-16 17:25:26 -07:00
  • 01513a334a Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) (#12010) Nir David 2025-07-16 22:33:41 +03:00
  • ac2bf41e53 [Model] Remove model sampler (#21059) Cyrus Leung 2025-07-17 03:03:37 +08:00
  • a931b4cdcf Remove Qwen Omni workaround that's no longer necessary (#21057) Harry Mellor 2025-07-16 17:25:23 +01:00
  • a0f8a79646 [fix] fix qwen image_embeds input (#21049) Avshalom Manevich 2025-07-16 17:17:20 +02:00
  • 18bdcf4113 feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information (#20575) Mac Misiura 2025-07-16 14:52:14 +01:00
  • 1c3198b6c4 [Model] Consolidate pooler implementations (#20927) Cyrus Leung 2025-07-16 21:39:13 +08:00
  • 260127ea54 [Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md (#19199) Michael Yao 2025-07-16 21:11:38 +08:00
  • d0dc4cfca4 Fix inadvertently silenced PP tests for mp, add DeepSeek V2/V3 model family to PP tests (#20831) Seiji Eicher 2025-07-16 00:14:49 -07:00
  • d31a647124 [BugFix] Fix import error on non-blackwell machines (#21020) Lucas Wilkinson 2025-07-16 01:27:29 -04:00
  • 85431bd9ad [TPU] fix kv_cache_update kernel block size choosing logic (#21007) Chengji Yao 2025-07-15 21:39:48 -07:00
  • c11013db8b [Meta] Llama4 EAGLE Support (#20591) zhiweiz 2025-07-15 21:14:15 -07:00
  • 1eb2b9c102 [CI] update typos config for CI pre-commit and fix some spells (#20919) Peter Pan 2025-07-16 12:12:40 +08:00
  • 6ebf313790 Avoid direct comparison of floating point numbers (#21002) Maximilien de Bayser 2025-07-16 01:12:14 -03:00
  • cfbcb9ed87 [Voxtral] Add more tests (#21010) Patrick von Platen 2025-07-16 06:11:49 +02:00
  • 76ddeff293 [Doc] Remove duplicate docstring (#21012) Wentao Ye 2025-07-15 23:09:13 -04:00
  • f46098335b [Bugfix] Fix Mistral3 support on SM100/SM120 (#20998) Michael Goin 2025-07-15 23:08:41 -04:00
  • e9534c7202 [CI][HPU] update for v0 deprecate by switching to VLLM_TARGET_DEVICE=empty (#21006) Chendi.Xue 2025-07-15 22:07:05 -05:00
  • 7976446015 Add Dockerfile argument for VLLM_USE_PRECOMPILED environment (#20943) Doug Smith 2025-07-15 22:53:57 -04:00
  • fcb9f879c1 [Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… (#20937) Ming Yang 2025-07-15 19:53:42 -07:00
  • 3ed94f9d0a [Docs] Enhance Anyscale documentation, add quickstart links for vLLM (#21018) Ricardo Decal 2025-07-15 22:46:56 -04:00
  • fa839565f2 [Misc] Refactor: Improve argument handling for conda command (#20481) Reid 2025-07-16 10:43:19 +08:00
  • 75a99b98bf [Chore] Remove outdated transformers check (#20989) Brayden Zhong 2025-07-15 22:42:40 -04:00
  • b5c3b68359 [Misc] bump xgrammar version to v0.1.21 (#20992) Chauncey 2025-07-16 10:42:16 +08:00
  • 6cbc4d4bea [Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture (#20923) Thomas Parnell 2025-07-16 04:19:10 +02:00
  • 153c6f1e61 [Frontend] Remove print left in FrontendArgs.add_cli_args (#21004) Michael Goin 2025-07-15 22:18:41 -04:00
  • 34cda778a0 [Frontend] OpenAI Responses API supports input image (#20975) Chauncey 2025-07-16 08:59:36 +08:00
  • 30800b01c2 [Nvidia] Integrate SM100 cudnn prefill API to MLA prefill (#20411) Elfie Guo 2025-07-15 17:56:45 -07:00
  • 10be209493 [Bug Fix] get_distributed_init_method should get the ip from get_ip i… (#20889) Chen LI 2025-07-15 14:23:52 -07:00
  • 19c863068b [Frontend] Support cache_salt in /v1/completions and /v1/responses (#20981) Marko Rosenmueller 2025-07-15 23:01:04 +02:00
  • f29fd8a7f8 [BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 (#20838) Tuan, Hoang-Trong 2025-07-15 16:08:26 -04:00
  • ed10f3cea1 [ROCm] warpSize is being made non constexpr in ROCm 7.0 (#20330) Gregory Shtrasberg 2025-07-15 14:01:44 -04:00
  • b637e9dcb8 Add full serve CLI reference back to docs (#20978) Harry Mellor 2025-07-15 18:42:30 +01:00
  • 1e36c8687e [Deprecation] Remove nullable_kvs (#20969) Harry Mellor 2025-07-15 18:21:50 +01:00
  • 5bac61362b Configure Gemini (#20971) Harry Mellor 2025-07-15 17:37:05 +01:00
  • 313ae8c16a [Deprecation] Remove everything scheduled for removal in v0.10.0 (#20979) Harry Mellor 2025-07-15 16:57:53 +01:00
  • c847e34b39 [CI/Build] Fix wrong path in Transformers Nightly Models Test (#20994) Cyrus Leung 2025-07-15 23:53:16 +08:00
  • e7e3e6d263 Voxtral (#20970) Patrick von Platen 2025-07-15 16:35:30 +02:00
  • 4ffd963fa0 [v1][core] Support for attention free models (#20811) Christian Pinto 2025-07-15 15:20:01 +01:00
  • 56fe4bedd6 [Deprecation] Remove TokenizerPoolConfig (#20968) Harry Mellor 2025-07-15 15:00:50 +01:00
  • d91278181d [doc] Add more details for Ray-based DP (#20948) Rui Qiao 2025-07-15 05:37:12 -07:00
  • 20149d84d9 [MISC] Add init files for python package (#20908) Li Wang 2025-07-15 20:16:33 +08:00
  • 3534c39a20 [V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli (#20840) Thomas Parnell 2025-07-15 13:04:35 +02:00
  • c586b55667 [TPU] Optimize kv cache update kernel (#20415) Yifei Teng 2025-07-15 03:56:43 -07:00
  • 33d560001e [Docs] Improve documentation for ray cluster launcher helper script (#20602) Ricardo Decal 2025-07-15 06:55:45 -04:00
  • f148c44c6a [frontend] Refactor CLI Args for a better modular integration (#20206) kourosh hakhamaneshi 2025-07-15 02:23:42 -07:00
  • 235bfd5dfe [Docs] Improve documentation for RLHF example (#20598) Ricardo Decal 2025-07-15 04:54:10 -04:00
  • 68d28e37b0 [frontend] Add --help=page option for paginated help output (#20961) Reid 2025-07-15 15:42:00 +08:00
  • 37a7d5d74a [Misc] Refactor AllReduceFusionPass. Remove parameter (#20918) Ilya Markov 2025-07-15 08:57:40 +02:00
  • d4d309409f Implement Async Scheduling (#19970) Woosuk Kwon 2025-07-14 23:01:46 -07:00
  • 85bd6599e4 [Model] Add AutoWeightsLoader support for BERT, RoBERTa (#20534) Jennifer He 2025-07-15 01:34:24 -04:00
  • 91b3d190ae [cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir (#20940) Boyuan Feng 2025-07-14 22:02:17 -07:00
  • fc017915f5 [Doc] Clearer mistral3 and pixtral model support description (#20926) Isotr0py 2025-07-15 12:56:53 +08:00
  • 9ad0a4588b [Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer (#20934) Pavani Majety 2025-07-14 20:27:50 -07:00
  • 016b8d1b7f Enabled BnB NF4 inference on Gaudi (#20172) Ruheena Suhani Shaik 2025-07-15 08:56:08 +05:30
  • 80305c1b24 [CI] Fix flaky test_streaming_response test (#20913) Nicolò Lucchesi 2025-07-15 05:15:15 +02:00
  • 37e2ecace2 feat: add image zoom to improve image viewing experience (#20763) Reid 2025-07-15 11:14:23 +08:00
  • 054c8657e3 [Docs] Add Kuberay to deployment integrations (#20592) Ricardo Decal 2025-07-14 23:13:55 -04:00
  • d4170fad39 Use w8a8 quantized matmul Pallas kernel (#19170) XiongfeiWei 2025-07-14 20:06:33 -07:00
  • 946aadb4a0 [CI/Build] Split Entrypoints Test into LLM and API Server (#20945) Michael Goin 2025-07-15 11:44:18 +09:00
  • bcdfb2a330 [Bugfix] Fix incorrect dispatch for CutlassBlockScaledGroupedGemm and DeepGEMM (#20933) Michael Goin 2025-07-15 10:42:17 +09:00
  • ba8c300018 [BugFix] VLLM_DISABLE_COMPILE_CACHE=1 should disable all reads and writes from the cache (#20942) Richard Zou 2025-07-14 21:26:18 -04:00
  • 8cdc371217 SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP (#20769) Alexander Matveev 2025-07-14 21:06:38 -04:00
  • 61e20828da Fall back if flashinfer comm module not found (#20936) Yong Hoon Shin 2025-07-14 16:11:18 -07:00
  • 55e1c66da5 [Docs] remove outdated performance benchmark (#20935) Kuntai Du 2025-07-14 15:14:17 -07:00
  • 86f3ac21ce Fix overflow indexing in causal_conv1d kernel (#20938) Thomas Parnell 2025-07-14 23:43:07 +02:00
  • 149f2435a5 [Misc] Relax translations tests (#20856) Nicolò Lucchesi 2025-07-14 22:08:36 +02:00
  • c0569dbc82 [Misc] ModularKernel : Perform WeightAndReduce inside TritonExperts & DeepGemmExperts (#20725) Varun Sundar Rabindranath 2025-07-15 01:17:16 +05:30
  • 8bb43b9c9e Add benchmark dataset for mlperf llama tasks (#20338) Michael Goin 2025-07-15 04:10:07 +09:00
  • 559756214b Change default model to Qwen3-0.6B (#20335) Tyler Michael Smith 2025-07-14 12:54:52 -04:00