Commit Graph

  • e1fe7591f2 [Misc]Code Cleanup (#13859) Chenguang Li 2025-02-26 10:44:30 +08:00
  • 5629f26df7 [V1][Spec Decode] Change Spec Decode Rejection Sampling API (#13729) Lily Liu 2025-02-25 18:14:48 -08:00
  • 9ba28043b5 [misc] Show driver IP info when Ray fails to allocate driver worker (#13858) Rui Qiao 2025-02-25 17:53:43 -08:00
  • 24679788ed DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833) Harry Mellor 2025-02-26 01:24:57 +00:00
  • 07c4353057 [Model] Support Grok1 (#13795) Michael Goin 2025-02-25 20:07:12 -05:00
  • 34e3494e70 Fix failing MyGemma2Embedding test (#13820) Harry Mellor 2025-02-25 20:33:03 +00:00
  • f75aa72732 [Neuron] Add custom_ops for neuron backend (#13246) Liangfu Chen 2025-02-25 11:47:49 -08:00
  • 340e39e387 Fix string parsing error (#13825) Chen1022 2025-02-26 00:20:29 +08:00
  • f4133ce4e5 [Bugfix] Revert inspection code in #13743 (#13832) Cyrus Leung 2025-02-26 00:18:50 +08:00
  • 6522d55b6f Fix /v1/audio/transcriptions Bad Request Error (#13811) Wen Sun 2025-02-25 22:03:33 +08:00
  • 6ff518626c [Bugfix] Fix deepseek-vl2 inference with more than 2 images (#13818) Isotr0py 2025-02-25 22:03:02 +08:00
  • fa82074167 [Bugfix] Flush TunableOp results before worker processes are destroyed. (#13623) Nichols A. Romero 2025-02-25 05:08:20 -06:00
  • 75e9d49796 [Bugfix] Initialize attention bias on the same device as Query/Key/Value (#13468) Junlin Zhou 2025-02-25 18:13:09 +08:00
  • 32c3b6bfd1 [Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs (#13724) Chen1022 2025-02-25 18:12:19 +08:00
  • 37b6cb4985 [CI/Build] Fix V1 LoRA failure (#13767) Jee Jee Li 2025-02-25 18:01:15 +08:00
  • aabeb2688f [ROCm][Quantization][Kernel] Using HIP FP8 header (#12593) Gregory Shtrasberg 2025-02-25 03:39:59 -05:00
  • 2f42a4888c [Feature] Support KV cache offloading and disagg prefill with LMCache connector. (#12953) Jiayi Yao 2025-02-25 02:38:42 -06:00
  • 3173c3b34e [misc] Clean up ray compiled graph type hints (#13731) Rui Qiao 2025-02-25 00:37:08 -08:00
  • 2d87d7d1ac [Bugfix] Modify modelscope api usage in transformer_utils (#13807) Shanshan Shen 2025-02-25 16:36:07 +08:00
  • aab392774b [Core] xgrammar: Expand list of unsupported jsonschema keywords (#13783) Russell Bryant 2025-02-25 03:21:25 -05:00
  • 6724e79164 [Misc] Check that the model can be inspected upon registration (#13743) Cyrus Leung 2025-02-25 16:18:19 +08:00
  • 03f48b3db6 [Core] LoRA V1 - Add add/pin/list/remove_lora functions (#13705) Varun Sundar Rabindranath 2025-02-25 13:48:02 +05:30
  • 4d251ad00e Fix CompressedTensorsWNA16MoE with grouped scales (#13769) Michael Goin 2025-02-25 03:17:14 -05:00
  • 18e505930d [Bugfix] Support MLA for CompressedTensorsWNA16 (#13725) Michael Goin 2025-02-25 01:10:31 -05:00
  • 4a8cfc7551 [Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" (#13802) Lucas Wilkinson 2025-02-24 23:33:59 -05:00
  • bc32bc73aa [V1][Metrics] Implement vllm:lora_requests_info metric (#13504) Mark McLoughlin 2025-02-25 04:01:33 +00:00
  • ab1091d5f2 [Misc][Attention][Quantization] init property earlier (#13733) wangxiyuan 2025-02-25 11:19:30 +08:00
  • 1e15aaef56 [Bugfix][Quantization] Fix FP8 + EP (#13784) Tyler Michael Smith 2025-02-24 21:54:17 -05:00
  • 51010a1807 [Misc] set single whitespace between log sentences (#13771) cjackal 2025-02-25 11:26:12 +09:00
  • 7196a3b1db [Doc] arg_utils.py: fixed a typo (#13785) Eli Boyarski 2025-02-25 04:23:04 +02:00
  • cdc1fa12eb Remove unused kwargs from model definitions (#13555) Harry Mellor 2025-02-25 01:13:52 +00:00
  • f61528d46d [Misc][Chore] Clean Up AsyncOutputProcessing Logs (#13780) Robert Shaw 2025-02-24 19:39:07 -05:00
  • 1f0ae3ed0a [Misc] Clean Up EngineArgs.create_engine_config (#13734) Robert Shaw 2025-02-24 13:52:21 -05:00
  • db986c19ea Fix precommit fail in fused_moe intermediate_cache2 chunking (#13772) Michael Goin 2025-02-24 12:25:47 -05:00
  • 227578480d Revert "[V1][Core] Fix memory issue with logits & sampling" (#13775) Roger Wang 2025-02-24 09:16:05 -08:00
  • befc402d34 [V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) (#10980) afeldman-nm 2025-02-24 11:29:41 -05:00
  • 444b0f0f62 [Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set (#12513) Nicolò Lucchesi 2025-02-24 16:43:21 +01:00
  • ccc00515fd [BugFix] Illegal memory access for MoE On H20 (#13693) Zhonghua Deng 2025-02-24 23:37:32 +08:00
  • 781096e385 Expert Parallelism (EP) Support for DeepSeek V2 (#12583) Jongseok Park 2025-02-24 07:33:20 -08:00
  • 7940d8a6a7 [CI/Build] add python-json-logger to requirements-common (#12842) Roger Meier 2025-02-24 15:10:33 +01:00
  • c0e3ecd6d2 [Bugfix] fix(logging): add missing opening square bracket (#13011) Roger Meier 2025-02-24 15:10:25 +01:00
  • 23eca9cf68 [model][refactor] remove cuda hard code in models and layers (#13658) Mengqing Cao 2025-02-24 22:10:14 +08:00
  • 437b76ff59 [V1][Core] Fix memory issue with logits & sampling (#13721) Roger Wang 2025-02-24 06:10:06 -08:00
  • f90a375593 [ci] Add logic to change model to S3 path only when S3 CI env var is on (#13727) Kevin H. Luu 2025-02-23 22:32:11 -08:00
  • e7ef74e26e Fix some issues with benchmark data output (#13641) Huy Do 2025-02-23 18:23:18 -08:00
  • cbae7af552 [V1][BugFix] Fix engine core client shutdown hangs (#13298) Nick Hill 2025-02-23 13:07:43 -08:00
  • eb24dc4a45 [v1] torchrun compatibility (#13642) youkaichao 2025-02-23 22:47:24 +08:00
  • 9bebc9512f [Misc] Deprecate --dataset from benchmark_serving.py (#13708) Roger Wang 2025-02-23 05:32:20 -08:00
  • 5a2ba16f5c [Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms (#13688) Nick Hill 2025-02-23 02:54:29 -08:00
  • ba5106e519 [LMM] Implement merged multimodal processor for whisper (#13278) Isotr0py 2025-02-23 17:46:03 +08:00
  • d5ca2110f1 [Quant] BaiChuan SupportsQuant (#13710) Kyle Sayers 2025-02-22 22:21:15 -05:00
  • 2c5e637b57 [ci] Use env var to control whether to use S3 bucket in CI (#13634) Kevin H. Luu 2025-02-22 19:19:45 -08:00
  • 322d2a27d6 [BugFix] Minor: logger import in attention backend (#13706) Andy Lo 2025-02-23 00:51:13 +00:00
  • 82e0d601fc [CI/Build] Fix pre-commit errors from #13571 (#13709) Roger Wang 2025-02-22 16:50:38 -08:00
  • 78ac0f591d [CI/Build] fix uv caching in Dockerfile (#13611) Daniele 2025-02-22 17:25:20 +01:00
  • b56155e7f3 [XPU]fix setuptools version for xpu (#13548) Yan Ma 2025-02-23 00:05:35 +08:00
  • 382f66fb08 [Bugfix] Fix boolean conversion for OpenVINO env variable (#13615) Helena Kloosterman 2025-02-22 17:04:12 +01:00
  • 8354f6640c [Doc] Dockerfile instructions for optional dependencies and dev transformers (#13699) Cyrus Leung 2025-02-22 22:04:31 +08:00
  • c904fdddf6 [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm (#13231) Gregory Shtrasberg 2025-02-22 08:54:38 -05:00
  • 558db8083c [V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths (#13095) Sage Moore 2025-02-22 05:25:41 -08:00
  • e109e598c7 [NVIDIA] Support nvfp4 cutlass gemm (#13571) Kaixi Hou 2025-02-22 05:24:05 -08:00
  • 8db1b9d0a1 Support SSL Key Rotation in HTTP Server (#13495) Keyun Tong 2025-02-22 05:17:44 -08:00
  • 2382ad29d1 [ci] fix linter (#13701) youkaichao 2025-02-22 20:28:59 +08:00
  • 3e472d882a [core] set up data parallel communication (#13591) youkaichao 2025-02-22 19:28:59 +08:00
  • 7f6bae561c [CI/Build] Fix pre-commit errors (#13696) Cyrus Leung 2025-02-22 16:31:26 +08:00
  • 105b8ce4c0 [Misc] Reduce LoRA-related static variable (#13166) Jee Jee Li 2025-02-22 16:21:30 +08:00
  • 2cb8c1540e [Metrics] Add --show-hidden-metrics-for-version CLI arg (#13295) Mark McLoughlin 2025-02-22 08:20:45 +00:00
  • 1cd981da4f [V1][Metrics] Support vllm:cache_config_info (#13299) Mark McLoughlin 2025-02-22 08:20:00 +00:00
  • fca20841c2 Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size (#13660) Yu Chin Fabian Lim 2025-02-22 16:19:10 +08:00
  • da31b5333e [Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler (#13594) Jennifer Zhao 2025-02-22 00:08:29 -08:00
  • bb78fb318e [v1] Support allowed_token_ids in v1 Sampler (#13210) Lu Fang 2025-02-21 22:13:05 -08:00
  • 8aca27fa11 [Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len (#13691) Robin 2025-02-22 14:10:38 +08:00
  • 95c617e04b [Misc] Bump compressed-tensors (#13619) Dipika Sikka 2025-02-22 01:09:04 -05:00
  • 9a1f1da5d1 [Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA (#13687) Shane A 2025-02-21 22:07:45 -08:00
  • 68d630a0c7 [ROCM] fix native attention function call (#13650) Gordon Wong 2025-02-22 14:07:04 +08:00
  • 68d535ef44 [Misc] Capture and log the time of loading weights (#13666) Jun Duan 2025-02-22 01:06:34 -05:00
  • c6ed93860f [Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… (#13672) Robin 2025-02-22 14:05:28 +08:00
  • 0ffdf8ce0c [HTTP Server] Make model param optional in request (#13568) Keyun Tong 2025-02-21 21:55:50 -08:00
  • 8c0dd3d4df docs: Add a note on full CI run in contributing guide (#13646) Yuan Tang 2025-02-22 00:53:59 -05:00
  • ada7c780d5 [Misc] Fix yapf linting tools etc not running on pre-commit (#13695) Isotr0py 2025-02-22 13:10:43 +08:00
  • 288cc6c234 [Attention] MLA with chunked prefill (#12639) Lucas Wilkinson 2025-02-21 18:30:12 -05:00
  • 900edbfa48 fix typo of grafana dashboard, with correct datasource (#13668) John Zheng 2025-02-22 02:21:05 +08:00
  • b2c3fc5d65 [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation (#13586) Isotr0py 2025-02-21 14:24:17 +08:00
  • 839b27c6cc [Kernel]Add streamK for block-quantized CUTLASS kernels (#12978) leoneo 2025-02-21 14:14:24 +08:00
  • 34ad27fe83 [ci] Fix metrics test model path (#13635) Kevin H. Luu 2025-02-20 22:12:10 -08:00
  • 1c3c975766 [FEATURE] Enables /score endpoint for embedding models (#12846) Gabriel Marinho 2025-02-21 03:09:47 -03:00
  • 1cdc88614a Missing comment explaining VDR variable in GGUF kernels (#13290) Szymon Ożóg 2025-02-21 07:06:54 +01:00
  • 31aa045c11 [V1][Sampler] Avoid an operation during temperature application (#13587) Nick Hill 2025-02-20 22:05:56 -08:00
  • a30c093502 [Bugfix] Add mm_processor_kwargs to chat-related protocols (#13644) Roger Wang 2025-02-20 22:04:33 -08:00
  • c7b07a95a6 Use pre-commit to update requirements-test.txt (#13617) Harry Mellor 2025-02-21 06:03:27 +00:00
  • 27a09dc52c [NVIDIA] Fix an issue to use current stream for the nvfp4 quant (#13632) Kaixi Hou 2025-02-20 22:01:48 -08:00
  • 981f3c831e [Misc] Adding script to setup ray for multi-node vllm deployments (#12913) Edwin Hernandez 2025-02-20 21:16:40 -08:00
  • 44c33f01f3 Add llmaz as another integration (#13643) Kante Yin 2025-02-21 11:52:40 +08:00
  • 33170081f1 [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245) Lingfan Yu 2025-02-20 17:45:45 -08:00
  • 71face8540 [Bugfix] Fix max_num_batched_tokens for MLA (#13620) Michael Goin 2025-02-20 20:45:20 -05:00
  • bfbc0b32c6 [Frontend] Add backend-specific options for guided decoding (#13505) Joe Runde 2025-02-20 13:07:58 -07:00
  • 6a417b8600 fix neuron performance issue (#13589) ajayvohra2005 2025-02-20 13:59:36 -05:00
  • d3ea50113c [V1][Minor] Print KV cache size in token counts (#13596) Woosuk Kwon 2025-02-20 09:24:31 -08:00
  • 34aad515c8 Update pre-commit's isort version to remove warnings (#13614) Harry Mellor 2025-02-20 16:00:14 +00:00
  • ed6e9075d3 [Bugfix] Fix deepseekv3 grouped topk error (#13474) v0.7.3 chenxiaobing 2025-02-20 22:47:01 +08:00