Commit Graph

  • 635b897246 [distributed] remove pynccl's redundant stream (#11744) cennn 2025-01-05 23:09:11 +08:00
  • 4068f4b5b5 [MISC] Replace c10::optional with std::optional (#11730) Lu Fang 2025-01-04 17:20:34 -08:00
  • 47831430cc [Bugfix][V1] Fix test_kv_cache_utils.py (#11738) Jee Jee Li 2025-01-05 00:07:59 +08:00
  • 65c08928c2 [Model] Remove unnecessary weight initialization logic (#11736) Cyrus Leung 2025-01-04 23:46:21 +08:00
  • ba214dffbe [Bugfix] Fix precision error in LLaVA-NeXT (#11735) Cyrus Leung 2025-01-04 23:45:57 +08:00
  • eed11ebee9 [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision (#11717) Cyrus Leung 2025-01-04 19:40:53 +08:00
  • 300acb8347 [Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture (#11233) Yan Burman 2025-01-04 08:50:16 +02:00
  • d91457d529 [V1] Add kv cache utils tests. (#11513) xcnick 2025-01-04 14:49:46 +08:00
  • fbf2564554 [V1] Add RayExecutor support for AsyncLLM (api server) (#11712) Kunshang Ji 2025-01-04 14:41:31 +08:00
  • d1d49397e7 Update bnb.md with example for OpenAI (#11718) Alberto Ferrer 2025-01-04 00:29:02 -06:00
  • 9c93636d84 Update tool_calling.md (#11701) Hust_YangXian 2025-01-04 14:16:30 +08:00
  • e5d7ed0c53 [V1] log GPU blocks num for MultiprocExecutor (#11656) WangErXiao 2025-01-04 08:13:12 +08:00
  • ad0d567e1c [V1] Chore: cruft removal (#11724) Robert Shaw 2025-01-03 18:25:02 -05:00
  • bf0d97d786 Update requirements-tpu.txt to support python 3.9 and 3.11 (#11695) Michael Goin 2025-01-03 17:36:46 -05:00
  • a655eb3025 [Misc]Add BNB quantization for Qwen2VL (#11719) Jee Jee Li 2025-01-04 06:19:02 +08:00
  • 1543914c04 [V1] Improve TP>1 Error Handling + Stack Trace (#11721) Robert Shaw 2025-01-03 16:29:11 -05:00
  • 61fed92c7e [Bugfix] Fix ColumnParallelLinearWithLoRA slice (#11708) ZincCat 2025-01-03 13:02:34 -08:00
  • 80c751e7f6 [V1] Simplify Shutdown (#11659) Robert Shaw 2025-01-03 12:25:38 -05:00
  • e1a5c2f0a1 [Model] Whisper model implementation (#11280) Aurick Qiao 2025-01-03 03:39:19 -05:00
  • fd3a62a122 [perf-benchmark] Fix dependency for steps in benchmark pipeline (#11710) Kevin H. Luu 2025-01-03 13:38:37 +07:00
  • 07064cb1d4 [Bugfix] Check chain_speculative_sampling before calling it (#11673) Lu Fang 2025-01-02 16:58:56 -08:00
  • 2f1e8e8f54 Update default max_num_batch_tokens for chunked prefill (#11694) Sachin Varghese 2025-01-02 19:25:53 -05:00
  • 68d37809b9 [Misc] Minimum requirements for SageMaker compatibility (#11576) Nathan Azrak 2025-01-03 10:59:25 +11:00
  • 5dba257506 Resolve race conditions in Marlin kernel (#11493) wchen61 2025-01-03 06:58:56 +08:00
  • 187e32997c [Bugfix] Change kv scaling factor by param json on nvidia gpu (#11688) bjmsong 2025-01-03 05:11:39 +08:00
  • b55ed6ef8a [V1][Minor] Optimize token_ids_cpu copy (#11692) Woosuk Kwon 2025-01-03 04:04:58 +09:00
  • 2f385183f3 [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. (#10013) Kathy Yu 2025-01-02 10:28:09 -08:00
  • 84c35c374a According to vllm.EngineArgs, the name should be distributed_executor_backend (#11689) Chunyang Wen 2025-01-03 02:14:16 +08:00
  • 8c38ee7007 [VLM] Merged multi-modal processor for LLaVA-NeXT (#11682) Cyrus Leung 2025-01-03 00:39:27 +08:00
  • b6087a6bee [mypy] Pass type checking in vllm/inputs (#11680) Tobias Pitters 2025-01-02 17:18:15 +01:00
  • 23c1b10a4c [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (#11674) Cyrus Leung 2025-01-02 17:00:00 +08:00
  • a115ac46b5 [VLM] Move supported limits and max tokens to merged multi-modal processor (#11669) Cyrus Leung 2025-01-01 23:44:42 +08:00
  • 73001445fb [V1] Implement Cascade Attention (#11635) Woosuk Kwon 2025-01-01 21:56:46 +09:00
  • 6d70198b17 [Doc] Fix typo (#11666) Kazuhiro Serizawa 2025-01-01 17:10:10 +09:00
  • f962f426bc [Misc] Replace space with - in the file names (#11667) Lu Fang 2024-12-31 23:39:30 -08:00
  • 11d8a091c6 [Misc] Optimize Qwen2-VL LoRA test (#11663) Jee Jee Li 2025-01-01 14:42:23 +08:00
  • 365801fedd [VLM] Add max-count checking in data parser for single image models (#11661) Cyrus Leung 2025-01-01 14:15:21 +08:00
  • 4db72e57f6 [Bugfix][Refactor] Unify model management in frontend (#11660) Joe Runde 2024-12-31 18:21:51 -08:00
  • 0c6f998554 [Benchmark] Add benchmark script for CPU offloading (#11533) Yihua Cheng 2024-12-31 18:10:55 -06:00
  • e7c7c5e822 [V1][VLM] V1 support for selected single-image models. (#11632) Roger Wang 2024-12-31 13:17:22 -08:00
  • 8c3230d8c1 [V1] Simpify vision block hash for prefix caching by removing offset from hash (#11646) Chen Zhang 2024-12-31 16:56:01 +08:00
  • 2c5718809b [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. (#11565) sakunkun 2024-12-31 14:29:04 +08:00
  • 82c49d3260 [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (#6909) John Giorgi 2024-12-31 01:15:58 -05:00
  • 74fa1d123c [Bugfix] Fix OpenAI parallel sampling when using xgrammar (#11637) Michael Goin 2024-12-30 22:43:54 -05:00
  • a2a40bcd0d [Model][LoRA]LoRA support added for MolmoForCausalLM (#11439) Matthias Vogler 2024-12-31 02:33:06 +01:00
  • ccb1aabcca [benchmark] Remove dependency for H100 benchmark step (#11572) Kevin H. Luu 2024-12-30 12:27:07 -08:00
  • 36e7670045 [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel (#11631) whyiug 2024-12-31 02:51:04 +08:00
  • 5886aa496e [V1] [6/N] API Server: Better Shutdown (#11586) Robert Shaw 2024-12-30 10:51:02 -05:00
  • 8d9b6721e7 [VLM] Abstract out multi-modal data parsing in merged processor (#11620) Cyrus Leung 2024-12-30 23:01:35 +08:00
  • b12e87f942 [platforms] enable platform plugins (#11602) youkaichao 2024-12-30 20:24:45 +08:00
  • 5dbf854553 [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (#11618) Li, Jiang 2024-12-30 18:17:04 +08:00
  • 970d6d0776 [Build][Kernel] Update CUTLASS to v3.6.0 (#11607) Tyler Michael Smith 2024-12-30 04:22:13 -05:00
  • 628ec6c17b [Docker] bump up neuron sdk v2.21 (#11593) Liangfu Chen 2024-12-29 21:46:14 -08:00
  • 3682e33f9f [v1] fix compilation cache (#11598) youkaichao 2024-12-30 12:24:12 +08:00
  • 0aa38d16f5 Remove print statement in DeepseekScalingRotaryEmbedding (#11604) Michael Goin 2024-12-29 15:16:46 -05:00
  • faef77c0d6 [Misc] KV cache transfer connector registry (#11481) Kuntai Du 2024-12-29 10:08:09 -06:00
  • dba4d9dec6 [v1][bugfix] fix cudagraph with inplace buffer assignment (#11596) youkaichao 2024-12-29 17:03:49 +08:00
  • 32b4c63f02 [Doc] Convert list tables to MyST (#11594) Cyrus Leung 2024-12-29 15:56:22 +08:00
  • 4fb8e329fd [V1] [5/N] API Server: unify Detokenizer and EngineCore input (#11545) Robert Shaw 2024-12-28 15:51:57 -05:00
  • 328841d002 [bugfix] interleaving sliding window for cohere2 model (#11583) youkaichao 2024-12-29 00:55:42 +08:00
  • d427e5cfda [Doc] Minor documentation fixes (#11580) Cyrus Leung 2024-12-28 21:53:59 +08:00
  • 42bb201fd6 [V1][Minor] Set pin_memory=False for token_ids_cpu tensor (#11581) Woosuk Kwon 2024-12-28 22:33:12 +09:00
  • 59d6bb4c86 [Hardware][AMD]: Replace HIPCC version with more precise ROCm version (#11515) hj-wei 2024-12-28 19:17:35 +08:00
  • b7dcc003dc [Model] Remove hardcoded image tokens ids from Pixtral (#11582) Roger Wang 2024-12-28 02:54:23 -08:00
  • d34be24bb1 [Model] Support InternLM2 Reward models (#11571) Isotr0py 2024-12-28 14:14:10 +08:00
  • b5cbe8eeb3 [Bugfix] Last token measurement fix (#11376) Rajveer Bachkaniwala 2024-12-27 22:34:46 -05:00
  • df04dffade [V1] [4/N] API Server: ZMQ/MP Utilities (#11541) Robert Shaw 2024-12-27 20:45:08 -05:00
  • a60731247f [Doc] Update mllama example based on official doc (#11567) Chen Zhang 2024-12-28 08:31:10 +08:00
  • ac79799403 [Bugfix] Fix for ROCM compressed tensor support (#11561) Selali 2024-12-27 12:12:11 -08:00
  • dde1fa18c9 [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix (#11566) Isotr0py 2024-12-28 03:45:13 +08:00
  • 0240402c46 [Misc]Add BNB quantization for MolmoForCausalLM (#11551) Jee Jee Li 2024-12-28 02:48:24 +08:00
  • 55509c2114 [MODEL] LoRA support for Jamba model (#11209) ErezSC42 2024-12-27 19:58:21 +02:00
  • 101418096f [VLM] Support caching in merged multi-modal processor (#11396) Cyrus Leung 2024-12-28 01:22:48 +08:00
  • 5ce4627a7e [Doc] Add xgrammar in doc (#11549) Chen1022 2024-12-27 21:05:10 +08:00
  • 7af553ea30 [Misc] Abstract the logic for reading and writing media content (#11527) Cyrus Leung 2024-12-27 19:21:23 +08:00
  • 2c9b8ea2b0 [Bugfix] Fix TeleChat2ForCausalLM weights mapper (#11546) Jee Jee Li 2024-12-27 18:39:15 +08:00
  • d003f3ea39 Update deploying_with_k8s.md with AMD ROCm GPU example (#11465) AlexHe99 2024-12-27 18:00:04 +08:00
  • 6c6f7fe8a8 [Platform] Move model arch check to platform (#11503) Mengqing Cao 2024-12-27 16:45:25 +08:00
  • 2339d59f92 [BugFix] Fix quantization for all other methods (#11547) v0.6.6.post1 Robert Shaw 2024-12-27 01:23:29 -05:00
  • 1b875a0ef3 [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (#11534) Robert Shaw 2024-12-27 00:19:21 -05:00
  • eb881ed006 [misc] fix typing (#11540) youkaichao 2024-12-27 11:05:08 +08:00
  • 46d4359450 [CI] Fix broken CI (#11543) Robert Shaw 2024-12-26 21:49:16 -05:00
  • 81b979f2a8 [V1] Fix yapf (#11538) Woosuk Kwon 2024-12-27 09:47:10 +09:00
  • 371d04d39b [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (#11394) Woosuk Kwon 2024-12-27 09:32:38 +09:00
  • 0c0c2015c5 Update openai_compatible_server.md (#11536) Robert Shaw 2024-12-26 19:26:18 -05:00
  • 82d24f7aac [Docs] Document Deepseek V3 support (#11535) Simon Mo 2024-12-26 16:21:56 -08:00
  • f49777ba62 Deepseek v3 (#11502) v0.6.6 Simon Mo 2024-12-26 16:09:44 -08:00
  • 55fb97f7bd [2/N] API Server: Avoid ulimit footgun (#11530) Robert Shaw 2024-12-26 18:43:05 -05:00
  • 2072924d14 [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523) Michael Goin 2024-12-26 18:33:30 -05:00
  • 720b10fdc6 [1/N] API Server (Remove Proxy) (#11529) Robert Shaw 2024-12-26 18:03:43 -05:00
  • b85a977822 [Doc] Add video example to openai client for multimodal (#11521) Isotr0py 2024-12-27 01:31:29 +08:00
  • eec906d811 [Misc] Add placeholder module (#11501) Cyrus Leung 2024-12-26 21:12:51 +08:00
  • f57ee5650d [Model] Modify MolmoForCausalLM MLP (#11510) Jee Jee Li 2024-12-26 21:12:05 +08:00
  • dcb1a944d4 [V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler (#10681) sroy745 2024-12-26 02:02:58 -08:00
  • 7492a36207 [Doc] Add QVQ and QwQ to the list of supported models (#11509) Roger Wang 2024-12-26 01:44:32 -08:00
  • aa25985bd1 [Misc][LoRA] Fix LoRA weight mapper (#11495) Jee Jee Li 2024-12-26 15:52:48 +08:00
  • dbeac95dbb Mypy checking for vllm/compilation (#11496) Lucas Tucker 2024-12-25 23:04:07 -06:00
  • 51a624bf02 [Misc] Move some multimodal utils to modality-specific modules (#11494) Cyrus Leung 2024-12-26 12:23:20 +08:00
  • 6ad909fdda [Doc] Improve GitHub links (#11491) Cyrus Leung 2024-12-26 06:49:26 +08:00
  • b689ada91e [Frontend] Enable decord to load video from base64 (#11492) Cyrus Leung 2024-12-26 00:33:55 +08:00