Michael Goin
|
04222984f8
|
[Docs] Add nsight guide to profiling docs (#14298)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-06 14:19:58 -08:00 |
|
Michael Goin
|
6832707e90
|
[V1][Bugfix] Standardize quantized kv cache rejection for attention backends (#14221)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-06 14:18:29 -08:00 |
|
Michael Goin
|
6b2ef5cd17
|
[Bug] Fix Attention when ignored in by quant_method (#14313)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-06 14:18:06 -08:00 |
|
Tyler Michael Smith
|
958adce478
|
[Bugfix] Fix use_direct_call condition in FusedMoE layer for (#14382)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-03-06 14:17:21 -08:00 |
|
Tyler Michael Smith
|
99b0915d3b
|
[Kernel] Add needs_fixed_stride_order tag to most GEMMs (#14306)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-03-06 14:17:09 -08:00 |
|
Thomas Parnell
|
8ca2b21c98
|
[CI] Disable spawn when running V1 Test (#14345)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-03-06 21:52:46 +00:00 |
|
Michael Goin
|
d9292786e1
|
[CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (#13569)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-06 16:08:36 -05:00 |
|
Tyler Michael Smith
|
cc2f9b32c8
|
[Distributed] Add enable_expert_parallel arg (#14305)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-03-06 18:54:45 +00:00 |
|
Himanshu Jaju
|
cd579352bf
|
[V1] Do not detokenize if sampling param detokenize is False (#14224)
Signed-off-by: Himanshu Jaju <hj@mistral.ai>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-03-06 10:40:24 -08:00 |
|
Ying Zhong
|
9f1710f1ac
|
Fix mla prefill context performance (#13897)
Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com>
|
2025-03-06 09:35:49 -08:00 |
|
Thomas Parnell
|
e642ec962c
|
Add authors to license header. (#14371)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
|
2025-03-06 08:43:09 -08:00 |
|
Dilip Gowda Bhagavan
|
ada19210a3
|
Adding cpu inference with VXE ISA for s390x architecture (#12613)
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com>
Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com>
|
2025-03-06 08:40:53 -08:00 |
|
Harry Mellor
|
bf0560bda9
|
Reinstate best_of for V0 (#14356)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-06 08:34:22 -08:00 |
|
youkaichao
|
151b08e0fe
|
[RLHF] use worker_extension_cls for compatibility with V0 and V1 (#14185)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-03-07 00:32:46 +08:00 |
|
Jitse Klomp
|
81b2f4a45f
|
[Doc] Fix date typo in README.md (#14366)
Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl>
|
2025-03-06 08:29:57 -08:00 |
|
Cyrus Leung
|
82551ad616
|
[Core] Don't use cache during multi-modal profiling (#14336)
|
2025-03-06 08:03:31 -08:00 |
|
courage17340
|
caac5c2e59
|
[Bugfix][Core] fix abort_seq_group and memory leak when n>1 (#14326)
Signed-off-by: courage17340 <courage17340@163.com>
|
2025-03-06 23:59:32 +08:00 |
|
Thomas Parnell
|
6bd1dd9d26
|
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152)
|
2025-03-06 07:39:16 -08:00 |
|
Irina Yuryeva
|
4f27044aab
|
[Doc] Correct beam_search using in generative_models.md (#14363)
|
2025-03-06 15:37:10 +00:00 |
|
Yanyi Liu
|
0ddc991f5c
|
[Doc] Update reasoning with stream example to use OpenAI library (#14077)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
|
2025-03-06 13:20:37 +00:00 |
|
Nicolò Lucchesi
|
fa82b93853
|
[Frontend][Docs] Transcription API streaming (#13301)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-03-06 10:39:35 +00:00 |
|
Nicolò Lucchesi
|
69ff99fdcd
|
[Core] Optimizing cross-attention QKVParallelLinear computation (#12325)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
|
2025-03-06 09:37:26 +00:00 |
|
lkchen
|
5d802522a7
|
[V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (#14275)
Signed-off-by: Linkun Chen <github@lkchen.net>
|
2025-03-06 08:58:41 +00:00 |
|
kYLe
|
1769928079
|
[Model] Update Paligemma multimodal processing with PromptUpdate (#14015)
Signed-off-by: Kyle Huang <kylhuang@nvidia.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-03-06 08:31:38 +00:00 |
|
Pavani Majety
|
ed6ea06577
|
[Hardware] Update the flash attn tag to support Blackwell (#14244)
|
2025-03-05 22:01:37 -08:00 |
|
Nicolò Lucchesi
|
5ee10e990d
|
[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (#11301)
|
2025-03-05 20:00:53 -08:00 |
|
Varun Sundar Rabindranath
|
3dbd2d813a
|
[V1] LoRA - Enable more V1 tests (#14315)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-03-06 11:55:42 +08:00 |
|
Ce Gao
|
f5f7f00cd9
|
[Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (#14114)
|
2025-03-06 03:49:20 +00:00 |
|
Rui Qiao
|
abcc61e0af
|
[misc] Mention ray list nodes command to troubleshoot ray issues (#14318)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2025-03-06 02:00:36 +00:00 |
|
Lucas Wilkinson
|
f6bb18fd9a
|
[BugFix] MLA + V1, illegal memory access and accuracy issues (#14253)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-03-05 17:10:13 -08:00 |
|
Yuan Tang
|
71eaf8969b
|
[Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (#13850)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
|
2025-03-05 17:09:29 -08:00 |
|
Michael Goin
|
ca100c90fe
|
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-05 17:08:51 -08:00 |
|
Russell Bryant
|
ffad94397d
|
[CI/Build] Use spawn multiprocessing mode for V1 test pipeline (#14243)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-03-05 17:08:02 -08:00 |
|
Lucas Wilkinson
|
4dacaa4a83
|
[BugFix] Fix prefix caching V0 MLA (#14255)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com>
|
2025-03-05 17:07:42 -08:00 |
|
Tyler Michael Smith
|
a7ea35aa67
|
[Bugfix] Remove num_tokens_across_dp (#14302)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-03-05 23:55:55 +00:00 |
|
pyc96
|
1e3e76b6cc
|
[Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (#14237)
Signed-off-by: pyc96 <pychen96@gmail.com>
|
2025-03-05 22:22:40 +00:00 |
|
Lu Fang
|
53ea6ad830
|
[V1][Easy] Add empty allowed_token_ids in the v1 sampler test (#14308)
Signed-off-by: Lu Fang <lufang@fb.com>
|
2025-03-05 21:41:18 +00:00 |
|
Serena
|
1b7624bf5c
|
[misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (#14267)
|
2025-03-05 21:28:50 +00:00 |
|
Nick Hill
|
ac60dc7fe1
|
[V1][BugFix] Fix for mixed top_k batch (#14301)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com>
|
2025-03-05 20:43:04 +00:00 |
|
Vincent
|
a4f1ee35d6
|
Deprecate best_of Sampling Parameter in anticipation for vLLM V1 (#13997)
Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-05 20:22:43 +00:00 |
|
Nick Hill
|
a32c8669ca
|
[V1][Minor] Remove obsolete FIXME comment (#14304)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-05 11:59:23 -08:00 |
|
Simon Mo
|
ca2ca8de57
|
[Docs] Add Meta Slides (#14297)
Signed-off-by: simon-mo <simon.mo@hey.com>
|
2025-03-05 08:30:23 -08:00 |
|
Isotr0py
|
f71b00a19e
|
[Bugfix] Fix broken vision language example (#14292)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-03-05 15:57:10 +00:00 |
|
DaividFrank
|
8f808cf86e
|
prefix_caching.md: Fixed typo (#14293)
Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai>
|
2025-03-05 15:43:13 +00:00 |
|
Jee Jee Li
|
7bab4bb048
|
[Misc] Add Qwen2MoeForCausalLM moe tuning support (#14276)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-03-05 23:11:29 +08:00 |
|
Isotr0py
|
e17e4488bd
|
[LoRA] Remove linear hack outside transformers backend (#14177)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-03-05 15:06:28 +00:00 |
|
Robert Shaw
|
257e200a25
|
[V1][Frontend] Add Testing For V1 Runtime Parameters (#14159)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
|
2025-03-05 14:18:55 +00:00 |
|
Zhe Zhang
|
47d4a7e004
|
Small update for external_launcher backend docs (#14288)
|
2025-03-05 21:30:00 +08:00 |
|
Cyrus Leung
|
7f89a594dd
|
[Doc] [3/N] Refer code examples for common cases in dev multimodal processor (#14278)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-05 12:29:50 +00:00 |
|
Iacopo Poli
|
961644e6a8
|
[Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (#14217)
Signed-off-by: Iacopo Poli <iacopo@lighton.ai>
|
2025-03-05 11:44:10 +00:00 |
|