biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Wentao Ye	14561fabfd	[Perf] Optimize pooling model redundant copy, 1.8% throughput improvement (#35127 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-02-24 04:13:11 -08:00
Zhengxu Chen	c77f3e1207	[compile] Save aot compile artifacts atomically. (#35117 ) Signed-off-by: zhxchen17 <zhxchen17@fb.com>	2026-02-24 04:11:01 -08:00
Dor Huri	012dee9233	[Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) (#35147 ) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2026-02-24 04:10:32 -08:00
Tugsbayasgalan Manlaibaatar	f1c664545b	Make voxtral compile friendly (#33959 ) Signed-off-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>	2026-02-24 09:33:35 +01:00
Xin Yang	c870eb9e0f	[LoRA] Update LoRA expand kernel block_n calculation (#32621 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2026-02-23 23:17:53 -08:00
BadrBasowid	6af03f2394	[Refactor] [1/N] Reorganize kernel abstraction directory (#34055 ) Signed-off-by: BadrBasowid <badr.basowid@gmail.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>	2026-02-24 06:47:22 +00:00
Vlad Tiberiu Mihailescu	1a6cf39dec	[CI/Build] Remove redundant OpenTelemetry pip install from CI configs (#35032 ) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>	2026-02-23 22:24:11 -08:00
Nicolò Lucchesi	f91808ae0d	[MM] Allow audio chunking for offline LLM (#34628 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2026-02-23 21:04:28 -08:00
Vadim Gimpelson	33a0d43c71	[BUGFIX][Qwen3.5] Hardcode `mlp.gate` as not quantizable (#35156 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-02-23 19:42:24 -08:00
pschlan-amd	80d93fd6da	gpu_model_runner: Cache is_encoder_decoder from model config (#35099 ) Signed-off-by: Patrick Schlangen <pschlan@amd.com>	2026-02-23 19:08:34 -08:00
Jia Guo	ec85340531	[Quantization] Support FP8 MoE bias for models like GPT-OSS (#34906 ) Signed-off-by: jasperjiaguo <jasperg662@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-23 19:07:47 -08:00
Rohan Potdar	2ff4e51152	[ROCm] AITER fused RoPE+KVCache (#33443 ) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: charlifu <charlifu@amd.com> Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com> Co-authored-by: charlifu <charlifu@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>	2026-02-23 19:06:00 -08:00
Asaf Gardin	95642441d0	[Mamba1] - Change supports_update_block_table to True (#35054 ) Signed-off-by: Josephasafg <ajgard7@gmail.com>	2026-02-23 19:05:57 -08:00
Xin Yang	a7c9f7b7ec	[Bugfix] Fix lora_ids in FusedMoE LoRA test (#35135 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2026-02-23 21:49:25 -05:00
Michael Goin	a4bd661fb3	[Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default (#34924 ) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-02-23 17:34:41 -08:00
Michael Goin	3ef9fd0f98	[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches (#35123 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-02-23 17:11:27 -08:00
Michael Goin	22a97e6613	[Perf] Improve default triton fused moe configs (#34846 ) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-02-23 16:01:28 -08:00
Aaron Hao	596ed1f02e	[RL] Validation for pause_mode='keep' (#34992 ) Signed-off-by: ahao-anyscale <ahao@anyscale.com>	2026-02-23 16:30:56 -05:00
Nicolò Lucchesi	b8d8b7e934	[Misc] Monitor interface changes (#35113 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2026-02-23 17:14:51 +00:00
Harry Mellor	28c5e69ba0	Enforce that `model` is the first positional arg when `--served-model-name` is used (#34973 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-23 08:38:05 -08:00
Harry Mellor	864167d376	Fix custom processors that use deleted import for Transformers v5 (#35101 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-23 08:38:00 -08:00
haosdent	a2ba6a5244	[Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) (#34874 ) Signed-off-by: haosdent <haosdent@gmail.com>	2026-02-23 17:31:51 +01:00
Harry Mellor	c4f38696f7	Use Xet high performance mode for Transformers v5 (#35098 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-23 08:19:30 -08:00
haosdent	a7f341c323	[Bugfix] Fix MRotaryEmbedding missing `truncate` attr with YaRN scaling (#35080 ) Signed-off-by: haosdent <haosdent@gmail.com>	2026-02-23 16:05:52 +00:00
Robert Shaw	d13ece38d7	[CI] Skip Responses API (#34990 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-02-23 07:46:45 -08:00
Mark McLoughlin	5cc7c4452e	[Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) (#30950 ) Export the existing Model FLOPs Utilization (MFU) metrics via Prometheus. `--enable-mfu-metrics` is required for these to be exposed. Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2026-02-23 15:01:07 +00:00
Eldar Kurtić	b95bb6927f	[kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies (#34254 ) Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-02-23 07:37:55 -07:00
Cyrus Leung	392645454b	[Refactor] Decouple TimingContext from InputProcessingContext (#35083 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-23 14:15:50 +00:00
Eldar Kurtić	1e8438a89a	[Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests (#35033 ) Signed-off-by: Eldar Kurtic <you@example.com> Co-authored-by: Eldar Kurtic <you@example.com>	2026-02-23 09:04:34 -05:00
Robert Shaw	8435b2e049	[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) (#34302 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-02-23 14:02:26 +00:00
Yan Ma	b1b5e045df	[XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend (#35010 ) Signed-off-by: Yan Ma <yan.ma@intel.com>	2026-02-23 05:06:44 -08:00
Andreas Karatzas	5f68464f92	[ROCm][CI] Fix spec decode profile assertion and logprob test determinism (#35043 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-02-23 05:05:54 -08:00
Vincent Gimenes	aa08a30fc9	[CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig (#35060 ) Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>	2026-02-23 05:05:36 -08:00
Wentao Ye	7f40e9e516	[Refactor] Remove dead private func `_fp8_perm` and `_extract_mask_for_item` (#35068 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-02-23 05:05:20 -08:00
Harry Mellor	103e614b14	Fix pipeline parallel with embed scaling in the Transformers modelling backend (#35094 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-23 05:04:47 -08:00
Neil Schemenauer	54e2f83d0a	[Feature] Lazy import for the "mistral" tokenizer module. (#34651 ) Signed-off-by: Neil Schemenauer <nas@arctrix.com>	2026-02-23 00:43:01 -08:00
Gabe Goodhart	e631f8e78e	fix: Apply embedding_multiplier to inputs_embeds (#34813 ) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2026-02-23 00:42:46 -08:00
Martin Hickey	e97c46a92d	[BugFix]: Fix local mypy issues (#34739 ) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-23 00:40:29 -08:00
Jee Jee Li	7291d1b288	[Bugfix] Fix kernel benchmark (#33752 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2026-02-22 21:18:08 -08:00
Cyrus Leung	987506bca6	[Refactor] Simplify dummy data generation (#35025 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-22 20:55:27 -08:00
Woosuk Kwon	c645e9a214	[Model Runner V2] Remove propose_draft method (#35070 ) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>	2026-02-22 18:27:12 -08:00
Nick Hill	944ffb5968	[Model Runner V2][Minor] Remove redundant `do_spec_decode` field (#35039 ) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>	2026-02-22 16:18:04 -08:00
qizixi	2bcf71b9c0	[Spec Decode] Reduce TP communication for speculative decoding draft token generation (#34049 ) Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>	2026-02-22 14:59:16 -08:00
tacos8me	b7892a3bef	[Model] Add NVFP4 quantization support for Step3.5-Flash (#34478 ) Signed-off-by: tacos8me <ian@cloudhabit.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-02-22 12:30:46 -07:00
Benjamin Chislett	682566b18e	[Bug] Refactor max_num_batched_tokens to account for drafting (#34898 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>	2026-02-22 11:18:46 -05:00
qizixi	b9c2a565cc	[Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup (#34529 ) Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>	2026-02-22 08:08:32 -08:00
Andreas Karatzas	dd8c3a7fb2	[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays (#35052 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-02-22 10:07:18 +00:00
Andreas Karatzas	a8a47c17b6	[ROCm][CI] Fix flaky embedding chat test by using tolerance-based comparison (#35050 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-02-22 09:03:44 +00:00
Roger Wang	40f88d8318	[Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser (#34779 ) Signed-off-by: Roger Wang <hey@rogerw.io>	2026-02-21 23:15:35 -08:00
Woosuk Kwon	2cbf9656ce	[Model Runner V2] Enable CUDA graph for Eagle3 (#35040 ) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>	2026-02-21 21:42:50 -08:00

... 22 23 24 25 26 ...

15309 Commits