biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
biondizzle	c77342da87	KV cache: prefer CPU placement, zero via CPU not GPU Two critical fixes for managed memory KV cache allocation: 1. Preferred location set to CPU (not GPU). The KV cache is too large for HBM (50-100+ GiB). Setting preferred location to GPU causes the driver to try migrating the entire allocation to HBM → OOM. With CPU as preferred location, pages stay in LPDDR/EGM and page-fault to GPU on-demand during attention ops. 2. Zero memory via CPU memset (not cudaMemset). cudaMemset runs on the device, forcing ALL pages to migrate to GPU before zeroing — exactly what we're trying to avoid. CPU memset keeps pages in LPDDR. Also added SetAccessedBy(GPU) so the GPU can access pages remotely over C2C NVLink without triggering page migration back to GPU.	2026-04-12 03:44:16 +00:00
biondizzle	7f35bc4158	Targeted KV cache managed memory allocation Instead of swapping the global CUDA allocator (which broke cuBLAS), allocate KV cache via cudaMallocManaged directly in _allocate_kv_cache_tensors(). Controlled by VLLM_KV_CACHE_USE_MANAGED_MEMORY env var. Model weights and compute intermediates stay in HBM via default cudaMalloc. Only KV cache spills into EGM/LPDDR.	2026-04-11 02:14:34 +00:00
biondizzle	487dd34e04	Selective prefetch: only prefetch allocations <2 GiB to GPU Model weights (small tensors) must be in HBM for cuBLAS GEMM ops which can't page-fault into managed memory. KV cache blocks are large and numerous — prefetching them all fills HBM and causes OOM. The 2 GiB threshold separates compute data from cache data.	2026-04-10 14:58:57 +00:00
biondizzle	a15f86ecfa	Remove cudaMemPrefetchAsync from managed allocator Eager prefetching was filling HBM+EGM, causing subsequent cudaMallocManaged calls to fail after model loading. On GH200 with EGM, pages should migrate on-demand via hardware page faults over C2C NVLink. The cudaMemAdviseSetPreferredLocation(GPU) hint is sufficient to prefer GPU placement with LPDDR fallback.	2026-04-10 05:58:11 +00:00
Michael	2a69949bda	[Bugfix]: Fix Gemma4ToolParser.__init__() missing `tools` parameter (#38847 ) Signed-off-by: Michael Hospedales <hospedales@me.com> (cherry picked from commit `bb39382b2b`) v0.19.0	2026-04-02 16:45:38 -07:00
Luciano Martins	8adcf8c40a	feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) (#38826 ) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Luciano Martins <lucianomartins@google.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> (cherry picked from commit `08ed2b9688`)	2026-04-02 11:49:53 -07:00
khluu	cfad6a509c	Revert "[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang (#38730 )" This reverts commit `c284a6671c`.	2026-04-01 15:14:58 -07:00
Stefano Castagnetta	c284a6671c	[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang (#38730 ) Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com> (cherry picked from commit `6183cae1bd`) v0.19.0rc1	2026-04-01 12:11:03 -07:00
Chauncey	3a30a1a6a8	[Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str (#38242 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> (cherry picked from commit `cbe7d18096`)	2026-04-01 12:10:53 -07:00
Juan Pérez de Algaba	29982d48b3	(security) Enforce frame limit in VideoMediaIO (#38636 ) Signed-off-by: jperezde <jperezde@redhat.com> (cherry picked from commit `58ee614221`)	2026-04-01 12:10:40 -07:00
Yifan Qiao	1dbbafd3f3	[Feat][v1] Simple yet General CPU KV Cache Offloading (#37160 ) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> (cherry picked from commit `91e4521f9f`) v0.19.0rc0	2026-04-01 01:03:14 -07:00
Lucas Wilkinson	0ee3b7fc3d	[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking (#36178 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit `eb47454987`)	2026-04-01 01:02:58 -07:00
Matthew Bonanni	268bed9cf3	[Bugfix][Async] Fix async spec decoding with hybrid models (#38556 ) Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> (cherry picked from commit `757068dc65`)	2026-04-01 01:02:35 -07:00
Jiangyun Zhu	bcc0fdd0f3	[CI] fix LM Eval Qwen3.5 Models (B200) (#38632 ) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> (cherry picked from commit `ea7bfde6e4`)	2026-04-01 01:02:20 -07:00
wang.yuqi	69b8bd4b33	[CI Failure] pin colmodernvbert revision (#38612 ) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> (cherry picked from commit `719735d6c5`)	2026-04-01 01:02:04 -07:00
Li, Jiang	12449f9492	[Bugfix][CPU] Skip set_num_threads after thread binding (#38535 ) Signed-off-by: jiang1.li <jiang1.li@intel.com> (cherry picked from commit `6557f4937f`)	2026-03-30 23:01:42 -07:00
haosdent	b92312dfd7	[CI] Fix SPLADE pooler test broken by #38139 (#38495 ) Signed-off-by: haosdent <haosdent@gmail.com> (cherry picked from commit `a08b7733fd`)	2026-03-30 21:52:13 -07:00
Jaewon	d816834c1a	[MoE] Add RoutingMethodType.Simulated to TRT-LLM FP8/NVFP4 kernel allowlists (#38329 ) Signed-off-by: Jaewon Lee <jaewon@meta.com>	2026-03-29 22:53:43 -07:00
Roger Wang	92f0db57a8	[Misc] Always use `forward_mulmat` for `Conv3d` on newer versions of torch. (#38487 )	2026-03-30 05:39:41 +00:00
Andreas Karatzas	bea23536f6	[CI] Add temperature=0.0, reduce max_tokens, and add debug prints to audio_in_video tests (#38492 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-30 05:36:45 +00:00
Jiangyun Zhu	c133f33746	Add @ZJY0516 to CODEOWNERS (#38497 ) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>	2026-03-29 21:10:00 -07:00
Stanislav Kirillov	a6db99ba02	[Bugfix] Support multi-type params parsing for DeepSeek v3.2 (#33703 ) Signed-off-by: Stanislav Kirillov <stas@nebius.com> Co-authored-by: Stanislav Kirillov <stas@nebius.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>	2026-03-30 04:07:28 +00:00
Andreas Karatzas	4f2ed5fddb	[ROCm][CI] Enable hybrid chunked prefill test (#38317 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-30 10:30:26 +08:00
Kyle Sayers	d28d86e8a3	[QeRL] Fix online quantized reloading (#38442 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-03-29 14:56:41 -06:00
Wentao Ye	995dea1354	[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement (#38139 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-03-29 18:12:50 +00:00
allgather	8c0b6267d7	[Transformers v5] fix missing pixtral/voxtral multimodal dispatch (#38410 ) Signed-off-by: allgather <all2allops@gmail.com>	2026-03-29 09:59:06 +00:00
Andreas Karatzas	43cc5138e5	[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models (#38450 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-28 22:08:03 -07:00
Shubhra Pandit	5b8c30d62b	[Spec Decode, BugFix] Propagate norm_before_fc from Eagle3 speculator (#38111 ) Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com>	2026-03-29 00:42:06 +00:00
haosdent	d39b8daf5f	[Feature] Add Qwen3-ForcedAligner support via token classification pooling (#35367 ) Signed-off-by: haosdent <haosdent@gmail.com>	2026-03-29 00:27:52 +00:00
Walter Beller-Morales	fafca38adc	[BugFix][Frontend] apply task instruction as system prompt in cohere v2/embed (#38362 ) Signed-off-by: walterbm <walter.beller.morales@gmail.com>	2026-03-28 18:30:54 +00:00
Kunshang Ji	aa4eb0db78	[CI]revert initialize_model context manager (#38426 ) Signed-off-by: Kunshang Ji <jikunshang95@gmail.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>	2026-03-28 16:56:50 +00:00
Andreas Karatzas	af89140efc	[ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry (#38415 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-29 00:47:42 +08:00
haosdent	b2bc736b12	[CI] Fix Ernie4.5-VL initialization test (#38429 ) Signed-off-by: haosdent <haosdent@gmail.com>	2026-03-28 22:43:24 +08:00
whyiug	58c959a767	[Misc]: clean up non-core lint issues (#37049 ) Signed-off-by: whyiug <whyiug@hotmail.com>	2026-03-28 10:28:16 -04:00
Bvicii	bda3eda82d	[Bugfix] Disallow renderer_num_workers > 1 with mm processor cache (#38418 ) Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>	2026-03-28 06:32:52 -07:00
Michael Goin	2bf5b70ae8	[CI Bugfix] Pre-download missing FlashInfer headers in Docker build (#38391 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-03-28 06:09:00 -07:00
yzong-rh	6dad4c5722	[Test] Fix flaky race condition in test_abort_final_step (#38414 ) Signed-off-by: Yifan <yzong@redhat.com>	2026-03-28 09:06:56 +00:00
Liwen	171775f306	Fix Device Index for ROCm Ray Workers in MoE Benchmark (#38108 ) Signed-off-by: Liwen <53441624+li-liwen@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2026-03-28 08:27:11 +00:00
TJian	58a249bc61	[ROCm] [Release] Update ROCm variant from rocm700 to rocm721 (#38413 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2026-03-28 06:07:03 +00:00
IriKa	148a5c1226	[Bugfix]fix output Nan/Inf in marlin if dtype=float16 (#33972 ) Signed-off-by: IriKa Qiu <qiujie.jq@gmail.com>	2026-03-27 16:36:08 -07:00
Wei Zhao	b69bf2f0b1	[Perf] Use torch compile to fuse pack topk in trtllm moe (#37695 ) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>	2026-03-27 17:30:46 -06:00
rongfu.leng	88149b635e	Add nvidia h800 moe config (#31201 ) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>	2026-03-27 16:28:48 -07:00
Hongxia Yang	83a4df049d	[ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips (#38367 ) Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>	2026-03-27 23:20:19 +00:00
Gregory Shtrasberg	731285c939	[ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 (#38252 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>	2026-03-27 18:03:12 -05:00
Johnny	97d19197bc	[NVIDIA] Fix DGX Spark logic (#38126 ) Signed-off-by: johnnynunez <johnnynuca14@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: guillaume_guy <guillaume.guy@airbnb.com> Signed-off-by: Guillaume Guy <guillaume.c.guy@gmail.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Andreas Karatzas <akaratza@amd.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: Sathish Sanjeevi <SKPsanjeevi@users.noreply.github.com> Co-authored-by: Guillaume Guy <guillaume.c.guy@gmail.com> Co-authored-by: guillaume_guy <guillaume.guy@airbnb.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2026-03-27 15:26:07 -07:00
Giancarlo Delfin	384e4d5f48	[Model Runner V2] Rebuild attention metadata before eagle decode full… (#38311 ) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>	2026-03-27 13:46:42 -07:00
Nicolò Lucchesi	44a6528028	[CI] Skip failing test (#38369 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2026-03-27 13:25:19 -07:00
Kyle Sayers	648edcf729	[QeRL] Compose online quantization with quantized reloading (#38032 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-03-27 13:22:33 -07:00
Michael Goin	7ba425e916	Add short flag `-sc` for `--speculative-config` argument (#38380 ) Co-authored-by: Claude <noreply@anthropic.com>	2026-03-27 12:04:22 -07:00
Gregory Shtrasberg	b8665383df	[ROCm] Fix GPT-OSS import for triton 3.6 (#37453 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>	2026-03-27 18:00:57 +00:00

1 2 3 4 5 ...

15359 Commits