Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

9fa6c68fa6 [ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends (#35334) Gregory Shtrasberg 2026-02-27 15:32:55 -06:00
2ce6f3cf67 [Feat][RL][2/2] Native Weight Syncing API: IPC (#34171) Aaron Hao 2026-02-27 12:45:21 -08:00
1f3dbd95fd [Bugfix][Model] Fix gpt-oss batch invariance (#35404) Jakub Zakrzewski 2026-02-27 21:41:24 +01:00
1d532f9d8f [DP] Only use DP padding when cudagraphs are actually used (#34102) Lucas Wilkinson 2026-02-27 15:14:31 -05:00
234a65b781 [Bugfix] Add monkeypatch to prevent race condition from writing (#35420) Lucas Kabela 2026-02-27 11:51:36 -08:00
2decec9856 [Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 (#34888) SteadfastAsArt 2026-02-28 03:39:23 +08:00
29b35477b0 [compile] Fix caching error over pytree slice node. (#35308) Zhengxu Chen 2026-02-27 14:34:16 -05:00
b1d9f5372d [Model Runner V2] Warmup kernels (#35172) Nick Hill 2026-02-27 10:43:30 -08:00
fd6de37fca [BugFix] Fix 3D rope in transformers backend (#35097) Raushan Turganbay 2026-02-27 19:34:49 +01:00
c8aca0c9e1 Support parakeet as audio encoder for nemotron-nano-vl (#35100) Netanel Haber 2026-02-27 20:07:38 +02:00
b602e4f299 [Doc] Fix link to Llama chat template for usability (#35525) Martin Hickey 2026-02-27 17:51:09 +00:00
157722da75 [perf] Use pinned memory for async H2D transfer in do_mamba_copy_block (#35480) Huamin Li 2026-02-27 09:50:37 -08:00
1d897ff04f [Misc] Fill in some v1 CODEOWNERS gaps (#35524) Nick Hill 2026-02-27 09:34:37 -08:00
905d76b51d [Model] Add huggingface skt/A.X-K1 model (#32407) fort726 2026-02-28 02:26:02 +09:00
9098ce690c [Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching (#34390) Yanan Cao 2026-02-27 09:21:35 -08:00
876312f0b5 [Core] Fix gpu_worker.py pre-commit errors (#35312) Nick Hill 2026-02-27 07:54:24 -08:00
5de98abc12 Add @BoyuanFeng to CODEOWNERS (#35317) Boyuan Feng 2026-02-27 07:53:47 -08:00
9251ed5c4f [Bugfix] Handle case when kimi ends reasoning with a tool call (#33646) Koushik Dutta 2026-02-27 06:58:28 -08:00
e8249378e4 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests (#35487) Yueqian Lin 2026-02-27 09:48:25 -05:00
6d4f9d3ad5 [Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp (#35082) haosdent 2026-02-27 22:27:06 +08:00
fbe3f0120a Revert "Add GlmOcrConfig for GLM-OCR model type recognition" (#35512) Harry Mellor 2026-02-27 14:13:27 +00:00
66c1751d13 [compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism (#35410) Jason Li 2026-02-27 05:36:37 -08:00
6467b635b6 [Bugfix] Add missing activation attr to RMSNormGated (#35423) Tib 2026-02-27 13:53:35 +01:00
9c3fe9936b Flashinfer cuDNN backend for Qwen3 VL ViT attention (#34580) Max Hu 2026-02-27 20:20:23 +08:00
b66a74649e [Bugfix] Replace assert with ValueError for response_format validation in completions endpoint (#35456) Umut Polat 2026-02-27 11:01:06 +03:00
07bdabef03 [Bugfix] Use 'sum' reduction instead of 'avg' in Async TP reduce-scatter (#33088) Wang Xingran 2026-02-27 15:06:08 +08:00
a572baff5e [Model Performance] Add Qwen3MoE tuned MoE configs for H200 (#35457) Chengyi Nie 2026-02-26 21:51:14 -08:00
516cf26698 [Bug] correct out dtype of rms_norm_gated native path (#35369) zofia 2026-02-27 13:19:51 +08:00
487e5c51f7 [Bugfix] disable allreduce_rms_fusion by default when pp size > 1 (#35424) Jiangyun Zhu 2026-02-27 12:18:52 +08:00
1a8c71674e [BugFix] Repo utils debug print patch (#35434) Daniel Huang 2026-02-26 19:50:56 -08:00
062b789632 [Bug] Fix outdated links in source code (#35314) Wentao Ye 2026-02-26 22:50:46 -05:00
a532c83849 use 'max_active_experts' for moe lora input size (#33197) gnovack 2026-02-26 19:50:43 -08:00
1e5ad9b74f [Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping (#35413) Jee Jee Li 2026-02-27 11:46:30 +08:00
cabdaa7619 [Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils (#35400) Nicolò Lucchesi 2026-02-27 04:42:51 +01:00
06be53563b [Core]Extract is_last_rank in Ray for tpu to override (#33012) Chenyaaang 2026-02-26 19:18:52 -08:00
c29ee9c326 [compile] Invalidate cache for cpu flags (#35119) Angela Yi 2026-02-26 18:54:11 -08:00
d43048ce05 [Bugfix] Emit reasoning_part events in simple streaming path for Resp… (#35184) daniel-salib 2026-02-26 17:49:06 -08:00
4fec53cfcb [CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI (#34274) Michael Goin 2026-02-26 19:58:03 -05:00
38c498b8e3 [Performance] Cublas Bf16 Gate with Fp32 Output (#35121) roikoren755 2026-02-27 02:51:28 +02:00
56a6371706 [Update] Use FlashInfer fast_decode_plan directly instead of replication (#34687) Andrii Skliar 2026-02-27 01:31:43 +01:00
6283021142 [Bugfix] Fix KV Scale loading for MLA Models (#35430) Pavani Majety 2026-02-26 15:38:19 -08:00
01923eec70 [ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales (#30357) Aleksandr Malyshev 2026-02-26 14:50:16 -08:00
31fb6f43da [Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds (#33839) pkousha 2026-02-26 14:35:58 -08:00
eb19955c37 [WideEP] Remove pplx all2all backend (#33724) Tyler Michael Smith 2026-02-26 17:30:10 -05:00
0f2f24c8b2 [Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism (#35429) Lucia Fang 2026-02-26 14:08:16 -08:00
d0105b84f0 add mixed precision support for modelopt (#35047) sychen52 2026-02-26 13:56:24 -08:00
832a780f3a Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models (#35396) danielafrimi 2026-02-26 23:55:19 +02:00
98217b09f9 [Performance] Extract KV cache update op from flashinfer forward (#35422) ElizaWszola 2026-02-26 22:29:01 +01:00
967572dd5f fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning (#35230) 不做了睡大觉 2026-02-27 04:30:45 +08:00
3d66502e1b [Model Runner V2] Prepare attn metadata in ModelState [2/N] (#35383) Woosuk Kwon 2026-02-26 11:47:02 -08:00
c66aa48e99 [Model Runner V2] Add model states [1/N] (#35350) Woosuk Kwon 2026-02-26 11:20:35 -08:00
b6d5a17298 [Model Runner V2] Fix error-handling (#35063) Nick Hill 2026-02-26 11:00:19 -08:00
5e58bdc711 [Bugfix] Remove erroneous lower bound on LoRA vocab size constraint (#35354) Lucas Wilkinson 2026-02-26 13:44:50 -05:00
a1f53addb1 [BugFix] Align fused MoE-LoRA kernel config with actual weight shapes (#34396) Runkai Tao 2026-02-26 13:03:10 -05:00
05970c772c [Refactor] Remove dead code for attention benchmark script (#35418) Wentao Ye 2026-02-26 12:53:46 -05:00
d940607629 [Core] Support min_tokens with speculative decoding (#32642) Yiliu Dong 2026-02-27 01:31:28 +08:00
99c7892c5b [Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement (#35330) Wentao Ye 2026-02-26 12:14:54 -05:00
ec8f943db1 Add GlmOcrConfig for GLM-OCR model type recognition (#34982) hujia177 2026-02-26 09:04:42 -08:00
f2ad952f40 [BugFix][kv_offload]: Fix kernel block size detection (#35125) Or Ozeri 2026-02-26 18:29:34 +02:00
9e2cabdf9c [ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release (#34387) Sage Moore 2026-02-26 08:28:45 -08:00
ec8ab9d254 [ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers (#34157) Douglas Lehr 2026-02-26 10:00:49 -06:00
05972ea7e5 [Refactor] Remove dead or duplicate func utils or variables (#35318) Wentao Ye 2026-02-26 10:57:56 -05:00
111d869069 [Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model (#35297) Jakub Zakrzewski 2026-02-26 15:17:17 +01:00
7fea7250a4 [Bug] Fix missing <think> tag after tool call in MiniMax 2.1 (#35352) stingoChen 2026-02-26 22:11:07 +08:00
845ee348ef [Misc] Standardize handling of mm_processor_kwargs.size (#35284) Cyrus Leung 2026-02-26 21:05:46 +08:00
ec13e549d3 [Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic (#35275) Asaf Gardin 2026-02-26 14:22:06 +02:00
c6ca51598a [Bugfix] fix device_name for routing replay (#34336) Li-Yongwen 2026-02-26 20:18:38 +08:00
c0615a296d [Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression (#35368) Yueqian Lin 2026-02-26 06:58:23 -05:00
01914445b0 Remove bc-lint (#35274) Harry Mellor 2026-02-26 11:01:01 +00:00
5281713e11 [XPU] use fixed UMD version in dockerfile.xpu (#35392) Kunshang Ji 2026-02-26 18:54:55 +08:00
32693db8ce [Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading (#35289) HZY 2026-02-26 18:26:15 +08:00
e03ddcfbd4 [Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le (#35081) Akash kaothalkar 2026-02-26 15:51:24 +05:30
02acd16861 [Benchmarks] Plot benchmark timeline and requests statistics (#35220) Sophie du Couédic 2026-02-26 11:17:43 +01:00
ab87f85231 [Model] Ring 2.5 (#35102) Jiangyun Zhu 2026-02-26 18:17:11 +08:00
3827c8c55a [Test] Add tests for n parameter in chat completions API (#35283) v0.16.1rc0 Krish Gupta 2026-02-26 14:44:07 +05:30
ade81f17fe [Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash (#35250) Kevin McKay 2026-02-26 02:11:07 -06:00
6042e66cd5 [ROCm] Add extra step in config initialization to populate custom ops before compilation config init (#34848) Gregory Shtrasberg 2026-02-26 02:05:40 -06:00
9f9a675b23 [XPU][8/N] Fix kernel bugs in XPU LoRA and MOE LORA (#34115) Chaojun Zhang 2026-02-26 15:46:44 +08:00
a07c4c5939 [BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 (#35298) Ofir Zafrir 2026-02-26 09:15:16 +02:00
d3a51da92a [Benchmark] Simplify SLA scan (#35306) Cyrus Leung 2026-02-26 14:35:41 +08:00
186ea22efe [Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py (#35339) Flora Feng 2026-02-26 01:35:16 -05:00
4a9c07a0a2 [BugFix] anthropic/serving_messages: fix tool call arguments streaming (#34887) Daniele 2026-02-26 06:39:48 +01:00
9d37941017 [torch.compile] Sequence Parallelism threshold compile ranges (#28672) Jason Li 2026-02-25 21:00:12 -08:00
4171ff6dd9 [CPU][Feat] Enable KleidiAI INT8_W4A8 for all input dtypes (#34890) Fadi Arafeh 2026-02-26 05:00:10 +00:00
13025e71e8 [Model Runner V2] Add coding style guide (#35325) Woosuk Kwon 2026-02-25 20:42:40 -08:00
71dfce6aa6 [Kernel] Refactor FlashInfer allreduce for mnnvl backend (#34109) Hanjie Qiu 2026-02-25 19:17:20 -08:00
2aa4140402 openpangu-vl support video input (#34134) hujiaxin0 2026-02-26 11:08:09 +08:00
86c3b5a808 [BugFix] Fix fp4 quant kernel on CUDA 12.8 (#35210) Roberto L. Castro 2026-02-26 03:32:50 +01:00
160424a937 [Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs (#33992) Seungmin Kim 2026-02-26 11:15:51 +09:00
9511a3f8ee [Bugfix] Fix AttributeError in SMControlContextManager (#35338) Lucas Wilkinson 2026-02-25 21:01:10 -05:00
de527e1cec [UX] Add --moe-backend arg for explicit kernel selection (#33807) Michael Goin 2026-02-25 20:44:44 -05:00
1976356ee6 [MoE Refactor] MXFP4 Cutlass Experts to MK (#34542) Yongye Zhu 2026-02-25 17:32:39 -08:00
cbf8f7028c [UX] Add --performance-mode {balanced,interactivity,throughput} (#34936) Michael Goin 2026-02-25 20:28:31 -05:00
6831650c40 [offloader] v2: Hide weight onloading latency via prefetching (#29941) Ming Yang 2026-02-25 17:20:59 -08:00
ed42507f6d [ROCm][CI] Amending deletion of AMD mirror (#35322) Andreas Karatzas 2026-02-25 16:17:56 -06:00
9571e99945 [ROCm][CI] Extending attention backend coverage for Eagle spec decode tests (#35265) Andreas Karatzas 2026-02-25 16:16:18 -06:00
c97234c08b fix(mxfp4): Disable monolithic path for TRITON backend with EP (#34270) Elizabeth Thomas 2026-02-25 15:33:42 -06:00
b188bab441 [CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor (#34985) rasmith 2026-02-25 13:18:00 -06:00
15d76f74e2 Revert "[Misc] Enable weights loading tracking for quantized models" (#35309) Lucas Wilkinson 2026-02-25 12:20:15 -05:00
8fd6975479 [ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results (#35049) Andreas Karatzas 2026-02-25 10:48:37 -06:00

... 13 14 15 16 17 ...