Commit Graph

  • a3a51d20e7 [Benchmark] Improvements to attention benchmark script (#37115) Wei Zhao 2026-03-16 18:22:40 -04:00
  • e5b807607c [Quant][Feature] Support online MXFP8 quantization for MoE and dense models (#35448) EdalatiAli 2026-03-16 18:07:39 -04:00
  • fd4d96302a Fix eplb nvfp4 experts hook (#37217) Elvir Crnčević 2026-03-16 23:03:54 +01:00
  • c0f011918d [Bugfix] opcheck false mutation error in rms_norm_per_block_quant (#36688) (#36779) Krish Gupta 2026-03-17 02:41:33 +05:30
  • e6ae4b1be1 [compile] Enable mega aot artifact for torch 2.12+. (#37198) Zhengxu Chen 2026-03-16 17:05:51 -04:00
  • 2dccb38f73 [Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors (#36549) v0.18.0rc0 zhanqiuhu 2026-03-16 16:51:04 -04:00
  • d157216093 [BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer (#37197) Kunshang Ji 2026-03-17 04:39:56 +08:00
  • 93f3c8e531 [Misc] Add float16 to CacheDType (#37199) Matthew Bonanni 2026-03-16 16:24:48 -04:00
  • 2cc26c3a99 [CI][BugFix][MORI][AMD] Add transfer_id to kv transfer params for test (#37213) rasmith 2026-03-16 15:22:57 -05:00
  • dfa8852db2 [Refactor] Consolidate GPT-OSS reasoning parser tests (#36915) Flora Feng 2026-03-16 15:53:07 -04:00
  • 714c6e0eab [torch.compile][BE] Modify cudagraph callable to check for is_forward_context_set (#36288) Lucas Kabela 2026-03-16 12:42:34 -07:00
  • 0fefd00e6c [Bugfix] Fix render server crash for quantized models on CPU-only hosts (#37215) Sage 2026-03-16 20:59:01 +02:00
  • f5c081d432 [PD][Nixl] Add support for hybrid SSM-FA models (#36687) Nicolò Lucchesi 2026-03-16 19:58:06 +01:00
  • c88ea8338b [MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible (#36982) Matthew Bonanni 2026-03-16 13:51:21 -04:00
  • 9f9ecff4cd Add simple granite4 tool parser (#36827) Max de Bayser 2026-03-16 14:49:09 -03:00
  • ca1954d58c [Bugfix] Disable cross-layer KV cache for MLA attention backends (#37090) haosdent 2026-03-17 01:03:10 +08:00
  • 55e6d3d5c0 [Bugfix] Make siglip/clip compatible with transformers v5 (#37200) Raushan Turganbay 2026-03-16 17:48:18 +01:00
  • 6682c231fa [Bugfix] Add error handling for FINISHED_ERROR in OpenAIServing (#37148) Chauncey 2026-03-17 00:27:47 +08:00
  • 5ae685c1c8 [Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout (#34158) Itay Etelis 2026-03-16 17:20:51 +02:00
  • ce8cf9161d [Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh (#36693) Wentao Ye 2026-03-16 11:12:15 -04:00
  • 18be11fd59 [BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 (#35594) xjx 2026-03-16 23:10:42 +08:00
  • 8d8855fdae [Bugfix] Add safety check and fallback for null scaling factor (#36106) Yuanheng Zhao 2026-03-16 22:27:29 +08:00
  • e855d380fa [Compile] Fix compile warning in moe_permute (#36529) Wentao Ye 2026-03-16 10:16:14 -04:00
  • 0e5a9382af [Bugfix] accept redacted thinking blocks in Anthropic messages (#36992) Benjamin Bartels 2026-03-16 14:01:57 +00:00
  • 04bf5a35fa [Spec Decode] Update extract_hidden_states to use deferred kv_connector clear (#37013) Fynn Schmitt-Ulms 2026-03-16 09:53:45 -04:00
  • 43a73f853b Remove unused EVS functions in qwen3_vl.py (#37183) Tianyu Guo 2026-03-16 21:09:09 +08:00
  • ffbc2e5bdb Patch Mistral config (#37104) Julien Denize 2026-03-16 13:22:18 +01:00
  • f9e6db3034 [Models][Qwen3 ViT] Keep max_seqlen on CPU to prevent D2H sync (#37139) Lukas Geiger 2026-03-16 12:11:59 +00:00
  • d61d2b08e9 [Build] Fix API rate limit exceeded when using VLLM_USE_PRECOMPILED=1 (#36229) elvischenv 2026-03-16 20:09:27 +08:00
  • f5e59ee7a6 [Performance] Add prefetch for checkpoints to OS page cache (#36012) Artem Perevedentsev 2026-03-16 13:32:02 +02:00
  • 9b005edc48 [Docs] Make the link to hardware plugins clearer (#37174) Harry Mellor 2026-03-16 11:12:58 +00:00
  • bf9a185395 GLM4 tool parser: fix streaming mode (#35208) Robin Nabel 2026-03-16 11:48:52 +01:00
  • ad041c79db Fix text only inputs for MRoPE models with the Transformers modelling backend (#37055) Harry Mellor 2026-03-16 10:31:16 +00:00
  • 747b068136 [Hardware] Replace memory related torch.cuda APIs (#37031) Kunshang Ji 2026-03-16 18:24:48 +08:00
  • 122f75d939 Fix pipeline parallel with multimodal models with the Transformers modelling backend (#37057) Harry Mellor 2026-03-16 10:20:37 +00:00
  • d8f8a7aad2 [Misc] Sync pre-commit to 4.5.1 in workflows and docs (#36675) SoluMilken 2026-03-16 18:03:21 +08:00
  • 0115e957d4 [Frontend][Misc] Remove unused log in /is_sleeping (#37093) Roy Wang 2026-03-16 17:46:28 +08:00
  • 116ed130f4 [Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches (#34871) haosdent 2026-03-16 17:30:23 +08:00
  • 8374387bd8 [FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell (#36987) Vadim Gimpelson 2026-03-16 13:04:29 +04:00
  • 912fbe9555 [Bugfix] Fix Qwen2.5-Omni/Qwen3-Omni use_audio_in_video with multi-video inputs (#37147) Isotr0py 2026-03-16 16:56:06 +08:00
  • 52131f88d9 use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks (#36204) Laith Sakka 2026-03-16 01:52:31 -07:00
  • 821eb80c0d [Performance][Model Loader] Skip non-local expert weights during EP model loading (#37136) Roy Wang 2026-03-16 16:33:36 +08:00
  • a2956a0f8e [ROCm][CI] Retrying in case of batch variance effects and reducing flakiness (#36442) Andreas Karatzas 2026-03-16 03:08:51 -05:00
  • 911355e216 [ROCm] Fix KV copy methods and auto-select attention backend for ROCm (#36845) Andreas Karatzas 2026-03-16 03:07:27 -05:00
  • 8d3f8f485e [Bugfix] fix Qwen3.5 tool calling bug (#36774) Chauncey 2026-03-16 15:38:42 +08:00
  • 96efb91480 [Model Runner V2] Fix processed logits in sample() (#37144) Woosuk Kwon 2026-03-16 00:35:49 -07:00
  • 2754231ba3 [Kernel] Add FlashInfer MoE A2A Kernel (#36022) leo-cf-tian 2026-03-16 02:45:32 -04:00
  • 2390d44209 [Model] Add HyperCLOVAX-SEED-Think-14B language model support (#37107) bigshanedogg 2026-03-16 15:40:05 +09:00
  • 7362b4450a [Bugfix] Avoid LD_PRELOAD check on MacOS (#37145) Li, Jiang 2026-03-16 14:31:44 +08:00
  • 57a314d155 [CI][Bugfix] Fix 500 errors from priority overflow and TemplateError subclasses in schema fuzz tests (#37127) Andreas Karatzas 2026-03-16 00:27:21 -05:00
  • d4c57863f7 [ROCm][CI] Fix engine teardown and text normalization to stabilize voxtral test (#37138) Andreas Karatzas 2026-03-15 23:49:31 -05:00
  • 68e1b711f1 [XPU] Add deepseek_scaling_rope fused kernel (#36612) Wang, Yiting 2026-03-16 12:35:08 +08:00
  • 0024f39a32 [ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality (#34907) rasmith 2026-03-15 21:36:51 -05:00
  • e9163b536e [responsesAPI][ez] add a unit test for SimpleContext logprobs (#37126) Andrew Xia 2026-03-15 17:12:26 -07:00
  • 7acaea634c In-Tree AMD Zen CPU Backend via zentorch [1/N] (#35970) Lalithnarayan C 2026-03-16 05:05:35 +05:30
  • 697e4ff352 [GDN] add a config for gdn kernel selection (#36647) Jiangyun Zhu 2026-03-16 00:40:17 +08:00
  • a3e2e250f0 [Feature] Add Azure Blob Storage support for RunAI Model Streamer (#34614) Hari 2026-03-15 17:08:21 +05:30
  • 143e4dccdf [Misc] Add online audio_in_video test (#36775) Isotr0py 2026-03-15 15:14:11 +08:00
  • 6590a3ecda [Frontend] Remove torchcodec from audio dependency (#37061) Isotr0py 2026-03-15 13:15:59 +08:00
  • b3debb7e77 [Build] Upgrade xgrammar to get a security fix (#36168) Russell Bryant 2026-03-14 23:13:48 -04:00
  • 458c1a4b2d [Frontend] Reduce chat template warmup logging levels (#37062) Nick Hill 2026-03-14 13:48:59 -07:00
  • 821fde2df4 [Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference (#32384) Karan Bansal 2026-03-14 22:59:06 +05:30
  • 8c29042bb9 [Feature] Add InstantTensor weight loader (#36139) arlo 2026-03-15 01:05:23 +08:00
  • 5467d137b3 [Frontend] Avoid startup error log for models without chat template (#37040) Cyrus Leung 2026-03-15 00:36:11 +08:00
  • 3ed46f374b [Model Runner V2] Add Support for XD-RoPE (#36817) Santino Ramos 2026-03-14 09:27:55 -07:00
  • 84868e4793 [Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats (#35109) seanmamasde 2026-03-14 23:44:03 +08:00
  • a8e8d62dd8 [Misc] Clean up Kimi-audio whisper encoder loading (#36903) Isotr0py 2026-03-14 23:37:52 +08:00
  • e42b49bd69 Mistral common v10 (#36971) Julien Denize 2026-03-14 15:26:43 +01:00
  • 4a718e770d [Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests (https://github.com/vllm-project/vllm/issues/35665) (#35684) Sergey Zinchenko 2026-03-14 17:10:11 +03:00
  • 600a039f57 [CI] Shard Multi-Modal Models (Standard) into 4 parallel jobs (#37014) Kevin H. Luu 2026-03-14 01:26:54 -07:00
  • ffa5d74f15 Enable loading of fused expert weights in the Transformers modelling backend (#36997) Harry Mellor 2026-03-14 07:01:06 +00:00
  • 74fe80ee95 [CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs (#37015) Kevin H. Luu 2026-03-13 21:21:13 -07:00
  • bcfdadb1bc [Refactor] Relocate chat completion and anthropic tests (#36919) Flora Feng 2026-03-14 00:16:16 -04:00
  • 236de72e49 [CI] Pin helion version (#37012) Yanan Cao 2026-03-13 20:25:29 -07:00
  • a116f96930 [V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls (#37006) sbeurnier 2026-03-14 02:37:32 +01:00
  • 092ace9e3a [UX] Improve UX of CPU backend (#36968) Li, Jiang 2026-03-14 09:27:29 +08:00
  • f680dc1b39 [responsesAPI] prioritize content over summary in reasoning item input (#36516) Andrew Xia 2026-03-13 18:20:30 -07:00
  • b41aa264f9 fix: resolve chat template names before kwargs detection (#36937) Giulio Leone 2026-03-14 01:20:16 +01:00
  • 367cf5cd3e [Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype (#36931) Dimitrios Bariamis 2026-03-14 00:41:16 +01:00
  • 6d53efd2a5 [Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models (#34695) haosdent 2026-03-14 07:25:41 +08:00
  • 8b346309a5 [Refactor] Consolidate SupportsEagle (#36063) Benjamin Chislett 2026-03-13 19:22:40 -04:00
  • 54a6db827f [BugFix] Fix "DP Coordinator receives unexpected..." messages (#37008) Nick Hill 2026-03-13 16:18:05 -07:00
  • 9efc4db965 [Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces (#37004) Matthew Bonanni 2026-03-13 18:55:36 -04:00
  • f1816fb192 [CI] Split V1 e2e + engine (1 GPU) into separate jobs (#36945) Kevin H. Luu 2026-03-13 14:16:02 -07:00
  • 0005d2a3c9 Use Transformers v5 WeightRenaming for Transformers modeling backend (#31545) Harry Mellor 2026-03-13 20:49:08 +00:00
  • d0b402974f [Bugfix][Spec Decode] Avoid double call of Ngram CPU (#36952) Ekagra Ranjan 2026-03-13 16:33:19 -04:00
  • 6341d43043 [ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer (#35316) Divakar Verma 2026-03-13 15:44:24 -04:00
  • 7afe0faab1 [Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish (#36666) Mark McLoughlin 2026-03-13 19:10:06 +00:00
  • 5a3f1eb62f [Misc] Set default kv_buffer_device in a better way (#36862) Harry Mellor 2026-03-13 19:07:33 +00:00
  • b3ce711b93 Fp8 lora dense kernel (#35242) yugong333 2026-03-13 12:05:08 -07:00
  • abf61aaa8e [Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request (#36800) Isotr0py 2026-03-14 02:16:05 +08:00
  • 4508532fbd [Bugfix] fix paddleocr crash on some image shape (#36959) bigmoyan 2026-03-13 21:46:55 +08:00
  • d5af196c18 [2/N] Elastic EP Milestone 2: Integrating NIXL-EP (#35627) Itay Alroy 2026-03-13 15:25:33 +02:00
  • 82f836d976 [XPU] Support LoRA via torch.compile on XPU platform (#36962) Chaojun Zhang 2026-03-13 18:34:59 +08:00
  • 4fccd30f19 [ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options (#36181) Andreas Karatzas 2026-03-13 04:04:22 -05:00
  • cfaf4668f7 [kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec (#36610) Or Ozeri 2026-03-13 10:04:21 +02:00
  • 99a57bdf74 [ROCm][CI] Corrected the GPT-OSS test root path (#36711) Andreas Karatzas 2026-03-13 02:53:43 -05:00
  • a2268617cf [Frontend] Delegate preprocessing to OpenAIServingRender (#36483) Sage 2026-03-13 09:39:43 +02:00
  • a4ad9db541 Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) (#35786) Rohan Potdar 2026-03-13 02:33:22 -05:00
  • b373b5102a [Tests] Shutdown test RemoteVLLMServer cleanly (#36950) Nick Hill 2026-03-13 00:32:55 -07:00