Commit Graph

  • 5ce2d10e4a Fix models which use layer_type_validation for Transformers v5 (#37398) Harry Mellor 2026-03-18 18:41:51 +00:00
  • 738d0a281f [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation (#37439) Chengyu Fang 2026-03-19 02:36:34 +08:00
  • 70b81c4f3d [bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP (#37449) youkaichao 2026-03-19 02:32:30 +08:00
  • 7476d148db [Model] Remove unnecessary processor definition for Nemotron Parse (#37456) Cyrus Leung 2026-03-19 02:25:13 +08:00
  • f3732bd931 [Misc] Clean up model registry (#37457) Cyrus Leung 2026-03-19 02:24:44 +08:00
  • 0ef7f79054 [Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement (#37340) Wentao Ye 2026-03-18 14:18:34 -04:00
  • 16c971dbc7 [CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename (#37328) Andreas Karatzas 2026-03-18 04:44:12 -05:00
  • 5dd8df0701 [kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec (#36642) Or Ozeri 2026-03-18 19:26:40 +02:00
  • 39bfb57b7c Add API docs link if the CLI arg is a config class (#37432) Harry Mellor 2026-03-18 17:19:35 +00:00
  • c9d838fc33 Adding deterministic lora benchmarking to vLLM Bench (#36057) RonaldBXu 2026-03-18 09:02:03 -07:00
  • b1169d7be8 [Kernel] Add gpt-oss Router GEMM kernel (#37205) Xin Yang 2026-03-18 08:15:56 -07:00
  • 17808394bc standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 (#37371) XLiu-2000 2026-03-18 23:05:37 +08:00
  • 296839a1b0 [Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE (#30647) elvischenv 2026-03-18 23:01:26 +08:00
  • c373b5c00d [Log] Reduce duplicate log (#37313) Wentao Ye 2026-03-18 10:57:44 -04:00
  • de1a86b7de elastic_ep: Fix stateless group port races (#36330) Itay Alroy 2026-03-18 16:36:18 +02:00
  • 99267c23ca [2/3] Refactor InternVL-based processors (#37324) Cyrus Leung 2026-03-18 22:22:19 +08:00
  • 525f2eeb0b [kv_offload+HMA][6/N]: Split offloading_connector.py (#37405) Or Ozeri 2026-03-18 15:42:46 +02:00
  • 918b7890a1 [Bugfix] Fix base64 JPEG video frames returning empty metadata (#37301) Yufeng He 2026-03-18 21:40:03 +08:00
  • 98b09ddc27 [NIXL][Bugfix] metrics & testing minor bug (#36051) Andy Lo 2026-03-18 13:39:14 +00:00
  • cef1f302d2 [Model] Enable LoRA support for tower and connector in H2OVL (#31696) Shwetha Poojary 2026-03-18 18:56:47 +05:30
  • 17c47fb869 [Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy (#37322) Elvir Crnčević 2026-03-18 11:30:29 +01:00
  • b322b197f1 [Build] Bump python openai version (#32316) Chauncey 2026-03-18 18:20:10 +08:00
  • eaf7c9b976 [CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename (#37328) Andreas Karatzas 2026-03-18 04:44:12 -05:00
  • 47a1f11bff [docs] Add docs for new RL flows (#36188) Aaron Hao 2026-03-18 02:04:26 -07:00
  • 262ddd0d81 [cherry-pick][Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy #37322 v0.18.0rc1 khluu 2026-03-18 01:48:32 -07:00
  • e60c1674b3 [Bugfix] Avoid OpenMP thread reallocation in CPU torch compile (#37391) Li, Jiang 2026-03-18 15:51:39 +08:00
  • faa80947f5 [Performance] Add --enable-ep-weight-filter CLI option (#37351) Roy Wang 2026-03-18 09:36:55 +08:00
  • eeabf740bb [Custom Ops] Add functional + out variant for scaled_fp4_quant (#34389) Terry Gao 2026-03-16 15:51:46 -07:00
  • cdcffafef8 Fix eplb nvfp4 experts hook (#37217) Elvir Crnčević 2026-03-16 23:03:54 +01:00
  • fad09e8a1f fix(glm47): improve tool call parsing and content normalization (#37386) Karan Bansal 2026-03-18 13:42:21 +05:30
  • 8c31f47c63 [LoRA] Make LoRA respect language_model_only (#37375) Jee Jee Li 2026-03-18 15:53:34 +08:00
  • 261801242f [Bugfix] Avoid OpenMP thread reallocation in CPU torch compile (#37391) Li, Jiang 2026-03-18 15:51:39 +08:00
  • fcf0687b27 [kv_offload+HMA][0/N]: Support block-level preemption handling (#34805) Or Ozeri 2026-03-18 08:49:53 +02:00
  • 86b7e3c95a [XPU] skip unsupported ut and update test_nixl_connector (#37179) liuzhenwei 2026-03-18 13:32:59 +08:00
  • 0e95916155 [responsesAPI] parser.extract_response_outputs can take in token IDs (#37130) Andrew Xia 2026-03-17 22:31:31 -07:00
  • ce2ef42fd3 [CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset (#37335) Andreas Karatzas 2026-03-18 00:26:20 -05:00
  • 8b6325758c [ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture (#37349) Andreas Karatzas 2026-03-17 23:55:40 -05:00
  • a0dd1995c7 [Hardware][TPU] Add supports_async_scheduling() method to Executor interface so that it can be extended for Executor implementations. (#36924) gxd3 2026-03-17 21:53:28 -07:00
  • f1740006e4 [Perf] Enable dual stream execution of input projection for Qwen3 (#36795) Xin Yang 2026-03-17 20:13:27 -07:00
  • 58cde5c026 [ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm (#37330) Andreas Karatzas 2026-03-17 22:12:26 -05:00
  • 761e0aa7a0 [Performance] Add --enable-ep-weight-filter CLI option (#37351) Roy Wang 2026-03-18 09:36:55 +08:00
  • ff9fbc9aff [Kernel][Helion] [16/N] Refactor register_kernel API to be more Dynamo-friendly (#36705) Yanan Cao 2026-03-17 18:23:35 -07:00
  • e6c4797704 [ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn (#36927) Divakar Verma 2026-03-17 20:49:32 -04:00
  • 09e4576f65 [Kernel] Add non-gated support for NVFP4 CUTLASS MoE (#37320) Michael Goin 2026-03-17 23:12:04 +01:00
  • 3ed7b1e6e0 [ROCm] Validate block_size for explicitly selected attention backends (#36846) Andreas Karatzas 2026-03-17 17:04:40 -05:00
  • e8f9dbc369 [Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling (#36720) JartX 2026-03-17 22:55:34 +01:00
  • de35c06c66 Make KV connector metadata build overridable via plugin (#37336) Yong Hoon Shin 2026-03-17 14:29:06 -07:00
  • c0745a851a [Model] Add ColQwen3.5 4.5B support (#36887) Athrael Soju 2026-03-17 21:17:02 +00:00
  • b5ca9c3557 [Models] Cohere ASR (#35809) Ekagra Ranjan 2026-03-17 17:04:17 -04:00
  • 245758992e [Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow (#34577) Chao-Ju Chen 2026-03-18 04:48:42 +08:00
  • 1204cf0a9d [Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158) Dimitrios Bariamis 2026-03-17 21:13:06 +01:00
  • b36adfa349 [Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (#37252) Wei Zhao 2026-03-17 16:09:20 -04:00
  • e78821b438 [Deprecation] Deprecate --calculate-kv-scales option (#37201) Michael Goin 2026-03-17 20:57:24 +01:00
  • 51f0acda79 [Model] Remove unused handle_oov_mm_token (#37321) Cyrus Leung 2026-03-18 03:44:52 +08:00
  • fa75204b16 bump compressed-tensors version to 0.14.0.1 (#36988) Brian Dellabetta 2026-03-17 15:36:19 -04:00
  • bdb903bb5f [Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs (#36674) Wentao Ye 2026-03-17 15:19:52 -04:00
  • 68f783a727 [Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility (#35673) Andrey Talman 2026-03-17 14:47:59 -04:00
  • c5030c439d [CI] Split Distributed Tests (4 GPUs) and Kernel MoE tests (#37100) Avinash Singh 2026-03-18 00:14:55 +05:30
  • 51b2333be1 [Perf] Optimize top-k search in apply_top_k_top_p_triton sampler (#37225) Michael Goin 2026-03-17 19:35:17 +01:00
  • 4ed51308c8 [CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__ (#37230) Andreas Karatzas 2026-03-17 11:08:08 -05:00
  • c781fbbab3 [Bugfix] Standardize custom HF Processor init (#37289) Cyrus Leung 2026-03-17 23:38:55 +08:00
  • 979ff44cea [BugFix] PyTorch Compilation Tests should error if any test fails (#37300) Richard Zou 2026-03-17 11:26:38 -04:00
  • f63ed7b5ac [Bugfix] Fix DP MTP Dummy Run (#35243) Benjamin Chislett 2026-03-17 11:16:48 -04:00
  • c9e5096256 [openapi] remove redundant exception stack trace[4/N] (#37157) Ning Xie 2026-03-17 23:06:25 +08:00
  • 2ff0ad9694 [UltraVox] Fix output type (#37224) Anton Vlasjuk 2026-03-17 15:51:17 +01:00
  • a836524d20 [Chore] Replace all base64 usages with faster pybase64 package (#37290) Isotr0py 2026-03-17 22:44:19 +08:00
  • 3717a4dd47 [Misc][LoRA] Add --lora-target-modules to restrict LoRA to specific modules (#34984) Bhoomit 2026-03-17 07:36:41 -07:00
  • ecfcdd2ce4 Fix Phi3 test that fails with Transformers v5 (#37298) Harry Mellor 2026-03-17 14:29:24 +00:00
  • c25dbc2d27 [Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace (#36955) Siew's Capital Jarvis 2026-03-17 22:22:09 +08:00
  • 77d2a5f17b pick up tuned prefill configs for FP8 FA3 (#36265) Jonas M. Kübler 2026-03-17 15:00:26 +01:00
  • 59192dfd39 [Frontend] Complete OpenAI render delegation (#37287) Sage 2026-03-17 15:53:55 +02:00
  • 56cb1baa66 [Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators (#36256) Umut Polat 2026-03-17 16:52:30 +03:00
  • f340324335 [1/2] Move InternVL-based processors (#37260) Cyrus Leung 2026-03-17 21:50:56 +08:00
  • 2660b9289c Bugfix for offloading+prefetch for GLM-4.7-FP8 (#37178) sfbemerk 2026-03-17 14:22:09 +01:00
  • 293f036e6d Add gigachat 3.1 tool parser + fix gigachat3 tool parser (#36664) Viacheslav 2026-03-17 15:03:20 +03:00
  • 0fb142a454 [perf][connector] optimize build_connector_meta when host buffer transfer is not used (#37165) youkaichao 2026-03-17 19:59:35 +08:00
  • 00f8e0d211 [Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender (#37266) Sage 2026-03-17 13:22:54 +02:00
  • 4af9ed21cb [Bugfix](xpu): prevent “selected index k out of range” in TP decode path (#37259) zhao, zhenhui 2026-03-17 19:14:07 +08:00
  • 9c7cab5ebb [Feature]: Support for multiple embedding types in a single inference call (#35829) Augusto Yao 2026-03-17 17:05:42 +08:00
  • 132bfd45b6 [Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens (#37258) Chauncey 2026-03-17 16:54:52 +08:00
  • 24b4272a8c Fix infinite recursive search issue in quark.py (#32779) xiao-llm 2026-03-17 03:19:15 -04:00
  • 8a680463fa [Bugfix] Fix NemotronH MTP + Chunked Prefill (#35447) Benjamin Chislett 2026-03-17 02:07:33 -04:00
  • 20b14095a4 [Bugfix] Fix loading Music Flamingo (#35535) Nick Cao 2026-03-17 01:24:40 -04:00
  • 17c1bdf371 [Bugfix] dtype mismatch in ngram gpu propose (#37246) PatchyTIS 2026-03-17 13:19:55 +08:00
  • 3e3d320c1b [Refactor] Relocate responses API tests (#37241) Flora Feng 2026-03-17 01:14:52 -04:00
  • 4d22667c32 [Feature][Frontend] add support for Cohere Embed v2 API (#37074) Walter Beller-Morales 2026-03-16 19:55:53 -04:00
  • 1fe3932c8b [ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch (#37219) Andreas Karatzas 2026-03-16 22:34:49 -05:00
  • 54a62a79f7 [ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch (#37219) v0.17.2rc0 Andreas Karatzas 2026-03-16 22:34:49 -05:00
  • 384dc7f77b [Refactor] Relocate completion and chat completion tests (#37125) Flora Feng 2026-03-16 23:31:23 -04:00
  • f04d5226f8 [CI] Fix flaky tool_use chat completion tests with deterministic seed (#37027) Flora Feng 2026-03-16 23:24:34 -04:00
  • 0a0a1a198b Add ability to replace oot ops when using lora (#37181) Kyuyeun Kim 2026-03-16 18:04:15 -07:00
  • 6c1cfbad32 Support non-contiguous KV cache in TRTLLM fp8 dequant kernel (#36867) Vadim Gimpelson 2026-03-17 04:48:42 +04:00
  • 45f526d652 [BugFix] Correct max memory usage for multiple KV-cache groups (#36030) Harry Huang 2026-03-17 08:38:52 +08:00
  • 5db91f0aaf Fix some Mistral parser issues (#37209) Julien Denize 2026-03-17 01:08:56 +01:00
  • 061980c36a [Feature][Frontend] add support for Cohere Embed v2 API (#37074) Walter Beller-Morales 2026-03-16 19:55:53 -04:00
  • 7a49742b88 [CI/Build] Add common tool call parser test suite (#27599) Ben Browning 2026-03-16 19:46:20 -04:00
  • 3e6a1e1686 [Custom Ops] Add functional + out variant for scaled_fp4_quant (#34389) Terry Gao 2026-03-16 15:51:46 -07:00
  • 7961486a9b Fix EagleMistralLarge3Model initialization (#37232) Julien Denize 2026-03-16 23:41:00 +01:00
  • 4f9b14c21c [CI] Stabilize multinode DP internal LB completion tests (#36356) Andreas Karatzas 2026-03-16 17:40:23 -05:00
  • 31a458c091 [Doc] Clarify schema enforcement behavior for tool_choice modes (#37064) Yuchen Fama 2026-03-16 18:27:42 -04:00