Commit Graph

  • 01a08739e0 [misc] split engine_model into json file for nsys profile tool (#23117) Grace Ho 2025-08-19 00:44:53 -07:00
  • fda9537c5e [Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 (#23114) Jiangyun Zhu 2025-08-19 14:24:31 +08:00
  • 90bbe0a5ad [Log] Warning Once for Cutlass MLA (#23137) Wentao Ye 2025-08-19 02:24:16 -04:00
  • e75f342261 Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema (#22023) Benji Beck 2025-08-18 22:48:26 -07:00
  • 78dba404ad [Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes (#22725) Nikhil Suryawanshi 2025-08-19 10:10:37 +05:30
  • e9d6a3db69 [TPU] make ptxla not imported when using tpu_commons (#23081) Chengji Yao 2025-08-18 20:46:42 -07:00
  • a4454e9401 chore: disable enable_cpp_symbolic_shape_guards (#23048) Xiao 2025-08-18 20:08:05 -07:00
  • 14006840ea [V0 Deprecation] Remove V0 FlashInfer attention backend (#22776) Woosuk Kwon 2025-08-18 19:54:16 -07:00
  • 6603288736 [CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests (#22871) Robert Shaw 2025-08-18 20:39:01 -04:00
  • 95e3095136 [Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code (#23122) Thomas Parnell 2025-08-19 02:31:38 +02:00
  • c9b38be8aa [Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT (#23041) Woosuk Kwon 2025-08-18 17:20:38 -07:00
  • 0dd3f4f5ab [Misc] Minor refactoring for prepare_inputs (#23116) Woosuk Kwon 2025-08-18 16:58:05 -07:00
  • 498259ccce Install tpu_info==0.4.0 to fix core dump for TPU (#23135) Xiang Xu 2025-08-18 16:23:33 -07:00
  • aab549870d Use Blackwell FlashInfer MXFP4 MoE by default if available (#23008) v0.10.1 Michael Goin 2025-08-18 18:25:49 -04:00
  • ba6928cf13 fix: OpenAI SDK compat (ResponseTextConfig) (#23126) Breno Baldas Skuk 2025-08-19 00:22:59 +02:00
  • befedf86a8 [CI Bugfix] Pin openai<1.100 to unblock CI (#23118) Michael Goin 2025-08-18 15:14:01 -04:00
  • 6d25e3fd6e Use Blackwell FlashInfer MXFP4 MoE by default if available (#23008) Michael Goin 2025-08-18 18:25:49 -04:00
  • ac6eb49de3 fix: OpenAI SDK compat (ResponseTextConfig) (#23126) Breno Baldas Skuk 2025-08-19 00:22:59 +02:00
  • bf756321c7 [CI Bugfix] Pin openai<1.100 to unblock CI (#23118) Michael Goin 2025-08-18 15:14:01 -04:00
  • 0e3bb543f0 [Bugfix] Support compile for Transformers multimodal (#23095) Raushan Turganbay 2025-08-18 15:35:48 +02:00
  • 569aefd134 chore: remove unnecessary patch_padding_side for the chatglm model (#23090) 杨朱 · Kiki 2025-08-18 20:32:13 +08:00
  • d3f71f1224 [Refactor] Get prompt updates earlier (#23097) Cyrus Leung 2025-08-18 20:31:53 +08:00
  • 5a30bd10d8 [Bugfix] fix IntermediateTensors equal method (#23027) Ning Xie 2025-08-18 17:58:11 +08:00
  • 27e8d1ea3e [Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs (#23053) Cyrus Leung 2025-08-18 17:52:00 +08:00
  • 5c79b0d648 [XPU][CI]add xpu env vars in CI scripts (#22946) Kunshang Ji 2025-08-18 17:47:03 +08:00
  • 5f5664b3e4 [XPU] Fix compile size for xpu (#23069) Kunshang Ji 2025-08-18 15:04:08 +08:00
  • 89657a557c [Misc] Fix backward compatibility from #23030 (#23070) Roger Wang 2025-08-17 23:33:29 -07:00
  • 08d5f7113a [Misc] refactor function name (#23029) Ning Xie 2025-08-18 13:16:21 +08:00
  • b2fd0b81e0 [Bugfix][CI] Machete kernels: deterministic ordering for more cache hits (#23055) Andy Lo 2025-08-18 07:10:26 +02:00
  • 9f1c642254 [Bugfix] fix Qwen2.5-Omni processor output mapping (#23058) double7 2025-08-18 13:09:11 +08:00
  • 7be3a59d8e [Misc] enhance static type hint (#23059) Ning Xie 2025-08-18 13:09:08 +08:00
  • 8ea0c2753a [Misc] Minor code cleanup for _get_prompt_logprobs_dict (#23064) Woosuk Kwon 2025-08-17 18:16:03 -07:00
  • 0fc8fa751a fix: gptq marlin weight loading failure (#23066) v0.10.1rc1 Simon Mo 2025-08-17 15:56:07 -07:00
  • 21e39436c8 [XPU] fix xpu to set cudagraph batch sizes (#23044) Calvin Chen 2025-08-18 05:45:42 +08:00
  • 6d243efeda [Misc] Convert use_structured_output property into constant (#23060) Woosuk Kwon 2025-08-17 12:41:38 -07:00
  • c55bc1db26 [Misc] Remove dead return (#23061) Woosuk Kwon 2025-08-17 10:36:46 -07:00
  • 292084e72a [BugFix] Fix for IMA in FA3 varlen combine (#22967) Lucas Wilkinson 2025-08-17 11:52:04 -04:00
  • 16bff144be [Misc] fix typo in the multimodal doc (#23051) Kevinzz 2025-08-17 16:56:20 +08:00
  • fe0411fc6f [Bugfix] should use stack instead of concat (#22972) 947132885 2025-08-17 16:46:36 +08:00
  • 4d4061b6e7 [Kernel] Add cuda kernel for gpt_oss activation (#22951) Jee Jee Li 2025-08-17 13:03:24 +08:00
  • 87f48623a5 [Misc] method name typo fix (#23042) Ning Xie 2025-08-17 12:49:14 +08:00
  • 5c32143b9d [Refactor] Defer tensor data construction in MultiModalKwargs (#23030) Cyrus Leung 2025-08-17 12:05:50 +08:00
  • 94096a47c9 [UX] Separate marlin moe config logic from triton moe (#23006) Michael Goin 2025-08-16 22:16:42 -04:00
  • a258ad8bcc [Bugfix] fix qwen3 moe fp8 accuracy issue (#23031) Jinzhen Lin 2025-08-17 08:41:23 +08:00
  • bf7f470b22 [V1] Logits processors extensibility (#19912) afeldman-nm 2025-08-16 15:59:17 -04:00
  • 4fc722eca4 [Kernel/Quant] Remove AQLM (#22943) Michael Goin 2025-08-16 15:38:21 -04:00
  • 3253ae765e [Flaky CI] Increase timeout tolerance for test_mp_crash_detection+test_default_mm_lora_chat_completions (#23028) Michael Goin 2025-08-16 14:33:08 -04:00
  • 000cceca8c [Bugfix gpt-oss] Fix float32 convert for flashinfer sink support (#23016) Michael Goin 2025-08-16 14:16:00 -04:00
  • 68373d3126 [Frontend] Added support for HermesToolParser for models without special tokens (#16890) Woonggi Min 2025-08-17 02:38:42 +09:00
  • 52ce1420e9 Fix handling of max_num_batched_tokens for pooling tasks (#23004) Maximilien de Bayser 2025-08-16 14:36:30 -03:00
  • 829bbd7882 [New Model]mBART model (#22883) 汪志鹏 2025-08-16 20:16:58 +08:00
  • 4dff91c93d [Refactor] Allow optional MultiModalKwargsItem in IPC (#23022) Cyrus Leung 2025-08-16 19:30:49 +08:00
  • de9cb61763 Add docs for PrefixRepetitionDataset + enable usage with vllm bench throughput (#23012) Seiji Eicher 2025-08-16 03:21:20 -07:00
  • 2dbccce8a6 [CI][Bugfix] Skip Ovis2 generation test because of broken remote code (#22954) Isotr0py 2025-08-16 17:44:19 +08:00
  • 933f45334a [Core] Make cudagraph check cuda platform only (#23005) Chengji Yao 2025-08-16 00:46:00 -07:00
  • cc826a202b [Multimodal] Update Tensor schema test to cover arbitrary shape mm inputs (#22867) Isotr0py 2025-08-16 15:44:50 +08:00
  • 6d3da472bc [Misc] Add --save-dir option to benchmark_moe (#23020) Jee Jee Li 2025-08-16 15:26:10 +08:00
  • 78863f8c5c [BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors (#22962) Andrew Sansom 2025-08-16 01:25:10 -05:00
  • 5157827cfc [Build] Env var to disable sccache (#22968) Lucas Wilkinson 2025-08-16 01:36:27 -04:00
  • 7caec10e7b [XPU]avoid circular import during XPU init (#23017) Kunshang Ji 2025-08-16 13:16:34 +08:00
  • 1f83e7d849 [misc] nsys profile output kernel classifier and visualizer (#22971) Grace Ho 2025-08-15 19:52:51 -07:00
  • e4e37ded56 [V1] support min_tokens for detokener (#22014) Calvin Chen 2025-08-16 10:28:10 +08:00
  • f6b5040590 [Frontend] Avoid list copies in serving_chat.py (#22947) Nick Hill 2025-08-15 19:06:30 -07:00
  • fbd88728b3 [Bugfix] Fix DeepSeek MTP (#22934) Benjamin Chislett 2025-08-15 21:25:06 -04:00
  • 070da660c1 [Kernel] Simplify get_kv_cache_layout and cache use_trtllm_attention env-dependent bit (#22735) Nicolò Lucchesi 2025-08-16 02:14:08 +02:00
  • ad0297d113 [Misc] Support passing multiple request ids at once to AsyncLLM.abort() (#22944) Nick Hill 2025-08-15 17:00:36 -07:00
  • 236b864e4f [BugFix] Make run_once thread-safe (#22978) Yichen Yan 2025-08-16 07:56:17 +08:00
  • 3e2f7985a2 Support multiple attention groups for KV sharing (#22672) Yong Hoon Shin 2025-08-15 16:54:10 -07:00
  • c280066f9d [v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728) Or Ozeri 2025-08-16 02:52:52 +03:00
  • b9dc9d2607 [BugFix] Handle case where async utility call is cancelled (#22996) Nick Hill 2025-08-15 16:38:42 -07:00
  • 1fc375dc05 [Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask causing unintended masking and NaN logits (#22963) rishitdholakia13 2025-08-15 17:25:05 -06:00
  • 76144adf76 ci: Add CUDA + arm64 release builds (#21201) Eli Uriegas 2025-08-15 16:16:23 -07:00
  • f5d412bafb [BugFix] Fix regression caused by mamba state dtype PR (#22998) Thomas Parnell 2025-08-16 00:55:26 +02:00
  • 177e55e3bd [Attention] FA3 Attention Sinks Perf Boost (#22478) Lucas Wilkinson 2025-08-15 17:41:07 -04:00
  • 1723ef1aae minor: zero workspace buffer init for flashinfer trtllm-gen attn (#22603) eigen 2025-08-15 17:38:10 -04:00
  • 00d6cba0cf Add PrefixRepetitionRandomDataset to vllm bench serve datasets (#20638) Seiji Eicher 2025-08-15 14:09:23 -07:00
  • 7f89ed248f [Fix] enable swap_ab for pplx problem size computation (#22991) shixianc 2025-08-15 14:02:12 -07:00
  • 8a87cd27d9 [CI] Speed up Whisper tests by reusing server (#22859) Michael Goin 2025-08-15 16:56:31 -04:00
  • a344a1a7da Use regex in convert-results-json-to-markdown.py (#22989) Michael Goin 2025-08-15 16:54:20 -04:00
  • 79899b63f6 [Bugfix] Added more env vars to hash (#22449) nvjullin 2025-08-16 04:08:37 +08:00
  • 6e670778cd [Core] direct indexing on self.block_table_np in compute_slot_mapping (#22940) Zebing Lin 2025-08-15 15:12:12 -04:00
  • df5afa82e5 [Log] Debug Once for Randomizing dummy data for DP Rank (#22860) Wentao Ye 2025-08-15 14:51:50 -04:00
  • 6cd69f51bf [Model] Granite-4 support loading quantized checkpoint (#22925) Chih-Chieh Yang 2025-08-15 14:47:56 -04:00
  • 8ad7285ea2 [Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remove extra arguments from modular kernel methods. (#22035) bnellnm 2025-08-15 14:46:00 -04:00
  • 48b01fd4d4 [Structured Output] Make the output of structured output example more complete (#22481) Shanshan Shen 2025-08-16 02:29:25 +08:00
  • 993d3d122b [Benchmarks] Include image data when ShareGPT4V dataset is used. (#22955) Chenheli Hua 2025-08-15 11:23:06 -07:00
  • 68af77e51c [FIXBUG] Correctly Apply Grammar Bitmask in Mixed Batches (#22896) JartX 2025-08-15 19:42:49 +02:00
  • 6b04039a72 [BugFix] Skip the Q component for QKVParallelLinear in the case of QKVCrossParallelLinear since its width is 0 (#22369) sstamenk 2025-08-15 19:17:31 +02:00
  • 7e2fb3c507 Merge branch 'main' into wye-refactor-quant-folder Wentao Ye 2025-08-15 11:24:28 -04:00
  • 1c859a1387 [V0 Deprecation] Remove advance_step (#22969) Woosuk Kwon 2025-08-15 08:22:31 -07:00
  • 74f441f4b5 [Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer (#20059) fhl2000 2025-08-15 22:01:39 +08:00
  • a0632a3e03 [Frontend] Expose do_log_stats interval to env (#22905) Csrayz 2025-08-15 21:00:20 +08:00
  • e8b40c7fa2 [CI] Remove duplicated docs build from buildkite (#22924) Harry Mellor 2025-08-15 13:58:06 +01:00
  • 48f4636927 [Misc] Ignore ep_kernels_workspace (#22807) Jee Jee Li 2025-08-15 20:58:03 +08:00
  • 75531a6c13 [V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) (#22928) Thomas Parnell 2025-08-15 14:57:06 +02:00
  • 22341b996e Improve multimodal hasher performance for re-used Image prompts (#22825) Staszek Paśko 2025-08-15 14:32:56 +02:00
  • 49252cf59e [MM] Allow skipping memory profiling for multimodal models. (#22950) Roger Wang 2025-08-15 04:41:38 -07:00
  • 3e6dd40016 [Bugfix] fix cuda 12.6 and 11.8 build (#22952) Jinzhen Lin 2025-08-15 18:10:22 +08:00
  • aa300c438d [Bugfix] Unquote file uri before reading image (#22912) Sayandip Dutta 2025-08-15 14:58:00 +05:30
  • fe91ce9591 [V1] - Split Prefill and Decode for Mamba1 models (#22653) amirai21 2025-08-15 11:59:52 +03:00