Commit Graph

  • 84d57342b6 [BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker (#26004) Wenlong Wang 2025-10-01 01:03:25 -07:00
  • 57b46d769e [Doc] updating torch.compile doc link (#25989) nadathurv 2025-10-01 00:04:56 -07:00
  • f48b6a03ba [Misc]allow disable pynccl (#25421) Lucia Fang 2025-09-30 23:04:13 -07:00
  • e4beabd2c8 [BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (#25988) v0.11.0rc4 Lucas Wilkinson 2025-10-01 00:58:31 -04:00
  • febb688356 [Bugfix] Fix __syncwarp on ROCM (#25996) Zhewen Li 2025-09-30 21:15:11 -07:00
  • a1825fe645 [MM] Add text-only mode for Qwen3-VL (#26000) Roger Wang 2025-09-30 21:13:42 -07:00
  • bab9231bf1 [Model] MTP fallback to eager for DeepSeek v32 (#25982) Lucia Fang 2025-09-30 18:53:22 -07:00
  • c214d699fd [spec decode] Consolidate speculative decode method name for MTP (#25232) qizixi 2025-09-26 15:27:05 -07:00
  • c3dfb0f6dd [Bench] Add DeepSeekV32 to MoE benchmark (#25962) Jee Jee Li 2025-10-01 05:13:48 +08:00
  • 83f3c9beae [bugfix][deepseek] fix flashmla kernel selection (#25956) youkaichao 2025-10-01 00:30:36 +08:00
  • d0b178cef1 [NIXL] Add support for MLA caches with different latent dim (#25902) Nicolò Lucchesi 2025-09-30 14:18:29 +02:00
  • b3230e1ac0 [New Model] DeepSeek-V3.2 (Rebased to Main) (#25896) Yongye Zhu 2025-09-30 05:14:41 -04:00
  • 03df0fb5d2 [BugFix] Fix DP/EP hang (#25906) Lucas Wilkinson 2025-09-30 00:18:59 -04:00
  • 9471879bd4 [Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (#25909) Wentao Ye 2025-09-29 21:15:19 -04:00
  • ab5b6459df [Bugfix] Fallback ViT attn backend to SDPA for blackwell (#25851) Roger Wang 2025-09-28 23:03:51 -07:00
  • 2a69ab4899 Update to Transformers v4.56.2 (#24638) Harry Mellor 2025-10-01 06:07:07 +01:00
  • 8d7da92fd7 [BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (#25988) Lucas Wilkinson 2025-10-01 00:58:31 -04:00
  • e952eee698 [Bugfix] Fix __syncwarp on ROCM (#25996) Zhewen Li 2025-09-30 21:15:11 -07:00
  • 66bca9b8bd [MM] Add text-only mode for Qwen3-VL (#26000) Roger Wang 2025-09-30 21:13:42 -07:00
  • 99028fda44 Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935) Param 2025-09-30 22:19:53 -04:00
  • 1244948885 [Log] Optimize Log for FP8MOE (#25709) Wentao Ye 2025-09-30 22:18:43 -04:00
  • a73f6491c8 Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (#25843) Salvatore Cena 2025-10-01 04:18:19 +02:00
  • 001e50c92c [Model] MTP fallback to eager for DeepSeek v32 (#25982) Lucia Fang 2025-09-30 18:53:22 -07:00
  • 96ebcaa3ad [Misc] Make EP kernels install script support uv (#25785) Lucas Wilkinson 2025-09-30 19:38:34 -04:00
  • 5db1870bb9 [gpt-oss] use vLLM instead of openai types for streaming (#25186) Andrew Xia 2025-09-30 15:47:07 -07:00
  • 2ce26b9b5d [Docs] Remove API Reference from search index (#25949) Harry Mellor 2025-09-30 23:10:02 +01:00
  • a388252ac4 Add explicit pooling classes for the Transformers backend (#25322) Harry Mellor 2025-09-30 23:07:06 +01:00
  • 9a9f48dff7 [V1] [P/D] Add Support for KV Load Failure Recovery (#19330) David Ben-David 2025-10-01 00:57:08 +03:00
  • 67f3fb0844 [Bench] Add DeepSeekV32 to MoE benchmark (#25962) Jee Jee Li 2025-10-01 05:13:48 +08:00
  • 43b752c325 [Llama4] [multimodal] Fix misplaced dtype cast of cos_sin_cache in Llama4VisionRotaryEmbedding (#25889) cjackal 2025-10-01 05:35:15 +09:00
  • cfd302db9b OffloadingConnector: Fix GPU block tracking bug (#25856) Or Ozeri 2025-09-30 22:53:04 +03:00
  • fb610ae684 [Docs] Add moe kernel features doc (#25297) bnellnm 2025-09-30 15:03:15 -04:00
  • 2f652e6cdf [Doc] Improve MM Pooling model documentation (#25966) Cyrus Leung 2025-10-01 02:58:29 +08:00
  • e6a226efba [Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' (#25958) Wentao Ye 2025-09-30 14:13:03 -04:00
  • a2e6fa7e03 [bugfix][deepseek] fix flashmla kernel selection (#25956) youkaichao 2025-10-01 00:30:36 +08:00
  • 9f1c4ecaf2 [Bugfix] Token type and position embeddings fail to be applied to inputs_embeds (#25922) Cyrus Leung 2025-10-01 00:23:12 +08:00
  • ef283548f7 [Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (#25895) Pavani Majety 2025-09-30 07:51:31 -07:00
  • f4db5e6de1 [Bugfix][Model] Fix inference for Hunyuan dense models (#25354) Anion 2025-09-30 22:38:07 +08:00
  • 099aaee536 Add Hugging Face Inference Endpoints guide to Deployment docs (#25886) Sergio Paniego Blanco 2025-09-30 16:35:06 +02:00
  • 35fe398c7c [Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 (#25858) Asaf Joseph Gardin 2025-09-30 17:30:44 +03:00
  • bb6d43047e [Fix] Improve CPU backend compatibility for RISC-V (#25816) ihb2032 2025-09-30 21:48:07 +08:00
  • bc546f76a1 [CI] Move applicable tests to CPU (#24080) Reza Barazesh 2025-09-30 09:45:20 -04:00
  • 80608ba5af [NIXL] Add support for MLA caches with different latent dim (#25902) Nicolò Lucchesi 2025-09-30 14:18:29 +02:00
  • e184c9c510 [perf] Use CPU tensor to reduce GPU->CPU sync (#25884) Lehua Ding 2025-09-30 19:51:16 +08:00
  • d7e34b4210 [Model] Move vision_feature_select_strategy into resolve_visual_encoder_outputs (#25938) Cyrus Leung 2025-09-30 19:24:57 +08:00
  • ef6e0e7132 [Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (#25936) CSWYF3634076 2025-09-30 19:11:21 +08:00
  • 1ad3aca682 Updated TRL integration docs (#25684) Sergio Paniego Blanco 2025-09-30 12:10:55 +02:00
  • 8d0afa9b42 [Doc] Add Cambricon MLU support (#25942) a120092009 2025-09-30 17:59:47 +08:00
  • fa7e254a7f [New Model] DeepSeek-V3.2 (Rebased to Main) (#25896) Yongye Zhu 2025-09-30 05:14:41 -04:00
  • e23cacda35 [Bugfix]: Clean up chunked prefill logging when using whisper (#25075) Simon Danielsson 2025-09-30 10:17:49 +02:00
  • 2e1b8bc2b6 [Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect logical_not (#25925) Zhou Jiahao 2025-09-30 16:15:23 +08:00
  • e47433b3c1 [BugFix] Pass config_format via try_get_generation_config (#25912) acisseJZhong 2025-09-29 22:09:50 -07:00
  • 23194d83e8 [BugFix] Fix DP/EP hang (#25906) Lucas Wilkinson 2025-09-30 00:18:59 -04:00
  • 61aedb5ffe MoveVllmConfig from config/__init__.py to config/vllm.py (#25271) Harry Mellor 2025-09-30 03:49:49 +01:00
  • d3bd171123 [Benchmark] Support benchmark throughput for external launcher DP (#25913) Zhuohan Li 2025-09-29 18:43:57 -07:00
  • 89e4050af4 [Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (#25909) Wentao Ye 2025-09-29 21:15:19 -04:00
  • 78a47f87ce Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models (#25717) Andrew Sansom 2025-09-29 19:10:58 -05:00
  • 6a113d9aed [V0 Deprecation] Remove vllm.worker and update according imports (#25901) Aaron Pham 2025-09-29 19:26:11 -04:00
  • 2e4fe48c37 [NIXL] Increase default KV block eviction timeout on P (#25897) Nicolò Lucchesi 2025-09-29 23:35:14 +02:00
  • 8eb0a1d906 [Doc] Polish example for torchrun dp (#25899) Zhuohan Li 2025-09-29 14:31:34 -07:00
  • fea3e476aa [Kernel] Chunk-aligned mamba2 (#24683) Thomas Parnell 2025-09-29 23:18:25 +02:00
  • 61a3431613 [Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so (#25605) Gregory Shtrasberg 2025-09-29 17:01:50 -04:00
  • 9bedac9623 [Doc] Add documentation for vLLM continuous benchmarking and profiling (#25819) Naman Lalit 2025-09-29 13:49:49 -07:00
  • c42ff4f4fd [BugFix][torch.compile] KV scale calculation issues with FP8 quantization (#25513) Adrian Abeyta 2025-09-29 14:52:04 -05:00
  • d5ab28511c [Bugfix] Use correct key "ignore" for config.json non-quantized layers (#25706) Lee Nau 2025-09-29 12:07:29 -07:00
  • e61eb5e09d [Model] Remove MotifForCausalLM (#25866) Jee Jee Li 2025-09-30 00:36:30 +08:00
  • 0899ba5b42 [CI/Build] Include Transformers backend test in nightly transformers test (#25885) Isotr0py 2025-09-30 00:33:39 +08:00
  • 145ac73317 [Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (#25883) Rahul Tuli 2025-09-29 21:07:20 +05:30
  • d0d138bc55 [Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (#24690) Chenxi Yang 2025-09-29 07:31:51 -07:00
  • 43227236ec [torch.compile] serialize cudagraph_mode as its enum name instead of value (#25868) Jiangyun Zhu 2025-09-29 21:54:52 +08:00
  • 8616300ae2 [Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models (#25854) Zhou Jiahao 2025-09-29 18:59:04 +08:00
  • edbaadd91f [Bugfix] Fix requirements paths in install instructions (#25827) Yingjun Mou 2025-09-29 03:49:35 -07:00
  • 9360d34fa1 update to latest deepgemm for dsv3.2 (#25871) youkaichao 2025-09-29 17:51:43 +08:00
  • 1b67b04656 [Misc] Remove more get_input_embeddings_v0 (#25857) Cyrus Leung 2025-09-29 16:03:37 +08:00
  • bd51f78e39 [V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge (#25331) Isotr0py 2025-09-29 14:09:18 +08:00
  • 65ecb4f134 [Bugfix] Fallback ViT attn backend to SDPA for blackwell (#25851) Roger Wang 2025-09-28 23:03:51 -07:00
  • 8ce5d3198d [P/D] NIXL Updates (#25844) v0.11.0rc3 Robert Shaw 2025-09-29 00:46:30 -04:00
  • 09c2cbc04a [Bugfix] fix Qwen3VLMoe load when pp > 1 (#25838) JJJYmmm 2025-09-29 01:56:12 +08:00
  • 143844fa43 [XPU]Fix xpu spec decoding UTs, avoid using cuda graph (#25847) Kunshang Ji 2025-09-29 13:15:10 +08:00
  • 219cfbe7f6 Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS (#25832) Thomas Parnell 2025-09-29 07:08:17 +02:00
  • 9b44a7d926 [P/D] NIXL Updates (#25844) Robert Shaw 2025-09-29 00:46:30 -04:00
  • a3ae45a38c [Misc] fix tests failure by using current_platform (#25825) Juechen Liu 2025-09-28 21:18:57 -07:00
  • 0307428d65 Remove redundant cudagraph dispatcher warning (#25841) Michael Goin 2025-09-28 17:12:42 -04:00
  • 471997adf6 [Bugfix] fix Qwen3VLMoe load when pp > 1 (#25838) JJJYmmm 2025-09-29 01:56:12 +08:00
  • b1ded114b9 Update GLM-4.5 Doc transformers version (#25830) Yuxuan Zhang 2025-09-28 20:05:51 +08:00
  • f4e4088c99 Fix random dataset mismatched token length with config. (#24937) weiliang 2025-09-28 16:23:44 +08:00
  • 4c347044c9 [VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling (#25557) v0.11.0rc2 Isotr0py 2025-09-28 12:21:01 +08:00
  • 19e7ab7315 [Bugfix] Fix Qwen3-VL regression from #24982 (#25814) Roger Wang 2025-09-27 20:21:09 -07:00
  • 6de3d431d9 [MM] Optimize memory profiling for scattered multimodal embeddings (#25810) Roger Wang 2025-09-27 19:17:58 -07:00
  • b14773bd64 [Bugfix][NIXL] Fix Async Scheduler timeout issue (#25808) Nicolò Lucchesi 2025-09-27 21:17:35 +02:00
  • 26a7a33b88 [Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (#24982) Tyler Michael Smith 2025-09-27 10:22:28 -04:00
  • 5aa5811a16 [CI] Fix FlashInfer AOT in release docker image (#25730) Michael Goin 2025-09-26 17:11:40 -04:00
  • c2fa2d4dc9 [Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (#25788) Wentao Ye 2025-09-26 23:44:52 -04:00
  • 32335c8b34 Add option to restrict media domains (#25783) Russell Bryant 2025-09-26 21:23:52 -04:00
  • 04c2b26972 Add filtering for chat template kwargs (#25794) Russell Bryant 2025-09-27 06:46:49 -04:00
  • ee10d7e6ff Validate API tokens in constant time (#25781) Russell Bryant 2025-09-27 06:09:26 -04:00
  • bb79c4da2f Reduce the Cuda Graph memory footprint when running with DBO (#25779) Sage Moore 2025-09-26 15:29:56 -07:00
  • 0efd540dbc [VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling (#25557) Isotr0py 2025-09-28 12:21:01 +08:00
  • 6144754014 [Bugfix] Fix Qwen3-VL regression from #24982 (#25814) Roger Wang 2025-09-27 20:21:09 -07:00
  • 69311446ba [MM] Optimize memory profiling for scattered multimodal embeddings (#25810) Roger Wang 2025-09-27 19:17:58 -07:00