Commit Graph

  • 0730414999 [Core] Add audio_embeds support to chat completions (#29059) jeremyteboul 2025-11-20 19:39:47 -08:00
  • a982f5b5ea [kernel][perf] support uncontiguous input for rms_norm kernel (#28103) zhrrr 2025-11-21 11:39:09 +08:00
  • 0e741c12e3 [Bugfix] Fix Plamo3 rope handling (#29092) Cyrus Leung 2025-11-21 11:38:35 +08:00
  • 56669c1f29 [CI] Fix mypy for vllm/v1/worker (#29037) Wentao Ye 2025-11-20 22:36:07 -05:00
  • 3f5f36da3f [ROCm] Fix for import when building with upstream triton for gfx1100 for gpt-oss serving (#29127) Hongxia Yang 2025-11-20 22:30:07 -05:00
  • e1eefa4c40 [Bug] Fix torch warning of tf32 usage (#29112) Wentao Ye 2025-11-20 20:54:59 -05:00
  • ed6ae1e36a [AITER] [ROCm] Fix crash when loading llama4 model with old aiter version installed, fallback to forward_native implementation (#29124) Xiao Li 2025-11-20 17:54:35 -08:00
  • 9875be6431 [LoRA][2/2]Remove LoRA extra vocab (#28545) Jee Jee Li 2025-11-21 09:46:43 +08:00
  • df44df0143 [Feature] Shared Experts Overlap with FI deepgemm swap kernel, 2.2% throughput improvement and 3.6% TTFT improvement (#28879) Wentao Ye 2025-11-20 20:41:49 -05:00
  • 87cbbdff63 Update model references for OLMo3 (#29099) Michael Goin 2025-11-20 20:16:52 -05:00
  • 986ab5db63 [CI Bugfix] Fix Kernels DeepGEMM Test (H100) (#29106) Michael Goin 2025-11-20 19:42:33 -05:00
  • dd39f91edb [Doc] cleanup TPU documentation and remove outdated examples (#29048) Rob Mulla 2025-11-20 19:05:59 -05:00
  • c7a29d2c8d [CI/Build] Remove skip global cleanup in test_struct_output_generate.py (#29022) rasmith 2025-11-20 15:44:37 -06:00
  • 8237ab8a2b [CI/Build] Skip lm-format-enforcer tests in test_struct_output_generate.py for now (#29021) rasmith 2025-11-20 15:35:14 -06:00
  • 3fd74189db Fixes bench (#29058) Driss Guessous 2025-11-20 13:21:54 -08:00
  • 5e5a7eb16f [CI/Build] Make test_attention_selector.py run tests on correct platform (#29064) rasmith 2025-11-20 14:45:56 -06:00
  • 3d84ef9054 [CI/Build][AMD] Skip if flash_attn_varlen_func not available in test_aiter_flash_attn.py (#29043) rasmith 2025-11-20 14:39:49 -06:00
  • 4d01b64284 [Bugfix] - Add Trace Headers to Beam Search Path (#29100) Software Developer 2025-11-20 21:00:33 +01:00
  • 114b0e2500 [chore] Update annotate release scripts (#29077) Kevin H. Luu 2025-11-20 10:22:40 -08:00
  • 647464719b [KVConnector][Core] Support cross-layer KV blocks (#27743) Or Ozeri 2025-11-20 20:09:59 +02:00
  • e5bfcb6a88 [BugFix][PD]: make example proxy usable with P2pNcclConnector (#26628) Pan Li 2025-11-21 01:38:31 +08:00
  • 22924383e1 Updating the mirror of test-amd.yaml as of 2025-11-18 (#29016) Alexei-V-Ivanov-AMD 2025-11-20 11:07:06 -06:00
  • 56f45eddaf [Frontend] Optimize beam search loop by sorting and then splicing (#19347) rookie 2025-11-21 01:02:30 +08:00
  • 82b05b15e6 [BugFix] [FEAT] Enable fastsafetensors for ROCm platform (#28225) TJian 2025-11-20 23:34:11 +07:00
  • a2e9ebe9e2 [BugFix] Fix flash_attn import in siglip2navit.py (#29082) Fanli Lin 2025-11-20 20:14:29 +08:00
  • 93c8672ceb [Bugfix] Fix spec decode memory regression after #28549 (#28819) Zhewen Li 2025-11-20 03:05:50 -08:00
  • 371b1d4c61 [RL] Add Pause and Resume Generation for Asynchronous RL Training (#28037) Samit 2025-11-20 19:01:03 +08:00
  • c9e093116c [MODEL] Implement plamo3 (#28834) Shinichi Hemmi 2025-11-20 20:00:19 +09:00
  • c0c2dd1e0b [BugFix] kv_offloading: Fix bug in loading of partial cpu blocks (#28951) Or Ozeri 2025-11-20 12:55:10 +02:00
  • 06c20c9904 [ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA (#26670) Pleaplusone 2025-11-20 18:54:01 +08:00
  • 6eb745d9bd Add truncate arg to yarn to match openai implementation of gpt-oss (#28244) Anna Shors 2025-11-20 02:53:50 -08:00
  • 66483a9d00 [Chore] Update xgrammar version from 0.1.25 to 0.1.27 (#28221) cjackal 2025-11-20 19:53:09 +09:00
  • edfe867208 [Misc] don't cache CUTLASS_REVISION var in CMakeLists.txt (#28518) Jinzhen Lin 2025-11-20 18:52:53 +08:00
  • dc45efc8ef [BugFix] Fix Llama4 Pipeline Parallelism Assert Error (#28577) Dezhan 2025-11-20 02:52:36 -08:00
  • fb8851f254 [Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (#28760) Vensen 2025-11-20 18:52:02 +08:00
  • a903d59ffa cleanup at::Tag::needs_fixed_stride_order (#28974) Boyuan Feng 2025-11-20 02:51:36 -08:00
  • 322cb02872 [CI/Build][AMD] Fix import errors in tests/kernels/attention (#29032) rasmith 2025-11-20 03:48:09 -06:00
  • 2c52c7fd9a [Bug] Fix torch dynamo warning Dynamo detected a call to a functools.lru_cache (#29038) Wentao Ye 2025-11-20 03:52:23 -05:00
  • 1e1c06789e [ci][amd] fix EPLB execution test (#28742) Bradley D 2025-11-19 23:53:38 -08:00
  • 7218f83992 [ROCm][BugFix] Fix shared expert loading error when disable VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS (#28633) Pleaplusone 2025-11-20 15:50:23 +08:00
  • 20e4497be2 [V0 Deprecation] Remove num_lookahead_slots (#29000) Cyrus Leung 2025-11-20 14:39:10 +08:00
  • 1c7bcc55b8 [Frontend] Allow parsed tool arguments (#28820) Quentin Gallouédec 2025-11-19 23:20:12 -07:00
  • a9705a290a [Model][QwenVL] Replace torch.repeat_interleave with faster np.repeat (#28964) Lukas Geiger 2025-11-20 06:04:23 +00:00
  • 64192d5624 [Bugfix] Revert custom attention mask for gemma3-mm (#28995) Isotr0py 2025-11-20 13:23:22 +08:00
  • fe25772aa9 [Bugfix] Handle broken frames in video loading (#29001) Canlin Guo 2025-11-20 12:38:12 +08:00
  • 0cca9b4d13 [Bugfix] Fix precision loss in LoRA-wrapped RowParallelLinear by fusing bias into GEMM (#28972) prashanth058 2025-11-19 19:50:37 -08:00
  • a8c536829c Consolidate Nvidia ModelOpt quant config handling for all quantization methods (#28076) Shengliang Xu 2025-11-19 19:39:36 -08:00
  • fcbcba6c70 [Feat] Iteration-level profiling for Torch and CUDA profiler (#28987) Benjamin Chislett 2025-11-19 22:17:48 -05:00
  • 3168285fca [cpu][ci] Add initial set of tests for Arm CPUs (#28657) Fadi Arafeh 2025-11-20 02:37:09 +00:00
  • 3fb0d90999 [AMD] Use Decoupled Kernel Block Size to Support AITER MLA block_size=1 (#27715) Qiang Zhang 2025-11-20 10:11:52 +08:00
  • 05c2dee7e9 [DeepSeek + LMCache Multiprocess] handle MLA for deepseek model + LMCache Multiprocess connector (#29039) Kuntai Du 2025-11-20 09:40:49 +08:00
  • 1d642872a2 [torchao] fix safetensors for sharding (#28169) liangel-02 2025-11-19 19:39:45 -05:00
  • 9ccef8e333 [Misc] Colorize logs (#29017) Nick Hill 2025-11-19 16:26:04 -08:00
  • 537cc635c7 [GC Debugger] Simply and improve GC Debugger Utils (#29029) Jialin Ouyang 2025-11-19 16:10:22 -08:00
  • 5031cd5d55 [Refactor] Optimize select_experts (#28069) Wentao Ye 2025-11-19 18:53:15 -05:00
  • 3aaa94ac99 [Performance] Reduce DeepGEMM N dim restriction from 128 to 64 multiplier (#28687) Alexander Matveev 2025-11-19 18:47:13 -05:00
  • 8e38e99829 [Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (#28849) JartX 2025-11-20 00:30:08 +01:00
  • 0075bfffd4 [CI] Fix precommit rope_theta issue (#29040) Wentao Ye 2025-11-19 17:22:43 -05:00
  • 275de34170 [BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (#29036) v0.11.2 Lucas Wilkinson 2025-11-19 16:43:54 -05:00
  • fa3ffb4365 [BugFix] Ray with multiple nodes (#28873) Julien Denize 2025-11-19 22:20:58 +01:00
  • 6d5974369c [BugFix] Fix async-scheduling + FlashAttn MLA (#28990) Lucas Wilkinson 2025-11-19 10:04:07 -05:00
  • 0ce9990d2c [NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (#28938) Johnny 2025-11-19 01:44:27 +01:00
  • cb0a7b4bea [Bugfix] Move flashinfer kernel check into ``__init__` function of `FusedMoE`` (#29018) Max Hu 2025-11-19 16:54:15 -05:00
  • 8f4f77a727 [BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (#29036) Lucas Wilkinson 2025-11-19 16:43:54 -05:00
  • 22e44ad589 [ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm (#28984) Micah Williamson 2025-11-19 15:31:33 -06:00
  • 88f5b19f0b [DeepSeek] Fix DeepSeek V3.2 Rope Embedding (#28968) Yongye Zhu 2025-11-19 16:30:04 -05:00
  • 613abb50d5 [MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#25990) Shu Wang 2025-11-19 15:29:06 -06:00
  • cdeec2e606 [BugFix] Ray with multiple nodes (#28873) Julien Denize 2025-11-19 22:20:58 +01:00
  • 1607e664f0 [Bug] Fix Batch Invariant MLA test (#28967) Wentao Ye 2025-11-19 16:18:32 -05:00
  • 68d7231991 [CI/Build] Fix test_prefix_prefill for AMD (#28905) Ryan Rock 2025-11-19 15:04:36 -06:00
  • 2fd893b4ce [Feature] Prefill Context Parallel (PCP) basic support (#28718) Qiu 2025-11-20 04:52:44 +08:00
  • 02f5903b84 Eagle: MM Cuda Graphs with MRope (#28896) Izzy Putterman 2025-11-19 12:01:05 -08:00
  • ac10fd3c69 Upstreaming aiter triton attention backend as a new backend (#28701) Aleksandr Malyshev 2025-11-19 11:59:30 -08:00
  • 9d2d561257 [Bugfix] Fix precision corruption when shared_experts_stream=None (#28942) 杰兮 2025-11-20 03:30:57 +08:00
  • fe69f331f8 [Kernels] Improve H200 Fused MoE Config (#28992) Robert Shaw 2025-11-19 14:23:54 -05:00
  • 3319a493fc [Core] Reuse created spec tokens lists to mitigate GC cost (#28917) Jialin Ouyang 2025-11-19 11:20:22 -08:00
  • 61728cd1df Re-enable FlashInfer for Llama4 on Blackwell in e2e fusion tests (#28966) Copilot 2025-11-19 13:32:19 -05:00
  • 0c80efd94f GLM-V video segmentation solution adjustment (#28941) Yuxuan Zhang 2025-11-20 01:32:55 +08:00
  • a8b70304d6 Update rope_scaling to rope_parameters in preparation for Transformers v5 (#28542) Harry Mellor 2025-11-19 18:06:36 +01:00
  • d44e9df7d4 [Model][Mamba] Add selector for mamba attention backend and make it pluggable for other device (#26487) Shanshan Shen 2025-11-20 00:24:55 +08:00
  • 48fc8b1e59 [BugFix] Fix async-scheduling + FlashAttn MLA (#28990) Lucas Wilkinson 2025-11-19 10:04:07 -05:00
  • 1ffe934c8a [torch.compile] caching of config fields should be opt-out by default (#26468) vnadathur 2025-11-19 06:13:54 -08:00
  • 2c8b9182b5 [CI] Reorganize compile tests so new tests are automatically included in CI (#28625) Yanan Cao 2025-11-19 06:13:50 -08:00
  • 4f5299f717 Relax Transformers modeling backend MoE experts check (#28952) Harry Mellor 2025-11-19 14:50:30 +01:00
  • 09540cd918 [Doc]: fix typos in various files (#29010) Didier Durand 2025-11-19 13:56:21 +01:00
  • da2f6800e0 [Feat][Perf] Enable deepep-low-latency with round-robin expert placement. (#28449) Chen Bruce 2025-11-19 20:46:24 +08:00
  • ba558c029a [config] Expose get_total_num_hidden_layers() in ModelConfig (#28961) Tova Movshovitz 2025-11-19 13:37:11 +02:00
  • 97cfa99d59 [Docs] Take env var definition out of folded admonition (#29005) Harry Mellor 2025-11-19 12:32:04 +01:00
  • bbc6c2f1e5 [CI/Build] Fix broken build on Apple M1 (#28999) j20120307 2025-11-19 03:07:22 -08:00
  • 8151609583 refactor(cpu_types_scalar.hpp): Unify scalar loop implementations using unroll_loop (#28847) ihb2032 2025-11-19 19:05:44 +08:00
  • fdf93486d6 [Docs] Clean up moe_kernel_features.md (#28530) Michael Yao 2025-11-19 18:35:29 +08:00
  • d69062c67a add support for --fully-sharded-loras in fused_moe (#28761) gnovack 2025-11-19 00:32:00 -08:00
  • ae4821a108 Add CPU support model (#28697) Louie Tsai 2025-11-18 23:47:57 -08:00
  • 7ed27f3cb5 [Doc]: fix typos in various files (#28945) Didier Durand 2025-11-19 07:52:30 +01:00
  • a4511e38db Speed up macOS smoke test (#28954) Michael Goin 2025-11-19 01:46:32 -05:00
  • 71d0ae1c54 [Misc] Update embedding/cross encoder tests to use mteb v2 (#27329) Roman Solomatin 2025-11-19 09:28:40 +03:00
  • 3d4e7d34be [Model][QwenVL] Simplify cos/sin rotary embedding indexing (#28962) Lukas Geiger 2025-11-19 05:43:01 +00:00
  • 6a25ea5f0e [Docs] Update oneshot imports (#28188) Uranus 2025-11-19 13:30:08 +08:00
  • 73ff872db0 [Bugfix] Fix typo in Qwen3 Next model executor (#28960) Gleb Kurchanov 2025-11-19 08:21:02 +03:00
  • 468a8d72ba [Bugfix] Fix FusedMoEModularKernel for triton backend (#28913) Xin Yang 2025-11-18 21:05:22 -08:00