Commit Graph

  • 1ecfabe525 glm 4.6 fused tuned inference config for B200 (#32958) navmarri14 2026-02-08 10:55:47 -08:00
  • 4df841fe75 [torch.compile] Add an option to force-enable the MOE cold start optimization (#33735) Richard Zou 2026-02-08 13:42:56 -05:00
  • a263aa6140 [BugFix] Change support no act and mul for marlin (#34088) TomerBN-Nvidia 2026-02-08 19:18:22 +02:00
  • 179ae7da8f [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate (#33771) aabbccddwasd 2026-02-09 00:13:24 +08:00
  • c4df59ad43 Add embedding input functionality for disabled modalities [remake] (#32493) Reagan Lee 2026-02-08 04:57:16 -08:00
  • 785cf28fff [ROCm] [CI] Reduce Resource of two test groups (#34059) TJian 2026-02-08 15:17:26 +08:00
  • a96197f564 [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used (#33855) Nick Hill 2026-02-07 23:16:34 -08:00
  • ab10d79855 [ROCm][Bugfix] fix act_quant_fusion module import error (#34069) Andreas Karatzas 2026-02-07 21:21:12 -06:00
  • 7fcb705b80 [CI/Build] Skip GCS test (#34057) Cyrus Leung 2026-02-08 00:52:38 +08:00
  • b956cdf818 [Doc] Fix run_batch docs (#34056) Cyrus Leung 2026-02-07 22:18:16 +08:00
  • ed17f54c8b Perf tuning and expansion of cases covered for wvSplitKrc (#33493) Hashem Hashemi 2026-02-07 05:33:11 -08:00
  • 860981d8d8 Make directory exist ok for ray spinning up multiple replicas on a single instance (#33604) Jiang Wu 2026-02-07 05:30:49 -08:00
  • 52181baaea Update DeepGEMM version pin in Dockerfile to match #32479 (#33935) zifeitong 2026-02-07 05:30:22 -08:00
  • de3869bb4d move checks out of unified_kv_cache_update custom op (#33943) Rohan Potdar 2026-02-07 07:30:09 -06:00
  • ce9b3cd3e9 [PluggableLayer][3/N] Apply PluggableLayer to mamba layers. (#33660) whx 2026-02-07 21:26:05 +08:00
  • db4ede9743 [Model] Enable Step3p5ForCausalLM testing (#33755) Jee Jee Li 2026-02-07 21:25:24 +08:00
  • 2cb2340f7a [Frontend]Add support for transcriptions and translations to run_batch (#33934) Pooya Davoodi 2026-02-07 05:24:57 -08:00
  • 4df44c16ba Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 (#33939) TundeAtSN 2026-02-07 08:24:52 -05:00
  • 81fe69cae5 [torch.compile] Stop compiling identical artifacts (#34003) Richard Zou 2026-02-07 08:24:48 -05:00
  • dd6a6e1190 [Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune (#34006) Mohammad Miadh Angkad 2026-02-07 21:24:44 +08:00
  • edb359cce4 [Renderer] Define render_cmpl and render_chat (#34039) Cyrus Leung 2026-02-07 21:24:40 +08:00
  • 6ed5eda300 [CI][Build] Pin grpcio-tools==1.78.0 (#34048) wang.yuqi 2026-02-07 21:24:35 +08:00
  • 11a4c9d30d [Misc] Simplify get_max_tokens (#34036) Cyrus Leung 2026-02-07 16:59:49 +08:00
  • 15a0b9e570 Fix spelling errors (#33978) lukec 2026-02-07 15:58:50 +08:00
  • c490d8cc73 [ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug (#34038) Andreas Karatzas 2026-02-07 00:21:08 -06:00
  • 48312e579a [Misc] Make PlaceholderRange.get_num_embeds a method (#34035) Cyrus Leung 2026-02-07 13:30:17 +08:00
  • bc32444b23 [Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support (#33517) Vel 2026-02-06 20:28:01 -08:00
  • 18e8545297 [Revert] Add util handle_deprecated back (#33998) Wentao Ye 2026-02-06 23:14:45 -05:00
  • 6f7adc533a fix description in plugin_system.md (#33999) 果冻虾仁 2026-02-07 11:37:02 +08:00
  • 40218a82ba [ModelRunner V2] Revert token rank comparison difference for now (#34017) Nick Hill 2026-02-06 19:11:05 -08:00
  • 1c3b22058f [Misc] Add backward-compatible import aliases for renamed translations module (#34015) kourosh hakhamaneshi 2026-02-06 19:01:41 -08:00
  • 3920cafdd6 [Bugfix] Fix _fused_moe_lora_expand signature mismatch (#33821) Xin Yang 2026-02-06 18:45:59 -08:00
  • ec28784fdc [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion (#34007) rasmith 2026-02-06 20:43:25 -06:00
  • 55aeec04f5 [Bugfix] Fix Whisper tokenization (#34011) Nicolò Lucchesi 2026-02-07 03:42:52 +01:00
  • 906077181b [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 (#33967) Ikenna 2026-02-06 21:27:33 -05:00
  • 89a385d79f [Feat][RL] Pause and Resume with keep requests for single engine (#32351) Aaron Hao 2026-02-06 16:08:58 -08:00
  • 4a2d00eafd [bugfix] [ROCm] Fix premature CUDA initialization in platform detection (#33941) kourosh hakhamaneshi 2026-02-06 14:17:55 -08:00
  • 207c3a0c20 Fix RoutingMethodType logic (#33919) Dimitrios Bariamis 2026-02-06 23:03:34 +01:00
  • ae2e93f89b [Fix] Fix logprobs=0 handling for /inference/v1/generate endpoint (#34010) Sumanth R Hegde 2026-02-06 12:33:40 -08:00
  • 9e9acce577 [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) (#33993) xuebwang-amd 2026-02-07 03:11:32 +08:00
  • fe5438200b [Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op (#33734) Charlie Fu 2026-02-06 13:09:59 -06:00
  • 77c09e1130 [Refactor] Remove align block size logic in moe_permute (#33449) Wentao Ye 2026-02-06 13:57:06 -05:00
  • 16786da735 [Model Runner V2] support apply penalty for spec decode (#33251) zhrrr 2026-02-07 02:56:48 +08:00
  • aaa2efbe98 [DOC] [ROCm] Update docker deployment doc (#33971) vllmellm 2026-02-07 02:05:35 +08:00
  • aca5967416 [KV Connector] Add missing method overrides to MultiConnector (#33292) Seiji Eicher 2026-02-06 09:58:21 -08:00
  • 67a746e87f [Log] Optimize duplicate startup log (#33944) Wentao Ye 2026-02-06 12:49:56 -05:00
  • 7bec435130 [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 (#33964) Chauncey 2026-02-07 01:23:44 +08:00
  • 5c52644b10 [Docs] Update link to Benchmark CLI documentation (#33254) Eldar Kurtić 2026-02-06 17:00:59 +01:00
  • 2ce9fe4ad0 [XPU][5/N] add wna16 xpu kernel (#33973) zofia 2026-02-06 23:59:53 +08:00
  • cd8b405bd0 [Refactor] Consolidate sequence normalization and enc-dec parsing (#33928) Cyrus Leung 2026-02-06 23:43:47 +08:00
  • 4707f7ebb4 [Model] Support MiniCPM-o 4.5 (#33431) tc-mb 2026-02-06 23:29:10 +08:00
  • c39ee9ee2b [Docs] Add sections on process architecture and minimum CPU resources (#33940) Michael Goin 2026-02-06 10:26:43 -05:00
  • 350ca72c04 [ROCm][AITER] Fix AITER import regression for explicit backend selection (#33749) Andreas Karatzas 2026-02-06 09:08:16 -06:00
  • 1fb0495a72 [FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab (#33509) FredericOdermatt 2026-02-06 15:23:03 +01:00
  • 85ee1d962b [Bugfix] Fix models and tests for transformers v5 (#33977) Raushan Turganbay 2026-02-06 14:47:41 +01:00
  • 51a7bda625 Update WeightTransferConfig to be more standard like the others (#33989) Harry Mellor 2026-02-06 13:15:00 +00:00
  • 6e7b1c4b59 [Docs] Improve documentation (#33799) SorenDreano 2026-02-06 13:57:09 +01:00
  • 2991dd3d22 [Bugfix][Model] Support LoRA on Qwen3 Output Embedding (#29816) Kurt Shuster 2026-02-06 04:25:31 -08:00
  • ac32e66cf9 [torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) (#33731) Luka Govedič 2026-02-06 07:19:49 -05:00
  • f79d9dce16 [CPU][BugFix] Fix loading of w8a8int models with bias (#33582) Fadi Arafeh 2026-02-06 11:59:20 +00:00
  • ba5cbbf107 Bump HF Hub client to get bug fix (#33984) Harry Mellor 2026-02-06 11:25:33 +00:00
  • 233b26ab35 [PaddleOCR-VL] Add BC for transformers 5.0 config (#33976) zhang-prog 2026-02-06 18:33:49 +08:00
  • 791a94bed0 Consolidate and fix forbidden import pre-commit checks (#33982) Harry Mellor 2026-02-06 09:47:41 +00:00
  • e969a169ef support view_from_cpu_tensor on XPU (#33868) Xinyu Chen 2026-02-06 16:34:20 +08:00
  • 6d8d34be6d Fix main pre-commit (#33975) Harry Mellor 2026-02-06 08:08:05 +00:00
  • 1363e3d6d5 [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation (#32263) Gassan Salama 2026-02-06 07:01:48 +00:00
  • 965525667b Onboard voyage-4-nano (#33720) chengchengpei 2026-02-05 22:23:34 -08:00
  • 6550815c3a [XPU]Replace pip in docker.xpu with uv pip (#31112) sihao_li 2026-02-06 14:02:33 +08:00
  • 7439e4f41b [XPU][4/N] add mxfp4 moe model support (#33679) Kunshang Ji 2026-02-06 13:03:59 +08:00
  • ac04dd374f [CPU] Add BF16 Kernel type for s390x (#33788) R3hankhan 2026-02-06 10:27:02 +05:30
  • 035a6cb09a [Misc] Update code for encoder-decoder models (#33900) Cyrus Leung 2026-02-06 11:38:39 +08:00
  • a32cb49b60 feat(frontend): early-fail tokenization guard for user requests (#31366) Mingliang Li 2026-02-06 11:38:02 +08:00
  • 20d7454c9b fix(ROCm): Make flash_attn import optional in MLA attention (#33511) Rabi Mishra 2026-02-06 07:52:53 +05:30
  • 5819ca8944 [Docs] Add reo analytics (#33957) Simon Mo 2026-02-05 17:42:22 -08:00
  • 79028d4388 [Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel (#33568) Xin Yang 2026-02-05 17:34:00 -08:00
  • 325ab6b0a8 [Feature] OTEL tracing during loading (#31162) emricksini-h 2026-02-06 01:59:28 +01:00
  • 91a07ff618 [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue (#33832) Wei Zhao 2026-02-05 18:50:49 -05:00
  • d5c4800112 Adds padding and perf improvements to wvSplitK_fp8 (#33527) Hashem Hashemi 2026-02-05 14:16:02 -08:00
  • 42d5d705f9 [Minor] Sort safetensors files to ensure deterministic loading order (#33491) Lumosis 2026-02-05 14:05:09 -08:00
  • 116880a5a0 [Bugfix] Make MM batching more robust (#33817) Cyrus Leung 2026-02-06 04:40:58 +08:00
  • 4145e50d85 [Bugfix] Fix DSV3.2 NVFP4 (#33932) Matthew Bonanni 2026-02-05 14:22:19 -05:00
  • 20f5d185a6 [Misc] Rename translations to speech_to_text for OAI serving component (#33904) Nicolò Lucchesi 2026-02-05 20:16:52 +01:00
  • 1887acca9e Fix tokenizer test for renamed attr on Transformers v5 (#33902) Harry Mellor 2026-02-05 19:16:20 +00:00
  • 92e7562a99 [Bugfix] Suppress non-TTY color output on the process name part of the log (#29714) Tsukasa OI 2026-02-06 03:47:09 +09:00
  • 87d0d17ab5 [Models] Consolidate Deepseek-OCR2 processor (#33909) Isotr0py 2026-02-06 02:29:20 +08:00
  • a57c8228ff [Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor (#33375) bnellnm 2026-02-05 13:07:18 -05:00
  • 1ee95841bd [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path (#33795) zackyoray 2026-02-05 19:51:58 +02:00
  • 7d8c6804e2 [Misc] Add debug logs (#33931) Nicolò Lucchesi 2026-02-05 18:42:40 +01:00
  • af3162d3aa [Spec Decode] Unified Parallel Drafting (#32887) Benjamin Chislett 2026-02-05 12:37:18 -05:00
  • 5b2a9422f0 [BugFix] Fix LoRA Fp8 (#33879) danisereb 2026-02-05 19:25:55 +02:00
  • c1858b7ec8 [Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943) Aaron Hao 2026-02-05 09:13:23 -08:00
  • 82914d2ae8 [Bugfix] Fix step3p5 parser when using mtp (#33690) Mario Hong 2026-02-06 00:04:04 +08:00
  • 81a90e5277 [Docs] Add bart-plugin to docs (#33905) Nicolò Lucchesi 2026-02-05 13:20:25 +01:00
  • 1c3a221d3b [Bugfix] Fix corner case of sparse embedding (#33886) wang.yuqi 2026-02-05 18:51:22 +08:00
  • 7bd42e609d [Refactor] Clean up input preprocessing (#33687) Cyrus Leung 2026-02-05 18:43:42 +08:00
  • a2522839d8 [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading (#33876) Isotr0py 2026-02-05 18:29:54 +08:00
  • 59a5cb387a [perf] Integrate flashinfer concat_mla_k (#31171) jiahanc 2026-02-05 18:23:11 +08:00
  • 8322d4e47f Enable Cross layers KV cache layout at NIXL Connector V2 (#33339) liranschour 2026-02-05 12:17:02 +02:00
  • 3e472e81f9 [ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) (#32710) Andreas Karatzas 2026-02-05 04:01:23 -06:00
  • 038914b7c8 [Refactor] Move task outside of PoolingParams.verify (#33796) Cyrus Leung 2026-02-05 17:33:11 +08:00