Commit Graph

  • 1c46dea001 Revert "[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… (#31617) shyeh25 2026-01-11 04:39:59 +08:00
  • 028599739d [BugFix] scheduler: Fix resuming of preempted requests after async load (#31583) Or Ozeri 2026-01-10 22:39:25 +02:00
  • d1fd802fa3 fused_moe_kernel - cast accumulator after applying router weights (#32002) gnovack 2026-01-10 12:36:45 -08:00
  • 543c23be78 [LoRA][Perf] Improve FusedMoE LoRA performance for small rank (#32019) Xin Yang 2026-01-10 11:04:18 -08:00
  • b8bf5c45bb [Kernel] Optimize Sliding Window Attention in 3D Triton Kernel (#31984) jvlunteren 2026-01-10 19:13:44 +01:00
  • e6c6f2c79d [Quant] Support MXFP4 W4A16 for compressed-tensors dense models (#31926) Michael Goin 2026-01-10 09:44:35 -05:00
  • 07286ec5a6 [Bugfix] Fix integer overflow in Gemma3n audio processing (#31657) Jeremy Teboul 2026-01-10 01:52:53 -08:00
  • 14fc7a68c7 [Bugfix] fix offline chat output prompt (#32076) Ning Xie 2026-01-10 15:50:57 +08:00
  • 5f2385a4c8 [Benchmark][1/2] Generalize SLA criterion validation from binary flags to margins (#32075) Cyrus Leung 2026-01-10 15:11:03 +08:00
  • a01a1c0d69 [Bugfix] fix encoder cache leak of waiting requests in scheduler to solve stuck in CPU scheduling (#31857) Frelam 2026-01-10 14:27:58 +08:00
  • da6709c9fe [Misc] Delay deprecation of CommonAttentionMetadata properties (#32074) Lucas Wilkinson 2026-01-10 00:06:44 -05:00
  • d83becd503 [ROCm][CI] Fix flaky test_function_calling_with_stream and reduce schema test examples (#32063) Andreas Karatzas 2026-01-09 23:02:35 -06:00
  • 0c9614876e Update modelopt KV cache quantization resolution to new scheme (#31895) roikoren755 2026-01-10 06:54:13 +02:00
  • 583a90e005 [Refactor] Separate sequence and token pooling types (#32026) Cyrus Leung 2026-01-10 12:53:24 +08:00
  • 52d428295d [Core] Refactor ColumnParallelLinear: remove unused parameter and optimize forward (#31939) maang 2026-01-10 12:19:49 +08:00
  • c60578de0a [Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process (#31295) Kevin McKay 2026-01-09 21:57:38 -06:00
  • 80fead8bf6 Fuse RoPE and MLA KV-cache write (#25774) PatrykSaffer 2026-01-10 04:18:37 +01:00
  • e45946bd91 feature/issac 0.2 (#31550) Akshat Shrivastava 2026-01-09 19:18:05 -08:00
  • ea6d067a2a [Misc][LLaMa4] Compile LLaMa Vision Encoder (#30709) Lucas Kabela 2026-01-09 19:01:38 -08:00
  • abd9224280 resolve pydantic error in startup benchmark (#31348) Ning Xie 2026-01-10 10:41:27 +08:00
  • 4dc0d606b7 [Bugfix] Narrow broad exceptions in compilation backends (#31616) Kevin McKay 2026-01-09 20:39:22 -06:00
  • ac0675ff6b [CI] Allow Deprecated Quantization For LM Eval Tests (#32065) Micah Williamson 2026-01-09 20:10:47 -06:00
  • e18464a57d [Perf] Optimize async scheduling placeholder using empty (#32056) Wentao Ye 2026-01-09 19:46:11 -05:00
  • 1963245ed1 [Core] Use weights_only=True with torch.load (#32045) Russell Bryant 2026-01-09 19:28:57 -05:00
  • 0308901975 [2/N][Attention] Fix pre-commit errors (#32052) Matthew Bonanni 2026-01-09 19:27:15 -05:00
  • aaf4b70aae [Misc][BE] Type coverage for vllm/compilation [2/3] (#31744) Lucas Kabela 2026-01-09 15:30:38 -08:00
  • 3adffd5b90 [Misc] Enable async scheduling by default with spec decoding (#31998) Nick Hill 2026-01-09 15:09:34 -08:00
  • 97ba96fbe9 [perf][async] support non cpu sync get logprob tensors for spec (#31336) zhrrr 2026-01-10 05:24:51 +08:00
  • 94578127a4 [NIXL] refine decoder side post process for heterogeneous BlockSize and kv_layout (#30275) Chendi.Xue 2026-01-09 15:22:19 -06:00
  • 2612ba9285 [1/N][Attention] Restructure attention: move files (#31916) Matthew Bonanni 2026-01-09 16:10:24 -05:00
  • 1f8b7c536b [responsesAPI] fix incomplete_messages for simple/parsable context (#31836) Andrew Xia 2026-01-09 16:00:57 -05:00
  • 0a0aa07747 [Quant] Make static quant support all group shapes (#30833) Lucas Wilkinson 2026-01-09 15:49:27 -05:00
  • f9e2a75a1e [fix] add cutedsl to global sf (#32001) jiahanc 2026-01-09 12:03:02 -08:00
  • a4d5d663e2 Add unpermute-aware fused MoE path and small-batch fallback (#29354) Runkai Tao 2026-01-09 14:58:39 -05:00
  • 657e9c0e18 [Fix] Introduce audio channels spec (#31595) Jeremy Teboul 2026-01-09 11:34:51 -08:00
  • 308feab33f [Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement (#31830) Wentao Ye 2026-01-09 14:13:43 -05:00
  • 28ae32a5d3 [Refactor] Remove numpy split in async scheduling (#32034) Wentao Ye 2026-01-09 14:09:02 -05:00
  • f32c629eb4 [Frontend][gpt-oss] Allow system message to overwrite model identity (#31737) Andrew Xia 2026-01-09 14:03:57 -05:00
  • cd4a95e3aa [Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator (#31707) Yifan Qiao 2026-01-09 10:53:20 -08:00
  • d5ec6c056f [UX] Add vLLM model inspection view (#29450) Michael Goin 2026-01-09 12:12:35 -05:00
  • 08d954f036 [Doc] Add developer guide for CustomOp (#30886) Shanshan Shen 2026-01-10 00:21:11 +08:00
  • ac9f9330e6 Rename --exclude-log-deltas to --enable-log-deltas (#32020) Kevin Šuc 2026-01-09 16:30:40 +01:00
  • 2d0c5b630e [Doc] Remove hardcoded Whisper in example openai translation client (#32027) Isotr0py 2026-01-09 22:44:52 +08:00
  • 34cd32fe30 [Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe (#31832) Michael Goin 2026-01-09 09:40:33 -05:00
  • 8e27663b6a [CPU] Add head sizes 80 and 112 with vec16 fallback (#31968) R3hankhan 2026-01-09 19:44:46 +05:30
  • 7cdf7e2fe0 [Model] Remove redundant None check in DeepSeekOCR image input processing (#32016) maang 2026-01-09 22:12:44 +08:00
  • bbf80ede43 Fix type error (#31999) Adolfo Victoria 2026-01-09 06:03:32 -08:00
  • 4505849b30 [ROCm][PD] add moriio kv connector. (#29304) inkcherry 2026-01-09 22:01:57 +08:00
  • db07433ce5 [Misc] Skip hashing kwargs if value is None (#32025) Roger Wang 2026-01-09 05:20:59 -08:00
  • e02706d2d2 [ROCm][CI][V1] Fix nixl_connector test failure and achieve CUDA parity in test_async_scheduling (#32000) Andreas Karatzas 2026-01-09 06:48:32 -06:00
  • b474782ad7 [Feature][Benchmarks] Custom dataset: read output length from dataset (#31881) Sophie du Couédic 2026-01-09 13:40:59 +01:00
  • 55212c1404 fix: remove duplicate engine_id check in nixl_connector (#31948) Bofeng Xue 2026-01-09 20:13:17 +08:00
  • e7b68f4d6c [Bugfix] Fix Triton FusedMoE LoRA (#30585) Xin Yang 2026-01-09 03:46:59 -08:00
  • 1a19e9cd87 [Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten (#31380) vllmellm 2026-01-09 12:28:02 +01:00
  • c8ed39b9dd [Model] Reorganize pooling layers (#31973) Cyrus Leung 2026-01-09 19:02:14 +08:00
  • 020732800c [Bugfix] Fix OpenAPI schema test failures (#31921) Andreas Karatzas 2026-01-09 04:56:20 -06:00
  • dc77cb7129 [Bugfix] Fix Var Length Batched Padding in Granite Speech (#31906) Alex Brooks 2026-01-09 03:28:43 -07:00
  • bde38c11df fix lora moe sharding when rank < max_lora_rank (#31994) gnovack 2026-01-08 22:43:25 -08:00
  • 707b240d7e [Bugfix] Fix FusedMoE LoRA w2_output_size (#31949) Xin Yang 2026-01-08 21:54:05 -08:00
  • 29ce48221c [Cleanup] Remove obsolete spec decoding compatibility logic (#32003) Nick Hill 2026-01-08 21:44:18 -08:00
  • 7a05d2dc65 [CI] [ROCm] Fix tests/entrypoints/test_grpc_server.py on ROCm (#31970) TJian 2026-01-09 12:54:20 +08:00
  • a1648c4045 [ROCm][CI] Fix test_token_classification.py::test_bert_models (#31993) Divakar Verma 2026-01-08 22:04:33 -06:00
  • e2d49ec2a4 [Bugfix] missing tokens occur in harmony streaming (#30437) RioS 2026-01-09 12:59:34 +09:00
  • 8413868dab [Bugfix] Fix typo in FusedMoE LoRA reshape comment (#31992) Xin Yang 2026-01-08 18:46:05 -08:00
  • 8ff4a99566 [Async][Feat] support apply penalty or bad_words for async + spec (#30495) zhrrr 2026-01-09 10:31:50 +08:00
  • a4ec0c5595 [Frontend] Add MCP tool streaming support to Responses API (#31761) daniel-salib 2026-01-08 17:19:34 -08:00
  • 0fa8dd24d2 [Bugfix] Fix Typo from NVFP4 Refactor (#31977) Robert Shaw 2026-01-08 19:18:50 -05:00
  • 6ebe34d6fa [Feature] Add iteration level logging and enhance nvtx marker (#31193) Max Hu 2026-01-08 19:13:39 -05:00
  • 11cec296dd [BugFix] Add spec-decode-incompatible request param validation (#31982) Nick Hill 2026-01-08 16:08:21 -08:00
  • 5825bbc1f7 [Quantization] Deprecate Long Tail of Schemes (#31688) Robert Shaw 2026-01-08 19:07:45 -05:00
  • d62cfe546d [MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection (#31752) Yongye Zhu 2026-01-08 16:01:30 -08:00
  • 6cdf015c3c [Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future (#31747) Lucas Wilkinson 2026-01-08 18:20:49 -05:00
  • 5d3b6097ad [Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs (#30881) Dipika Sikka 2026-01-08 17:45:17 -05:00
  • e74698c27a [Misc][Refactor] Add FusedMoERouter object (#30519) bnellnm 2026-01-08 15:52:55 -05:00
  • aa125ecf0e [Frontend] Improve error message (#31987) Cyrus Leung 2026-01-09 04:07:03 +08:00
  • f16bfbe5bc [Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders (#31627) Lucas Kabela 2026-01-08 11:33:24 -08:00
  • 87e07a6b46 Revert "feat(moe): Add is_act_and_mul=False support for Triton MoE kernels" (#31978) Michael Goin 2026-01-08 14:31:53 -05:00
  • 7508243249 [Model Runner V2] Simplify BlockTables with UVA (#31965) Woosuk Kwon 2026-01-08 10:24:26 -08:00
  • 83e1c76dbe [CI][ROCm] Fix NIXL tests on ROCm (#31728) Nicolò Lucchesi 2026-01-08 18:34:43 +01:00
  • a563866b48 Fix ijson build for Power. (#31702) Nishidha Panpaliya 2026-01-08 22:42:33 +05:30
  • a3d909ad2b [Misc] Tidy up some spec decode logic in GPUModelRunner (#31591) Nick Hill 2026-01-08 09:10:07 -08:00
  • 49568d5cf9 [Doc] Improve MM models LoRA notes (#31979) Jee Jee Li 2026-01-09 00:55:22 +08:00
  • b8112c1d85 [Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 (#31960) danisereb 2026-01-08 18:08:37 +02:00
  • eaba8ece77 [Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming (#31969) Chauncey 2026-01-08 23:28:13 +08:00
  • fe86be66c5 [Model] Support IQuestCoder model (#31575) yxing-bj 2026-01-08 22:42:57 +08:00
  • 1da3a5441a [Docs]: update claude code url (#31971) Chauncey 2026-01-08 22:04:55 +08:00
  • 72c068b8e0 [CI] [Bugfix] Fix unbounded variable in run-multi-node-test.sh (#31967) TJian 2026-01-08 21:42:01 +08:00
  • 7645bc524b [OpenAI] Fix tool_choice=required streaming when output has trailing extra data (#31610) Mary 2026-01-08 13:01:42 +00:00
  • 1123a87892 [Model] Enable LoRA support for Pixtral (#31724) Ce Zhao 2026-01-08 08:00:57 -05:00
  • 03fd76c570 [Model] Add LFM2-VL model support (#31758) tianshu-Michael-yu 2026-01-08 05:00:27 -08:00
  • 59d260f5e4 [Model] Add Grok-2 (#31847) Bijaya Dangol 2026-01-08 13:59:48 +01:00
  • 18d4e481d0 [Voxtral] Fix speech transcription api (#31388) Patrick von Platen 2026-01-08 12:34:19 +02:00
  • 2972a05473 [MM Encoder]: Make MMEncoderAttention's scale takes effect properly (#31950) Isotr0py 2026-01-08 18:33:48 +08:00
  • 5576227bc1 [Model] Standardize common vision encoders (#31947) Cyrus Leung 2026-01-08 18:33:16 +08:00
  • d1b6fe007f [Chore] Further cleanup pooler (#31951) Cyrus Leung 2026-01-08 18:16:21 +08:00
  • 04a49669d1 RayLLM Bugfix - Preserve obj store URL for multi engine_config creation (#30803) omer-dayan 2026-01-08 12:00:25 +02:00
  • 96fcd3c267 [Misc] Support qwen3-next lora (#31719) BingjiaWang 2026-01-08 17:27:50 +08:00
  • 1f214290d6 fix(compile): apply partition wrapper when loading AOT cached functions (#31536) DevByteAI 2026-01-08 11:27:26 +02:00
  • 8cbdc7eb94 [CI/Build] Enable test_kv_cache_events_dp for AMD (#31834) Ryan Rock 2026-01-08 03:00:24 -06:00
  • b634e619bb Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility. (#31635) Lumosis 2026-01-08 01:00:07 -08:00