Commit Graph

  • c59a0eca42 [KV offload][4/N] Offloading KV connector (#22595) Or Ozeri 2025-09-19 22:07:17 +03:00
  • b716ab93a7 [bugfix] fix structured outputs key missing issue from #24929 (#25195) Lucia Fang 2025-09-19 11:37:57 -07:00
  • 138f0d1e75 [Docs] add __init__.py to vllm/model_executor/layers/quantization/compressed_tensors/transform (#24974) samzong 2025-09-20 02:32:27 +08:00
  • 2506ce5189 [Core][Prefix Hash] Fix prefix hash metrics sliding window maintainance (#24990) Jialin Ouyang 2025-09-19 11:22:53 -07:00
  • 47fd08aaf9 [CI/Build] fix test function_calling (#25072) Chauncey 2025-09-20 02:16:32 +08:00
  • 12aed7e453 Encoder model support for the Transformers backend (#25174) Harry Mellor 2025-09-19 19:15:22 +01:00
  • d90e212a3a Remove Redundant Assignment in Qwen3_VisionPatchMerger (#25224) LJH-LBJ 2025-09-20 02:15:13 +08:00
  • 2821986450 [Core] Modify the initialization parameters of the lora manager (#25249) Jee Jee Li 2025-09-20 02:01:28 +08:00
  • 6c117cff7d [Frontend] Pass API server count to each process (#23717) Cyrus Leung 2025-09-20 01:15:19 +08:00
  • 7ac67ea525 [KV offload][3/N] Add worker-side CPU support (#21448) Or Ozeri 2025-09-19 19:53:45 +03:00
  • ce75e15373 refactor(benchmarks): add type annotations to wait_for_endpoint parameters (#25218) samzong 2025-09-20 00:36:52 +08:00
  • aed16879a9 Move ModelConfig from config/__init__.py to config/model.py (#25252) Harry Mellor 2025-09-19 17:22:33 +01:00
  • cf278ff3b2 Update CODEOWNERS (#25269) Harry Mellor 2025-09-19 17:12:55 +01:00
  • 838d7116ba [Qwen] Remove cuda hard-code in qwen3 next (#25243) Icey 2025-09-19 20:25:12 +08:00
  • 5089fd749c [V0 Deprecation] Remove V0 logic from get_input_embeddings interface (#25242) Cyrus Leung 2025-09-19 19:10:52 +08:00
  • a3d087adec [P/D][Nixl] Introduce KVTransferMetrics and aggregation strategy (#22188) Nicolò Lucchesi 2025-09-19 13:09:14 +02:00
  • 058525b997 Move PoolerConfig from config/__init__.py to config/pooler.py (#25181) Harry Mellor 2025-09-19 12:02:55 +01:00
  • 1dfea5f4a9 [Bugfix][Perf] Misc fixes for Qwen3 VL (#25238) Roger Wang 2025-09-19 03:46:16 -07:00
  • cea91a32f2 [Kernel][Performance] Add Triton kernel for Qwen3-VL interleaved MRoPE (#25055) Isotr0py 2025-09-19 18:27:49 +08:00
  • a684c0124c [bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B (#25146) Yan Ma 2025-09-19 16:45:06 +08:00
  • f2718d2948 [Misc] Cleanup test conftest for deprecated encoder-decoder models (#25231) Isotr0py 2025-09-19 15:44:56 +08:00
  • 825fdb11ad [Bugfix][CPU] Add placeholder to avoid import errors when using fused_moe ops on platforms without triton (#25137) Li, Jiang 2025-09-19 15:41:12 +08:00
  • 8c1d4acbfe [CPU] Disable oneDNN linear on non-x86 platforms (#25166) Li, Jiang 2025-09-19 15:27:22 +08:00
  • 486c5599e3 [Build] Update Xgrammar to 0.1.24 to get a CVE fix (#25188) Russell Bryant 2025-09-19 02:27:17 -04:00
  • a6149aa587 [OOT] Support sync_model_loading for OOT (#25126) Chendi.Xue 2025-09-19 00:41:53 -05:00
  • 6c8a3c099b [Docs] Fix griffe warnings in vllm/multimodal (#25216) Michael Yao 2025-09-19 13:10:44 +08:00
  • 31a8a2a7bc [Misc] Clean up MM profiling warnings (#25222) Roger Wang 2025-09-18 21:46:57 -07:00
  • 1a0a04dae9 [Perf] Optimize memory peak during EAGLE model loading. (#24585) Chen Ding 2025-09-19 11:31:16 +08:00
  • 6d8246aaff [gpt-oss] Add ResponseReasoningPartAddedEvent, ResponseReasoningPartDoneEvent for streaming (#24938) Andrew Xia 2025-09-18 19:11:59 -07:00
  • 9d1c50a5ac [KV offload][2/N] Introduce LRU-based CPU offloading management (#20075) Or Ozeri 2025-09-19 03:20:51 +03:00
  • 9a4600e4dc [CORE] Prompt Embeddings Support for v1 Engine (#24278) Andrew Sansom 2025-09-18 19:03:09 -05:00
  • 9fac6aa30b [BugFix] Fix DeepGEMM warmup, no m.weight_scale_inv (#25206) Lucas Wilkinson 2025-09-18 17:26:28 -04:00
  • a53ad626d6 [KV offload][1b/N] rename offloading to kv_offload (#25191) Or Ozeri 2025-09-18 23:53:52 +03:00
  • 1c3dad22ff [V0 Deprecation] Remove unused async_timeout.py (#25190) Woosuk Kwon 2025-09-18 13:35:21 -07:00
  • d2a30a2d93 [Bug] Fix torch Compilation Cache Hit Error (#25093) Wentao Ye 2025-09-18 15:38:37 -04:00
  • 75fb112d80 [Bug] Fix returned_lse not Defined issue (#25106) Wentao Ye 2025-09-18 15:32:24 -04:00
  • 38db529f66 [feat]: Create interface for model-specific M-RoPE (#24194) Aziz 2025-09-18 21:18:56 +02:00
  • 064cac7bb7 [fix]: remove data type hardcoding from gptoss model implementation (#23807) Nikhil Gupta 2025-09-18 19:15:23 +01:00
  • e19bce40a1 [V0 Deprecation] Remove AsyncLLMEngine (#25025) Woosuk Kwon 2025-09-18 11:07:42 -07:00
  • 505805b645 [KV offload][1/N] Introduce an offloading component (#19848) Or Ozeri 2025-09-18 20:57:07 +03:00
  • bbdc0f2366 [ROCm][AITER][Bugfix] Switch AITER to use PIECEWISE_AND_FULL compilation (#25104) Rohan Potdar 2025-09-18 12:46:47 -05:00
  • dc34059360 [ROCm][CI/Build] Use ROCm7.0 as the base (#25178) Gregory Shtrasberg 2025-09-18 12:36:55 -04:00
  • c4cb0af98a [spec decode] Fix MTP inference path for MiMo-7B model (#25136) qizixi 2025-09-18 09:12:19 -07:00
  • 1c3b1634aa [Misc] Add codeowner for Transformers backend (#25180) Harry Mellor 2025-09-18 17:01:50 +01:00
  • 2ea50e977a Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964) Shu Wang 2025-09-18 10:52:58 -05:00
  • b419937c78 [Docs] Fix warnings in mkdocs build (continued) (#25163) Hyogeun Oh (오효근) 2025-09-19 00:23:26 +09:00
  • 5f696c33b1 [New Model] Support BertForTokenClassification / Named Entity Recognition (NER) task (#24872) wang.yuqi 2025-09-18 23:22:01 +08:00
  • 67244c86f0 feat(api): Return 503 on /health when engine is dead (#24897) dongbo910220 2025-09-18 22:29:40 +08:00
  • 072d7e53e5 [PERF] Add conv1d metadata to GDN attn (#25105) Vadim Gimpelson 2025-09-18 18:27:49 +04:00
  • 01a583fea4 [Kernel] Decouple Tile Size from Block Size in Triton Unified Attention Kernel (#21197) jvlunteren 2025-09-18 16:27:01 +02:00
  • bc19d75985 [Misc] Add kv-connector label (#25156) Nicolò Lucchesi 2025-09-18 15:56:07 +02:00
  • fbd6523ac0 Refactor dense FP8 tensor/channel/block utils and add CT FP8 block (#21404) Michael Goin 2025-09-18 08:53:45 -04:00
  • 470484a4f5 [Structured Output][Refactor] Move apply_grammar_bitmask() method from ModelRunner to structured output utils (#21999) Shanshan Shen 2025-09-18 20:44:31 +08:00
  • 21da73343a [Misc] Clean up flags in vllm bench serve (#25138) Roger Wang 2025-09-18 05:43:33 -07:00
  • 66072b36db [Bugfix][Mamba] - Fix Conv State Kernel FP32 Support (#24883) Asaf Joseph Gardin 2025-09-18 15:21:17 +03:00
  • 3ed1ec4af2 Fix validate-config pre-commit check (#25157) Harry Mellor 2025-09-18 13:06:28 +01:00
  • 5a33ae9a3f Fix forward reference warning in documentation (#25150) Harry Mellor 2025-09-18 12:41:41 +01:00
  • c9ff9e6f0c [Docs] add the parallel sampling usage in LLMEngine and AsyncLLM (#24222) William Song 2025-09-18 20:37:08 +09:00
  • eaffe4486c [Docs] Fix pooling-params doc references in openai_compatible_server.md (#24939) Kay Yan 2025-09-18 19:36:47 +08:00
  • 8ed039d527 Move StructuredOutputsConfig from config/__init__.py to config/structured_outputs.py (#25153) Harry Mellor 2025-09-18 12:24:27 +01:00
  • 37970105fe [Model] Improve Pooling Model (#25149) Jee Jee Li 2025-09-18 19:04:21 +08:00
  • cc935fdd7e [Frontend] Support setting logprobs to -1 (#25031) Chauncey 2025-09-18 18:34:42 +08:00
  • abdfcd4f3d silu-v1: Fix EPS not being used during max-reduction (#25069) Elvir Crnčević 2025-09-18 12:25:12 +02:00
  • 4f02b77de4 Fix: Add explicit #include <omp.h> for OpenMP compatibility on certain toolchains (#24951) ihb2032 2025-09-18 17:43:23 +08:00
  • 29283e8976 [Chore] Cleanup guided namespace, move to structured outputs config (#22772) Aaron Pham 2025-09-18 05:20:27 -04:00
  • 05b044e698 [Doc] Fix cross-reference warnings (#25058) Punitvara 2025-09-18 14:35:16 +05:30
  • aa3f105c59 Add 'path' option to ImagePrompt data_format (#25081) Gerard Finol 2025-09-18 11:02:14 +02:00
  • ef7eefe17a [Qwen] Add fp8 checkpoint support for qwen3-next. (#25079) Tao He 2025-09-18 16:16:04 +08:00
  • 350c94deb3 [Bugfix] when use s3 model cannot use default load_format (#24435) rongfu.leng 2025-09-18 15:47:43 +08:00
  • f4cd80f944 Retrieve sliding_window from text config in Gemma3 MM (#25085) Harry Mellor 2025-09-18 07:29:05 +01:00
  • 349e0e3462 [Docs] Fix API Reference (#25140) Harry Mellor 2025-09-18 07:23:29 +01:00
  • 81b16a2bc9 [Kernel] Better inf handling for grouped topk cu (#24886) Lumina 2025-09-18 13:53:55 +08:00
  • e111d5b0ae [CLI] Use streaming in CLI chat and completion commands (#23769) Simon Mo 2025-09-17 22:30:26 -07:00
  • a904ea78ea [benchmark] add peak throughput metrics and plot (#23867) Simon Mo 2025-09-17 22:30:02 -07:00
  • b7433ca1a4 [Spec Decode] Efficient padded speculation (#24539) Benjamin Chislett 2025-09-18 01:07:24 -04:00
  • 5c65a72bb1 [V0 Deprecation] Remove more V0 tests (#25117) Woosuk Kwon 2025-09-17 22:05:25 -07:00
  • 9d8a2d86d2 [EPLB] Add EPLB support for hunyuan_v1 (#23078) YiwenC 2025-09-17 21:51:35 -07:00
  • 3bc18127ff [XPU] Whisper model support on XPU Platform (#25123) Chaojun Zhang 2025-09-18 12:30:10 +08:00
  • bec060fd99 Mark prompt logprobs as incompatible with prompt embeds at API level (#25077) Andrew Sansom 2025-09-17 23:25:07 -05:00
  • 52bc9d5b3e [Model] enable data parallel for InternVL vision encoder (#23909) YiwenC 2025-09-17 21:11:46 -07:00
  • dc2979c585 [Kernels] Overlap shared experts with combine instead of dispatch (#24254) bnellnm 2025-09-18 00:10:21 -04:00
  • 027d37df38 [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (#24960) toncao 2025-09-18 11:08:50 +07:00
  • b98219670f [Core][MM] Cleanup MultiModalCache (#25006) Lukas Geiger 2025-09-18 05:08:41 +01:00
  • 32baf1d036 [Docs] Clean up the contributing README (#25099) Harry Mellor 2025-09-18 05:05:18 +01:00
  • 3127274d02 [MM Encoder] Apply DP ViT for Qwen3-VL model series (#24955) Roger Wang 2025-09-17 21:04:21 -07:00
  • 4ac510f484 [Kernels] Enable DeepGEMM by default (#24462) bnellnm 2025-09-17 23:19:52 -04:00
  • 7fb2a5be28 [V0 Deprecation] Skip PP test (#25128) Woosuk Kwon 2025-09-17 20:18:36 -07:00
  • 6c036615dc [V0 Deprecation] Remove misc V0 tests (#25118) Woosuk Kwon 2025-09-17 19:41:55 -07:00
  • 2fc24e94f9 [V0 Deprecation] Remove V0 Tracing & Metrics tests (#25115) Woosuk Kwon 2025-09-17 19:40:44 -07:00
  • 2c3c1bd07a [V0 Deprecation] Remove V0 Engine tests (#25114) Woosuk Kwon 2025-09-17 19:38:09 -07:00
  • 5963b98b46 [Kernel] Delegate construction of FusedMoEQuantConfig to FusedMoEMethodBase subclasses (#22537) bnellnm 2025-09-17 19:43:31 -04:00
  • e6585ddb45 [Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel (#24833) elvischenv 2025-09-18 07:37:23 +08:00
  • 2a4d6412e6 Add a batched auto tune script (#25076) Karan Goel 2025-09-17 15:41:18 -07:00
  • e67a79db03 [Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic (#24600) elvischenv 2025-09-18 06:36:29 +08:00
  • 9f882d8791 Disable failing GPT-OSS Eval (Blackwell) for now (#25107) Michael Goin 2025-09-17 18:36:00 -04:00
  • 1a456c7c90 Aiter mha fp8 fix (#24991) Douglas Lehr 2025-09-17 17:29:14 -05:00
  • fedb75fa27 [Bugfix][B200] Fix cutlass_mla hang (#24966) Alexander Matveev 2025-09-17 18:06:38 -04:00
  • bff2e5f1d6 [gpt-oss][2] fix types for streaming (#24556) Andrew Xia 2025-09-17 15:04:28 -07:00
  • 3c068c637b [Kernel] Faster pre-processing time for W4A8 (#23972) czhu-cohere 2025-09-17 17:35:32 -04:00
  • f20c3b0951 [BUG] Exclude .pth files when pulling remote files (#25092) ahao-anyscale 2025-09-17 13:42:09 -07:00