Commit Graph

  • e94cfd51da [BUG] Qwen3-next MTP. Fix attn metadata build bug (#26564) Vadim Gimpelson 2025-10-10 22:59:03 +04:00
  • 7c12763b24 Fix some typing issues found by mypy==1.18.2 (#26596) Harry Mellor 2025-10-10 19:21:25 +01:00
  • b8b302cde4 Update CUDA architecture list in build pipeline for 12.9.1 wheels (#26592) v0.11.0 Will Eaton 2025-10-10 14:15:27 -04:00
  • 3b780a4bbb Update CUDA architecture list in build pipeline for 12.9.1 wheels (#26592) Will Eaton 2025-10-10 14:15:27 -04:00
  • 30f78af147 Update pre-commit hook versions (#26591) Harry Mellor 2025-10-10 18:03:44 +01:00
  • 19a9b169bf Add Qwen3-Omni moe thinker (#25550) Xiong Wang 2025-10-11 01:00:56 +08:00
  • 96ad65b7fe [Transform] [Quantization] Add QuTLASS support to vLLM (#24440) Roberto L. Castro 2025-10-10 18:43:40 +02:00
  • 8d2b8c0ff2 [Model] Add FlexOlmo model implementation (#24923) Shane A 2025-10-10 09:43:15 -07:00
  • b2155ed317 [Model][Qwen3VL] Compute cu_seqlens on CPU to remove (#26496) Lukas Geiger 2025-10-10 17:42:17 +01:00
  • 910abdbd08 [Bugfix] fixed top_logprobs: -1 does not appear to work as intended (#26470) Chauncey 2025-10-11 00:41:17 +08:00
  • cddce79fda [torch.compile] Make inductor partition rules respect splitting_ops #25691 (#25845) baonudesifeizhai 2025-10-10 12:35:28 -04:00
  • e519281920 [Metrics] Add test for multi-modal cache stats logging (#26588) Mark McLoughlin 2025-10-10 17:00:50 +01:00
  • 7b03584de8 Silu v2 (#25074) Elvir Crnčević 2025-10-10 17:19:53 +02:00
  • ae9d0e7da5 [Bugfix] Make DP padding optional in coordinate_batch_across_dp (#26375) Sage Moore 2025-10-10 07:53:33 -07:00
  • 0e67102d93 Added test_top_k_per_row to test-pipeline.yaml. (#26569) Daniel Cámpora 2025-10-10 16:48:33 +02:00
  • f4ba2061cf [BugFix][torch.compile] Fix fused_scaled_matmul_reduce_scatter signature for PyTorch 2.8 (#26038) Jason Li 2025-10-10 10:42:13 -04:00
  • 1e6848a65d [CI] fix test_run_batch.py::test_completions - AssertionError (#26578) Chauncey 2025-10-10 22:16:28 +08:00
  • 67661375fa [BugFix] Fix noop elimination edge case (#26394) Andy Lo 2025-10-10 14:33:04 +01:00
  • 213b64452a [Bugfix] Convert untraceable GroupShape to list for AMD impl (#26535) Lucas Kabela 2025-10-10 06:32:29 -07:00
  • 784c231151 [NIXL] Ignore abort on already-finished request (#25067) Mark McLoughlin 2025-10-10 11:21:56 +01:00
  • 606b00e80f [bugfix][DCP] fix block_size of hash in DCP prefix caching (#26296) Chen Zhang 2025-10-10 18:02:49 +08:00
  • 720d3cd0f0 [CI] fix ruff format (#26579) Chauncey 2025-10-10 18:02:12 +08:00
  • ab196edefb Remove LoRA bias support (#25807) Ashwin Phadke 2025-10-10 15:20:33 +05:30
  • 3ee202ea1e [GPT-OSS] Add support for arrays at tool message content (#25593) Luis Tomas Bolivar 2025-10-10 11:00:45 +02:00
  • ad430a67ca [Metrics] Log multi-modal cache stats and fix reset (#26285) Cyrus Leung 2025-10-10 16:45:55 +08:00
  • 6f0f570c43 [deepseek] kernel block size for UniformTypeKVCacheSpecs (#26559) Chen Zhang 2025-10-10 16:40:41 +08:00
  • b545a0b207 fix test_simple_inductor_graph_partition (#26522) Boyuan Feng 2025-10-09 23:39:19 -07:00
  • 29255cfc3b [Spec-Decode] Support piecewise cudagraphs for Eagle head (#25109) Lucas Wilkinson 2025-10-10 01:20:31 -04:00
  • da4455609d [Chore]: One pythonic tool parser test uses the wrong parser (#26515) Ben Browning 2025-10-10 00:03:55 -04:00
  • aafb99a4d4 [Core] Small simplification in GPUModelRunner._update_states() (#26508) Nick Hill 2025-10-09 19:53:58 -07:00
  • 757fa4a4da [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (#23849) Rui Qiao 2025-10-09 19:53:43 -07:00
  • c6187f55f7 Refactor MistralTokenizer (#26358) Julien Denize 2025-10-10 00:48:58 +02:00
  • 8983e0216f [CI] Fix Pre-commit Issue Cannot determine type of "rank" and "world_size" (#26448) Wentao Ye 2025-10-09 18:16:48 -04:00
  • 1ee35382cb [Bug] Fix modular_kernel: ZeroDivisionError: integer division or modulo by zero (#26528) Wentao Ye 2025-10-09 18:13:27 -04:00
  • 6e783bc54b [Bugfix] Fix CUDA graph selection bug in FlashInfer at high concurrency (#26499) Benjamin Chislett 2025-10-09 17:12:34 -04:00
  • c9d33c60dc [UX] Add FlashInfer as default CUDA dependency (#26443) Michael Goin 2025-10-09 17:10:02 -04:00
  • 2e54db4d2b [Core] Remove unused prev_sampled_token_ids_invalid_indices input batch field (#26514) Nick Hill 2025-10-09 13:22:14 -07:00
  • 44f633dba1 [Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention (#25674) elvischenv 2025-10-10 04:13:39 +08:00
  • a462331e36 [Bugfix] Disable moe inplace for torch >= 2.9 (#26497) bnellnm 2025-10-09 14:07:38 -04:00
  • 4069db3f2e [Bugfix] Enable padded FP4 quantization (#25947) roikoren755 2025-10-09 20:59:41 +03:00
  • 0d37450eb7 [BUGFIX] Add cu_tokens_across_sp to DPMetadata (#26457) Sage Moore 2025-10-09 10:13:56 -07:00
  • 47e66c24e2 [Model] Apply shared experts overlap optimization to all models with shared experts (#26145) bnellnm 2025-10-09 11:31:04 -04:00
  • 3b736e1c38 [Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 (#25049) Ming Yang 2025-10-09 08:06:29 -07:00
  • 2c1c7dfb35 [Models][Qwen] Replace pad with cat for better performance (#26486) Lukas Geiger 2025-10-09 15:51:26 +01:00
  • e246ad6f0c Upgrade Pydantic to v2.12.0 and remove hack for Python 3.13 (#26481) Harry Mellor 2025-10-09 14:02:40 +01:00
  • 5728da11ea Revert #26113 "[Frontend] CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops" (#26472) Jiangyun Zhu 2025-10-09 20:43:55 +08:00
  • 92be3f3517 [Feature] Use pydantic validation in parallel.py config (#26417) Simon Danielsson 2025-10-09 14:41:31 +02:00
  • d1ddf340c8 [V0 deprecation] Remove QKVCrossParallelLinear implementation (#26475) Isotr0py 2025-10-09 18:52:27 +08:00
  • ec10fd0abc [Bugfix] Move current_platform import to avoid python import cache. (#16601) Wenzheng Bi 2025-10-09 18:46:19 +08:00
  • 0426e3c5e1 [Models][Qwen3VL] Optimise _validate_and_reshape_mm_tensor (#26426) Lukas Geiger 2025-10-09 11:25:48 +01:00
  • 4bdf7ac593 [Bugfix] Fix SHM cache initialization (#26427) Cyrus Leung 2025-10-09 17:48:04 +08:00
  • dc7976dd9f [Misc] Upgrade more code to Python 3.10 (#26463) Cyrus Leung 2025-10-09 17:43:53 +08:00
  • e4791438ed [Feature] Use pydantic validation in lora.py and load.py configs (#26413) Simon Danielsson 2025-10-09 11:38:33 +02:00
  • e6e898f95d [doc] add Volcengine as a compute sponsor (#26477) youkaichao 2025-10-09 17:11:47 +08:00
  • ddcbc2f334 [Misc] Misc code simplifications (#26450) Nick Hill 2025-10-09 02:10:06 -07:00
  • a83ff278d6 [torchao] Add support for ModuleFqnToConfig using regex (#26001) Jerry Zhang 2025-10-09 01:32:32 -07:00
  • cf4cd6c24f Add: Support for multiple hidden layers in Eagle3 (#26164) Rahul Tuli 2025-10-09 13:00:50 +05:30
  • b960441812 Enable RMSNorm substitution for Transformers backend (#26353) Harry Mellor 2025-10-09 08:28:51 +01:00
  • 1317028aa8 [Model] Gemma3: Fix GGUF loading and quantization (#26189) Luciano Martins 2025-10-09 04:00:53 -03:00
  • 5e49c3e777 Bump Flashinfer to v0.4.0 (#26326) elvischenv 2025-10-09 14:58:44 +08:00
  • 0d7c3cb51d Update Dockerfile and install runai-model-streamer[gcs] package (#26464) pwschuurman 2025-10-08 23:48:51 -07:00
  • 1b2c440cd6 [Core] Relax the LoRA max rank (#26461) Jee Jee Li 2025-10-09 14:47:14 +08:00
  • 0f29dca988 [CI/Build] Fix model nightly tests (#26466) Cyrus Leung 2025-10-09 14:44:16 +08:00
  • d24cf322e1 [Hybrid]: Decouple Kernel Block Size from KV Page Size (#24486) Zhiyuan Li 2025-10-09 14:43:39 +08:00
  • d17f0fbf30 [Core][KVConnector] Propagate all tokens on resumed preemptions (#24926) Qier Li 2025-10-09 02:43:31 -04:00
  • 43ab8cfaa5 [MM][Doc] Add documentation for configurable mm profiling (#26200) Wenlong Wang 2025-10-08 23:21:20 -07:00
  • de253d63b7 [Hardware][AMD] Enable FlexAttention backend on ROCm (#26439) Matt 2025-10-09 01:20:18 -05:00
  • 8bd696fa53 [Bugfix] Incorrect another MM data format in vllm bench throughput (#26462) Huy Do 2025-10-08 22:58:46 -07:00
  • bb6d8c21f9 [Bugfix] Catch and log invalid token ids in detokenizer #2 (#26445) Nick Hill 2025-10-08 21:20:25 -07:00
  • ebf6ef1a9b [Minor] Change warning->warning_once in preprocess (#26455) Zhuohan Li 2025-10-08 21:09:06 -07:00
  • 0c52d6ef81 [Bugfix] Set the minimum python version for gpt-oss (#26392) Jee Jee Li 2025-10-09 11:35:49 +08:00
  • 467a4f98f1 [Misc] Redact ray runtime env before logging (#26302) Rui Qiao 2025-10-08 17:43:34 -07:00
  • e614ab7806 Separate MLAAttention class from Attention (#25103) Naveenraj Kamalakannan 2025-10-08 20:11:11 -04:00
  • 2a03f93de9 [Attention] Register FLASHMLA_SPARSE (#26441) Matthew Bonanni 2025-10-08 18:28:52 -04:00
  • da364615fc [Kernels] Modular kernel refactor (#24812) bnellnm 2025-10-08 17:51:52 -04:00
  • f08919b7d1 [Bugfix] Respect min_tokens in scheduler stop check (#26317) Elaine Zhao 2025-10-08 14:08:24 -07:00
  • 93f2c0aa08 [Models] Improve iteration over layers (#26425) Lukas Geiger 2025-10-08 21:48:33 +01:00
  • 4ebc9108a7 [Kernel] Centralize platform kernel import in current_platform.import_kernels (#26286) Nicolò Lucchesi 2025-10-08 22:25:31 +02:00
  • e1ba235668 [BugFix] Fix failing test quantization/test_compressed_tensors.py::test_compressed_tensors_fp8_block_enabled (#26436) Morrison Turnansky 2025-10-08 16:04:12 -04:00
  • b82f4307c9 [Bugfix][Flashinfer] fix VLLM_USE_TRTLLM_ATTENTION issue for models with diff hyperparameters (#25924) elvischenv 2025-10-09 03:54:48 +08:00
  • 76879cc160 [Attention] Implement universal BACKEND_MAP (#25900) Matthew Bonanni 2025-10-08 15:00:25 -04:00
  • b25d7b5657 [Feature] Change cache.py with pydantic validation (#26390) Vinay R Damodaran 2025-10-08 11:12:59 -07:00
  • e09d1753ec Remove Python 3.9 support ahead of PyTorch 2.9 in v0.11.1 (#26416) Harry Mellor 2025-10-08 18:40:42 +01:00
  • 4ba8875749 [Bug] Fix Test in Batch Invariant (#26128) Wentao Ye 2025-10-08 13:13:47 -04:00
  • 6273fe8d3d [Benchmarks] Fix imports in FP8 tuning script (#26407) Lukas Geiger 2025-10-08 17:31:59 +01:00
  • 9fb3ae4e6f [Bug] Fix DeepGEMM Attention Test (#26423) Wentao Ye 2025-10-08 12:23:41 -04:00
  • 76afe4edf8 [Bugfix] Fix vllm bench ... on CPU-only head nodes (#25283) Aydin Abiar 2025-10-08 09:06:42 -07:00
  • c1b06fc182 [CI Failure] Fix pre-commit issue for install_nixl_from_source_ubuntu.py (#26424) Michael Goin 2025-10-08 10:55:43 -04:00
  • 241b4cfe66 [Refactor] Refactor FP8 & INT8 Quant Folder inside w8a8 (#25293) Wentao Ye 2025-10-08 10:20:48 -04:00
  • 9fc983c707 [NIXL][non-cuda] Add install script for nixl with non-cuda ucx (#25959) Chendi.Xue 2025-10-08 09:19:53 -05:00
  • 2f99f2f506 Tidy vllm/config/__init__.py to only add classes and functions (#26405) Harry Mellor 2025-10-08 15:10:00 +01:00
  • 338b1bf04f [Benchmarks] Add support for Qwen 3 VL MoE tuning (#26419) Lukas Geiger 2025-10-08 15:01:08 +01:00
  • e39dc46f8f [CI] Pooling models mteb test disable enforce_eager (#26408) wang.yuqi 2025-10-08 20:15:36 +08:00
  • 10c75b5439 [Docs] Have mergify leave a comment with the docs preview link (#26412) Harry Mellor 2025-10-08 13:04:00 +01:00
  • f9582fd8f4 [Model] Allow passing custom number of max tiles to Nano 2 VL (#26403) Eugene Khvedchenya 2025-10-08 14:19:39 +03:00
  • f377333bd7 [Misc] add usedforsecurity=False in md5 hash call (#26357) Daniele 2025-10-08 12:18:32 +02:00
  • f8607863d8 [Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement (#26197) Wentao Ye 2025-10-08 03:33:56 -04:00
  • 335b28f7d1 [TPU] Rename tpu_commons to tpu_inference (#26279) Utkarsh Sharma 2025-10-08 12:00:52 +05:30
  • 5e65d6b2ad fix[DP][v1]: Prevent hangs from mismatched worker configurations (#26218) Ayush Satyam 2025-10-08 11:25:08 +05:30
  • 0d4f48fa10 [Bugfix] Incorrect MM data format in vllm bench throughput (#26395) Cyrus Leung 2025-10-08 13:52:19 +08:00