Commit Graph

  • d419aa5dc4 [V1] Enable TPU V1 backend by default (#17673) Michael Goin 2025-05-06 09:49:49 -04:00
  • f9bc5a0693 [Bugfix] Fix triton import with local TritonPlaceholder (#17446) Mengqing Cao 2025-05-06 17:53:09 +08:00
  • 05e1f96419 Fix dockerfilegraph pre-commit hook (#17698) Harry Mellor 2025-05-06 09:56:48 +01:00
  • 6eae34533a [Misc] Fix ScalarType float4 naming (#17690) Lucas Wilkinson 2025-05-06 04:07:15 -04:00
  • 63ced7b43f [Doc] Update notes for H2O-VL and Gemma3 (#17219) Cyrus Leung 2025-05-06 15:51:02 +08:00
  • dc47ba32f8 [Bugfix] Fixed prompt length for random dataset (#17408) Mikhail Podvitskii 2025-05-06 09:00:08 +02:00
  • edbf2d609e [easy] Fix logspam on PiecewiseBackend errors (#17138) Richard Zou 2025-05-06 02:46:11 -04:00
  • 999328be0d [Model] Add GraniteMoeHybrid 4.0 model (#17497) Stan Wozniak 2025-05-06 06:00:31 +02:00
  • 98834fefaa Update nm to rht in doc links + refine fp8 doc (#17678) Michael Goin 2025-05-05 20:41:14 -04:00
  • 90bd2ae172 [Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument (#17677) Varun Sundar Rabindranath 2025-05-06 06:04:29 +05:30
  • 5941e0b7ea [TPU][V1] Add support for top-logprobs (#17072) Nicolò Lucchesi 2025-05-05 23:20:15 +02:00
  • 9765940824 [TPU] Enable gemma3-27b with TP>1 on multi-chips. (#17335) XiongfeiWei 2025-05-05 14:19:58 -07:00
  • 5ea5c514da [BugFix] Increase timeout for startup failure test (#17642) Nick Hill 2025-05-05 13:53:19 -07:00
  • d3efde8176 [Benchmarks] Remove invalid option under V1 engine (#17651) Russell Bryant 2025-05-05 16:30:22 -04:00
  • aea302be6c Use git-path commit in hook (#17616) Thomas J. Fan 2025-05-05 13:55:32 -04:00
  • cc05b90d86 [Doc] Fix broken cuda installation doc rendering (#17654) Isotr0py 2025-05-06 01:52:40 +08:00
  • 1d0c9d6b2d [Kernel] some optimizations for dense marlin and moe marlin (#16850) Jinzhen Lin 2025-05-06 00:39:30 +08:00
  • f62cad6431 [Build/CI] Upgrade CUTLASS to 3.9.2 (#17641) Tyler Michael Smith 2025-05-04 22:23:17 -04:00
  • 5394ad7387 [Bugfix] fix KeyError on top logprobs are special tokens (#17637) Chauncey 2025-05-05 10:22:35 +08:00
  • 68e1ee0072 [Bugfix][Easy] Fix whitespace in shm_broadcast.py logging (#17635) Tyler Michael Smith 2025-05-04 22:20:19 -04:00
  • 2858830c39 [Bugfix] Prioritize dtype in root config before checking text config (#17629) Cyrus Leung 2025-05-04 20:43:05 +08:00
  • d6484ef3c3 Add full API docs and improve the UX of navigating them (#17485) Harry Mellor 2025-05-04 03:42:43 +01:00
  • 46fae69cf0 [Misc] V0 fallback for --enable-prompt-embeds (#17615) Cyrus Leung 2025-05-04 06:59:24 +08:00
  • f66f1e0fa3 [Bugfix] Fix broken Qwen2.5-omni tests (#17613) Isotr0py 2025-05-04 01:08:14 +08:00
  • 887d7af882 [Core] Gate prompt_embeds behind a feature flag (#17607) Cyrus Leung 2025-05-04 00:19:20 +08:00
  • a92842454c [Bugfix][ROCm] Using device_type because on ROCm the API is still torch.cuda (#17601) Gregory Shtrasberg 2025-05-03 01:25:47 -04:00
  • c8386fa61d [Build/CI] Upgrade CUTLASS to 3.9.1 (#17602) Tyler Michael Smith 2025-05-03 01:25:14 -04:00
  • 87baebebd8 [Frontend][TPU] Add TPU default max-num-batched-tokens based on device name (#17508) Chenyaaang 2025-05-02 21:42:44 -07:00
  • e3d0a1d190 [Quantizaton] [AMD] Add support for running DeepSeek int8 w8a8 MoE on ROCm (#17558) rasmith 2025-05-02 23:41:10 -05:00
  • d47b605eca Update test requirements to CUDA 12.8 (#17576) 22quinn 2025-05-02 21:40:15 -07:00
  • 22c6f6397f [Neuron][Build] Require setuptools >= 77.0.3 for PEP 639 (#17603) Liangfu Chen 2025-05-02 19:41:59 -07:00
  • 3ec97e2cc5 [release] Add command to clean up Docker containers/images in TPU release machine (#17606) Kevin H. Luu 2025-05-02 18:54:34 -07:00
  • 9b103a1d76 fix typo in logging (#17605) Eric Hartford 2025-05-02 21:04:40 -04:00
  • b90b0852e9 [easy] Print number of needed GPUs in skip message (#17594) Richard Zou 2025-05-02 18:27:43 -04:00
  • 9352cdb56d [Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning (#16263) Xiaodong Wang 2025-05-02 12:44:19 -07:00
  • 182f40ea8b Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) Zhiyu 2025-05-02 11:36:46 -07:00
  • 3e887d2e0c permute/unpermute kernel for moe optimization (#14568) Caleb_Du 2025-05-03 02:31:55 +08:00
  • 3015d5634e [BugFix][Attention] Fix sliding window attention in V1 giving incorrect results (#17574) v0.8.5.post1 Lucas Wilkinson 2025-05-02 14:01:38 -04:00
  • edb5286ea5 [BugFix] Fix Memory Leak (#17567) Robert Shaw 2025-05-02 04:07:03 -04:00
  • 0f87d8f7b2 [BugFix][Attention] Fix sliding window attention in V1 giving incorrect results (#17574) Lucas Wilkinson 2025-05-02 14:01:38 -04:00
  • 4c33d67321 [Bugfix] fix tmp_out and exp_sums dimensions (#17438) Hui Liu 2025-05-02 09:44:07 -07:00
  • cb234955df [Misc] Clean up input processing (#17582) Cyrus Leung 2025-05-02 23:11:53 +08:00
  • 3a500cd0b6 [doc] miss result (#17589) Reid 2025-05-02 22:04:49 +08:00
  • 868c546da4 Support W8A8 INT8 MoE for compressed-tensors (#16745) Michael Goin 2025-05-02 08:03:32 -06:00
  • 99404f53c7 [Security] Fix image hash collision (#17378) Cyrus Leung 2025-05-02 20:36:39 +08:00
  • 785d75a03b Automatically tell users that dict args must be valid JSON in CLI (#17577) Harry Mellor 2025-05-02 13:24:55 +01:00
  • 6d1479ca4b [doc] add the print result (#17584) Reid 2025-05-02 20:24:45 +08:00
  • b8b0859b5c add more pytorch related tests for torch nightly (#17422) Yang Wang 2025-05-02 03:29:59 -07:00
  • d7543862bd [Misc] Rename assets for testing (#17575) Cyrus Leung 2025-05-02 18:29:25 +08:00
  • c777df79f7 [BugFix] Fix Memory Leak (#17567) Robert Shaw 2025-05-02 04:07:03 -04:00
  • cc2a77d7f1 [Core] [Bugfix] Add Input Embeddings (#15428) Andrew Sansom 2025-05-02 03:06:39 -05:00
  • 9e2de9b9e9 [Bugifx] Remove TritonPlaceholder from sys.modules (#17317) Isotr0py 2025-05-02 15:45:01 +08:00
  • 109e15a335 Add pt_load_map_location to allow loading to cuda (#16869) Jerry Zhang 2025-05-01 23:23:42 -07:00
  • f192ca90e6 Fix PixtralHF missing spatial_merge_size (#17571) Michael Goin 2025-05-01 23:14:09 -06:00
  • f89d0e11bf [Misc] Continue refactoring model tests (#17573) Cyrus Leung 2025-05-02 13:06:08 +08:00
  • b4003d11fc Check if bitblas is installed during support check (#17572) Michael Goin 2025-05-01 22:32:54 -06:00
  • 292fc59d61 [CI] Actually run tests/kv_transfer/test_disagg.py in CI (#17555) Michael Goin 2025-05-01 22:05:04 -06:00
  • afcb3f8863 [Attention] MLA move o_proj q_proj into cuda-graph region (#17484) Lucas Wilkinson 2025-05-01 23:16:26 -04:00
  • afb12e4294 [Doc] note that not all unit tests pass on CPU platforms (#17554) David Xia 2025-05-01 22:57:21 -04:00
  • 24aebae177 [Bugfix] Disable gptq_bitblas for <SM80 to fix GPTQ on V100/T4 (#17541) Michael Goin 2025-05-01 18:59:35 -06:00
  • 39c0813a7f [V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 (#17504) qizixi 2025-05-01 16:19:30 -07:00
  • 9b70e2b4c1 [Misc][Tools][Benchmark] Publish script to auto tune server parameters (#17207) Chenyaaang 2025-05-01 12:53:03 -07:00
  • 173daac19d [Bug]change the position of cuda_graph_sizes in dataclasses (#17548) Chen Xia 2025-05-01 11:52:37 -07:00
  • 04f2cfc894 Remove duplicate code from dbrx.py (#17550) sstamenk 2025-05-01 20:51:58 +02:00
  • 811a6c0972 [ROCM] Add gfx950 to the custom attention archs (#16034) Juan Villamizar 2025-05-01 13:18:28 -05:00
  • 9b1769dd9a [Bugfix] Fix lint error (#17547) Cyrus Leung 2025-05-02 02:12:19 +08:00
  • 61c299f81f [Misc]add configurable cuda graph size (#17201) Chen Xia 2025-05-01 11:04:50 -07:00
  • 4acfa3354a [ROCm] update installation guide to include build aiter from source instructions (#17542) Hongxia Yang 2025-05-01 14:01:28 -04:00
  • 88c8304104 [Model] Refactor Ovis2 to support original tokenizer (#17537) Isotr0py 2025-05-02 02:00:53 +08:00
  • 6768ff4a22 Move the last arguments in arg_utils.py to be in their final groups (#17531) Harry Mellor 2025-05-01 18:31:44 +01:00
  • f2e7af9b86 [CI/Build] Remove awscli dependency (#17532) Cyrus Leung 2025-05-02 00:20:54 +08:00
  • 7423cf0a9b [Misc] refactor example - cpu_offload_lmcache (#17460) Reid 2025-05-01 23:05:24 +08:00
  • 460a2b1100 [torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations (#10867) Sage Moore 2025-05-01 07:59:28 -07:00
  • 28566d73b3 [ROCm] remove unsupported archs from rocm triton flash-attention supported list (#17536) Hongxia Yang 2025-05-01 10:54:25 -04:00
  • 98060b001d [Feature][Frontend]: Deprecate --enable-reasoning (#17452) Chauncey 2025-05-01 21:46:16 +08:00
  • f5a3c655b2 [FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config (#17535) TJian 2025-05-01 21:37:17 +08:00
  • 7169f87ad0 [doc] add streamlit integration (#17522) Reid 2025-05-01 21:34:02 +08:00
  • b74d888c63 Fix more broken speculative decode tests (#17450) Huy Do 2025-05-01 06:05:58 -07:00
  • 2007d4d54f [FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X (#17530) TJian 2025-05-01 21:03:13 +08:00
  • 48e925fab5 [Misc] Clean up test docstrings and names (#17521) Cyrus Leung 2025-05-01 20:19:32 +08:00
  • 1903c0b8a3 [Frontend] Show progress bar for adding requests (#17525) Cyrus Leung 2025-05-01 20:15:32 +08:00
  • 86a1f67a3b [Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model (#17285) Teruaki Ishizaki 2025-05-01 20:54:51 +09:00
  • a257d9bccc Improve configs - ObservabilityConfig (#17453) Harry Mellor 2025-05-01 11:52:05 +01:00
  • 015069b017 [Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content (#17515) Chauncey 2025-05-01 18:29:01 +08:00
  • fbefc8a78d [Core] Enable IPv6 with vllm.utils.make_zmq_socket() (#16506) Russell Bryant 2025-05-01 05:38:18 -04:00
  • 26bc4bbcd8 Avoid overwriting vllm_compile_cache.py (#17418) Keyun Tong 2025-05-01 00:30:57 -07:00
  • 3c3d767201 [BugFix] Fix mla cpu - missing 3 required positional arguments (#17494) Lucas Wilkinson 2025-05-01 02:36:52 -04:00
  • 13cf6b6236 [BugFix] fix speculative decoding memory leak when speculation is disabled (#15506) Noah Yoshida 2025-04-30 23:28:17 -07:00
  • 90d0a54c4d [ROCm] Effort to reduce the number of environment variables in command line (#17229) Hongxia Yang 2025-05-01 02:27:06 -04:00
  • 7a0a146c54 [Build] Require setuptools >= 77.0.3 for PEP 639 (#17389) Russell Bryant 2025-05-01 02:25:36 -04:00
  • 7ab643e425 FIxing the AMD test failures caused by PR#16457 (#17511) Alexei-V-Ivanov-AMD 2025-05-01 01:23:07 -05:00
  • afb4429b4f [CI/Build] Reorganize models tests (#17459) Cyrus Leung 2025-05-01 14:03:08 +08:00
  • aa4502e7f3 [CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg (#17500) Michael Goin 2025-04-30 22:03:30 -06:00
  • 17b4d85f63 [CI][TPU] Skip structured outputs+spec decode tests on TPU (#17510) Michael Goin 2025-04-30 21:36:20 -06:00
  • 1144a8efe7 [Bugfix] Temporarily disable gptq_bitblas on ROCm (#17411) NaLan ZeYu 2025-05-01 10:51:45 +08:00
  • 08fb5587b4 [Bugfix][ROCm] Fix import error on ROCm (#17495) Gregory Shtrasberg 2025-04-30 22:51:42 -04:00
  • dbc18e7816 [CI][TPU] Skip Multimodal test (#17488) Siyuan Liu 2025-04-30 19:51:39 -07:00
  • 02bd654846 [Misc] Rename Audios -> Audio in Qwen2audio Processing (#17507) Alex Brooks 2025-04-30 20:51:36 -06:00
  • 200bbf92e8 Bump Compressed Tensors version to 0.9.4 (#17478) Rahul Tuli 2025-04-30 17:24:45 -05:00
  • 81ecf425f0 [v1][Spec Decode] Make sliding window compatible with eagle prefix caching (#17398) Chen Zhang 2025-05-01 02:25:53 +08:00