Commit Graph

  • 56c76c2e0e [Bugfix] clean up duplicated code (#16485) rongfu.leng 2025-04-12 07:19:40 +08:00
  • c09632a66c Update openai_compatible_server.md (#16507) Christian Sears 2025-04-11 18:54:58 -04:00
  • a3bf8d4a2b [Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488) Yong Hoon Shin 2025-04-11 15:26:55 -07:00
  • 16eda8c43a [Frontend] Added chat templates for LLaMa4 pythonic tool calling (#16463) Ye (Charlotte) Qi 2025-04-11 15:26:17 -07:00
  • cd77382ac1 Improve configs - LoadConfig (#16422) Harry Mellor 2025-04-11 21:27:27 +01:00
  • 71b9cde010 [Bugfix] handle alignment of encoder_seq_lens in mllama.py (#14784) Travis Johnson 2025-04-11 13:59:50 -06:00
  • 5285589f37 [Doc] Document InternVL3 support (#16495) Isotr0py 2025-04-12 03:41:09 +08:00
  • f41647ee6b [Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366) Michael Goin 2025-04-11 11:54:08 -06:00
  • 4d022cbc75 [TPU][V1] Make --disable_chunked_mm_input mandatory for serving MM models (#16483) Nicolò Lucchesi 2025-04-11 19:06:14 +02:00
  • 70de35a881 Fix erroneous "model doesn't support compile" warning (#16486) Richard Zou 2025-04-11 12:24:36 -04:00
  • 34b2cf3b33 [Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU (#12779) Tomasz Zielinski 2025-04-11 16:38:36 +02:00
  • 9e90c9f73f [Bugfix] Fix bugs of running Quark quantized models (#16236) chaow-amd 2025-04-11 22:18:32 +08:00
  • e9528f6dc6 [Kernel] support merge_attn_states CUDA kernel, 3x speedup (#16173) DefTruth 2025-04-11 20:50:50 +08:00
  • 51baa9c333 Don't install triton on ppc64le platform (#16470) Harry Mellor 2025-04-11 11:11:00 +01:00
  • 35e076b3a8 [Misc] update api_client example (#16459) Reid 2025-04-11 18:05:40 +08:00
  • a26f59ccbc [Misc] Raise error for V1 not supporting Long LoRA. (#16415) Jee Jee Li 2025-04-11 16:51:20 +08:00
  • aa3b3d76e0 Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True (#16447) Michael Goin 2025-04-11 02:09:52 -06:00
  • f7030df3be [Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner (#15990) Jee Jee Li 2025-04-11 15:32:37 +08:00
  • 905e91e9ac Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" (#16453) DefTruth 2025-04-11 14:44:22 +08:00
  • f8f9c0ba62 [Bugfix] Don't set an upper bound on repetition penalty (#16403) Alex Brooks 2025-04-11 00:19:40 -06:00
  • dda811021a [CPU][Bugfix] Fix CPU docker issues (#16454) Li, Jiang 2025-04-11 14:19:07 +08:00
  • 93195146ea [Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test (#16424) Isotr0py 2025-04-11 12:57:16 +08:00
  • ed37599544 Update supported_hardware.md for TPU INT8 (#16437) Michael Goin 2025-04-10 22:28:07 -06:00
  • 99ef59cf7f [Llama4] Enable attention temperature tuning by default for long context (>32k) (#16439) Yong Hoon Shin 2025-04-10 21:26:07 -07:00
  • d544d141ec update benchmark_serving_structured_output to include auto backend (#16438) Chenyaaang 2025-04-10 21:25:52 -07:00
  • 3e397a9484 check input length of sonnet samples (#16423) Alexey Belyakov 2025-04-11 03:15:06 +01:00
  • 268c325078 Fix range_ratio Bug in RandomDataset (#16126) WWW 2025-04-11 06:31:17 +08:00
  • 3cc9af88ff [TPU][V1] Disable per-request seed/Generator (#16172) Nicolò Lucchesi 2025-04-10 23:05:44 +02:00
  • 7cd0bd7212 [Bugfix] Fix output token length check logic (#16419) look 2025-04-11 04:16:48 +08:00
  • 56d4aefa33 [VLM] Avoid unnecessary dummy multimodal data during processing (#16416) Cyrus Leung 2025-04-11 03:32:14 +08:00
  • dd143ef541 [V1] Zero-copy tensor/ndarray serialization/transmission (#13790) Nick Hill 2025-04-10 12:23:14 -07:00
  • daefed052c [Model] Reduce redundant computations in mamba2 blocks for Bamba-9B (#15423) Chih-Chieh Yang 2025-04-10 15:07:07 -04:00
  • 5fbab20e02 [Bugfix] Fix bug when dataset is json (#15899) Chenyaaang 2025-04-10 11:35:41 -07:00
  • e8224f3dca [V1][Spec Decode] Eagle Model loading (#16035) Lily Liu 2025-04-10 11:21:48 -07:00
  • 9665313c39 [V1] Set structured output backend to auto by default (#15724) Russell Bryant 2025-04-10 13:53:26 -04:00
  • 0c54fc7273 Improve configs - ParallelConfig (#16332) Harry Mellor 2025-04-10 18:34:37 +01:00
  • c1b57855ec [TPU][V1] Use language_model interface for getting text backbone in MM (#16410) Nicolò Lucchesi 2025-04-10 19:32:04 +02:00
  • 83b824c8b4 [VLM] Remove BaseProcessingInfo.get_mm_max_tokens_per_item (#16408) Cyrus Leung 2025-04-11 00:06:58 +08:00
  • 7678fcd5b6 Fix the torch version parsing logic (#15857) Lu Fang 2025-04-10 07:37:47 -07:00
  • 8661c0241d [CI] Add auto update workflow for Dockerfile graph (#11879) wineandchord 2025-04-10 21:43:05 +08:00
  • ce8d6b75fc [doc] update the wrong link (#16401) Reid 2025-04-10 21:02:37 +08:00
  • 61de3ef74b [Model] Remove image mm limit for LLaMa4 (#16365) Ye (Charlotte) Qi 2025-04-10 02:36:27 -07:00
  • ec1f9c8c91 Update Numba to 0.61.2 (#16376) cyyever 2025-04-10 15:59:37 +08:00
  • 65e09094c4 [doc] add download model tips (#16389) Reid 2025-04-10 15:45:26 +08:00
  • c70cf0fe06 [Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models (#16038) Michael Goin 2025-04-10 01:08:47 -06:00
  • a5d11a54dc [Bugfix] Fix validation error for text-only Mllama 3.2 (#16377) Cyrus Leung 2025-04-10 14:19:42 +08:00
  • 3d4c87758e [Misc] Update transformers version limits of multi-modal tests (#16381) Cyrus Leung 2025-04-10 14:03:33 +08:00
  • a9bd832fc5 [Model] use AutoWeightsLoader for deepseek_v2, internlm2 (#16383) Aaron Ang 2025-04-10 02:01:00 -04:00
  • 417bcefbae fix sonnet dataset sample when prefix len is very small (#16379) Chenyaaang 2025-04-09 22:35:07 -07:00
  • baada0e737 [Bugfix][TPU] Fix TPU validate_request (#16369) Michael Goin 2025-04-09 22:55:12 -06:00
  • 82eb61dd4c [misc] use tqdm.auto where appropriate (#16290) Benjamin Kitor 2025-04-09 21:54:54 -07:00
  • 0d4d06fe2f [CI][Bugfix] Pin triton version for CPU (#16384) Roger Wang 2025-04-09 21:35:00 -07:00
  • 4aed0ca6a2 [bugfix] Avoid the time consumption caused by creating dummy videos. (#16371) Jintao 2025-04-10 12:30:05 +08:00
  • 1621b25288 [TPU] Fix dummy loading OOM (#16372) Chengji Yao 2025-04-09 21:06:16 -07:00
  • a564797151 [Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral (#16325) Aaron Ang 2025-04-09 23:07:40 -04:00
  • 1da6a09274 [Bugfix]: do not shutdown server if skip_special_use=False for MistralTokenizer (#14094) Guillaume Calmettes 2025-04-10 04:43:09 +02:00
  • 1e44ffc3ff Add GLM-4-0414 support (#16338) Yuxuan Zhang 2025-04-10 09:19:42 +08:00
  • a454748544 [TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues (#16275) Chengji Yao 2025-04-09 17:51:51 -07:00
  • 1bff42c4b7 [Misc] refactor Structured Outputs example (#16322) Reid 2025-04-10 07:32:42 +08:00
  • cb391d85dc [Hardware] add platform-specific request validation api (#16291) Joe Runde 2025-04-09 21:50:01 +02:00
  • fee5b8d37f [Build/CI] Add tracing deps to vllm container image (#15224) Russell Bryant 2025-04-09 15:14:06 -04:00
  • b2ce859bd2 Fix benchmark_throughput.py --backend=hf (#16352) Michael Goin 2025-04-09 13:09:28 -06:00
  • 566f10a929 [CI]Fix hpu docker and numpy version for CI (#16355) Chendi.Xue 2025-04-09 12:52:26 -05:00
  • c3b5189137 [Bugfix] catch AssertionError in MistralTokenizer as ValueError (#16344) Guillaume Calmettes 2025-04-09 19:33:24 +02:00
  • a25866ac8d [Bugfix] Fix profiling.py (#16202) zh Wang 2025-04-10 01:03:34 +08:00
  • 098900d7c2 Revert "Update label-tpu mergify and remove removal bot" (#16350) Michael Goin 2025-04-09 08:59:36 -06:00
  • 98d01d3ce2 [Bugfix][Frontend] respect provided default guided decoding backend (#15476) Guillaume Calmettes 2025-04-09 14:11:10 +02:00
  • d55244df31 [Model] Add SupportsMultiModal.get_language_model interface (#16007) Nicolò Lucchesi 2025-04-09 13:12:54 +02:00
  • 04149cce27 [BugFix] fix some typos found by typos. (#16314) yihong 2025-04-09 18:43:59 +08:00
  • 24834f4894 update neuron config (#16289) ajayvohra2005 2025-04-09 06:43:22 -04:00
  • ec7da6fcf3 [BugFix] llama4 qknorm should be not shared across head (#16311) Lucia Fang 2025-04-09 00:59:14 -07:00
  • 819d548e8a [BugFix] logger is not callable (#16312) yihong 2025-04-09 15:59:02 +08:00
  • 477d2a8aa2 Update label-tpu mergify and remove removal bot (#16298) Michael Goin 2025-04-09 01:56:25 -06:00
  • e484e02857 [Bugfix] Avoid transferring cached multi-modal items from P0 to P1 (#16273) Cyrus Leung 2025-04-09 15:51:27 +08:00
  • 24f6b9a713 [Misc] Fix test_sharded_state_loader.py(#16004) (#16005) Accelerator1996 2025-04-09 14:47:30 +08:00
  • 9cdde47289 [BugFix] Fix fusion test and add them to CI (#16287) Luka Govedič 2025-04-09 02:46:45 -04:00
  • b1eb4ca152 [TPU] Update PyTorch/XLA (#16288) Chengji Yao 2025-04-08 23:46:32 -07:00
  • 87b4ac56c2 [CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding (#16221) Michael Goin 2025-04-08 22:14:46 -06:00
  • cb84e45ac7 [Core] Upgrade to xgrammar 0.1.18, add cache size limit (#16283) Russell Bryant 2025-04-08 22:13:22 -04:00
  • 4716377fbc [Feature] Estimate max-model-len use available KV cache memory (#16168) rongfu.leng 2025-04-09 10:12:51 +08:00
  • 4e9cf8c1dd [Bugfix] fix gettid method is not define (#16084) rongfu.leng 2025-04-09 10:12:44 +08:00
  • 2976dc27e9 [Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs (#16198) TJian 2025-04-09 10:12:34 +08:00
  • 102bf967f0 [Model] Add smolvlm support (#16017) Chauncey 2025-04-09 10:12:17 +08:00
  • 1f4b09b525 Add support to modelopt quantization of Mixtral model (#15961) yueshen2016 2025-04-08 18:53:31 -07:00
  • 86c3369eb8 [CI/Build] Fix CI LoRA failure (#16270) Jee Jee Li 2025-04-09 09:13:56 +08:00
  • 2755c34a8f [V1] Update structured output offline inference example (#15721) Russell Bryant 2025-04-08 18:34:09 -04:00
  • db10422184 [Bugfix] fix deepseek fp16 scale bug (#14809) Jinzhen Lin 2025-04-09 04:56:09 +08:00
  • e1a2c699dd [BugFix] Fix Llama4 - Index Error When Single Request Near Max Context (#16209) Lucas Wilkinson 2025-04-08 14:56:51 -04:00
  • 0115ccd5c0 Add warning that content below line in template will be removed (#16276) Harry Mellor 2025-04-08 19:18:40 +01:00
  • 40b4284fe3 [Bugfix] Handle process_weights_after_loading for QKVCrossParallelLinear (#15328) Isotr0py 2025-04-09 01:02:23 +08:00
  • 4ebc0b9640 [Bugfix] Proper input validation for multi-modal encoder-decoder models (#16156) Cyrus Leung 2025-04-09 00:45:21 +08:00
  • dc96fd54c6 [Misc] Avoid stripping meaningful whitespace from nvidia-smi topo -m output in collect_env.py (#16272) Kero Liang 2025-04-09 00:08:09 +08:00
  • 1f5d13ab9f [New Model]: jinaai/jina-embeddings-v3 (#16120) wang.yuqi 2025-04-08 23:39:12 +08:00
  • 90cb44eb02 Update to transformers==4.51.1 (#16257) Harry Mellor 2025-04-08 14:53:39 +01:00
  • e11880deea [Bugfix] Remove triton do_bench fast_flush arg (#16256) Kebe 2025-04-08 21:51:06 +08:00
  • 9351f91be9 [BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm (#16247) TY-AMD 2025-04-08 20:10:26 +08:00
  • 5a1e1c8353 [Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe (#16203) rongfu.leng 2025-04-08 19:05:47 +08:00
  • 69ecaa7c79 [Misc] Add warning for multimodal data in LLM.beam_search (#16241) Alex Brooks 2025-04-08 05:05:27 -06:00
  • 7f00899ff7 [Misc] format and refactor some examples (#16252) Reid 2025-04-08 18:42:32 +08:00
  • 995e3d1f41 [Docs] Add Slides from Singapore Meetup (#16213) Simon Mo 2025-04-08 00:20:22 -07:00