This website requires JavaScript.
56c76c2e0e
[Bugfix] clean up duplicated code (#16485 )
rongfu.leng
2025-04-12 07:19:40 +08:00
c09632a66c
Update openai_compatible_server.md (#16507 )
Christian Sears
2025-04-11 18:54:58 -04:00
a3bf8d4a2b
[Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488 )
Yong Hoon Shin
2025-04-11 15:26:55 -07:00
16eda8c43a
[Frontend] Added chat templates for LLaMa4 pythonic tool calling (#16463 )
Ye (Charlotte) Qi
2025-04-11 15:26:17 -07:00
cd77382ac1
Improve configs - LoadConfig (#16422 )
Harry Mellor
2025-04-11 21:27:27 +01:00
71b9cde010
[Bugfix] handle alignment of encoder_seq_lens in mllama.py (#14784 )
Travis Johnson
2025-04-11 13:59:50 -06:00
5285589f37
[Doc] Document InternVL3 support (#16495 )
Isotr0py
2025-04-12 03:41:09 +08:00
f41647ee6b
[Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366 )
Michael Goin
2025-04-11 11:54:08 -06:00
4d022cbc75
[TPU][V1] Make --disable_chunked_mm_input mandatory for serving MM models (#16483 )
Nicolò Lucchesi
2025-04-11 19:06:14 +02:00
70de35a881
Fix erroneous "model doesn't support compile" warning (#16486 )
Richard Zou
2025-04-11 12:24:36 -04:00
34b2cf3b33
[Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU (#12779 )
Tomasz Zielinski
2025-04-11 16:38:36 +02:00
9e90c9f73f
[Bugfix] Fix bugs of running Quark quantized models (#16236 )
chaow-amd
2025-04-11 22:18:32 +08:00
e9528f6dc6
[Kernel] support merge_attn_states CUDA kernel, 3x speedup (#16173 )
DefTruth
2025-04-11 20:50:50 +08:00
51baa9c333
Don't install triton on ppc64le platform (#16470 )
Harry Mellor
2025-04-11 11:11:00 +01:00
35e076b3a8
[Misc] update api_client example (#16459 )
Reid
2025-04-11 18:05:40 +08:00
a26f59ccbc
[Misc] Raise error for V1 not supporting Long LoRA. (#16415 )
Jee Jee Li
2025-04-11 16:51:20 +08:00
aa3b3d76e0
Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True (#16447 )
Michael Goin
2025-04-11 02:09:52 -06:00
f7030df3be
[Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner (#15990 )
Jee Jee Li
2025-04-11 15:32:37 +08:00
905e91e9ac
Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" (#16453 )
DefTruth
2025-04-11 14:44:22 +08:00
f8f9c0ba62
[Bugfix] Don't set an upper bound on repetition penalty (#16403 )
Alex Brooks
2025-04-11 00:19:40 -06:00
dda811021a
[CPU][Bugfix] Fix CPU docker issues (#16454 )
Li, Jiang
2025-04-11 14:19:07 +08:00
93195146ea
[Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test (#16424 )
Isotr0py
2025-04-11 12:57:16 +08:00
ed37599544
Update supported_hardware.md for TPU INT8 (#16437 )
Michael Goin
2025-04-10 22:28:07 -06:00
99ef59cf7f
[Llama4] Enable attention temperature tuning by default for long context (>32k) (#16439 )
Yong Hoon Shin
2025-04-10 21:26:07 -07:00
d544d141ec
update benchmark_serving_structured_output to include auto backend (#16438 )
Chenyaaang
2025-04-10 21:25:52 -07:00
3e397a9484
check input length of sonnet samples (#16423 )
Alexey Belyakov
2025-04-11 03:15:06 +01:00
268c325078
Fix range_ratio Bug in RandomDataset (#16126 )
WWW
2025-04-11 06:31:17 +08:00
3cc9af88ff
[TPU][V1] Disable per-request seed/Generator (#16172 )
Nicolò Lucchesi
2025-04-10 23:05:44 +02:00
7cd0bd7212
[Bugfix] Fix output token length check logic (#16419 )
look
2025-04-11 04:16:48 +08:00
56d4aefa33
[VLM] Avoid unnecessary dummy multimodal data during processing (#16416 )
Cyrus Leung
2025-04-11 03:32:14 +08:00
dd143ef541
[V1] Zero-copy tensor/ndarray serialization/transmission (#13790 )
Nick Hill
2025-04-10 12:23:14 -07:00
daefed052c
[Model] Reduce redundant computations in mamba2 blocks for Bamba-9B (#15423 )
Chih-Chieh Yang
2025-04-10 15:07:07 -04:00
5fbab20e02
[Bugfix] Fix bug when dataset is json (#15899 )
Chenyaaang
2025-04-10 11:35:41 -07:00
e8224f3dca
[V1][Spec Decode] Eagle Model loading (#16035 )
Lily Liu
2025-04-10 11:21:48 -07:00
9665313c39
[V1] Set structured output backend to auto by default (#15724 )
Russell Bryant
2025-04-10 13:53:26 -04:00
0c54fc7273
Improve configs - ParallelConfig (#16332 )
Harry Mellor
2025-04-10 18:34:37 +01:00
c1b57855ec
[TPU][V1] Use language_model interface for getting text backbone in MM (#16410 )
Nicolò Lucchesi
2025-04-10 19:32:04 +02:00
83b824c8b4
[VLM] Remove BaseProcessingInfo.get_mm_max_tokens_per_item (#16408 )
Cyrus Leung
2025-04-11 00:06:58 +08:00
7678fcd5b6
Fix the torch version parsing logic (#15857 )
Lu Fang
2025-04-10 07:37:47 -07:00
8661c0241d
[CI] Add auto update workflow for Dockerfile graph (#11879 )
wineandchord
2025-04-10 21:43:05 +08:00
ce8d6b75fc
[doc] update the wrong link (#16401 )
Reid
2025-04-10 21:02:37 +08:00
61de3ef74b
[Model] Remove image mm limit for LLaMa4 (#16365 )
Ye (Charlotte) Qi
2025-04-10 02:36:27 -07:00
ec1f9c8c91
Update Numba to 0.61.2 (#16376 )
cyyever
2025-04-10 15:59:37 +08:00
65e09094c4
[doc] add download model tips (#16389 )
Reid
2025-04-10 15:45:26 +08:00
c70cf0fe06
[Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models (#16038 )
Michael Goin
2025-04-10 01:08:47 -06:00
a5d11a54dc
[Bugfix] Fix validation error for text-only Mllama 3.2 (#16377 )
Cyrus Leung
2025-04-10 14:19:42 +08:00
3d4c87758e
[Misc] Update transformers version limits of multi-modal tests (#16381 )
Cyrus Leung
2025-04-10 14:03:33 +08:00
a9bd832fc5
[Model] use AutoWeightsLoader for deepseek_v2, internlm2 (#16383 )
Aaron Ang
2025-04-10 02:01:00 -04:00
417bcefbae
fix sonnet dataset sample when prefix len is very small (#16379 )
Chenyaaang
2025-04-09 22:35:07 -07:00
baada0e737
[Bugfix][TPU] Fix TPU validate_request (#16369 )
Michael Goin
2025-04-09 22:55:12 -06:00
82eb61dd4c
[misc] use tqdm.auto where appropriate (#16290 )
Benjamin Kitor
2025-04-09 21:54:54 -07:00
0d4d06fe2f
[CI][Bugfix] Pin triton version for CPU (#16384 )
Roger Wang
2025-04-09 21:35:00 -07:00
4aed0ca6a2
[bugfix] Avoid the time consumption caused by creating dummy videos. (#16371 )
Jintao
2025-04-10 12:30:05 +08:00
1621b25288
[TPU] Fix dummy loading OOM (#16372 )
Chengji Yao
2025-04-09 21:06:16 -07:00
a564797151
[Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral (#16325 )
Aaron Ang
2025-04-09 23:07:40 -04:00
1da6a09274
[Bugfix]: do not shutdown server if skip_special_use=False for MistralTokenizer (#14094 )
Guillaume Calmettes
2025-04-10 04:43:09 +02:00
1e44ffc3ff
Add GLM-4-0414 support (#16338 )
Yuxuan Zhang
2025-04-10 09:19:42 +08:00
a454748544
[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues (#16275 )
Chengji Yao
2025-04-09 17:51:51 -07:00
1bff42c4b7
[Misc] refactor Structured Outputs example (#16322 )
Reid
2025-04-10 07:32:42 +08:00
cb391d85dc
[Hardware] add platform-specific request validation api (#16291 )
Joe Runde
2025-04-09 21:50:01 +02:00
fee5b8d37f
[Build/CI] Add tracing deps to vllm container image (#15224 )
Russell Bryant
2025-04-09 15:14:06 -04:00
b2ce859bd2
Fix benchmark_throughput.py --backend=hf (#16352 )
Michael Goin
2025-04-09 13:09:28 -06:00
566f10a929
[CI]Fix hpu docker and numpy version for CI (#16355 )
Chendi.Xue
2025-04-09 12:52:26 -05:00
c3b5189137
[Bugfix] catch AssertionError in MistralTokenizer as ValueError (#16344 )
Guillaume Calmettes
2025-04-09 19:33:24 +02:00
a25866ac8d
[Bugfix] Fix profiling.py (#16202 )
zh Wang
2025-04-10 01:03:34 +08:00
098900d7c2
Revert "Update label-tpu mergify and remove removal bot" (#16350 )
Michael Goin
2025-04-09 08:59:36 -06:00
98d01d3ce2
[Bugfix][Frontend] respect provided default guided decoding backend (#15476 )
Guillaume Calmettes
2025-04-09 14:11:10 +02:00
d55244df31
[Model] Add SupportsMultiModal.get_language_model interface (#16007 )
Nicolò Lucchesi
2025-04-09 13:12:54 +02:00
04149cce27
[BugFix] fix some typos found by typos. (#16314 )
yihong
2025-04-09 18:43:59 +08:00
24834f4894
update neuron config (#16289 )
ajayvohra2005
2025-04-09 06:43:22 -04:00
ec7da6fcf3
[BugFix] llama4 qknorm should be not shared across head (#16311 )
Lucia Fang
2025-04-09 00:59:14 -07:00
819d548e8a
[BugFix] logger is not callable (#16312 )
yihong
2025-04-09 15:59:02 +08:00
477d2a8aa2
Update label-tpu mergify and remove removal bot (#16298 )
Michael Goin
2025-04-09 01:56:25 -06:00
e484e02857
[Bugfix] Avoid transferring cached multi-modal items from P0 to P1 (#16273 )
Cyrus Leung
2025-04-09 15:51:27 +08:00
24f6b9a713
[Misc] Fix test_sharded_state_loader.py(#16004 ) (#16005 )
Accelerator1996
2025-04-09 14:47:30 +08:00
9cdde47289
[BugFix] Fix fusion test and add them to CI (#16287 )
Luka Govedič
2025-04-09 02:46:45 -04:00
b1eb4ca152
[TPU] Update PyTorch/XLA (#16288 )
Chengji Yao
2025-04-08 23:46:32 -07:00
87b4ac56c2
[CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding (#16221 )
Michael Goin
2025-04-08 22:14:46 -06:00
cb84e45ac7
[Core] Upgrade to xgrammar 0.1.18, add cache size limit (#16283 )
Russell Bryant
2025-04-08 22:13:22 -04:00
4716377fbc
[Feature] Estimate max-model-len use available KV cache memory (#16168 )
rongfu.leng
2025-04-09 10:12:51 +08:00
4e9cf8c1dd
[Bugfix] fix gettid method is not define (#16084 )
rongfu.leng
2025-04-09 10:12:44 +08:00
2976dc27e9
[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs (#16198 )
TJian
2025-04-09 10:12:34 +08:00
102bf967f0
[Model] Add smolvlm support (#16017 )
Chauncey
2025-04-09 10:12:17 +08:00
1f4b09b525
Add support to modelopt quantization of Mixtral model (#15961 )
yueshen2016
2025-04-08 18:53:31 -07:00
86c3369eb8
[CI/Build] Fix CI LoRA failure (#16270 )
Jee Jee Li
2025-04-09 09:13:56 +08:00
2755c34a8f
[V1] Update structured output offline inference example (#15721 )
Russell Bryant
2025-04-08 18:34:09 -04:00
db10422184
[Bugfix] fix deepseek fp16 scale bug (#14809 )
Jinzhen Lin
2025-04-09 04:56:09 +08:00
e1a2c699dd
[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context (#16209 )
Lucas Wilkinson
2025-04-08 14:56:51 -04:00
0115ccd5c0
Add warning that content below line in template will be removed (#16276 )
Harry Mellor
2025-04-08 19:18:40 +01:00
40b4284fe3
[Bugfix] Handle process_weights_after_loading for QKVCrossParallelLinear (#15328 )
Isotr0py
2025-04-09 01:02:23 +08:00
4ebc0b9640
[Bugfix] Proper input validation for multi-modal encoder-decoder models (#16156 )
Cyrus Leung
2025-04-09 00:45:21 +08:00
dc96fd54c6
[Misc] Avoid stripping meaningful whitespace from nvidia-smi topo -m output in collect_env.py (#16272 )
Kero Liang
2025-04-09 00:08:09 +08:00
1f5d13ab9f
[New Model]: jinaai/jina-embeddings-v3 (#16120 )
wang.yuqi
2025-04-08 23:39:12 +08:00
90cb44eb02
Update to transformers==4.51.1 (#16257 )
Harry Mellor
2025-04-08 14:53:39 +01:00
e11880deea
[Bugfix] Remove triton do_bench fast_flush arg (#16256 )
Kebe
2025-04-08 21:51:06 +08:00
9351f91be9
[BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm (#16247 )
TY-AMD
2025-04-08 20:10:26 +08:00
5a1e1c8353
[Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe (#16203 )
rongfu.leng
2025-04-08 19:05:47 +08:00
69ecaa7c79
[Misc] Add warning for multimodal data in LLM.beam_search (#16241 )
Alex Brooks
2025-04-08 05:05:27 -06:00
7f00899ff7
[Misc] format and refactor some examples (#16252 )
Reid
2025-04-08 18:42:32 +08:00
995e3d1f41
[Docs] Add Slides from Singapore Meetup (#16213 )
Simon Mo
2025-04-08 00:20:22 -07:00