Commit Graph

  • b4ac449a83 [Misc] Merge the logs of pp layers partitions (#16225) Kebe 2025-04-08 15:18:15 +08:00
  • 8e5314a468 [V1] Add disable_chunked_mm_input arg to disable partial mm input prefill (#15837) Michael Goin 2025-04-08 00:24:07 -06:00
  • 87918e40c4 [torch.compile][TPU] Make @support_torch_compile work for XLA backend (#15782) Siyuan Liu 2025-04-07 23:23:53 -07:00
  • f6b32efb7f [Bugfix] Fix and reorganize broken GGUF tests and bump gguf version (#16194) Isotr0py 2025-04-08 13:38:13 +08:00
  • b99733d092 [Bugfix] Do not skip "empty" parts of chats that are parsable (#16219) Michael Goin 2025-04-07 23:14:15 -06:00
  • 05a015d6a5 Add warning for Attention backends that do not support irope yet (#16212) Yong Hoon Shin 2025-04-07 20:59:26 -07:00
  • ad971af8c7 [Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 (#16161) zxfan-cpu 2025-04-08 11:48:47 +08:00
  • f2ebb6f541 [V1] Scatter and gather placeholders in the model runner (#16076) Roger Wang 2025-04-07 19:43:41 -07:00
  • 1d01211264 Update BASE_IMAGE to 2.22 release of Neuron (#16218) Satyajith Chilappagari 2025-04-07 19:11:18 -07:00
  • f94ab12f79 [Misc] Update compressed-tensors to version 0.9.3 (#16196) Miles Williams 2025-04-08 03:09:06 +01:00
  • a865bc1ca6 [core] do not send error across process (#16174) youkaichao 2025-04-08 10:09:03 +08:00
  • 21802c4b6d [ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping (#16031) Michael Goin 2025-04-07 19:28:14 -06:00
  • 652907b354 Torchao (#14231) Driss Guessous 2025-04-07 16:39:28 -07:00
  • 24f1c01e0f [Bugfix][V0] XGrammar structured output supports Enum (#15878) leon-seidel 2025-04-08 00:38:25 +02:00
  • fad6e2538e [Misc] add description attribute in CLI (#15921) Reid 2025-04-08 06:30:35 +08:00
  • 7f6d47c1a2 [V1][BugFix] Exit properly if engine core fails during startup (#16137) Nick Hill 2025-04-07 15:30:15 -07:00
  • 3147586ebd [Bugfix] Fix guidance backend for Qwen models (#16210) Benjamin Chislett 2025-04-07 18:15:43 -04:00
  • ed636d99ca [Misc] Move Llama 4 projector call into encoder execution (#16201) Roger Wang 2025-04-07 14:02:05 -07:00
  • 090c856d76 [Misc] Human-readable max-model-len cli arg (#16181) Nicolò Lucchesi 2025-04-07 20:40:58 +02:00
  • ad434d4cfe Print the warning only once (#16193) Gregory Shtrasberg 2025-04-07 14:30:06 -04:00
  • 66d433b94f [V1] Revert the default max_num_seqs to V0 values for most hardware (#16158) Cyrus Leung 2025-04-08 01:54:36 +08:00
  • 027b204ff1 [Bugfix] Re-enable support for ChatGLMForConditionalGeneration (#16187) Cyrus Leung 2025-04-07 23:15:58 +08:00
  • 55dcce91df Upstream Llama4 Support to Main (#16113) Lu Fang 2025-04-07 08:06:27 -07:00
  • 8017c8db7f [Doc]Update image to latest version (#16186) Robin 2025-04-07 22:17:39 +08:00
  • dc3529dbf6 [Misc] improve example mlpspeculator and llm_engine_example (#16175) Reid 2025-04-07 19:53:52 +08:00
  • 7699258ef0 [Model] Add Qwen3 and Qwen3MoE (#15289) YamPengLi 2025-04-07 19:06:41 +08:00
  • e9ba99f296 [V1][Structured Output] Add supports_structured_output() method to Platform (#16148) Shanshan Shen 2025-04-07 19:06:24 +08:00
  • 7c80368710 [VLM] Florence-2 supports online serving (#16164) Isotr0py 2025-04-07 19:04:02 +08:00
  • 95d63f38c0 doc: fix some typos in doc (#16154) yihong 2025-04-07 13:32:06 +08:00
  • bb8dab821e [CI] Set max transformers version for Ultravox model test (#16149) Roger Wang 2025-04-06 21:37:58 -07:00
  • fc0f87768a [Bugfix] Make dummy encoder prompt padding alternative and add missing warnings (#16129) Isotr0py 2025-04-07 12:07:15 +08:00
  • 0a57386721 [Misc] Update Mistral-3.1 example (#16147) Cyrus Leung 2025-04-07 11:57:37 +08:00
  • 3749e28774 [V1][Minor] Minor simplification for get_computed_blocks (#16139) Woosuk Kwon 2025-04-06 20:38:12 -07:00
  • 86fc2321ff [Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token (#15202) Kay Yan 2025-04-07 11:34:51 +08:00
  • 2549c0dfef Fix requires-python (#16132) Martin Hoyer 2025-04-07 04:22:25 +02:00
  • b10e519895 [V1][Minor] Optimize get_cached_block (#16135) Woosuk Kwon 2025-04-06 13:48:14 -07:00
  • 9bde5ba127 [TPU] Update PyTorch/XLA (#16130) Chengji Yao 2025-04-06 11:25:55 -07:00
  • 72c8f1ad04 [Misc] update requires-python in pyproject.toml (#16116) Reid 2025-04-06 22:56:34 +08:00
  • da224daaa9 [Bugfix] add hf_token to EngineArgs (#16093) paolovic 2025-04-06 16:47:33 +02:00
  • 3a100b9278 [Bugfix] LoRA : Fix the order in which the kernels process LoRAs (#16040) Varun Sundar Rabindranath 2025-04-06 10:04:50 -04:00
  • 242a637aea [Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 (#16103) rongfu.leng 2025-04-06 20:52:01 +08:00
  • c2a9671510 [Misc] Improve model redirect to accept json dictionary (#16119) Isotr0py 2025-04-06 20:51:45 +08:00
  • d5ae4f7f42 [Doc][Bugfix] Add missing EOF in k8s deploy doc (#16025) Paul Schweigert 2025-04-06 08:10:57 -04:00
  • b6c502a150 [Misc] refactor example eagle (#16100) Reid 2025-04-06 17:42:48 +08:00
  • 9ca710e525 [CI][V1] Fix passing tokenizer as kwarg to validate_guidance_grammar (#16117) Roger Wang 2025-04-06 01:18:00 -07:00
  • eb07c8cb5b [Frontend] Fix typo in tool chat templates for llama3.2 and toolace (#14501) Ben Jackson 2025-04-06 00:44:36 -07:00
  • ba10801961 [Benchmark] Add sampling parameters to benchmark_serving. (#16022) Hyesoo Yang 2025-04-05 21:30:35 -07:00
  • 620fc2d09e [Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 (#16112) Lucia Fang 2025-04-05 21:23:40 -07:00
  • 296c6572dd Revert "[V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (#15906)" v0.8.3 simon-mo 2025-04-05 21:10:57 -07:00
  • c575232395 [Model] Support Llama4 in vLLM (#16104) Lu Fang 2025-04-05 21:01:00 -07:00
  • 29283eaa7e [Model] use AutoWeightsLoader for phi, gemma, deepseek (#16088) Jonghyun Choe 2025-04-06 12:34:38 +09:00
  • 2fa66ef713 [Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine (#15946) Jinzhen Lin 2025-04-06 11:04:22 +08:00
  • 13affc432d [Misc] Remove redundant code (#16098) Chauncey 2025-04-06 11:03:50 +08:00
  • d8f094a92a [Misc] format output for encoder_decoder.py (#16095) Reid 2025-04-06 10:57:18 +08:00
  • 97ae6d777f Fix some capitalisations in generated examples doc titles (#16094) Harry Mellor 2025-04-05 14:44:03 +01:00
  • 6baeee70d1 Revert "doc: add info for macos clang errors (#16049)" (#16091) yihong 2025-04-05 19:51:51 +08:00
  • d2517a4939 [doc] fix 404 (#16082) Reid 2025-04-05 19:39:18 +08:00
  • 6342adc438 fix: support clang17 for macos and fix the real libomp (#16086) yihong 2025-04-05 19:00:12 +08:00
  • 0adba91547 [CI] Fix benchmark script level (#16089) Kevin H. Luu 2025-04-05 03:36:01 -07:00
  • 4285e423a6 [Misc] Auto detect bitsandbytes pre-quantized models (#16027) Tristan Leclercq 2025-04-05 08:30:45 +02:00
  • 63375f0cdb [V1][Spec Decode] Update N-gram Proposer Interface (#15750) v0.8.3rc1 Woosuk Kwon 2025-04-04 16:32:54 -07:00
  • 70ad3f9e98 [Bugfix][TPU] Fix V1 TPU worker for sliding window (#16059) Michael Goin 2025-04-04 17:31:19 -06:00
  • d6fc629f4d [Kernel][Minor] Re-fuse triton moe weight application (#16071) bnellnm 2025-04-04 19:27:34 -04:00
  • af51d80fa1 Revert "[V1] Scatter and gather placeholders in the model runner" (#16075) Roger Wang 2025-04-04 14:50:57 -07:00
  • f5722a5052 [V1] Scatter and gather placeholders in the model runner (#15712) Cyrus Leung 2025-04-05 05:26:44 +08:00
  • 651cf0fec1 [V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (#15906) Nick Hill 2025-04-04 12:56:43 -07:00
  • 4dc52e1c53 [CI] Reorganize .buildkite directory (#16001) Kevin H. Luu 2025-04-04 12:16:20 -07:00
  • 4708f13a9c [Bugfix] Fix default behavior/fallback for pp in v1 (#16057) Michael Goin 2025-04-04 11:58:08 -06:00
  • a6d042df0a [ROCm][Bugfix] Bring back fallback to eager mode removed in #14917, but for ROCm only (#15413) Gregory Shtrasberg 2025-04-04 12:40:37 -04:00
  • 40a36ccfeb [ROCm][Bugfix] Use platform specific FP8 dtype (#15717) Gregory Shtrasberg 2025-04-04 12:40:20 -04:00
  • ef608c37a7 [Distributed] [ROCM] Fix custom allreduce enable checks (#16010) Ilya Markov 2025-04-04 18:39:08 +02:00
  • 2386803f2a [CPU] Change default block_size for CPU backend (#16002) Li, Jiang 2025-04-05 00:39:05 +08:00
  • 95862f7b4d [Benchmark][Doc] Update throughput benchmark and README (#15998) Ziji Shi (Steven) 2025-04-04 09:39:02 -07:00
  • 230b131b54 [Bugfix][kernels] Fix half2float conversion in gguf kernels (#15995) Isotr0py 2025-04-05 00:38:58 +08:00
  • 0812d8dd41 [Hardware][Gaudi][BugFix] fix arguments of hpu fused moe (#15945) liuzhenwei 2025-04-05 00:38:55 +08:00
  • bf7e3c51ae [Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt (#15939) Jonghyun Choe 2025-04-05 01:38:52 +09:00
  • a35a8a8392 [V1][Spec Decode] Avoid logging useless nan metrics (#16023) Mark McLoughlin 2025-04-04 16:52:41 +01:00
  • 4ef0bb1fcf doc: add info for macos clang errors (#16049) yihong 2025-04-04 22:58:16 +08:00
  • fadc59c0e6 [TPU][V1] Remove ragged attention kernel parameter hard coding (#16041) Chengji Yao 2025-04-04 04:48:50 -07:00
  • 86cbd2eee9 [Misc] improve gguf check (#15974) Reid 2025-04-04 09:33:36 +08:00
  • 092475f738 [ROCm] Tweak the benchmark script to run on ROCm (#14252) Huy Do 2025-04-03 17:12:48 -07:00
  • dcc56d62da [Bugfix] Fix function names in test_block_fp8.py (#16033) bnellnm 2025-04-03 19:01:34 -04:00
  • f15e70d906 [TPU] Switch Test to Non-Sliding Window (#15981) Robert Shaw 2025-04-03 14:28:45 -07:00
  • b6be6f8d1e [TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. (#15732) iefgnoix 2025-04-03 14:23:28 -07:00
  • 03a70eacaf Re-enable the AMD Testing for the passing tests. (#15586) Alexei-V-Ivanov-AMD 2025-04-03 13:05:17 -05:00
  • 45b1ff7a25 [Misc][Performance] Advance tpu.txt to the most recent nightly torch … (#16024) yarongmu-google 2025-04-03 10:32:54 -07:00
  • 15ba07ef25 [Minor] Fused experts refactor (#15914) bnellnm 2025-04-03 13:19:38 -04:00
  • d2b58ca203 [Neuron][kernel] Fuse kv cache into a single tensor (#15911) Liangfu Chen 2025-04-03 09:51:32 -07:00
  • 82e7e19a6e [SupportsQuant] Chameleon, Chatglm, Commandr (#15952) Kyle Sayers 2025-04-03 11:25:22 -04:00
  • 421c462948 [SupportsQuant] Bert, Blip, Blip2, Bloom (#15573) Kyle Sayers 2025-04-03 11:23:19 -04:00
  • 84884cd9ac fix: tiny fix make format.sh excutable (#16015) yihong 2025-04-03 23:18:05 +08:00
  • a43aa183dc [doc] update contribution link (#15922) Reid 2025-04-03 18:47:31 +08:00
  • 463bbb1835 [Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process (#15367) wwl2755 2025-04-03 02:32:10 -05:00
  • 5e125e74d1 [misc] improve error message for "Failed to infer device type" (#15994) youkaichao 2025-04-03 14:45:03 +08:00
  • 06f21ce7a5 [Benchmark] Add AIMO Dataset to Benchmark (#15955) Ziji Shi (Steven) 2025-04-02 23:09:18 -07:00
  • 57a810db9c [ROCM][V0] PA kennel selection when no sliding window provided (#15982) Aleksandr Malyshev 2025-04-02 22:28:44 -07:00
  • 8b664706aa [bugfix] add seed in torchrun_example.py (#15980) youkaichao 2025-04-03 12:25:01 +08:00
  • 37bfee92bf fix: better error message for get_config close #13889 (#15943) yihong 2025-04-03 11:53:19 +08:00
  • e73ff24e31 [ROCM][KERNEL] Paged attention for V1 (#15720) Aleksandr Malyshev 2025-04-02 19:48:00 -07:00
  • bd7599d34a [V1][TPU] Do not compile sampling more than needed (#15883) Nicolò Lucchesi 2025-04-03 03:36:01 +02:00