Commit Graph

  • 39a22dcaac [Misc] Minor code simplification for spec decode (#24053) Woosuk Kwon 2025-09-01 08:54:01 -07:00
  • 41c80698b3 Document multi-proc method selection for profiling (#23802) Julien Debache 2025-09-01 15:28:26 +02:00
  • 7c8271cd1e [Model]: support KeyeVL-1_5-8B (#23838) Kwai-Keye 2025-09-01 18:50:27 +08:00
  • 3e330fcb21 [Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors (#24033) Kay Yan 2025-09-01 18:34:52 +08:00
  • d46934b229 [Frontend] Gemma3n audio transcriptions/translations endpoint (#23735) Nicolò Lucchesi 2025-09-01 12:07:46 +02:00
  • 107284959a [Doc]: fix typos in Python comments (#24026) Didier Durand 2025-09-01 11:38:20 +02:00
  • dc1a53186d [Kernel] Update DeepGEMM to latest commit (#23915) Jee Jee Li 2025-09-01 17:38:04 +08:00
  • 55602bb2e6 [Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN (#20904) wang.yuqi 2025-09-01 16:50:25 +08:00
  • d7fbc6ddac [Misc] Enable V1 FP16 inference on pre-Ampere GPUs (#24022) Isotr0py 2025-09-01 16:12:22 +08:00
  • 5438967fbc [Misc] add hash_function doc string (#24014) Ning Xie 2025-09-01 14:11:20 +08:00
  • 422e793fa6 [Bugfix] Add support for <tool_call> format in streaming mode for XLAM Tool Parser (#22769) Code Jesus 2025-08-31 23:07:54 -07:00
  • 1cb39dbcdd [Misc] IO Processor plugins for pooling models (#22820) Christian Pinto 2025-09-01 07:07:12 +01:00
  • 437c3ce026 Migrate Phi4 inputs to TensorSchema (#23471) Benji Beck 2025-08-31 23:05:59 -07:00
  • 499b074bfd [Misc] refactor code by import as for torch._inductor.config (#23677) Ning Xie 2025-09-01 14:05:42 +08:00
  • ff0e59d83a [CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization (#23357) Isotr0py 2025-09-01 13:52:20 +08:00
  • b55713683c [Misc] Move fast prefill logic to separate method (#24013) Woosuk Kwon 2025-08-31 22:40:38 -07:00
  • acc1a6e10a Fix the bug related to loading GPTP INT3 weights. (#23328) Jun-Howie 2025-09-01 13:39:57 +08:00
  • 8c742a66d1 [Misc] Avoid redundant copy for encoder-only models (#24012) Woosuk Kwon 2025-08-31 21:02:43 -07:00
  • 183a70967a [BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) (#23994) JartX 2025-09-01 05:33:40 +02:00
  • 14b4326b94 v1: Support KV events from connectors (#19737) Or Ozeri 2025-09-01 04:13:21 +03:00
  • 752d2e1c36 [Minor] Fix some random typos in comments (#24009) Nick Hill 2025-08-31 16:42:17 -07:00
  • 81eea3d348 vllm fix check on max vocab size (#22471) Xiaodong Wang 2025-08-31 05:57:05 -07:00
  • 9701352e4b [Doc]: fix typos in Python comments (#24001) Didier Durand 2025-08-31 10:21:59 +02:00
  • 749be00a98 [Core][Multimodal] Allow passing multi_modal_uuids as multimodal identifiers. (#23394) Roger Wang 2025-08-30 18:01:22 -07:00
  • 5b8077b8ac Fix wrong truncate_prompt_tokens type hint (#22761) Gabriel Marinho 2025-08-30 17:39:38 -03:00
  • 038e9be4eb [LoRA] Much faster startup when LoRA is enabled (#23777) Andy Lo 2025-08-30 16:37:39 +01:00
  • 68a349114f [Misc] enhance type hint for rearrange return value (#23519) Ning Xie 2025-08-30 21:43:33 +08:00
  • e80bca309e [Refactor] refactor freezing_value/cuda_event initialize outside try finally (#23758) Ning Xie 2025-08-30 21:42:25 +08:00
  • fb4983e112 [Misc] add reorder_batch AttentionMetadataBuilder (#23798) Ning Xie 2025-08-30 21:41:45 +08:00
  • 379ea2823a Add LoRA support for DeepSeek models (V2, V3, R1-0528) (#23971) sadegh.shokatian 2025-08-30 06:40:02 -07:00
  • 3a6acad431 [Model] Enable encoder DP for MiniCPM-V (#23948) Jiangyun Zhu 2025-08-30 21:31:26 +08:00
  • 5490d633ce [UT] fix unify_kv_cache_configs when kv cache config needs sort (#23843) Ning Xie 2025-08-30 19:22:14 +08:00
  • 628d00cd7b [Bugfix] Fix test_lora_resolvers.py (#23984) Jee Jee Li 2025-08-30 19:16:11 +08:00
  • 4071c76cf3 [V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba (#23831) Thomas Parnell 2025-08-30 09:16:15 +02:00
  • f1bddbd852 [Core] Cleanup TPU model runner for MM (#23894) Cyrus Leung 2025-08-30 15:14:58 +08:00
  • 9748c5198b [CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion (#23973) Yong Hoon Shin 2025-08-30 00:14:43 -07:00
  • ee52a32705 [CI] Move testing image from remote URL to S3 (#23980) Roger Wang 2025-08-29 21:41:25 -07:00
  • 8fb85b7bb6 Add routed_scaling_factor to MoE grouped topk (#23123) Xin Yang 2025-08-29 21:36:48 -07:00
  • 5b31cb1781 [Bugfix] Fix --config arg expansion called from api_server.py (#23944) dubejf 2025-08-30 00:36:39 -04:00
  • d660c98c1b [CI] Fix unavailable image remote URL (#23966) Roger Wang 2025-08-29 15:40:04 -07:00
  • 5674a40366 [Misc] Make download_weights_from_hf more reliable (#23863) Harry Mellor 2025-08-29 20:37:24 +01:00
  • 8c3e199998 Revert gemma3n fast prefill changes (#23897) Yong Hoon Shin 2025-08-29 12:16:57 -07:00
  • 1c26b42296 [Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models (#23824) Thomas Parnell 2025-08-29 20:47:58 +02:00
  • b7adf94c4a Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj (#23939) Michael Goin 2025-08-29 13:28:35 -04:00
  • 4d7fe40fc0 [RL][BugFix] Fix missing tokenizer error for token-in-token-out (#23904) 22quinn 2025-08-29 10:09:55 -07:00
  • 0dc9532065 [BUGFIX ] fix undefined silu_and_mul_nvfp4_quant (#23929) yzds 2025-08-30 00:36:39 +08:00
  • 72a69132dc [CI] Add aiter to matching list of issue auto labeller for rocm tag (#23942) vllmellm 2025-08-29 23:29:21 +08:00
  • d90d8eb674 [BugFix] Async scheduling and PP compatibility with DP (#23770) Nick Hill 2025-08-29 08:17:27 -07:00
  • 0a2f4c0793 [Models] Use in-place adds in Idefics2Vision (#23932) Lukas Geiger 2025-08-29 15:42:57 +01:00
  • 1cf3753b90 [MODEL] Apertus and XIELU (#23068) EduardDurech 2025-08-29 14:29:18 +02:00
  • 4f7cde7272 Adds json_count_leaves utility function (#23899) Adit Chawdhary 2025-08-29 17:58:13 +05:30
  • 67c14906aa Update PyTorch to 2.8.0 (#20358) Huy Do 2025-08-29 03:57:35 -07:00
  • 69f46359dd [Multimodal] Consolidate mm inputs into MultiModalFeatureSpec (#23779) Flora Feng 2025-08-29 03:36:57 -07:00
  • d9e00dbd1f [Performance] V1 Classify Models E2E Performance Optimization (#23541) wang.yuqi 2025-08-29 18:12:32 +08:00
  • ad39106b16 [CPU] Enable data parallel for CPU backend (#23903) Li, Jiang 2025-08-29 17:19:58 +08:00
  • 2554b27baa [V0 Deprecation] Remove pooling model support in V0 (#23434) Maximilien de Bayser 2025-08-29 04:04:02 -03:00
  • 934bebf192 Better errors for Transformers backend missing features (#23759) Harry Mellor 2025-08-29 08:01:40 +01:00
  • 885ca6d31d [Misc] Fix warnings for mistral model (#23552) Jiangyun Zhu 2025-08-29 14:58:48 +08:00
  • 2d0afcc9dc [mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation. (#23895) Chenheli Hua 2025-08-28 23:29:13 -07:00
  • b4f9e9631c [CI/Build] Clean up LoRA test (#23890) Jee Jee Li 2025-08-29 14:28:35 +08:00
  • 05d839c19e Fix(async): Add support for truncate_prompt_tokens in AsyncLLM (#23800) Raghavan 2025-08-29 11:25:06 +05:30
  • 6597d7a456 [Platform] import activation_quant_fusion for CUDA only (#23882) wangxiyuan 2025-08-29 13:54:16 +08:00
  • 5264015d74 [BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek running on AMD (#23864) Jinghui Zhang 2025-08-28 22:54:12 -07:00
  • 98ac0cb32d [Bugfix] Use ReplicatedLinear for SequenceClassification head (#23836) Isotr0py 2025-08-29 12:41:20 +08:00
  • c8b3b299c9 [tests] Improve speed and reliability of test_transcription_api_correctness (#23854) Russell Bryant 2025-08-29 00:25:33 -04:00
  • 006477e60b [ROCm][Fix] Fix rocm build caused by #23791 (#23847) Charlie Fu 2025-08-28 21:52:27 -05:00
  • de533ab2a1 [Models] Improve iteration over layers (#19497) Lukas Geiger 2025-08-29 02:26:34 +01:00
  • 235c9db8a7 [XPU] support data parallel for MoE models on XPU (#22887) Chaojun Zhang 2025-08-29 09:23:04 +08:00
  • b668055a11 [V0 Deprecation] Remove V0 Samplers test (#23862) Woosuk Kwon 2025-08-28 18:05:52 -07:00
  • d3d2aad5a2 [Log] Use Debug Once for DeepGEMM E8M0 When not Enabled (#23858) Wentao Ye 2025-08-28 18:18:10 -04:00
  • cb293f6a79 [V1] Enable prefill optimization for Gemma3n (#22628) Yong Hoon Shin 2025-08-28 14:54:30 -07:00
  • 7ffbf27239 [BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu (#23737) Woosuk Kwon 2025-08-28 14:22:46 -07:00
  • 27e88cee74 chore: build release image by default (#23852) Simon Mo 2025-08-28 13:17:15 -07:00
  • 16a45b3a28 [NVIDIA] Support SiluMul + NVFP4 quant fusion (#23671) elvischenv 2025-08-29 03:36:50 +08:00
  • 57d4ede520 [bugfix] [spec-decoding] fix data race in sample_recovered_tokens_kernel (vLLM v1) (#23829) Jingkai He 2025-08-29 03:05:20 +08:00
  • 04d1dd7f4a [ROCm][Aiter] Add triton fp8 bmm kernel for mla (#23264) Divakar Verma 2025-08-28 13:18:08 -05:00
  • f32a5bc505 Migrate Llama4ImagePatchInputs to TensorSchema (#22021) Benji Beck 2025-08-28 10:29:37 -07:00
  • 8805ad9fa9 Add scale_config.yml file for Meta autoscalers for GH Actions (#23840) Jean Schmidt 2025-08-28 18:31:20 +02:00
  • 0583578f42 [ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime (#23757) Jean Schmidt 2025-08-28 17:59:19 +02:00
  • db74d60490 [Bugfix] Add fake mode around passes (#23349) Angela Yi 2025-08-28 08:25:56 -07:00
  • 95089607fa [Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE (#23819) Po-Han Huang (NVIDIA) 2025-08-28 21:56:20 +08:00
  • 1f096f9b95 [CI] Fix linting error on main (#23835) Thomas Parnell 2025-08-28 15:52:01 +02:00
  • 66548f6603 [Bugfix] Fix benchmark_moe.py for blockwise fp8. (#23823) YUQI.CHENG 2025-08-28 21:44:09 +08:00
  • d3da2eea54 [Doc]: fix typos in Python scripts (#23828) Didier Durand 2025-08-28 14:37:38 +02:00
  • bfab219648 [Model] [gpt-oss] fix gpt-oss pp support (#23815) Jiangyun Zhu 2025-08-28 20:36:55 +08:00
  • a3432f18fd [BugFix][Spec Decode] Use float64 for uniform_probs (#23803) Woosuk Kwon 2025-08-28 05:26:45 -07:00
  • 67cee40da0 [CI/Build][Bugfix] Fix Qwen VL tests on CPU (#23818) Li, Jiang 2025-08-28 19:57:05 +08:00
  • d99c3a4f7b [Doc]: fix typos in .md files (including those of #23751) (#23825) Didier Durand 2025-08-28 13:38:19 +02:00
  • 3462c1c522 [FIXBUG] Add return_success parameter to moe_wna16_weight_loader function (#22797) JartX 2025-08-28 11:03:22 +02:00
  • c5d004aaaf [Model] Add PP support and VLM backbone compatability for GPT-OSS (#23680) Isotr0py 2025-08-28 16:03:28 +08:00
  • 11a7fafaa8 [New Model]: Support GteNewModelForSequenceClassification (#23524) wang.yuqi 2025-08-28 15:36:42 +08:00
  • 186aced5ff [Kernel] cuda kernels for upcoming decode context parallel feature (#23791) yzds 2025-08-28 15:29:11 +08:00
  • daa1273b14 [Bugfix] when set offline model running error (#23711) rongfu.leng 2025-08-28 15:27:45 +08:00
  • c07a73317d [CI] enable idefics3 and fuyu-8b test in multimodal test (#23790) Jiangyun Zhu 2025-08-28 14:51:24 +08:00
  • 22feac8e95 [Transform] [Quantization] Add transforms to compressed tensors (#22486) Kyle Sayers 2025-08-28 02:43:48 -04:00
  • c8851a4723 Add deprecation warning for lora_extra_vocab_size (#23635) Jinheng 2025-08-28 13:34:29 +08:00
  • f48a9af892 [CI] make all multi-gpu weight loading tests run nightly (#23792) Alex 2025-08-27 23:27:36 -05:00
  • a11adafdca Gracefully handle edge cases in harmony utils (#23155) Jan Kessler 2025-08-28 05:14:00 +02:00
  • a781e84ec2 [Perf] Tune configs for triton block fp8 gemm H100/H200 (#23748) Michael Goin 2025-08-27 23:12:53 -04:00
  • 1b7b161a09 [Feature] models: pass layer prefix to replace_linear_class for per-layer quantization routing. Addresses #23239 (#23556) Shrey Gupta 2025-08-28 08:42:44 +05:30