Commit Graph

  • 013b73e9b2 Fix managed KV cache: use __cuda_array_interface__ instead of UntypedStorage.from_blob cmm biondizzle 2026-04-12 06:56:52 +00:00
  • c77342da87 KV cache: prefer CPU placement, zero via CPU not GPU biondizzle 2026-04-12 03:44:16 +00:00
  • 7f35bc4158 Targeted KV cache managed memory allocation biondizzle 2026-04-11 02:14:34 +00:00
  • 487dd34e04 Selective prefetch: only prefetch allocations <2 GiB to GPU biondizzle 2026-04-10 14:58:57 +00:00
  • a15f86ecfa Remove cudaMemPrefetchAsync from managed allocator biondizzle 2026-04-10 05:58:11 +00:00
  • e5de19ff9a [CI/Build[ Don't auto-rebase PRs with CI failures (#39443) main Cyrus Leung 2026-04-10 04:57:37 +08:00
  • edee96519a [Spec Decode] fix returning size mismatch on extract hidden states proposer (#38610) zzaebok 2026-04-10 04:39:39 +08:00
  • adaabb8a55 Add nightly b200 test for spec decode eagle correctness (#38577) Rishi Puri 2026-04-09 13:09:09 -07:00
  • f7cad67412 [ASR] Fix spacing bw chunks in multi chunk audio transcription (#39116) Ekagra Ranjan 2026-04-09 15:46:33 -04:00
  • a8134aef4e [XPU] check is_xccl_available before oneccl warmup (#39302) Xinyu Chen 2026-04-10 03:42:17 +08:00
  • 2800706f06 [Refactor] Move NVFP4 GEMM management into NvFp4LinearKernel (#39129) Michael Goin 2026-04-09 21:05:36 +02:00
  • 0d310ffbeb [CI/Build] Update auto-rebase rule (#39429) Cyrus Leung 2026-04-10 01:59:56 +08:00
  • d5f75fdf50 [ROCm] Correctly guard fused_silu_mul_block_quant on ROCm (#39387) Micah Williamson 2026-04-09 12:59:03 -05:00
  • 827268e98d [Quantization] Support Quark W8A8 INT8 MoE inference (#36320) PikaPikachu 2026-04-10 01:24:43 +08:00
  • 56e19d7ee2 [Model Runner V2] Fix flex attention kv blocks calculation issue (#39353) Wentao Ye 2026-04-09 13:07:43 -04:00
  • 9036d4c464 [ROCm][CI] Resolved nvidia package deps issue (#39421) Andreas Karatzas 2026-04-09 11:06:06 -05:00
  • a8c6ee9b78 [Performance Improvement] Update batched_count_greater_than to handle batch size 1 without recompile (#38933) Lucas Kabela 2026-04-09 08:51:31 -07:00
  • 3b1d9c3156 [CI/Build] Fix memory cleanup in MM test (#39411) Cyrus Leung 2026-04-09 23:50:45 +08:00
  • 54d244f28f [UX] Improve error message for MM input too long (#39409) Cyrus Leung 2026-04-09 21:20:19 +08:00
  • 6c749399b7 [BugFix] fix tests/kernels/moe/test_moe_layer.py (#39404) Richard Zou 2026-04-09 14:48:59 +02:00
  • 91eea72330 [Tests] Add Qwen3-VL multimodal memory leak check (#39268) lalit10 2026-04-09 04:54:46 -07:00
  • df2503e125 nemotron-nano-vl: Allow use_audio_in_video to be passed at vllm serve time (#38538) Andrii Skliar 2026-04-09 13:44:39 +02:00
  • c8d98f81f6 [Core] Simplify API server handshake (#39364) Nick Hill 2026-04-09 03:56:15 -07:00
  • d87fb264df [Docs] Bring README updates into docs README (#39397) Harry Mellor 2026-04-09 12:35:00 +02:00
  • 66c079ae83 [Frontend][4/n] Improve pooling entrypoints | pooling. (#39153) wang.yuqi 2026-04-09 18:09:45 +08:00
  • b6c9be509e [CI] fix possible user permission issues in nightly index generation (#39390) Shengqi Chen 2026-04-09 16:14:07 +08:00
  • ed733802f0 Fix NUMA binding on non-CDMM Grace-Blackwell systems (#39361) Qidong Su 2026-04-09 03:36:51 -04:00
  • 8a34c5087a [ROCm] Remove unnecessary fp8 roundtrip in gather cache NHD dequant (#39122) Andrew Barnes 2026-04-09 03:12:22 -04:00
  • ed2f282bc8 [Perf] Optimize redundant sync for pooling model, 3.7% Throughput Improvement (#39113) Wentao Ye 2026-04-09 02:12:23 -04:00
  • 9e78555743 [Docker] Add fastsafetensors to NVIDIA Dockerfile (#38950) Zhewen Li 2026-04-08 22:21:37 -07:00
  • e80e633927 [XPU] Skip VLLM_BATCH_INVARIANT for XPU in EAGLE DP test (#39164) sihao_li 2026-04-09 12:45:16 +08:00
  • 490f17d0c7 [Multimodal] Fix nested_tensors_equal: add length check for lists and tuple support (#38388) Khairul Kabir 2026-04-08 21:40:37 -07:00
  • 2e98406048 [Refactor] Improve indexer decode path metadata preparation (#38865) Yongye Zhu 2026-04-08 23:49:15 -04:00
  • ef5a226819 [PD][HeteroArch]Fix accuracy issue with CPU_ATTN as Decoder and Flash_ATTN as prefiller (#38935) Chendi.Xue 2026-04-08 22:19:07 -05:00
  • aec18492d0 [CI] Fix mypy for vllm/v1/ops (#39219) Wentao Ye 2026-04-08 23:06:34 -04:00
  • 2a49284c8a Fix Responses JSON schema alias serialization (#38519) noobHappylife 2026-04-09 10:50:16 +08:00
  • d37b378762 [Model] Update ColModernVBERT to support latest HF checkpoint (#39307) Ilya Boytsov 2026-04-09 04:48:51 +02:00
  • 92fbec391b [Bug] Fix routing bias dtype for trtllm per-block fp8 moe (#38989) Wei Zhao 2026-04-08 22:42:43 -04:00
  • 2f41d6c063 [Bugfix] Fix cpu-offload-gb assertion with non-default block sizes (#36461) Ajay Anubolu 2026-04-08 19:42:16 -07:00
  • 3aecdf08b4 [Gemma4] Support quantized MoE (#39045) Dipika Sikka 2026-04-08 21:57:53 -04:00
  • eb4205fee5 [UX] Integrate DeepGEMM into vLLM wheel via CMake (#37980) Michael Goin 2026-04-09 03:56:32 +02:00
  • 83aea2147f [XPU][UT] update UTs in CI (#39296) liuzhenwei 2026-04-09 09:38:16 +08:00
  • 2e9034c998 [W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. (#33892) Maral 2026-04-09 08:50:39 +08:00
  • 8332078cfd [Bugfix] FlashInfer MXINT4 MoE crashes, missing do_finalize (#39315) Benjamin Chislett 2026-04-08 20:36:33 -04:00
  • ba4a78eb5d [torch.compile] Allow usage of Opaque Objects in PyTorch 2.11 (#39286) Richard Zou 2026-04-09 01:21:10 +02:00
  • f3c7941ec8 [Bugfix]Fix EP precision for Qwen3.5, Qwen3-Next (#39181) Kai Song 2026-04-09 05:47:48 +08:00
  • 3352bf8b03 [CI Bug] Fix pre-commit issue in main (#39347) Wentao Ye 2026-04-08 17:10:05 -04:00
  • 7c94ae16c6 [BugFix] --max-model-len=-1 causes over-limit requests to hang and starve the entire service (#39102) triangleXIV 2026-04-09 05:03:17 +08:00
  • ad05edfbca tests/v1/e2e/spec_decode: assert async scheduling is used (#39206) Rishi Puri 2026-04-08 13:30:03 -07:00
  • 2018137242 [Feature] Batch invariant nvfp4 linear support (#39322) Wentao Ye 2026-04-08 16:29:13 -04:00
  • a776a48b1c [MoE] Move DEEP_GEMM into experts/ subdirectory (#39005) Jackmin801 2026-04-08 12:23:08 -07:00
  • 8477fe427d [Tool] adjust_request to reasoning parser, and Gemma4 fixes (#39027) Ben Browning 2026-04-08 15:04:04 -04:00
  • e24e0a43a4 [Attention] relax the head dim 512 and paged kv for sm90+FA4 (#38835) Lain 2026-04-08 11:23:18 -07:00
  • b55d830ec7 [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode (#37421) Roberto L. Castro 2026-04-08 19:35:57 +02:00
  • 75e01a39a1 [Feature] NUMA binding support for GPU workers (#38635) Shengqi Chen 2026-04-09 00:55:24 +08:00
  • 512c5eb455 [kv_offload+HMA][5/N]: Track group block hashes and block IDs (#37109) Or Ozeri 2026-04-08 19:50:28 +03:00
  • 13151a4df4 [Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values (#39114) Flora Feng 2026-04-08 12:46:27 -04:00
  • 56c976c1b5 [ROCm] Enable fused_silu_mul_block_quant on ROCm (#38817) Gregory Shtrasberg 2026-04-08 11:23:32 -05:00
  • d74a306c4b [Core] Use tuple_return in split_module for tuple-conformant subgraphs (#38752) Frederik Gossen 2026-04-08 12:09:58 -04:00
  • 0e9f0a516c [ROCm][CI-Build] Cherry pick triton BUFFER_OPS fix and update AITER (#38580) Gregory Shtrasberg 2026-04-08 10:38:03 -05:00
  • 8904fc4d19 [Bugfix] Fix V1 logprobs empty strings for multi-byte UTF-8 tokens when logprobs > 0 (#34875) haosdent 2026-04-08 23:30:00 +08:00
  • 1a2c17634e [Bugfix] Add missing ASRDataset import and CLI args in benchmarks/throughput.py (#38114) nemanjaudovic 2026-04-08 15:53:53 +02:00
  • 308cec5864 [FlashAttention] Symlink FA4 instead of copying when using VLLM_FLASH_ATTN_SRC_DIR (#38814) Matthew Bonanni 2026-04-08 08:04:34 -04:00
  • 4e2ab1861d [CI Failure] pin nomic-embed-text-v1 revision (#39292) wang.yuqi 2026-04-08 19:43:06 +08:00
  • 140cbb1186 [Bugfix] Cuda Clean up scales Kvcache fp8/int8_per_token_head (#39224) JartX 2026-04-08 13:08:04 +02:00
  • 6155bbd1dd [Bugfix][Docs] Fix ReadTheDocs build crash from mocked torch decorator (#39284) Kevin H. Luu 2026-04-08 02:43:01 -07:00
  • 78434b923c [CI][AMD][BugFix][Kernel] Cast induction variable to int64 on MI350 for chunk_gated_delta_rule_fwd_kernel_h_blockdim64 to avoid illegal memory access (#39087) rasmith 2026-04-08 03:57:18 -05:00
  • 2488d1dca2 [Docs] Update README (#39251) Michael Goin 2026-04-08 05:34:07 +02:00
  • d734445fcd [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls (#38909) yoke 2026-04-08 11:03:54 +08:00
  • 927975ead8 [Parser] Migrate response api streaming to unified parser (#38755) Flora Feng 2026-04-07 22:09:00 -04:00
  • 9ea7d670d8 [Bugfix] Fix Qwen3 tool parser for Responses API tools (#38848) Flora Feng 2026-04-07 22:08:51 -04:00
  • 7b80cd8ac3 [Docs] Add Phi-4-reasoning-vision to supported models + examples (#39232) Varun Sundar Rabindranath 2026-04-07 19:02:26 -07:00
  • 2111997f96 [release 2.11] Update to torch 2.11 (#34644) Andrey Talman 2026-04-07 21:55:48 -04:00
  • 5af684c319 [CI] Add reasoning parser tests to CI (#37025) Flora Feng 2026-04-07 20:57:36 -04:00
  • d521dcdbcc docs: clarify SMT and OMP acronyms in CpuPlatform (#39085) Md. Mekayel Anik 2026-04-08 06:42:07 +06:00
  • 5daf62271d [Model Runner V2] Fuse probabilistic rejection sample kernels (#38496) Giancarlo Delfin 2026-04-07 17:37:37 -07:00
  • ad3304425b [XPU] add xpu backend implementation of mxfp8 quant (#38682) zofia 2026-04-08 08:30:35 +08:00
  • 70406eb1dc [Attention][V0 Deprecation] Deprecate accept output buffer (#39125) Lucas Wilkinson 2026-04-07 17:14:58 -04:00
  • 08bfedc152 [Bugfix] Fix extract_hidden_states crash with quantized KV cache dtype (#39160) Yubo Wang 2026-04-07 11:18:33 -07:00
  • 0102bd2f4c [Parser] Pass request.tools to tool parser (#38860) Flora Feng 2026-04-07 13:36:21 -04:00
  • 83d09d36b5 [CI][Bugfix][AMD][ Ensure weights created when using emulating OCP MXFP4 (#36993) rasmith 2026-04-07 11:37:16 -05:00
  • 92b9afeecd [XPU] Quick fix for TritonMLA to remove cuda hardcode (#39088) Chendi.Xue 2026-04-07 11:17:58 -05:00
  • 7310555482 [Bugfix] Fix marlin nvfp4 rescaling (#37502) Jinzhen Lin 2026-04-07 23:57:17 +08:00
  • 96b5004b71 [KVConnector] Support 3FS KVConnector (#37636) ibifrost 2026-04-07 23:46:00 +08:00
  • 98e1a43af7 [Bugfix][Quantization] Fix PerTensorScale loading with tuple shard_id in MergedColumnParallelLinear (#38517) kkyyxhll 2026-04-07 23:16:26 +08:00
  • 729eb59f60 [KVConnector]: prioritize external connector over internal registry (#38301) maobaolong 2026-04-07 23:03:11 +08:00
  • 6e1100889e fix(test): recompute Jina ColBERT rotary inv_freq cleared by transformers v5 weight loader (#39176) Ilya Boytsov 2026-04-07 16:40:55 +02:00
  • edcc37a8ce Fix Mistral yarn warning in Transformers v5 (#37292) Harry Mellor 2026-04-07 15:23:33 +02:00
  • 79df4a794d Automatically add links to API docs for matching strings in docs (#37434) Harry Mellor 2026-04-07 15:21:18 +02:00
  • 7c139ab23f [KV Offload] Clean up ARC/LRU refactoring leftovers: group ARC tests and fix stale comment (#38217) Ronen Schaffer 2026-04-07 15:14:45 +03:00
  • 0be9516ea4 [Bug] Fix Trtllm Fp8 MoE Weight Shuffle Memory Fragamentation (#39054) Wei Zhao 2026-04-07 08:04:08 -04:00
  • 7b9de7c892 [Bugfix] Correct mistake in chained comparison in static assert logic (#38699) Kyle Mylonakis 2026-04-07 11:24:39 +01:00
  • dd9342e6bc only patch runtime_env for torch >= 2.10 (#38763) Rohan Potdar 2026-04-07 04:29:23 -05:00
  • 8060bb0333 [vLLM IR] rework gemma_rms_norm (#39014) Jiangyun Zhu 2026-04-07 16:37:00 +08:00
  • da4c0e4db9 [Model] Use AutoWeightsLoader for FalconH1 (#39092) Rishapveer Singh 2026-04-07 10:25:17 +02:00
  • a9a0e0551f nano-nemotron-vl: get_mm_max_tokens_per_item for audio, video, image == seq_len (#38727) Netanel Haber 2026-04-07 10:23:29 +03:00
  • 5c35517a3e [ROCm] Remove unused IS_FNUZ parameter from reshape_and_cache_shuffle_kernel (#39123) Andrew Barnes 2026-04-07 03:17:59 -04:00
  • a435e3108d [ROCm][CI] Fix test repo-root assumptions (#39053) Andreas Karatzas 2026-04-07 00:36:21 -05:00
  • 2df2c85be4 [Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path (#38504) Andreas Karatzas 2026-04-06 21:57:09 -05:00
  • 62095e82c1 [BugFix][MRV2] Fix cuda event reuse race (#39115) Nick Hill 2026-04-06 17:21:09 -07:00