Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

013b73e9b2 Fix managed KV cache: use __cuda_array_interface__ instead of UntypedStorage.from_blob cmm biondizzle 2026-04-12 06:56:52 +00:00
c77342da87 KV cache: prefer CPU placement, zero via CPU not GPU biondizzle 2026-04-12 03:44:16 +00:00
7f35bc4158 Targeted KV cache managed memory allocation biondizzle 2026-04-11 02:14:34 +00:00
487dd34e04 Selective prefetch: only prefetch allocations <2 GiB to GPU biondizzle 2026-04-10 14:58:57 +00:00
a15f86ecfa Remove cudaMemPrefetchAsync from managed allocator biondizzle 2026-04-10 05:58:11 +00:00
e5de19ff9a [CI/Build[ Don't auto-rebase PRs with CI failures (#39443) main Cyrus Leung 2026-04-10 04:57:37 +08:00
edee96519a [Spec Decode] fix returning size mismatch on extract hidden states proposer (#38610) zzaebok 2026-04-10 04:39:39 +08:00
adaabb8a55 Add nightly b200 test for spec decode eagle correctness (#38577) Rishi Puri 2026-04-09 13:09:09 -07:00
f7cad67412 [ASR] Fix spacing bw chunks in multi chunk audio transcription (#39116) Ekagra Ranjan 2026-04-09 15:46:33 -04:00
a8134aef4e [XPU] check is_xccl_available before oneccl warmup (#39302) Xinyu Chen 2026-04-10 03:42:17 +08:00
2800706f06 [Refactor] Move NVFP4 GEMM management into NvFp4LinearKernel (#39129) Michael Goin 2026-04-09 21:05:36 +02:00
0d310ffbeb [CI/Build] Update auto-rebase rule (#39429) Cyrus Leung 2026-04-10 01:59:56 +08:00
d5f75fdf50 [ROCm] Correctly guard fused_silu_mul_block_quant on ROCm (#39387) Micah Williamson 2026-04-09 12:59:03 -05:00
827268e98d [Quantization] Support Quark W8A8 INT8 MoE inference (#36320) PikaPikachu 2026-04-10 01:24:43 +08:00
56e19d7ee2 [Model Runner V2] Fix flex attention kv blocks calculation issue (#39353) Wentao Ye 2026-04-09 13:07:43 -04:00
9036d4c464 [ROCm][CI] Resolved nvidia package deps issue (#39421) Andreas Karatzas 2026-04-09 11:06:06 -05:00
a8c6ee9b78 [Performance Improvement] Update batched_count_greater_than to handle batch size 1 without recompile (#38933) Lucas Kabela 2026-04-09 08:51:31 -07:00
3b1d9c3156 [CI/Build] Fix memory cleanup in MM test (#39411) Cyrus Leung 2026-04-09 23:50:45 +08:00
54d244f28f [UX] Improve error message for MM input too long (#39409) Cyrus Leung 2026-04-09 21:20:19 +08:00
6c749399b7 [BugFix] fix tests/kernels/moe/test_moe_layer.py (#39404) Richard Zou 2026-04-09 14:48:59 +02:00
91eea72330 [Tests] Add Qwen3-VL multimodal memory leak check (#39268) lalit10 2026-04-09 04:54:46 -07:00
df2503e125 nemotron-nano-vl: Allow use_audio_in_video to be passed at vllm serve time (#38538) Andrii Skliar 2026-04-09 13:44:39 +02:00
c8d98f81f6 [Core] Simplify API server handshake (#39364) Nick Hill 2026-04-09 03:56:15 -07:00
d87fb264df [Docs] Bring README updates into docs README (#39397) Harry Mellor 2026-04-09 12:35:00 +02:00
66c079ae83 [Frontend][4/n] Improve pooling entrypoints | pooling. (#39153) wang.yuqi 2026-04-09 18:09:45 +08:00
b6c9be509e [CI] fix possible user permission issues in nightly index generation (#39390) Shengqi Chen 2026-04-09 16:14:07 +08:00
ed733802f0 Fix NUMA binding on non-CDMM Grace-Blackwell systems (#39361) Qidong Su 2026-04-09 03:36:51 -04:00
8a34c5087a [ROCm] Remove unnecessary fp8 roundtrip in gather cache NHD dequant (#39122) Andrew Barnes 2026-04-09 03:12:22 -04:00
ed2f282bc8 [Perf] Optimize redundant sync for pooling model, 3.7% Throughput Improvement (#39113) Wentao Ye 2026-04-09 02:12:23 -04:00
9e78555743 [Docker] Add fastsafetensors to NVIDIA Dockerfile (#38950) Zhewen Li 2026-04-08 22:21:37 -07:00
e80e633927 [XPU] Skip VLLM_BATCH_INVARIANT for XPU in EAGLE DP test (#39164) sihao_li 2026-04-09 12:45:16 +08:00
490f17d0c7 [Multimodal] Fix nested_tensors_equal: add length check for lists and tuple support (#38388) Khairul Kabir 2026-04-08 21:40:37 -07:00
2e98406048 [Refactor] Improve indexer decode path metadata preparation (#38865) Yongye Zhu 2026-04-08 23:49:15 -04:00
ef5a226819 [PD][HeteroArch]Fix accuracy issue with CPU_ATTN as Decoder and Flash_ATTN as prefiller (#38935) Chendi.Xue 2026-04-08 22:19:07 -05:00
aec18492d0 [CI] Fix mypy for vllm/v1/ops (#39219) Wentao Ye 2026-04-08 23:06:34 -04:00
2a49284c8a Fix Responses JSON schema alias serialization (#38519) noobHappylife 2026-04-09 10:50:16 +08:00
d37b378762 [Model] Update ColModernVBERT to support latest HF checkpoint (#39307) Ilya Boytsov 2026-04-09 04:48:51 +02:00
92fbec391b [Bug] Fix routing bias dtype for trtllm per-block fp8 moe (#38989) Wei Zhao 2026-04-08 22:42:43 -04:00
2f41d6c063 [Bugfix] Fix cpu-offload-gb assertion with non-default block sizes (#36461) Ajay Anubolu 2026-04-08 19:42:16 -07:00
3aecdf08b4 [Gemma4] Support quantized MoE (#39045) Dipika Sikka 2026-04-08 21:57:53 -04:00
eb4205fee5 [UX] Integrate DeepGEMM into vLLM wheel via CMake (#37980) Michael Goin 2026-04-09 03:56:32 +02:00
83aea2147f [XPU][UT] update UTs in CI (#39296) liuzhenwei 2026-04-09 09:38:16 +08:00
2e9034c998 [W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. (#33892) Maral 2026-04-09 08:50:39 +08:00
8332078cfd [Bugfix] FlashInfer MXINT4 MoE crashes, missing do_finalize (#39315) Benjamin Chislett 2026-04-08 20:36:33 -04:00
ba4a78eb5d [torch.compile] Allow usage of Opaque Objects in PyTorch 2.11 (#39286) Richard Zou 2026-04-09 01:21:10 +02:00
f3c7941ec8 [Bugfix]Fix EP precision for Qwen3.5, Qwen3-Next (#39181) Kai Song 2026-04-09 05:47:48 +08:00
3352bf8b03 [CI Bug] Fix pre-commit issue in main (#39347) Wentao Ye 2026-04-08 17:10:05 -04:00
7c94ae16c6 [BugFix] --max-model-len=-1 causes over-limit requests to hang and starve the entire service (#39102) triangleXIV 2026-04-09 05:03:17 +08:00
ad05edfbca tests/v1/e2e/spec_decode: assert async scheduling is used (#39206) Rishi Puri 2026-04-08 13:30:03 -07:00
2018137242 [Feature] Batch invariant nvfp4 linear support (#39322) Wentao Ye 2026-04-08 16:29:13 -04:00
a776a48b1c [MoE] Move DEEP_GEMM into experts/ subdirectory (#39005) Jackmin801 2026-04-08 12:23:08 -07:00
8477fe427d [Tool] adjust_request to reasoning parser, and Gemma4 fixes (#39027) Ben Browning 2026-04-08 15:04:04 -04:00
e24e0a43a4 [Attention] relax the head dim 512 and paged kv for sm90+FA4 (#38835) Lain 2026-04-08 11:23:18 -07:00
b55d830ec7 [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode (#37421) Roberto L. Castro 2026-04-08 19:35:57 +02:00
75e01a39a1 [Feature] NUMA binding support for GPU workers (#38635) Shengqi Chen 2026-04-09 00:55:24 +08:00
512c5eb455 [kv_offload+HMA][5/N]: Track group block hashes and block IDs (#37109) Or Ozeri 2026-04-08 19:50:28 +03:00
13151a4df4 [Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values (#39114) Flora Feng 2026-04-08 12:46:27 -04:00
56c976c1b5 [ROCm] Enable fused_silu_mul_block_quant on ROCm (#38817) Gregory Shtrasberg 2026-04-08 11:23:32 -05:00
d74a306c4b [Core] Use tuple_return in split_module for tuple-conformant subgraphs (#38752) Frederik Gossen 2026-04-08 12:09:58 -04:00
0e9f0a516c [ROCm][CI-Build] Cherry pick triton BUFFER_OPS fix and update AITER (#38580) Gregory Shtrasberg 2026-04-08 10:38:03 -05:00
8904fc4d19 [Bugfix] Fix V1 logprobs empty strings for multi-byte UTF-8 tokens when logprobs > 0 (#34875) haosdent 2026-04-08 23:30:00 +08:00
1a2c17634e [Bugfix] Add missing ASRDataset import and CLI args in benchmarks/throughput.py (#38114) nemanjaudovic 2026-04-08 15:53:53 +02:00
308cec5864 [FlashAttention] Symlink FA4 instead of copying when using VLLM_FLASH_ATTN_SRC_DIR (#38814) Matthew Bonanni 2026-04-08 08:04:34 -04:00
4e2ab1861d [CI Failure] pin nomic-embed-text-v1 revision (#39292) wang.yuqi 2026-04-08 19:43:06 +08:00
140cbb1186 [Bugfix] Cuda Clean up scales Kvcache fp8/int8_per_token_head (#39224) JartX 2026-04-08 13:08:04 +02:00
6155bbd1dd [Bugfix][Docs] Fix ReadTheDocs build crash from mocked torch decorator (#39284) Kevin H. Luu 2026-04-08 02:43:01 -07:00
78434b923c [CI][AMD][BugFix][Kernel] Cast induction variable to int64 on MI350 for chunk_gated_delta_rule_fwd_kernel_h_blockdim64 to avoid illegal memory access (#39087) rasmith 2026-04-08 03:57:18 -05:00
2488d1dca2 [Docs] Update README (#39251) Michael Goin 2026-04-08 05:34:07 +02:00
d734445fcd [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls (#38909) yoke 2026-04-08 11:03:54 +08:00
927975ead8 [Parser] Migrate response api streaming to unified parser (#38755) Flora Feng 2026-04-07 22:09:00 -04:00
9ea7d670d8 [Bugfix] Fix Qwen3 tool parser for Responses API tools (#38848) Flora Feng 2026-04-07 22:08:51 -04:00
7b80cd8ac3 [Docs] Add Phi-4-reasoning-vision to supported models + examples (#39232) Varun Sundar Rabindranath 2026-04-07 19:02:26 -07:00
2111997f96 [release 2.11] Update to torch 2.11 (#34644) Andrey Talman 2026-04-07 21:55:48 -04:00
5af684c319 [CI] Add reasoning parser tests to CI (#37025) Flora Feng 2026-04-07 20:57:36 -04:00
d521dcdbcc docs: clarify SMT and OMP acronyms in CpuPlatform (#39085) Md. Mekayel Anik 2026-04-08 06:42:07 +06:00
5daf62271d [Model Runner V2] Fuse probabilistic rejection sample kernels (#38496) Giancarlo Delfin 2026-04-07 17:37:37 -07:00
ad3304425b [XPU] add xpu backend implementation of mxfp8 quant (#38682) zofia 2026-04-08 08:30:35 +08:00
70406eb1dc [Attention][V0 Deprecation] Deprecate accept output buffer (#39125) Lucas Wilkinson 2026-04-07 17:14:58 -04:00
08bfedc152 [Bugfix] Fix extract_hidden_states crash with quantized KV cache dtype (#39160) Yubo Wang 2026-04-07 11:18:33 -07:00
0102bd2f4c [Parser] Pass request.tools to tool parser (#38860) Flora Feng 2026-04-07 13:36:21 -04:00
83d09d36b5 [CI][Bugfix][AMD][ Ensure weights created when using emulating OCP MXFP4 (#36993) rasmith 2026-04-07 11:37:16 -05:00
92b9afeecd [XPU] Quick fix for TritonMLA to remove cuda hardcode (#39088) Chendi.Xue 2026-04-07 11:17:58 -05:00
7310555482 [Bugfix] Fix marlin nvfp4 rescaling (#37502) Jinzhen Lin 2026-04-07 23:57:17 +08:00
96b5004b71 [KVConnector] Support 3FS KVConnector (#37636) ibifrost 2026-04-07 23:46:00 +08:00
98e1a43af7 [Bugfix][Quantization] Fix PerTensorScale loading with tuple shard_id in MergedColumnParallelLinear (#38517) kkyyxhll 2026-04-07 23:16:26 +08:00
729eb59f60 [KVConnector]: prioritize external connector over internal registry (#38301) maobaolong 2026-04-07 23:03:11 +08:00
6e1100889e fix(test): recompute Jina ColBERT rotary inv_freq cleared by transformers v5 weight loader (#39176) Ilya Boytsov 2026-04-07 16:40:55 +02:00
edcc37a8ce Fix Mistral yarn warning in Transformers v5 (#37292) Harry Mellor 2026-04-07 15:23:33 +02:00
79df4a794d Automatically add links to API docs for matching strings in docs (#37434) Harry Mellor 2026-04-07 15:21:18 +02:00
7c139ab23f [KV Offload] Clean up ARC/LRU refactoring leftovers: group ARC tests and fix stale comment (#38217) Ronen Schaffer 2026-04-07 15:14:45 +03:00
0be9516ea4 [Bug] Fix Trtllm Fp8 MoE Weight Shuffle Memory Fragamentation (#39054) Wei Zhao 2026-04-07 08:04:08 -04:00
7b9de7c892 [Bugfix] Correct mistake in chained comparison in static assert logic (#38699) Kyle Mylonakis 2026-04-07 11:24:39 +01:00
dd9342e6bc only patch runtime_env for torch >= 2.10 (#38763) Rohan Potdar 2026-04-07 04:29:23 -05:00
8060bb0333 [vLLM IR] rework gemma_rms_norm (#39014) Jiangyun Zhu 2026-04-07 16:37:00 +08:00
da4c0e4db9 [Model] Use AutoWeightsLoader for FalconH1 (#39092) Rishapveer Singh 2026-04-07 10:25:17 +02:00
a9a0e0551f nano-nemotron-vl: get_mm_max_tokens_per_item for audio, video, image == seq_len (#38727) Netanel Haber 2026-04-07 10:23:29 +03:00
5c35517a3e [ROCm] Remove unused IS_FNUZ parameter from reshape_and_cache_shuffle_kernel (#39123) Andrew Barnes 2026-04-07 03:17:59 -04:00
a435e3108d [ROCm][CI] Fix test repo-root assumptions (#39053) Andreas Karatzas 2026-04-07 00:36:21 -05:00
2df2c85be4 [Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path (#38504) Andreas Karatzas 2026-04-06 21:57:09 -05:00
62095e82c1 [BugFix][MRV2] Fix cuda event reuse race (#39115) Nick Hill 2026-04-06 17:21:09 -07:00

1 2 3 4 5 ...