Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

b2b2c5239e [MoE Refactor] Split up compressed_tensors_moe.py (#38960) bnellnm 2026-04-06 20:07:54 -04:00
00d7b497b3 [NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct MI300, MI355X and Hopper through emulation (#35733) fxmarty-amd 2026-04-07 00:18:27 +02:00
9c81f35b1a [Attention][MLA] Re-enable FA4 as default MLA prefill backend (#38819) Matthew Bonanni 2026-04-06 17:51:46 -04:00
f186cfe75e [MRV2] Fix hanging issue with DeepSeek V3.2 by setting skip_attn=False (#39098) Woosuk Kwon 2026-04-06 12:55:13 -07:00
dfa5062a8f NemotronH default mamba_ssm_cache_dtype=float32; enable auto-hook for NemotronHNanoVLV2Config (#39032) Netanel Haber 2026-04-06 22:47:46 +03:00
e8ebbdde83 [Quantization] Add FlashInfer CuteDSL batched experts backend for NVFP4 MoE (#38251) Yongye Zhu 2026-04-06 14:57:53 -04:00
94fbb09894 [EASY] Drop duplicate KV-cache initialization (#38799) namgyu-youn 2026-04-07 03:05:39 +09:00
419e73cdfa [Bug] Fix mistral version dependency (#39086) Wentao Ye 2026-04-06 13:31:19 -04:00
f01482408c [MoE Refactor][Test] FusedMoE layer test (#24675) bnellnm 2026-04-06 13:17:23 -04:00
bfdc0a3a99 [NIXL][Mamba][3/N] Heterogeneous TP: 3-read conv state transfer (#37635) zhanqiuhu 2026-04-06 13:07:02 -04:00
93bada494f [MoE Refactor] Split of DefaultMoERunner class (#35326) bnellnm 2026-04-06 12:41:59 -04:00
608914de30 [Core] Re-enable Inductor pre-grad passes in standalone compile (torch>=2.12) (#38944) Frederik Gossen 2026-04-06 12:37:13 -04:00
4ae218c122 [Refactor] Remove unused dead code (#38842) Wentao Ye 2026-04-06 11:52:05 -04:00
f40d9879f2 [Models][GDN] Remove GPU/CPU syncs in GDNAttentionMetadata.build during speculative decoding (#38047) Lukas Geiger 2026-04-06 16:39:37 +01:00
47e605092b [Gemma4] Enable Fast Prefill Optimization (#38879) Lucas Wilkinson 2026-04-06 11:19:39 -04:00
e69a265135 [Feat][Core] safely abort requests when FSM fails to advance (#38663) Walter Beller-Morales 2026-04-06 11:00:16 -04:00
fef56c1855 [Mistral Grammar] Support Grammar Factory (#38150) Julien Denize 2026-04-06 16:28:51 +02:00
c5e3454e5a [Model] Add support for BharatGen's Param2MoE model (#38000) bhargav-patel-29 2026-04-06 13:49:56 +05:30
f6983f01de MiniMax-M2: add Eagle3 speculative decoding support (#37512) liuchenbing2026 2026-04-06 10:50:18 +08:00
780ba37458 [ROCm][Quantization] Add asymmetric INT8 quantization support to TritonInt8ScaledMMLinearKernel (#38501) Andreas Karatzas 2026-04-05 20:42:10 -05:00
9570654c6d [ROCm][CI] Run Kernels Core Operation Test On MI325 and mitigate flakiness (#38184) Micah Williamson 2026-04-05 20:42:02 -05:00
d56e952239 nano_nemotron_vl: fix tensor device mismatch exception when video profiling (#39029) Netanel Haber 2026-04-06 01:23:45 +03:00
56de443db1 [ci] Switch some CI jobs to H200 MIG slices (#38956) Kevin H. Luu 2026-04-05 13:26:11 -07:00
4dd49b06f8 [Bug] Fix Import paths for encoder_cudagraph modules (#38997) Greg Pereira 2026-04-05 12:11:58 -07:00
f53fa26e05 [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992) Greg Pereira 2026-04-05 10:11:18 -07:00
1af6f78ae5 [Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout (#38993) Wei Zhao 2026-04-05 10:54:31 -04:00
228023b3a5 [Bugfix][MoE] Fix 6-8% decode regression: prefer multi-stream shared expert overlap (#38990) Martin Vit 2026-04-05 16:28:31 +02:00
9a528260ef [Bugfix][Spec Decode] Fix extract_hidden_states for VLM models (#38987) Aaron Batilo 2026-04-05 03:41:54 -06:00
968ed02ace [Quantization][Deprecation] Remove Petit NVFP4 (#32694) Robert Shaw 2026-04-04 20:07:45 -04:00
7d266abb22 Revert "[vLLM IR] gemma_rms_norm" (#38998) Robert Shaw 2026-04-04 17:48:08 -04:00
156405d243 [vLLM IR] gemma_rms_norm (#38780) Xiaoshuang Wang 2026-04-05 01:55:52 +08:00
99e5539a67 [Perf][GDN] Align TMA usage with upstream FLA (#38981) Artem Perevedentsev 2026-04-04 19:38:02 +03:00
a88ce94bbb [IR][RmsNorm] pass None if not has_weight (#38961) Linkun 2026-04-04 08:02:30 -07:00
2a36d8fb72 [Bugfix][CPU] Fix macOS compatibility broken by #36487 (#38970) Ziming Qi 2026-04-04 10:05:58 -04:00
93726b2a1c Refactor Arctic loading to use AutoWeightsLoader (#38955) lalit10 2026-04-03 22:01:09 -07:00
8617f8676b [Bugfix] Fix DSV32 weight loading (#38870) Yongye Zhu 2026-04-03 22:57:52 -04:00
06fd9ffcc4 [ROCm][CI] Fix ROCm Dockerfile conftest generation for older Docker parsers (#38959) Andreas Karatzas 2026-04-03 21:41:41 -05:00
cab4064cd5 [Bug] Fix workspace manager _current_workspaces size (#38853) Wentao Ye 2026-04-03 21:29:45 -04:00
062f1a2d70 [Bug] Fix compile error for swap_blocks_batch in CUDA 13 (#38915) Wentao Ye 2026-04-03 19:56:38 -04:00
81994e1d0e [Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… (#38927) elenalil-aws 2026-04-03 16:30:09 -07:00
4b506ff90a [ROCm][CI] Minor missing import patch (#38951) Andreas Karatzas 2026-04-03 18:01:20 -05:00
5875bb2e9c [ROCm][CI] Added back missing common deps (#38937) Andreas Karatzas 2026-04-03 17:58:57 -05:00
f0d3ad9f3e [ci] Remove soft fail for AMD image build job (#38941) Kevin H. Luu 2026-04-03 13:42:33 -07:00
121ea5a21f Removed GPU state confirmation and cleanup steps. (#38238) Divin Honnappa 2026-04-03 15:11:08 -05:00
ab79863e6c Remove MQ multi-node tests (#38934) Jeffrey Wang 2026-04-03 13:00:08 -07:00
5f1de2b14b [Model Runner V2] Add config validation for not-yet-supported features (#38758) Nick Hill 2026-04-03 12:08:08 -07:00
a5a623d961 [Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts (#38859) yzong-rh 2026-04-03 13:48:17 -04:00
f8c3af2d85 [vLLM IR] add import_ir_kernels() to support OOT platforms (#38807) Xiaoshuang Wang 2026-04-04 01:25:19 +08:00
50cd5674b3 Fix invalid logprobs with MTP enabled and sync scheduling (#38711) danisereb 2026-04-03 19:24:37 +03:00
7b1a7423be [Frontend] new online quantization frontend (#38138) Vasiliy Kuznetsov 2026-04-03 11:58:39 -04:00
97f92c6b47 [KVConnector] Skip register_kv_caches on profiling (#38558) Nicolò Lucchesi 2026-04-03 17:40:16 +02:00
46f02e00f2 [Bugfix] Fix AWQ models batch invariance issues (#38670) Yusuf Mohammad 2026-04-03 15:54:15 +01:00
6b4872240f [XPU] bump up xpu-kernel v0.1.5, transpose moe weights (#38342) Qiming Zhang 2026-04-03 07:10:02 -07:00
580090db6b [Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM (#38325) Necofish 2026-04-03 21:49:59 +08:00
cb10b7e80b [GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill (#38361) Artem Perevedentsev 2026-04-03 16:38:02 +03:00
bf8b022e60 [Intel][Triton] Support round_int8 for Intel backend (#38825) Mieszko Dziadowiec 2026-04-03 14:47:35 +02:00
40ee64c00e [XPU][CI] Skip test_topp_only and test_topk_and_topp cases on Intel GPU in CI (#38904) xiangdong 2026-04-03 20:44:52 +08:00
1b117cb0ac [ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8 (#38615) wufann 2026-04-03 18:54:00 +08:00
abebd9323d [CPU] Replace OMP initialization (#36487) Anton Ivanov 2026-04-03 11:42:43 +01:00
25f2b55319 [Frontend] feat: add streaming support for token generation endpoint (#37171) Hyeonki Hong 2026-04-03 19:20:32 +09:00
cb4ff07f8b [XPU][CI] Skip test_topk_only cases on Intel GPU in CI (#38899) xiangdong 2026-04-03 17:50:41 +08:00
a7d79fa133 [ROCm][CI/Build] Fix the pytest hook to properly print out the summary (#38585) Gregory Shtrasberg 2026-04-03 04:24:26 -05:00
fa9e68022d Fix Nano Nemotron VL regressions (#38655) Netanel Haber 2026-04-03 10:22:06 +03:00
5506435419 [Misc] Clean up Gemma4 implementation (#38872) v0.19.1rc0 Isotr0py 2026-04-03 13:47:02 +08:00
311c981647 [MRV2][KVConnector] Fix missing build_connector_worker_meta (#38698) Yifan Qiao 2026-04-02 22:42:52 -07:00
21d7ecc5b0 [CI/Build] Add audio deps in Dockerfile.cpu (#38876) Li, Jiang 2026-04-03 13:05:14 +08:00
4729b90838 [Bug] Add e_score_correction_bias to SKIP_TENSORS (#38746) Aaron Hao 2026-04-02 21:15:05 -07:00
8b141ed8c3 full cudagraph for flex-attn (#36298) shunting314 2026-04-02 21:15:01 -07:00
2ad7c0335f [Model] Add Phi4ForCausalLMV for microsoft/Phi-4-reasoning-vision-15B (#38306) Varun Sundar Rabindranath 2026-04-03 00:14:57 -04:00
201d2ea5bf [CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI (#38664) Bowen Bao 2026-04-02 21:05:45 -07:00
103f0de565 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle (#38774) Bowen Bao 2026-04-02 20:29:57 -07:00
32e0c0bfa2 refactor hard coded device string in test files under tests/v1 and tests/lora (#37566) wliao2 2026-04-02 20:21:47 -07:00
4a06e1246e [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync (#38460) Itay Etelis 2026-04-03 06:13:23 +03:00
3bc2734dd0 [Kernel] Fuse FP8 output quantization into merge_attn_states (#36518) Carl Y 2026-04-02 18:47:04 -07:00
1f5ec2889c [mla] Support fused FP8/NVFP4 output quantization in MLA attention (#35792) (#36205) Carl Y 2026-04-02 18:16:11 -07:00
ee3cf45739 [XPU] Initial support for GDN attention on Qwen3-next/Qwen3.5 (#33657) Yan Ma 2026-04-03 08:59:11 +08:00
05e68e1f81 [CI] Fix test_nixl_connector (#38838) Matthew Bonanni 2026-04-02 20:52:13 -04:00
771913e4a0 [Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (#38832) Vadim Gimpelson 2026-04-03 04:45:57 +04:00
71a9125c67 [New Model]: add support for telechat3 (#38510) 1096125073 2026-04-03 08:26:22 +08:00
66e86f1dbd [Kernel] Mamba support different layout for Conv state (#37416) Nicolò Lucchesi 2026-04-03 01:50:09 +02:00
2a69949bda [Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter (#38847) v0.19.0 Michael 2026-04-02 17:35:19 -04:00
bb39382b2b [Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter (#38847) Michael 2026-04-02 17:35:19 -04:00
7b743ba953 [CI] Fix: pass string cache_dtype in test_register_kv_caches (#38836) zhanqiuhu 2026-04-02 15:42:09 -04:00
188defbd0b [CI] Add flashinfer.py to attention test source deps (#38792) Stefano Castagnetta 2026-04-02 21:24:29 +02:00
8adcf8c40a feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) (#38826) Luciano Martins 2026-04-02 15:13:28 -03:00
08ed2b9688 feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) (#38826) Luciano Martins 2026-04-02 15:13:28 -03:00
ecd5443dbc Bump helion dependency from 0.3.2 to 0.3.3 (#38062) Yanan Cao 2026-04-02 10:59:33 -07:00
58262dec6e [Bugfix] Fix test mocks after SM100 restriction in #38730 (#38791) Stefano Castagnetta 2026-04-02 19:12:58 +02:00
cb3935a8fc [FA4] Update flash-attention to latest upstream FA4 (#38690) Lucas Wilkinson 2026-04-02 13:02:37 -04:00
82a006beeb [CI][ROCm] Add gpt-oss w4a8 in CI (#38292) Bowen Bao 2026-04-02 09:06:01 -07:00
a9b4f07ba2 [Frontend] Re-enable running MaxSim on GPU (#38620) wang.yuqi 2026-04-03 00:03:13 +08:00
d9408ffba3 Triton MLA perf fixes (#33529) Koushik Dutta 2026-04-02 06:40:01 -07:00
16a65e4173 [Bugfix] Enable batch-invariant Triton matmul on all Ampere GPUs (SM 8x) (#38427) Yusuf Mohammad 2026-04-02 14:29:58 +01:00
c0817e4d39 [Model] Add support for Cheers multimodal model (#38788) bsliu 2026-04-02 21:01:40 +08:00
dfe5e31689 Don't compile vision encoder for Transformers backend (#30518) Harry Mellor 2026-04-02 13:42:29 +01:00
2ce3d0ce36 [Feature] KV cache per-token-head INT8/FP8 quantization (#38378) JartX 2026-04-02 14:13:26 +02:00
4eefbf9609 [Perf] fuse kernels in gdn (#37813) Jiangyun Zhu 2026-04-02 19:52:18 +08:00
551b3fb39f [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 (#38086) vllmellm 2026-04-02 16:13:42 +08:00
c6f722b93e [CPU] Support gelu act in cpu_fused_moe (#38770) Li, Jiang 2026-04-02 14:14:32 +08:00
9bd7231106 Revert "[Kernel] Add gpt-oss Router GEMM kernel (#37205)" (#38778) Xin Yang 2026-04-01 22:02:32 -07:00

1 2 3 4 5 ...