This website requires JavaScript.
b2b2c5239e
[MoE Refactor] Split up compressed_tensors_moe.py (#38960 )
bnellnm
2026-04-06 20:07:54 -04:00
00d7b497b3
[NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct MI300, MI355X and Hopper through emulation (#35733 )
fxmarty-amd
2026-04-07 00:18:27 +02:00
9c81f35b1a
[Attention][MLA] Re-enable FA4 as default MLA prefill backend (#38819 )
Matthew Bonanni
2026-04-06 17:51:46 -04:00
f186cfe75e
[MRV2] Fix hanging issue with DeepSeek V3.2 by setting skip_attn=False (#39098 )
Woosuk Kwon
2026-04-06 12:55:13 -07:00
dfa5062a8f
NemotronH default mamba_ssm_cache_dtype=float32; enable auto-hook for NemotronHNanoVLV2Config (#39032 )
Netanel Haber
2026-04-06 22:47:46 +03:00
e8ebbdde83
[Quantization] Add FlashInfer CuteDSL batched experts backend for NVFP4 MoE (#38251 )
Yongye Zhu
2026-04-06 14:57:53 -04:00
94fbb09894
[EASY] Drop duplicate KV-cache initialization (#38799 )
namgyu-youn
2026-04-07 03:05:39 +09:00
419e73cdfa
[Bug] Fix mistral version dependency (#39086 )
Wentao Ye
2026-04-06 13:31:19 -04:00
f01482408c
[MoE Refactor][Test] FusedMoE layer test (#24675 )
bnellnm
2026-04-06 13:17:23 -04:00
bfdc0a3a99
[NIXL][Mamba][3/N] Heterogeneous TP: 3-read conv state transfer (#37635 )
zhanqiuhu
2026-04-06 13:07:02 -04:00
93bada494f
[MoE Refactor] Split of DefaultMoERunner class (#35326 )
bnellnm
2026-04-06 12:41:59 -04:00
608914de30
[Core] Re-enable Inductor pre-grad passes in standalone compile (torch>=2.12) (#38944 )
Frederik Gossen
2026-04-06 12:37:13 -04:00
4ae218c122
[Refactor] Remove unused dead code (#38842 )
Wentao Ye
2026-04-06 11:52:05 -04:00
f40d9879f2
[Models][GDN] Remove GPU/CPU syncs in GDNAttentionMetadata.build during speculative decoding (#38047 )
Lukas Geiger
2026-04-06 16:39:37 +01:00
47e605092b
[Gemma4] Enable Fast Prefill Optimization (#38879 )
Lucas Wilkinson
2026-04-06 11:19:39 -04:00
e69a265135
[Feat][Core] safely abort requests when FSM fails to advance (#38663 )
Walter Beller-Morales
2026-04-06 11:00:16 -04:00
fef56c1855
[Mistral Grammar] Support Grammar Factory (#38150 )
Julien Denize
2026-04-06 16:28:51 +02:00
c5e3454e5a
[Model] Add support for BharatGen's Param2MoE model (#38000 )
bhargav-patel-29
2026-04-06 13:49:56 +05:30
f6983f01de
MiniMax-M2: add Eagle3 speculative decoding support (#37512 )
liuchenbing2026
2026-04-06 10:50:18 +08:00
780ba37458
[ROCm][Quantization] Add asymmetric INT8 quantization support to TritonInt8ScaledMMLinearKernel (#38501 )
Andreas Karatzas
2026-04-05 20:42:10 -05:00
9570654c6d
[ROCm][CI] Run Kernels Core Operation Test On MI325 and mitigate flakiness (#38184 )
Micah Williamson
2026-04-05 20:42:02 -05:00
d56e952239
nano_nemotron_vl: fix tensor device mismatch exception when video profiling (#39029 )
Netanel Haber
2026-04-06 01:23:45 +03:00
56de443db1
[ci] Switch some CI jobs to H200 MIG slices (#38956 )
Kevin H. Luu
2026-04-05 13:26:11 -07:00
4dd49b06f8
[Bug] Fix Import paths for encoder_cudagraph modules (#38997 )
Greg Pereira
2026-04-05 12:11:58 -07:00
f53fa26e05
[Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992 )
Greg Pereira
2026-04-05 10:11:18 -07:00
1af6f78ae5
[Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout (#38993 )
Wei Zhao
2026-04-05 10:54:31 -04:00
228023b3a5
[Bugfix][MoE] Fix 6-8% decode regression: prefer multi-stream shared expert overlap (#38990 )
Martin Vit
2026-04-05 16:28:31 +02:00
9a528260ef
[Bugfix][Spec Decode] Fix extract_hidden_states for VLM models (#38987 )
Aaron Batilo
2026-04-05 03:41:54 -06:00
968ed02ace
[Quantization][Deprecation] Remove Petit NVFP4 (#32694 )
Robert Shaw
2026-04-04 20:07:45 -04:00
7d266abb22
Revert "[vLLM IR] gemma_rms_norm" (#38998 )
Robert Shaw
2026-04-04 17:48:08 -04:00
156405d243
[vLLM IR] gemma_rms_norm (#38780 )
Xiaoshuang Wang
2026-04-05 01:55:52 +08:00
99e5539a67
[Perf][GDN] Align TMA usage with upstream FLA (#38981 )
Artem Perevedentsev
2026-04-04 19:38:02 +03:00
a88ce94bbb
[IR][RmsNorm] pass None if not has_weight (#38961 )
Linkun
2026-04-04 08:02:30 -07:00
2a36d8fb72
[Bugfix][CPU] Fix macOS compatibility broken by #36487 (#38970 )
Ziming Qi
2026-04-04 10:05:58 -04:00
93726b2a1c
Refactor Arctic loading to use AutoWeightsLoader (#38955 )
lalit10
2026-04-03 22:01:09 -07:00
8617f8676b
[Bugfix] Fix DSV32 weight loading (#38870 )
Yongye Zhu
2026-04-03 22:57:52 -04:00
06fd9ffcc4
[ROCm][CI] Fix ROCm Dockerfile conftest generation for older Docker parsers (#38959 )
Andreas Karatzas
2026-04-03 21:41:41 -05:00
cab4064cd5
[Bug] Fix workspace manager _current_workspaces size (#38853 )
Wentao Ye
2026-04-03 21:29:45 -04:00
062f1a2d70
[Bug] Fix compile error for swap_blocks_batch in CUDA 13 (#38915 )
Wentao Ye
2026-04-03 19:56:38 -04:00
81994e1d0e
[Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… (#38927 )
elenalil-aws
2026-04-03 16:30:09 -07:00
4b506ff90a
[ROCm][CI] Minor missing import patch (#38951 )
Andreas Karatzas
2026-04-03 18:01:20 -05:00
5875bb2e9c
[ROCm][CI] Added back missing common deps (#38937 )
Andreas Karatzas
2026-04-03 17:58:57 -05:00
f0d3ad9f3e
[ci] Remove soft fail for AMD image build job (#38941 )
Kevin H. Luu
2026-04-03 13:42:33 -07:00
121ea5a21f
Removed GPU state confirmation and cleanup steps. (#38238 )
Divin Honnappa
2026-04-03 15:11:08 -05:00
ab79863e6c
Remove MQ multi-node tests (#38934 )
Jeffrey Wang
2026-04-03 13:00:08 -07:00
5f1de2b14b
[Model Runner V2] Add config validation for not-yet-supported features (#38758 )
Nick Hill
2026-04-03 12:08:08 -07:00
a5a623d961
[Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts (#38859 )
yzong-rh
2026-04-03 13:48:17 -04:00
f8c3af2d85
[vLLM IR] add import_ir_kernels() to support OOT platforms (#38807 )
Xiaoshuang Wang
2026-04-04 01:25:19 +08:00
50cd5674b3
Fix invalid logprobs with MTP enabled and sync scheduling (#38711 )
danisereb
2026-04-03 19:24:37 +03:00
7b1a7423be
[Frontend] new online quantization frontend (#38138 )
Vasiliy Kuznetsov
2026-04-03 11:58:39 -04:00
97f92c6b47
[KVConnector] Skip register_kv_caches on profiling (#38558 )
Nicolò Lucchesi
2026-04-03 17:40:16 +02:00
46f02e00f2
[Bugfix] Fix AWQ models batch invariance issues (#38670 )
Yusuf Mohammad
2026-04-03 15:54:15 +01:00
6b4872240f
[XPU] bump up xpu-kernel v0.1.5, transpose moe weights (#38342 )
Qiming Zhang
2026-04-03 07:10:02 -07:00
580090db6b
[Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM (#38325 )
Necofish
2026-04-03 21:49:59 +08:00
cb10b7e80b
[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill (#38361 )
Artem Perevedentsev
2026-04-03 16:38:02 +03:00
bf8b022e60
[Intel][Triton] Support round_int8 for Intel backend (#38825 )
Mieszko Dziadowiec
2026-04-03 14:47:35 +02:00
40ee64c00e
[XPU][CI] Skip test_topp_only and test_topk_and_topp cases on Intel GPU in CI (#38904 )
xiangdong
2026-04-03 20:44:52 +08:00
1b117cb0ac
[ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8 (#38615 )
wufann
2026-04-03 18:54:00 +08:00
abebd9323d
[CPU] Replace OMP initialization (#36487 )
Anton Ivanov
2026-04-03 11:42:43 +01:00
25f2b55319
[Frontend] feat: add streaming support for token generation endpoint (#37171 )
Hyeonki Hong
2026-04-03 19:20:32 +09:00
cb4ff07f8b
[XPU][CI] Skip test_topk_only cases on Intel GPU in CI (#38899 )
xiangdong
2026-04-03 17:50:41 +08:00
a7d79fa133
[ROCm][CI/Build] Fix the pytest hook to properly print out the summary (#38585 )
Gregory Shtrasberg
2026-04-03 04:24:26 -05:00
fa9e68022d
Fix Nano Nemotron VL regressions (#38655 )
Netanel Haber
2026-04-03 10:22:06 +03:00
5506435419
[Misc] Clean up Gemma4 implementation (#38872 )
v0.19.1rc0
Isotr0py
2026-04-03 13:47:02 +08:00
311c981647
[MRV2][KVConnector] Fix missing build_connector_worker_meta (#38698 )
Yifan Qiao
2026-04-02 22:42:52 -07:00
21d7ecc5b0
[CI/Build] Add audio deps in Dockerfile.cpu (#38876 )
Li, Jiang
2026-04-03 13:05:14 +08:00
4729b90838
[Bug] Add e_score_correction_bias to SKIP_TENSORS (#38746 )
Aaron Hao
2026-04-02 21:15:05 -07:00
8b141ed8c3
full cudagraph for flex-attn (#36298 )
shunting314
2026-04-02 21:15:01 -07:00
2ad7c0335f
[Model] Add Phi4ForCausalLMV for microsoft/Phi-4-reasoning-vision-15B (#38306 )
Varun Sundar Rabindranath
2026-04-03 00:14:57 -04:00
201d2ea5bf
[CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI (#38664 )
Bowen Bao
2026-04-02 21:05:45 -07:00
103f0de565
[ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle (#38774 )
Bowen Bao
2026-04-02 20:29:57 -07:00
32e0c0bfa2
refactor hard coded device string in test files under tests/v1 and tests/lora (#37566 )
wliao2
2026-04-02 20:21:47 -07:00
4a06e1246e
[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync (#38460 )
Itay Etelis
2026-04-03 06:13:23 +03:00
3bc2734dd0
[Kernel] Fuse FP8 output quantization into merge_attn_states (#36518 )
Carl Y
2026-04-02 18:47:04 -07:00
1f5ec2889c
[mla] Support fused FP8/NVFP4 output quantization in MLA attention (#35792 ) (#36205 )
Carl Y
2026-04-02 18:16:11 -07:00
ee3cf45739
[XPU] Initial support for GDN attention on Qwen3-next/Qwen3.5 (#33657 )
Yan Ma
2026-04-03 08:59:11 +08:00
05e68e1f81
[CI] Fix test_nixl_connector (#38838 )
Matthew Bonanni
2026-04-02 20:52:13 -04:00
771913e4a0
[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (#38832 )
Vadim Gimpelson
2026-04-03 04:45:57 +04:00
71a9125c67
[New Model]: add support for telechat3 (#38510 )
1096125073
2026-04-03 08:26:22 +08:00
66e86f1dbd
[Kernel] Mamba support different layout for Conv state (#37416 )
Nicolò Lucchesi
2026-04-03 01:50:09 +02:00
2a69949bda
[Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter (#38847 )
v0.19.0
Michael
2026-04-02 17:35:19 -04:00
bb39382b2b
[Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter (#38847 )
Michael
2026-04-02 17:35:19 -04:00
7b743ba953
[CI] Fix: pass string cache_dtype in test_register_kv_caches (#38836 )
zhanqiuhu
2026-04-02 15:42:09 -04:00
188defbd0b
[CI] Add flashinfer.py to attention test source deps (#38792 )
Stefano Castagnetta
2026-04-02 21:24:29 +02:00
8adcf8c40a
feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) (#38826 )
Luciano Martins
2026-04-02 15:13:28 -03:00
08ed2b9688
feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) (#38826 )
Luciano Martins
2026-04-02 15:13:28 -03:00
ecd5443dbc
Bump helion dependency from 0.3.2 to 0.3.3 (#38062 )
Yanan Cao
2026-04-02 10:59:33 -07:00
58262dec6e
[Bugfix] Fix test mocks after SM100 restriction in #38730 (#38791 )
Stefano Castagnetta
2026-04-02 19:12:58 +02:00
cb3935a8fc
[FA4] Update flash-attention to latest upstream FA4 (#38690 )
Lucas Wilkinson
2026-04-02 13:02:37 -04:00
82a006beeb
[CI][ROCm] Add gpt-oss w4a8 in CI (#38292 )
Bowen Bao
2026-04-02 09:06:01 -07:00
a9b4f07ba2
[Frontend] Re-enable running MaxSim on GPU (#38620 )
wang.yuqi
2026-04-03 00:03:13 +08:00
d9408ffba3
Triton MLA perf fixes (#33529 )
Koushik Dutta
2026-04-02 06:40:01 -07:00
16a65e4173
[Bugfix] Enable batch-invariant Triton matmul on all Ampere GPUs (SM 8x) (#38427 )
Yusuf Mohammad
2026-04-02 14:29:58 +01:00
c0817e4d39
[Model] Add support for Cheers multimodal model (#38788 )
bsliu
2026-04-02 21:01:40 +08:00
dfe5e31689
Don't compile vision encoder for Transformers backend (#30518 )
Harry Mellor
2026-04-02 13:42:29 +01:00
2ce3d0ce36
[Feature] KV cache per-token-head INT8/FP8 quantization (#38378 )
JartX
2026-04-02 14:13:26 +02:00
4eefbf9609
[Perf] fuse kernels in gdn (#37813 )
Jiangyun Zhu
2026-04-02 19:52:18 +08:00
551b3fb39f
[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 (#38086 )
vllmellm
2026-04-02 16:13:42 +08:00
c6f722b93e
[CPU] Support gelu act in cpu_fused_moe (#38770 )
Li, Jiang
2026-04-02 14:14:32 +08:00
9bd7231106
Revert "[Kernel] Add gpt-oss Router GEMM kernel (#37205 )" (#38778 )
Xin Yang
2026-04-01 22:02:32 -07:00