Commit Graph

  • b2b2c5239e [MoE Refactor] Split up compressed_tensors_moe.py (#38960) bnellnm 2026-04-06 20:07:54 -04:00
  • 00d7b497b3 [NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct MI300, MI355X and Hopper through emulation (#35733) fxmarty-amd 2026-04-07 00:18:27 +02:00
  • 9c81f35b1a [Attention][MLA] Re-enable FA4 as default MLA prefill backend (#38819) Matthew Bonanni 2026-04-06 17:51:46 -04:00
  • f186cfe75e [MRV2] Fix hanging issue with DeepSeek V3.2 by setting skip_attn=False (#39098) Woosuk Kwon 2026-04-06 12:55:13 -07:00
  • dfa5062a8f NemotronH default mamba_ssm_cache_dtype=float32; enable auto-hook for NemotronHNanoVLV2Config (#39032) Netanel Haber 2026-04-06 22:47:46 +03:00
  • e8ebbdde83 [Quantization] Add FlashInfer CuteDSL batched experts backend for NVFP4 MoE (#38251) Yongye Zhu 2026-04-06 14:57:53 -04:00
  • 94fbb09894 [EASY] Drop duplicate KV-cache initialization (#38799) namgyu-youn 2026-04-07 03:05:39 +09:00
  • 419e73cdfa [Bug] Fix mistral version dependency (#39086) Wentao Ye 2026-04-06 13:31:19 -04:00
  • f01482408c [MoE Refactor][Test] FusedMoE layer test (#24675) bnellnm 2026-04-06 13:17:23 -04:00
  • bfdc0a3a99 [NIXL][Mamba][3/N] Heterogeneous TP: 3-read conv state transfer (#37635) zhanqiuhu 2026-04-06 13:07:02 -04:00
  • 93bada494f [MoE Refactor] Split of DefaultMoERunner class (#35326) bnellnm 2026-04-06 12:41:59 -04:00
  • 608914de30 [Core] Re-enable Inductor pre-grad passes in standalone compile (torch>=2.12) (#38944) Frederik Gossen 2026-04-06 12:37:13 -04:00
  • 4ae218c122 [Refactor] Remove unused dead code (#38842) Wentao Ye 2026-04-06 11:52:05 -04:00
  • f40d9879f2 [Models][GDN] Remove GPU/CPU syncs in GDNAttentionMetadata.build during speculative decoding (#38047) Lukas Geiger 2026-04-06 16:39:37 +01:00
  • 47e605092b [Gemma4] Enable Fast Prefill Optimization (#38879) Lucas Wilkinson 2026-04-06 11:19:39 -04:00
  • e69a265135 [Feat][Core] safely abort requests when FSM fails to advance (#38663) Walter Beller-Morales 2026-04-06 11:00:16 -04:00
  • fef56c1855 [Mistral Grammar] Support Grammar Factory (#38150) Julien Denize 2026-04-06 16:28:51 +02:00
  • c5e3454e5a [Model] Add support for BharatGen's Param2MoE model (#38000) bhargav-patel-29 2026-04-06 13:49:56 +05:30
  • f6983f01de MiniMax-M2: add Eagle3 speculative decoding support (#37512) liuchenbing2026 2026-04-06 10:50:18 +08:00
  • 780ba37458 [ROCm][Quantization] Add asymmetric INT8 quantization support to TritonInt8ScaledMMLinearKernel (#38501) Andreas Karatzas 2026-04-05 20:42:10 -05:00
  • 9570654c6d [ROCm][CI] Run Kernels Core Operation Test On MI325 and mitigate flakiness (#38184) Micah Williamson 2026-04-05 20:42:02 -05:00
  • d56e952239 nano_nemotron_vl: fix tensor device mismatch exception when video profiling (#39029) Netanel Haber 2026-04-06 01:23:45 +03:00
  • 56de443db1 [ci] Switch some CI jobs to H200 MIG slices (#38956) Kevin H. Luu 2026-04-05 13:26:11 -07:00
  • 4dd49b06f8 [Bug] Fix Import paths for encoder_cudagraph modules (#38997) Greg Pereira 2026-04-05 12:11:58 -07:00
  • f53fa26e05 [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992) Greg Pereira 2026-04-05 10:11:18 -07:00
  • 1af6f78ae5 [Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout (#38993) Wei Zhao 2026-04-05 10:54:31 -04:00
  • 228023b3a5 [Bugfix][MoE] Fix 6-8% decode regression: prefer multi-stream shared expert overlap (#38990) Martin Vit 2026-04-05 16:28:31 +02:00
  • 9a528260ef [Bugfix][Spec Decode] Fix extract_hidden_states for VLM models (#38987) Aaron Batilo 2026-04-05 03:41:54 -06:00
  • 968ed02ace [Quantization][Deprecation] Remove Petit NVFP4 (#32694) Robert Shaw 2026-04-04 20:07:45 -04:00
  • 7d266abb22 Revert "[vLLM IR] gemma_rms_norm" (#38998) Robert Shaw 2026-04-04 17:48:08 -04:00
  • 156405d243 [vLLM IR] gemma_rms_norm (#38780) Xiaoshuang Wang 2026-04-05 01:55:52 +08:00
  • 99e5539a67 [Perf][GDN] Align TMA usage with upstream FLA (#38981) Artem Perevedentsev 2026-04-04 19:38:02 +03:00
  • a88ce94bbb [IR][RmsNorm] pass None if not has_weight (#38961) Linkun 2026-04-04 08:02:30 -07:00
  • 2a36d8fb72 [Bugfix][CPU] Fix macOS compatibility broken by #36487 (#38970) Ziming Qi 2026-04-04 10:05:58 -04:00
  • 93726b2a1c Refactor Arctic loading to use AutoWeightsLoader (#38955) lalit10 2026-04-03 22:01:09 -07:00
  • 8617f8676b [Bugfix] Fix DSV32 weight loading (#38870) Yongye Zhu 2026-04-03 22:57:52 -04:00
  • 06fd9ffcc4 [ROCm][CI] Fix ROCm Dockerfile conftest generation for older Docker parsers (#38959) Andreas Karatzas 2026-04-03 21:41:41 -05:00
  • cab4064cd5 [Bug] Fix workspace manager _current_workspaces size (#38853) Wentao Ye 2026-04-03 21:29:45 -04:00
  • 062f1a2d70 [Bug] Fix compile error for swap_blocks_batch in CUDA 13 (#38915) Wentao Ye 2026-04-03 19:56:38 -04:00
  • 81994e1d0e [Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… (#38927) elenalil-aws 2026-04-03 16:30:09 -07:00
  • 4b506ff90a [ROCm][CI] Minor missing import patch (#38951) Andreas Karatzas 2026-04-03 18:01:20 -05:00
  • 5875bb2e9c [ROCm][CI] Added back missing common deps (#38937) Andreas Karatzas 2026-04-03 17:58:57 -05:00
  • f0d3ad9f3e [ci] Remove soft fail for AMD image build job (#38941) Kevin H. Luu 2026-04-03 13:42:33 -07:00
  • 121ea5a21f Removed GPU state confirmation and cleanup steps. (#38238) Divin Honnappa 2026-04-03 15:11:08 -05:00
  • ab79863e6c Remove MQ multi-node tests (#38934) Jeffrey Wang 2026-04-03 13:00:08 -07:00
  • 5f1de2b14b [Model Runner V2] Add config validation for not-yet-supported features (#38758) Nick Hill 2026-04-03 12:08:08 -07:00
  • a5a623d961 [Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts (#38859) yzong-rh 2026-04-03 13:48:17 -04:00
  • f8c3af2d85 [vLLM IR] add import_ir_kernels() to support OOT platforms (#38807) Xiaoshuang Wang 2026-04-04 01:25:19 +08:00
  • 50cd5674b3 Fix invalid logprobs with MTP enabled and sync scheduling (#38711) danisereb 2026-04-03 19:24:37 +03:00
  • 7b1a7423be [Frontend] new online quantization frontend (#38138) Vasiliy Kuznetsov 2026-04-03 11:58:39 -04:00
  • 97f92c6b47 [KVConnector] Skip register_kv_caches on profiling (#38558) Nicolò Lucchesi 2026-04-03 17:40:16 +02:00
  • 46f02e00f2 [Bugfix] Fix AWQ models batch invariance issues (#38670) Yusuf Mohammad 2026-04-03 15:54:15 +01:00
  • 6b4872240f [XPU] bump up xpu-kernel v0.1.5, transpose moe weights (#38342) Qiming Zhang 2026-04-03 07:10:02 -07:00
  • 580090db6b [Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM (#38325) Necofish 2026-04-03 21:49:59 +08:00
  • cb10b7e80b [GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill (#38361) Artem Perevedentsev 2026-04-03 16:38:02 +03:00
  • bf8b022e60 [Intel][Triton] Support round_int8 for Intel backend (#38825) Mieszko Dziadowiec 2026-04-03 14:47:35 +02:00
  • 40ee64c00e [XPU][CI] Skip test_topp_only and test_topk_and_topp cases on Intel GPU in CI (#38904) xiangdong 2026-04-03 20:44:52 +08:00
  • 1b117cb0ac [ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8 (#38615) wufann 2026-04-03 18:54:00 +08:00
  • abebd9323d [CPU] Replace OMP initialization (#36487) Anton Ivanov 2026-04-03 11:42:43 +01:00
  • 25f2b55319 [Frontend] feat: add streaming support for token generation endpoint (#37171) Hyeonki Hong 2026-04-03 19:20:32 +09:00
  • cb4ff07f8b [XPU][CI] Skip test_topk_only cases on Intel GPU in CI (#38899) xiangdong 2026-04-03 17:50:41 +08:00
  • a7d79fa133 [ROCm][CI/Build] Fix the pytest hook to properly print out the summary (#38585) Gregory Shtrasberg 2026-04-03 04:24:26 -05:00
  • fa9e68022d Fix Nano Nemotron VL regressions (#38655) Netanel Haber 2026-04-03 10:22:06 +03:00
  • 5506435419 [Misc] Clean up Gemma4 implementation (#38872) v0.19.1rc0 Isotr0py 2026-04-03 13:47:02 +08:00
  • 311c981647 [MRV2][KVConnector] Fix missing build_connector_worker_meta (#38698) Yifan Qiao 2026-04-02 22:42:52 -07:00
  • 21d7ecc5b0 [CI/Build] Add audio deps in Dockerfile.cpu (#38876) Li, Jiang 2026-04-03 13:05:14 +08:00
  • 4729b90838 [Bug] Add e_score_correction_bias to SKIP_TENSORS (#38746) Aaron Hao 2026-04-02 21:15:05 -07:00
  • 8b141ed8c3 full cudagraph for flex-attn (#36298) shunting314 2026-04-02 21:15:01 -07:00
  • 2ad7c0335f [Model] Add Phi4ForCausalLMV for microsoft/Phi-4-reasoning-vision-15B (#38306) Varun Sundar Rabindranath 2026-04-03 00:14:57 -04:00
  • 201d2ea5bf [CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI (#38664) Bowen Bao 2026-04-02 21:05:45 -07:00
  • 103f0de565 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle (#38774) Bowen Bao 2026-04-02 20:29:57 -07:00
  • 32e0c0bfa2 refactor hard coded device string in test files under tests/v1 and tests/lora (#37566) wliao2 2026-04-02 20:21:47 -07:00
  • 4a06e1246e [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync (#38460) Itay Etelis 2026-04-03 06:13:23 +03:00
  • 3bc2734dd0 [Kernel] Fuse FP8 output quantization into merge_attn_states (#36518) Carl Y 2026-04-02 18:47:04 -07:00
  • 1f5ec2889c [mla] Support fused FP8/NVFP4 output quantization in MLA attention (#35792) (#36205) Carl Y 2026-04-02 18:16:11 -07:00
  • ee3cf45739 [XPU] Initial support for GDN attention on Qwen3-next/Qwen3.5 (#33657) Yan Ma 2026-04-03 08:59:11 +08:00
  • 05e68e1f81 [CI] Fix test_nixl_connector (#38838) Matthew Bonanni 2026-04-02 20:52:13 -04:00
  • 771913e4a0 [Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (#38832) Vadim Gimpelson 2026-04-03 04:45:57 +04:00
  • 71a9125c67 [New Model]: add support for telechat3 (#38510) 1096125073 2026-04-03 08:26:22 +08:00
  • 66e86f1dbd [Kernel] Mamba support different layout for Conv state (#37416) Nicolò Lucchesi 2026-04-03 01:50:09 +02:00
  • 2a69949bda [Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter (#38847) v0.19.0 Michael 2026-04-02 17:35:19 -04:00
  • bb39382b2b [Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter (#38847) Michael 2026-04-02 17:35:19 -04:00
  • 7b743ba953 [CI] Fix: pass string cache_dtype in test_register_kv_caches (#38836) zhanqiuhu 2026-04-02 15:42:09 -04:00
  • 188defbd0b [CI] Add flashinfer.py to attention test source deps (#38792) Stefano Castagnetta 2026-04-02 21:24:29 +02:00
  • 8adcf8c40a feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) (#38826) Luciano Martins 2026-04-02 15:13:28 -03:00
  • 08ed2b9688 feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) (#38826) Luciano Martins 2026-04-02 15:13:28 -03:00
  • ecd5443dbc Bump helion dependency from 0.3.2 to 0.3.3 (#38062) Yanan Cao 2026-04-02 10:59:33 -07:00
  • 58262dec6e [Bugfix] Fix test mocks after SM100 restriction in #38730 (#38791) Stefano Castagnetta 2026-04-02 19:12:58 +02:00
  • cb3935a8fc [FA4] Update flash-attention to latest upstream FA4 (#38690) Lucas Wilkinson 2026-04-02 13:02:37 -04:00
  • 82a006beeb [CI][ROCm] Add gpt-oss w4a8 in CI (#38292) Bowen Bao 2026-04-02 09:06:01 -07:00
  • a9b4f07ba2 [Frontend] Re-enable running MaxSim on GPU (#38620) wang.yuqi 2026-04-03 00:03:13 +08:00
  • d9408ffba3 Triton MLA perf fixes (#33529) Koushik Dutta 2026-04-02 06:40:01 -07:00
  • 16a65e4173 [Bugfix] Enable batch-invariant Triton matmul on all Ampere GPUs (SM 8x) (#38427) Yusuf Mohammad 2026-04-02 14:29:58 +01:00
  • c0817e4d39 [Model] Add support for Cheers multimodal model (#38788) bsliu 2026-04-02 21:01:40 +08:00
  • dfe5e31689 Don't compile vision encoder for Transformers backend (#30518) Harry Mellor 2026-04-02 13:42:29 +01:00
  • 2ce3d0ce36 [Feature] KV cache per-token-head INT8/FP8 quantization (#38378) JartX 2026-04-02 14:13:26 +02:00
  • 4eefbf9609 [Perf] fuse kernels in gdn (#37813) Jiangyun Zhu 2026-04-02 19:52:18 +08:00
  • 551b3fb39f [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 (#38086) vllmellm 2026-04-02 16:13:42 +08:00
  • c6f722b93e [CPU] Support gelu act in cpu_fused_moe (#38770) Li, Jiang 2026-04-02 14:14:32 +08:00
  • 9bd7231106 Revert "[Kernel] Add gpt-oss Router GEMM kernel (#37205)" (#38778) Xin Yang 2026-04-01 22:02:32 -07:00