Commit Graph

  • cea96a0156 [Bugfix] Fix sync_and_slice_intermediate_tensors (#21537) Rui Qiao 2025-07-25 17:07:58 -07:00
  • 2eddd437ba Add interleaved RoPE test for Llama4 (Maverick) (#21478) Yong Hoon Shin 2025-07-25 17:07:26 -07:00
  • 75d29cf4e1 [Perf] Cuda Kernel for Int8 Per Token Group Quant (#21476) Wentao Ye 2025-07-25 20:07:07 -04:00
  • 41d3082c41 Add Unsloth to RLHF.md (#21636) Daniel Han 2025-07-25 17:06:48 -07:00
  • 7cfea0df39 [TPU][Test] Rollback PR-21550. (#21619) QiliangCui 2025-07-25 13:22:01 -07:00
  • 5ac3168ee3 [Docs] add auto-round quantization readme (#21600) Wenhua Cheng 2025-07-25 23:52:42 +08:00
  • 396ee94180 [CI] Unifying Dockerfiles for ARM and X86 Builds (#21343) Kebe 2025-07-25 22:33:56 +08:00
  • e189b50f53 Add support for Prithvi in Online serving mode (#21518) mgazz 2025-07-25 15:01:27 +01:00
  • 136d750f5f [Kernel] Improve machete memory bound perf (#21556) czhu-cohere 2025-07-25 06:53:21 -07:00
  • b3caeb82e7 [ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. (#20295) who who who 2025-07-25 21:50:21 +08:00
  • eab2f3980c [Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel (#20839) Chih-Chieh Yang 2025-07-25 09:49:36 -04:00
  • 9fe98d4250 [Frontend] Add request_id to the Request object so they can be controlled better via external load balancers (#21009) kourosh hakhamaneshi 2025-07-25 06:49:11 -07:00
  • 29c6fbe58c [MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B (#20931) bigshanedogg 2025-07-25 22:05:42 +09:00
  • c72f049cb4 [Model] Fix Ernie4.5MoE e_score_correction_bias parameter (#21586) xyxinyang 2025-07-25 21:02:53 +08:00
  • f3a683b7c9 [Bugfix][Logprobs] Fix logprobs op to support more backend (#21591) Mengqing Cao 2025-07-25 20:53:07 +08:00
  • 46d81d6951 [V1] Get supported tasks from model runner instead of model config (#21585) Cyrus Leung 2025-07-25 20:36:45 +08:00
  • 5c3f2628d5 [Quantization] Enable BNB support for more MoE models (#21370) Jee Jee Li 2025-07-25 18:57:34 +08:00
  • 7311f74468 [Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' (#21579) Kebe 2025-07-25 18:42:23 +08:00
  • 8ed01e32f7 Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (#21598) Xu Wenqing 2025-07-25 17:36:55 +08:00
  • e38e96a3c0 [Tests] Harden DP tests (#21508) Nick Hill 2025-07-25 10:27:24 +01:00
  • 40d86ee412 [TPU][Bugfix] fix OOM issue in CI test (#21550) Chengji Yao 2025-07-24 23:01:53 -07:00
  • 85d051f026 [Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS (#21262) Yang Chen 2025-07-24 22:54:23 -07:00
  • 5140f54b89 [CI/Build] fix cpu_extension for apple silicon (#21195) Ignacio Sica 2025-07-25 02:53:59 -03:00
  • 947edd099e [Misc][Tools] make max-model-len a parameter in auto_tune script (#21321) Chengji Yao 2025-07-24 22:46:43 -07:00
  • fde60ee775 [Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings (#21479) hfan 2025-07-25 01:46:06 -04:00
  • b38bc652ac [Model] Support tensor parallel for timm ViT in Deepseek_vl2 (#21494) Jason Gu 2025-07-25 13:45:16 +08:00
  • adaf2c6d4f [Bugfix] fix modelscope snapshot_download serialization (#21536) Ning Xie 2025-07-25 13:44:38 +08:00
  • 42343f1f89 [CI] Update CODEOWNERS for CPU and Intel GPU (#21582) Li, Jiang 2025-07-25 12:58:03 +08:00
  • 965bc71b04 Integrate TensorSchema with shape validation for Phi3VImagePixelInputs (#21232) Benji Beck 2025-07-24 21:43:52 -07:00
  • 807a328bb6 [Docs] Add requirements/common.txt to run unit tests (#21572) Zhou Fang 2025-07-24 20:51:15 -07:00
  • e0be2c4d09 [TPU][Test] Temporarily suspend this MoE model in test_basic.py. (#21560) QiliangCui 2025-07-24 20:44:50 -07:00
  • 9c8b2c2a8a [DP] Support api-server-count > 0 in hybrid DP LB mode (#21510) Nick Hill 2025-07-25 04:18:16 +01:00
  • 2212cd6cfb [Bugfix] DeepGemm utils : Fix hardcoded type-cast (#21517) Varun Sundar Rabindranath 2025-07-25 08:47:29 +05:30
  • ce3a9b1378 [Kernel] adding fused_moe configs for upcoming granite4 (#21332) Burkhard Ringlein 2025-07-25 05:16:59 +02:00
  • 2ce90e5b01 Fix GLM-4 PP Missing Layer When using with PP. (#21531) Yuxuan Zhang 2025-07-25 11:07:38 +08:00
  • 633f6e804b [Bug] Fix DeepGemm Init Error (#21554) Wentao Ye 2025-07-24 23:07:22 -04:00
  • b57296bb9a [Docs] Fix site_url for RunLLM (#21564) Harry Mellor 2025-07-25 04:05:58 +01:00
  • 34ddcf9ff4 [Frontend] run-batch supports V1 (#21541) Cyrus Leung 2025-07-25 11:05:55 +08:00
  • fe56180c7f [MoE] More balanced expert sharding (#21497) Woosuk Kwon 2025-07-24 15:56:08 -07:00
  • 07d80d7b0e [TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. (#21539) QiliangCui 2025-07-24 15:33:04 -07:00
  • 2dd72d23d9 update flashinfer to v0.2.9rc1 (#21485) weiliang 2025-07-25 05:06:11 +08:00
  • a6c7fb8cff [Docs] Add Expert Parallelism Initial Documentation (#21373) Simon Mo 2025-07-24 12:36:06 -07:00
  • a7272c23d0 [Docs][minor] Fix broken gh-file link in distributed serving docs (#21543) Ricardo Decal 2025-07-24 10:36:56 -07:00
  • 6066284914 [P/D] Support CPU Transfer in NixlConnector (#18293) Juncheng Gu 2025-07-24 09:58:42 -07:00
  • 1e9ea8e69d [P/D] Move FakeNixlWrapper to test dir (#21328) Rui Qiao 2025-07-24 08:53:45 -07:00
  • d9f9a3fd96 [XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform (#21036) Chaojun Zhang 2025-07-24 23:23:36 +08:00
  • 1b25f1fe75 Update flashinfer CUTLASS MoE Kernel (#21408) Shu Wang 2025-07-24 10:13:31 -05:00
  • e8cb0d0495 [Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access (#21465) Wentao Ye 2025-07-24 11:13:24 -04:00
  • 684174115d [Docs] Rewrite Distributed Inference and Serving guide (#20593) Ricardo Decal 2025-07-24 08:13:05 -07:00
  • cdb79ee63d [Docs] Update Tensorizer usage documentation (#21190) Sanger Steel 2025-07-24 09:56:18 -04:00
  • 5a19a6c670 [Fix] Update mamba_ssm to 2.2.5 (#21421) elvischenv 2025-07-24 18:25:41 +08:00
  • 2ded067fd2 [Bugfix] Fix CUDA arch flags for MoE permute (#21426) Ming Yang 2025-07-24 03:23:59 -07:00
  • 13abd0eaf9 [Model] Officially support Emu3 with Transformers backend (#21319) Harry Mellor 2025-07-24 11:22:12 +01:00
  • 61b8cea3b4 [Attention] Optimize FlashInfer MetadataBuilder Build call (#21137) Lucas Wilkinson 2025-07-24 06:21:46 -04:00
  • 526078a96c bump flashinfer to v0.2.8 (#21385) cjackal 2025-07-24 19:20:38 +09:00
  • 6da0078523 [Feat] Allow custom naming of vLLM processes (#21445) Chauncey 2025-07-24 18:15:23 +08:00
  • 73e3949d07 [Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() (#21501) Rui Qiao 2025-07-24 03:13:40 -07:00
  • 6eca337ce0 Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 (#20544) Shintarou Okada 2025-07-24 18:56:36 +09:00
  • 85bda9e7d0 remove GLM-4.5 quantization wrong Code (#21435) Yuxuan Zhang 2025-07-24 16:52:43 +08:00
  • 610852a423 [Core] Support model loader plugins (#21067) 22quinn 2025-07-24 01:49:44 -07:00
  • f0f4de8f26 [Misc] Fix duplicate FusedMoEConfig debug messages (#21455) Nick Hill 2025-07-24 09:27:30 +01:00
  • fc5f756db4 [v1][Core] Clean up usages of SpecializedManager (#21407) Zhou Fang 2025-07-24 00:40:11 -07:00
  • e74bfc70e4 [TPU][Bugfix] fix moe layer (#21340) Chengji Yao 2025-07-24 00:38:39 -07:00
  • 90eeea8f85 [Bugfix][ROCm] Fix for warp_size uses on host (#21205) Gregory Shtrasberg 2025-07-24 03:37:19 -04:00
  • dde295a934 Deduplicate Transformers backend code using inheritance (#21461) Harry Mellor 2025-07-24 08:16:23 +01:00
  • 6d8d0a24c0 Add think chunk (#21333) v0.10.0rc2 v0.10.0 Julien Denize 2025-07-24 06:51:32 +02:00
  • 11ef7a611e [BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses (#21211) Yinghai Lu 2025-07-23 21:44:04 -07:00
  • dc2f159f8a Dump input metadata on crash for async scheduling (#21258) Woosuk Kwon 2025-07-23 21:10:30 -07:00
  • d5b981f8b1 [DP] Internal Load Balancing Per Node [one-pod-per-node] (#21238) Robert Shaw 2025-07-23 23:57:32 -04:00
  • eec6942014 [BugFix] Fix KVConnector TP worker aggregation (#21473) Nick Hill 2025-07-24 04:56:49 +01:00
  • fd48d99ffd [BugFix]: Batch generation from prompt_embeds fails for long prompts (#21390) KazusatoOoko 2025-07-23 20:43:17 -07:00
  • f8c15c4efb [Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (#21437) WeiQing Chen 2025-07-24 11:42:11 +08:00
  • aa08a954f9 [Bugfix] Fix casing warning (#21468) Matthew Bonanni 2025-07-23 23:41:23 -04:00
  • 13e4ee1dc3 [XPU][UT] increase intel xpu CI test scope (#21492) Liangliang Ma 2025-07-24 11:24:04 +08:00
  • 772ce5af97 [Misc] Add dummy maverick test to CI (#21324) Ming Yang 2025-07-23 20:22:42 -07:00
  • 63d92abb7c [Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding (#21374) deven-labovitch 2025-07-23 23:22:19 -04:00
  • 11599b0e1f feat(gguf_loader): accept HF repo paths & URLs for GGUF (#20793) Hardik Gupta 2025-07-23 20:21:02 -07:00
  • f3137cdd81 [Core] Freeze gc during cuda graph capture to speed up init (#21146) Michael Goin 2025-07-23 20:20:14 -04:00
  • 82ec66f514 [V0 Deprecation] Remove Prompt Adapters (#20588) Michael Goin 2025-07-23 19:36:48 -04:00
  • 78c13e30e1 [V1] Fix local chunked attention always disabled (#21419) Yong Hoon Shin 2025-07-23 15:59:30 -07:00
  • 5c9b807b34 [Core] Add reload_weights RPC method (#20096) 22quinn 2025-07-23 14:24:52 -07:00
  • 14bf19e39f [TPU][TEST] Fix the downloading issue in TPU v1 test 11. (#21418) QiliangCui 2025-07-23 11:29:36 -07:00
  • 4ac7713e32 Add test case for compiling multiple graphs (#21044) Yong Hoon Shin 2025-07-23 11:00:47 -07:00
  • 8560a5b258 [Core][Model] PrithviMAE Enablement on vLLM v1 engine (#20577) Christian Pinto 2025-07-23 19:00:23 +01:00
  • 316b1bf706 [Tests] Add tests for headless internal DP LB (#21450) Nick Hill 2025-07-23 15:49:25 +01:00
  • 7c734ee09b [Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. (#21364) Tao He 2025-07-23 21:34:37 +08:00
  • f59ec35b7f [V1] Check all pooling tasks during profiling (#21299) Cyrus Leung 2025-07-23 20:53:26 +08:00
  • 2671334d45 [Model] add Hunyuan V1 Dense Model support. (#21368) Asher 2025-07-23 18:54:08 +08:00
  • 2cc5016a19 [Docs] Clean up v1/metrics.md (#21449) Michael Yao 2025-07-23 18:37:25 +08:00
  • 6929f8b437 [Misc] fixed nvfp4_moe test failures due to invalid kwargs (#21246) Yang Chen 2025-07-23 01:41:43 -07:00
  • 32ec9e2f2a Mamba V2 Test not Asserting Failures. (#21379) Yu Chin Fabian Lim 2025-07-23 04:40:27 -04:00
  • accac82928 [Sampler] Introduce logprobs mode for logging (#21398) Lu Fang 2025-07-23 01:39:25 -07:00
  • 23637dcdef [Docs] Fix bullets and grammars in tool_calling.md (#21440) Michael Yao 2025-07-23 16:23:20 +08:00
  • 6364af92f8 Fixed typo in profiling logs (#21441) Sergio Paniego Blanco 2025-07-23 10:18:54 +02:00
  • 7aaa2bd5a8 [Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload (#19679) Guillaume Calmettes 2025-07-23 09:30:05 +02:00
  • 2f5c14de6a add clear messages for deprecated models (#21424) youkaichao 2025-07-23 15:03:16 +08:00
  • f002e9a870 [Cleanup] Only log MoE DP setup warning if DP is enabled (#21315) Michael Goin 2025-07-23 03:02:48 -04:00
  • a1f3610fc6 [Core] Add basic unit test for maybe_evict_cached_block (#21400) Jialin Ouyang 2025-07-23 00:02:02 -07:00
  • 4ecedd1806 [Bugfix] Fix nightly transformers CI failure (#21427) Isotr0py 2025-07-23 15:01:01 +08:00
  • 107111a859 Changing "amdproduction" allocation. (#21409) Alexei-V-Ivanov-AMD 2025-07-22 22:48:31 -05:00