biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Tyler Michael Smith	eb5741ad42	[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587 ) Integrates the block-quantized kernels introduced in https://github.com/vllm-project/vllm/pull/11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-01-31 15:29:11 -08:00
Robert Shaw	325f679f32	[BugFix] Fix Torch.Compile For DeepSeek (#12594 ) Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-01-31 12:06:39 -08:00
Dipika Sikka	eb5cb5e528	[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order (#11528 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2025-01-23 21:40:33 +00:00
Gregory Shtrasberg	b5b57e301e	[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>	2025-01-17 17:12:26 +00:00
Elfie Guo	fa0050db08	[Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651 ) Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2025-01-16 04:31:27 +00:00
Cyrus Leung	d848800e88	[Misc] Move `print_*_once` from utils to logger (#11298 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>	2025-01-09 12:48:12 +08:00
Li, Jiang	5dbf854553	[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (#11618 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2024-12-30 10:17:04 +00:00
Robert Shaw	2339d59f92	[BugFix] Fix quantization for all other methods (#11547 ) Some checks failed Create Release / Create Release (push) Has been cancelled Details	2024-12-26 22:23:29 -08:00
Simon Mo	f49777ba62	Deepseek v3 (#11502 ) Some checks failed Create Release / Create Release (push) Has been cancelled Details Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com>	2024-12-26 16:09:44 -08:00
Michael Goin	2072924d14	[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523 ) Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-26 15:33:30 -08:00
Michael Goin	399c798608	Remove ScaledActivation for AWQ (#10057 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-11-06 14:27:06 +00:00
wangshuai09	4e2d95e372	[Hardware][ROCM] using current_platform.is_rocm (#9642 ) Signed-off-by: wangshuai09 <391746016@qq.com>	2024-10-28 04:07:00 +00:00
Cyrus Leung	6ffa3f314c	[CI/Build] Avoid CUDA initialization (#8534 )	2024-09-18 10:38:11 +00:00
Wenxiang	1248e8506a	[Model] Adding support for MSFT Phi-3.5-MoE (#7729 ) Co-authored-by: Your Name <you@example.com> Co-authored-by: Zeqi Lin <zelin@microsoft.com> Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>	2024-08-30 13:42:57 -06:00
Dipika Sikka	fc911880cc	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-27 15:07:09 -07:00
Dipika Sikka	955b5191c9	[Misc] update fp8 to use `vLLMParameter` (#7437 )	2024-08-22 08:36:18 -04:00
Michael Goin	aae74ef95c	Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 )" (#7764 )	2024-08-22 03:42:14 +00:00
Dipika Sikka	8678a69ab5	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-21 16:17:10 -07:00
Mor Zusman	7fc23be81c	[Kernel] W8A16 Int8 inside FusedMoE (#7415 )	2024-08-16 10:06:51 -07:00
Charlie Fu	e837b624f2	[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210 )	2024-08-16 10:06:30 -07:00
Dipika Sikka	d3bdfd3ab9	[Misc] Update Fused MoE weight loading (#7334 )	2024-08-13 14:57:45 -04:00
Michael Goin	5223199e03	[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219 )	2024-08-07 11:23:12 -07:00
Thomas Parnell	9a7e2d0534	[Bugfix] Allow vllm to still work if triton is not installed. (#6786 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-29 14:51:27 -07:00
Robert Shaw	889da130e7	[ Misc ] `fp8-marlin` channelwise via `compressed-tensors` (#6524 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-07-25 09:46:04 -07:00
Michael Goin	0eb0757bef	[Misc] Add ignored layers for `fp8` quantization (#6657 )	2024-07-23 14:04:04 -04:00
Michael Goin	9e0b558a09	[Misc] Support FP8 kv cache scales from compressed-tensors (#6528 )	2024-07-23 04:11:50 +00:00
Robert Shaw	683e3cb9c4	[ Misc ] `fbgemm` checkpoints (#6559 )	2024-07-20 09:36:57 -07:00
Robert Shaw	4cc24f01b1	[ Kernel ] Enable Dynamic Per Token `fp8` (#6547 )	2024-07-19 23:08:15 +00:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
Robert Shaw	fb6af8bc08	[ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 (#6417 )	2024-07-13 20:03:58 -07:00
Robert Shaw	b675069d74	[ Misc ] Refactor Marlin Python Utilities (#6082 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-07-11 15:40:11 +00:00
Robert Shaw	abfe705a02	[ Misc ] Support Fp8 via `llm-compressor` (#6110 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-07-07 20:42:11 +00:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
youkaichao	482045ee77	[hardware][misc] introduce platform abstraction (#6080 )	2024-07-02 20:12:22 -07:00
Robert Shaw	7c008c51a9	[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-02 21:54:35 +00:00
youkaichao	614aa51203	[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007 )	2024-06-30 20:07:34 -07:00
Robert Shaw	af9ad46fca	[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-30 23:06:27 +00:00
Robert Shaw	2cd402e169	[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-28 18:43:49 +00:00
Tyler Michael Smith	3f3b6b2150	[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715 )	2024-06-20 18:36:10 +00:00
Tyler Michael Smith	703475f6c2	[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516 )	2024-06-14 09:30:15 -07:00
Tyler Michael Smith	e38042d4af	[Kernel] Disable CUTLASS kernels for fp8 (#5505 )	2024-06-13 13:38:05 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Michael Goin	c09dade2a2	[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353 )	2024-06-08 13:54:05 -04:00
Cheng Li	e69ded7d1c	[Bug Fix] Fix the support check for FP8 CUTLASS (#5352 ) Bug description: With torch 2.4.0.dev20240603+cu121, cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112) This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.	2024-06-08 00:42:05 +00:00
Tyler Michael Smith	8d75fe48ca	[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183 ) Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.	2024-06-07 08:42:35 +00:00
Cody Yu	5563a4dea8	[Model] Correct Mixtral FP8 checkpoint loading (#5231 )	2024-06-05 10:58:50 -07:00
Cody Yu	a3a73ab069	[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893 ) The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).	2024-05-22 13:28:20 -07:00
Philipp Moritz	379da6dcb5	[Kernel] [FP8] Improve FP8 linear layer performance (#4691 ) This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)	2024-05-09 16:38:07 -07:00
Robert Shaw	111815d482	[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-04-30 21:46:12 +00:00
Philipp Moritz	12628d3c78	[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-27 04:49:59 +00:00

1 2

54 Commits