biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Mor Zusman	9fb12f7848	[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (#9838 ) Signed-off-by: mzusman <mor.zusmann@gmail.com>	2024-10-31 20:06:25 +00:00
wangshuai09	622b7ab955	[Hardware] using current_platform.seed_everything (#9785 ) Signed-off-by: wangshuai09 <391746016@qq.com>	2024-10-29 14:47:44 +00:00
youkaichao	32176fee73	[torch.compile] support moe models (#9632 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-27 21:58:04 -07:00
wangshuai09	4e2d95e372	[Hardware][ROCM] using current_platform.is_rocm (#9642 ) Signed-off-by: wangshuai09 <391746016@qq.com>	2024-10-28 04:07:00 +00:00
Mengqing Cao	5cbdccd151	[Hardware][openvino] is_openvino --> current_platform.is_openvino (#9716 )	2024-10-26 10:59:06 +00:00
Charlie Fu	59449095ab	[Performance][Kernel] Fused_moe Performance Improvement (#9384 ) Signed-off-by: charlifu <charlifu@amd.com>	2024-10-24 15:37:52 -07:00
Jee Jee Li	295a061fb3	[Kernel] add kernel for FATReLU (#9610 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-10-24 16:18:27 +08:00
wangshuai09	3ddbe25502	[Hardware][CPU] using current_platform.is_cpu (#9536 )	2024-10-22 00:50:43 -07:00
Chen Zhang	4fa3e33349	[Kernel] Support sliding window in flash attention backend (#9403 )	2024-10-20 10:57:52 -07:00
bnellnm	eca2c5f7c0	[Bugfix] Fix support for dimension like integers and ScalarType (#9299 )	2024-10-17 19:08:34 +00:00
Mor Zusman	fb60ae9b91	[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189 )	2024-10-16 12:12:43 -04:00
Cyrus Leung	7e7eae338d	[Misc] Standardize RoPE handling for Qwen2-VL (#9250 )	2024-10-16 13:56:17 +08:00
Tyler Michael Smith	7342a7d7f8	[Model] Support Mamba (#6484 )	2024-10-11 15:40:06 +00:00
Lucas Wilkinson	a64e7b9407	[Bugfix] Machete garbage results for some models (large K dim) (#9212 )	2024-10-10 14:16:17 +08:00
bnellnm	bd37b9fbe2	[Bugfix] Try to handle older versions of pytorch (#9086 )	2024-10-08 14:28:12 -07:00
ElizaWszola	05d686432f	[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973 ) Co-authored-by: Dipika <dipikasikka1@gmail.com> Co-authored-by: Dipika Sikka <ds3822@columbia.edu>	2024-10-04 12:34:44 -06:00
youkaichao	9aaf14c62e	[misc] add forward context for attention (#9029 )	2024-10-03 12:09:42 -07:00
Mor Zusman	f13a07b1f8	[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533 )	2024-09-29 17:35:58 -04:00
ElizaWszola	d081da0064	[Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-28 18:19:40 -07:00
youkaichao	a9b15c606f	[torch.compile] use empty tensor instead of None for profiling (#8875 )	2024-09-27 08:11:32 -07:00
bnellnm	300da09177	[Kernel] Fullgraph and opcheck tests (#8479 )	2024-09-25 08:35:52 -06:00
Lucas Wilkinson	86e9c8df29	[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-23 13:46:26 -04:00
Charlie Fu	9cc373f390	[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577 )	2024-09-19 17:37:57 +00:00
Tyler Michael Smith	db9120cded	[Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039 )	2024-09-18 20:05:06 +00:00
Cyrus Leung	6ffa3f314c	[CI/Build] Avoid CUDA initialization (#8534 )	2024-09-18 10:38:11 +00:00
Tyler Michael Smith	8110e44529	[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012 )	2024-09-17 23:44:27 +00:00
Simon Mo	546034b466	[refactor] remove triton based sampler (#8524 )	2024-09-16 20:04:48 -07:00
Luka Govedič	5d73ae49d6	[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270 )	2024-09-16 11:52:40 -07:00
ElizaWszola	a091e2da3e	[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032 ) Co-authored-by: Dipika <dipikasikka1@gmail.com>	2024-09-16 09:47:19 -06:00
Isotr0py	fc990f9795	[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (#8357 )	2024-09-15 16:51:44 -06:00
Charlie Fu	1ef0d2efd0	[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310 )	2024-09-13 17:01:11 -07:00
Cyrus Leung	a84e598e21	[CI/Build] Reorganize models tests (#7820 )	2024-09-13 10:20:06 -07:00
bnellnm	73202dbe77	[Kernel][Misc] register ops to prevent graph breaks (#6917 ) Co-authored-by: Sage Moore <sage@neuralmagic.com>	2024-09-11 12:52:19 -07:00
Dipika Sikka	6cd5e5b07e	[Misc] Fused MoE Marlin support for GPTQ (#8217 )	2024-09-09 23:02:52 -04:00
Elfie Guo	e39ebf5cf5	[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173 )	2024-09-05 05:12:26 +00:00
Pavani Majety	6b3421567d	[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-29 14:53:11 -04:00
youkaichao	ef99a78760	Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982 )	2024-08-28 21:27:06 -07:00
Mor Zusman	fdd9daafa3	[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651 )	2024-08-28 15:06:52 -07:00
rasmith	e5697d161c	[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386 )	2024-08-28 15:37:47 -04:00
Pavani Majety	b98cc28f91	[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-28 10:01:22 -07:00
LI MOU	53328d7536	[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509 )	2024-08-21 08:54:31 -07:00
Lucas Wilkinson	5288c06aa0	[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174 )	2024-08-20 07:09:33 -06:00
Charlie Fu	e837b624f2	[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210 )	2024-08-16 10:06:30 -07:00
youkaichao	54bd9a03c4	register custom op for flash attn and use from torch.ops (#7536 )	2024-08-15 22:38:56 -07:00
jon-chuang	50b8d08dbd	[Misc/Testing] Use `torch.testing.assert_close` (#7324 )	2024-08-16 04:24:04 +00:00
jon-chuang	a046f86397	[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-12 22:47:41 +00:00
Luka Govedič	5fb4a3f678	[Bugfix][Kernel] Increased atol to fix failing tests (#7305 )	2024-08-08 12:16:13 -04:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
Luka Govedič	8d59dbb000	[Kernel] Add per-tensor and per-token AZP epilogues (#5941 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-08-06 18:17:08 +00:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00

... 2 3 4 5 6 ...

330 Commits