Commit Graph

746 Commits

Author SHA1 Message Date
youkaichao
05d1f8c9c6 [misc] move functions to config.py (#10624)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 09:27:30 +00:00
youkaichao
eebad39f26 [torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 14:04:42 -08:00
Lucas Wilkinson
d200972e7f [Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-19 19:40:33 -08:00
ElizaWszola
b00b33d77e [Model][Quantization] HQQ support through Marlin kernel expansion (#9766)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-19 13:31:12 -08:00
Mengqing Cao
8c1fb50705 [Platform][Refactor] Extract func get_default_attn_backend to Platform (#10358)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-11-19 11:22:26 +08:00
Lucas Wilkinson
96d999fbe8 [Kernel] Initial Machete W4A8 support + Refactors (#9855)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-18 12:59:29 -07:00
ElizaWszola
79ee45b428 [Misc] Bump up test_fused_moe tolerance (#10364)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-15 16:31:18 +00:00
Luka Govedič
bf2ddc6610 [bugfix] Fix static asymmetric quantization case (#10334)
Signed-off-by: Daniël de Kok <me@danieldk.eu>
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-11-15 09:35:11 +08:00
rasmith
127c07480e [Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2024-11-08 19:59:22 -05:00
Luka Govedič
4f93dfe952 [torch.compile] Fuse RMSNorm with quant (#9138)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-11-08 21:20:08 +00:00
Joe Runde
d58268c56a [V1] Make v1 more testable (#9888)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-06 11:57:35 -08:00
Aaron Pham
21063c11c7 [CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2024-11-06 07:11:55 +00:00
Michael Goin
235366fe2e [CI] Prune back the number of tests in tests/kernels/* (#9932)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-05 16:02:32 -05:00
sroy745
a78dd3303e [Encoder Decoder] Add flash_attn kernel support for encoder-decoder models (#9559) 2024-11-01 23:22:49 -07:00
Peter Salas
6c0b7f548d [Core][VLM] Add precise multi-modal placeholder tracking (#8346)
Signed-off-by: Peter Salas <peter@fixie.ai>
2024-11-01 16:21:10 -07:00
Pavani Majety
598b6d7b07 [Bugfix/Core] Flashinfer k_scale and v_scale (#9861) 2024-11-01 12:15:05 -07:00
Mor Zusman
9fb12f7848 [BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (#9838)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-10-31 20:06:25 +00:00
wangshuai09
622b7ab955 [Hardware] using current_platform.seed_everything (#9785)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-29 14:47:44 +00:00
youkaichao
32176fee73 [torch.compile] support moe models (#9632)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-27 21:58:04 -07:00
wangshuai09
4e2d95e372 [Hardware][ROCM] using current_platform.is_rocm (#9642)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
Mengqing Cao
5cbdccd151 [Hardware][openvino] is_openvino --> current_platform.is_openvino (#9716) 2024-10-26 10:59:06 +00:00
Charlie Fu
59449095ab [Performance][Kernel] Fused_moe Performance Improvement (#9384)
Signed-off-by: charlifu <charlifu@amd.com>
2024-10-24 15:37:52 -07:00
Jee Jee Li
295a061fb3 [Kernel] add kernel for FATReLU (#9610)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-10-24 16:18:27 +08:00
wangshuai09
3ddbe25502 [Hardware][CPU] using current_platform.is_cpu (#9536) 2024-10-22 00:50:43 -07:00
Chen Zhang
4fa3e33349 [Kernel] Support sliding window in flash attention backend (#9403) 2024-10-20 10:57:52 -07:00
bnellnm
eca2c5f7c0 [Bugfix] Fix support for dimension like integers and ScalarType (#9299) 2024-10-17 19:08:34 +00:00
Mor Zusman
fb60ae9b91 [Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189) 2024-10-16 12:12:43 -04:00
Cyrus Leung
7e7eae338d [Misc] Standardize RoPE handling for Qwen2-VL (#9250) 2024-10-16 13:56:17 +08:00
Tyler Michael Smith
7342a7d7f8 [Model] Support Mamba (#6484) 2024-10-11 15:40:06 +00:00
Lucas Wilkinson
a64e7b9407 [Bugfix] Machete garbage results for some models (large K dim) (#9212) 2024-10-10 14:16:17 +08:00
bnellnm
bd37b9fbe2 [Bugfix] Try to handle older versions of pytorch (#9086) 2024-10-08 14:28:12 -07:00
ElizaWszola
05d686432f [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
2024-10-04 12:34:44 -06:00
youkaichao
9aaf14c62e [misc] add forward context for attention (#9029) 2024-10-03 12:09:42 -07:00
Mor Zusman
f13a07b1f8 [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533) 2024-09-29 17:35:58 -04:00
ElizaWszola
d081da0064 [Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-28 18:19:40 -07:00
youkaichao
a9b15c606f [torch.compile] use empty tensor instead of None for profiling (#8875) 2024-09-27 08:11:32 -07:00
bnellnm
300da09177 [Kernel] Fullgraph and opcheck tests (#8479) 2024-09-25 08:35:52 -06:00
Lucas Wilkinson
86e9c8df29 [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Charlie Fu
9cc373f390 [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) 2024-09-19 17:37:57 +00:00
Tyler Michael Smith
db9120cded [Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) 2024-09-18 20:05:06 +00:00
Cyrus Leung
6ffa3f314c [CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Tyler Michael Smith
8110e44529 [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) 2024-09-17 23:44:27 +00:00
Simon Mo
546034b466 [refactor] remove triton based sampler (#8524) 2024-09-16 20:04:48 -07:00
Luka Govedič
5d73ae49d6 [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270) 2024-09-16 11:52:40 -07:00
ElizaWszola
a091e2da3e [Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
2024-09-16 09:47:19 -06:00
Isotr0py
fc990f9795 [Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357) 2024-09-15 16:51:44 -06:00
Charlie Fu
1ef0d2efd0 [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) 2024-09-13 17:01:11 -07:00
Cyrus Leung
a84e598e21 [CI/Build] Reorganize models tests (#7820) 2024-09-13 10:20:06 -07:00
bnellnm
73202dbe77 [Kernel][Misc] register ops to prevent graph breaks (#6917)
Co-authored-by: Sage Moore <sage@neuralmagic.com>
2024-09-11 12:52:19 -07:00
Dipika Sikka
6cd5e5b07e [Misc] Fused MoE Marlin support for GPTQ (#8217) 2024-09-09 23:02:52 -04:00