Commit Graph

9263 Commits

Author SHA1 Message Date
Amit Garg
18ae428a0d [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (#8571) 2024-09-20 08:54:02 +08:00
bnellnm
de6f90a13d [Misc] guard against change in cuda library name (#8609) 2024-09-20 06:36:30 +08:00
Alexey Kondratiev(AMD)
6cb748e190 [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (#8551) 2024-09-19 13:06:32 -07:00
Simon Mo
9e99407e3c Create SECURITY.md (#8642) 2024-09-19 12:16:28 -07:00
Isotr0py
ea4647b7d7 [Doc] Add documentation for GGUF quantization (#8618) 2024-09-19 13:15:55 -06:00
盏一
e42c634acb [Core] simplify logits resort in _apply_top_k_top_p (#8619) 2024-09-19 18:28:25 +00:00
Charlie Fu
9cc373f390 [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) 2024-09-19 17:37:57 +00:00
Nick Hill
76515f303b [Frontend] Use MQLLMEngine for embeddings models too (#8584) 2024-09-19 12:51:06 -04:00
Kunshang Ji
855c8ae2c9 [MISC] remove engine_use_ray in benchmark_throughput.py (#8615) 2024-09-18 22:33:20 -07:00
Kuntai Du
c52ec5f034 [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616) 2024-09-19 05:24:24 +00:00
Roger Wang
02c9afa2d0 Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (#8593) 2024-09-19 04:14:28 +00:00
sroy745
3118f63385 [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545) 2024-09-19 02:24:15 +00:00
Tyler Michael Smith
4c34ce8916 [Kernel] Remove marlin moe templating on thread_m_blocks (#8573)
Co-authored-by: lwilkinson@neuralmagic.com
2024-09-19 01:42:49 +00:00
Joe Runde
0d47bf3bf4 [Bugfix] add dead_error property to engine client (#8574)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-18 22:10:01 +00:00
Nick Hill
d9cd78eb71 [BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572) 2024-09-18 20:17:55 +00:00
Tyler Michael Smith
db9120cded [Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) 2024-09-18 20:05:06 +00:00
Gregory Shtrasberg
b3195bc9e4 [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380)
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-18 10:41:08 -07:00
Geun, Lim
e18749ff09 [Model] Support Solar Model (#8386)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-18 11:04:00 -06:00
Russell Bryant
d65798f78c [Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-09-18 16:10:27 +00:00
afeldman-nm
a8c1d161a7 [Core] *Prompt* logprobs support in Multi-step (#8199) 2024-09-18 08:38:43 -07:00
Alexander Matveev
7c7714d856 [Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH (#8157)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-18 13:56:58 +00:00
Aaron Pham
9d104b5beb [CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c [CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Jiaxin Shan
e351572900 [Misc] Add argument to disable FastAPI docs (#8554) 2024-09-18 09:51:59 +00:00
Daniele
95965d31b6 [CI/Build] fix Dockerfile.cpu on podman (#8540) 2024-09-18 10:49:53 +08:00
Tyler Michael Smith
8110e44529 [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) 2024-09-17 23:44:27 +00:00
Alexey Kondratiev(AMD)
09deb4721f [CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520) 2024-09-17 16:40:29 -07:00
youkaichao
fa0c114fad [doc] improve installation doc (#8550)
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>
2024-09-17 16:24:06 -07:00
Joe Runde
98f9713399 [Bugfix] Fix TP > 1 for new granite (#8544)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-17 23:17:08 +00:00
Nick Hill
56c3de018c [Misc] Don't dump contents of kvcache tensors on errors (#8527) 2024-09-17 12:24:29 -07:00
Patrick von Platen
a54ed80249 [Model] Add mistral function calling format to all models loaded with "mistral" format (#8515)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-17 17:50:37 +00:00
chenqianfzh
9855b99502 [Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) 2024-09-17 08:09:12 -07:00
sroy745
1009e93c5d [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) 2024-09-17 07:35:01 -07:00
Isotr0py
1b6de8352b [Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495) 2024-09-17 07:34:27 +00:00
Rui Qiao
cbdb252259 [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (#8509)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-09-17 00:06:26 -07:00
youkaichao
99aa4eddaf [torch.compile] register allreduce operations as custom ops (#8526) 2024-09-16 22:57:57 -07:00
Roger Wang
ee2bceaaa6 [Misc][Bugfix] Disable guided decoding for mistral tokenizer (#8521) 2024-09-16 22:22:45 -07:00
Alex Brooks
1c1bb388e0 [Frontend] Improve Nullable kv Arg Parsing (#8525)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-17 04:17:32 +00:00
Simon Mo
546034b466 [refactor] remove triton based sampler (#8524) 2024-09-16 20:04:48 -07:00
Joe Runde
cca61642e0 [Bugfix] Fix 3.12 builds on main (#8510)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-17 00:01:45 +00:00
Simon Mo
5ce45eb54d [misc] small qol fixes for release process (#8517) 2024-09-16 15:11:27 -07:00
Simon Mo
5478c4b41f [perf bench] set timeout to debug hanging (#8516) 2024-09-16 14:30:02 -07:00
Kevin Lin
47f5e03b5b [Bugfix] Bind api server port before starting engine (#8491) 2024-09-16 13:56:28 -07:00
youkaichao
2759a43a26 [doc] update doc on testing and debugging (#8514) 2024-09-16 12:10:23 -07:00
Luka Govedič
5d73ae49d6 [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270) 2024-09-16 11:52:40 -07:00
sasha0552
781e3b9a42 [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (#8506) 2024-09-16 12:15:57 -06:00
Nick Hill
acd5511b6d [BugFix] Fix clean shutdown issues (#8492) 2024-09-16 09:33:46 -07:00
lewtun
837c1968f9 [Frontend] Expose revision arg in OpenAI server (#8501) 2024-09-16 15:55:26 +00:00
ElizaWszola
a091e2da3e [Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
2024-09-16 09:47:19 -06:00
Isotr0py
fc990f9795 [Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357) 2024-09-15 16:51:44 -06:00