[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)

Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 15:47:41 -07:00
parent 4ddc4743d7
commit a046f86397
10 changed files with 208 additions and 47 deletions
--- a/docs/source/quantization/fp8_e5m2_kvcache.rst
+++ b/docs/source/quantization/fp8_e5m2_kvcache.rst
@@ -32,5 +32,3 @@ Here is an example of how to enable this feature:
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


-Note, current prefix caching doesn't work with FP8 KV cache enabled, forward_prefix kernel should handle different KV and cache type.
-