[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917)

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
This commit is contained in:
Roberto L. Castro
2026-03-09 17:50:36 +01:00
committed by GitHub
parent 2b28b9b269
commit 580864d81e
10 changed files with 415 additions and 15 deletions

View File

@@ -8,8 +8,9 @@ steps:
- csrc/
- tests/kernels/core
- tests/kernels/test_top_k_per_row.py
- tests/kernels/test_concat_mla_q.py
commands:
- pytest -v -s kernels/core kernels/test_top_k_per_row.py
- pytest -v -s kernels/core kernels/test_top_k_per_row.py kernels/test_concat_mla_q.py
- label: Kernels Attention Test %N
timeout_in_minutes: 35