no code implementations • 3 May 2024 • Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu
To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading.
Language Modelling Large Language Model