no code implementations • 12 May 2024 • Sayeh Sharify, Zifei Xu, Wanzin Yazar, Xin Wang
Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges.
no code implementations • 14 Apr 2024 • Tian Jin, Wanzin Yazar, Zifei Xu, Sayeh Sharify, Xin Wang
We demonstrate that using this custom CUDA kernel improves the throughput of LLM inference by 28%.