1 code implementation • 7 May 2024 • Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han
The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores.
no code implementations • 17 Dec 2020 • Hanrui Wang, Zhekai Zhang, Song Han
Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence.
1 code implementation • ICLR 2020 • Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han
Most of the traditional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally expensive and unscalable.
no code implementations • 20 Feb 2020 • Zhekai Zhang, Hanrui Wang, Song Han, William J. Dally
We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5. 4x.
Hardware Architecture Distributed, Parallel, and Cluster Computing
10 code implementations • 26 Aug 2019 • Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han
On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4. 0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1. 5x faster than MobileNetV3, 2. 6x faster than EfficientNet w. r. t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission.
Ranked #76 on Neural Architecture Search on ImageNet
no code implementations • 13 Jan 2018 • Wenda Qiu, Yueyang Xianzang, Zhekai Zhang
This paper purposed a method for evaluating it.