Search Results for author: Zhekai Zhang

Found 6 papers, 3 papers with code

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

1 code implementation • 7 May 2024 • Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores.

Language Modelling Large Language Model +1

144

Paper
Code

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

no code implementations • 17 Dec 2020 • Hanrui Wang, Zhekai Zhang, Song Han

Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence.

Quantization Sentence

Paper
Add Code

Once for All: Train One Network and Specialize it for Efficient Deployment

1 code implementation • ICLR 2020 • Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

Most of the traditional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally expensive and unscalable.

Neural Architecture Search

1,840

Paper
Code

SpArch: Efficient Architecture for Sparse Matrix Multiplication

no code implementations • 20 Feb 2020 • Zhekai Zhang, Hanrui Wang, Song Han, William J. Dally

We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5. 4x.

Hardware Architecture Distributed, Parallel, and Cluster Computing

Paper
Add Code

Once-for-All: Train One Network and Specialize it for Efficient Deployment

10 code implementations • 26 Aug 2019 • Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4. 0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1. 5x faster than MobileNetV3, 2. 6x faster than EfficientNet w. r. t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission.

Ranked #76 on Neural Architecture Search on ImageNet

Neural Architecture Search

1,840

Paper
Code

Benchmark Visual Question Answer Models by using Focus Map

no code implementations • 13 Jan 2018 • Wenda Qiu, Yueyang Xianzang, Zhekai Zhang

This paper purposed a method for evaluating it.

Visual Reasoning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.