Succinct Compression: Near-Optimal and Lossless Compression of Deep Neural Networks during Inference Runtime

29 Sep 2021  ·  Yicun Duan, Xiangjun Peng ·

Recent advances in Deep Neural Networks (DNN) compression (e.g. pruning, quantization and etc.) significantly reduces the amount of space consumption for storage, making them easier to deploy in low-cost devices. However, those techniques do not keep the compressed representation during inference runtime, which incurs significant overheads in terms of both performance and space consumption. We introduce ``Succinct Compression”, a three-stage framework to enable DNN inference with near-optimal compression and much better performance during inference runtime. The key insight of our method leverages the concept of \textit{Succinct Data Structures}, which supports fast queries directly on compressed representation without decompression. Our method first transforms DNN models as our proposed formulations in either Element-wise or Block-wise manner, so that \textit{Succinct Data Structures} can take advantage of. Then, our method compresses transformed DNN models using \textit{Succinct Data Structures}. Finally, our method exploits our specialized execution pipelines for different model formulations, to retrieve relevant data for DNN inference. Our experimental results show that, our method keeps near-optimal compression, and achieves at least 8.7X/11.5X speedup on AlexNet/VGG-16 inference, compared with Huffman Coding. We also experimentally show that our method is quite synergistic with Pruning and Quantization.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods