More Side Information, Better Pruning: Shared-Label Classification as a Case Study

1 Jan 2021 · Omer Leibovitch, Nir Ailon ·

Pruning of neural networks, also known as compression or sparsification, is the task of converting a given network, which may be too expensive to use (in prediction) on low resource platforms, with another 'lean' network which performs almost as well as the original one, while using considerably fewer resources. By turning the compression ratio knob, the practitioner can trade off the information gain versus the necessary computational resources, where information gain is a measure of reduction of uncertainty in the prediction. In certain cases, however, the practitioner may readily possess some information on the prediction from other sources. The main question we study here is, whether it is possible to take advantage of the additional side information, in order to further reduce the computational resources, in tandem with the pruning process? Motivated by a real-world application, we distill the following elegantly stated problem. We are given a multi-class prediction problem, combined with a (possibly pre-trained) network architecture for solving it on a given instance distribution, and also a method for pruning the network to allow trading off prediction speed with accuracy. We assume the network and the pruning methods are state-of-the-art, and it is not our goal here to improve them. However, instead of being asked to predict a single drawn instance $x$, we are being asked to predict the label of an $n$-tuple of instances $(x_1,\dots x_n)$, with the additional side information of all tuple instances share the same label. The shared label distribution is identical to the distribution on which the network was trained. One trivial way to do this is by obtaining individual raw predictions for each of the $n$ instances (separately), using our given network, pruned for a desired accuracy, then taking the average to obtain a single more accurate prediction. This is simple to implement but intuitively sub-optimal, because the $n$ independent instantiations of the network do not share any information, and would probably waste resources on overlapping computation. We propose various methods for performing this task, and compare them using extensive experiments on public benchmark data sets for image classification. Our comparison is based on measures of relative information (RI) and $n$-accuracy, which we define. Interestingly, we empirically find that I) sharing information between the $n$ independently computed hidden representations of $x_1,..,x_n$, using an LSTM based gadget, performs best, among all methods we experiment with, ii) for all methods studied, we exhibit a sweet spot phenomenon, which sheds light on the compression-information trade-off and may assist a practitioner to choose the desired compression ratio.

PDF Abstract