Rigging the Lottery Making All Tickets Winners
Rigging the Lottery: Making All Tickets Winners [Supplemental] ICML 2020
This paper proposes a new method to train a sparse neural network, which raises the theoretical limit of a sparse neural network’s trainable size under memory constraint and yields higher accuracy. The method is concise and proven effective over image classification and language modeling tasks with ResNet and MobileNet.
There are two minor drawbacks from my perspective. First, the experiment to compare a dense network’s performance with one of the sparse network trained using the proposed method is not strict enough, which has the risk of exaggerating a sparse network’s performance. Second, the paper does not account for a sparse computing device’s possible slowdown compared to a dense one in the experiment.
Despite these two minor drawbacks, which apply to almost every paper on sparse neural network training, this paper is excellent work, and I highly recommend to read and try it in action.
The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions that the learner uses to predict outputs of given inputs that it has not encountered .
Sparse weight is a common inductive bias and has many successful use cases in deep learning:
- As the building block of a convolutional neural network, convolution could be represented as a sparse matrix multiplication with a Toeplitz weight matrix .
- Depthwise and pointwise convolution  could be viewed as a regular convolution with CP decomposition weight tensor.
The sparsity pattern in the above two examples is deterministic and hand-designed. In recent years, people are exploring methods to design arbitrary and even learnable sparsity pattern. In the particular field of convolutional neural networks, the research trend brings us two problems.
- How to train a sparse convolutional neural network
- How to implement an efficient sparse convolution computing device
This paper focuses on the first problem.
The most popular method to train a sparse neural network is pruning , which trains a dense network (network with sparsity level of 1.) initially and then gradually prunes weights to zero and approaches the target level of sparsity at the end. This method works well, but it limits the sparse network’s size to less than the dense network size.
This paper proposes to train the sparse network with the target level of sparsity and random initialized sparse connections from the very beginning. Suppose we use this method to train a network with a sparsity level of 0.9; the sparse network can be ten times larger than the dense one.
However, a network with random sparse connections is unlikely to be the lottery  and pruned to local optimal. The paper proposes to jumps out of those local optimal by dropping weak connections and form new connections periodically while maintaining the overall sparsity rate.
Under the main idea, some details of this method are listed below:
- The method forms the new connections by selecting the zero weights with the largest gradients at that current batch (instantaneous gradient information).
- The authors also find that zero initialization for the newly formed connection brings the best result.
- They use cosine annealing to decay the fraction of updated connections each time.
- Given a certain overall sparsity level, they find that the sparsity level is beneficial to be proportional to each layer’s number of connections (make small layer denser and large layer more sparse).
Experiments on image classification and language modeling are conducted to verify the method to be effective. These experiments show that a sparse network trained using the new method outperforms a dense one with equal FLOPs by a considerable margin consistently.
I select the MobileNet on ImageNet result to show above, and other experiments share similar patterns.
Below I share some personal thoughts on this work:
- The experimental design has the risk of exaggerating the performance of a sparse network. This paper’s method is analogous to the retraining trick, which reinitializes weights in some layers and continues the training. Retraining usually brings higher generalization performance, which explains why longer training time benefits the sparse network rather than the dense one. For a strictly fair comparison, I recommend training the dense network with a cosine annealing learning rate schedule , rather than a stepwise learning rate schedule, and train for five times longer to match the sparse training time.
- Recall the two problems that we mentioned above. Many papers about sparse network training ignore that the first problem depends heavily on the second one. To improve this trend, we should study how to estimate the bounds of the sparse computing device’s slowdown, which will be valuable work.
- Amazingly, sparsity works in many such cases, and I think we will figure out how to use this concept better if we could understand more about why it works so well.
: A guide to convolution arithmetic for deep learning https://arxiv.org/abs/1603.07285
: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications https://arxiv.org/abs/1704.04861
: TO PRUNE, OR NOT TO PRUNE: EXPLORING THE EFFICACY OF PRUNING FOR MODEL COMPRESSION https://arxiv.org/abs/1710.01878
: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks https://arxiv.org/abs/1803.03635
: SGDR: Stochastic Gradient Descent with Warm Restarts https://arxiv.org/abs/1608.03983v5