It is widely agreed in the deep neural networks community of researchers and the overall literature on the subject that sparse neural networks perform better than dense neural networks when it comes to image classification tasks. But this is a bit of s surprise, for me it was a recent revelation. Why is that the case ? Today we will try to know more about this with our choice of paper to review by a group of Indian researchers from Indian Institute of Science in Bangalore, India.

For anyone with some knowledge of machine learning practice, and of deep neural networks in particular, know that the subject is often plagued with implementation issues that mostly revolve around optimization algorithms, trade-offs about bias-variance and how top balance accuracy vs overfitting. But there are also questions about computational costs associated with the particular choices in the relevant features and parameters needed for better performance. The reasons as to why sparse neural networks perform better than dense ones pertain basically to those particular choices and trade-offs, as we will see next in this review:

### Training Sparse Neural Networks

*Abstract*

Deep neural networks with lots of parameters are typically used for large scale computer vision tasks such as image classification. This is a result of using dense matrix multiplications and convolutions. However, sparse computations are known to be much more efficient. In this work, we train and build neural networks which implicitly use sparse computations. We introduce additional gate variables to perform parameter selection and show that this is equivalent to using a spike-and-slab prior. We experimentally validate our method on both small and large networks and achieve state-of-the-art compression results for sparse neural network models

### Introduction

For large-scale tasks such as image classification, large networks with many millions of parameters are often used (Krizhevsky, Sutskever, and Hinton 2012), (Simonyan and Zisserman 2015), (Szegedy et al. 2015). However, these networks typically use dense computations. Would it be advantageous to use sparse computations instead? Apart from having fewer number of parameters to store (O(mn) to O(k)), sparse computations also decrease feedforward evaluation time (O(mnp) to O(kp)). Further, having a lower parameter count may help in avoiding overfitting.

The first paragraph in the introduction outlines at the outset what we have said earlier about the effectiveness of sparse neural networks when performing large-scale tasks such as image classification. In order to achieve good results in dealing with overfitting, the authors outline the use of regularizers, but of a special kind: regularizers that explicitly restrict the total numbers of parameters of a network, where a strategy using sparsity-inducing regularizers with a penalty (*l1*) on the parameter vector is implemented:

Regularizers are often used to discourage overfitting. These usually restrict the magnitude (l2/l1) of weights. However, to restrict the computational complexity of neural networks, we need a regularizer which restricts the total number of parameters of a network. A common strategy to obtain sparse parameters is to apply sparsity-inducing regularizers such as the l1 penalty on the parameter vector. However, this is often insufficient in inducing sparsity in case of large non-convex problems like deep neural network training as shown in (Collins and Kohli 2014). The contribution of this paper is to be able to induce sparsity in a tractable way for such models.

The main contributions of the paper are then disclosed transparently:

The overall contributions of the paper are as follows:

. We propose a novel regularizer that restricts the total number of parameters in the network. (Section 2)We perform experimental analysis to understand the behaviour of our method. (Section 4) •We apply our method on LeNet-5, AlexNet and VGG-16 network architectures to achieve state-of-the-art results on network compression. (Section 4)

### Problem Formulation

One important definition in this paper is computational complexity. Computational complexity theory is a somewhat of a generalization approach to a broader set of issues in computational problems. So the authors felt the need to provide a definition of the concept in regard to neural networks and its particular form of computation:

To understand the motivation behind our method, let us first define our notion of computational complexity of a neural network.

Let Φ = {gˆs 1 , gˆs 2 , …, gˆs m} be a set of m vectors. This represents an m-layer dense neural network architecture where g s i is a vector of parameter indices for the i th layer, i.e; gˆs i = {0, 1}ˆni . Here, each layer gˆs i contains ni elements. Zero indicates his absence of a parameter and one indicates the presence. Thus, for a dense neural network, gi is a vector of all ones, i.e.; gˆs i = {1}ˆni . For a sparse parameter vector, gˆs i would consist of mostly zeros. Let us call Φ as the index set of a neural network.

For these vectors, our notion of complexity is simply the total number of parameters in the network.

Definition 1. The complexity of a m-layer neural network with index set Φ is given by

We now aim to solve the following optimization problem.

where θ denotes the weights of the neural network, and Φ the index set. l(ˆy(θ, Φ), y) denotes the loss function, which depends on the underlying task to be solved. Here, we learn both the weights as well as the index set of the neural network. Using the formalism of the index set, we are able to penalize the total number of network parameters. While easy to state, we note that this problem is difficult to solve, primarily because Φ contains elements ∈ {0, 1}.

The following sections of this paper describe neatly the methodology and theoretical base used in this paper, makes for a good important read and I will present next the main points with figures and paragraphs noteworthy:

How do we incorporate the index set formalism in neural networks? Assume that the index set (Gˆs in Fig. 1) is multiplied pointwise with the weight matrix. This results in a weight matrix that is effectively sparse, if the index set has lots of zeros rather than ones. In other words, we end up learning two sets of variables to ensure that one of them – weights – becomes sparse.

How do we learn such binary parameters in the first place ? To facilitate this, we interpret index set variables (Gˆs ) as draws from a bernoulli random variable. As a result, we end up learning the real-valued bernoulli parameters (G in Fig. 1), or gate variables rather than index set variables themselves. Here the sampled binary gate matrix Gˆs corresponds exactly to the index set, or the Φ matrix described above.

To clarify our notation, G and g stand for the real-valued gate variables, while the superscript(.) s indicates binary sampled variables. When we draw from a bernoulli distribution, we have two choices – we can either perform an unbiased draw (the usual sampling process), or we can perform a so-called maximum likelihood (ML) draw. The ML draw involves simply thresholding the values of G at 0.5. To ensure determinism, we use the ML draw or thresholding in this work.

(…)

Our overall regularizer is simply a combination of this bimodal regularizer as well the traditional l2 or l1 regularizer for the individual gate variables. Our objective function is now stated as follows.

where g[i,j]denotes the j th gate parameter in the i th layer. Note that for g[i,j] ∈ {0, 1}, the second term in Eqn. 2 vanishes and the third term becomes λ||Φ||, thus reducing to Eqn.1.

The interesting step in this methodology happens next when it is derived the original objective function Equation 1 from the objective function derived in Equation 2. To achieve this it is introduced an Equation 3. Assuming the gate variables and the bernoulli distribution setting, the fact that *gˆs *is a random variable, it is possible to convert a real-valued stochastic objective function by a process of sampling with deterministic Monte-Carlo algorithms and treating the random variables as expectations.

In this Equation the gi,j are equal to **E(**gˆsi,j** ) **(expectations).

While this formulation is sufficient to solve the original problem, we impose another condition on this objective. We would like to minimize the number of Monte-Carlo evaluations in the loss term. This amounts to reducing [ 1/t ∑ t (l(ˆy(θ, Gˆs ), y)) − E(l(ˆy(θ, Gˆs ), y))]ˆ2 for a fixed t, or reducing the variance of the loss term. This is done by reducing the variance of g s , the only random variable in the equation. To account for this, we add another penalty term corresponding to Var(g s ) = g × (1 − g). Imposing this additional penalty and then using t = 1 gives us back Eqn.2.

### Relation to Spike-and-Slab priors

The work methodology for the problem formulation in this paper resembles the so-called spike-and-slab priors used in Bayesian Statistics:

We observe that our problem formulation closely resembles spike-and-slab type priors used in Bayesian statistics for variable selection (Mitchell and Beauchamp 1988).

Broadly speaking, these priors are mixtures of two distributions – one with very low variance (spike), and another with comparatively large variance (slab). By placing a large mass on the spike, we can expect to obtain parameter vectors with large sparsity.

Here, δ(·) denotes the dirac delta distribution, and Z denotes the normalizing constant, and α is the mixture coefficient. Also note that like (Mitchell and Beauchamp 1988), we assume that wi ∈ [−k, k] for some k > 0. This is visualized in Fig. 2b. Note that this is a multiplicative mixture of distributions, rather than additive. By taking negative logarithm of this term and ignoring constant terms, we obtain

The authors of this paper compared their methodology with former related work in this field by prominent researchers such Andrew Ng and Yoshua Bengio, specially in the choice of activation function for the backpropagation process. They introduce and demonstrate that a *clip* function worked better for their implementation of a straight-through estimator to compute discrete-valued stochastic bernoulli distributed gradients:

It is also provided a comparison with the LASSO methodology to meet sparsity:

LASSO is commonly used method to attain sparsity and perform variable selection. The main difference between the above method and LASSO is that LASSO is primarily a shrinkage operator, i.e.; it shrinks all parameters until lots of them are close to zero. This is not true for the case of spike-and-slab priors, which can have high sparsity and encourage large values at the same time. This is due to the richer parameterization of these priors.

### Related Work

I found this paper to be a possible important new methodology and innovative way to achieve higher performance in image classification with deep learning networks. It may as well turn out to be a referenced and further enhance methodology by other researcher in the field. A the authors remark in one paragraph revising related work, their approach is new in the way sparsity by neuronal pruning while simultaneously learning by the network of weights is achieved:

There have been many recent works which perform compression of neural networks. Weight-pruning techniques were popularized by LeCun et al.(1989) and Hassibi et al.(1993), who introduced Optimal Brain Damage and Optimal Brain Surgery respectively. Recently, Srinivas and Babu (2015a) proposed a neuron pruning technique, which relied on neuronal similarity. In contrast, we perform weight pruning based on learning, rather than hand-crafted rules.

Previous attempts have also been made to sparsify neural networks. Han et al.(2015) create sparse networks by alternating between weight pruning and network training. A similar strategy is followed by Collins and Kohli (2014). On the other hand, our method performs both weight pruning and network training simultaneously. Further, our method has considerably less number of hyper-parameters to determine (λ1, λ2) compared to the other methods, which have n thresholds to be set for each of the n layers in a neural network.

### Experiments

In this section we perform experiments to evaluate the effectiveness of our method. First, we perform some experiments designed to understand typical behaviour of the method. These experiments are done primarily on LeNet-5 (LeCun et al. 1998). Second, we use our method to perform network compression on two networks – LeNet-5 and AlexNet. These networks are trained on MNIST and ILSVRC-2012 dataset respectively. Our implementation is based on Lasagne, a Theano-based library

### Conclusion

Finally I will conclude the review of this interesting paper with the last word by the authors. As always, the further reading of the complete analysis of the proposed methodology in the paper is recommended. In there it is disclosed the full picture of the effects of hyper-parameters when using a maximum likelihood sampling instead og an unbiased bernoulli sampling:

Effect of hyper-parameters In Section 2.1 we described that we used maximum likelihood sampling (i.e.; thresholding) instead of unbiased sampling from a bernoulli. In these experiments, we shall study the relative effects of hyperparameters on both methods. In the sampling case, sparsity is difficult to measure as different samples may lead to slightly different sparsities. As a result, we measure expected sparsity as well the it’s variance. Our methods primarily have the following hyperparameters: λ1, λ2 and the initialization for each gate value. As a result, if we have a network with n layers, we have n+2 hyper-parameters to determine.

This analysis is presented with all the results for the sparsity and layer-wise compression of the different frameworks compared in tables easy to read.

## Conclusion

We have introduced a novel method to learn neural networks with sparse connections. This can be interpreted as learning weights and performing pruning simultaneously. By introducing a learning-based approach to pruning weights, we are able to obtain the optimal level of sparsity. This enables us to achieve state-of-the-art results on compression of deep neural networks.

*Featured Image: Yarin Gal Cambridge Machine Learning website*