I am coming back to reviewing a paper that I deem relevant or with some probable line of further research that may prove worth to pursue in the future about deep neural networks. I was parsing one of O’Reilly Artificial Intelligence latest newsletter, and I found the paper below about ensembles of neural networks. It makes for a good technical reading and it will be assumed by The Intelligence of Information that the reader understands the basics of deep learning and deep neural networks – something I recognize to skew a bit the audience, but with an increasing number joining the ranks by the day.
The proposal in the paper to obtain the seemingly contradictory goal of ensembling multiple neural networks at no additional training cost, set us right from the reading of the abstract to the core of the paper. This is achieved by a clever combination of optimization techniques within the parameter space with a rapid convergence procedure by leveraging recent efforts in cyclic learning rate schedules:
The resulting technique, which we refer to as Snapshot Ensembling, is simple, yet surprisingly effective. We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields lower error rates than state-of-the-art single models at no additional training cost, and compares favorably with traditional network ensembles. On CIFAR-10 and CIFAR-100 our DenseNet Snapshot Ensembles obtain error rates of 3.4% and 17.4% respectively.
This research effort was a joint work between chinese Tsinghua University and Cornell University. The authors believe that in spite of the success of stochastic gradient descent (SGD) in escaping local spurious saddle points and local minima in an optimization schedule, some valuable information might be missed after its implementation. That information is actually essential to improve model performance. From that fact it follows that in order to generalize better, deep learning models need to find the best trade-off between a good and a bad minima with respect to generalization when SGD is applied:
SGD tends to avoid sharper local minima because gradients are computed from small mini-batches and are therefore inexact (Keskar et al., 2016). If the learning rate is sufficiently large, the intrinsic random motion across gradient steps prevents the optimizer from reaching any of the sharp basins along its optimization path. However, if the learning rate is small, the model tends to converge into the closest local minimum. These two very different behaviors of SGD are typically exploited in different phases of optimization (He et al., 2016a). Initially the learning rate is kept high to move into the general vicinity of a flat local minimum. Once this search has reached a stage in which no further progress is made, the learning rate is dropped (once or twice), triggering a descent, and ultimately convergence, to the final local minimum.
The new interesting perspective taken in this paper – that might have been missed by earlier researchers, or thought out but not actually tried – is the fact the a deep neural network with millions of parameters is initialized through different minibatch ordering and will converge to different solutions if two distinct architectures are implemented. This can be exploited through ensembling, in which multiple neural networks are trained from different initialization and then combined with majority voting or averaging. The main deep learning competitions have been won by implementing ensembling of deep learning architectures.
But why this isn´t widely used by the practitioners and deep learning industry as of today? There is a high cost in ensembling multiple deep neural networks.
Despite its obvious advantages, the use of ensembling for deep networks is not nearly as widespread as it is for other algorithms. One likely reason for this lack of adaptation may be the cost of learning multiple neural networks. Training deep networks can last for weeks, even on high performance hardware with GPU acceleration. As the training cost for ensembles increases linearly, ensembles can quickly becomes uneconomical for most researchers without access to industrial scale computational resources.
What the authors of this paper devised was a method to skip this cost with a clever procedure whereby the convergence characteristics of SGDis combined with a cyclic learning rate procedure. In this way the optimization path of the SGD isn’t allowed to run its full course and instead it is stopped and initialized again cyclically, providing a trade-off in th cost of ensembling:
In this paper we focus on the seemingly-contradictory goal of learning an ensemble of multiple neural networks without incurring any additional training costs. We achieve this goal with a training method that is simple and straight-forward to implement. Our approach leverages the non-convex nature of neural networks and the ability of SGD to converge to and escape from local minima on demand. Instead of training M neural networks independently from scratch, we let SGD converge M times to local minima along its optimization path. Each time the model converges, we save the weights and add the corresponding network to our ensemble. We then restart the optimization with a large learning rate to escape the current local minimum. More specifically, we adopt the cycling procedure suggested by Loshchilov & Hutter (2016), in which the learning rate is abruptly raised and then quickly lowered to follow a cosine function. Because our final ensemble consists of snapshots of the optimization path, we refer to our approach as Snapshot Ensembling
And Snapshot Ensembling can even itself be ensembled:
In contrast to traditional ensembles, the training time for the entire ensemble is identical to the time required to train a single traditional model. During testing time, one can evaluate and average the last (and therefore most accurate) m out of M models. Our approach is naturally compatible with other methods to improve the accuracy, such as data augmentation, stochastic depth (Huang et al., 2016b), or batch normalization (Ioffe & Szegedy, 2015). In fact, Snapshot Ensembles can even be ensembled, if for example parallel resources are available during training. In this case, an ensemble of K Snapshot Ensembles yields K × M models at K times the training cost.
The results are point to an always reducing error rate without increasing training costs by this method:
We evaluate the efficacy of Snapshot Ensembles on three state-of-the-art deep learning architectures for object recognition: ResNet (He et al., 2016b), Wide-ResNet (Zagoruyko & Komodakis, 2016), and DenseNet (Huang et al., 2016a). We show across four different data sets that Snapshot Ensembles almost always reduce error without increasing training costs. For example, on CIFAR-10 and CIFAR-100, Snapshot Ensembles obtains error rates of 3.44% and 17.41% respectively.
Snapshot Ensembling produces an ensemble of accurate and diverse models from a single training process. At the heart of Snapshot Ensembling is an optimization process which visits several local minima before converging to a final solution. We take model snapshots at these various minima, and average their predictions at test time.
Neat observation of how deep neural networks optimize/perform during training time yielded the inspiration for this effort.
Ensembles work best if the individual models (1) have low test error and (2) do not overlap in the set of examples they misclassify. Along most of the optimization path, the weight assignments of a neural network tend not to correspond to low test error. In fact, it is commonly observed that the validation error drops significantly only after the learning rate has been reduced, which is typically done after several hundred epochs. Our approach is inspired by the observation that training neural networks for fewer epochs and dropping the learning rate earlier has minor impact on the final test error (Loshchilov & Hutter, 2016). This seems to suggest that local minima along the optimization path become promising (in terms of generalization error) after only a few epochs.
Cyclic Cosine Annealing:
We lower the learning rate at a very fast pace, encouraging the model to converge towards its first local minimum after as few as 50 epochs. The optimization is then continued at a larger learning rate, which perturbs the model and dislodges it from the minimum. We repeat this process several times to obtain multiple convergences. Formally, the learning rate α has the form:
where t is the iteration number, T is the total number of training iterations, and f is a monotonically decreasing function.
The large learning rate α = f(0) provides the model enough energy to escape from a critical point, while the small learning rate α = f(dT /Me) drives the model to a well-behaved local minimum. In our experiments, we set f to be the shifted cosine function proposed by Loshchilov & Hutter (2016):
where α0 is the initial learning rate. Intuitively, this function anneals the learning rate from its initial value α0 to f(dT /Me) ≈ 0 over the course of a cycle. Following (Loshchilov & Hutter, 2016), we update the learning rate at each iteration rather than at every epoch. This improves the convergence of short cycles, even when a large initial learning rate is used.
Snapshot Ensembling.Figure 2 depicts the training process using cyclic and traditional learning rate schedules. At the end of each training cycle, it is apparent that the model reaches a local minimum with respect to the training loss. Thus, before raising the learning rate, we take a “snapshot” of the model weights (indicated as vertical dashed black lines). After training M cycles, we have M model snapshots, f1 . . . fM, each of which will be used in the final ensemble. It is important to highlight that the total training time of the M snapshots is the same as training a model with a standard schedule (indicated in blue). In some cases, the standard learning rate schedule achieves lower training loss than the cyclic schedule; however, as we will show in the next section, the benefits of ensembling outweigh this difference.
And indeed that was the case.
Following is a brief overview of the experimental setup. Actually I just highlight a small part of this section, foe reasons of blog post limitations. Further reading of all experimental details of a research such as this one is always warranted:
We then test the Snapshot Ensemble algorithm trained with the cyclic cosine learning rate as described in (2). We test models with the max learning rate α0 set to 0.1 and 0.2. In both cases, we divide the training process into learning rate cycles. Model snapshots are taken after each learning rate cycle. Additionally, we train a Snapshot Ensemble with a non-cyclic learning rate schedule. This NoCycle Snapshot Ensemble, which uses the same schedule as the Single Model and Dropout baselines, is meant to highlight the impact of cyclic learning rates for our method. To accurately compare with the cyclic Snapshot Ensembles, we take the same number of snapshots equally spaced throughout the training process. Finally, we compare against SingleCycle Ensembles, a Snapshot Ensemble variant in which the network is re-initialized at the beginning of every cosine learning rate cycle, rather than using the parameters from the previous optimization cycle. This baseline essentially creates a traditional ensemble, yet each network only has 1/M of the typical training time. This variant is meant to highlight the tradeoff between model diversity and model convergence. Though SingleCycle Ensembles should in theory explore more of the parameter space, the models do not benefit from the optimization of previous cycles.
Activation space. To further explore the diversity of models, we compute the pairwise correlation of softmax outputs for every pair of snapshots. (…)
Firstly, there are large correlations between the last 3 snapshots of the non-cyclic training schedule (right). These snapshots are taken after dropping the learning rate, suggesting that each snapshot has converged to the same minimum. Though there is more diversity amongst the earlier snapshots, these snapshots have much higher error rates and are therefore not ideal for ensembling. Conversely, there is less correlation between all cyclic snapshots (left). Because all snapshots have similar accuracy (as can be seen in Figure 5), these differences in predictions can be exploited to create effective ensembles.
I thought this research effort to be a significant addition to the methodological procedures for efficiently train deep neural networks – a known problematic issue in the space. The introduction of an innovative ensembling method for training deep neural networks, combining Stochastic Gradient Descent convergence characteristics with a pair of clever cyclic learning rate procedure and a model snapshot procedure, this effort might have an indisputable impact within this research field. And the authors promise to further improve their methodology in future research:
DISCUSSION We introduce Snapshot Ensembling, a simple method to obtain ensembles of neural networks without any additional training cost. Our method exploits the ability of SGD to converge to and escape from local minima as the learning rate is lowered, which allows the model to visit several weight assignments that lead to increasingly accurate predictions over the course of training. We harness this power with the cyclical learning rate schedule proposed by Loshchilov & Hutter (2016), saving model snapshots at each point of convergence. We show in several experiments that all snapshots are accurate, yet produce different predictions from one another, and therefore are well suited for test-time ensembles. Ensembles of these snapshots significantly improve the state-of-the-art on CIFAR-10, CIFAR-100 and SVHN. Future work will explore combining Snapshot Ensembles with traditional ensembles. In particular, we will investigate how to balance growing an ensemble with new models (with random initializations) and refining existing models with further training cycles under a fixed training budget.
featured image: Finding Optimal Weights of Ensemble Learner using Neural Network