I have been a keen follower of the Blog The Morning Paper for a while now. Over more than one year, maybe. Adrain Colyer is an experienced Computer Scientist and an expert on ventures (companies and capital, I suppose) of high technology background, specifically about cloud computing and data science. Therefore he really knows what he writes about, even though with an agreeable humbleness of tone, which he often points out in the posts.
This week he has been posting about a series of papers on Convolutional Neural Networks that he took from the long top 100 papers awesome deep learning papers list. It is a long list of relevant papers, worth to read in a field that surprises by the sheer volume and quality of the research output. Sometimes we may think that quantity buttresses quality in any pursuit, but I think that is not the case with Machine Learning/Deep Learning research. Obviously some efforts will end up being more significant and important than others, but overall I really do feel that the trade-off of quantity/quality on the efforts is quite positive and impressive.
The three posts from the mentioned list of papers that The Morning Paper has been posting about can be consulted here, here and here. Here in this blog I’ve also been posting with some healthy frequency on Machine Learning/Deep Learning issues, papers and topics. But I think the reader may have noticed already that, for my own personal reasons, I tend to prefer the topic of Computer Vision when related with Machine Learning/Deep Learning. It is the closest to my own professional and academic background and I find it the most interesting, indeed. The other topics are also important and deeply interesting, but we should always stick with what in the end we might contribute positively and, who knows make some kind of difference (a bit of wishful thinking does not hurt no one here, I suppose, and for the best reasons possible…). Convolutional Neural Networks are the state-of-the-art technique in the application of Deep Learning in Computer Vision. So these three posts in The Morning Paper should be mantra reading for folks the likes of me.
The very last paper of this long lists surveyed by The Morning Paper caught my attention precisely about the close link to Computer Vision proper. But also because it is from one – perhaps the most important technological behemoth – of the technological giants that happen to sponsor scientific/technological research of world-class quality, Google; last but not least it is also a paper that invites us to think critically about its own content, conclusions and within a sometimes difficult to understand scientific/technological field of study. The pursuit of challenging tasks is encouraged by this blog and should be a value much closer to the attention of other parts of human endeavour… But we will not bother with unfortunate mindsets here.
The paper in question is this Rethinking the Inception Architecture for Computer Vision, and The Information Age shall now take a deep breath and share it with its followers. It will be a contained brief overview, though. Recently I have been in a process of preparation for another personal project (something I should do to render all other efforts, like this blog, worth its salt) and the long reviews will become less frequent in this blog. But nevertheless, the commitment to provide the best view possible and what the main points worthy to think about in the papers I read is of course maintained.
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error on the validation set and 3.6% top-5 error on the official test set.
One of the most important issues in deep neural networks concerns the computational efficiency with which these networks process the vast amount of data entering its input layers. This paper is one other research effort that try to offer some solutions, partial or otherwise, to this issue:
Much of the original gains of the GoogLeNet network  arise from a very generous use of dimension reduction, just like in the “Network in network” architecture by Lin et al [?]. This can be viewed as a special case of factorizing convolutions in a computationally efficient manner. Consider for example the case of a 1×1 convolutional layer followed by a 3 × 3 convolutional layer. In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.
Here we explore other ways of factorizing convolutions in various settings, especially in order to increase the computational efficiency of the solution. Since Inception networks are fully convolutional, each weight corresponds to one multiplication per activation. Therefore, any reduction in computational cost results in reduced number of parameters. This means that with suitable factorization, we can end up with more disentangled parameters and therefore with faster training. Also, we can use the computational and memory savings to increase the filter-bank sizes of our network while maintaining our ability to train each model replica on a single computer.
The Morning Paper comments
I briefly sketch what Adrian Colyer wrote about this paper in this section. Later below there will be the conclusions section.
This paper comes a little out-of-order in our series, as it covers the Inception v3 architecture. The bulk of the paper though is a collection of advice for designing image processing deep convolutional networks. Inception v3 just happens to be the result of applying that advice.
Avoid representational bottlenecks – representation size should gently decrease from the inputs to the outputs before reaching the final representation used for task at hand. Big jumps (downward) in representation size cause extreme compression of the representation and bottleneck the model.
Higher dimensional representations are easier to process locally in a network, more activations per tile allows for more disentangled features. The resulting networks train faster.
Spatial aggregation of lower dimensional embeddings can be done without much or any loss in representational power. “For example, before performing a more spread out (e.g. 3×3 convolution), one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects.”
Balance network width and depth, optimal performance is achieved by balancing the number of filters per stage and the depth of the network. I.e., if you want to go deeper you should also consider going wider.
Although these principles might make sense, it is not straightforward to use them to improve the quality of networks out of the box. The idea is to use them judiciously in ambiguous situations only.
The authors also revisit the question of the auxiliary classifiers used to aid training in the original Inception. “Interestingly, we found that auxiliary classifiers did not result in improved convergence early in the training… near the end of training the network with the auxiliary branches starts to overtake the accuracy of the network without, and reaches a slightly higher plateau.” Removing the lower of the two auxiliary classifiers also had no effect.
The section on conclusions of this paper is shown below. The rest of the paper provides some probably useful guidance for the researcher struggling with their own model while applying convolutional neural networks within a computer vision framework. Of course the precise way in which each model performs and the context matters, and these recommendations should always be balanced with all the factors that might get into play:
We have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture. This guidance can lead to high performance vision networks that have a relatively modest computation cost compared to simpler, more monolithic architectures. Our highest quality version of Inception-v2 reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification, setting a new state of the art. This is achieved with relatively modest (2.5×) increase in computational cost compared to the network described in Ioffe et al . Still our solution uses much less computation than the best published results based on denser networks: our model outperforms the results of He et al  – cutting the top-5 (top-1) error by 25% (14%) relative, respectively – while being six times cheaper computationally and using at least five times less parameters (estimated). The combination of lower parameter count and additional regularization with batch-normalized auxiliary classifiers and label-smoothing allows for training high quality networks on relatively modest sized training sets.
featured image: Highlights of ICCV 2015