The UC Berkeley Computer Vision Group in California, US, is one of the most important research groups in the world for Computer Vision. It is truly first class, with high quality work in the field. This field of research is going through a healthy period in its development with the recent contributions from machine intelligence and artificial intelligence, which completely changed it, for the better. The Information Age has posted already extensively on Convolutional Neural Networks, at the moment the state-of-the-art for Computer Vision algorithms, so readers will be familiar with today’s post.

The mentioned research group has published recently another good paper about deep convolutional neural networks. And I would like to post about it here today. This was a paper submitted and approved in the last months of 2016, with the main reviews probably having been done already in January 2017. The review and commentary will try to be of the short mode, and the reason is the admission by myself that I am not an expert. I consider myself to be in good position to express thoughts, views and a sense of necessity in posting these kind of reviews, though. This is about my background, which is really suited for both deep learning subjects (on the mathematical conceptual side), and Computer Vision in particular, given that I’ve done work with the Physics subject of Optics and Lasers. But that was in the deep past. The present is all about a completely new field of Computer Vision with sophisticated embedded machine learning algorithms, that I increasingly feel I really should care about… Maybe even in a seriously professional way.

Having said all this, now comes the research paper. The main motivation for me to review it had to do with it being important for two reasons. First it introduces a new deep convolutional neural network framework, SqueezeDet. Second the possible wide set of applications for the framework appeared to me to maybe go beyond autonomous vehicles. The paper dealt mainly with the implementation for autonomous driving, which is already fair and important enough. And last, but not least the framework was specifically designed for real-time object recognition, a critical issue for autonomous driving, but I even imagine it to be critical in myriad other applications (smart home devices or smart surveillance cameras, not name just two…).

### SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving

Abstract

Object detection is a crucial task for autonomous driving. In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires realtime inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment. In this work, we propose SqueezeDet, a fully convolutional neural network for object detection that aims to simultaneously satisfy all of the above constraints. In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast. Our model is fully convolutional, which leads to small model size and better energy efficiency. Finally, our experiments show that our model is very accurate, achieving state-of-the-art accuracy on the KITTI [9] benchmark. The source code of SqueezeDet is open-source released.

### Introduction

The heart of the matter for SqueezeDet is object detection. Indeed this is crucial task that is critically important. essential for autonomous vehicles to see the light of commercially viability and feasibility some day:

A safe and robust autonomous driving system relies on accurate perception of the environment. To be more specific, an autonomous vehicle needs to accurately detect cars, pedestrians, cyclists, road signs, and other objects in realtime in order to make the right control decisions that ensure safety. Moreover, to be economical and widely deployable, this object detector must operate on embedded processors that dissipate far less power than powerful GPUs used for benchmarking in typical computer vision experiments.

In spite of the fact that the number and variety of image sensors, with different solutions for different autonomous vehicles in development as of current knowledge, the specific image data suitable for deep neural networks and its huge data requirements are relatively cheap to obtain; the pace of development of these neural networks promise to increase the accuracy and robustness of computer vision frameworks, with ever more possible scenarios in the loop:

(…) Image sensors are cheap compared with others such as LIDAR. Image data (including video) are much more abundant than, for example, LIDAR cloud points, and are much easier to collect and annotate. Recent progress in deep learning shows a promising trend that with more and more data that cover all kinds of long-tail scenarios, we can always design more powerful neural networks with more parameters to digest the data and become more accurate and robust.

For the full feasibility of autonomous driving to become a reality, not just accuracy must enter the equation. There is also speed, model size and energy efficiency. Thus frameworks with low power consumption will command a comparative advantage:

For autonomous driving some basic requirements for image object detectors include the following: a) Accuracy. More specifically, the detector ideally should achieve 100% recall with high precision on objects of interest. b) Speed. The detector should have real-time or faster inference speed to reduce the latency of the vehicle control loop. c) Small model size. As discussed in [16], smaller model size brings benefits of more efficient distributed training, less communication overhead to export new models to clients through wireless update, less energy consumption and more feasible embedded system deployment. d) Energy efficiency.

(…)

While precise figures vary, the new Xavier processor from Nvidia, for example, is targeting a 20W thermal design point. Processors targeting mobile applications have an even smaller power budget and must fit in the 3W–10W range. Without addressing the problems of a) accuracy, b) speed, c) small model size, and d) energy efficiency, we won’t be able to truly leverage the power of deep neural networks for autonomous driving.

The framework introduced in this paper meet all of the above requirements. SqueezeDet leverages and boasts the argument of accuracy, speed, small model size and energy efficiency in the following way:

The detection pipeline of SqueezeDet is inspired by [21]: first, we use stacked convolution filters to extract a high dimensional, low resolution feature map for the input image. Then, we use ConvDet, a convolutional layer to take the feature map as input and compute a large amount of object bounding boxes and predict their categories. Finally, we filter these bounding boxes to obtain final detections.

The “backbone” convolutional neural net (CNN) architecture of our network is SqueezeNet [16], which achieves AlexNet level imageNet accuracy with a model size of < 5MB that can be further compressed to 0.5MB. After strengthening the SqueezeNet model with additional layers followed by ConvDet, the total model size is still less than 8MB.

The inference speed of our model can reach 57.2 FPS3 with input image resolution of 1242×375. Benefiting from the small model size and activation size, SqueezeDet has a much smaller memory footprint and requires fewer DRAM accesses, thus it consumes only 1.4J of energy per image on a TITAN X GPU, which is about 84X less than a Faster RCNN model described in [2].

SqueezeDet is also very accurate. One of our trained SqueezeDet models achieved the best average precision in all three difficulty levels of cyclist detection in the KITTI object detection challenge [9].

### SqueezeDet is a fully connected convolutional neural network

One important aspect of SqueezeDet is that it is a fully connected convolutional neural network framework (FCN). This means that the final layer, the output layer of backpropagation relevance, is a grid and not a single vector. Interestingly the normal 1×1 unidimensional class of probabilities setting for a CNN (convolutional neural network) needed to perform image classification is here turned into a 1×1×channels grid architecture. Then this grid is *downsampled* by average-pooling to a vector to produce the class-probabilities. The final result is a so-called R-CNN (region convolutional neural network, where the classification occurs in specified regions of interest in an image), but with a fully connected output layer, transformed in a so-called R-FCN:

FCN models have been applied in other areas as well. To address the image classification problem, a CNN needs to output a 1-dimensional vector of class probabilities. One common approach is to have one or more fully connected layers, which by definition output a 1D vector – 1×1×Channels (e.g. [18, 23]). However, an alternative approach is to have the final parameterized layer be a convolutional layer that outputs a grid (H×W×Channels), and to then use average-pooling to downsample the grid to 1×1×Channels to a vector of produce class probabilities (e.g. [16, 19]). Finally, the R-FCN method that we mentioned earlier in this section is a fully-convolutional network.

### Model Description – ConvDet

As written in the caption of the above paper’s Fig 1. the SqueezeDet detection pipeline is built so as to extract a feature map (a first shot image) and feed it into a convolutional layer called *ConvDet. ConvDet *computes a grid of boxes of the image, where each box is associated with a confidence score and a conditional class probabilities. Formally the relation of the confidence score and the conditional probabilities is determined like this:

The other C scalars represents the conditional class probability distribution given that the object exists within the bounding box. More formally, we denote the conditional probabilities as Pr(classc|Object), c ∈ [1, C]. We assign the label with the highest conditional probability to this bounding box and we use

as the metric to estimate the confidence of the bounding box prediction.

The authors then define *ConvDet * the following way:

ConvDetis essentially a convolutional layer that is trained to output bounding box coordinates and class probabilities. It works as a sliding window that moves through each spatial position on the feature map. At each position, it computes K × (4 + 1 + C) values that encode the bounding box predictions. Here, K is the number of reference bounding boxes with pre-selected shapes. Using the notation from [22], we call these reference bounding boxes as anchor.(…)

For each anchor (i, j, k), we compute 4 relative coordinates (δx(ijk), δy(ijk), δw(ijk), δh(ijk)) to transform the anchor into a predicted bounding box, as shown in Fig. 2. Following [12], the transformation is described by

As explained in the previous section, the other (C + 1) outputs for each anchor encode the confidence score for this prediction and conditional class probabilities.

Specifically,

ConvDetis similar to the last layer of RPN in Faster R-CNN [22]. The major difference is that, RPN is regarded as a “weak” detector that is only responsible for detecting whether an object exists and generating bounding box proposals for the object. The classification is handed over to fully connected layers, which are regarded as a “strong” classifier. But in fact, convolutional layers are “strong” enough to detect, localize, and classify objects at the same time.

This meet the requirement of smaller model size.

### Some remarks about the Training Protocol

In this section I would like to share the more difficult part of the paper, from the computational and mathematical conceptual view. It deals with the Training protocol that was used to verify the overall quality of the SqueezeDet framework. It features some heavy mathematical and computational concepts, but the reader should not be afraid. The important part is that it really works, an excuse common with all the black box models of the world, I know. Nevertheless I am always moved by the elegance and intricate complexity of the conceptual mathematics. Yes, I also do not feel afraid that complexity and never confuse this contextual complexity with the other more pessimistic and biased used of the word:

Unlike Faster R-CNN [22], which deploys a (4-step) alternating training strategy to train RPN and detector network, our SqueezeDet detection network can be trained end-to-end, similarly to YOLO [21]. To train the ConvDet layer to learn detection, localization and classification, we define a multi-task loss function:

The first part of the loss function is the bounding box regression. (δx(ijk), δy(ijk), δw(ijk), δh(ijk)) corresponds to the relative coordinates of anchor-k located at grid center-(i, j). They are outputs of the ConvDet layer. The ground truth bounding box δˆG (ijk), or (δxˆG (ijk), δyˆG (ijk), δwˆG (ijk), δhˆG (ijk)), is computed as:

(…)

During training, we compare ground truth bounding boxes with all anchors and assign them to the anchors that have the largest overlap (IOU) with each of them. The reason is that we want to select the “closest” anchor to match the ground truth box such that the transformation needed is reduced to minimum. I(ijk) evaluates to 1 if the k-th anchor at position-(i, j) has the largest overlap with a ground truth box, and to 0 if no ground truth is assigned to it. This way, we only include the loss generated by the “responsible” anchors. As there can be multiple objects per image, we normalize the loss by dividing it by the number of objects.

(…)

The last part of the loss function is just cross-entropy loss for classification. lˆG (c) ∈ {0, 1} is the ground truth label and p(c) ∈ [0, 1], c ∈ [1, C] is the probability distribution predicted by the neural net. We usedsoftmaxto normalize the correspondingConvDetoutput to make sure that pc is ranged between [0, 1].

The hyper-parameters in Equation 2 are selected empirically. In our experiments, we set λ(bbox) = 5, λˆ+(conf) = 75, λˆ−(conf) = 100. This loss function can be optimized directly using back-propagation.

### Conclusion

For obvious reasons of space limitations for a blog post, I will conclude here the review of this interesting and important research paper from UC Berkeley Computer Vision Group. This is one of research groups in the field where the quality and significance of their output should be eagerly followed by all the interested audience. The rest of the paper that I intentionally skipped details the squeeze layer used to reduce model size (a 1×1 *convolutional layer that compresses an input tensor with large channel size to one with the same batch and spatial dimension, but smaller channel size*.), the experimental setup, with all the accuracy and model average precision and recall specifications and finally the design space exploration and the specification as to the model architecture. As always a read encouraged by this blog. The conclusion paragraph went as follows:

We proposed SqueezeDet, a fully convolutional neural network for real-time object detection. We integrated the region proposition and classification into ConvDet, which is orders of magnitude smaller than its fully-connected counterpart. With the constraints of autonomous driving in mind, our proposed SqueezeDet and SqueezeDet+ models are designed to be small, fast, energy efficient, and accurate. On all of these metrics, our models advance the state-of-the-art.

*featured image: from the body of the paper – Figure 3. Comparing RPN, ConvDet and the detection layer of YOLO [21]. Activations are represented as blue cubes and layers (and their parameters) are represented as orange ones. Activation and parameter dimensions are also annotated. *