I would like to review here today an interesting paper about the application of an optimization technique to problems in Computer Vision.

The technique is called fast bilateral solver, and is a linear solver (optimization algorithm), that, as the authors underline in the abstract, *is a novel algorithm for edge-aware smoothing that combines the flexibility and speed of simple filtering approaches with the accuracy of domain-specific optimization algorithms. *

This is the kind of papers (research) The Information Age has been specially good at spotting for review, sharing and highlighting as pointers in innovative developments that could lead to improvements or guide the future direction in further research and development.

Of note also is the combination of theoretical developments in mathematical optimization and straight empirical application in domain-specific state-of-the-art computer vision problems, with an ease of integration with another state-of-the-art develpoments such as deep learning pipelines:

### The Fast Bilateral Solver

We present the bilateral solver, a novel algorithm for edge-aware smoothing that combines the flexibility and speed of simple filtering approaches with the accuracy of domain-specific optimization algorithms. Our technique is capable of matching or improving upon state-of-the-art results on several different computer vision tasks (stereo, depth superresolution, colorization, and semantic segmentation) while being 10-1000 times faster than competing approaches. The bilateral solver is fast, robust, straightforward to generalize to new domains, and simple to integrate into deep learning pipelines.

### Problem Formulation

The outline of the problem is around the relationship between the core of a natural image and its edges and boundaries, e.g., the images’ pixels are correlated smoothly at its core, but present discontinuities at the edges or across edges. Automated systems strive to achieve a better awareness of these discontinuities, in order to present the best possible result in the consistency required of a proper image resolution. As the complexity and difficulty in resolving challenging issues in an image increase the need for better , more sophisticated optimization techniques comes forth (the numbers below are related with the references section of the paper):

Traditional approaches to edge-aware smoothing apply an image-dependent filter to a signal of interest. Examples of this include joint bilateral filtering [37,40] and upsampling [20], adaptive manifolds [12], the domain transform [11], the guided filter [17,16], MST-based filtering [41], and weighted median filtering [30,44]. These techniques are flexible and computationally efficient, but often insufficient for solving more challenging computer vision tasks. Difficult tasks often necessitate complex iterative inference or optimization procedures that encourage smoothness while maintaining fidelity with respect to some observation. Optimization algorithms of this nature have been used in global stereo [34], depth superresolution [10,19,24,26,29,32], colorization [25], and semantic segmentation [6,22,28,45]. These approaches are tailored to their specific task, and are generally computationally expensive. In this work we present an optimization algorithm that is 10-1000× faster than existing domain-specific approaches with comparable accuracy, and produces higher-quality output than lightweight filtering techniques with comparable runtimes.

Based on prior research on fast bilateral solvers by another author, the researchers of this paper present their proposal, which is a novel approach, with the claim of being an enhancement in the efficiency of the former approach:

In this paper we present a new form of bilateral-space optimization which we call the bilateral solver, which efficiently solves a regularized least-squares optimization problem to produce an output that is bilateral-smooth and close to the input. This approach has a number of benefits:

General The bilateral solver is a single intuitive abstraction that can be applied to many different problems, while matching or beating the specialized state-of-the-art algorithms for each of these problems. It can be generalized to a variety of loss functions using standard techniques from M-estimation [14].Differentiable Unlike other approaches for edge-aware smoothness which require a complicated and expensive “unrolling” to perform backpropagation [45], the backward pass through our solver is as simple and fast as the forward pass, allowing it to be easily incorporated into deep learning architectures.Fast The bilateral solver is expressible as a linear least-squares optimization problem, unlike the non-linear optimization problem used in [2]. This enables a number of optimization improvements including a hierarchical preconditioner and initialization technique that hasten convergence, as well as efficient methods for solving multiple problems at once.

(…)

We begin by presenting the objective and optimization techniques that make up our bilateral solver. Let us assume that we have some per-pixel input quantities t (the “target” value, see Figure 1a) and some per-pixel confidence of those quantities c (Figure 1c), both represented as vectorized images. Let us also assume that we have some “reference” image (Figure 1d), which is a normal RGB image. Our goal is to recover an “output” vector x (Figure 1b), which will resemble the input target where the confidence is large while being smooth and tightly aligned to edges in the reference image. We will accomplish this by constructing an optimization problem consisting of an image-dependent smoothness term that encourages x to be bilateral-smooth, and a data-fidelity term that minimizes the squared residual between x and the target t weighted by our confidence c:

minimize (x) λ/2 ∑(i,j) Wˆ (i,j) (xi − xj )² + ∑(i) ci(xi − ti)² (1)

The smoothness term in this optimization problem is built around an affinity matrix Wˆ , which is a bistochastized version of a bilateral affinity matrix W. Each element of the bilateral affinity matrix Wi,j reflects the affinity between pixels i and j in the reference image in the YUV colorspace (…)

It is indeed interesting how the algorithm of these researchers achieve an adaptive way of filtering the discontinuities in an image resolution problem, but using common interpolation (“slicing”) methods, all in an automatic real-time paradigm:

This W matrix is commonly used in the bilateral filter [40], an edge-preserving filter that blurs within regions but not across edges by locally adapting the filter to the image content. There are techniques for speeding up bilateral filtering [1,5] which treat the filter as a “splat/blur/slice” procedure: pixel values are “splatted” onto a small set of vertices in a grid [2,5] or lattice [1] (a soft histogramming operation), then those vertex values are blurred, and then the filtered pixel values are produced via a “slice” (an interpolation) of the blurred vertex values. These splat/blur/slice filtering approaches all correspond to a compact and efficient factorization of W:

W = SˆT B¯ S (3)

But our authors outline their approach in the following way:

With this we can describe our algorithm, which we will refer to as the “bilateral solver.” The input to the solver is a reference RGB image, a target image that contains noisy observed quantities which we wish to improve, and a confidence image. We construct a simplified bilateral grid from the reference image, which is bistochastized as in [2] (see the supplement for details), and with that we construct the A matrix and b vector described in Equation 6 which are used to solve the linear system in Equation 8 to produce an output image. If we have multiple target images (with the same reference and confidence images) then we can construct a larger linear system in which b has many columns, and solve for each channel simultaneously using the same A matrix. In this many-target case, if b is low rank then that property can be exploited to accelerate optimization, as we show in the supplement.

### Integration with Backpropagation and Deep Learning pipelines

Without further due:

Integrating any operation into a deep learning framework requires that it is possible to backpropagate through that operation. Backpropagating through global operators such as our bilateral solver is generally understood to be difficult, and is an active research area [18]. Unlike most global smoothing operators, our model is easy to backpropagate through by construction. Note that we do not mean backpropagating through a multiplication of a matrix inverse A ¯ ¹ , which would simply be another multiplication by A ¯ ¹ . Instead, we will backpropagate onto the A matrix used in the least-squares solve that underpins the bilateral solver, thereby allowing us to backpropagate through the bilateral solver itself.

Our authors backpropagate through the bilateral solver, but with the diagonal elements of the sparse matrix instead of the complete matrix because the off-diagonal elements of A do not depend on the input signal or confidence, which yields enhanced efficiency:

.We will use these observations to backpropagate through the bilateral solver. The bilateral solver takes some input target t and some input confidence c, and then constructs a linear system that gives us a bilateral-space solution yˆ, from which we can “slice” out a pixel-space solutionxˆ

yˆ = A¯¹ b, xˆ = Sˆ T yˆ (13)

They emphasize the better outcomes of their approach compared with previous methods this way:

Because the computational cost of the backwards pass is dominated by the least squares solve necessary to compute ∂f /∂b, computing the backward pass through the solver is no more costly than computing the forward pass. Contrast this with past approaches for using iterative optimization algorithms in deep learning architectures, which create a sequence of layers, one for each iteration in optimization [45]. The backward pass in these networks is a fixed function of the forward pass and so cannot adapt like the bilateral solver to the structure of the error gradient at the output. Furthermore, in these “unrolled” architectures, the output at each iteration (layer) must be stored during training, causing the memory requirement to grow linearly with the number of iterations. In the bilateral solver, the memory requirements are small and independent of the number of iterations, as we only need to store the bilateral-space output of the solver ˆy during training. These properties make the bilateral solver an attractive option for deep learning architectures where speed and memory usage are important.

### Hierarchical Preconditioning

The optimization problems described in this paper – a quadratic objective of the bilateral solver – can be sped up and improved with proper initialization and preconditioning techniques. In this paper it is used a hierarchical preconditioning technique:

Hierarchical preconditioners have been studied extensively for image interpolation and optimization tasks. Unfortunately, techniques based on image pyramids [38] are not applicable to our task as our optimization occurs in a sparse 5- dimensional bilateral-space. More sophisticated image-dependent or graph based techniques [21,23,39] are effective preconditioners, but in our experiments the cost of constructing the preconditioner greatly outweighs the savings provided by the improved conditioning. We will present a novel preconditioner which is similar in spirit to hierarchical basis functions [38] or push-pull interpolation [13], but adapted to our task using the bilateral pyramid techniques presented in [2]. Because of its bilateral nature, our preconditioner is inherently locally adapted and so resembles image-adapted preconditioners [23,39]

With an interesting application in semantic segmentation, where the integration with convolutional and recurrent neural networks is shown possible:

Semantic segmentation is the problem of assigning a category label to each pixel in an image. State-of-the-art approaches to semantic segmentation use large convolutional neural networks (CNNs) to map from pixels to labels [6,28]. The output of these CNNs is often smoothed across image boundaries, so recent approaches refine their output with a CRF ([6,45]). These CRF-based approaches improve per-pixel labeling accuracy, but this accuracy comes at a computational cost: inference in a fully connected CRF on a 500 × 500 image can take up to a second (see Table 4). To evaluate whether the bilateral solver could improve the efficiency of semantic segmentation pipelines, we use it instead of the CRF component in two state-of-the-art models: DeepLab-LargeFOV [6] and CRF-RNN [45]. The DeepLab model consists of a CNN trained on Pascal VOC12 and then augmented with a fixed dense CRF. The CRF-RNN model generalizes the CRF with a recurrent neural network, and trains this component jointly with the CNN on Pascal and MSCOCO.

(Note: CRF stands for Conditional Random Field)

### Conclusion

A simple but bold conclusion ensues with a note of hints on further lines for future research:

We have presented the bilateral solver, a flexible and fast technique for inducing edge-aware smoothness. We have demonstrated that the solver can produce or improve state-of-the-art results on a variety of different computer vision tasks, while being faster or more accurate than other approaches. Its speed and generality suggests that the bilateral solver is a useful tool in the construction of computer vision algorithms and deep learning pipelines.

We applied our bilateral solver to the class probability maps using uniform confidence. The resulting discrete segmentations are more accurate and qualitatively smoother than the CNN outputs, despite our per-channel smoothing providing no explicit smoothness guarantees on the argmax of the filtered perclass probabilities (Table. 4, Figure 5). The bilateral solver is 8 − 10× faster than the CRF and CRF-RNN approaches when applied to the same inputs (Table 4). Although the bilateral solver performs slightly worse than the CRF-based approaches, its speed suggests that it may be a useful tool in contexts such as robotics and autonomous driving, where low latency is necessary.

Please consult the supplement provided with the paper for further details, tables and the major mathematical derivations together with the research methodology implemented.

*Inserted Image: Neural networks [3.2] : Conditional random fields – linear chain CRFNeural networks [3.2] : Conditional random fields – linear chain CRF *

*Featured Image: Nonlinear Programming*