The long list of blogs I subscribe to seldom keeps me bored. This morning I didn’t even remember well to have subscribed ALGORITHMIA Blog, or whether I remembered their well qualified creators’ skill set.
But the Internet appears to be ever more intelligent, and with such a good timing of their e-mail servers, that when I checked today my mailbox with one post from this Blog, it immediately caught my attention.
In it we can learn about the recent launch by the Amazon Cloud Computing company AWS of their new virtual server service to create and deploy images called AWS AMI. I already defined briefly what an image and container is in the context of Cloud Computing in The Information Age, so I will not bother to repeat it here now. AWS promises to provide deep learning integration in cloud computing with security and quality.
A further link in the blog post mentioned directed us to the paper I will review here today. Continuing with the deep learning topic series of reviews, this time it concerns the application of deep learning frameworks to texture networks. Texture is an important characteristic to analyse in the context of image processing in general, and in the paper it is referred an application to artistic painting in particular. The authors applied a generative feed-forward network model trained with complex expressive loss functions, and their convolutional network mimic the quality of a former effort with deep neural networks, where only one single texture example was used, but extend to multiple samples of textures of the same size, multiple times faster:
Gatys et al. recently demonstrated that deep networks can generate beautiful textures and stylized images from a single texture example. However, their methods requires a slow and memory consuming optimization process. We propose here an alternative approach that moves the computational burden to a learning stage. Given a single example of a texture, our approach trains compact feed-forward convolutional networks to generate multiple samples of the same texture of arbitrary size and to transfer artistic style from a given image to any other image. The resulting networks are remarkably light-weight and can generate textures of quality comparable to Gatys et al., but hundreds of times faster. More generally, our approach highlights the power and flexibility of generative feed-forward models trained with complex and expressive loss functions.
The group of researchers that produced this paper were composed of several Russian researchers from Skolkovo Institute of Science & Technology and one researcher from the University of Oxford in the United Kingdom. They start their paper with wide reference to main results and insights of the paper that triggered the research in their own paper, a similar work by Leon A. Gatys from May 2015, which is also an open access paper on texture synthesis with convolutional neural networks:
Most of these proposed generative networks that produce images as output, using feed-forward calculations from a random seed; however, very impressive results were obtained by (Gatys et al., 2015a;b) by using networks descriptively, as image statistics. Their idea is to reduce image generation to the problem of sampling at random from the set of images that match a certain statistics. In texture synthesis (Gatys et al., 2015a), the reference statistics is extracted from a single example of a visual texture, and the goal is to generate further examples of that texture. In style transfer (Gatys et al., 2015b), the goal is to match simultaneously the visual style of a first image, captured using some low-level statistics, and the visual content of a second image, captured using higher-level statistics. In this manner, the style of an image can be replaced with the one of another without altering the overall semantic content of the image.
The shortcomings of the approach by Gatys et al. are nicely pointed. This lays the foundation and leit motiff for our authors paper. The high computational and memory requirements of the approach by Gatys et al., which is an iterative optimization procedure requiring backpropagation through the gradients of the parameters of the model, motivates them of try to succeed in applying a feed-forward generation network:
Matching statistics work well in practice, is conceptually simple, and demonstrates that off-the-shelf neural networks trained for generic tasks such as image classification can be re-used for image generation. However, the approach of (Gatys et al., 2015a;b) has certain shortcomings too. Being based on an iterative optimization procedure, it requires backpropagation to gradually change the values of the pixels until the desired statistics is matched. This iterative procedure requires several seconds in order to generate a relatively small image using a high-end GPU, while scaling to large images is problematic because of high memory requirements. By contrast, feed-forward generation networks can be expected to be much more efficient because they require a single evaluation of the network and do not incur in the cost of backpropagation.
Notwithstanding the further challenges that might occur with this approach, which the authors outline in the last sections of the paper, it is noteworthy the enthusiasm and energetic tone of their description of the effort. They firmly believe this work to be deployable in other settings such as video and mobile applications:
Our contribution is threefold. First, we show for the first time that a generative approach can produce textures of the quality and diversity comparable to the descriptive method. Second, we propose a generative method that is two orders of magnitude faster and one order of magnitude more memory efficient than the descriptive one. Using a single forward pass in networks that are remarkably compact make our approach suitable for video-related and possibly mobile applications. Third, we devise a new type of multi-scale generative architecture that is particularly suitable for the tasks we consider.
This approach has the nice advantage, according to the authors’ thinking, of combining a conceptually simple feed-forward architecture with a complex and expressive loss function used as activation layer. This comes from the fact that the texture networks, so to speak by them, is a fully-convolutional network that can generate textures and images of arbitrary size, beyond merely performing descriptive statistics of the images.
The related work section of this paper elucidate us about the logical framework for analysing textures in images. Interesting that this is accomplished as a reframed approach from sampling from a drawing in an image and using inductive Bayesian statistics to an optimization problem, whereby the texture synthesis is achieved by a minimization of a pre-image of a certain statistical function:
In texture synthesis, the distribution is induced by an example texture instance x0 (e.g. a polka dots image), such that we can write x ∼ p(x|x0). In style transfer, the distribution is induced by an image x0 representative of the visual style (e.g. an impressionist painting) and a second image x1 representative of the visual content (e.g. a boat), such that x ∼ p(x|x0, x1). (Mahendran & Vedaldi, 2015; Gatys et al., 2015a;b) reduce this problem to the one of finding a pre-image of a certain image statistics Φ(x) ∈ R d and pose the latter as an optimization problem. In particular, in order to synthesize a texture from an example image x0, the pre-image problem is:
Importantly, the pre-image x : Φ(x) ≈ Φ(x0) is usually not unique, and sampling pre-images achieves diversity. In practice, samples are extracted using a local optimization algorithm A starting from a random initialization z. Therefore, the generated image is the output of the function
This results in a distribution p(x|x0) which is difficult to characterise, but is easy to sample and, for good statistics Φ, produces visually pleasing and diverse images. Both (Mahendran & Vedaldi, 2015) and (Gatys et al., 2015a;b) base their statistics on the response that x induces in deep neural network layers. Our approach reuses in particular the statistics based on correlations of convolutional maps proposed by (Gatys et al., 2015a;b).
The proposed method of texture network
After further reasoning about the theoretical details surrounding the introduction of this notion of a pre-image – I recommend the reader to check it through as it involves advanced topics in probability & statistics, such as the ergodicity of their sample space and the shortcomings in inducing a maximum-entropy distribution over textures -, the authors provide a detailed sketch of the proposed method of their texture network. I will only highlight here what is in my view the main points. In this I may skip some detail that the more advanced reader may spot, and if so I encourage those readers to come forth and post in the comment section of this Blog.
At a highlevel (see Figure 2), our approach is to train a feed-forward generator network g which takes a noise sample z as input and produces a texture sample g(z) as output. For style transfer, we extend this texture network to take both a noise sample z and a content image y and then output a new image g(y, z) where the texture has been applied to y as a visual style. A separate generator network is trained for each texture or style and, once trained, it can synthesize an arbitrary number of images of arbitrary size in an efficient, feed-forward manner.
A key challenge in training the generator network g is to construct a loss function that can assess automatically the quality of the generated images. For example, the key idea of GAN is to learn such a loss along with the generator network. We show in Sect. 3.1 that a very powerful loss can be derived from pre-trained and fixed descriptor networks using the statistics introduced in (Gatys et al., 2015a;b). Given the loss, we then discuss the architecture of the generator network for texture synthesis (Sect. 3.2) and then generalize it to style transfer (Sect 3.3).
Our loss function is derived from (Gatys et al., 2015a;b) and compares image statistics extracted from a fixed pretrained descriptor CNN (usually one of the VGG CNN (Simonyan & Zisserman, 2014; Chatfield et al., 2014) which are pre-trained for image classification on the ImageNet ILSVRC 2012 data). The descriptor CNN is used to measure the mismatch between the prototype texture x0 and the generated image x. Denote by Fˆl [i] (x) the i-th map (feature channel) computed by the l-th convolutional layer by the descriptor CNN applied to image x. The Gram matrix Gˆl (x) is defined as the matrix of scalar (inner) products between such feature maps:
Given that the network is convolutional, each inner product implicitly sums the products of the activations of feature i and j at all spatial locations, computing their (unnormalized) empirical correlation. Hence Gˆl [ij] (x) has the same general form as (3) and, being a orderless statistics of local stationary features, can be used as a texture descriptor.
In practice, (Gatys et al., 2015a;b) use as texture descriptor the combination of several Gram matrices Gˆl , l ∈ L[T] , where L[T] contains selected indices of convolutional layer in the descriptor CNN. This induces the following texture loss between images x and x0:
In addition to the texture loss (5), (Gatys et al., 2015b) propose to use as content loss the one introduced by (Mahendran & Vedaldi, 2015), which compares images based on the output Fˆl i (x) of certain convolutional layers l ∈ L[C] (without computing further statistics such as the Gram matrices). In formulas
where N[l] is the number of maps (feature channels) in layer l of the descriptor CNN. The key difference with the texture loss (5) is that the content loss compares feature activations at corresponding spatial locations, and therefore preserves spatial information. Thus this loss is suitable for content information, but not for texture information.
Analogously to (Gatys et al., 2015a), we use the texture loss (5) alone when training a generator network for texture synthesis, and we use a weighted combination of the texture loss (5) and the content loss (6) when training a generator network for stylization. In the latter case, the set L[C] does not include layers as shallow as the set L[T] as only the high-level content should be preserved.
(…) We found that training benefited significantly from inserting batch normalization layers (Ioffe & Szegedy, 2015) right after each convolutional layer and, most importantly, right before the concatenation layers, since this balances gradients travelling along different branches of the network.
Learning optimizes the objective (7) using stochastic gradient descent (SGD). At each iteration, SGD draws a mini-batch of noise vectors zk, k = 1, . . . , B, performs forward evaluation of the generator network to obtained the corresponding images xk = g(zk, θ), performs forward evaluation of the descriptor network to obtain Gram matrices Gl (xk), l ∈ LT , and finally computes the loss (5) (note that the corresponding terms Gl (x0) for the reference texture are constant). After that, the gradient of the texture loss with respect to the generator network parameters θ is computed using backpropagation, and the gradient is used to update the parameters. Note that LAPGAN (Denton et al., 2015) also performs multi-scale processing, but uses layer-wise training, whereas our generator is trained end-to-end.
Discussion and conclusions
The review has been going a bit long for now. This was a compelling paper that to me was a satisfying read. But least of doing a complete inappropriate and unethical copy paste of the entire content, as always the full paper evaluation including the references within is highly recommended . The authors themselves discuss and conclude the paper in this way:
We have presented a new deep learning approach for texture synthesis and image stylization. Remarkably, the approach is able to generate complex textures and images in a purely feed-forward way, while matching the texture synthesis capability of (Gatys et al., 2015a), which is based on multiple forward-backward iterations. In the same vein as (Goodfellow et al., 2014; Dziugaite et al., 2015; Li et al., 2015), the success of this approach highlights the suitability of feed-forward networks for complex data generation and for solving complex tasks in general.
The key to this success is the use of complex loss functions that involve different feed-forward architectures serving as “experts” assessing the performance of the feed-forward generator. While our method generally obtains very good result for texture synthesis, going forward we plan to investigate better stylization losses to achieve a stylization quality comparable to (Gatys et al., 2015b) even for those cases (e.g. Figure 3.top) where our current method achieves less impressive results.
Hope for the best in the forward, described above, pursuits by the authors of this compelling paper.
featured image: An Open Source AWS AMI for Training Style Transfer Models