I will review today one other interesting paper from the last NIPS 2016 Conference. This time the paper I’ve chosen is about a relatively new unsupervised machine learning method implemented by a system of two neural networks called Generative Adversarial Networks (GAN). This is a technique being used with some success, specially in Computer Vision applications, but the authors of this paper claim possible applications within natural language processing and signal processing frameworks.

One aspect of this method that is somehow innovative is the interplay of two neural networks which are performing two opposed tasks in an adversarial zero-sum game setting. One would have thought this to be unproductive, but we should recall that this is only a methodological approach; this is intended to improve how training data in a machine learning framework helps the generation process of real test data in order for the respective algorithm to become better with its task at predicting an event or simulating some feature of reality – an obvious artificial intelligent goal.

The Microsoft Research and Cambridge University authors of this paper show that GANs are probabilistic models constructs but a special case of a more general family of general variational divergence estimation models. But their *f*-divergence model is claimed to be able to be used for training generative neural samplers, and that it is possible to use a wide set of such models with various choices of divergence functions which then allows for the analysis of the quality of the obtained generative models and the training richness and complexity.

### f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

Abstract

Generative neural samplers are probabilistic models that implement sampling using feedforward neural networks: they take a random input vector and produce a sample from a probability distribution defined by the network weights. These models are expressive and allow efficient computation of samples and derivatives, but cannot be used for computing likelihoods or for marginalization. The generative-adversarial training method allows to train such models through the use of an auxiliary discriminative neural network. We show that the generative-adversarial approach is a special case of an existing more general variational divergence estimation approach. We show that any f-divergence can be used for training generative neural samplers. We discuss the benefits of various choices of divergence functions on training complexity and the quality of the obtained generative models.

### Introduction

In the introduction we get a fully qualified description of what are GANs, and how the authors show that current machine learing use of GANs isn’t yet exploring the full potential of its generalization to other probabilistic models, specially in experimental settings for generating natural images:

Given a generative model Q from a class Q of possible models we are generally interested in performing one or multiple of the following operations:

Sampling. Produce a sample from Q. By inspecting samples or calculating a function on a set of samples we can obtain important insight into the distribution or solve decision problems.

Estimation. Given a set of iid samples {x1, x2, . . . , xn} from an unknown true distribution P, find Q ∈ Q that best describes the true distribution.

Point-wise likelihood evaluation. Given a sample x, evaluate the likelihood Q(x).

Generative-adversarial networks (GAN) in the form proposed by [10] are an expressive class of generative models that allow exact sampling and approximate estimation. The model used in GAN is simply a feedforward neural network which receives as input a vector of random numbers, sampled, for example, from a uniform distribution. This random input is passed through each layer in the network and the final layer produces the desired output, for example, an image. Clearly, sampling from a GAN model is efficient because only one forward pass through the network is needed to produce one exact sample.

(…)

In the original GAN paper the authors show that it is possible to estimate neural samplers by approximate minimization of the symmetric Jensen-Shannon divergence,

where DKL denotes the Kullback-Leibler divergence. The key technique used in the GAN training is that of introducing a second “discriminator” neural networks which is optimized simultaneously. Because DJS(PkQ) is a proper divergence measure between distributions this implies that the true distribution P can be approximated well in case there are sufficient training samples and the model classQis rich enough to represent P.

(…)

In this work we show that the principle of GANs is more general and we can extend the variational divergence estimation framework proposed by Nguyen et al. [25] to recover the GAN training objective and generalize it to arbitrary f-divergences. More concretely, we make the following contributions over the state-of-the-art:

We derive the GAN training objectives for all f-divergences and provide as example additional divergence functions, including the Kullback-Leibler and Pearson divergences.

We simplify the saddle-point optimization procedure of Goodfellow et al. [10] and provide a theoretical justification.

We provide experimental insight into which divergence function is suitable for estimating generative neural samplers for natural images.

### Details of the Method

In what follows I will briefly describe the details of the methodological approach in this paper. The authors begin by defining *f-*divergence as follows:

where the generator function f : R+ → R is a convex, lower-semicontinuous function satisfying f(1) = 0. Different choices of f recover popular divergences as special cases in (2).

After this definition comes the significant step of this paper: showing that *f*-divergence is a special case of a more general family of models called *variational divergence minimization *methods:

Nguyen et al. [25] derive a general variational method to estimate f-divergences given only samples from P and Q. We will extend their method from merely estimating a divergence for a fixed model to estimating model parameters. We call this new method variational divergence minimization (VDM) and show that the generative-adversarial training is a special case of this more general VDM framework. For completeness, we first provide a self-contained derivation of Nguyen et al’s divergence estimation procedure. Every convex, lower-semicontinuous function f has a convex conjugate function fˆ∗ , also known as Fenchel conjugate [14]. This function is defined as:

The function f ∗ is again convex and lower-semicontinuous and the pair (f, fˆ∗ ) is dual to another in the sense that fˆˆ∗∗ = f. Therefore, we can also represent f as f(u) = sup[t∈domfˆ∗] {tu − fˆ∗ (t)}.

Nguyen et al. leverage the above variational representation of f in the definition of the f-divergence to obtain a lower bound on the divergence,

where T is an arbitrary class of functions T : X → R. The above derivation yields a lower bound for two reasons: first, because of Jensen’s inequality when swapping the integration and supremum operations. Second, the class of functions T may contain only a subset of all possible functions. By taking the variation of the lower bound in (4) w.r.t. T, we find that under mild conditions on f [25], the bound is tight for

where f’ denotes the first order derivative of f. This condition can serve as a guiding principle for choosing f and designing the class of functions T . For example, the popular reverse Kullback-Leibler divergence corresponds to f(u) = − log(u) resulting in T ∗ (x) = −q(x)/p(x), see Table 5. We list common f-divergences in Table 5 and provide their Fenchel conjugates f ∗ and the domains domf ∗ in Table 6. We provide plots of the generator functions and their conjugates in the supplementary materials.

And now it is detailed the Variational Divergence Minimization (VDM), where for this paper it is used the common minibatch sampling method and the expectation is approximated with sampling without replacement:

We can learn a generative model Qθ by finding a saddle-point of the following f-GAN objective function, where we minimize with respect to θ and maximize with respect to ω,

To optimize (6) on a given finite training data set, we approximate the expectations using minibatch samples. To approximate Ex∼P [·] we sample B instances without replacement from the training set. To approximate Ex∼Qθ [·] we sample B instances from the current generative model Qθ.

### Algorithms for Variational Divergence Minimization (VDM)

This section will surely delight the data engineer and software geek in you… and me 😉 :

We now discuss numerical methods to find saddle points of the objective (6). To this end, we distinguish two methods; first, the alternating method originally proposed by Goodfellow et al. [10], and second, a more direct single-step optimization procedure. In our variational framework, the alternating gradient method can be described as a double-loop method; the internal loop tightens the lower bound on the divergence, whereas the outer loop improves the generator model. While the motivation for this method is plausible, in practice the choice taking a single step in the inner loop is popular. Goodfellow et al. [10] provide a local convergence guarantee.

(…)

Motivated by the success of the alternating gradient method with a single inner step, we propose a simpler algorithm shown in Algorithm 1. The algorithm differs from the original one in that there is no inner loop and the gradients with respect to ω and θ are computed in a single back-propagation

Good, but further down:

Goodfellow et al. [10] noticed that training GAN can be significantly sped up by maximizing Ex∼Qθ [log Dω(x)] instead of minimizing Ex∼Qθ [log (1 − Dω(x))] for updating the generator. In the more general f-GAN Algorithm (1) this means that we replace line 4 with the update

thereby maximizing the generator output. This is not only intuitively correct but we can show that the stationary point is preserved by this change using the same argument as in [10]; we found this useful also for other divergences.

Larsen and Sønderby recommended monitoring real and fake statistics, which are defined as the true positive and true negative rates of the variational function viewing it as a binary classifier. Since our output activation gf are all monotone, we can derive similar statistics for any f-divergence by only changing the decision threshold. Due to the link between the density ratio and the variational function (5), the threshold lies at f’ (1) (see Table 6). That is, we can interpret the output of the variational function as classifying the input x as a true sample if the variational function Tω(x) is larger than f’ (1), and classifying it as a sample from the generator otherwise.

### Related work and final remarks

Following the theoretical and methodological description, the authors of this interesting paper go on and present their implementation of this results to the popular MNIST and LSUN datasets

. I for now skip this section and go on to list some of the related work done in similar former research papers and conclude with some final remarks:

We now discuss how our approach relates to existing work. Building generative models of real world distributions is a fundamental goal of machine learning and much related work exists. We only discuss work that applies to neural network models:

Mixture density networks [2] are neural networks which directly regress the parameters of a finite parametric mixture model. When combined with a recurrent neural network this yields impressive generative models of handwritten text [11].

NADE [19] and RNADE [33]perform a factorization of the output using a predefined and somewhat arbitrary ordering of output dimensions. The resulting model samples one variable at a time conditioning on the entire history of past variables. These models provide tractable likelihood evaluations and compelling results but it is unclear how to select the factorization order in many applications .

Diffusion probabilistic models[29] define a target distribution as a result of a learned diffusion process which starts at a trivial known distribution. The learned model provides exact samples and approximate log-likelihood evaluations.

Noise contrastive estimation(NCE) [13] is a method that estimates the parameters of unnormalized probabilistic models by performing non-linear logistic regression to discriminate the data from artificially generated noise. NCE can be viewed as a special case of GAN where the discriminator is constrained to a specific form that depends on the model (logistic regression classifier) and the generator (kept fixed) is providing the artificially generated noise (see supplementary material).

Variational auto-encoders(VAE) [18, 28] are pairs of probabilistic encoder and decoder models which map a sample to a latent representation and back, trained using a variational Bayesian learning objective. The advantage of VAE is in the encoder model which allows efficient inference from observation to latent representation and overall they are a compelling alternative to f-GANs and recent work has studied combinations of the two approaches [23]

Discussion

Generative neural samplers offer a powerful way to represent complex distributions without limiting factorizing assumptions. However, while the purely generative neural samplers as used in this paper are interesting their use is limited because after training they cannot be conditioned on observed data and thus are unable to provide inferences. We believe that in the future the true benefits of neural samplers for representing uncertainty will be found in discriminative models and our presented methods extend readily to this case by providing additional inputs to both the generator and variational function as in the conditional GAN model [8].

Acknowledgements. We thank Ferenc Huszar for discussions on the generative-adversarial approach.

This is the kind of paper where it is highly recommended to parse through the referenced list of papers attentively. In this review, inside the quoted boxes there are numerous pointers to this list. I do not hyperlinked those, but it is supposed to be interpreted as a referenced paper that the careful and diligent reader should check and maybe read also.

Tomorrow will be another day and another post!

*featured image: from this paper – Figure 1: The two terms in the saddle objective (7) are plotted as a function of the variational function Vω(x).*