A team of researchers from Facebook AI research released an interesting paper about sequence to sequence learning with convolutional neural networks (CNNs). CNNs has been mainly used in computer vision implementations, being a state-of-the-art stack for the the researche and development in object recognition or image recognition. Less often have CNNs been implemented for machine translation or speech recognition. The paper we review here for The Intelligence of Information is one such study about implementation of CNNs to machine translation, especially the case of sequence to sequence learning of parts of speech in machine translation that are usually performed with recurrent neural networks (RNNs), namely Long-short term memory (LSTMs).

Indeed from the abstract we already learn that the authors claim to outperform the accuracy of deep LSTMs in *WMT’14 English-German and WMT’14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.*

The success of RNNs in machine translation tasks, speech recognition and text summarization have been attributed to an encoding-decoding scheme of the input sequence with a series of bi-directional recurrent neural networks that generates a variable length output with another set of decoders RNNs. This bi-directional stack is integrated with an interface via a soft-attention mechanism. Such scheme have ouptperformed traditional phrase-based models by large margins.

Convolutional neural networks could also be implemented for sequence modelling given that there are some advantages in so doing. But they are less common.

### Convolutional Sequence to Sequence Learning

Compared to recurrent layers, convolutions create representations for fixed size contexts, however, the effective context size of the network can easily be made larger by stacking several layers on top of each other. This allows to precisely control the maximum length of dependencies to be modeled. Convolutional networks do not depend on the computations of the previous time step and therefore allow parallelization over every element in a sequence. This contrasts with RNNs which maintain a hidden state of the entire past that prevents parallel computation within a sequence.

And the hierarchical representaion of multi-layered convolutional neural networks, whereby the layers of different levels interact with each other over the input sequence, provide a way for this hierarchical structure to capture long-range dependencies compared with the chain structure of recurrent neural networks, that is, the feature representation capturing relationships within a window of *n* words can be obtained by a less than linear computational complexity *O(n/k)*, whereas RNNs performs with linear computational complexity *O(n)*.

Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by recurrent networks, e.g. we can obtain a feature representation capturing relationships within a window of n words applying only O(n/k) convolutional operations for a kernel of width k, compared to linear number O(n) for recurrent neural networks. Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word. Fixing the number of nonlinearities applied to the inputs also eases learning.

The source-code and the models for this research can be accessed in this GitHub repository. One important aspects worth to mention in this paper concerns the use of gated linear units which eases the gradient propagation (with backprop this is an interesting methodological setup) and the fact that each decoder layer is equiped with a separate attention module.

In this paper we propose an architecture for sequence to sequence modeling that is entirely convolutional. Our model is equipped with gated linear units (Dauphin et al., 2016) and residual connections (He et al., 2015a). We also use attention in every decoder layer and demonstrate that each attention layer only adds a negligible amount of overhead. The combination of these choices enables us to tackle largescale problems (§3).

The paragraph below describes the preformance of this setup. The units printed (BLEU) affords a further scrutiny by the interested and more technical readership. Noteworthy is also how this model can translate unseen sentences at an order of magnitude faster speed:

We evaluate our approach on several large datasets for machine translation as well as summarization and compare to the current best architectures reported in the literature. On WMT’16 English-Romanian translation we achieve a new state of the art, outperforming the previous best result by 1.8 BLEU. On WMT’14 English-German we outperform the strong LSTM setup of Wu et al. (2016) by 0.5 BLEU and on WMT’14 English-French we outperform the likelihood trained system of Wu et al. (2016) by 1.5 BLEU. Furthermore, our model can translate unseen sentences at an order of magnitude faster speed than Wu et al. (2016) on GPU and CPU hardware (§4, §5).

How did the former recurrent neural network sequence to sequence modeling worked?:

Sequence to sequence modeling has been synonymous with recurrent neural network based encoder-decoder architectures (Sutskever et al., 2014; Bahdanau et al., 2014). The encoder RNN processes an input sequence x = (x1, . . . , xm) of m elements and returns state representations z = (z1. . . . , zm). The decoder RNN takes z and generates the output sequence y = (y1, . . . , yn) left to right, one element at a time. To generate output yi+1, the decoder computes a new hidden state hi+1 based on the previous state hi , an embedding gi of the previous target language word yi, as well as a conditional input ci derived from the encoder output z. Based on this generic formulation, various encoder-decoder architectures have been proposed, which differ mainly in the conditional input and the type of RNN.

The models with attention compute a weighted sum over the (*z1,…., zms)* representations* *at each time step.

The weights of the sum are referred to as attention scores and allow the network to focus on different parts of the input sequence as it generates the output sequences. Attention scores are computed by essentially comparing each encoder state zj to a combination of the previous decoder state hi and the last prediction yi; the result is normalized to be a distribution over input elements.

Popular choices for recurrent networks in encoder-decoder models are long short term memory networks (LSTM; Hochreiter & Schmidhuber, 1997) and gated recurrent units (GRU; Cho et al., 2014). Both extend Elman RNNs (Elman, 1990) with a gating mechanism that allows the memorization of information from previous time steps in order to model long-term dependencies. Most recent approachesalso rely on bi-directional encoders to build representations of both past and future contexts (Bahdanau et al., 2014; Zhou et al., 2016; Wu et al., 2016). Models with many layers often rely on shortcut or residual connections (He et al., 2015a; Zhou et al., 2016; Wu et al., 2016).

### The Convolutional approach to sequence to sequence modeling

The convolutional approach relies in a fully CNN architecture. This substitutes the RNNs in computing the intermediate encoder states** z **and decoder states **h. **The architecture of the CNN used in this research is equiped with input elements is a distributional space and a strong sense of position by the introduction of a vector of position embeddings:

First, we embed input elements x = (x1, . . . , xm) in distributional space as w = (w1, . .. , wm), where wj ∈ Rˆf is a column in an embedding matrix D ∈ R V ×f. We alsoequip our model with a sense of order by embedding the absolute position of input elements p = (p1, . . . , pm) where pj ∈ Rˆf. Both are combined to obtain input element representations e = (w1 + p1, . . . , wm + pm). We proceed similarly for output elements that were already generated by the decoder network to yield output element representations that are being fed back into the decoder network g = (g1, . . ., gn). Position embeddings are useful in our architecture since they give our model a sense of which portion of the sequence in the input or output it is currently dealing with (§5.4).

The block structure of this convolutional neural network i a simple one. It computes intermediate states based on a fixed number of input elements. Each block contains a one-dimensional convolution followed by a non-linearity. The non-linearites chosen in this research were the so-called Gated Linear Units (GLUs). They impplement a simple gating mechanism over the output of the convolution:

I will have to finish this post now. Unfortunately some technical problems appeared unexpectedely. I encourage the readers to fully disclose the rest if the paper, which is recommended for the exoeriemtsl details and the overall results of the comparsison between the CNN and RNN approach to sequence to sequence learning and modeling.