I had the unfortunate trouble of a serious technical problem with my best PC equipment and this precludes me to best serve these blogs and its readers with the appropriate quality. This interrupyion will last until the end of this week and may extend to the next week, in spite of my best efforts to get to normal conditions as soon as possible.

I apologize in advance and just promise to get back to business as usual with an enhanced motivation…

]]>

Indeed from the abstract we already learn that the authors claim to outperform the accuracy of deep LSTMs in *WMT’14 English-German and WMT’14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.*

The success of RNNs in machine translation tasks, speech recognition and text summarization have been attributed to an encoding-decoding scheme of the input sequence with a series of bi-directional recurrent neural networks that generates a variable length output with another set of decoders RNNs. This bi-directional stack is integrated with an interface via a soft-attention mechanism. Such scheme have ouptperformed traditional phrase-based models by large margins.

Convolutional neural networks could also be implemented for sequence modelling given that there are some advantages in so doing. But they are less common.

Compared to recurrent layers, convolutions create representations for fixed size contexts, however, the effective context size of the network can easily be made larger by stacking several layers on top of each other. This allows to precisely control the maximum length of dependencies to be modeled. Convolutional networks do not depend on the computations of the previous time step and therefore allow parallelization over every element in a sequence. This contrasts with RNNs which maintain a hidden state of the entire past that prevents parallel computation within a sequence.

And the hierarchical representaion of multi-layered convolutional neural networks, whereby the layers of different levels interact with each other over the input sequence, provide a way for this hierarchical structure to capture long-range dependencies compared with the chain structure of recurrent neural networks, that is, the feature representation capturing relationships within a window of *n* words can be obtained by a less than linear computational complexity *O(n/k)*, whereas RNNs performs with linear computational complexity *O(n)*.

Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by recurrent networks, e.g. we can obtain a feature representation capturing relationships within a window of n words applying only O(n/k) convolutional operations for a kernel of width k, compared to linear number O(n) for recurrent neural networks. Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word. Fixing the number of nonlinearities applied to the inputs also eases learning.

The source-code and the models for this research can be accessed in this GitHub repository. One important aspects worth to mention in this paper concerns the use of gated linear units which eases the gradient propagation (with backprop this is an interesting methodological setup) and the fact that each decoder layer is equiped with a separate attention module.

In this paper we propose an architecture for sequence to sequence modeling that is entirely convolutional. Our model is equipped with gated linear units (Dauphin et al., 2016) and residual connections (He et al., 2015a). We also use attention in every decoder layer and demonstrate that each attention layer only adds a negligible amount of overhead. The combination of these choices enables us to tackle largescale problems (§3).

The paragraph below describes the preformance of this setup. The units printed (BLEU) affords a further scrutiny by the interested and more technical readership. Noteworthy is also how this model can translate unseen sentences at an order of magnitude faster speed:

We evaluate our approach on several large datasets for machine translation as well as summarization and compare to the current best architectures reported in the literature. On WMT’16 English-Romanian translation we achieve a new state of the art, outperforming the previous best result by 1.8 BLEU. On WMT’14 English-German we outperform the strong LSTM setup of Wu et al. (2016) by 0.5 BLEU and on WMT’14 English-French we outperform the likelihood trained system of Wu et al. (2016) by 1.5 BLEU. Furthermore, our model can translate unseen sentences at an order of magnitude faster speed than Wu et al. (2016) on GPU and CPU hardware (§4, §5).

How did the former recurrent neural network sequence to sequence modeling worked?:

Sequence to sequence modeling has been synonymous with recurrent neural network based encoder-decoder architectures (Sutskever et al., 2014; Bahdanau et al., 2014). The encoder RNN processes an input sequence x = (x1, . . . , xm) of m elements and returns state representations z = (z1. . . . , zm). The decoder RNN takes z and generates the output sequence y = (y1, . . . , yn) left to right, one element at a time. To generate output yi+1, the decoder computes a new hidden state hi+1 based on the previous state hi , an embedding gi of the previous target language word yi, as well as a conditional input ci derived from the encoder output z. Based on this generic formulation, various encoder-decoder architectures have been proposed, which differ mainly in the conditional input and the type of RNN.

The models with attention compute a weighted sum over the (*z1,…., zms)* representations* *at each time step.

The weights of the sum are referred to as attention scores and allow the network to focus on different parts of the input sequence as it generates the output sequences. Attention scores are computed by essentially comparing each encoder state zj to a combination of the previous decoder state hi and the last prediction yi; the result is normalized to be a distribution over input elements.

Popular choices for recurrent networks in encoder-decoder models are long short term memory networks (LSTM; Hochreiter & Schmidhuber, 1997) and gated recurrent units (GRU; Cho et al., 2014). Both extend Elman RNNs (Elman, 1990) with a gating mechanism that allows the memorization of information from previous time steps in order to model long-term dependencies. Most recent approachesalso rely on bi-directional encoders to build representations of both past and future contexts (Bahdanau et al., 2014; Zhou et al., 2016; Wu et al., 2016). Models with many layers often rely on shortcut or residual connections (He et al., 2015a; Zhou et al., 2016; Wu et al., 2016).

The convolutional approach relies in a fully CNN architecture. This substitutes the RNNs in computing the intermediate encoder states** z **and decoder states **h. **The architecture of the CNN used in this research is equiped with input elements is a distributional space and a strong sense of position by the introduction of a vector of position embeddings:

First, we embed input elements x = (x1, . . . , xm) in distributional space as w = (w1, . .. , wm), where wj ∈ Rˆf is a column in an embedding matrix D ∈ R V ×f. We alsoequip our model with a sense of order by embedding the absolute position of input elements p = (p1, . . . , pm) where pj ∈ Rˆf. Both are combined to obtain input element representations e = (w1 + p1, . . . , wm + pm). We proceed similarly for output elements that were already generated by the decoder network to yield output element representations that are being fed back into the decoder network g = (g1, . . ., gn). Position embeddings are useful in our architecture since they give our model a sense of which portion of the sequence in the input or output it is currently dealing with (§5.4).

The block structure of this convolutional neural network i a simple one. It computes intermediate states based on a fixed number of input elements. Each block contains a one-dimensional convolution followed by a non-linearity. The non-linearites chosen in this research were the so-called Gated Linear Units (GLUs). They impplement a simple gating mechanism over the output of the convolution:

I will have to finish this post now. Unfortunately some technical problems appeared unexpectedely. I encourage the readers to fully disclose the rest if the paper, which is recommended for the exoeriemtsl details and the overall results of the comparsison between the CNN and RNN approach to sequence to sequence learning and modeling.

]]>

Source: Blockchain Technology secured by Quantum Mechanics

]]>

Today I would like to post a more technical and pure engineering topic. The heart of the matter in Artificial Intelligence(AI) is more practical/empirical based than theoretical. Even though the conceptual framework is undoubtedly important. But to get a good grasp of the real work involved in setting up all the apparatus for a machine learning/deep learning and AI model or project we need to get the hands dirty, so to speak.

The video below may offer the right frame of mind with this goal. It features a talk by research scientist Chris Fregly from PipelineIO – a machine learning and AI start-up from San Francisco, US. Chris starts by presenting the GitHub repository called fluxcapacitor/pipeline. In spite of the talk being from January 2017 – less than six months of development in software development might mean a lot of time and already not up to data work -, I thought this presentation to preserve its relevance. And it manages to put together a plethora of software developments such as Kubernetes orchestration tools, Docker containers, Apache SparkML, TensorFlow and the Jupyter Notebook all in a bundled development stack, quite impressive…

The YouTube video description of the talk is also worth to read through. I share it here:

In this completely demo-based talk, Chris Fregly from PipelineIO will demo the latest 100% open source research in high-scale, fault-tolerant model serving using Tensorflow, Spark ML, Jupyter Notebook, Docker, Kubernetes, and NetflixOSS Microservices.

This talk will discuss the trade-offs of mutable vs. immutable model deployments, on-the-fly JVM byte-code generation, global request batching, miroservice circuit breakers, and dynamic cluster scaling – all from within a Jupyter notebook.

Chris Fregly is a Research Scientist at PipelineIO – a Machine Learning and Artificial Intelligence Startup in San Francisco.

Chris is an Apache Spark Contributor, Netflix Open Source Committer, Founder of the Advanced Spark and TensorFlow Meetup, Author of the upcoming book, Advanced Spark, and Creator of the upcoming O’Reilly video series, Deploying and Scaling Distributed TensorFlow in Production.

Previously, Chris was a Distributed Systems Engineer at Netflix, Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.

This talk was part of meetup from the Advanced Spark and TensorFlow Meetup and anticipates the StartupML Conference to be held in San Francisco later in August. These are cutting-edge software development around data engineering, big data pipelines, machine learning and Artificial Intelligence compute engines. Then there is the hardware part of all this multitude of developments, which has also witnessed recently some important milestones. Chris’ talk manages to give us picture of the relationship between these software and hardware developments within the SparkML and TensorFlow AI models stacks.

One other interesting aspect of the talk is how the Jupyter Notebook of Python language software development origin is seamlessly integrated with the various pipelines, providing flexible environments contributing to a more productive setup overall.

My final work goes to the part of the talk where Chris gave his view about model deployments and rollbacks and the Graph Transform tool that Chris has been working on . It seemed to be the heart of the matter in this talk. The way modern AI and machine learning/deep learning models are being deployed deserves greater attention. Of special interest is the interplay between mutable and immutable deployment, with the rollback option always an on *button. *Docker containers and images are playing an increasing crucial role in this development. So to know as much as possible about Docker containers and images is a plus. On the Graph Transform tool development Chris recommends simplification procedures in order to properly make sense of what can easily become a deeply convoluted graphical display as well as optimizing the serving runtime environment.

*featured image: Continuously Train & Deploy Spark ML and Tensorflow AI Models from Jupyter Notebook to Production*

]]>

Source: What is an Initial Coin Offering?

]]>

Mehmet co-authored a short paper published in ArXiv in April this year. It is a fitting short paper review about deep learning architectures for The Intelligence of Information‘s start of the week. It investigates how high network connectivity might increase accuracy of learning for deep neural networks. For this the team of researchers used two metrics to quantify what is called spectral ergodicity – this is advanced mathematical conceptual terms that must be further checked by readers of this Blog -, one somewhat new, Thirumalai-Mountain (TM) metric, the other widely already known and used by the research community in statistics, machine learning and mathematical optimization, Kullbach-Leibler (KL) divergence.

This research effort was properly developed within a computational framework that have given rise to new software, developed by Mehmet . Followers and readers are encouraged to check it further and fork it here in its GitHub repository.

Abstract

Using random matrix ensembles, mimicking weight matrices from deep and recurrent neural networks, we investigate how increasing connectivity leads to higher accuracy in learning with a related measure on eigenvalue spectra. For this purpose, we quantify spectral ergodicity based on the Thirumalai-Mountain (TM) metric and Kullbach-Leibler (KL) divergence. As a case study, different size circular random matrix ensembles, i.e., circular unitary ensemble (CUE), circular orthogonal ensemble (COE), and circular symplectic ensemble (CSE), are generated. Eigenvalue spectra are computed along with the approach to spectral ergodicity with increasing connectivity. As a result, it is argued that success of deep learning architectures attributed to spectral ergodicity conceptually, as this property prominently decreases with increasing connectivity in surrogate weight matrices.

Random matrices are a probability theory/mathematical physics development of sorts that have successfully found applications in numerous fields. From Finance to signal processing or macroeconomics to Neuroscience, Random matrices are important for the mathematical representation of complex statistical properties of physical systems:

Characterising statistical properties of different random matrix ensembles plays a critical role in understanding the nature of physical models they represent. For example in neuroscience, neuronal dynamics can be encoded as a synaptic connectivity matrix in different network architectures [2–4] and as well as weight matrix of deep learning architectures [5, 6], possibly with dropout [7]. Similarly, transition matrices in stochastic materials simulations in discrete space [8, 9].

Eigenvalue spectrum, another advanced mathematical conceptual jargon that is worth to further learn, provide important information regarding both structure and dynamics of physical systems. For example the learning rate in a recurrent neural network appears to be influenced by the spectral radius of its weight matrices; and spectral radius is related with the Eigenvalue spectrum. In spite of the not that good English by Mehmet et. al. (my own constructive criticism), there it is the following two paragraphs with an outline of the essential message and purpose of the paper:

Eigenvalue spectrum entails information regarding both structure and dynamics. For example, spectral radius of weight matrices in recurrent neural networks influence the learning rate, i.e., training [10]. Ergodic properties are not much investigated in this context. While, spectral ergodicity is prominently used in characterizing quantum systems undergoes so called analogy to chaotic motion in the energy spectra, as there is no quantum trajectories in classical mechanics sense [11].

The concept of ergodicity appears in statistical mechanics as time averages of a physical dynamics is equal to its ensemble average [9, 12]. The definition is not uniform in the literature [9]. For example, Markov chain transition matrix is called ergodic, if all eigenvalues are below one, implying any state can be reachable from any other. Here, spectral ergodicity implies eigenvalue spectra averaged of ensemble of matrices and spectra obtained using a single matrix, a realisation from an ensemble, as it is or via an averaging procedure, i.e., spectral average generates the same spectra within statistical accuracy.

Following this reasoning the context of neural networks is one where these concepts bear fruit. Indeed in that context, spectral ergodicity may have at least two implications: one from considering the ensemble of weight matrices in different layers for feed-forward architectures, and another from considering the connectivity architecture for recurrent networks. Quantifying a measure of spectral ergodicity becomes central. The authors provide the following one:

A measure of spectral ergodicity is defined as follows for a finite ensemble, inpired from Thirumalai-Mountain (TM) metric [13, 14], for M random matrices of size NxN,

with spectral spacing of bk, where k = 1, …, K, spectral density of ρj (bk), where j = 1, …, M and ensemble average spectral density ¯ρ(bk),

The essential idea of such a measure is to capture fluctuations between individual eigenvalue spectrum against the ensemble averaged one empirically. Note that spectral spacing bk plays a role of time compare to original TM metric in the naive formulation.

However ergodicity for spectral ergodicity should be defined differently, as a function of *matrix size, *in line with the asymptotic expectation that spectral and ensemble averages yield the same statistics. Enter the Kullbach-Leibler (KL) divergence metric, where a distance is defined between two distributions on consecutive ensemble sizes, Nk > Nk−1:

since KL is not-symetric, we sum KL in both directions to quantify approach to spectral ergodicity,

An interpretation of Dse(Nk) would be that, how increasing connectivity size influence approach to spectral ergodicity. This would indirectly measure information content as well, due to connection of KL to mutual information, i.e., relative entropy.

That is, the approach to deep learning architectures proposed in this paper is from the random matrices of weights in feed-forward architectures and ensemble of connectivity in recurrent networks instead of from a specific architecture and learning algorithm. The advantages of the approach are outlined brilliantly (better English this time…):

Using random matrices brings a distinct advantage of generalizing the results for any architecture using a simple toy system and being able to enforce certain restrictions on the spectral radius. For example applying a constraint on the spectral radius to unity, this simulates prevention of learning algorithm in getting into numerical problems in training the network [10]. Circular random matrix ensembles serves this purpose well, where their eigenvalues lies on the unit circle on the complex plane.

The Neuroscience research connection is also outlined:

A typical weight matrix for a layer, W, in deep learning architectures, feed-forward multi-layer neural networks, are not rectangular and has no self-recurrent connections. We took the assumption that eigenvalue spectrum of WˆTW is directly comparable to behaviour of W in learning and effect of diagonal elements as recurrent connections are small for our purpose. While, an other interpretation of such random matrix ensembles appear in synaptic matrices for brain dynamics and memory [4].

The other aspects of interest in this paper and its computational development concerns parallel code generating circular ensembles, circumventing the need for spectral unfolding, namely eliminating the noise generated by Gaussian ensembles. *Reproducibility* is achieved by what the authors called *random seeds*:

Implementation of parallel code generating circular ensembles [18], using Mezzadri’s approach [19, 20], is utilized. Using circular ensembles circumvent the need for spectral unfolding [11] too, namely eliminating varying mean eigenvalue density for semicircle law for Gaussian ensembles [11]. Random seeds are preserved for each chunk from an ensemble, so that the results are reproducable both in parallel and serial runs [18]. Three different circular ensembles are generated.

In the paper followed a detailed mathematical formulation of a Circular Unitary Ensemble (CUE) and a Circular Orthogonal Ensemble (COE), that I skip in here. But again this is advanced mathematical physics conceptual tool set of interest.

This short but interesting paper revealed to be the perfect way to start the week here in this Blog. It provides one other different approach to deep learning architectures, one where the connectivity of the network and the spectral ergodicity play a role in determining the accuracy in learning by the neural network. This goes against the data centric algorithmic norm, taking risks in pursuit of insight:

Success of deep learning architectures are attributed to availibilty of large amount of data and being able to train multiple layers at the same time [5, 6]. However, how this is possible from theoretical point of view is not well established. We introduce quantification of spectral ergodicity for random matrices as a surrogate to weight matrices in deep learning architectures and argue that spectral ergodicity conceptually might leverage our understanding how these architectures perform learning in high accuracy. From biological standpoint, our results also show that spectral ergodicity would play an important role in understanding synaptic matrices.

*featured image: Periodic and Ergodic Spectral Problems*

]]>

This time the culprit was Andrew Rowan and his talk video shared here below. It is about another opportunity to learn and understand a bit more about Bayesian Deep Learning, presenting a new probabilistic programming framework called Edward and understanding also that the technique Dropout, commonly used in the context of regularization in dealing with overfitting of deep neural networks, might also be views as a Bayesian Approximation method in deep neural networks settings.

The Edward probabilistic programming framework was designed for extending the common Python and TensorFlow libraries dealing with deep neural networks workloads like inference, or variational inference implemented for several benchmark data sets.

One of first important takeaways from this talk is the importance probabilistic programming might have in regard with Artificial Intelligence (AI) safety, especially when AI is applied in fields such as medicine or finance.

Then the explanation of the modern revival of Bayesian Deep Learning, its links with Monte Carlo estimators for doing variational inference in deep neural nets; from there the connection to the development of probabilistic programming and automated inference models follows seamlessly.

The Edward framework actually is TensorFlow with added features such as random variables and inference algorithms. If it is TensorFlow based it is a computational graph node compute engine. Then an intuitive understanding of its philosophy of performance: it builds a model of an inference problem, infer the model given data and then performs a *criticism *of the model given the data, which Andrew Rowan specified as a Posterior Predictive Checks operation in order to reproduce data features.

Further Edward allows for the implementation of scalable black box variational inference techniques, through Monte Carlo sampling at the cost of noisy gradient estimation, to which Edward reduces its variance by automating all the process.

But how is Dropout in non-probabilistic neural nets connected to variational inference in Bayesian neural nets? By reparametrize and factorize the variational distribution of weights in a neural network to a non-Gaussian (Bernoulli) sampling distribution such that it is as if a Dropout objective optimization function. In fact it was earlier demonstrated the equivalence between a dropout objective and the approximate ELBO (Monte Carlo estimators). As such is a kind of Bayesian Approximation.

MC Dropout experiments revealed to be competitive for convolutional neural networks, recurrent neural networks and reinforcement learning in experiments with the CIFAR 10 datasets and other datasets.

The most probable future lines of research in this field were outlined by Dr. Andrew Rowan as: better variational posterior approximations (normalizing flows in PyMC3, hierarchical variational models, etc..) and lower variance ELBO estimators (less noisy Monte Carlo estimators).

*featured image: Andrew Rowan – Bayesian Deep Learning with Edward (and a trick using Dropout)*

]]>

]]>

I recommend the reader to also fork the GitHub pull request/repository Tensorflow-based Recommendation systems, where a detailed description of this developement is available as well as all the code base:

Tensorflow-based Recommendation systems

Factorization models are very popular in recommendation systems because they can be used to discover latent features underlying the interactions between two different kinds of entities. There are many variations of factorization algorithms (SVD, SVD++, factorization machine, …). When implementing them or developing new ones, you probably spend a lot of time on the following areas rather than modeling:

Derivative calculationVariant SGD algorithm explorationMulti-thread accelerationVectorization acceleration

Tensorflow is a general computation framework using data flow graphs although deep learning is the most important application of it. With Tensorflow, derivative calculation can be done by auto differentiation, which means that you only need to write the inference part. It provides variant fancy SGD learning algorithms, CPU/GPU acceleration, and distributed training in a computer cluster. Since Tensorflow has some embedding modules for word2vec-like application, it is supposed to be a good platform for factorization models as well, even in production. Please note that embedding in deep learning is equivalent to factorization in shallow learning!

Description

This talk will demonstrate how to harness a deep-learning framework such as Tensorflow, together with the usual suspects such as Pandas and Numpy, to implement recommendation models for news and classified ads.

Abstract

Recommender systems are used across the digital industry to model users’ preferences and increase engagement. Popularised by the seminal Netflix prize, collaborative filtering techniques such as matrix factorisation are still widely used, with modern variants using a mix of meta-data and interaction data in order to deal with new users and items. We will demonstrate how to implement a variety of models using Tensorflow, from simple bi-linear models expressed as shallow neural nets to the latest deep incarnations of Amazon DSSTNE and Youtube neural networks. We will also use TensorBoard and particularly the embedding projector to visualise the latent space for items and metadata.

The final part of the talk was an important question and answer by Guillaume, where he answered some pertinent questions. The comparison between TensorFlow and the more *Pythonesque* PyTorch was highlighted on several occasions, with the speaker finally giving his own opinions, regarding TensorFlow a more robust tool set for the kinds of compute workloads in distributed computing for recommenders systems with matrix factorization, but the code declaration is static whereas PyTorch is dynamically compiled providing a much more flexible code manipulation between Python software ecosystems.

*featured image: Guillaume Allain – Recommender systems with Tensorflow*

]]>

]]>