Success with deep learning architectures, surrogate random matrices and spectral ergodicity

From now and then I just wonder how good my social media connections really are. Often crossed my mind that social media is more noise than signal, real good signals of the best we can be and do as human beings. There is a lot of not that good about what human beings should be and do in social media. But then, not so often but nevertheless indisputably true, you happen to find real value and good stuff on social media. That was the case recently with this my connection from LinkedIn: Mehmet Suzen, a Physicist and Data Scientist.

Mehmet co-authored a short paper published in ArXiv in April this year. It is a fitting short paper review about deep learning architectures for The Intelligence of Information‘s start of the week. It investigates how high network connectivity might increase accuracy of learning for deep neural networks. For this the team of researchers used two metrics to quantify what is called spectral ergodicity – this is advanced mathematical conceptual terms that must be further checked by readers of this Blog -, one somewhat new, Thirumalai-Mountain (TM) metric, the other widely already known and used by the research community in statistics, machine learning and mathematical optimization, Kullbach-Leibler (KL) divergence.

This research effort was properly developed within a computational framework that have given rise to new software, developed by Mehmet . Followers and readers are encouraged to check it further and fork it here in its GitHub repository.

Spectral Ergodicity in Deep Learning Architectures via Surrogate Random Matrices

Abstract

Using random matrix ensembles, mimicking weight matrices from deep and recurrent neural networks, we investigate how increasing connectivity leads to higher accuracy in learning with a related measure on eigenvalue spectra. For this purpose, we quantify spectral ergodicity based on the Thirumalai-Mountain (TM) metric and Kullbach-Leibler (KL) divergence. As a case study, different size circular random matrix ensembles, i.e., circular unitary ensemble (CUE), circular orthogonal ensemble (COE), and circular symplectic ensemble (CSE), are generated. Eigenvalue spectra are computed along with the approach to spectral ergodicity with increasing connectivity. As a result, it is argued that success of deep learning architectures attributed to spectral ergodicity conceptually, as this property prominently decreases with increasing connectivity in surrogate weight matrices.

Random matrices are a probability theory/mathematical physics development of sorts that have successfully found applications in numerous fields. From Finance to signal processing or macroeconomics to Neuroscience, Random matrices are important for the mathematical representation of complex statistical properties of physical systems:

Characterising statistical properties of different random matrix ensembles plays a critical role in understanding the nature of physical models they represent. For example in neuroscience, neuronal dynamics can be encoded as a synaptic connectivity matrix in different network architectures [2–4] and as well as weight matrix of deep learning architectures [5, 6], possibly with dropout [7]. Similarly, transition matrices in stochastic materials simulations in discrete space [8, 9].

Eigenvalue spectrum, another advanced mathematical conceptual jargon that is worth to further learn, provide important information regarding both structure and dynamics of physical systems. For example the learning rate in a recurrent neural network appears to be influenced by the spectral radius of its weight matrices; and spectral radius is related with the Eigenvalue spectrum. In spite of the not that good English by Mehmet et. al.  (my own constructive criticism),  there it is the following two paragraphs with an outline of the essential message and purpose of the paper:

Eigenvalue spectrum entails information regarding both structure and dynamics. For example, spectral radius of weight matrices in recurrent neural networks influence the learning rate, i.e., training [10]. Ergodic properties are not much investigated in this context. While, spectral ergodicity is prominently used in characterizing quantum systems undergoes so called analogy to chaotic motion in the energy spectra, as there is no quantum trajectories in classical mechanics sense [11].

The concept of ergodicity appears in statistical mechanics as time averages of a physical dynamics is equal to its ensemble average [9, 12]. The definition is not uniform in the literature [9]. For example, Markov chain transition matrix is called ergodic, if all eigenvalues are below one, implying any state can be reachable from any other. Here, spectral ergodicity implies eigenvalue spectra averaged of ensemble of matrices and spectra obtained using a single matrix, a realisation from an ensemble, as it is or via an averaging procedure, i.e., spectral average generates the same spectra within statistical accuracy.

Following this reasoning the context of neural networks is one where these concepts bear fruit. Indeed in that context, spectral ergodicity may have at least two implications: one from considering the ensemble of weight matrices in different layers for feed-forward architectures, and another from considering the connectivity architecture for recurrent networks. Quantifying a measure of spectral ergodicity becomes central. The authors provide the following one:

A measure of spectral ergodicity is defined as follows for a finite ensemble, inpired from Thirumalai-Mountain (TM) metric [13, 14], for M random matrices of size NxN,

spectralergMehmetform1

with spectral spacing of bk, where k = 1, …, K, spectral density of ρj (bk), where j = 1, …, M and ensemble average spectral density ¯ρ(bk),

spectralergMehmetform2

The essential idea of such a measure is to capture fluctuations between individual eigenvalue spectrum against the ensemble averaged one empirically. Note that spectral spacing bk plays a role of time compare to original TM metric in the naive formulation.

However ergodicity for spectral ergodicity should be defined differently, as a function of matrix size, in line with the asymptotic expectation that spectral and ensemble averages yield the same statistics. Enter the Kullbach-Leibler (KL) divergence metric, where a distance is defined between two distributions on consecutive ensemble sizes, Nk > Nk−1:

since KL is not-symetric, we sum KL in both directions to quantify approach to spectral ergodicity,

spectralergMehmetform3

An interpretation of Dse(Nk) would be that, how increasing connectivity size influence approach to spectral ergodicity. This would indirectly measure information content as well, due to connection of KL to mutual information, i.e., relative entropy.

That is, the approach to deep learning architectures proposed in this paper is from the random matrices of weights in feed-forward architectures and ensemble of connectivity in recurrent networks instead of from a specific architecture and learning algorithm. The advantages of the approach are outlined brilliantly (better English this time…):

Using random matrices brings a distinct advantage of generalizing the results for any architecture using a simple toy system and being able to enforce certain restrictions on the spectral radius. For example applying a constraint on the spectral radius to unity, this simulates prevention of learning algorithm in getting into numerical problems in training the network [10]. Circular random matrix ensembles serves this purpose well, where their eigenvalues lies on the unit circle on the complex plane.

The Neuroscience research connection is also outlined:

A typical weight matrix for a layer, W, in deep learning architectures, feed-forward multi-layer neural networks, are not rectangular and has no self-recurrent connections. We took the assumption that eigenvalue spectrum of WˆTW is directly comparable to behaviour of W in learning and effect of diagonal elements as recurrent connections are small for our purpose. While, an other interpretation of such random matrix ensembles appear in synaptic matrices for brain dynamics and memory [4].

The other aspects of interest in this paper and its computational development concerns parallel code generating circular ensembles, circumventing the need for spectral unfolding, namely eliminating the noise generated by Gaussian ensembles. Reproducibility is achieved by what the authors called random seeds:

Implementation of parallel code generating circular ensembles [18], using Mezzadri’s approach [19, 20], is utilized. Using circular ensembles circumvent the need for spectral unfolding [11] too, namely eliminating varying mean eigenvalue density for semicircle law for Gaussian ensembles [11]. Random seeds are preserved for each chunk from an ensemble, so that the results are reproducable both in parallel and serial runs [18]. Three different circular ensembles are generated.

In the paper followed a detailed mathematical formulation of a Circular Unitary Ensemble (CUE) and a Circular Orthogonal Ensemble (COE), that I skip in here. But again this is advanced mathematical physics conceptual tool set of interest.

spectralergMehmetFig1

This short but interesting paper revealed to be the perfect way to start the week here in this Blog. It provides one other different approach to deep learning architectures, one where the connectivity of the network and the spectral ergodicity play a role in determining the accuracy in learning by the neural network. This goes against the data centric algorithmic norm, taking risks in pursuit of insight:

Success of deep learning architectures are attributed to availibilty of large amount of data and being able to train multiple layers at the same time [5, 6]. However, how this is possible from theoretical point of view is not well established. We introduce quantification of spectral ergodicity for random matrices as a surrogate to weight matrices in deep learning architectures and argue that spectral ergodicity conceptually might leverage our understanding how these architectures perform learning in high accuracy. From biological standpoint, our results also show that spectral ergodicity would play an important role in understanding synaptic matrices.

featured image: Periodic and Ergodic Spectral Problems

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s