The passing year of 2016 was a year awash with many, too many to count, interesting papers within the fields of Deep Learning and Artificial Intelligence. This is a sign of a healthy research community, a period of unprecedented creative efforts and a probable breakthrough period in the technologies around those subjects, through to becoming mainstream implementations within science & technology research as well as in business settings.

Less often we encounter a good critical survey of the best efforts or the better part of that research. A good selection of the most relevant such research is indeed needed, for the increasing importance of deep learning and artificial intelligence should also raise some concerns as to its proper understanding and how and where these technologies will be better used, always in view of them being useful and potential enhancements of our human systems.

The paper I would like to share, describe (tentatively explain…) and review here today is, I think, part of those kind of research efforts. The chosen paper might be one of many obfuscated in the noise of other, also undoubtedly interesting and important efforts, because it joins together the deep leaning relevant topic of Reinforcement Learning with two other important developments: restricted Boltzmann machine and a statistical sampling method in quantum physis called quantum annealing.

The paper refers to an introductory paper published early in 2016 that also may have gone unnoticed by many analysts and pundits, precisely titled *Quantum Boltzmann Machine, *which introduced the topic developed in here that appears to gracefully join deep reinforcement learning and its techniques and the framework of quantum physis. This is hardly something to go unnoticed by the interested audience, as these are fields of science & technology getting a lot of attention for their promise to be deeply disruptive and providing breakthrough innovations transformative of our daily lives.

In fact there has been recently some speculation (a qualified one, though…) that the fusion of these cutting-edge fields might lead us to what is called a period of quantum supremacy within a broader Quantum Computation paradigm – that is quantum computers would perform better than classical computers with less error both from a software and a hardware perspective. The relevant issues are to do with the computational cost still involved with chaotic quantum systems, and with how quantum circuits in a quantum computer deal with noise and errors. These recent developments point in a direction where supercomputing of qubits coupled with cross entropy methods might lead to successful approximations of practical circuit fidelity with the theoretical limit of circuit fidelity. Advancements in the manufacturing of superconducting qubits opens up this quantum supremacy possibility, as it is rightly mentioned in the first paragraph of the paper here reviewed:

In view of recent advancements in the manufacturing of superconducting qubits [1, 2] and systems of qubits with second-order interactions, an important question for quantum computation is the existence of quantum supremacy (e.g., in [3]), that is, whether near-term quantum devices can outperform classical computers in some—in fact, any—computational task.

The Canadian authors of the paper propose to fuse the methods of simulated quantum annealing (SQA) with a design of their own of a reinforcement learning algorithm where the interplay between visible and hidden nodes is flexibly tuned within a deep network of layers. They propose that the first and last layers of the deep network to be visible. Further, this is done within a layout where the Ising spin Hamiltonian (the energy model of the nodes of the network) are nothing else than the qubits of a deep Boltzmann machine (DBM). The authors claim (and empirically demonstrate) superior performance of this setting compared with both the Restricted Boltzmann Machine (RBM) and the General Boltzman Machine (GBM) setting. The framework then implemented for the training network as a Quantum Boltzman Machine in the presence of a significant transverse filed is shown to further improve the reinforcement learning methods using DBMs.

### Reinforcement Learning Using Quantum Boltzmann Machines

Abstract.

We investigate whether quantum annealers with select chip layouts can outperform classical computers in reinforcement learning tasks. We associate a transverse field Ising spin Hamiltonian with a layout of qubits similar to that of a deep Boltzmann machine (DBM) and use simulated quantum annealing (SQA) to numerically simulate quantum sampling from this system. We design a reinforcement learning algorithm in which the set of visible nodes representing the states and actions of an optimal policy are the first and last layers of the deep network. In absence of a transverse field, our simulations show that DBMs train more effectively than restricted Boltzmann machines (RBM) with the same number of weights. Since sampling from Boltzmann distributions of a DBM is not classically feasible, this is evidence of advantage of a non-Turing sampling oracle. We then develop a framework for training the network as a quantum Boltzmann machine (QBM) in the presence of a significant transverse field for reinforcement learning. This further improves the reinforcement learning method using DBMs.

.

### Introduction

From the introduction first paragraph, following the continued remarks highlighted above, the proposal in this paper is outlined:

(…) With this motivation, we consider reinforcement learning as the computational task of interest, and design a method of reinforcement learning using a layout of quantum bits similar to that of a deep Boltzmann machine (DBM) (see Fig. 1b for a graphical representation). We use simulated quantum annealing (SQA) to demonstrate the advantage of reinforcement learning using quantum Boltzmann machines over its classical counterpart, in small problem instances.

A nice overview of reinforcement learning and its applications then follows. I thought the lengthy two paragraphs worth to display here, as it is good simplified, intelligible conceptual explanation of the topic, with the important distinctions from other machine learning frameworks, such as supervised learning:

Reinforcement learning ([4], known also as neuro-dynamic programming [5]) is an area of optimal control theory at the intersection of approximate dynamic programming and machine learning. It has been used successfully for many applications, in fields such as engineering [6, 7], sociology [8, 9], and economics [10, 11].

It is important to differentiate between reinforcement learning and common streams of research in machine learning. For instance, in supervised learning, the learning is facilitated by training samples provided by a source external to the agent and the computer. In reinforcement learning, the training samples are provided only by the interaction of the agent itself with the environment. For example, in a motion planning problem in an uncharted territory, it is desired that the agent learns in the fastest possible way to correctly navigate with the least number of blind decisions required to be taken. This is known as the dilemma ofexploration versus exploitation; that is, neither exploration nor exploitation can be pursued exclusively without facing a penalty or failing at the task. The goal is hence not just to design an algorithm that eventually converges to an optimal policy, but for it to be able to generate good policies early in the learning process. We refer the reader to [4, Ch. 1.1] for a thorough introduction to the use cases and problem scenarios addressed by reinforcement learning.

The mathematical conceptual framework for reinforcement learning is a sophisticated one. In mathematics there are often situations where the underlying initial assumptions and underpinnings of the concepts are easy to follow in the beginning, but they rapidly turn much more complex and hard to understand, and we need to have some capacity to somehow rearrange our intuitive perspectives and cognitive tendencies to a proper following. This is the case with reinforcement learning, compounded here by the same issue as to the mathematics of quantum physics, that is even more sophisticated:

The core idea in reinforcement learning is defining an operator on the Banach space of real-valued functions on the set of states of a system such that a fixed point of the operator carries information about an optimal policy of actions for a finite or infinite number of decision epochs. A numerical method for computing this fixed point is to explore this function space by travelling in a direction that minimizes the distance between two consecutive applications of the contraction mapping operator [5].

This optimization task, called learning in the context of reinforcement learning, can be performed by locally parametrizing the above function space using a set of auxiliary variables, and applying a gradient method to these variables. One approach for such a parametrization, due to [12], is to use the weights of a restricted Boltzmann machine (RBM) (see Fig. 1a) as the parameters, and the free energy of the RBM as an approximator for the elements in the function space. The descent direction is then calculated in terms of the expected values of the nodes of the RBM.

From here the authors claim their belief as to the superior performance of DBMs compared with RBMs (in this special case of a quantum system instead of a neural network operating in a classical regime, or at the boundary…):

It follows from the universal approximation theorem [13] that RBMs can approximate any joint distribution over binary variables [14, 15]. However, in the context of reinforcement learning, RBMs are not necessarily the best choice for approximating Q-functions relating to Markov decision processes because [14] and [15] show that RBMs may require an exponential number of hidden variables with respect to the number of visible variables in order to approximate the desired joint distribution. On the other hand, DBMs have the potential to model higher order dependencies than RBMs, and are more robust than deep belief networks [16].

One may, therefore, consider replacing the RBM with other graphical models and investigating the performance of the models in the learning process. Except in the case of RBMs, calculating statistical data from the nodes of a graphical model amounts to sampling from a Boltzmann distribution, creating a bottleneck in the learning procedure if performed classically [17].

This work is an investigative approach, we should remind ourselves. The authors are notwithstanding optimistic about the prospects for this line of research, as the feasibility of a future operative quantum computer may become a reality sooner than is currently believed by the most skeptics:

As we explain in what follows, DBMs are good candidates for reinforcement learning tasks. Moreover, an important advantage of a DBM layout for a quantum annealing system is that the proximity and couplings of the qubits in the layout are similar to those of a sequence of bipartite blocks in D-Wave devices [18], and it is therefore feasible that such layouts could be manufactured in the near future. This is why, instead of attempting to embed a Boltzmann machine structure on an existing quantum annealing system as in [19, 20, 21, 22], we work under the assumption that the network itself is the native connectivity graph of a near-future quantum annealer, and, using numerical simulations, we attempt to understand its applicability to reinforcement learning.

(…)

Quantum Monte Carlo (QMC) numerical simulations have been found to be useful in simulating time dependant quantum systems. Simulated quantum annealing (SQA) [23, 24], one of the many flavours of QMC methods, is based on the Suzuki–Trotter expansion of the path integral representation of the Hamiltonian of Ising spin models in the presence of a transverse field driver Hamiltonian. Even though the efficiency of SQA for finding the ground state of an Ising model is topologically obstructed [25], we consider the samples generated by SQA to be good approximations of the Boltzmann distribution of the quantum Hamiltonian [26]. Experimental studies have shown similarities in the behaviour of SQA and that of quantum annealing [27, 28] and its physical realization [29, 30] by D-Wave Systems.

Following this it is briefly described the reasoning supporting the work on the Quantum Boltzmann Machine paper, which happens to be based on a graphical model in which the energy operator of the quantum Hamiltonian in the presence of a transverse field is such a device:

The classical counterpart of SQA is conventional simulated annealing (SA), which is based on thermal annealing. This algorithm can be used to create Boltzmann distributions from the Ising spin model only in the absence of a transverse field. It should, therefore, be possible to use SA or SQA to approximate the Boltzmann distribution of a classical Boltzmann machine. However, unlike SA, it is possible to use SQA not only to approximate the Boltzmann distribution of a classical Boltzmann machine, but also that of a graphical model in which the energy operator is a quantum Hamiltonian in the presence of a transverse field. These graphical models, called quantum Boltzmann machines (QBM), were first introduced in [31].

We use SQA simulations to demonstrate evidence that a quantum annealing device that approximates the distribution of a DBM or a QBM may improve the learning process compared to a reinforcement learning method that uses classical RBM techniques. Other studies have shown that SQA is more efficient than thermal SA [23, 24]. Therefore, our method in conjunction with SQA can also be viewed as a quantum-inspired approach for reinforcement learning.

What distinguishes our work from current trends in quantum machine learning is that (i) we consider the use of quantum annealing in reinforcement learning applications rather than frequently studied classification or recognition problems; (ii) using SQA-based numerical simulations, we assume that the connectivity graph of a DBM directly maps to the native layout of a feasible quantum annealer; and (iii) the results of our experiments using SQA to simulate the sampling of an entangled system of spins suggest that using quantum annealers in reinforcement learning tasks has an advantage over thermal sampling.

### Discussion and Results

After this long but comprehensive introduction we now present briefly the main results of this paper together with a discussion of challenges and the possible future lines for future developments.

Maze traversal is a problem typically used to develop and benchmark reinforcement learning algorithms [32]. A maze is structured as a two-dimensional grid of r rows and c columns in which a decision-making agent is free to move up, down, left, right, or stand still. During this maze traversal, the agent encounters obstacles (e.g., walls), rewards (e.g., goals), and penalties (negative rewards, e.g., a pit). Each cell of the maze can contain either a deterministic or stochastic reward, a wall, a pit, or a neutral value.

The goal of the reinforcement learning algorithm in the maze traversal problem is for the agent to learn the optimal action to take in each cell of the maze by maximizing the total reward, that is, finding a route across the maze that avoids walls and pits while favouring rewards. This problem can be modelled as a Markov decision process (MDP)…

(…)

The RBM reinforcement learning algorithm is due to Sallans and Hinton [12]. This algorithm uses the update rules (6) and (7) to update the weights of an RBM and (4) to calculate the expected values of random variables associated with the hidden nodes (referred to as activations of the hidden nodes in machine learning terminology). As explained in Sec. 5.6, the main advantage of RBM is that it has explicit formulas for the hidden node activations given the values of the visible nodes. Moreover, only for RBMs can the entropy portion of the free energy (26) be written in terms of the activations of the hidden nodes. More-complicated network architectures do not possess this property, so there is a need for a Boltzmann distribution sampler.

(…)

Since we are interested in the dependencies between states and actions, we consider a DBM architecture that has a layer of states connected to the first layer of hidden nodes, followed by multiple hidden layers, and a layer of actions connected to the final layer of hidden nodes (see Fig. 1). We demonstrate the advantages of this deep architecture trained using SQA and the derivation in Sec. 5.6 of the temporal-difference gradient method for reinforcement learning using general Boltzmann machines (GBM).

For Tr independent training runs of the same reinforcement learning algorithm, Ts training samples are used for reinforcement learning. The fidelity measure at the i-th training sample is defined by:

where A(s, i, l) denotes the action assigned at the l-th run and i-th training sample to the state s. In our experiments, each algorithm is run 1440 times, and for each run of an algorithm, Ts = 500 training samples are generated.

Fig. 3a and 3b show the fidelity of the generated policies obtained from various reinforcement learning experiments on two clear 3 × 5 mazes. In Fig. 3a, the maze includes one reward, one wall, and one pit, and in Fig. 3b, the maze includes two stochastic rewards in addition. In these experiments, the training samples are generated by sweeping over the maze. Each sweep iterates over the maze elements in the same order. This explains the periodic behaviour of the fidelity curves (cf. Fig. 3c).

The curves labelled ‘QBM-RL’ represent the fidelity of reinforcement learning using QBMs. Sampling from the QBM is performed using SQA. All other experiments use classical Boltzmann machines as their graphical model. In the experiment labelled ‘RBM-RL’, the graphical model is an RBM, trained classically using formula (4). The remaining curve is labelled ‘DBM-RL’ for classical reinforcement learning using a DBM. In these experiments, sampling from configurations of the DBM is performed with SQA (with Γf = 0.01). The fidelity results of DBM-RL coincide closely with those of sampling configurations of the DBM using simulated annealing;

(…)

To demonstrate the performance of RBM-RL and DBM-RL with respect to scaling, we define another measure called average fidelity, av`, where we take the average fidelity over the last ` training samples of the fidelity measure. Given Ts total training samples and fid(i) as defined above, we write

(…)

Again the point of the improved performance of DBM-RL algorithm over RBM-RL, which is even further compounded by sampling in the presence of transverse field:

The fidelity curves in Fig. 3 show that the DBM-RL algorithm outperforms the RBM-RL algorithm with respect to the number of training samples. Therefore, we expect that in conjunction with a high-performance sampler of Boltzmann distributions (e.g., a quantum or a quantum-inspired oracle taken as such), the DBM-RL algorithm improves the performance of reinforcement learning. The QBM-RL algorithm improves upon the DBM-RL results even further by taking advantage of sampling in presence of a significant transverse field.

In each experiment, the fidelity curves from DBM-RL produced using SQA with Γf = 0.01 match the ones produced using SA. This is consistent with our expectation that using SQA with Γ → 0 produces samples from the same distribution as SA, namely, the Boltzmann distribution of the classical Ising Hamiltonian with no transverse field.

note: Γ → Gamma Function

### Methods

SQA methods are a class of quantum-inspired algorithms that perform discrete optimization by classically simulating the quantum tunnelling phenomena (see [34, p. 422] for an introduction). The algorithm used in this paper is a single spin-flip version of quantum Monte Carlo numerical simulation based on the Suzuki–Trotter formula, and uses the Metropolis acceptance probabilities. The SQA algorithm simulates the quantum annealing phenomena of an Ising spin model with a transverse field, that is,

where σˆz and σˆx represent the Pauli z- and x-matrices, respectively, and time t ranges from 0 to 1.

(…)

In our implementations, we have used a linear transverse field schedule for the SQA algorithm as in [35] and [36]. Based on the Suzuki–Trotter formula, the key idea of this algorithm is to approximate the partition function of the Ising model with a transverse field as a partition function of a classical Hamiltonian denoted by Heff, corresponding to a classical Ising model of one dimension higher. More precisely,

where r is the number of replicas, Jˆ+ = 1/2 β log coth ( Γβ / r) , and σik represent spins of the classical system of one dimension higher.

In Algorithm 1, we recall the steps of the classical reinforcement learning algorithm using an RBM with a graphical model similar to that shown in Fig. 1a. We set the initial Boltzmann machine weights using Gaussian zero-mean values with a standard deviation of 1.00, as is common practice for implementing Boltzmann machines [37]. Consequently, this initializes an approximation of a Q-function and a policy π given by

We associate a classical spin variable σh to each hidden node h. Then the activations of the hidden nodes are calculated via

where σ(·) is the sigmoid function. Here s and a are encodings of state s and action a, respectively, using the state and action nodes of the Boltzmann machine. In our experiments, all Boltzmann machines have as many state nodes as |S| and as many action nodes as |A|.

According to lines 4 and 5 of Algorithm 2, the samples from the SA or SQA algorithm are used to approximate the free energy of the classical DBM at points (s1, a1) and (s2, a2) using

The last algorithm is QBM-RL, presented in Algorithm 3. The initialization is performed as in Algorithms 1 and 2. However, according to lines 4 and 5, the samples from the SQA algorithm are used to approximate the free energy of a QBM at points (s1, a1) and (s2, a2) by computing the free energy corresponding to an effective classical Ising spin model of one dimension higher representing the quantum Ising spin model of the QBM,

In this case, <Heffs,a> is approximated by the average energy of the entire system of one dimension higher and P(c|s, a) is approximated by the normalized frequency of the configuration c of the entire system of one dimension higher (hence, there are only 150 sample points for each input instance in this case).

### Conclusion

The paper follows from this with a supplementary material featuring the details of the Markov Decision Process used in this paper, the value iteration functions, the Q-functions and the temporal-difference gradient descent. It is especially important the part about reinforcement learning with clamped Boltzmann machines. As always I recommend the interested reader to check through by reading the whole paper. For now a leave it here with a concluding remark: the methodology and conceptual framework presented in this paper may be of an investigative and speculative nature (well grounded on empirical algorithmic efforts, though…), as of yet, but the solid background and wider scope of the research might be useful and point in a direction of further improvement both for deep reinforcement learning and to the achievement of quantum supremacy.

The runtime and computational resources needed for DBM-RL and QBM-RL in comparison with RBM-RL are not investigated here. We expect that in view of [15] the size of the RBMs needed to solve larger maze problems will grow exponentially. It is, therefore, an interesting research path to pursue the extrapolation of the asymptotic complexity and size of the DBM-RL and QBM-RL algorithms in the quest for quantum supremacy.

*featured image: Figure 1. (a) The general RBM layout used in RBM-based reinforcement learning: the visible layer on the left consists of state and action nodes, and is connected to the hidden layer, forming a complete bipartite graph. (b) The general DBM layout used in DBM-based reinforcement learning: the visible nodes on the left represent states and the visible nodes on the right represent actions. The training procedure captures the correlations between states and actions in the weights of the edges between the nodes.*