Data efficiency in deep reinforcement learning: Neural Episodic Control

We are past middle of march 2017. The Arxiv collection of research papers keeps its tremendous pace of releases of open free to read research papers. It is increasing in quality and openness, a good development for the folks, like me, that like to read quality scientific papers. I am form a country that likes to regard itself as back of the pack when it comes to reading skills. There have been important improvements here; Portugal does even manage to have a new capacity to provide researchers good enough to write papers deserving to be in Arxiv, or in more select, even elitist scientific peer review publications. But with Arxiv open and free to read policy, in places with historical literacy difficulties, it is a matter of doing the proper homework, so to speak, and incentivize everyone to spare a little time and read.

Today I will review a paper I regard one of those worth a read in this already full March 2017. It is interesting read about developments in deep reinforcement learning (DRL), specifically improvements with the data efficiency of the technique, plagued as it is with inefficiency in data requirements; to put it in another way, to achieve human performance in computational tasks, DRL requires orders of magnitude more data than the human brain, performing similar tasks.

The team of researchers from the by now well-known London’s DeepMind, certainly known by the The Information Age readers, produced this new DRL agent, they called Neural Episodic Control, which is claimed to solve the efficiency issues in those algorithms. It combines a slow gradual changing state representations with a rapidly update rule for estimates of the value function. It is demonstrated that it manages to outperform other state-of-the-art, general purpose deep reinforcement learning agents. We will further check it here now, with some comments and the possible fair description of the paper.

The experimental setup and the empirical methods were based and supported around the application of the DRL algorithms within game environments, such as the Atari 2600 game environment. I confess that when I review scientific papers that rely on virtual environments for their empirical methodology, I feel somewhat skeptical from the outset. But I recognize that the skepticism may be misplaced here. The reputation and acceptance of DeepMind research teams, of AlphaGo quality, is just one of several other arguments against misplaced skepticism; on the other hand, games are today not necessarily only entertainment but also a way to build and simulate/model real world environments or ecosystems, gaining serious attention from a more formal audience that spans academic research communities to business applications. We are dealing here with a very specific interplay between artificial intelligence/cognition and  human cognition, which is also a kind of simulation of reality to begin with… We aren’t going to bother to go deeper in philosophical issues about what is reality and its relationship with human cognition here, for obvious reasons.

Neural Episodic Control


Deep reinforcement learning methods attain super-human performance in a wide range of environments. Such methods are grossly inefficient, often taking orders of magnitudes more data than humans to achieve reasonable performance. We propose Neural Episodic Control: a deep reinforcement learning agent that is able to rapidly assimilate new experiences and act upon them. Our agent uses a semi-tabular representation of the value function: a buffer of past experience containing slowly changing state representations and rapidly updated estimates of the value function. We show across a wide range of environments that our agent learns significantly faster than other state-of-the-art, general purpose deep reinforcement learning agents.

DeepMind’s new algorithm adds ‘memory’ to AI – WIRED UK


In the Atari 2600 set of environments, deep reinforcement learning techniques called deep Q-networks, were able to complete the tasks required, but took more than 200 hours of gameplay to achieve scores similar to what a human player achieves with just 2 hours. This is evidence of an efficiency bottleneck with these techniques. Form the outset the authors disclose what they think are the reasons for this gap, and it serves well to outline the solution approach that was later implemented:

1. Stochastic gradient descent optimisation requires the use of small learning rates. Due to the global approximation nature of neural networks, high learning rates cause catastrophic interference (McCloskey & Cohen, 1989). Low learning rates mean that experience can only be incorporated into a neural network slowly.

2. Environments with a sparse reward signal can be difficult for a neural network to model as there may be very few instances where the reward is non-zero. This can be viewed as a form of class imbalance where low-reward samples outnumber high-reward samples by an unknown number. Consequently, the neural network disproportionately underperforms at predicting larger rewards, making it difficult for an agent to take the most rewarding actions.

3. Reward signal propagation by value-bootstrapping techniques, such as Q-learning, results in reward information being propagated one step at a time through the history of previous interactions with the environment. This can be fairly efficient if updates happen in reverse order in which the transitions occur. However, in order to train on uncorrelated minibatches DQN-style, algorithms train on randomly selected transitions, and, in order to further stabilise training, require the use of a slowly updating target network further slowing down reward propagation.

Notwithstanding that this papers focuses on addressing those three problems, the authors recognize other advances described in other papers that also make important contributions to improve the data efficiency bottleneck of deep reinforcement learning:

In this work we shall focus on addressing the three concerns listed above; we must note, however, that other recent advances in exploration (Osband et al., 2016), hierarchical reinforcement learning (Vezhnevets et al., 2016) and transfer learning (Rusu et al., 2016; Fernando et al., 2017) also make substantial contributions to improving data efficiency in deep reinforcement learning over baseline agents.

Therefore, Neural Episodic Control (NEC) is introduced. The crucial point of this proposal is the ability of NEC to rapidly stick with successful strategies as soon as they are experienced. It avoids the latency in the stochastic gradient descent optimization algorithm that DQN and A3C fall into:

In this paper we propose Neural Episodic Control (NEC), a method which tackles the limitations of deep reinforcement learning listed above and demonstrates dramatic improvements on the speed of learning for a wide range of environments. Critically, our agent is able to rapidly latch onto highly successful strategies as soon as they are experienced, instead of waiting for many steps of optimisation (e.g., stochastic gradient descent) as is the case with DQN (Mnih et al., 2015) and A3C (Mnih et al., 2016).

Interestingly this new DRL algorithm is inspired by the workings of the human brain. Specifically it got inspiration by how the Hippocampus, a region of the human brain involved in the encoding of memory (with also the function as a relay of information to other parts of the brain), and how that region in turn is involved in the whole decision-making process of the human mind. But it also got influence from the one-shot learning methods  and the way neural networks remembers rare events:

Our work is in part inspired by the hypothesised role of the Hippocampus in decision-making (Lengyel & Dayan, 2007; Blundell et al., 2016) and also by recent work on one-shot learning (Vinyals et al., 2016) and learning to remember rare events with neural networks (Kaiser et al., 2016). Our agent uses a semi-tabular representation of its experience of the environment possessing several of the features of episodic memory such as long-term memory, sequentiality, and context-based lookups. (…)

I was reading the next piece of the paper and thinking about how modern sophisticated database systems work, with the semi-tabular representation append-only property of the memory in NEC, being somewhat reminiscent of those database systems. Further this representation binds slow-changing keys to fast updating values, and uses a context-based look-up on the keys to efficiently retrieve the useful values in the action selection task of the agent:

The semi-tabular representation is an append-only memory that binds slow-changing keys to fast updating values and uses a context-based lookup on the keys to retrieve useful values during action selection by the agent. Thus the agent’s memory operates in much the same way that traditional table-based RL methods map from state and action to value estimates.

A unique aspect of the memory in contrast to other neural memory architectures for reinforcement learning (explained in more detail in Section 3) is that the values retrieved from the memory can be updated much faster than the rest of the deep neural network. This helps alleviate the typically slow weight updates of stochastic gradient descent applied to the whole network and is reminiscent of work on fast weights (…)


Rewarding indeed.


Neural Episodic Control

After briefly describing the efforts done within the Deep Reinforcement Learning (DRL) literature to date, the authors shift to the details of the Neural Episodic Control (NEC) and the experimental setup of choice in this paper. We also now turn to review it, to later close with the concluding remarks. One interesting aspect of this research on DLR is the link  with computer vision research accomplished by a convolutional neural network. Indeed a convolutional neural network is one of the main components of NEC:

Our agent consists of three components: a convolutional neural network that processes pixel images s, a set of memory modules (one per action), and a final network that converts read-outs from the action memories into Q(s, a) values. For the convolutional neural network we use the same architecture as DQN (Mnih et al., 2015).

Then the NEC is fully described. I also found of interest the relatively simple (from an intuitive accessible undergraduate Computer Science perspective, instead of a more advanced cryptic approach ) computational structure and data types used:

For each action a ∈ A, NEC has a simple memory module Ma = (Ka, Va), where Ka and Va are dynamically sized arrays of vectors, each containing the same number of vectors. The memory module acts as an arbitrary association from keys to corresponding values, much like the dictionary data type found in programs. Thus we refer to this kind of memory module as a differentiable neural dictionary (DND). There are two operations possible on a DND: lookup and write, as depicted in Figure 1. Performing a lookup on a DND maps a key h to an output value o:


where vi is the ith element of the array Va and


where hi is the ith element of the array Ka and k(x, y) is a kernel between vectors x and y, e.g., Gaussian or inverse kernels.

Thus the output of a lookup in a DND is a weighted sum of the values in the memory, whose weights are given by normalised kernels between the lookup key and the corresponding key in memory.

The scalability is achieved by a combination of selection of top k-nearest neighbours with the use of an approximate nearest neighbours algorithm to perform the lookups, based on a previous kd-trees algorithm.

After a DND is queried, a new key-value pair is written into the memory. The key written corresponds to the key that was looked up. The associated value is application-specific (below we specify the update for the NEC agent). Writes to a DND are append-only: keys and values are written to the memory by appending them onto the end of the arrays Ka and Va respectively. If a key already exists in the memory, then its corresponding value is updated, rather than being duplicated.

Next it is presented the NEC algorithm.

The values in the DND, in the case of an NEC agent, are the Q values corresponding to the state that originally resulted in the corresponding key-value pair to be written to the memory. Thus this architecture produces an estimate of Q(s, a) for a single given action a. The architecture is replicated once for each action a the agent can take, with the convolutional part of the network shared among each separate DND Ma. The NEC agent acts by taking the action with the highest Q-value estimate at each time step. In practice, we use ε-greedy policy during training with a low ε:


(…) As an NEC agent acts, it continually adds new key-value pairs to its memory. Keys are appended to the memory of the corresponding action, taking the value of the query key h encoded by the convolutional neural network. (…)

When a state-action value is already present in a DND (i.e the exact same key h is already in Ka), the corresponding value present in Va, Qi , is updated in the same way as the classic tabular Q-learning algorithm:


As I refered to earlier this paper had as its empirical and experimental setups the environments of games within the Atari 2600 ecosystem (with a long list of other such games/environments that appears at the end of the paper in an appendix). We now turn to a brief description of these setups, with the main results that were obtained:

We investigated whether neural episodic control allows for more data efficient learning in practice in complex domains. As a problem domain we chose the Atari Learning Environment(ALE; Bellemare et al., 2013). We tested our method on the 57 Atari games used by Schaul et al. (2015a), which form an interesting set of tasks as they contain diverse challenges such as sparse rewards and vastly different magnitudes of scores across games. Most common algorithms applied in these domains, such as variants of DQN and A3C, require in the thousands of hours of in-game time, i.e. they are data inefficient.



We also compare to DQN with Prioritised Replay, which improves data efficiency by replaying more salient transitions more frequently. We did not directly compare to DRQN (Hausknecht & Stone, 2015) nor FRMQN (Oh et al., 2016) as results were not available for all Atari games. Note that in the case of DRQN, reported performance is lower than that of Prioritised Replay.

The use of the RMSProp algorithm of Geoffrey Hinton’s Coursera course Neural Networks for Machine Learning reference is quite interesting. The capacity of NEC in the memory storage per action is also impressive, but this might be surpassed in the future, if Moore’s Law still applies:

In terms of hyperparameters for NEC, we chose the same convolutional architecture as DQN, and store up to 5 × 10ˆ5 memories per action. We used the RMSProp algorithm (Tieleman & Hinton, 2012) for gradient descent training. We apply the same preprocessing steps as (Mnih et al., 2015), including repeating each action four times. For the N-step Q estimates we picked a horizon of N = 100. Our replay buffer stores the only last 10ˆ5 states (as opposed to 106 for DQN) observed and their N-step Q estimates. We do one replay update for every 16 observed frames with a minibatch of size 32. We set the number of nearest neighbours p = 50 in all our experiments. For the kernel function we chose a function that interpolates between the mean for short distances and weighted inverse distance for large distances, more precisely:


Intuitively, when all neighbours are far away we want to avoid putting all weight onto one data point. A Gaussian kernel, for example, would exponentially suppress all neighbours except for the closest one. The kernel we chose has the advantage of having heavy tails. This makes the algorithm more robust and we found it to be less sensitive to kernel hyperparameters.

The two tables below summarize the main results of the different algorithms when applied to a set of diverse games and environments. Table 1 shows the results for the data efficiency calculations, demonstrating the superior performance of NEC. Table 2 shows that NEC also outperform compared with other algorithm, MFEC, which isn’t as able to focus on small but relevant details of the pixels in an image:

Data efficiency results are summarised in Table 1. In the small data regime (less than 20 million frames) NEC clearly outperforms all other algorithms. The difference is especially pronounced before 5 million frames have been observed. Only at 40 million frames does DQN with Prioritised Replay outperform NEC on average; note that this corresponds to 185 hours of gameplay.


NEC also outperforms MFEC on average (see Table 2). In contrast with MFEC, NEC uses the reward signal to learn an embedding adequate for value interpolation. This difference is especially significant in games where a few pixels determine the value of each action. The simpler version of MFEC uses an approximation to L2 distances in pixel-space by means of random projections, and cannot focus on the small but most relevant details. Another version of MFEC calculated distances on the latent representation of a variational autoencoder (Kingma & Welling, 2013) trained to model frames. This latent representation does not depend on rewards and will be subject to irrelevant details like, for example, the display of the current score.

Related Work

The related work around the subject of this paper involves research on neural network architectures. Recurrent Neural Networks (RNNs) feature prominently in this literature, but also LSTM (Long Short term memory)and DNC (Differential Neural Computers), which are special kinds of  RNNs:

Recurrent neural network representations of memory (LSTMs and DNCs) are trained by truncated backpropagation through time, and are subject to the same slow learning of non-recurrent neural networks.

Some of these models have been adapted to their use in RL agents (LSTMs; Bakker et al., 2003; Hausknecht & Stone, 2015), DNCs (Graves et al., 2016), memory networks (Oh et al., 2016). However, the contents of these memories is typically reset at the beginning of every episode. This is appropriate when the goal of the memory is tracking previous observations in order to maximise rewards in partially observable or non-Markovian environments. Therefore, these implementations can be thought of as a type of working memory, and solve a different problem than the one addressed in this work.

These previous research works were also  attempts at incorporating some form of memory layer in deep neural networks, and with the reinforcement unsupervised learning paradigm:

Other deep RL methods keep a history of previous experience. Indeed, DQN itself has an elementary form of memory: the replay buffer central to its stable training can be viewed as a memory that is frequently replayed to distil the contents into DQN’s value network. Kumaran et al. (2016) suggest that training on replayed experiences from the replay buffer in DQN is similar to the replay of experiences from episodic memory during sleep in animals. DQN’s replay buffer differs from most other work on memory for deep reinforcement learning in its sheer scale: it is common for DQN’s replay buffer to hold millions of (s, a, r, s0) tuples.

Blundell et al. (2016, MFEC) recently used local regression for Q-function estimation using the mean of the k-nearest neighbours, except in the case of an exact match of the query point, in which case the stored value was returned. They also propose the use of the latent variable obtained from a variational autoencoder (Rezende et al., 2014) as an embedding space, but showed random projections often obtained better results.

In contrast with the ideas presented here, none of the local regression work aforementioned uses the reward signal to learn an embedding space of covariates in which to perform the local-regression. We learn this embedding space using temporal-difference learning; a crucial difference, as we showed in the experimental comparison to MFEC.


Concluding Remarks

Concluding this by now long paper review – this is one of those in which something lesser would be almost impossible – here we read two/three paragraphs from the discussion section of the paper. The significance in the improvement compared with existing state-of-the-art deep reinforcement learning frameworks hardly is an overstatement. The data efficiency is the real milestone:

We have proposed Neural Episodic Control (NEC): a deep reinforcement learning agent that learns significantly faster than other baseline agents on a wide range of Atari 2600 games. At the core of NEC is a memory structure: a Differentiable Neural Dictionary (DND), one for each potential action. NEC inserts recent state representations paired with corresponding value functions into the appropriate DND.


Our experiments show that NEC requires an order of magnitude fewer interactions with the environment than agents previously proposed for data efficiency, such as Prioritised Replay (Schaul et al., 2015b) and Retrace(λ) (Munos et al., 2016). We speculate that NEC learns faster through a combination of three features of the agent: the memory architecture (DND), the use of N-step Q estimates, and a state representation provided by a convolutional neural network.

The memory feature is also worthy value added in this effort:

The memory architecture, DND, rapidly integrates recent experience—state representations and corresponding value estimates—allowing this information to be rapidly integrated into future behaviour. Such memories persist across many episodes, and we use a fast approximate nearest neighbour algorithm (kd-trees) to ensure that such memories can be efficiently accessed.

In spite of this – a non-parametric approach for deep reinforcement learning proves promising to be valuable when data efficiency is paramount -, other proposals like the Prioritized Replay were shown to be of superior performance with the number of training trials greater than 40 million. Further improvements would be welcomed by the authors, with the performance of the algorithm and with the implementation of more complex visual settings, like complex 3D environments or real world tasks requiring data efficiency:

Our work suggests that non-parametric methods are a promising addition to the deep reinforcement learning toolbox, especially where data efficiency is paramount. In our experiments we saw that at the beginning of learning NEC outperforms other agents in terms of learning speed. We saw that later in learning Prioritised Replay has higher performance than NEC. We leave it to future work to further improve NEC so that its long term final performance is significantly superior to parametric agents. Another avenue of further research would be to apply the method discussed in this paper to a wider range of tasks such as visually more complex 3D worlds or real world tasks where data efficiency is of great importance due to the high cost of acquiring data.

featured image: Figure 1. Illustration of operations on a Differentiable Neural Dictionary.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s