OpenAI recently posted one of those Blog posts that you read in a life time. Sure, not to overstate much of the significance of the post, I thought it to be a post that you happen to encounter in online surfing just once in a while, a big while indeed. To compound my assertion is the now mostly majority consensus of opinion about how Artificial Intelligence algorithms will definitely impact the economies around the world, and with special significance the jobs markets.

The post I refer to is called Evolution Strategies as a Scalable Alternative to Reinforcement Learning. It was written by a group of distinguished researchers of machine learning/ deep learning, people like ANDREJ KARPATHY, TIM SALIMANS, JONATHAN HO,PETER CHEN, ILYA SUTSKEVER, JOHN SCHULMAN, et. al. . I will review it here today with a sign of compromise to the The Information Age followers and readers that posts such as these should be the bread and butter of this blog. But the audience will be well-informed to know that this kind of quality is hard to come by; the day-to-day of a researcher is not different from the common worker and only from now and then we happen to fortunately have these kind of encounters of first degree. Anyway, the encounters are becoming more frequent as the science&technology behind them continues to improve, and there aren’t signs of it to suddenly coming to a halt.

Reinforcement Learning is one of the most interesting and successful technique/algorithm within the complex, deep toolbox of Machine Learning. It manages to overcome the known limitations of conventional supervised learning algorithms in an elegant way. But it has also some performance limitations; the complex algorithmic setting is a barrier to many settings where a simpler, cheaper framework would be advisable, it does not always scale well, for instance when distributed systems are involved and the computation burden of the widely used backpropagation of gradients technique is also sometimes a problem.

The new approach presented in the post tries to overcome all these issues. If it is labelled here as a new technique, it is rather surprising to discover that it is based on decades old – maybe even centuries old – ideas. Evolutionary ideas started in the XIXth Century in Darwin’s Britain, but the unfolding of its fundamental character crossed time and history to transcend the biological/life sciences to enter the field of computer science in a somewhat shy way in the XXth Century, and are now within a scientific&technological space of global proportion. Recently the field of Evolutionary Computation has been gaining ground and aroused greater attention from the scientific computing/computer science communities, again in a nice gentle low-profile way, but little by little it is gaining ground. The post I review here today is another good staple evidence of that fact.

It should be noted at this stage that evolutionary computation and its main results are not to be linearly equated with the classical evolutionary ideas of Biology, in spite of both being related with the first being inspired by the latter. 150 plus years of science&technology puts a clear difference and a wedge between the completely different contexts where the ideas were hypothesised and developed.

### Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Our finding continues the modern trend of achieving strong results with decades-old ideas. For example, in 2012, the “AlexNet” paper showed how to design, scale and train convolutional neural networks (CNNs) to achieve extremely strong results on image recognition tasks, at a time when most researchers thought that CNNs were not a promising approach to computer vision. Similarly, in 2013, the Deep Q-Learning paper showed how to combine Q-Learning with CNNs to successfully solve Atari games, reinvigorating RL as a research field with exciting experimental (rather than theoretical) results. Likewise, our work demonstrates that ES achieves strong performance on RL benchmarks, dispelling the common belief that ES methods are impossible to apply to high dimensional problems.

In what follows the authors briefly describe the main ideas/points of relevance about reinforcement learning. This serves the purpose to later introduce properly their proposal of Evolution Strategies (ES) that is claimed to be an alternative or replacement of reinforcement learning (RL), where such action reveals itself to be appropriate:

Reinforcement Learning

Let’s briefly look at how RL works. Suppose we are given some environment (e.g. a game) that we’d like to train an agent on. To describe the behavior of the agent, we define a policy function (the brain of the agent), which computes how the agent should act in any given situation. In practice, the policy is usually a neural network that takes the current state of the game as an input and calculates the probability of taking any of the allowed actions. A typical policy function might have about 1,000,000 parameters, so our task comes down to finding the precise setting of these parameters such that the policy plays well (i.e. wins a lot of games).

The training process for the policy works as follows. Starting from a random initialization, we let the agent interact with the environment for a while and collect episodes of interaction (e.g. each episode is one game of Pong). We thus obtain a complete recording of what happened: what sequence of states we encountered, what actions we took in each state, and what the reward was at each step. As an example, below is a diagram of three episodes that each took 10 time steps in a hypothetical environment. Each rectangle is a state, and rectangles are colored green if the reward was positive (e.g. we just got the ball past our opponent) and red if the reward was negative (e.g. we missed the ball):

This diagram suggests a recipe for how we can improve the policy; whatever we happened to do leading up to the green states was good, and whatever we happened to do in the states leading up to the red areas was bad. We can then use backpropagation to compute a small update on the network’s parameters that would make the green actions more likely in those states in the future, and the red actions less likely in those states in the future. We expect that the updated policy works a bit better as a result. We then iterate the process: collect another batch of episodes, do another update, etc.

The reinforcement learning algorithm work by sequentially injecting a noise signal to the action space of the agent, that by a discrete time-step procedure is finding actions which are successful, actions that aren’t successful, in the end manages to update its actions accordingly to achieve better and better learning. This is a stochastically (random uncertainty) based process. Evolution Strategies is a different approach to learning (the random nature of the process is maintained, though), as the authors follow suit to explain in the next sections:

Evolution Strategies

On “Evolution”. Before we dive into the ES approach, it is important to note that despite the word “evolution”, ES has very little to do with biological evolution. Early versions of these techniques may have been inspired by biological evolution and the approach can, on an abstract level, be seen as sampling a population of individuals and allowing the successful individuals to dictate the distribution of future generations. However, the mathematical details are so heavily abstracted away from biological evolution that it is best to think of ES as simply a class of black-box stochastic optimization techniques.

Interestingly the black-box optimization approach ignores completely the environment and the action space of a supposed agent. All that matters is a function f(w) that is being blindly evaluated. Blindly efficiently accomplishing a more and more better fitted task, well understood.

Black-box optimization.In ES, we forget entirely that there is an agent, an environment, that there are neural networks involved, or that interactions take place over time, etc. The whole setup is that 1,000,000 numbers (which happen to describe the parameters of the policy network) go in, 1 number comes out (the total reward), and we want to find the best setting of the 1,000,000 numbers. Mathematically, we would say that we are optimizing a function`f(w)`

with respect to the input vector`w`

(the parameters / weights of the network), but we make no assumptions about the structure of`f`

, except that we can evaluate it (hence “black box”).

Notice in the below following paragraph how the random noise nature of evolution strategies (ES) is somewhat different from the former approach of conventional reinforcement learning (RL). Here the Gaussian noise is introduced in the parameter set space only, not within an entire action space embedded in a complex environment. It is as if the environment is completely abstracted away. I was reading this part of the post and thinking about analogies with some ideas in modern quantum cosmology, namely the research about the early universe… (this is only a personal apart):

The ES algorithm.Intuitively, the optimization is a “guess and check” process, where we start with some random parameters and then repeatedly 1) tweak the guess a bit randomly, and 2) move our guess slightly towards whatever tweaks worked better. Concretely, at each step we take a parameter vector`w`

and generate a population of, say, 100 slightly different parameter vectors`w1 ... w100`

by jittering`w`

with gaussian noise. We then evaluate each one of the 100 candidates independently by running the corresponding policy network in the environment for a while, and add up all the rewards in each case. The updated parameter vector then becomes the weighted sum of the 100 vectors, where each weight is proportional to the total reward (i.e. we want the more successful candidates to have a higher weight). Mathematically, you’ll notice that this is also equivalent to estimating the gradient of the expected reward in the parameter space using finite differences, except we only do it along 100 random directions. Yet another way to see it is that we’re still doing RL (Policy Gradients, or REINFORCE specifically), where the agent’s actions are to emit entire parameter vectors using a gaussian policy.

Below is presented the details of a code sample from an Evolution Strategies (ES) algorithm proper:

Code sample.To make the core algorithm concrete and to highlight its simplicity, here is a short example of optimizing a quadratic function using ES (or see this longer version with more comments):`# simple example: minimize a quadratic around some solution point import numpy as np solution = np.array([0.5, 0.1, -0.3]) def f(w): return -np.sum((w - solution)**2) npop = 50 # population size sigma = 0.1 # noise standard deviation alpha = 0.001 # learning rate w = np.random.randn(3) # initial guess for i in range(300): N = np.random.randn(npop, 3) R = np.zeros(npop) for j in range(npop): w_try = w + sigma*N[j] R[j] = f(w_try) A = (R - np.mean(R)) / np.std(R) w = w + alpha/(npop*sigma) * np.dot(N.T, A)`

Injecting noise in the parameters.Notice that the objective is identical to the one that RL optimizes: the expected reward. However, RL injects noise in the action space and uses backpropagation to compute the parameter updates, while ES injects noise directly in the parameter space. Another way to describe this is that RL is a “guess and check” on actions, while ES is a “guess and check” on parameters. Since we’re injecting noise in the parameters, it is possible to use deterministic policies (and we do, in our experiments). It is also possible to add noise in both actions and parameters to potentially combine the two approaches.

After this description of ES, the authors provide us with a list of the technical trade-offs between this new approach and the common conventional RL algorithms. It makes for rewarding reading for the more technical inclined audience, with some details that won’t be easy to digest for the common reader:

Tradeoffs between ES and RL

ES enjoys multiple advantages over RL algorithms (some of them are a little technical):

No need for backpropagation.ES only requires the forward pass of the policy and does not require backpropagation (or value function estimation), which makes the code shorter and between 2-3 times faster in practice. On memory-constrained systems, it is also not necessary to keep a record of the episodes for a later update. There is also no need to worry about exploding gradients in RNNs. Lastly, we can explore a much larger function class of policies, including networks that are not differentiable (such as in binary networks), or ones that include complex modules (e.g. pathfinding, or various optimization layers).

Highly parallelizable.ES only requires workers to communicate a few scalars between each other, while in RL it is necessary to synchronize entire parameter vectors (which can be millions of numbers). Intuitively, this is because we control the random seeds on each worker, so each worker can locally reconstruct the perturbations of the other workers. Thus, all that we need to communicate between workers is the reward of each perturbation. As a result, we observed linear speedups in our experiments as we added on the order of thousands of CPU cores to the optimization.

Higher robustness.Several hyperparameters that are difficult to set in RL implementations are side-stepped in ES. For example, RL is not “scale-free”, so one can achieve very different learning outcomes (including a complete failure) with different settings of the frame-skip hyperparameter in Atari. As we show in our work, ES works about equally well with any frame-skip.

Structured exploration.Some RL algorithms (especially policy gradients) initialize with random policies, which often manifests as random jitter on spot for a long time. This effect is mitigated in Q-Learning due to epsilon-greedy policies, where the max operation can cause the agents to perform some consistent action for a while (e.g. holding down a left arrow). This is more likely to do something in a game than if the agent jitters on spot, as is the case with policy gradients. Similar to Q-learning, ES does not suffer from these problems because we can use deterministic policies and achieve consistent exploration.

Credit assignment over long time scales.By studying both ES and RL gradient estimators mathematically we can see that ES is an attractive choice especially when the number of time steps in an episode is long, where actions have long-lasting effects, or if no good value function estimates are available.

Some important shortcomings of ES are also disclosed. This is mostly to do with the implementation of the algorithm in practice, not the overall theoretical robustness of the idea in general. This bodes well for the interested engineer, which might feel positively challenged in trying to accomplish a computing feat:

Conversely, we also found some challenges to applying ES in practice. One core problem is that in order for ES to work, adding noise in parameters must lead to different outcomes to obtain some gradient signal. As we elaborate on in our paper, we found that the use of virtual batchnorm can help alleviate this problem, but further work on effectively parameterizing neural networks to have variable behaviors as a function of noise is necessary. As an example of a related difficulty, we found that in Montezuma’s Revenge, one is very unlikely to get the key in the first level with a random network, while this is occasionally possible with random actions.

To finalize a review of a wonderful blog post, I review the competitiveness of ES compared with reinforcement learning (RL). The important advantages identified by the authors are mainly about data efficiency and the computational cost from the time perspective. A note on the related research in this field and the authors’ own concluding remarks are also worth mentioning with all the links provided worth to inspect and read through:

## ES is competitive with RL

We compared the performance of ES and RL on two standard RL benchmarks: MuJoCo control tasks and Atari game playing. Each MuJoCo task (see examples below) contains a physically-simulated articulated figure, where the policy receives the positions of all joints and has to output the torques to apply at each joint in order to move forward. Below are some example agents trained on three MuJoCo control tasks, where the objective is to move forward:

(…)

We usually compare the performance of algorithms by looking at their efficiency of learning from data; as a function of how many states we’ve seen, what is our average reward? Here are the example learning curves that we obtain, in comparison to RL (the TRPO algorithm in this case):

Data efficiency comparison.The comparisons above show that ES (orange) can reach a comparable performance to TRPO (blue), although it doesn’t quite match or surpass it in all cases. Moreover, by scanning horizontally we can see that ES is less efficient, but no worse than about a factor of 10 (note the x-axis is in log scale).

The efficiency cost may well be a good one to tolerate.

Wall clock comparison.Instead of looking at the raw number of states seen, one can argue that the most important metric to look at is the wall clock time: how long (in number of seconds) does it take to solve a given problem? This quantity ultimately dictates the achievable speed of iteration for a researcher. Since ES requires negligible communication between workers, we were able to solve one of the hardest MuJoCo tasks (a 3D humanoid) using 1,440 CPUs across 80 machines in only 10 minutes. As a comparison, in a typical setting 32 A3C workers on one machine would solve this task in about 10 hours. It is also possible that the performance of RL could also improve with more algorithmic and engineering effort, but we found that naively scaling A3C in a standard cloud CPU setting is challenging due to high communication bandwidth requirements.

**(…)**

## Related Work

ES is an algorithm from the neuroevolution literature, which has a long history in AI and a complete literature review is beyond the scope of this post. However, we encourage an interested reader to look at Wikipedia, Scholarpedia, and Jürgen Schmidhuber’s review article (Section 6.6). The work that most closely informed our approach is Natural Evolution Strategies by Wierstra et al. 2014. Compared to this work and much of the work it has inspired, our focus is specifically on scaling these algorithms to large-scale, distributed settings, finding components that make the algorithms work better with deep neural networks (e.g. virtual batch norm), and evaluating them on modern RL benchmarks.

### Conclusion

Our work suggests that neuroevolution approaches can be competitive with reinforcement learning methods on modern agent-environment benchmarks, while offering significant benefits related to code complexity and ease of scaling to large-scale distributed settings. We also expect that more exciting work can be done by revisiting other ideas from this line of work, such as indirect encoding methods, or evolving the network structure in addition to the parameters.

This section was also supplemented with a final note about supervised learning (SL), where the authors claim that their results aren’t competing with SL in terms of performance with classification tasks using the usual computation of the gradient of the loss function with backpropagation. Only in RL settings does the competitive edge of Evolution Strategies is actually manifested. I do recommend the interested reader to also click-through the GitHub repository of this deeply interesting machine learning framework readily made available by the authors of this blog post in OpenAI:

Note on supervised learning.It is also important to note that supervised learning problems (e.g. image classification, speech recognition, or most other tasks in the industry), where one can compute the exact gradient of the loss function with backpropagation, are not directly impacted by these findings. For example, in our preliminary experiments we found that using ES to estimate the gradient on the MNIST digit recognition task can be as much as 1,000 times slower than using backpropagation. It is only in RL settings, where one has to estimate the gradient of the expected reward by sampling, where ES becomes competitive.

As a postscript I must confess to feel real thrill and with a breathtaking sensation. That good feeling of hard things learnt but positively transformative. Formative as much!

*featured image: IEEE CIS VIDEO COMPETITION | COLORFUL GECKOS EXPLAIN EVOLUTIONARY COMPUTATION AND WIN FIRST PRIZE*