Reinforcement learning: introductory links

I have to admit that The Information Age assumes a tone of knowledge and expertise about the subjects usually covered here that may appear a bit too much bold, beyond the real expertise of who is writing. This is also a personal blog, so sometimes it is important to the reader to distinguish when the writer is writing with the confidence of knowing full well what he is writing about, from when he is wanting to get deeper in a subject and the writing provides a good way to put working memory aligned with  cumulative knowledge. I disclosed from the outset that I am a graduate engineer in physics, so my easiness with most of the concepts I cover here should be obvious, even if I am not the best of breed of the pack. Nevertheless, and precisely because of some limitations that I do not bother to recognise, I am optimistic of when I am confident in what I am writing I will not incur in too grave a mistake as to disapoint the follower and reader that happen to enjoy the topics I chose to cover here. Blogging is also networking, and in this activity we do not care or presume to be authoritative with a hard nose; and what we like most is learning and feeling progress in that learning, with ourselves and with our fellow bloggers of similar subjects.

After this long introductory paragraph, serving as a sort of justification for the essence matter of this post, let us dive in a couple of introductory links on the important Machine Leaning topic which is Reinforcement Learning. This is perhaps the ultimate topic in most machine learning scheduled courses apparatus, with good reason as it is probably the cutting edge issue in the field that may in the end reveal to be the essential killer app in the quest to achieve general artificial intelligence. Like this former post from this blog presented. But, following the idea from the first paragraph, I felt the need to properly understand the reinforcement learning topic, for myself and for the wider readership.

Therefore this post will try to provide assurance that Reinforcement Leaning will be reinforced learned in our minds with two links that expose the idea in more or less detail. The first is a nice short, but quite dense and hard for the newbie, article from the now usual suspect O’Reilly Data newsletter entitled accordingly Reinforcement learning explained. I say that Junling Hu’s article is a hard one but that is only my judgement; certainly the author’s intention was one of providing an easy accessible explanation of the topic, with some examples, figures and implementations to real life settings that would be as intuitive to understand as possible. And to a large extent that was achieved. But as with many other data science and machine learning topics, there is also a need to recognise that the difficulty of some concepts do not go well with intuitive explanations; sometimes it is better to go with the grain of the difficulty and insist, review, repeat and persist until you know a bit more of what it is involved in the particular concept. The concept of Reinforcement Learning (RL) is one such concept (at least to me, but that may not be the case for other people reading this, be they someone with some knowledge of machine learning or not…). Junling Hu’s post gives us a nice mental picture of the implementation of RL in the context of Robotics, and I will reproduce here some parts that I deemed worth to… reinforce:

Reinforcement learning, in a simplistic definition, is learning best actions based on reward or punishment.


There are three basic concepts in reinforcement learning: state, action, and reward. The state describes the current situation. For a robot that is learning to walk, the state is the position of its two legs. For a Go program, the state is the positions of all the pieces on the board.


Action is what an agent can do in each state. Given the state, or positions of its two legs, a robot can take steps within a certain distance. There are typically finite (or a fixed range of) actions an agent can take. For example, a robot stride can only be, say, 0.01 meter to 1 meter. The Go program can only put down its piece in one of 19 x 19 (that is 361) positions.


When a robot takes an action in a state, it receives a reward. Here the term “reward” is an abstract concept that describes feedback from the environment. A reward can be positive or negative. When the reward is positive, it is corresponding to our normal meaning of reward. When the reward is negative, it is corresponding to what we usually call “punishment.”


Each of these concepts seem simple and straightforward: Once we know the state, we choose an action that (hopefully) leads to positive reward. But the reality is more complex.


Consider this example: a robot learns to go through a maze. When the robot takes one step to the right, it reaches an open location, but when it takes one step to the left, it also reaches an open location. After going left for three steps, the robot hits a wall. Looking back, taking the left step at location 1 is a bad idea (bad action). How would the robot use the reward at each location (state) to learn how to get through the maze (which is the ultimate goal)?


“Real” reinforcement learning, or the version used as a machine learning method today, concerns itself with the long-term reward, not just the immediate reward.


The long-term reward is learned when an agent interacts with an environment through many trials and errors. The robot that is running through the maze remembers every wall it hits. In the end, it remembers the previous actions that lead to dead ends. It also remembers the path (that is, a sequence of actions) that leads it successfully through the maze. The essential goal of reinforcement learning is learning a sequence of actions that lead to a long-term reward. An agent learns that sequence by interacting with the environment and observing the rewards in every state.


How does an agent know what the desired long-term payoff should be? The secret lies in a Q-table (or Q function). It’s a lookup table for rewards associated with every state-action pair. Each cell in this table records a value called a Q-value. It is a representation of the long-term reward an agent would receive when taking this action at this particular state, followed by taking the best path possible afterward.

How does the agent learn about this long-term Q-value reward? It turns out, the agent does not need to solve a complex mathematical function. There is a simple procedure to learn all the Q-values called Q-learning. Reinforcement learning is essentially learning about Q-values while taking actions.


What is stated above appear to make the concept of RL straightforward and easy, but it is not quite that so. The inquisitive/knowledgeable mind will want to ask for more detail, for  a more satisfactory explanation. Is this just a method? Or is there something else involved? The Q-valued method explained below is telling about that being a method, yes, but an advanced one as it is forward-looking into the future actions, somehow taking into account a memory of future actions. The mathematical rule is simple enough though:

In this updating method, Q carries memory from the past and takes into account all future steps. Note that we use the maximized Q-value for the new state, assuming we always follow the optimal path afterward. As the agent visits all the states and tries different actions, it eventually learns the optimal Q-values for all possible state-action pairs. Then it can derive the action in every state that is optimal for the long term.


A simple example can be seen with the following maze robot:




The robot starts from the lower left corner of the maze. Each location (state) is indicated by a number. There are four action choices (left, right, up, down), but in certain states, action choices are limited. For example, in state 1 (initial state), the robot has only two action choices: up or right. In state 4, it has three action choices: left, right, or up. When the robot hits a wall, it receives reward -1. When it reaches an open location, it receives reward 0. When it reaches the exit, it receives reward 100. However, note that this one-time reward is very different from Q-values. In fact, we have


Q(4, left) = 0.8 x 0 + 0.2 (0+0.9 Q(1,right)) and

Q(4, right) = 0.8 x 0 + 0.2 (0+0.9 Q(5,up))


Where the learning rate is 0.2 and discount rate is 0.9. The best action in state 1 is right, and the best action in state 5 is up. Q(1,right) and Q(5,up) have different values because it takes more steps from state 1 than state 5 to reach the exit. Since we discount future rewards, we discount the added steps to reach the goal. Thus Q(5,up) has a higher value than Q(1,right). For this reason, Q(4,right) has a higher value than Q(4, left). Thus, the best action in state 4 is going right.


Q-learning requires the agent try many times to visit all possible state-action pairs. Only then does the agent have a complete picture of the world. Q-values represent the optimal values when taking the best sequence of actions. This sequence of actions is also called “policy.”


The basic take home idea: RL is all about repeated action in and around a particular feature to learn up and until a complete picture of what is needed to learn is clear and really learned. Plus the realization that the universe our agent (or ourselves for that matter…) operates in is a predictable stable universe and not a random universe:

A fundamental question we face is: is it possible for an agent to learn all the Q-values when it explores the actions possible in a given environment? In other words, is such learning feasible? The answer is yes if we assume the world responds to actions. In other words, the state changes based on an action. This assumption is called the Markov Decision Process (MDP). It assumes that the next state depends on the previous state and the action taken. Based on this assumption, all possible states can eventually be visited, and the long-term value (Q-value) of every state-action pair can be determined.

Imagine that we live in a random universe where our actions have no impact on what happens next, then reinforcement learning (Q-learning) would break down. After many times of trying, we’d have to throw in the towel. Fortunately, our universe is much more predictable. When a Go player puts down a piece on the board, the board position is clear in the next round. Our agent interacts with the environment and shapes it through its actions. The exact impact of our agent’s action on the state is typically straightforward. The new state is immediately observable. The robot can tell where it ends up.

The techniques of Reinforcement Learning

Next in the artice it is described some important techniques required in order for RL to successful perform its magic:

The performance of Q-learning depends on visiting all state-action pairs in order to learn the correct Q-values. This can be easily achieved with a small number of states. In the real world, however, the number of states can be very large, particularly when there are multiple agents in the system. For example, in a maze game, a robot has at most 1,000 states (locations); this grows to 1,000,000 when it is in a game against another robot, where the state represents the joint location of two robots (1,000 x 1,000).


When the state space is large, it is not efficient to wait until we visit all state-actions pairs. A faster way to learn is called the Monte Carlo method. In statistics, the Monte Carlo method derives an average through repeated sampling. In reinforcement learning, the Monte Carlo method is used to derive Q-values after repeatedly seeing the same state-action pair. It sets the Q-value, Q(s,a), as the average reward after many visits to the same state-action pair (s, a). This method removes the need for using a learning rate or a discount rate. It depends only on large numbers of simulations. Due to its simplicity, this method has become very popular. It has been used by AlphaGo after playing many games against itself to learn about the best moves.


Another way to reduce the number of states is by using a neural network, where the inputs are states and outputs are actions, or Q-values associated with each action. A deep neural network has the power to dramatically simplify the representation of states through hidden layers. In this Nature paper on deep reinforcement learning used with Atari games, the whole game board is mapped by a convolutional neural network to determine Q-values.



RL Course by David Silver – Lecture 1

Not wanting to keep copy interesting sections from Junling Hu’s article from O’Reilly Data Science, I just recommend the full readership and a visit to the links the article provides for further knowledge. And I pass to the second post (link) that complements the first one. It is a video lecture from Dr. David Silver featuring the first Lecture delivered as a course from the DeepMind youtube channel. It is a nice introduction to the topic and a link for the reader to check the whole video list from DeepMind (this video is already a bit old from May 2015…). DeepMind is a London-based venture AI firm, reputed for the AlphaGo achievement:

featured image: The computer that mastered Go

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s