The NIPS (Neural Information Processing Systems Foundation) 2016, held in Barcelona, featured a long and extremely varied list of papers. As always this conference is a staple event for the Information Technology, Computer Science and Machine Learning communities spread around the world, and it is an honor/privilege for a country/city to host top events like this.
While yesterday I was doing a post here about the important deep learning Python library TensorFlow, I came across a GitHub profile of an undergraduate student doing research in these areas. In it I found a link to an Arxiv paper titled Value Iteration Networks and it caught my attention precisely for being one of the feature papers in last year NIPS conference. It occurred immediately through my mind that I should review the paper for The Information Age. And indeed I thought it to be one of the most interesting papers present there. I should parse through the list and evaluate further interesting and relevant papers and review them in the following days, justifiably calling the series Interesting papers from NIPS 2016.
The paper was the result of a collaboration between researchers from UC Berkeley and the University of Washington respective Electric Engineering and Computer Science departments. It is an important contribution to the still long path in trying to achieve Artificial General Intelligence, for it brings together important work and experimental settings with Convolutional Neural Networks (CNNs) and not less important efforts with Reinforcement Learning (RL) methods. It adds one other interesting apparatus the authors call imitation learning (IL) to this mix. The result, presented at the outset in the abstract, is a fully differentiable neural network (NN) with a planning module embedded within.
This kind of research might go in the right direction for future robotics developments, and be essential to a future possibility of Robots that are capable of doing ever more complex and human like tasks to become a reality.
We introduce the value iteration network (VIN): a fully differentiable neural network with a ‘planning module’ embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.
The quest to one day achieve Artificial General Intelligence is a work in progress through different domains, that will have to somehow integrate different approaches to the problem. Or otherwise find the right balance within an embedded system as to what should be the proper framework, given a particular task, challenge at hand. There are those approaches, such as Google’s DeepMind effort in trying to achieve good results with what is called one-shot learning, where the learning occurs with minimal data input, and then the approach from this paper where the planning of a goal-oriented task is the priority. In this approach the trade-off is one of a sequential decision-making RL policy versus one-step decisions:
The sequential nature of decision making in RL is inherently different than the one-step decisions in supervised learning, and in general requires some form of planning . However, most recent deep RL works [19, 16, 10] employed NN architectures that are very similar to the standard networks used in supervised learning tasks, which typically consist of CNNs for feature extraction, and fully connected layers that map the features to a probability distribution over actions. Such networks are inherently reactive, and in particular, lack explicit planning computation. The success of reactive policies in sequential problems is due to the learning algorithm, which essentially trains a reactive policy to select actions that have good long-term consequences in its training domain.
To understand why planning can nevertheless be an important ingredient in a policy, consider the grid-world navigation task depicted in Figure 1 (left), in which the agent can observe a map of its domain, and is required to navigate between some obstacles to a target position. One hopes that after training a policy to solve several instances of this problem with different obstacle configurations, the policy would generalize to solve a different, unseen domain, as in Figure 1 (right). However, as we show in our experiments, while standard CNN-based networks can be easily trained to solve a set of such maps, they do not generalize well to new tasks outside this set, because they do not understand the goal-directed nature of the behavior. This observation suggests that the computation learned by reactive policies is different from planning, which is required to solve a new task1 .
The authors present their VIN (Value Iteration Network), which is a Neural Network (NN)-based policy that can learn to plan its way around obstacles in order to attain a goal. The key innovation here is that this network is embedded within a feed-forward classification (supervised) network that can learn by itself a planning computation, while it is also differentiable, and therefore can be trained using standard backpropagation algorithms:
The key to our approach is an observation that the classic value-iteration (VI) planning algorithm [2, 3] may be represented by a specific type of CNN. By embedding such a VI network module inside a standard feed-forward classification network, we obtain a NN model that can learn a planning computation. The VI block is differentiable, and the whole network can be trained using standard backpropagation. This makes our policy simple to train using standard RL and IL algorithms, and straightforward to integrate with NNs for perception and control.
To further appreciate the innovative point of the approach, this paper claims that the VIN introduced is a model-free RL method. So instead of being data agnostic about generalization, this approach is instead model agnostic. And it is claimed that it also leads to better generalization to new unseen instances of a task:
Our approach is different from model-based RL [22, 5], which requires system identification to map the observations to a dynamics model, which is then solved for a policy. In many applications, including robotic manipulation and locomotion, accurate system identification is difficult, and modelling errors can severely degrade the policy performance. In such domains, a model-free approach is often preferred . Since a VIN is just a NN policy, it can be trained model free, without requiring explicit system identification. In addition, the effects of modelling errors in VINs can be mitigated by training the network end-to-end.
We demonstrate that VINs can be effectively used within standard RL and IL algorithms in various problems, among which require visual perception, continuous control, and also natural language based decision making in the WebNav challenge . After training, the policy learns to map an observation to a planning computation relevant for the task, and generate action predictions based on the resulting plan. As we demonstrate, this leads to policies that generalize better to new, unseen, instances of the task.
Model details and background
Following it is presented the model for the Value Iteration Networks (VIN) and its related background.
Value Iteration: A standard model for sequential decision making and planning is the Markov decision process (MDP) [2, 3].
Formally, the value Vˆπ (s) of a state under policy π is the expected discounted sum of rewards when starting from that state and executing policy π, Vˆπ (s) =ˆ. Eˆπ [ ∑∞ t=0 γˆt r(st, at)| s0 = s] , where γ ∈ (0, 1) is a discount factor, and Eˆπ denotes an expectation over trajectories of states and actions (s0, a0, s1, a1 . . .), in which actions are selected according to π, and states evolve according to the transition kernel P(s’ |s, a). The optimal value function Vˆ∗ (s) =ˆ. maxπ Vˆπ (s) is the maximal long-term return possible from a state. A policy πˆ∗ is said to be optimal if Vˆπˆ∗ (s) = Vˆ∗ (s) ∀ s. A popular algorithm for calculating Vˆ∗ and πˆ∗ is value iteration (VI):
It is well known that the value function Vn in VI converges as n → ∞ to Vˆ∗ , from which an optimal policy may be derived as π ˆ∗ (s) = arg maxa Q∞(s, a).
Reinforcement Learning and Imitation Learning:
In MDPs where the state space is very large or continuous, or when the MDP transitions or rewards are not known in advance, planning algorithms cannot be applied. In these cases, a policy can be learned from either expert supervision – IL, or by trial and error – RL. While the learning algorithms in both cases are different, the policy representations – which are the focus of this work – are similar. In addition, most state-of-the-art algorithms such as [21, 19, 23, 16] are agnostic to the explicit policy representation, and only require it to be differentiable, for performing gradient descent on some algorithm-specific loss function. Therefore, in the rest of this paper we do not commit to a specific learning algorithm, and only consider the policy.
Let φ(s) denote an observation for state s. The policy is specified as a parametrized function πθ(a|φ(s)) mapping observations to a probability over actions, where θ are the policy parameters. For example, the policy could be represented as a neural network, with θ denoting the network weights. The goal is to tune the parameters such that the policy behaves well in the sense that πθ(a|φ(s)) ≈ πˆ∗ (a|φ(s)), where πˆ∗ is the optimal policy for the MDP, as defined in Section 2.
In IL (Imitation Learning), a dataset of N state observations and corresponding optimal actions is generated by an expert. Learning a policy then becomes an instance of supervised learning [21, 10]. In RL (Reinforcement Learning), the optimal action is not available, but instead, the agent can act in the world and observe the rewards and state transitions its actions effect. RL algorithms such as in [24, 19, 23, 16] use these observations to improve the value of the policy.
The Value Iteration Network Model
As stated earlier the authors of this paper claim that their model is able to learn by itself to solve for a policy representation that embeds an explicit planning module. If this planning module involves some MDP M and some unknown Mˆ–, then the model is able to discover by itself this value and add it to the policy iteration:
Let M denote the MDP of the domain for which we design our policy π. We assume that there is some unknown MDP M¯ such that the optimal plan in M¯ contains useful information about the optimal policy in the original task M. However, we emphasize that we do not assume to know M¯ in advance. Our idea is to equip the policy with the ability to learn and solve M¯ , and to add the solution of M¯ as an element in the policy π. We hypothesize that this will lead to a policy that automatically learns a useful M¯ to plan on.
For example, in the grid-world domain described above, we can let M¯ have the same state and action spaces as the true grid-world M. The reward function fR can map an image of the domain to a high reward at the goal, and negative reward near an obstacle, while fP can encode deterministic movements in the grid-world that do not depend on the observation. While these rewards and transitions are not necessarily the true rewards and transitions in the task, an optimal plan in M¯ will still follow a trajectory that avoids obstacles and reaches the goal, similarly to the optimal plan in M.
(…) Our approach is based on two important observations. The first is that the vector of values V¯ˆ∗ (s) ∀s encodes all the information about the optimal plan in M¯ . Thus, adding the vector V¯ˆ∗ as additional features to the policy π is sufficient for extracting information about the optimal plan in M¯ .
In NN terminology, this is a form of attention , in the sense that for a given label prediction (action), only a subset of the input features (value function) is relevant. Attention is known to improve learning performance by reducing the effective number of network parameters during learning. Therefore, the second element in our network is an attention module that outputs a vector of (attention modulated) values ψ(s). Finally, the vector ψ(s) is added as additional features to a reactive policy πre(a|φ(s), ψ(s)). The full network architecture is depicted in Figure 2 (left).
Returning to our grid-world example, at a particular state s, the reactive policy only needs to query the values of the states neighboring s in order to select the correct action. Thus, the attention module in this case could return a ψ(s) vector with a subset of V¯ ∗ for these neighboring states.
Let θ denote all the parameters of the policy, namely, the parameters of fR, fP , and πre, and note that ψ(s) is in fact a function of φ(s). Therefore, the policy can be written in the form πθ(a|φ(s)), similarly to the standard policy form (cf. Section 2). If we could back-propagate through this function, then potentially we could train the policy using standard RL and IL algorithms, just like any other standard policy representation. While it is easy to design functions fR and fP that are differentiable (and we provide several examples in our experiments), back-propagating the gradient through the planning algorithm is not trivial. In the following, we propose a novel interpretation of an approximate VI algorithm as a particular form of a CNN. This allows us to conveniently treat the planning module as just another NN, and by back-propagating through it, we can train the whole policy end-to-end.
The VI Module
We now introduce the VI module – a NN that encodes a differentiable planning computation. Our starting point is the VI algorithm (1). Our main observation is that each iteration of VI may be seen as passing the previous value function Vn and reward function R through a convolution layer and max-pooling layer. In this analogy, each channel in the convolution layer corresponds to the Q-function for a specific action, and convolution kernel weights correspond to the discounted transition probabilities. Thus by recurrently applying a convolution layer K times, K iterations of VI are effectively performed.
The VI block is simply a NN architecture that has the capability of performing an approximate VI computation. Nevertheless, representing VI in this form makes learning the MDP parameters and reward function natural – by backpropagating through the network, similarly to a standard CNN.
Value Iteration Networks
We now have all the ingredients for a differentiable planning-based policy, which we term a value iteration network (VIN). The VIN is based on the general planning-based policy defined above, with the VI module as the planning algorithm. In order to implement a VIN, one has to specify the state and action spaces for the planning module S¯ and A¯, the reward and transition functions fR and fP , and the attention function; we refer to this as the VIN design. For some tasks, as we show in our experiments, it is relatively straightforward to select a suitable design, while other tasks may require more thought. However, we emphasize an important point: the reward, transitions, and attention can be defined by parametric functions, and trained with the whole policy. Thus, a rough design can be specified, and then fine-tuned by end-to-end training.
Once a VIN design is chosen, implementing the VIN is straightforward, as it is simply a form of a CNN. The networks in our experiments all required only several lines of Theano  code.
Experimental remarks and conclusion
After this model’s detailed presentation, the authors go on to for a detailed explanation of the experimental set up.
In this section we evaluate VINs as policy representations on various domains. Our goal in these experiments is to investigate the following questions:
- Can VINs effectively learn a planning computation using standard RL and IL algorithms?
- Does the planning computation learned by VINs make them better than reactive policies at generalizing to new domains?
An additional goal in our experiments is to point out several ideas for designing VIN policies for various tasks. While this is not an exhaustive list that fits all domains, we hope that it will motivate creative designs in future work.
What follows this is quite an interesting explanation of the implementation details on real world applications such as on Grid-World domains or the Mars Rover Navigation and the WebNav challenge. This makes for nice reading and I recommend it to everyone. For the moment we close our review stating that the appendices in the paper and the open-source code are also recommended reading, specially for the more professional technical inclined. We remark the conclusions:
Conclusion and Outlook
The introduction of powerful and scalable RL methods has opened up a range of new problems for deep learning. However, few recent works investigate policy architectures that are specifically tailored for planning under uncertainty, and current RL theory and benchmarks rarely investigate the generalization properties of a trained policy [24, 19, 6]. This work takes a step in this direction, by exploring better generalizing policy representations.
Our VIN policies learn an approximate planning computation relevant for solving the task, and we have shown that such a computation leads to better generalization in a diverse set of tasks, ranging from simple gridworlds that are amenable to value iteration, to continuous control, and even to navigation of Wikipedia links. In future work we intend to learn different planning computations, based on simulation , or optimal linear control , and combine them with reactive policies, to potentially develop RL solutions for task and motion planning 
featured image: Value Iteration Networks talk slides