The paper I will comment and review today here for this blog is quite special. It is about a topic I cherish – Photonics and Optical Engineering once was my career path… -, but it is double dose of reinforced engagement. It meshes Photonic Engineering with the techniques and conceptual framework of Reinforcement Learning. Reinforcement Learning is one of the most important and significant fields of development within the broader Machine Learning and Computer Science subjects. Applied, as it is wonderfully described in this paper (that tastes more like a scientific and technological essay, and how close to my inclinations that could be…) to the topic of light transport simulation, it showed capacity to improve this photonic engineering challenge, possibly providing a path to further enhancement to rendering of images in Computer Graphics.
Ken Dahm and Alexander Keller – ArXiv Jan 2017
Form the outset we learn that the mathematical and conceptual framework of Reinforcement Learning and light transport simulation are related by the same integral equations. But then the need for precise and clean sampling in light transport simulation turns out to become critical. There are several options to do sampling for computer simulation and modelling, but the authors of this paper introduce an importance sampling approach with reinforcement learning and demonstrate that it is able by itself to learn and improve over time the correct detection of the light source for imaging retrieval purposes. From the abstract:
We show that the equations of reinforcement learning and light transport simulation are related integral equations. Based on this correspondence, a scheme to learn importance while sampling path space is derived. The new approach is demonstrated in a consistent light transport simulation algorithm that uses reinforcement learning to progressively learn where light comes from. As using this information for importance sampling includes information about visibility, too, the number of light transport paths with non-zero contribution is dramatically increased, resulting in much less noisy images within a fixed time budget.
Correct light transport simulation is a current challenge in photonics engineering given the fact that light sources and light receptors normally don’t know where precisely the light path is. There are many instances of noise and interference affecting this path (the old great physicist Issac Newton was obsessed with this issue, and eventually managed to discover some laws about this… risking becoming blind in the process 😄 ). But doing this with a computer is even more challenging and noisy by the obvious restrictions of computers in representing a correct picture of the path space. Knowing a better and improved light transport simulation reveals to be critical to many photonic engineering tasks.
Pointed out in the first paragraphs of the paper is one application of light transport simulation: computer graphics and computational synthesis of images that cannot be distinguished from real photographs:
One application of light transport simulation is the computational synthesis of images that cannot be distinguished from real photographs. In such simulation algorithms , light transport is modeled by a Fredholm integral equation of the second kind and pixel colors are determined by estimating functionals of the solution of the Fredholm integral equation. The estimators are averages of sampled light transport paths that connect light sources and camera sensors.
So we understand here that precisely identifying correct non-zero information paths is the crucial goal of Computer Graphics. Zero information paths are cumbersome noise:
Compared to reality, where photons and their trajectories are abundant, a computer may only consider a tiny fraction of path space, which is one of the dominant reasons for noisy images. It is therefore crucial to efficiently select light transport paths that have an important contribution to the image. While a lot of research in computer graphics has been focussing on importance sampling [16, 4, 3, 1, 20], for long there has not been a simple and efficient online method that can substantially reduce the number of light transport paths with zero contribution.
The majority of zero contributions are caused by unsuitable local importance sampling using only a factor instead of the complete integrand (see Fig. 1) or by trying to connect vertices of light transport path segments that are occluded, for example shooting shadow rays to light sources or connecting path segments starting both from the light sources and the camera sensors. An example for this inefficiency has been investigated early on in computer graphics [26, 27]: The visible part of the synthetic scene shown in Fig. 7 is lit through a door. By closing the door more and more the problem can be made arbitrarily more difficult to solve.
As such the researchers thought successfully in applying reinforcement learning algorithms to the task of precisely connecting light sources with camera sensors. They show that light transport simulation and reinforcement learning can be modelled by the same integral equation, complementary with previous machine learning algorithmic attempts at this task. And the importance sampling used is able to generalize to any light transport algorithm:
We therefore propose a method that is based on reinforcement learning  and allows one to sample light transport paths that are much more likely to connect lights and sensors. Complementary to first approaches of applying machine learning to image synthesis , in Sec. 2 we show that light transport and reinforcement learning can be modeled by the same integral equation. As a consequence, importance in light transport can be learned using any light transport algorithm.
Further significant in this research effort is the fact that the importance sampling scheme is automatic, following successful removal of all hyper-parameters from the path space; and there is also a remarkable computational efficiency outcome, as the algorithm is able to track distributions of samples over time without preprocessing:
Deriving a relationship between reinforcement learning and light transport simulation, we succeed to remove all hyper-parameters, yielding an automatic importance sampling scheme as introduced in Sec. 3. Our approach allows for controlling the memory footprint, for suitable representations of importance does not require preprocessing, and can be applied during image synthesis and/or across frames, because it is able to track distributions over time. A second parallel between temporal difference learning and next event estimation is pointed out in Sec. 4.
As demonstrated in Sec. 5 and shown in Fig. 8, already a simple implementation can dramatically improve light transport simulation. The efficiency of the scheme is based on two facts: Instead of shooting towards the light sources, we are guiding light transport paths to where the light comes from, which effectively shortens path length, and we learn importance from a smoothed approximation instead from high variance path space samples.
Q-Learning and Path Tracing
In this section I would briefly sketch the main concepts in the paper about the reinforcement learning algorithm proposed, like the details about the Q-learning adopted by the researchers. This will involve some advanced mathematical concepts and equations, so the reader not inclined to follow can skip it and go through the rest of this comment until its conclusion. But the more knowledgeable audience is invited to post comments or to point errors or misunderstandings. I already received some useful feedbacks about some errors I’ve made before, and I found those to be useful and rewarding. One of the nice things about blogging: the open spirit to accept mistakes and learn from them.
The setting of reinforcement learning  is depicted in Fig. 2: An agent takes an action thereby transitioning to the resulting next state and receiving a reward. In order to maximize the reward, the agent has to learn which action to choose in what state. This process very much resembles how humans learn.
Q-learning  is a model free reinforcement learning technique. Given a set of states S and a set of actions A, it determines a function Q(s,a) that for any s ∈ S values taking the action a ∈ A. Thus given a state s, the action a with the highest value may be selected next and
may be updated by a fraction of α ∈ [0,1], where r(s,a) is the reward for taking the action resulting in a transition to a state s 0 . In addition, the maximum Q-value of possible actions in s 0 is considered and discounted by a factor of γ ∈ [0,1).
The following twist in this conceptual setting is especially interesting:
Instead of taking into account only the best valued action,
averages all possible actions in s 0 and weighs their values Q(s 0 ,a 0 ) by a transition kernel π(s 0 ,a 0 ). This is especially interesting, as later it will turn out that always selecting the ”best” action does not perform as well as considering all options (see Fig. 7). For a continuous space A of actions, we then have
This is the conceptual framework for the reinforcement learning (RL). But now there is a need to integrate with the conceptual framework for the specific optical properties of the system, namely, the equation for radiance, which is modelled by the former referenced Fredholm integral equation of the second kind:
Le is the source radiance and the integral accounts for all radiance that is incident over the hemisphere ∫ˆ+(x) aligned by the surface normal in x and transported into direction ω. The hitpoint function h(x,ωi) traces a ray from x into direction ωi and returns the first surface point intersected. The radiance from this point is attenuated by the bidirectional scattering distribution function fs , where the cosine term of the angle θi between surface normal and ωi accounts for only the fraction that is perpendicular to the surface.
A comparison of Eqn. 2 for α = 0 and Eqn. 3 reveals structural similarities of the formulation of reinforcement learning and the light transport integral equation, respectively, which lend themselves to matching terms: Interpreting the state s as a location x ∈ R 3 and an action a as tracing a ray from location x into direction ω resulting in the point y := h(x,ω) corresponding to the state s 0 , the reward term r(s,a) can be linked to the emitted radiance Le(y,−ω) = Le(h(x,ω),−ω) as observed from x. Similarly, the integral operator can be applied to the value Q, yielding
In order to synthesize images, we need to compute functionals of the radiance equation 3, i.e. project the radiance onto the image plane. For the purpose of this article, we start with a simple forward path tracer [21, 12]: From a virtual camera, rays are traced through the pixels of the screen. Upon their first intersection with the scene geometry, the light transport path is continued into a scattering direction determined according to the optical surface properties. Scattering and ray tracing are repeated until a light source is hit. The contribution of this complete light transport path is added to the pixel pierced by the initial ray of this light transport path when started at the camera.
We therefore progressively approximate Eqn. 4 using reinforcement learning: Once a direction has been selected and a ray has been traced by the path tracer,updated using a learning rate α. The probability density function resulting from normalizing Q in turn is used for importance sampling a direction to continue the path. As consequence more and more light transport paths are sampled that contribute to the image. Computing a global solution to Q in a preprocess would not allow for focussing computations on light transport paths that contribute to the image.
Method and Algorithm
Next the authors described in some more detail the methods they used in implementing their experimental design as well as the algorithm deployed.
The method has been implemented in an importance driven forward path tracer as shown in Alg. 1: Only two routines for updating Q and selecting a scattering direction proportional to Q need to be added. Normalizing the Qi in a point y then results in a probability density that is used for importance sampling during scattering by inverting the cumulative distribution function. In order to guarantee ergodicity, meaning that every light transport path remains possible, all Qi(y) are initialized with a positive value, for example a uniform probability density or proportional to a factor of the integrand (see Fig. 1).
The parameters exposed by our implementation are the discretization, i.e. number of voxels, and the learning rate α. The cumulative distribution functions are built-in parallel every accumulated frame.
It is desirable to craft consistent rendering algorithms , because then all renderer introduced artifacts, like for example noise, are guaranteed to vanish over time. This requires the Qi(y) to converge, which may be accomplished by a vanishing learning rate α. In reinforcement learning, a typical approach is to count the number of visits to each pair of state s and action a and using
The method resembles the one used to make progressive photon mapping consistent , where consistency has been achieved by decreasing the search radius around a query point every time a photon hits sufficiently close. Similarly, the learning rate may also depend on the total number of visits to a state s alone, or even may be chosen to vanish independently of state and action. Again, such approaches have been explored in consistent photon mapping .
Temporal Difference Learning and Next Event Estimation
As I have been referring when doing these kind of posts I always recommend the reader to check further the paper for more details. The posts here should be read also, obviously, otherwise I wouldn’t bother, but I will never be able to fully sketch out all the detailed thinking of the authors. That wouldn’t be desirable anyway, but the blog posts do serve the noble purpose of showing with as much critical, qualified and pedagogical writing as possible good pieces of research and development in the issues it cares about.
In this section the author further outline their approach to computer graphics and photonic engineering by a RL method, this time with some tricks needed to avoid inevitable shortcomings, like the smaller light sources, the restrictions in approximating the quality of Q or the finer required resolution of Q in order to reliably guide light rays to hit a light source.
Already in  the contribution of light sources has been “learned”: A probability per light source has been determined by the number of successful shadow rays divided by the total number of shadow rays shot. This idea has been refined subsequently [14, 32, 2, 31].
For reinforcement learning, the state space may be chosen as a regular grid over the scene, where in each grid cell c for each light source l a value Vc,l is stored that is initialized with zero. Whenever a sample on a light source l is visible to a point x to be illuminated in the cell c upon next event estimation, its value
is updated using the norm of the contribution Cl(x). Building a cumulative distribution function from all values Vc,l within a cell c, light may be selected by importance sampling.
Results, discussion and conclusion
And for The Information Age that is it for now regarding this comment (review). As mentioned at the beginning this was an especially rewarding paper to read and parse through for me, obviously having been involved with photonic engineering in professional past (I must say that wasn’t an extremely successful period of my life, but that does not matter here, and does not matter at all for the relevance of this post and my enhanced admiration with this subject). Now I finish with the authors’ say about their results and what they think this contribution might be for the future of the filed of computer graphics, computational physics and photonic engineering:
For the same budget of light transport paths, the superiority over path tracing with importance sampling according to the reflection properties is obvious. A comparison with the Metropolis algorithm for importance sampling [27, 10] reveals much more uniform noise lacking the typical splotchy structure inherent with the local space exploration of Metropolis samplers. Note, however, that the new reinforcement learning importance sampling scheme could as well be combined with Metropolis sampling. Finally, updating Q by Eqn. 1, i.e. the ”best possible action” strategy is inferior to using the weighted average of all possible next actions according to Eqn. 5. In light transport simulation this is not surprising, as the deviation of the integrand from its estimated maximum very often is much larger than from a piecewise constant approximation.
Shooting towards where the radiance comes from naturally shortens the average path length as can be seen in Fig. 7e. As a consequence the new algorithm for the same budget of light transport paths is around 20% faster than without reinforcement learning. The big gain in quality is due to the dramatic increase of non-zero contribution light transport paths (see Fig. 8), even under complex lighting.
The idea of guiding light transport paths has been explored before [16, 8, 4, 20, 1, 28]. However, key to our approach is that by using a representation of Q in Eqn. 5 instead of solving the equation by recursion, i.e. a Neumann series, Q can be learned much faster and in fact during sampling light transport paths without any preprocess.
While the memory requirements for storing our data structure for Q are small, the data structure is not adaptive. An alternative is the adaptive hierarchical approximation to Q as used in . Yet, another variant would be learning parameters for lobes to guide light transport paths . In principle any data structure that has been used in graphics to approximate irradiance or radiance is a candidate. Which data structure and what parameters are best, may be depending on the scene to be rendered. For example, using discretized hemispheres limits the resolution with respect to solid angle. If the resolution is chosen too fine, learning is slow, if the resolution is to coarse, convergence is slow.
Identifying Q-learning and light transport, heuristics have been replaced by physically based functions, and the only parameters that the user may control are the learning rate and the discretization of Q.
The combination of reinforcement learning and deep neural networks [19, 7, 17, 18] is an obvious avenue of future research: Representing the radiance on hemispheres already has been successfully explored  and the interesting question is how well Q can be represented by neural networks.
Featured Image: Fig. 7 Comparison at 1024 paths per pixel (the room behind the door is shown in Fig. 4)… Learning Light Transport the Reinforced Way