The paper I’ve chosen to review to start the week is quite an interesting one. It is called Probabilistic Neural Programs, and it promises to be a relevant attempt at bridging the gap between the state of the art in deep neural networks and the current developments taking place in the emerging computing paradigm of Probabilistic Programming.
As the authors point out at the outset, the current state of the art in deep learning frameworks exhibits limitations or suboptimality in the trade-offs between expressivity in the computational models and the data requirements to meet those computational demands. On the other hand deep learning is a continuous approximating algorithmic setting, while the evidence is suggesting that discrete inference algorithms outperforms continuous approximations.
With all this in mind the researchers publish in this post a view that claims Probabilistic Programming as a paradigm that both supports the specification of different tasks in computational models and inference algorithms, including discrete search and continuous approximations (I will not in this review say that I have evidence of this or that I am agreeing with all the conclusions in this paper, as I suspect further research in this area may be needed in the future or a complete peered mathematical proof of the claim). Their probabilistic neural programs propose a new flexible framework permitting specification of the appropriate computational model and inference algorithm while at the same time enabling the use of deep neural networks. They accomplish this difficult to achieve goal by applying a graph-based framework for specifying neural networks added with an operator for weighted nondeterministic choice model.
Thus, a program sketch describes both the decisions to be made and the architecture of the neural network used to score these decisions. Importantly, the computation graph interacts with nondeterminism: the scores produced by the neural network determine the weights of nondeterministic choices, while the choices determine the network’s architecture. As with probabilistic programs, various inference algorithms can be applied to a sketch. Furthermore, a sketch’s neural network parameters can be estimated using stochastic gradient descent from either input/output examples or full execution traces.
The description of probabilistic neural programs
The programming language used in this paper to build the probabilistic neural program was Scala. In figure 1 above (as is also reproduced in the paper), we see the program sketch defining a multilayer perceptron. it is this classical machine learning/artificial intelligence algorithm that serves here as a probabilistic neural program, but where the parameters and intermediate values represent computational graph nodes in a network. These nodes are tensor valued and can be tweaked with operations such as matrix-vector multiplication and the hyperbolic tangent.
Figure 1 (right) shows how to use the choose function to create a nondeterministic choice. This function nondeterministically selects a value from a list of options. The score of each option is given by the value of a computation graph node that has the same number of elements as the list. Evaluating this function with a tensor yields a program sketch object that represents a function from neural network parameters to a probability distribution over values. The log probability of a value is proportional to the sum of the scores of the choices made in the execution that produced it. Performing (approximate) inference over this object – in this case, using beam search – produces an explicit representation of the distribution. Multiple nondeterministic choices can be combined to produce more complex sketches; this capability can be used to define complex computational models, including general-purpose models such as Turing machines. The library also has functions for conditioning on observations.
The paragraph above is a good description of the Scala framework used in this short paper. Of particular note is the assertion on the capability of their algorithm to define such complex computational models as Turing machines, wich are general-purpose models of computation.
Further the inference algorithms that were chosen draw on recent intense trend in structured prediction of greedy inference or beam search combined with non-factoring models. This trend in computational data engineering tries to enhance performance of computation and prediction with a minimum of input costs and lower latency of the queuing process.
Although various inference algorithms may be applied to a program sketch, in this work we use a simple beam search over executions. This approach accords with the recent trend in structured prediction to combine greedy inference or beam search with powerful non-factoring models [2, 10, 4]. The beam search maintains a queue of partial program executions, each of which is associated with a score. Each step of the search continues each execution until it encounters a call to choose, which adds zero or more executions to the queue for the next search step. The lowest scoring executions are discarded to maintain a fixed beam width. As an execution proceeds, it may generate new computation graph nodes; the search maintains a single computation graph shared by all executions to which these nodes are added. The search simultaneously performs the forward pass over these nodes as necessary to compute scores for future choices.
The neural network parameters are trained to maximize the loglikelihood of correct program executions using stochastic gradient descent. Each training example consists of a pair of program sketches, representing an unconditional and conditional distribution. The gradient computation is similar to that of a loglinear model with neural network factors. It first performs inference on both the conditional and unconditional distributions to estimate the expected counts associated with each nondeterministic choice. These counts are then backpropagated through the computation graph to update the network parameters.
The overall approach by the authors is one of optimism about what a probabilistic neural program can achieve, driven by the implementation in a biologically inspired food web or predator-prey model, of previous fame in game-theoretical research. And the results appear to vindicate exactly that optimism:
We consider three models for learning to make the choices for both organism and eat: a non-neural (LOGLINEAR) model, as well as two probabilistic neural models (2-LAYER PNP and MAXPOOL PNP). All three learn models for both organism and eat using outputs from a computer vision system trained to detect organism, text, and arrow relations between them.  defines a set of hand-engineered features heuristically created from the outputs of this vision system. LOGLINEAR and 2-LAYER PNP use only these features, and the difference is simply in the greater expressivity of a two-layer neural network. However, one of the major strengths of neural models is their ability to learn latent feature representations automatically, and our third model also uses the direct outputs of the vision system not made into features. The architecture of MAXPOOL PNP reflects this and contains additional input layers that maxpool over detected relationships between objects and confidence scores. The expectation is that our neural network modeling of nondeterminism will learn better latent representations than the manually defined features.
This interesting paper might be in the future serve as a referenced study in the possibility of implementing successfully probabilistic neural programs, where the inference algorithms and the flexible specifications in the computational models of these programs work simultaneously with deep learning frameworks. By providing encouraging results of an implementation in a well-known food web dataset of biology and game-theory research fame, the authors hope their intuitions to be correct and further vindicated:
We have presented probabilistic neural programs, a framework for program induction that permits flexible specification of computational models and inference algorithms while simultaneously enabling the use of deep learning. A program sketch describes a collection of nondeterministic decisions to be made during execution, along with the neural architecture to be used for scoring these decisions. The network parameters of a sketch can be trained from data using stochastic gradient descent. We demonstrate that probabilistic neural programs improve accuracy on a diagram question answering task which can be formulated as learning to execute program sketches in a domain-specific computational model.
I conclude by encourage all readers to parse through the reference list in this paper, where there are further readings of substance to the proposal in this paper.
featured image: “Probabilistic ILP” – Tutorial