The first post for this week is about Hierarchical Bayesian Modelling and probabilistic programming. It features a talk by Jonathan Sedar delivered at the PyData London 2016 conference held in the spring of this year at the English capital city.
The author starts by pointing an interesting setup for the talk. He regards these talks as a way to learn and improve himself, as he is not claiming to be an expert. But he is using the framework he is presenting in his start-up, Applied AI, with a tested implementation for instance in the insurance economic sector.
PyMC3 and PySTAN are two of the leading frameworks for Bayesian inference in Python: offering concise model specification, MCMC sampling, and a growing amount of built-in conveniences for model validation, verification and prediction.
PyMC3 is an iteration upon the prior PyMC2, and comprises a comprehensive package of symbolic statistical modelling syntax and very efficient gradient-based samplers using the Theano library of deep-learning fame for gradient computation. Of particular interest is that it includes the Non U-Turn Sampler NUTS developed recently by Hoffman & Gelman in 2014, which is only otherwise available in STAN.
PySTAN is a wrapper around STAN, a major3 open-source framework for Bayesian inference developed by Gelman, Carpenter, Hoffman and many others. STAN also has HMC and NUTS samplers, and recently, Variational Inference – which is a very efficient way to approximate the joint probability distribution. Models are specified in a custom syntax and compiled to C++.
The real world implementation Jonathan speaks about is specifically to road traffic and vehicle insurance as well a specific look at the recent Volkswagen emissions scandal :
The Real-World Problem & Dataset
I’m currently quite interested in road traffic and vehicle insurance, so I’ve dug into the UK VCA Vehicle Type Approval to find their Car Fuel and Emissions Information for August 2015. The raw dataset is available for direct download and is small but varied enough for our use here: roughly 2500 cars and 10 features inc hierarchies of car parent-manufacturer – manufacturer – model.
I will investigate the car emissions data from the point-of-view of the Volkswagen Emissions Scandal which seems to have meaningfully damaged their sales. Perhaps we can find unusual results in the emissions data for Volkswagen.
I thought this to be a good way to start the week. And hope to be fruitful to the reader of The Information Age