I must confess that I have been following with interest Xi’ An’s Og blog on matters statistics and machine learning. It isn’t always an easy read, it certainly is directed at a highly specialized audience, but the quality and range of the posts, the dedication of the author to it is worthy of attention. I just sometimes need to forget a bit how unusual the name of the Blog is and go after the content instead.
I decided to share today one such post. It is about the important topics of empirical Bayes methods, reference priors, EM algorithms, entropy and MLE. These are all important topics for both the good machine learning practitioner and the theoretical researcher. The article is a comment and critical appraisal of a research paper called Empirical Bayes Methods, Reference Priors, Cross Entropy and the EM Algorithm by a group of german researchers from the Zuse Institute Berlin.
These kind of posts are important because knowledgeably point a possible direction for further improvements in the experimental methodological approach from the paper being analyzed. This is an attitude that I acknowledge I should aim for when I do my research paper reviews here in this Blog. This isn’t to diminish the value with the more descriptive reviews, but the point should be made from the outset what the tone of the review is and how it should be interpreted.
But from the very frist paragraph we just witness the theoretical deep knowledge revealed by the Blogger, which is PhD level statistics researcher. I just post all the three paragraphs, and it is indeed worth it. After there is the link to the research paper mentioned with the abstract detailed:
Klebanov and co-authors from Berlin arXived this paper a few weeks ago and it took me a quiet evening in Darjeeling to read it. It starts with the premises that led Robbins to introduce empirical Bayes in 1956 (although the paper does not appear in the references), where repeated experiments with different parameters are run. Except that it turns non-parametric in estimating the prior. And to avoid resorting to the non-parametric MLE, which is the empirical distribution, it adds a smoothness penalty function to the picture. (Warning: I am not a big fan of non-parametric MLE!) The idea seems to have been Good’s, who acknowledged using the entropy as penalty is missing in terms of reparameterisation invariance. Hence the authors suggest instead to use as penalty function on the prior a joint relative entropy on both the parameter and the prior, which amounts to the average of the Kullback-Leibler divergence between the sampling distribution and the predictive based on the prior. Which is then independent of the parameterisation. And of the dominating measure. This is the only tangible connection with reference priors found in the paper.
The authors then introduce a non-parametric EM algorithm, where the unknown prior becomes the “parameter” and the M step means optimising an entropy in terms of this prior. With an infinite amount of data, the true prior (meaning the overall distribution of the genuine parameters in this repeated experiment framework) is a fixed point of the algorithm. However, it seems that the only way it can be implemented is via discretisation of the parameter space, which opens a whole Pandora box of issues, from discretisation size to dimensionality problems. And to motivating the approach by regularisation arguments, since the final product remains an atomic distribution.
While the alternative of estimating the marginal density of the data by kernels and then aiming at the closest entropy prior is discussed, I find it surprising that the paper does not consider the rather natural of setting a prior on the prior, e.g. via Dirichlet processes.
When estimating a probability density within the empirical Bayes framework, the non-parametric maximum likelihood estimate (NPMLE) usually tends to overfit the data. This issue is usually taken care of by regularization – a penalization term is subtracted from the marginal log-likelihood before the maximization step, so that the estimate favors smooth solutions, resulting in the so-called maximum penalized likelihood estimation (MPLE). The majority of penalizations currently in use are rather arbitrary brute-force solutions, which lack invariance under transformation of the parameters (reparametrization) and measurements. This contradicts the principle that, if the underlying model has several equivalent formulations, the methods of inductive inference should lead to consistent results. Motivated by this principle and using an information theoretic point of view, we suggest an entropy-based penalization term that guarantees this kind of invariance. The resulting density estimate can be seen as a generalization of reference priors. Using the reference prior as a hyperprior, on the other hand, is argued to be a poor choice for regularization.
We also present an insightful connection between the NPMLE, the cross entropy and the principle of minimum discrimination information suggesting another method of inference that contains the doubly-smoothed maximum likelihood estimation as a special case.
featured image: Zuse Institute Berlin