The reader of this blog is now acquainted with some blogs that I follow and sometimes repost here. One of these is from a French Statistician and Data Science guru that posts regularly in the unusually named blog Xi’ an’s OG. Here it is a brief biographical description of the author of this Blog, which sports a somewhat mysterious identity style of Blogging. Something that pays some tribute to creative ways of presenting an identity in the Internet of perils witnessed and uncertain.
Anyway the last post from Xi’ an’s OG was about the flurry of activity with publication of papers about the important Machine Learning technique of Stochastic Gradient Descent (SGD is different from the standard Gradient Descent in that it economize by subampling within a very large training dataset) . This activity confirms a belief about optimization or sampling techniques in Machine Learning as the main mathematical/statistical support for its development, or that practitioners and researchers continue to insist on these techniques with further commitment even if results appear to lead to not that better outcomes or discoveries. These algorithms still have some convergence and other efficiency issues on implementation, but recurrently everyone comes back here for further insight. The brief comments that the author posted triggered the will for this repost, where the highly technical content subject should be further researched by the interested:
- Stochastic Gradient Descent as Approximate Bayesian inference, by Mandt, Hoffman, and Blei, where this technique is used as a type of variational Bayes method, where the minimum Kullback-Leibler distance to the true posterior can be achieved. Rephrasing the [scalable] MCMC algorithm of Welling and Teh (2011) as such an approximation.
2. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent, by Arnak Dalalyan, which establishes a convergence of the uncorrected Langevin algorithm to the right target distribution in the sense of the Wasserstein distance. (Uncorrected in the sense that there is no Metropolis step, meaning this is a Euler approximation.) With an extension to the noisy version, when the gradient is approximated eg by subsampling. The connection with stochastic gradient descent is thus tenuous, but Arnak explains the somewhat disappointing rate of convergence as being in agreement with optimisation rates.
3. Stein variational adaptive importance sampling, by Jun Han and Qiang Liu, which relates to our population Monte Carlo algorithm, but as a non-parametric version, using RKHS to represent the transforms of the particles at each iteration. The sampling method follows two threads of particles, one that is used to estimate the transform by a stochastic gradient update, and another one that is used for estimation purposes as in a regular population Monte Carlo approach. Deconstructing into those threads allows for conditional independence that makes convergence easier to establish. (A problem we also hit when working on the AMIS algorithm.)