Latent Dirichlet Allocation (LDA) is a popular and often used probabilistic generative model in the context of machine/deep learning applications, for instance those pertaining to natural language processing. They have enjoyed widespread use and popularity in those technological topic’s communities. However they may become limited when the human input to a system enters as a factor, and these models become harder to generalize requiring detailed and often unrealistic assumptions about the data generative process. Human data input is always high-dimensional, and normally these models end up getting stuck with wrong assumptions.
The paper I review here today proposes a new way to bypass the limitations of LDA implementation. It introduces a new approach to topic modeling in text classification settings with what the authors call a Correlation Explanation (CorEx). What they claim to achieve is a similar performance and results as LDA, but with a better minimization of human intervention. For that they devised a neat strategy/methodology by an information theoretical framework that uses anchoring of words. These words are introduced as domain-knowledge anchors in a way similar to techniques such as the information bottleneck method. In this way the coherence of the overall document classification task is enhanced and the new topics introduced are better predictors, improving the framework.
This type of research is useful for today’s scientific literature volumes of text data. Knowledge extraction from the huge, challenging and ever-growing datasets of documents is an important and rich topic of research/investigations. Topic modeling is a popular methodology for extracting information from unstructured textual data.
In the paper the authors disclose the two methods of topic modeling most used by researchers, which includes LDA:
Two methodologies largely dominate topic modeling: matrix factorization, such as Latent Semantic Indexing (LSI) (Deerwester et al., 1990; Landauer et al., 1998), and probabilistic generative models, such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Generative models, and LDA in particular, have eclipsed topic modeling research and applications. LDA specifies a document generation process: it is assumed that for each document a topic is randomly chosen from a specified distribution, and then a word is randomly chosen according to a distribution specified by the chosen topic. The documenttopic and topic-word distributions that generate the document are unknown, but can be inferred using Bayesian inference.
However using LDA alone does not take into account the specification of distributions and parameters as problem, specially when domain-knowledge is a factor. This is undesirable when one wishes to uncover word and document relationships with minimal human intervention. This prompts the proposal of this paper: using Correlation Explanation (CorEx) for topic modeling, which is an information-theoretic approach that does not assume any data generative model. This might be a superior way to help the human judgement in settings where the minimization of assumptions (subjectivity) becomes dominant and underrepresented data might be missing:
The information-theoretic framework behind CorEx also naturally allows for flexible incorporation of word-level domain knowledge. Topic models are often susceptible to portraying only dominant themes of documents. Injecting a topic model, such as CorEx, with domain knowledge can help guide it towards otherwise underrepresented topics that are pertinent to the user. This can be useful, for example, if we wish to learn to automatically diagnose patients from medical notes written by their doctor. By incorporating word level domain knowledge, we might encourage our topic model to recognize a rare disease that would otherwise be missed. Alternately, if we have documents that relate to some natural disaster, we may want to focus our attention on topics that could guide relief workers to distribute aid more effectively.
This paper obviously follows methods trued in other similar works. Namely it is using metadata inserted in the model to perform text classification with a machine learning pipeline. In particular it inserts metadata manually, drawing inspiration from former research inserting anchor words in the context of non-negative matrix factorization, a technique widely used in recommenders systems:
Although the original algorithm proposed by Arora et. al and subsequent improvements to the algorithm find these anchor words automatically (Arora et al., 2013; Lee and Mimno, 2014), recent adaptations allow manual insertion of anchor words and other metadata (Nguyen et al., 2014; Nguyen et al., 2015). Our work is similar to the latter, where we treat anchor words as fuzzy logic markers and embed them into the topic model in a semi-supervised fashion. In this sense, our work is closest to Halpern et al., who have also made use of domain expertise and semisupervised anchored words in devising topic models (2014; 2015).
Of note in this paragraph the way anchor words are embedded as fuzzy logic markers in the topic model in a semi-supervised fashion. Related work with LDA-based models also allowed for similar specification for word-level information, but the work in this paper manages to incorporate this information naturally instead of the involved and careful construction of new assumptions by LDA, thus rendering this method much more lightweight and flexible:
There is an adjacent line of work that has focused on incorporating word-level information into LDA-based models. Andrezejewski and Zhu have presented two flavors of such models. One allows specification of Must-Link and Cannot-Link relationships between words that help partition otherwise muddled topics (Andrzejewski et al., 2009). The other model makes use of “z-labels,” words that are known to pertain to a specific topics and that are restricted to appearing in some subset of all the possible topics (Andrzejewski and Zhu, 2009). Similarly, Jagarlamudi et. al proposed SeededLDA, a model that seeds words into given topics and guides, but does not force, these topics towards these integrated words (2012). While we also seek to guide our model towards topics containing user-provided words, our model naturally extends to incorporating such information, while the LDA-based models require involved and careful construction of new assumptions. Thus, our framework is more lightweight and flexible than LDA-based models.
I will briefly describe the method more formally before ending with the concluding remarks. One of the author of this paper provide the software framework as an open-source one, that can be accessed in his GitHub profile here.
The comparison of their method with other efficient methods called latent tree approach is worth to mention. In later part of the paper there will detailed comparison of these to approach, the CorEx from this paper and the latent tree, but we will not cover it in this review. Anyway as a remark as to the better performance overall of these two methods compared with hierarchical Dirichlet process and the Chinese restaurant process here this paragraph, outlining also the new formulation in this paper exploiting sparsity:
Mathematically, CorEx topic models most closely resemble topic models based on latent tree reconstruction (Chen et al., 2015). In Chen et. al.’s analysis, their own latent tree approach and CorEx both report significantly better perplexity than hierarchical topic models based on the hierarchical Dirichlet process and the Chinese restaurant process but they showed that their method was much faster than CorEx. We revisit this comparison after introducing our new formulation exploiting sparsity in Sec. 3.3. CorEx has also been investigated as a way to find “surprising” documents (Hodas et al., 2015).
The total correlation, or multivariate mutual information, of a group of random variables Xg is expressed as
Dkl stands for the Kullback-Leibler Divergence
We see that Eqn. 1 does not quantify “correlation” in the modern sense of the word, and so it can be helpful to conceptualize total correlation as a measure of total dependence. Indeed, Eqn. 2 shows that total correlation can be expressed using the Kullback-Leibler Divergence and, therefore, it is zero if and only if the joint distribution of Xg factorizes, or, in other words, there is no dependence between the random variables.
In the context of topic modeling, Xg represents a group of words and Y represents a topic. Since we are always interested in grouping multiple sets of words into multiple topics, we will denote the latent topics as Y1, . . . Ym and their corresponding groups of words as Xgj for j = 1, . . . , m respectively. The CorEx topic model seeks to maximally explain the dependencies of words in documents through latent topics by maximizing TC(X; Y1, . . . , Ym). Instead, we maximize the following lower bound on this expression:
This optimization is subject to the constraint that the groups, Gj , do not overlap and the conditional distribution is normalized. The solution to this objective can be efficiently approximated, despite the search occurring over an exponentially large probability space (Ver Steeg and Galstyan, 2014).
Next the authors detail their method. Of note is the way they use the information bottleneck approach, where the parameter constraint in that approach is substituted by a set of indicator variables α[i,j] in this paper, that responds only to the input if relevant data appears in that input:
Note that the constraint on non-overlapping groups now becomes a constraint on α. Comparing the objective to Eqn. 6, we see that we have exactly the same compression term for each latent factor, I(X : Yj ), but the relevance variables now correspond to Z ≡ Xi . Inspired by the success of the bottleneck, we suggest that if we want to learn representations that are more relevant to specific keywords, we can simply anchor a word Xi to topic Yj , by constraining our optimization so that αi,j = βi,j , where βi,j ≥ 1 controls the anchor strength.
This schema is a natural extension of the CorEx objective and it is flexible, allowing for multiple words to be anchored to one topic, for one word to be anchored to multiple topics, or for any combination of these anchoring strategies. Furthermore, it combines supervised and unsupervised learning by allowing us to leave some topics without anchors.
After this the authors present their methodology concerning sparsity, with a vectorized binary dataset as a bag of words; then they describe the dataset sources and the evaluation of the first implementations, where the model was evaluated with the balance between expert-knowledge and crowd-sourcing taken into account, which allows the purity distinguished from the noise in the crowd-sourcing to be a measure of semantic topic consistency:
The numerical optimization for CorEx involves iteratively updating a fixed point equation until convergence. Similar to the EM algorithm, we start with a random soft labeling for each document and each latent factor at time t = 0, pt=0(yj |x ` ). Next we update the marginal distributions pt(xi , yj ) and the α t i,j using the original CorEx procedure. Note that since all variables are binary, the marginal distribution is just a two by two table of probabilities and can be estimated efficiently. The time-consuming part of training is the subsequent update.
For example, given a topic list with k words, the purity of a list with words all of the same label is 1, while that of a list with words all different labels is 1/k. Since the HA/DR lexicon labels are the result of expert knowledge and crowd-sourcing, the purity provides us with a measure of semantic topic consistency similar to word intrusion tests (Chang et al., 2009; Lau et al., 2014).
Finally, we evaluate the models in terms of document classification, where the feature set of each document is its topic distribution. The classification is carried out using multiclass logistic regression as implemented by the Scikit-Learn library (Pedregosa et al., 2011), where one binary regression is trained for each label and the label with the highest probability of appearing is selected.
While more sophisticated machine learning algorithms may produce better predictive scores, their complex frameworks have the potential to obfuscate differences between topic models. We also leverage the interpretability of logistic regression in our analysis of anchored CorEx. We perform all document classification tasks using a 60/40 split for training and testing.
The information gain in choosing anchor words was computed as:
In analyzing anchored CorEx, we wish to systematically test the effect of anchor words given the domain-specific lexicons. To do so, we follow the approach used by Jagarlamudi et. al: for each label in a data set, we find the words that have the highest mutual information, or information gain, with the label (2012).
Figure 3 above shows the cross-section results of the anchoring metrics with error bars and a measure of the precision and recall F1 score for the pre-anchoring of the data (words).
Discussion and conclusion
The final discussion section of this paper is composed of several paragraphs that I will not copy in full here, for obvious reasons. But if makes for a good wrap up read, and provide nice insights as to future work in this field, specially in more diverse set of strategies for the anchoring of words that the authors believe this approach might allow. Notwithstanding, there is also a point of humility about the shortcomings of this approach compared with some LDA methods for topic modeling. As to the potential of the CorEx method in finding structure in documents in a new way while also helping experts guiding topic models with minimal human intervention shedding light on possible overlooked themes, the authors were vindicated, until further improvement:
In this paper, we have introduced an information-theoretic topic model, CorEx, that does not rely on any of the generative assumptions of LDA-based topic models. CorEx is competitive with LDA in terms of producing semantically coherent topics that aid document classification. We also derived a flexible method for anchoring word-level domain knowledge in the CorEx topic model through the information bottleneck. Anchored CorEx guides the topic model towards themes that do not naturally emerge, and often produces more coherent and predictive topics.
(…) However, the flexibility of anchoring words through the information bottleneck lends itself to many possible creative anchoring strategies that could guide the topic model in different ways. Different goals may call for different anchoring strategies, and future work will explore the effect of alternate strategies.
While we have demonstrated several advantages of the CorEx topic model to LDA, it does have some shortcomings. Most notably, CorEx relies on binary count data, rather than the standard count data that is used as input into LDA and other topic models. Our sparse implementation also requires that each word appears in only one topic. These are not fundamental limitations of the theory, but a matter of computational efficiency. In future work, we hope to remove these restrictions while preserving the speed of the sparse CorEx topic modeling algorithm. As we have demonstrated, this information-theoretic approach has rich potential for finding structure in documents in a new way, and helping domain experts guide topic models with minimal intervention to capture otherwise eclipsed themes.
Featured Image: Topic Modeling for Learning Analytics Researchers LAK15 Tutorial