Today I return to the wonderful Blog maintained by OpenAI. The post I will re-share here for the followers of The Information Age is a description of another achievement of the team of researchers in OpenAI. The post commented about a paper on unsupervised sentiment analysis/reviews achieved by a multiplicative LSTM (long short-term memory) and it is a demanding by rewarding post to read. Sentiment analysis is a cutting-edge subject of research within recurrent neural networks trained for prediction. The approach in the paper concerned a representation learning (confirming the last week post about the relevance of this line of research in machine/deep learning) of sentiment where the network only predict the next character in a sequence rather than a whole sequence approach:
This achievement also appears to further reassure unsupervised learning as a more efficient architecture for machine learning/data science than the more common supervised settings. Requiring less data input, this “sentiment neuron” manages to output efficiently and accurately analysis when contrasted with previous methodologies on well-tested datasets:
A linear model using this representation achieves state-of-the-art sentiment analysis accuracy on a small but extensively studied dataset, the Stanford Sentiment Treebank (we get 91.8% accuracy versus the previous best of 90.2%), and can match the performance of previous supervised systems using 30-100x fewer labeled examples. Our representation also contains a distinct “sentiment neuron” which contains almost all of the sentiment signal.
I highly recommend the reader to read the full paper: Learning to Generate Reviews and Discovering Sentiment. It is well written and full of references to the earlier work on the subject that served as the inspiration or background to this effort. I will briefly review it bellow with some figures/tables from it and a concluding remark section. But now I continue in this section with the blog post, and its simple to understand description of the methodology implemented in the research. The surprising finding of the way large-scale neural networks are able to interpret a sentiment or learn the concept of sentiment just by simply predict the next character in a text sequence is a perfect first impression on this effort:
We were very surprised that our model learned an interpretable feature, and that simply predicting the next character in Amazon reviews resulted in discovering the concept of sentiment. We believe the phenomenon is not specific to our model, but is instead a general property of certain large neural networks that are trained to predict the next step or dimension in their inputs.
The large-scale nature of this effort is illustrated by the training time that the dataset took: one month. This even with the high-performance computation of four NVIDIA Pascal GPUs, the state-of-the-art on parallel fast compute GPUs:
We first trained a multiplicative LSTM with 4,096 units on a corpus of 82 million Amazon reviews to predict the next character in a chunk of text. Training took one month across four NVIDIA Pascal GPUs, with our model processing 12,500 characters per second.
These 4,096 units (which are just a vector of floats) can be regarded as a feature vector representing the string read by the model. After training the mLSTM, we turned the model into a sentiment classifier by taking a linear combination of these units, learning the weights of the combination via the available supervised data.
While training the linear model with L1 regularization, we noticed it used surprisingly few of the learned units. Digging in, we realized there actually existed a single “sentiment neuron” that’s highly predictive of the sentiment value.
Just like with similar models, our model can be used to generate text. Unlike those models, we have a direct dial to control the sentiment of the resulting text: we simply overwrite the value of the sentiment neuron.
The sentiment neuron adjusting its value on a character-by-character basis.
It’s interesting to note that the system also makes large updates after the completion of sentences and phrases. For example, in “And about 99.8 percent of that got lost in the film”, there’s a negative update after “lost” and a larger update at the sentence’s end, even though “in the film” has no sentiment content on its own.
The last two sections of the blog post teaches us as to the importance of unsupervised learning in the quest to further the efficiency of machine learning pipelines, using less labeled examples from a large dataset:
Labeled data are the fuel for today’s machine learning. Collecting data is easy, but scalably labeling that data is hard. It’s only feasible to generate labels for important problems where the reward is worth the effort, like machine translation, speech recognition, or self-driving.
Machine learning researchers have long dreamed of developing unsupervised learning algorithms to learn a good representation of a dataset, which can then be used to solve tasks using only a few labeled examples. Our research implies that simply training large unsupervised next-step-prediction models on large amounts of data may be a good approach to use when creating systems with good representation learning capabilities.
Reflecting on the possible next steps from here, the authors disclosed the somewhat obscure nature of this achievement. Nevertheless the contribution of this effort to enhance our understanding of representation learning, machine learning training regimes and how to efficiently handle large-scale datasets is probably unmatched by any other research groups in the Artificial Intelligence research cluster of galaxies.
Our results are a promising step towards general unsupervised representation learning. We found the results by exploring whether we could learn good quality representations as a side effect of language modeling, and scaled up an existing model on a carefully chosen dataset. Yet the underlying phenomena remain more mysterious than clear.
- These results were not as strong for datasets of long documents. We suspect our character-level model struggles to remember information over hundreds to thousands of time steps. We think it’s worth trying hierarchical models that can adapt the timescales at which they operate. Further scaling up these models may further improve representation fidelity and performance on sentiment analysis and similar tasks.
- The model struggles the more the input text diverges from review data. It’s worth verifying that broadening the corpus of text samples results in an equally informative representation that also applies to broader domains.
- Our results suggest that there exist settings where very large next-step-prediction models learn excellent unsupervised representations. Training a large neural network to predict the next frame in a large collection of videos may result in unsupervised representations for object, scene, and action classifiers.
Some highlights from the paper and the concluding remarks
In this last section of this blog post I will share some of the highlights from the original paper. Later there is a concluding remark from the paper. This serves as a more formal description of the issues involved as compared with the OpenAI blog post, which is a wonderful facilitator of comprehension of a hard to understand highly technical research field in Artificial Intelligence.
Much previous work on language modeling has evaluated on relatively small but competitive datasets such as Penn Treebank (Marcus et al., 1993) and Hutter Prize Wikipedia (Hutter, 2006). As discussed in Jozefowicz et al. (2016) performance on these datasets is primarily dominated by regularization. Since we are interested in high-quality sentiment representations, we chose the Amazon product review dataset introduced in McAuley et al. (2015) as a training corpus. In de-duplicated form, this dataset contains over 82 million product reviews from May 1996 to July 2014 amounting to over 38 billion training bytes. Due to the size of the dataset, we first split it into 1000 shards containing equal numbers of reviews and set aside 1 shard for validation and 1 shard for test.
Many potential recurrent architectures and hyperparameter settings were considered in preliminary experiments on the dataset. Given the size of the dataset, searching the wide space of possible configurations is quite costly. To help alleviate this, we evaluated the generative performance of smaller candidate models after a single pass through the dataset. The model chosen for the large-scale experiment is a single layer multiplicative LSTM (Krause et al., 2016) with 4096 units. We observed multiplicative LSTMs to converge faster than normal LSTMs for the hyperparameter settings that were explored both in terms of data and wall-clock time. The model was trained for a single epoch on mini-batches of 128 subsequences of length 256 for a total of 1 million weight updates. States were initialized to zero at the beginning of each shard and persisted across updates to simulate full-backpropagation and allow for the forward propagation of information outside of a given subsequence.
Although the focus of our analysis has been on the properties of our model’s representation, it is trained as a generative model and we are also interested in its generative capabilities. Hu et al. (2017) and Dong et al. (2017) both designed conditional generative models to disentangle the content of text from various attributes like sentiment or tense. We were curious whether a similar result could be achieved using the sentiment unit. (…)
Discussion and Future
Work It is an open question why our model recovers the concept of sentiment in such a precise, disentangled, interpretable, and manipulable way. It is possible that sentiment as a conditioning feature has strong predictive capability for language modelling. This is likely since sentiment is such an important component of a review. Previous work analysing LSTM language models showed the existence of interpretable units that indicate position within a line or presence inside a quotation (Karpathy et al., 2015). In many ways, the sentiment unit in this model is just a scaled up example of the same phenomena.
Our work highlights the sensitivity of learned representations to the data distribution they are trained on. The results make clear that it is unrealistic to expect a model trained on a corpus of books, where the two most common genres are Romance and Fantasy, to learn an encoding which preserves the exact sentiment of a review. Likewise, it is unrealistic to expect a model trained on Amazon product reviews to represent the precise semantic content of a caption of an image or a video.
There are several promising directions for future work highlighted by our results. The observed performance plateau, even on relatively similar domains, suggests improving the representation model both in terms of architecture and size. Since our model operates at the byte-level, hierarchical/multi-timescale extensions could improve the quality of representations for longer documents. The sensitivity of learned representations to their training domain could be addressed by training on a wider mix of datasets with better coverage of target tasks. Finally, our work encourages further research into language modelling as it demonstrates that the standard language modelling objective with no modifications is sufficient to learn high-quality representations.
It is quite revealing the possible research paths that could spring up from this paper. The authors demonstrated both their own humility as to not fully understand how a large scale neural network manages to by itself learn a concept of a sentiment only by predicting a next character in large scale text data input. But as it is refered in the above paragraphs there is also a performance plateau that should be further worked upon. Also highlighted are the limitations/sensitivities of the learned representation to the data distributions they are trained on. One other interesting outcome of this paper/research is the interplay that might exist between machine learning (ML) pipelines for natural language processing and other ML frameworks used for computer vision or speech recognition. Certainly there will more around these lines of research in a very near future.
featured image: Figure 4. Visualizing the value of the sentiment cell as it processes six randomly selected high contrast IMDB reviews. Red indicates negative sentiment while green indicates positive sentiment. Best seen in color.