Transfer Learning is the new frontier. TensorFlow might help implement transfer learning

Sebastien Ruder is a PhD student in Natural Language Processing (NLP). He also blogs regularly about deep learning and machine learning in his personal blog. I was researching about Transfer Learning within a recent acquaintance as to this being the next frontier or challenge for the deep learning communities to tackle or improve what already is the state-of-the-art. And I found Sebastien’s recent post on Transfer Learning and look not much further.

A comprehensive and compelling blog post may be all there is to fully understand a subject. And this is one such case. So you, the dear reader and follower of The Information Age are again invited to join in this endeavour, and kindly share your due attention to the main highlights from the post I will now pursue. It is a long post so I will only highlight what I thought were of relevance. If someone spots other points worth to mention that I did not, please feel free to comment or e-mail to suggest it to me.

Following the blog post I also found an interesting YouTube video about how Transfer Learning might be implemented with TensorFlow. Below is the video and link. Later I will also briefly comment on this.

This post begins with some paragraphs outlining the motivations of its very existence. Why write a comprehensible piece about Transfer Learning, and why now. This gives us the necessary historical perspective and background:

In recent years, we have become increasingly good at training deep neural networks to learn a very accurate mapping from inputs to outputs, whether they are images, sentences, label predictions, etc. from large amounts of labeled data.

What our models still frightfully lack is the ability to generalize to conditions that are different from the ones encountered during training. When is this necessary? Every time you apply your model not to a carefully constructed dataset but to the real world. The real world is messy and contains an infinite number of novel scenarios, many of which your model has not encountered during training and for which it is in turn ill-prepared to make predictions. The ability to transfer knowledge to new conditions is generally known as transfer learning and is what we will discuss in the rest of this post.

Over the course of this blog post, I will first contrast transfer learning with machine learning’s most pervasive and successful paradigm, supervised learning. I will then outline reasons why transfer learning warrants our attention. Subsequently, I will give a more technical definition and detail different transfer learning scenarios. I will then provide examples of applications of transfer learning before delving into practical methods that can be used to transfer knowledge. Finally, I will give an overview of related directions and provide an outlook into the future.

traditional_ml_setup
Figure 1: The traditional supervised learning setup in ML

What is Transfer Learning?

In the classic supervised learning scenario of machine learning, if we intend to train a model for some task and domain AA, we assume that we are provided with labeled data for the same task and domain. We can see this clearly in Figure 1, where the task and domain of the training and test data of our model A is the same.

We can now train a model A on this dataset and expect it to perform well on unseen data of the same task and domain. On another occasion, when given data for some other task or domain B, we require again labeled data of the same task or domain that we can use to train a new model B so that we can expect it to perform well on this data.

The traditional supervised learning paradigm breaks down when we do not have sufficient labeled data for the task or domain we care about to train a reliable model.

If we want to train a model to detect pedestrians on night-time images, we could apply a model that has been trained on a similar domain, e.g. on day-time images. In practice, however, we often experience a deterioration or collapse in performance as the model has inherited the bias of its training data and does not know how to generalize to the new domain.

(…)

transfer_learning_setup

Transfer learning allows us to deal with these scenarios by leveraging the already existing labeled data of some related task or domain. We try to store this knowledge gained in solving the source task in the source domain and apply it to our problem of interest as can be seen in Figure 2.

In practice, we seek to transfer as much knowledge as we can from the source setting to our target task or domain. This knowledge can take on various forms depending on the data: it can pertain to how objects are composed to allow us to more easily identify novel objects; it can be with regard to the general words people use to express their opinions, etc.

Ok. But why is this important now in the context of deep learning or deep neural networks developments? I think it isn’t difficult to conceptualize that, given that the transfer of knowledge in educational contexts in general has been an important issue, the transposition to a computational context (in spite of its obvious limitations and complexities) might be helpful in the quest to generalization and Artificial general Intelligence (AGI). Indeed that is what Andrew Ng predicts will happen in the future with Artificial Intelligence: the adoption and improvement of better and better transfer learning models. The author of this blog post seems to be of similar opinion, up to a point though:

andrew_ng_drivers_ml_success-1
Figure 4: Drivers of ML industrial success according to Andrew Ng

In particular, he sketched out a chart on a whiteboard that I’ve sought to replicate as faithfully as possible in Figure 4 above (sorry about the unlabelled axes). According to Andrew Ng, transfer learning will become a key driver of Machine Learning success in industry.

There is a clear optimism in this prediction. I noticed that what Ng and the chart above shows is a prediction on commercial and industrial implementations of machine learning. The link with broader AGI might be a different story, with the need to check further diverse scenarios and assumptions. Sebastien also provides a more realistic view in the next paragraph when he contrasts transfer learning with unsupervised learning, regarded as a more promising route to AGI:

(…)

It is less clear, however, why transfer learning which has been around for decades and is currently little utilized in industry, will see the explosive growth predicted by Ng. Even more so as transfer learning currently receives relatively little visibility compared to other areas of machine learning such as unsupervised learning and reinforcement learning, which have come to enjoy increasing popularity: Unsupervised learning — the key ingredient on the quest to General AI according to Yann LeCun as can be seen in Figure 5 — has seen a resurgence of interest, driven in particular by Generative Adversarial Networks.

lecun_nips_2016_cake_slide
Figure 5: Transfer Learning is conspicuously absent as ingredient from Yann LeCun’s cake

What makes transfer learning different? In the following, we will look at the factors that — in our opinion — motivate Ng’s prognosis and outline the reasons why just now is the time to pay attention to transfer learning.

Following these lines of reasoning and its most probable logical outcome we fully understand now why transfer learning is becoming an attention spot within the communities concerned:

The current use of machine learning in industry is characterised by a dichotomy:
On the one hand, over the course of the last years, we have obtained the ability to train more and more accurate models. We are now at the stage that for many tasks, state-of-the-art models have reached a level where their performance is so good that it is no longer a hindrance for users. How good? The newest residual networks [1] on ImageNet achieve superhuman performance at recognising objects; Google’s Smart Reply [2] automatically handles 10% of all mobile responses; speech recognition error has consistently dropped and is more accurate than typing [3]; we can automatically identify skin cancer as well as dermatologists; Google’s NMT system [4] is used in production for more than 10 language pairs; Baidu can generate realistic sounding speech in real-time; the list goes on and on. This level of maturity has allowed the large-scale deployment of these models to millions of users and has enabled widespread adoption.

(…)

At the same time, when applying a machine learning model in the wild, it is faced with a myriad of conditions which the model has never seen before and does not know how to deal with; each client and every user has their own preferences, possesses or generates data that is different from the data used for training; a model is asked to perform many tasks that are related to but not the same as the task it was trained for. In all of these situations, our current state-of-the-art models, despite exhibiting human-level or even super-human performance on the task and domain they were trained on, suffer a significant loss in performance or even break down completely.

Transfer learning can help us deal with these novel scenarios and is necessary for production-scale use of machine learning that goes beyond tasks and domains were labeled data is plentiful. So far, we have applied our models to the tasks and domains that — while impactful — are the low-hanging fruits in terms of data availability. To also serve the long tail of the distribution, we must learn to transfer the knowledge we have acquired to new tasks and domains.

To be able to do this, we need to understand the concepts that transfer learning involves. For this reason, we will give a more technical definition in the following section.

This is only the introductory sections of this comprehensive long post. The rest of post turns to more technical advanced material. I will only briefly sketch these parts. For example the part on applications is particularly of obvious interest, not less than the formal definition and the  proper methodology of transfer learning. A definitely stay tuned machine learning topic:

A Definition of Transfer Learning

For this definition, we will closely follow the excellent survey by Pan and Yang (2010) [6] with binary document classification as a running example.
Transfer learning involves the concepts of a domain and a task. A domain consists of a feature space X and a marginal probability distribution P(X) over the feature space, where X=x1,⋯,xn ∈ X. For document classification with a bag-of-words representation, is the space of all document representations, xi is the binary feature of the i-th word and X is a particular document.

Given a domain, D={X,P(X)}, a task consists of a label space Y and a conditional probability distribution P(Y|X) that is typically learned from the training data consisting of pairs xi ∈ X and yi ∈ Y. In our document classification example, is the set of all labels, i.e. True, False and yi is either True or False.

Given a source domain Ds, a corresponding source task Ts, as well as a target domain Dt and a target task Tt, the objective of transfer learning now is to enable us to learn the target conditional probability distribution P(Yt|Xt) in Dt with the information gained from Ds and Ts where Ds ≠ Dt or Ts ≠ Tt. In most cases, a limited number of labeled target examples, which is exponentially smaller than the number of labeled source examples are assumed to be available.

As both the domain and the task are defined as tuples, these inequalities give rise to four transfer learning scenarios, which we will discus below.

(…)

I skipped the section on the scenarios and jump to the applications section:

Learning from simulations

One particular application of transfer learning that I’m very excited about and that I assume we’ll see more of in the future is learning from simulations. For many machine learning applications that rely on hardware for interaction, gathering data and training a model in the real world is either expensive, time-consuming, or simply too dangerous. It is thus advisable to gather data in some other, less risky way.

(…)

udacity_self_driving_car_simulator
Figure 7: Udacity’s self-driving car simulator (source: TechCrunch)

Learning from simulations has the benefit of making data gathering easy as objects can be easily bounded and analyzed, while simultaneously enabling fast training, as learning can be parallelized across multiple instances. Consequently, it is a prerequisite for large-scale machine learning projects that need to interact with the real world, such as self-driving cars (Figure 6). According to Zhaoyin Jia, Google’s self-driving car tech lead, “Simulation is essential if you really want to do a self-driving car”. Udacity has open-sourced the simulator it uses for teaching its self-driving car engineer nanodegree, which can be seen in Figure 7 and OpenAI’s Universe will potentially allows to train a self-driving car using GTA 5 or other video games.

Another area where learning from simulations is key is robotics: Training models on a real robot is too slow and robots are expensive to train. Learning from a simulation and transferring the knowledge to real-world robot alleviates this problem and has recently been garnering additional interest [8]. An example of a data manipulation task in the real world and in a simulation can be seen in Figure 8.

(…)

commai-env_environment
Figure 9: Facebook AI Research’s CommAI-env (Mikolov et al., 2015)

Finally, another direction where simulation will be an integral part is on the path towards general AI. Training an agent to achieve general artificial intelligence directly in the real world is too costly and hinders learning initially through unnecessary complexity. Rather, learning may be more successful if it is based on a simulated environment such as CommAI-env [9] that is visible in Figure 9.

Adapting to new domains

While learning from simulations is a particular instance of domain adaptation, it is worth outlining some other examples of domain adaptation.

Domain adaptation is a common requirement in vision as often the data where labeled information is easily accessible and the data that we actually care about are different, whether this pertains to identifying bikes as in Figure 10 or some other objects in the wild. Even if the training and the test data look the same, the training data may still contain a bias that is imperceptible to humans but which the model will exploit to overfit on the training data [10].

Another common domain adaptation scenario pertains to adapting to different text types: Standard NLP tools such as part-of-speech taggers or parsers are typically trained on news data such as the Wall Street Journal, which has historically been used to evaluate these models. Models trained on news data, however, have difficulty coping with more novel text forms such as social media messages and the challenges they present.

Finally, while the above challenges deal with general text or image types, problems are amplified if we look at domains that pertain to individual or groups of users: Consider the case of automatic speech recognition (ASR). Speech is poised to become the next big platform, with 50% of all our searches predicted to be performed by voice by 2020. Most ASR systems are evaluated traditionally on the Switchboard dataset, which comprises 500 speakers. Most people with a standard accent are thus fortunate, while immigrants, people with non-standard accents, people with a speech impediment, or children have trouble being understood. Now more than ever do we need systems that are able to adapt to individual users and minorities to ensure that everyone’s voice is heard.

The paragraph above is a particularly relevant application of transfer learning models in political  and/ or social integration issues. A truly inclusive social sensitive application of sophisticated technology. A lesson to common naysayers or the Luddites still lying around our societies.

The rest of the post is full of more substantive topics and issues about transfer learning. For obvious ethical reasons (I will not turn this into a stupefying copy paste waste of time) and of the limited space of this blog, I will finish this review. I strongly encourage everyone to read the post to the full conclusion as it will be of course an instructive rewarding time to spend.

I turn now to the video I mentioned early about the TensorFlow take on Transfer Learning.

The video above is a nice overview of a TensorFlow implementation of Transfer Learning. It is by Syed Ahmed, which is a Computer Vision Developer at Ahold USA.

One interesting issue popping up from the first minutes of this talk is the fact that we have at our disposal phenomenal and powerful tools to do interesting things with open source software. Even if we do not have readily available a dataset at our disposal, we actually are able to produce data for our purposes nowadays… We already knew about this in this Blog, when we reviewed some papers on data augmentation. Syed mentions a paper from Andrew Ng about a fast data processing idea applied in image classification.

From here we just need to appreciate the closeness of the techniques in transfer learning to the implementation strategies in Convolutional Neural Networks (ConvNets). Actually the TensorFlow project embeds transfer learning as a feature to learn in order to properly understand it. It is a feature technique of the project from the outset.

The final word of mention from this talk is the list of points that Syed Ahmed makes about when to use or otherwise the transfer learning technique. Good advice in order to not waste time and efforts.

featured image: Boosting based Transfer Learning

Advertisements

2 thoughts on “Transfer Learning is the new frontier. TensorFlow might help implement transfer learning

  1. Thx for sharing the original blog (wich I did not read).
    I think the big challenge is to e.g. transfer *parts of* an architecture to another architecture (like e.g. edge detectors of certain angles from character recognition to person recognition). Coming up with a general strategy would be cool but I think impossible. Once we have really big networks (like the human brain or even bigger) this transfer might be intrinsic – but not understood. IMHO.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s