One of important developments around deep neural networks was DeepMind‘s WaveNet. This development might have passed unnoticed or not recognized by many, amidst the cacophony of hype or other obviously worthy developments. But The Information Age thought it appropriate to do further justice to highlight this development with this post.
London’s DeepMind is one of the most interesting and significant initiatives to advance deep learning research and development and seeking to implement them for business and/or fundamental research applications. WaveNet is one of the many so recent breakthroughs from this excellent institution. These days we should not be shy or afraid to praise, highlight and publicize excellence in R&D or in science&technology major developments, even if we hear around negative voices or the incensed noise of ignorance and arrogance…
WaveNet is an important and significant achievement because it changes a former paradigm in the practice of signal processing, specifically of sound processing. As all achievements should be, the changing of how things were done before to a much improved better procedure.
DeepMind has also an excellent Blog, where it is presented the latest significant outcomes from the research lab. WaveNet was one such developments that deserved a proper Blog post. This is the post I would like to highlight again here today. It is full of links and sound and image scripts, visually appealing, and friendly to the reader to parse through it and check further what is all about. This is R&D and Science&Technology marketing of highest degree, in my modest opinion. But the links provided do more than just marketing; this is also not-hyped and serious scholarship in some form or another. I sometimes I get a hard time listening and testing my patience to the limit when attending any voice disapproving or insisting only upon the dangers of the Internet and the highly connected world we live in today. It is true that the dangers exist and shouldn’t be ignored; but I just wonder if it never occurred to the Luddites of the XXI Century that danger is universally a feature of life in any era or period of human History. Any era and period of History has its own dangers and problems. Our age of high connectivity isn’t any different form any other in this regard.
I prefer to highlight the advantages of our digital/internet age. Posts such as this one from DeepMind are good examples of those advantages. With it we get several years worth of advanced signal processing, audio processing and Computer Science research, all wrapped up in a Blog post, and that in the end even gives us a small (hope you to have noticed my intended irony here…) link to the paper, which The Information Age also provides with the abstract and some images of it, for the record and enjoyment to all followers:
Allowing people to converse with machines is a long-standing dream of human-computer interaction. The ability of computers to understand natural speech has been revolutionised in the last few years by the application of deep neural networks (e.g., Google Voice Search). However, generating speech with computers — a process usually referred to as speech synthesis or text-to-speech (TTS) — is still largely based on so-called concatenative TTS, where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances. This makes it difficult to modify the voice (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database.
This has led to a great demand for parametric TTS, where all the information required to generate the data is stored in the parameters of the model, and the contents and characteristics of the speech can be controlled via the inputs to the model. So far, however, parametric TTS has tended to sound less natural than concatenative. Existing parametric models typically generate audio signals by passing their outputs through signal processing algorithms known as vocoders.
WaveNet changes this paradigm by directly modelling the raw waveform of the audio signal, one sample at a time. As well as yielding more natural-sounding speech, using raw waveforms means that WaveNet can model any kind of audio, including music.
I hope the reader by now, with the very first paragraphs of the post, have noticed what WaveNet accomplishes. Of special interest is that WaveNet turns parametric modelling of data something not like a mathematical back box, abstract parameters chucking vast amounts of data and outputting information, but something closer to natural output as it is perceived by natural brains: even if it also is artificial magic manipulation of information, it may be in the direction of how real biological systems also deals with data in more natural settings. But it gets more interesting because this is supported and inspired by earlier work on vision processing, i.e., we get a harmonious mix of techniques from other signal processing research with the one-dimensional sound processing of WaveNet:
Researchers usually avoid modelling raw audio because it ticks so quickly: typically 16,000 samples per second or more, with important structure at many time-scales. Building a completely autoregressive model, in which the prediction for every one of those samples is influenced by all previous ones (in statistics-speak, each predictive distribution is conditioned on all previous observations), is clearly a challenging task.
However, our PixelRNN and PixelCNN models, published earlier this year, showed that it was possible to generate complex natural images not only one pixel at a time, but one colour-channel at a time, requiring thousands of predictions per image. This inspired us to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.
There is the usual caveat from the computational expensive requirements of deep convolutional neural networks (CNN), though:
The above animation shows how a WaveNet is structured. It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps.
At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio.
Improving the state of the art
We trained WaveNet using some of Google’s TTS datasets so we could evaluate its performance. The following (above) figure shows the quality of WaveNets on a scale from 1 to 5, compared with Google’s current best TTS systems (parametric and concatenative), and with human speech using Mean Opinion Scores (MOS). MOS are a standard measure for subjective sound quality tests, and were obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese.
For both Chinese and English, Google’s current TTS systems are considered among the best worldwide, so improving on both with a single model is a major achievement.
Following in the post there are some sound scripts where the reader can hear the nice WaveNet achievement.
Another aspect that is somewhat new with Wave Net is the ability of the network to perform a kind of Transfer Learning. Ascertaining this, the CNN needed to be instructed as to the text it would reproduce. In the sound samples in the post it is possible to check this, as well as to confirm that it improves with inputs from many speaker samples, instead of one only sample from the speaker it wants to reproduce:
Knowing What to Say
In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet. This means the network’s predictions are conditioned not only on the previous audio samples, but also on the text we want it to say.
If we train the network without the text sequence, it still generates speech, but now it has to make up what to say. As you can hear from the samples below, this results in a kind of babbling, where real words are interspersed with made-up word-like sounds:
As you can hear from these samples, a single WaveNet is able to learn the characteristics of many different voices, male and female. To make sure it knew which voice to use for any given utterance, we conditioned the network on the identity of the speaker. Interestingly, we found that training on many speakers made it better at modelling a single speaker than training on that speaker alone, suggesting a form of transfer learning.
By changing the speaker identity, we can use WaveNet to say the same thing in different voices:
Similarly, we could provide additional inputs to the model, such as emotions or accents, to make the speech even more diverse and interesting.
When making Music, WaveNet was just amazing in reproducing piano and sounds of a sophistication on par with musicians of close to full pitch music skill:
Since WaveNets can be used to model any audio signal, we thought it would also be fun to try to generate music. Unlike the TTS experiments, we didn’t condition the networks on an input sequence telling it what to play (such as a musical score); instead, we simply let it generate whatever it wanted to. When we trained it on a dataset of classical piano music, it produced fascinating samples like the ones below:
WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general. The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. We are excited to see what we can do with them next.
I finalize this deeply rewarding post with the link to the paper and its abstract. The long authors’ list of the paper offer a glimpse to the amount of work required to achieve this outcome. It is not always indicative, but in this case this is really a team effort of the best sort, and the output speaks and sounds for itself. Rewarding the wider audio, signal processing and the authors all by themselves.
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
featured image: WaveNet by Google DeepMind | Two Minute Papers