Rocks the Piano
Google’s DeepMind brought us artificial intelligence systems that can play Atari classics and the complex game of Go as well as — no, better than — humans.
Now, the artificial intelligence research firm is at it again. This time, its machines are getting really good at sounding like humans.
In a blog post Thursday, DeepMind unveiled WaveNet, an artificial intelligence system that the company says outperforms existing text-to-speech technologies by 50 percent. WaveNet learns from raw audio files and then produces digital sound waves that resemble those produced by the human voice, which is an entirely different approach.
Someday, man and machine will routinely strike up conversations with each other. We’re not there yet, but natural language processing is a scorching hot area of AI research — Amazon, Apple, Google and Microsoft are all in pursuit of savvy digital assistants that can verbally help us interact with our devices.
Right now, computers are pretty good listeners, because deep learning algorithms have taken speech recognition to a new level. But computers still aren’t very good speakers. Most text-to-speech systems are still based on concatenative TTS — basically, cobbling words together from a massive database of sound fragments.
Other systems form a voice electronically, based on rules about how letter combinations are pronounced. Both approaches yield rather robot-y sounding voices. WaveNet is different.
Flexing Those Computing Muscles
WaveNet is an artificial neural network, that, at least on paper, resembles the architecture of the human brain. Data inputs flow through layers of interconnected nodes — the “neurons” — to produce an output. This allows computers to process mountains of data, and recognize patterns that would perhaps take humans a lifetime to uncover.
To model speech, WaveNet was fed real waveforms of English and Mandarin speech. These waveforms are loaded with data points, roughly 16,000 to sample per second, and WaveNet digests them all.