How were the fundamental improvements in Voice Recognition achieved?

berkeman · Sep 29, 2019

I remember about 20 years ago a colleague had to start using the Dragon Voice Recognition software on his EE design PC because he had developed really bad carpal tunnel pain, and he had to train the software to recognize his voice and limited phrases. That was the state of the art not too long ago.

But now, Voice Recognition software has advanced to the point where multiple speakers can speak at normal speed and devices like cellphones and Alexa and Siri can usually get the interpretation correct. What has led to this huge step? I remember studying Voice Recognition algorithyms back in the early Dragon days, and marvelling at how complex things like phoneme recognition were. Were the big advances due mostly to increased computing power? Or some other adaptive learning techniques?

This article seems to address part of my question, but I'm still not understanding the fundamental leap that got us from Dragon to Alexa... Thanks for your insights.

phinds · Sep 29, 2019

I would not be surprized if the details of the algorithms are highly proprietary.

anorlunda · Sep 30, 2019

I used to be an enthusiastic user of Google Voice Search (they called it google 411). It was the perfect machine learning platform because people with many accents in many backgrounds would ask similar questions. Then, it would get immediate feedback from the users if the recognition was correct or not. If the user replied "yes connect the call", it was successful. If not, then the user would retry.

From that, I assumed that it was just a neural net they were training. Or perhaps multiple cooperating nets. One working on words, and the other on semantics. "Please give me the number for pizza in Conway South Carolina." The question has a predictable structure. It can be expected to have object words "pizza", location words "conway", and noise words, "please give me". So the word guesses could be reinforced by the semantic structure.

But after several years, they abruptly discontinued the service. A press statement said that they had enough data.

"Breakthroughs" in machine learning are usually of the nature that they learn faster. But if you have enough data, enough money, enough time, even pre-breakthrough learning projects can succeed.

lomidrevo · Dec 5, 2019

berkeman said:

Summary:: There has been a fundamental improvement in Voice Recognition over the past couple of decades. What was the key to these breakthroughs?

Were the big advances due mostly to increased computing power? Or some other adaptive learning techniques?

Sure, computing power is one of the key factors. But R&D in the area of recurrent neural networks had probably even greater influence on the recent successes in the speech recognition applications. Especially the discovery of the LSTM architecture deserves to be highlighted:

As of 2016, major technology companies including Google, Apple, and Microsoft were using LSTMs as fundamental components in new products.[12] For example, Google used LSTM for speech recognition on the smartphone,[13][14] for the smart assistant Allo[15] and for Google Translate.[16][17] Apple uses LSTM for the "Quicktype" function on the iPhone [18][19] and for Siri.[20] Amazon uses LSTM for Amazon Alexa.[21]

256bits · Dec 6, 2019

Machine learning , yes that had jumped ahead.
But what is the machine learning - audio frequencies, phonetics, phrases, whole words?
There still has to be some electronics between the spoken word and the AI, and what is it that they are picking out to tell the difference between the words say "one" and "two", with very little time delay it would seem.
Fast Fourier transform, or something different to analyze the sound.

( PS. I just thought of this - has anybody done any tinkering with animal sounds - a cat's meow, dogs bark, pig's grunt , or any of the other everyday sounds we here - door slam, car engine, police siren . )

256bits · Dec 6, 2019

phinds said:

I would not be surprized if the details of the algorithms are highly proprietary.

Closely guarded secret.
Have they done any patents - others would be able to find out what they are really up to then.

harborsparrow · Dec 8, 2019

It is a combination of many factors, including cloud computing (in the case of smart speakers at home, for example, the recognition is taking place back in the cloud rather than locally), machine learning, Bayesian training, specific heuristics, and last but not least, crowdsourcing (used by Google's language learning algorithms). Quite a bit of details of how Google does it can be learned if you go to their tech talks at conventions.

Note also that before voice recognition could become ubiquitous, language reading, recognition and understanding by software had to increase. It all takes a lot of processing power, and it must be fast, which is why smart speakers and phones encourage you to "train" your recognizer for your particular voice.

anorlunda · Dec 8, 2019

Try this: Keynote talk: "https://www.isca-speech.org/archive/interspeech_2014/i14_3505.html," Interspeech, September 2014 (by https://en.wikipedia.org/w/index.php?title=Li_Deng&action=edit&redlink=1).

sysprog · Dec 14, 2019

harborsparrow said:

It is a combination of many factors, including cloud computing (in the case of smart speakers at home, for example, the recognition is taking place back in the cloud rather than locally), machine learning, Bayesian training, specific heuristics, and last but not least, crowdsourcing (used by Google's language learning algorithms). Quite a bit of details of how Google does it can be learned if you go to their tech talks at conventions.

Note also that before voice recognition could become ubiquitous, language reading, recognition and understanding by software had to increase. It all takes a lot of processing power, and it must be fast, which is why smart speakers and phones encourage you to "train" your recognizer for your particular voice.

The Google assistant app recognizes "Hey Google" locally. It has a Settings interface to train the necessary local recognition in case the default doesn't work well enough. The app listens constantly into a buffer a few seconds long, and sends those few seconds preceding "Hey Google" along with the few seconds of speech thereafter that precede a pause long enough that it determines end of message. If I don't say "Hey Google", the app doesn't send anything, but what it does with whatever I do send is not open to my scrutiny. I prefer the Google browser page behavior, wherein Google listens to nothing until I press the mic button. but so far, Google refuses to provide an option for that behavior on its phone app, so if I want that behavior from Google on my phone, I go to the Google page from a browser.

atyy · Dec 14, 2019

lomidrevo said:

Sure, computing power is one of the key factors. But R&D in the area of recurrent neural networks had probably even greater influence on the recent successes in the speech recognition applications. Especially the discovery of the LSTM architecture deserves to be highlighted:

Yes, it was deep learning. The only minor point of disagreement is whether it was mpre deep learning than computing power, since deep learning and LSTMs are old. One additional factor is more data for training.

The technology before deep learning was hidden markov models, and the first deep learning successes in speech recognition used hybrid deep learning - HMM architectures. I believe the current algorithms are pure deep learning algorithms.

sysprog · Dec 15, 2019

Also very important for voice recognition technology (including MyTalk, which won the Computerworld Smithsonian Award for the first commercially successful voice recognition consumer product), and for many other technologies we now take for granted, were the many innovations of General Magic, a company which was the subject of a critically acclaimed 2018 documentary.

lomidrevo · Dec 16, 2019

atyy said:

Yes, it was deep learning. The only minor point of disagreement is whether it was mpre deep learning than computing power, since deep learning and LSTMs are old. One additional factor is more data for training.

You are right that neural networks are old. LSTM itself is also old but not so much, some updates to the architecture were done in 2000. And its modified version GRU was introduced only in 2014. Another point is that today's neural networks can be compound of many many hidden layers, and therefore perform better. That was not possible decades ago, as to train such networks would take ages. The increased computer power indeed helped to overcome this issue, but the success came together (or in parallel) also with some advances in deep learning (for example, ReLu became to replace sigmoid as activation units only after 2009, which had a big impact on the overall performance, afaik).

Maybe I was too quick to say which of the factors had the greatest influence. I think we could agree that it was a combination of more factors, especially computer power and advances in deep learning. Regarding the other factor, amount of data for training, it is true that in general there is much more labeled data today than before so deep learning models can generally be trained with smaller generalization error. But in the case of speech recognition, I am not sure whether this was the key factor. I believe that many audio datasets including transcriptions were available already decades ago. For example TIMIT dataset, was created already in 1986, and today is still used as a benchmark when speech recognition solutions are compared. Just my opinion.

lomidrevo · Dec 16, 2019

256bits said:

Machine learning , yes that had jumped ahead.
But what is the machine learning - audio frequencies, phonetics, phrases, whole words?
There still has to be some electronics between the spoken word and the AI, and what is it that they are picking out to tell the difference between the words say "one" and "two", with very little time delay it would seem.
Fast Fourier transform, or something different to analyze the sound.

If I understand it right, the current methods are end-to-end deep learning methods, so there is no extra audio processing (like DSP for example) needed, maybe except the Fourier transform at the beginning to construct spectrogram from the audio sequence, see this paper:
https://arxiv.org/abs/1303.5778

I don't know how exactly it is implemented in practice for real-time applications, but some version of FFT, maybe STFT, is probably involved. The neural networks are pretty fast to calculate prediction (once they are trained), so there should not be an issue here. Just my opinion...

How were the fundamental improvements in Voice Recognition achieved?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

France to ditch Windows for Linux

Is This Music AI?

Gmail AI summaries

Help me build my server with a laptop that has a broken screen

Warning: Bad actors may already be in store-now-decrypt-later mode

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect