OK Google, how do voice assistants work?

Posted January 1, 2021

Picture of a voice assistant control interface

Voice assistants can perform various actions after hearing a “wake word” or command (such as Alexa or OK Google). They can turn on lights, play music, check the weather forecast, place online shopping orders, make restaurant reservations and more.

Technology behind voice assistants

The average person on the street has definitely heard of the most popular voice assistants, mainly Alexa and Apple’s Siri — but they might not know about natural language processing and speech recognition software, the technology behind our favorite voice assistants. Speech recognition software works by analyzing the user’s speech, using the following basic process:

Filter the words that the user says;
Digitize the user’s speech into a format that the machine can read;
Analyze the user’s speech for meaning;
Decide what the user needs based on previous input and algorithms.

For the previous input and algorithms in step 4, large amounts of audio training data are required to build effective voice assistants that can understand and fulfill user commands.

Of course, machine learning researchers can only use raw audio data to train their algorithms if it has been cleaned and labeled. For voice assistants, the audio training dataset should include a large volume of accurate language data. This ensures that the algorithm can understand and respond to human speech in different environments and contexts. For example, the Chinese language has 130 spoken dialects and 30 written languages. This creates a huge demand for cutting-edge tech solutions and processed training datasets.

The audio training dataset should also include different variations of the same request. For example, users who want to know whether it will rain tomorrow might ask different questions such as:

Is it going to rain tomorrow?
What is the chance of rain tomorrow?
What is the weather forecast for tomorrow?
Should I carry an umbrella tomorrow?

An effective voice assistant should be trained to understand that these are all different ways of asking the same question of whether it will rain tomorrow.

Limitations of voice assistants

Voice assistants rely heavily on natural language processing, so they are also constrained by the limitations of natural language processing. Natural language processing is often associated with chat or text interfaces, but it is also important for audio language technologies such as voice assistants, mobile phones and contact centers. At its current stage, natural language processing struggles with the complexities inherent in elements of speech such as metaphors and similes. Most human speech is not linear. We sometimes forget what we were talking about, ask tangential questions or inquire about multiple things at once. This is tough for a machine to follow algorithmically. In addition, voice involves unique challenges that text does not have to deal with, such as background noise and accents. The algorithm must overcome these additional challenges to deliver a good user experience.

Another problem with voice assistants is that they are often biased towards the masses. The answers that Alexa or Siri give closely match the needs of whoever produced most of the training data, but they might not be helpful for minorities. This is unfortunate since voice assistants first became popular for their role in streamlining people’s lives.

The future of voice assistants

Better access to audio training data will lead to machine learning applications that we cannot even imagine today. To improve voice assistants even further, it’s important to make sure that they serve everyone, including minorities and niche demographics.

TELUS International had the opportunity to play a part in the development of this kind of service. We collected speech samples of foreigners living in Japan who speak imperfect Japanese. This was used for a car manufacturer to improve the navigation voice assistant in their vehicles.

Another example where voice assistants at their current stage are failing to serve minorities is for people with speech impairments. Fortunately, there are several emerging tech companies that are currently working to make voice assistants more accessible. For example, Danny Weissberg created an app called Voiceitt after his grandmother suffered a stroke and lost her ability to speak. Voiceitt is focused on delivering speech recognition technology that understands non-standard speech. The Voiceitt app works by having the person with speech impediments create their own personal dictionary, which is then translated into standard speech to control other voice-enabled devices. To create a dictionary, the user composes and then reads out everyday phrases like “I’m hungry,” or “Turn on the lights.” The Voiceitt software records the speech and gradually learns the user’s particular pronunciation. Then, after being trained, the app will act like an instant translator. The user with speech impediments speaks a phrase, and the Voiceitt app reads or types it out in standard speech for voice assistants to process.

With the continued growth of the voice recognition industry and the ongoing advancement of the technology, those that keep inclusivity in mind will be poised to attract the most users. If you’re looking to improve your voice recognition software, contact our AI Data Solutions team today.

OK Google, how do voice assistants work?

Technology behind voice assistants

Limitations of voice assistants

The future of voice assistants

Check out our solutions

Related insights

Natural language processing: The power behind today's large language models

Are we headed for an AI data shortage?

Building a multilingual dataset with high-quality data collection and annotation