Frustrated when your in car voice recognition system doesn’t respond the way you want it to?

Posted on February 23, 2017

Voice recognition in vehicles can get a lot of work done these days. From locating the nearest café, calling mom or writing out a text message that you’re dictating while driving, in-car tech with smart device integration can work as our virtual assistants. But it can get quite frustrating when it’s meant to work in a certain way, but doesn’t do so perfectly all the time.

Despite the occasional awkward response or failure to understand some accents, voice recognition has made quite a bit of progress, with the recent versions of Apple’s Siri, Amazon’s Alexa and Ford’s SYNC infotainment system showing remarkable improvements.

So, how does our car really decipher the things we say and respond accordingly?

1. It’s not really about the sound. It’s the sound wave when you speak. A sound is created through tiny changes in air pressure, and it enters our ears as one continuous sound wave. But computers aren’t like people, so they need a way to ‘hear’ the words that are said and turn them into text. So when sounds enter the devices we use, its computer measures that sound wave at one point in time, stores it and measures it again, and does this again and again with each sound.

The result: the sound you made is now digitized for the computer to understand. This is a very precise process and smart devices do make mistakes. If the computer detects a gap in the wave, what gets measured may not be correct.

2. The sound of a word vs. the sound of something else. Once a sound is recorded digitally, the computer figures out what sounds it has to pay attention to, using algorithms. To determine if chunks of digitized sound are actually words, rather than sounds from a car engine or a radio, the computer applies a bunch of mathematical operations to separate what is speech and what isn’t.

3. Same word, different accents. Voice recognition works by breaking up the speech into small segments called phonemes. In English alone, there are about 40 different phonemes. The computer is trained to recognize what each speech segment looks like digitally, but they are not always the same. For instance, sounds vary with different accents, placement in a word and even spellings (i.e. ‘to’ vs. ‘two’ vs. ‘too’). Based on a dictionary word list and contextual relationships, the computer can make an assumption of what is being said. So, if your friend Mary is in your contact list, the command ‘call Mary’ is linked to ‘Mary’ and not ‘merry’.

Mark Porter, Supervisor, Asia Pacific Infotainment Systems at Ford Motor Company says, “with enhanced voice recognition, you can talk to SYNC 3 with simple real-world voice commands and the system responds naturally to your voice. It’s even been fine-tuned to deal with the Australian accent, and in China, it can understand a string of Chinese characters written by hand on its graphical interface.”

4. Predicting what the next word in a sentence might be. There can be many different word combinations in a single speech stream simply because there are lots of phonemes that sound similar to one another when said quickly. Sometimes the result can be a wacky sequence of words that don’t really make sense. To avoid this, the computer applies models based on how people actually talk to figure out how likely one word is to follow another.

5. Presenting the best result as quickly as possible. Once all the calculations are done and guesses are made, the computer finally presents its best result, whether on a screen, from a pre-set menu or through a vocal response. The latest voice recognition technology can achieve incredibly fast response times and are more intuitive than ever before. For instance, a user of SYNC 3 can command their car to ‘Tune to <frequency> FM’, while other systems still require you to say ‘Radio’ then points you to another list and prompts you again to say the frequency of the radio station you want to listen to.

With more real-time and accurate technology available, voice activated commands are making our lives better in a myriad of different ways. Although at times it may seem like they are just out to annoy us with bizarre answers, consider the tedious calculations and complex transformations behind-the-scenes to recognize a single word, let alone an entire sentence.

For gadgets to be even remotely able to decipher what we say and then piece together a semi-coherent response is amazing, especially since some humans are still trying to master this skill. Think about it.