Using DSP technology to optimise speech recognition performance
posted on
May 03, 2006 10:32AM
By Rishi Nag, Senior DSP Software Engineer, NCT
Page 1 of 2
Audio DesignLine
(05/03/2006 1:46 PM EDT)
Introduction
It might be that in 50 years time, we’ll have a family android who will converse with us about the weather or even Manchester United’s mid-season performance. If this is the case, then an important component of this icon of the future will be its ability to recognise speech the same way as humans. For the moment though, `speech recognition` is an important emerging technology that is playing a key role in automotive telematics, mobile phone technology, conferencing systems and similar telecom applications. This article discusses some of the obstacles that such systems need to overcome in order to move forward, towards a human level of performance, and how DSP noise reduction can help optimise the performance of such systems.
The Main Principals of Speech Recognition
Speech Recognition is the process of converting a talker’s sampled speech into the sequence of words representing what the talker has said. The basic building block of speech is the phoneme. There is one phoneme for every basic sound in the language. For example, the word `cat` is constructed from three phonemes -`k`, `a` and `t`. A Speech Recognition Engine will need to construct the sequence of the phonemes in the speech, before it can produce the sequence of words. This is typically carried out in a number of distinct stages.
First, each short segment of speech is analysed and its important acoustic characteristics are placed into a feature vector. The feature vector is compared to a database of feature vectors for the various phonemes, in order to find the closest match. This process is repeated for each short segment of speech to produce a sequence of phonemes.
The next stage involves use of a pronunciation dictionary to create a number of possible word sequences. A pronunciation dictionary contains a list of words and the sequence of phonemes corresponding to the pronunciation of the word. Using this dictionary in reverse, the phoneme sequences are put together to make known words. A single sequence of phonemes can, however, correspond to a number of different word sequences that have the same pronunciation. For example `car key` is pronounced the same as `khaki`. Consequently this stage will result in a number of alternative word sequences.
A language model then examines the context, and possibly the grammar, of the suggested strings of words to narrow the possibilities down to a word sequence that makes sense, the recognised word sequence.
To summarise, a typical speech recognition engine breaks down the speech into a sequence of feature vectors, capturing the important acoustic characteristics of the speech. The feature vectors are converted into a sequence of phonemes, which are built up into suggested sequences of words. These word sequences are then narrowed down to the recognised sentence.
Overcoming Background Noise
One of the major obstacles to achieving high performance speech recognition is `noise`. In an in-car situation, this noise comes from a number of sources; the road, the engine, the radio, the wind and maybe even the passengers. On a mobile phone, this noise might be background music, traffic, wind or passers-by talking.
Noise is a problem because it affects the acoustic characteristics extracted from the speech to make the sequence of feature vectors. This then introduces errors in the feature vectors and their corresponding phonemes.
Early attempts to apply noise reduction software techniques to enhance speech recognition performance had limited success since, in most cases, these noise reduction technologies had been developed to improve human-to-human communication systems. With such technology, there is always some misidentification of `noise` and `speech`. Noise that is misidentified as speech will be transmitted leading to speech-like artefacts that can sound like a babbling brook, very disturbing for a human listener. On the other hand, speech that is mis-identified as noise will be removed, potentially causing the speech to sound distorted.
Achieving the optimal performance from a noise reduction technology involves a trade-off between introducing watery artefacts and causing speech distortion. In human-to-human communication, watery artefacts are usually more unacceptable than losing small parts of speech, particularly since the brain, to some extent, tends to fill in the missing bits of speech to make sense of the output. On the other hand, in a speech recognition system, even a small amount of speech distortion can result in words being unrecognisable, while watery artefacts are often ignored. Consequently it is usually necessary to design noise reduction technology specifically for enhancing the performance of speech recognition systems.
Page 2 of 2
Audio DesignLine
(05/03/2006 1:46 PM EDT)
Another interesting aspect is the fact that in normal human-to-human conversation, say using a hands-free in-car phone, we talk over each other only about 6% of the time. During this time echo cancellation technology removes one of the voices to avoid echo and enhance clarity. For a speech recognition system, there is often competing background noise and `echo` for 100% of the time, whether it is from the radio, the speech recognition system itself or even a nearby passenger.
Due to these different operational requirements, noise and echo cancellation solutions for enhancing speech recognition need to be optimised differently and solutions aimed at human listeners are often non-ideal. The quality of such solutions is dependent on how cleverly they minimise the distortion to the speech while reducing background noise and echo.
In the Real World
If we were to consider an in car telematics or navigation system that requires voice instructions, it may be represented by the diagram in Fig1.
Figure 1: Voice instruction system for automotive environment
Fig 2 below shows a block diagram of the speech recognition aspects of this system used together with noise and echo reduction technology.
Figure 3: Combination noise and echo reduction for speech recognition
The voice of the driver is picked up by the communications microphone and is first processed by the block labelled RNF (Referenced Noise Filter), a type of echo canceller. This block has a direct feed from the music system, so that this background noise can be reduced. If the vehicle is an emergency services vehicle, this feed might well come from the siren.
The RNF block may also have a feed from the actual Automatic Speech Recognition (ASR) module itself. This is so that if the ASR system is talking to the driver, the driver can speak over the automated voice and still be understood. The RNF technology ensures that the ASR hears the driver’s voice, but not the sound of its own voice coming out of the loudspeakers. This is an important aspect of interactive speech recognition systems, the ability to choose an item from a menu, without waiting until the system as listed all possibilities. This is called “barge-in”.
Once the ‘echo’ has been removed, the signal is processed by the VRE, (Voice Recognition Enhancer) block. This technology can provide in the region of 6-18dB of noise reduction with minimal damage to the speech element of the signal. From here the speech is fed into the ASR module.
Because the VRE technology will allow voice type signals to pass through, it is possible that the voices of passengers, or maybe even music from their portable sound systems, might still corrupt the quality of the speech entering the ASR module. In some cases, a second microphone can be used to pick up the unwanted sound, so that the ENR (Enhanced Noise Reduction) block can eliminate this noise from the system, ensuring as clear speech as possible enters the ASR module. Such dual microphone noise reduction technology has particular applicability in cellular phone and headset applications where the communications microphone picks up significantly more speech than the second microphone, while the noise at both is fairly correlated.
The Difference Between Success and Failure
Speech recognition experts talk about ‘hit rates’. A `hit` is simply a word that has been successfully recognised. In a stationary car, the hit rate might be 100% but as the car accelerates forward, engine noise and wind noise will start to impact the system performance. Without noise reduction it is possible that the system would become unworkable. Indeed, it’s easy to see that trying to voice dial a ten digit telephone number may repeatedly fail even at a 90% hit rate.
Figure 3: Noise level vs, recognition rate
So even a 10% performance increase due to the noise reduction software will be the difference between an unsuccessful system and a successful system.
Adopting Noise Reduction for Speech Recognition
The various products listed above (VRE, RNF and ENR) are highly refined algorithms that run on DSP’s (Digital Signal Processors). This technology is available for a range of processors from such manufacturers as Texas Instruments and Analog Devices. Providers of such noise and echo reduction technology will often supply evaluation platforms that will enable engineers to quickly try out the technology to assess the impact this technology will have on their systems.
At NCT, we can provide our noise and echo reduction code in an eXpressDSP compliant format for Texas DSP devices. We can also provide PC executables for fast evaluation.
http://www.audiodesignline.com/showArticle.jhtml?articleID=187003196