Speech Recognition: Principles And Applications Essay, Research Paper
Table of contents
Abstract 3
Overview of the Characteristics of Automatic Speech Recognition Systems 4
Number of Words 4
Use of Grammar 5
Continuous vs. Discrete Speech 5
Speaker Dependency 6
Early Approaches to Automatic Speech Recognition 6
Acoustic-Phonetic Approach 7
Statistical Pattern Recognition Approach 8
Modern Approach to Automatic Speech Recognition 8
Hidden Markov Models 9 Training of an Automatic Speech Recognition System Based on HMMs 11 Sub-Word Units 11
Applications of Automatic Speech Recognition Systems 12
Automated Call-Type Recognition 13
Data Entry 13
Future Applications Using Automatic Speech Recognition Systems 14
Conclusion 14
References 15
Abstract
With the advances of technology, a lot of people may think that integrating the ability of understanding human speech in a computer system is a piece of cake. However, scientists disagree. Since the early nineteen fifties, scientists have tried to implement the perfect automatic speech recognition system, but they failed. They were successful in making the computer recognise a large number of words, but till now, a computer that understands everything without meeting any conditions does not exist. Due to the enormous applications, a lot of money and time is spent in improving speech recognition systems.
SPEECH RECOGNITION: PRINCIPLES AND APPLICATIONS
Nowadays, computer systems play a major role in our lives. They are used everywhere beginning with homes, offices, restaurants, gas stations, and so on. Nonetheless, for some, computers still represent the machine they will never know how to use. Communicating with a computer is done using a keyboard or a mouse, devices many people are not comfortable using. Speech recognition solves this problem and destroys the boundaries between humans and computers. Using a computer will be as easy as talking with your friend.
Unfortunately, scientists have discovered that implementing a perfect speech recognition system is no easy task. This report will present the principles and the major approaches to speech recognition systems along with some of their applications.
Overview of the Characteristics of Automatic Speech Recognition Systems
How can we evaluate a speech recognition system? Obviously describing it by good or bad isn’t enough since the performance of such a system may be outstanding in one application and poor in another. In fact, speech recognition systems are designed according to the application. Some of these variable characteristics are presented below.
Number of Words
The major characteristic of a speech recognition system is the number of words it can recognise. The question that comes to mind is how many words are enough so that the performance of a speech recognition system is acceptable. The answer depends on the application (6, p98). Some applications may require few words, like automated call-type recognition, others may require thousands, like data entry. However, increasing the number of words or the vocabulary of a speech recognition system increases its complexity and decreases its performance (probability of error is higher)(6, p.98). Systems with large vocabularies are also slower since more time is needed to search a word in a large vocabulary. Increasing the number of words isn’t enough because the speech recognition system is unable to differentiate words like ‘to’ and ‘two’ or ‘right’ and ‘write’ (6 ,p.98).
Use of Grammar
Using grammar, differentiating words like ‘to’ and ‘two’ or ‘right’ and ‘write’ is possible. Grammar is also used to speed up a speech recognition system by narrowing the range of the search (6,p.98). Grammar also increases the performance of a speech recognition system by eliminating inappropriate word sequencing. However, grammar doesn’t allow random dictation which is a problem for some applications (6, p.98).
Continuous vs. Discrete Speech
When speaking to each other, we don’t pause between words. In other words, we use continuous speech. However, for speech recognition systems, there is difficulty in dealing with continuous speech (6, p.98). The easy way out will be using discrete speech where we pause between words (6, p.100). With discrete speech input, the silent gap between words is used to determine the boundary of the word, whereas in continuous speech, the speech recognition system must separate words using an algorithm which is not a hundred per cent accurate. Still, for a small vocabulary and using grammar, continuous speech recognition systems are available. They are reliable and do not require great computational power (6, p.100). However, for large vocabulary, continuous speech recognition systems are very difficult to achieve, require huge computational power, as well as being slow. In fact, processing a speech sample can take three to ten times the time required for a person to say it (6, p.100).
Speaker Dependency
Speech recognition system designers must consider another important issue: whether their systems are speaker-dependent or speaker-independent. Each person pronounces a word differently. Although it is easy for humans to recognise the word ‘car’ whether an American or an Englishman says it, for speech recognition systems, this is not the case. Speaker dependency is determined from the application, some may require speaker-dependent systems (as in data entry), others may require speaker-independent systems (as in automated call-type recognition)(6, p.100). Speaker dependency affects greatly the training of an automatic speech recognition system (4, p.42).
Early Approaches to Automatic Speech Recognition
When scientists dreamed about a machine capable of understanding spoken language, computers and super fast integrated circuits were not available. However, they managed to build the fundamental principles of speech recognition systems. Several approaches were used, each one with advantages and disadvantages. Two of these approaches are discussed below.
Acoustic-Phonetic Approach
The theory behind acoustic-phonetic approach is acoustic phonetics. This theory assumes that spoken language is divided into phonetic units that are finite and particular. These phonetic units are distinguished by properties that are apparent in the speech signal (7, pp.42-43). The process by which speech is recognised is described briefly in what follows: initially, speech is divided into segments. According to the acoustic properties of these segments, an appropriate phonetic unit is attached to it. The obtained sequence of units is used to formulate a valid word (7, p43).
Figure 1: Phonetic sequence for a speech sample (7, 43).
As an example, consider the sequence of phonetic units matched with a sample of speech illustrated in figure 1. The symbol ‘SIL’ indicates a silence whereas the vertical position of the phonetic unit indicates how good it is matched with the corresponding segment of speech (the higher, the best match). After searching, we can match the phonetic sequence SIL-AO-L-AX-B-AW-T with the expression ‘all about’. It is obvious that the chosen phonemes are not only the first choices in the phonetic sequence, but also second (B and AX) and third (L) choices. Therefore matching a phonetic sequence with a word or a group of words is not obvious (7, p.43). In fact, this the main disadvantage of this approach.
Statistical Pattern Recognition Approach
In statistical pattern recognition, the speech patterns are directly inputted into the system and compared with the patterns inputted in the system during training (7, p.43). Unlike the acoustic-phonetic approach, the speech is not segmented nor checked for its properties. If enough patterns are inputted to the speech recognition system during training, it will perform better than the acoustic-phonetic approach. In general, statistical pattern recognition approach is used more than acoustic-phonetic approach because it is simpler to use, invariant to different speech vocabularies, and more accurate (higher performance)(7, p.44).
Modern Approach to Automatic Speech Recognition
With the availability of computers and high speed microprocessors, more research was done using the huge computational power available to solve the speech recognition problem. However, scientists, till now, don’t know the solution. Nevertheless, they were able to implement new approaches that proved to be much more efficient than earlier methods. Speech recognition systems are able to recognise more words and with more accuracy (3, p.115). Some of these approaches are presented below.
Hidden Markov Models (HMMs)
Speech is divided into phonemes. Unfortunately, these phonemes do not remain the same, they change according to the surrounding phonemes (4, p.44). HMMs are a tool to represent these changes mathematically.
A Markov model consists of a number of states linked together with each state corresponding to a unique output. Each link between two states is characterised by a probability called transitional probability (4, p.44). Moving from one state to another or remaining in the same state is function of the corresponding transitional probability (2, p.50). A classical example illustrating Markov models is the following: consider a three-state weather system with state one being rainy, state two cloudy, and state three sunny. Such a system is shown in figure 2 (transitional probabilities are added for explanation below). From the diagram, it is clear that if the current day is sunny, the probability of tomorrow being cloudy is 0.1, of tomorrow being rainy is 0.1, of tomorrow being sunny is 0.8 (2, p.50).
Figure 2: Three-state Markov model of the weather (2, p.51).
This example is an observable Markov model since we can check the state we are currently in (2, p.50). Nevertheless, speech recognition systems use hidden Markov models since the speech fragment is not observable by the speech recognition system (2, p.50). In hidden Markov models, a state can represent many outputs, therefore, a probability distribution of all possible outputs is associated with each state. A diagram of a three-state HMM is shown in figure 3 (4, p.44). This figure shows that each state has five possible outputs (A, B, C, D, and E) occurring with a probability according to b–1(s), b2(s), or b3(s). HMMs are doubly probabilistic since the transition from one state to the other and the output generated at that state are probabilistic (4, p.44). Therefore we notice that if we receive a sequence of outputs from an HMM, we are not able to retrace the sequence of states that the HMM passed by to get that sequence (4, p.44). Looking at figure 3, it is evident that an output sequence of A-B-C for example, can be achieved by any sequence of three states; however, each sequence of states has its own probability of occurrence. In speech recognition, each word is represented by a sequence of states (1, p.53), therefore, it is essential to find this sequence for any sequence of outputs. In fact, finding this sequence is equivalent to solving the speech recognition problem.
Figure 3: Three-state hidden Markov model (4, p.44).
The sequence of states is determined according to its probability. However, checking all the probabilities of all possible sequences can be very time consuming, especially in speech recognition HMMs that are much more complicated than our three-state example in figure 3. This problem was solved using an algorithm that utilises the fact that the probability of being in a certain state relies on the previous state (4, p.44).
Training of an Automatic Speech Recognition System Based on HMMs
As mentioned earlier, a major component of an HMM system are the probabilities between states and the probability distribution of each state. To have a good speech recognition system, these probabilities must change to factors like language, possible number of speakers, and so on (3, p.115). Determining these probabilities is part of what is known as training the speech recognition system.
This training process depends on whether we are dealing with a speaker-dependent or a speaker-independent speech recognition system. In the first case, speech samples are taken from the user and the probabilities are determined accordingly. In the second case, speech samples are accumulated from many speakers in addition to the text of what was said. In this case, the training process is much more complicated since the spectrogram (measure of frequency vs. time) of the same word depends on the speaker. A training process consists also of implementing a dictionary holding the vocabulary along with a grammar of permitted word sequences (4, p.42).
Sub-Word Units
In HMMs, each word is represented by a sequence of states (1, p.53). A word is recognised from the sequence of states that is most probably associated with a sequence of outputs. Therefore, the unit for such HMMs is the word. Many scientists believe that using sub-words instead of words may improve the quality of speech recognition (1, p.50).
To implement sub-word HMMs, a system of sub-word units must by selected. The simplest form of sub-word units are phones. Using phones as units for an HMM seems to be the right choice since phones are small in number and smoothly trained, but the performance of such an HMM is poor since a phone is affected by the surrounding phones (1, p.53). Another choice of sub-word units are syllables. Similar to phones, syllables are also affected by surrounding syllables, but their number is much greater than phones (around 20 000 in English) which make them hard to train (1, p.53). A new sub-word unit, known as triphone, seem to be the most successful. Triphones solve the problem of influence between sub-word units and their surrounding by modelling each phone according to its right and left neighbour (1, p.53). As an example, the ‘t’ in ‘still’ will be modelled by the s-t-i triphone (1, p.53). The immediate problem one can think of is the large number of triphones since we are taking each phone and combining it with all possible left and right phone neighbours. This problem can be resolved by using the fact that some triphones can be very similar since many neighbouring phones can affect a phone the same way (1, pp.53-54). For example, the effect on the ‘t’ in ‘still’ is similar to the one in ‘steal’ (1, pp.53-54). Even though the performance of the recognition system is affected by such approximations, it remains within acceptable standards (1, p.54).
Applications of Automatic Speech Recognition Systems
With all the time and money spend on researches on speech recognition systems, someone may wonder about the applications of speech recognition. This part will present some of the currently available applications along with some future applications of automatic speech recognition systems.
Automated Call-Type Recognition
An interesting and relatively simple application of speech recognition systems is automated call-type recognition. In pay phones, operators are needed to determine the call-type of the caller (7, p.490). Speech recognition may be used instead of operators. Five types of calls are available: ‘collect’, ‘calling card’, ‘operator’ for operator assisted calls, ‘third number’ for third party billing calls, ‘person’ for person-to-person calls (7, p.490). For this application, the speech recognition system must be speaker independent and capable of recognising and spotting the five key words mentioned above in a speech sample (2, p.52). The problem in this application is the high amount of background noise since pay phones are usually available in public places, however, this problem can be solved using appropriate speech recognition systems (low-level speakers, etc.)(2, p.52).
Data Entry
Entering data using speech recognition is very practical when performing a manual task (6, p.102). A speech recognition system for this application is highly complex and structured since it should contain a large vocabulary. For data entry, speaker-dependent or speaker-independent speech recognition systems are available even though speaker-independent systems perform better than speaker-dependent systems. They are also available for discrete or continuous speech (6, p.102). Data entry applications are still limited since the performance of speech recognition systems in this field is still limited.
Future applications using automatic speech recognition systems
With the increasing performance of automatic speech recognition systems, companies are more interested in integrating speech recognition systems in their products. Car manufacturers are interested in replacing all the levers, knobs, and buttons by a speech recognition system capable of doing everything, from raising temperature to locking doors and turning on the radio (5, p.49). In this way, the electronic content of the car is increased whereas the mechanical is reduced. This makes the car easier to design and build, therefore costing less (5, p.49). Others think of applying speech recognition systems in kitchen appliances such as dishwashers, ovens, refrigerators. Air-conditioners might some day be voice controlled (5, p.49).
Conclusion
The gradual but inevitable development of speech recognition systems will surely lead to a system that will one day compare to the perfect speech recognition device, the human being. New methods and algorithms are researched every day to improve the performance of speech recognition systems. Will we reach a stage where keyboards, buttons, and all input devices become obsolete? Time will tell.