There are three kinds of speech-recognition
systems:
The first system contains a big vocabulary, but it can only
understand one user. This system is mainly used in the
computerprograms people can buy.
The second system can understand all users, but it only has a limited
vocabulary. This is the system phone-information-services use.
The third system uses a very small vocabulary, only a few hundred
words, and it is used in difficult circumstances: in the car, in a
crowded room... These are the situations where the user isn't in front
of a microphone. It could be used for voice-controlled (car-)stereo's,
mobile phones etc...
Older speech-recognition programs use a technique named
'template matching'.
The computer uses a complete acoustic library: every word has to be
spoken and recorded before it can be recognized. When a word is
spoken, the computer searches for the pattern of the recorded words
that resembles the word that has to be recognized. This technique
works, but only if all the words are spoken with pauses (discrete
speech) and only with a small vocabulary.
For fluid speech, we need another technique.
Modern programs first make a phonetic portrait of the
user.
If that user speaks, the computer converts the speech into 'vectors',
which are compared with the phonetic words that are in it's
database.
Result increase if the computer uses a 'language model': a set of
simple rules which are used to deduct what is most likely said.
For example: the computer knows that after 'I won't' in most cases a
verb will follow. The computer can also consider the subject of the
text, and deduct what words will probably be used. This type of
speech-recognition only has a fault-rate of about 5 percent.