Speech Systems

There are three kinds of speech-recognition systems:
The first system contains a big vocabulary, but it can only understand one user. This system is mainly used in the computerprograms people can buy.
The second system can understand all users, but it only has a limited vocabulary. This is the system phone-information-services use.
The third system uses a very small vocabulary, only a few hundred words, and it is used in difficult circumstances: in the car, in a crowded room... These are the situations where the user isn't in front of a microphone. It could be used for voice-controlled (car-)stereo's, mobile phones etc...

Older speech-recognition programs use a technique named 'template matching'.
The computer uses a complete acoustic library: every word has to be spoken and recorded before it can be recognized. When a word is spoken, the computer searches for the pattern of the recorded words that resembles the word that has to be recognized. This technique works, but only if all the words are spoken with pauses (discrete speech) and only with a small vocabulary.
For fluid speech, we need another technique.

Modern programs first make a phonetic portrait of the user.
If that user speaks, the computer converts the speech into 'vectors', which are compared with the phonetic words that are in it's database.
Result increase if the computer uses a 'language model': a set of simple rules which are used to deduct what is most likely said.
For example: the computer knows that after 'I won't' in most cases a verb will follow. The computer can also consider the subject of the text, and deduct what words will probably be used. This type of speech-recognition only has a fault-rate of about 5 percent.


Back to Speech Technology