[ Speech
Recognition | Speech Synthesis ]
erhaps not as
widely-regarded as important as vision is, the human auditory system plays an important
function for structured thinking. How so? Even though the brain just processes
about a million bits of data per second as opposed to the
50 billion bits of data processed per second from each eye, people usually receive
communications through the ears. Thus, it is the meaningful language in the audio
messages that compel people to think about what is being said(natural
language processing) as well as how the message was constructed that relates to what
is meant. That is why it is easier to follow a news broadcast just by listening to
it rather than just watching it with the sound turned off.(Kurzweil 265)
The main focus in AI when it comes to sound-processing is to make a computer that can
recognize what a person says to it. The reason why this is done as opposed to making
a computer recognize the sound of a car or the sound of a telephone ring is because 1)
there usually is something meaningful when someone talks and 2) making a computer capable
of automated speech recognition(ASR)
would be a next step in man-machine interface(MMI).
Like vision, the place where sound is actual analyzed is in
the brain--which precisely makes it difficult to study because the brain is not understood
very well. That is why speech researchers are more concerned with getting a computer
to just recognize speech as opposed to getting a computer to mimic how people recognize
speech(i.e. the top-down approach).
Computers of today can store many hours of sounds digitally. However, strict
voice-pattern-to-voice-pattern matching is not accurate enough for a computer to realize
that the voice it received and the voice it stored in memory comes from the same person
saying the same thing. This is not the fault of the computer per se; people tend to
speak a little differently each time they talk. The example in the box below
illustrates this point:
To make ASRs smarter, researchers are trying to develop pattern-recognition programs
that can recognize the similar patterns between speeches spoken at different times like
the sample above, saying different things, and different people saying the same thing(speaker-independence).
ASR research has spawned various voce-activated programs today, though it often
requires some extensive training to get a computer to recognize a particular voice.
Eventually, coupled with natural
language processing and other intelligent thought capabilities, a computer may one day
be able to carry out commands come from conversation-like phrases.
[ Top ]
An aspect of sound-processing research that has made faster progress than ASR is speech
synthesis. Taking the knowledge from ASR such as phonemes and such[more stuff],
speech synthesizers have become relatively successful in generating understandable words
and sentences. However, making a computer speak naturally without the stoic voice
reminiscent of old science-fiction robots is still a greater challenge ahead.
As a branch of AI, the importance of ASR and speech synthesis lies in the development
of pattern-recognition programs that understands the bits of data that compose the
message. Some of the early technologies from this field have found their way into
the applications market, but they still need to be refined in order
for a computer to communicate intelligently and naturally like people. Perhaps in
the future, a person can carry on an engaging conversation
talking with his personal computer which combines the power of ASR, language-processing,
inferencing through a knowledge base, some other intelligent component for the computer to
be an active conversationalist and steer the discussion in a certain direction, and a
speech synthesizer that makes the computer sound like a person.