One aspect of speech technology is the conversion of
written text (on the computer) to speech. Most people think of metal
voices now. Indeed, the text-to-speech technology of the past years
wasn't that great.
But everyone who has ever listened to a new text-to-speech program
will tell you that computer-generated voices are quite clear now,
they're still a little robotic, but understandable.
One way to manage this is to record (parts) of words
and paste them together so it becomes continuous speech. This method
works great for purposes that use a small, fixed vocabulary, such as
phone-information-services etc. The method isn't suitable for
diverse texts.
For texts with a very big vocabulary the computer must generate the
speech.
You could record every word in the English language and
then play them in sequence to generate speech, but this sounds very
robotic, and the soundfiles would occupy a huge amount of
diskspace.
If the computer himself generates the speech, no soundfiles are
needed.
Another technique is to record every transition between the different sound in a language. For example: for 'sh' you would record the transition between the s and the h. You can then generate fluid speech by pasting all those transition as needed. Suppose there are 50 different sounds in a language, then there would be 50 X 50 = 2500 transitions in that language.