1 September 2000
Voice Technologies: The Outlook and Beyond
"Hello Mary. Power up please."
"Good Morning Tom. Startup sequence initiated."
The dark dashboard flickers to life, flashing green,
yellow and red. The five blank view screens gyrate for a few seconds and then
stabilize with views around the shuttle. A soft hum fills the cabin as the air
thrusters charge to full capacity.
"Tom, startup sequence complete. Do you want manual or
"Fine. Where shall we go today?"
"Well, I need to get to the office by nine-thirty, but I
would like to first stop by the flower shop in town. It's Sarah's birthday today
"Yes, I know. I'm glad I didn't have to remind you. Would
you like to go over your schedule, or would you like to listen to the news, or would
you just prefer some music?"
"The news please, but keep it at level two."
"Tom, Air Control indicates traffic into the city is heavy
and we can expect a thirty-minute delay. Would you like to plot for a more scenic
"That will be fine, just as long as I get to the office in
"Yes, Mary, bigger than the both of us."
The wing-tipped door closes on the egg shaped craft, making
a pressurized sound as air escapes from within. Tom presses the vertical booster
button on the control stick and the shuttle begins to rise. He holds the button
until the craft reaches a height of 100 feet. The heads-up-display shows Tom is
locked onto Mary's course and need only follow the imaginary polygons before his
path. Tom gently pushes forward on the control stick and with a hushed hum they're
What you've just witnessed, though fantasy, may some day
become reality. The craft itself has yet to be conceived, but the conversation
between Tom and Mary, though a form of data exchange between man and machine, may
not be too far off. And though Mary may take offense to this comment, her
personality and intelligence would be programmed by some programmer or manufacture
to meet a user's specifications. Mary's ability to speak and understand the
spoken word would be just a small part of her overall system. However, this paper
is about voice technologies and where they may lead, not artificial intelligence,
and therefore this paper focuses mainly on the topic at hand-voice technologies.
In my scenario with Tom and Mary, among many technologies,
three voice technologies were employed. One, speech recognition: where Tom initiates
the conversation with Mary by speaking to her. Mary recognizes his speech patterns
and acknowledges him with a greeting. Two, speech synthesis: where Mary is able to
form words and complete sentences, as well as, questions which require a response
from Tom. And lastly, continuous speech recognition: where Mary is able to understand
complete sentences, not just words and phrases. She is able to decipher instructions
and commands from what Tom says, and she acts upon them accordingly.
Understandably, programming and implementing such an advanced
system is well beyond our current capabilities, but with vendors and researchers
like Kurzweil, IBM, MIT's Laboratory for Computer Science and the Center for Speech
Technology Research (CSTR), the road may be long, but it will be sure. In the past,
processor speeds and storage space were limiting factors on voice technologies, but with processors closing in on 1000 MHz and storage devices already thirty-gig big, this author is entirely convinced conversations like Tom and Mary's will be an everyday occurrence. The falling costs to produce these devices also have spurred the use and research of voice technologies. Granted a lot of time, money and resources will have to come to pass before we reach Tom and Mary's level, but I foresee the day when the voice will replace the mouse. And believe it or not, that day has arrived.
Kurzweil, a leading researcher and vendor from England, is
currently marketing several products for Microsoft Windows that could satisfy many
small and big business needs through voice technologies. One product by Kurzweil
called Voice for Windows converts a user's voice into text. Voice commands allow a
user to open, close, edit and save text documents, spreadsheets, or even databases.
Voice for Windows come in two flavors, one with an active vocabulary of 30,000 words,
and another with an active vocabulary of 60,000 words.
Kurzweil also offers a nice little application called
Voicepad for Windows. This product is a cut-down version of Voice for Windows with
only a 17,000 word active vocabulary, but Voicepad is a straightforward application
that converts speech to text. It's something like Notepad, but Voicepad uses voice
instead of a keyboard.
I personally like Kurzweil's Talk Back for Windows. This
program is a speech synthesis application that will read back text documents such
as e-mail messages and read-me files. This application relieves the reading burden
from the user, and if you've ever read a long read-me file, you'll have an appreciation
for what I mean. As of this writing, I've yet to received a reply from Kurzweil to
my email requesting a price listing for these products.
IBM is probably the biggest player in the voice playpen,
and their product, ViaVoice98 ($224) takes dictation like Kurzweil's products into
Windows applications, but comes standard with a 64,000 word base vocabulary and is
expandable to another 64,000 words. In July, IBM also released a Chinese version of
ViaVoice98, and if you've ever used a Chinese or a Japanese keyboard to put just a
single kanji character onto the screen, you'll understand why such a product has been
quickly accepted by the Chinese people.
Two other products that IBM has introduced into the realm of voice technologies are
SpeechML and Home Page Reader for Windows ($149). The latter is an application that
translates Web content to audio. It combines ViaVoice98 Outloud (a text-to-speech
application from IBM) and Netscape Navigator. For a person to use this product he
or she must have a PC with at least a Pentium 150 MHz CPU, MS Windows 95 or better,
32 MB of RAM and 17 MB of hard drive space. In addition, a user can send and receive
email through the browser. Navigation is accomplished through a specially designed
keypad. Home Page Reader is currently only in English and Japanese, but additional
language versions are forthcoming.
The World Wide Web Consortium is currently reviewing a
request to use IBM's SpeechML as the standard for the new speech markup language.
SpeechML will allow Web designers to incorporate interactive speech into their Web
sites using new XML extensions. SpeechML will allow visitors to the site to listen
to all or a portion of the site, and also interact with the site using their voices.
Now, not getting to far ahead of myself, although these
products by Kurzweil and IBM promise to move us several steps closer to Tom and
Mary's conversation, they still lack the ability to recognize speech without
training the application first. And the best any of these products promise is an
accuracy rate of 90 percent. This maybe better than most typists, but in the case
of Tom and Mary, I think a zero-error rate is in order. Similarly, products offered
by other vendors like Creative Technologies and Dragon Systems require users to sit
for long hours in front of a computer before it can understand them. These vendor
products are similarly priced and require Pentium-class PC's with large hard drives
and a lot of system memory.
But don't despair. There are researchers at this moment trying to remove these
obstacles. For example, researchers at MIT's Laboratory for Computer Science are
developing systems that allow computers be controlled with voice commands without
all that training. They called them "speech understanding systems." Their Jupiter
project is currently in use, and if a person calls 888-573-8255, he or she can get
information about the whether by simply speaking into the telephone. The system
uses speech recognition for voice commands and speech synthesis to prompt callers
and give them the results of their inquiries. Note however, this is not continuous
speech recognition. Other projects the MIT teams are currently working include a
mapping application and an airline information system. MIT researchers are primarily
focusing on telephone response systems because they believe these services will reach
more users, as there are more telephones in homes than computers.
Also, for the past twenty years, a research unit at the
University of Edinburgh has been researching voice technologies. The Center for
Speech Technology Research (CSTR) is researching ways to make computers understand
continuous speech without having to go through any training requirements. CSTR is
also developing humanlike speech synthesis and researching voice verification for
security systems. In the field of speech recognition, CSTR has been working with
the likes of Plessey and Marconi on a project called Alvey. Alvey, an integrated
speech demonstrator, uses knowledge based and statistical techniques to convert
speech to text.
CSTR has been a pioneer in the field of speech synthesis
as well. They have been developing a way to generate a natural sounding voice
through a method called "diphone synthesis." Combining stored speech samples with
the desired output creates smooth speech and controls the pitch and duration of the
sounds. CSTR released their Festival Speech Synthesis System and is downloadable
from the Internet for free. And lastly, CSTR has been researching voice verification
by comparing a speaker's voice with stored samples. According to CSTR, their system
is able to distinguish the difference between a recorded voice and a live voice.
The immediate future for voice technologies is promising.
Although each application discussed in this paper has some limitations, I believe two
or three versions down the road these obstacles will be overcome. In the interim,
the current offerings do allow users to be more productive in their offices and at
their homes. And in the case of physically impaired individuals, these new products
have opened avenues to information and other opportunities once thought forever
closed. Voice applications of the future are only going to improve because the
technology can only get better. Tom and Mary may have to wait for us to catch up,
but it's only a matter of time when human and machine will interact as they are
able. Maybe not during my lifetime, but someday it will happen.