my take on the web

1 September 2000

Voice Technologies: The Outlook and Beyond

     "Hello Mary. Power up please."
     "Good Morning Tom. Startup sequence initiated."
     The dark dashboard flickers to life, flashing green, yellow and red. The five blank view screens gyrate for a few seconds and then stabilize with views around the shuttle. A soft hum fills the cabin as the air thrusters charge to full capacity.
     "Tom, startup sequence complete. Do you want manual or auto control?"
     "Fine. Where shall we go today?"
     "Well, I need to get to the office by nine-thirty, but I would like to first stop by the flower shop in town. It's Sarah's birthday today you know."
     "Yes, I know. I'm glad I didn't have to remind you. Would you like to go over your schedule, or would you like to listen to the news, or would you just prefer some music?"
     "The news please, but keep it at level two."
     "Tom, Air Control indicates traffic into the city is heavy and we can expect a thirty-minute delay. Would you like to plot for a more scenic route?"
     "That will be fine, just as long as I get to the office in time."
     "Big meeting?"
     "Yes, Mary, bigger than the both of us."
     The wing-tipped door closes on the egg shaped craft, making a pressurized sound as air escapes from within. Tom presses the vertical booster button on the control stick and the shuttle begins to rise. He holds the button until the craft reaches a height of 100 feet. The heads-up-display shows Tom is locked onto Mary's course and need only follow the imaginary polygons before his path. Tom gently pushes forward on the control stick and with a hushed hum they're off.

     What you've just witnessed, though fantasy, may some day become reality. The craft itself has yet to be conceived, but the conversation between Tom and Mary, though a form of data exchange between man and machine, may not be too far off. And though Mary may take offense to this comment, her personality and intelligence would be programmed by some programmer or manufacture to meet a user's specifications. Mary's ability to speak and understand the spoken word would be just a small part of her overall system. However, this paper is about voice technologies and where they may lead, not artificial intelligence, and therefore this paper focuses mainly on the topic at hand-voice technologies.
     In my scenario with Tom and Mary, among many technologies, three voice technologies were employed. One, speech recognition: where Tom initiates the conversation with Mary by speaking to her. Mary recognizes his speech patterns and acknowledges him with a greeting. Two, speech synthesis: where Mary is able to form words and complete sentences, as well as, questions which require a response from Tom. And lastly, continuous speech recognition: where Mary is able to understand complete sentences, not just words and phrases. She is able to decipher instructions and commands from what Tom says, and she acts upon them accordingly.
     Understandably, programming and implementing such an advanced system is well beyond our current capabilities, but with vendors and researchers like Kurzweil, IBM, MIT's Laboratory for Computer Science and the Center for Speech Technology Research (CSTR), the road may be long, but it will be sure. In the past, processor speeds and storage space were limiting factors on voice technologies, but with processors closing in on 1000 MHz and storage devices already thirty-gig big, this author is entirely convinced conversations like Tom and Mary's will be an everyday occurrence. The falling costs to produce these devices also have spurred the use and research of voice technologies. Granted a lot of time, money and resources will have to come to pass before we reach Tom and Mary's level, but I foresee the day when the voice will replace the mouse. And believe it or not, that day has arrived.
     Kurzweil, a leading researcher and vendor from England, is currently marketing several products for Microsoft Windows that could satisfy many small and big business needs through voice technologies. One product by Kurzweil called Voice for Windows converts a user's voice into text. Voice commands allow a user to open, close, edit and save text documents, spreadsheets, or even databases. Voice for Windows come in two flavors, one with an active vocabulary of 30,000 words, and another with an active vocabulary of 60,000 words.
     Kurzweil also offers a nice little application called Voicepad for Windows. This product is a cut-down version of Voice for Windows with only a 17,000 word active vocabulary, but Voicepad is a straightforward application that converts speech to text. It's something like Notepad, but Voicepad uses voice instead of a keyboard. 
     I personally like Kurzweil's Talk Back for Windows. This program is a speech synthesis application that will read back text documents such as e-mail messages and read-me files. This application relieves the reading burden from the user, and if you've ever read a long read-me file, you'll have an appreciation for what I mean. As of this writing, I've yet to received a reply from Kurzweil to my email requesting a price listing for these products.
     IBM is probably the biggest player in the voice playpen, and their product, ViaVoice98 ($224) takes dictation like Kurzweil's products into Windows applications, but comes standard with a 64,000 word base vocabulary and is expandable to another 64,000 words. In July, IBM also released a Chinese version of ViaVoice98, and if you've ever used a Chinese or a Japanese keyboard to put just a single kanji character onto the screen, you'll understand why such a product has been quickly accepted by the Chinese people.
Two other products that IBM has introduced into the realm of voice technologies are SpeechML and Home Page Reader for Windows ($149). The latter is an application that translates Web content to audio. It combines ViaVoice98 Outloud (a text-to-speech application from IBM) and Netscape Navigator. For a person to use this product he or she must have a PC with at least a Pentium 150 MHz CPU, MS Windows 95 or better, 32 MB of RAM and 17 MB of hard drive space. In addition, a user can send and receive email through the browser. Navigation is accomplished through a specially designed keypad. Home Page Reader is currently only in English and Japanese, but additional language versions are forthcoming.
     The World Wide Web Consortium is currently reviewing a request to use IBM's SpeechML as the standard for the new speech markup language. SpeechML will allow Web designers to incorporate interactive speech into their Web sites using new XML extensions. SpeechML will allow visitors to the site to listen to all or a portion of the site, and also interact with the site using their voices.
     Now, not getting to far ahead of myself, although these products by Kurzweil and IBM promise to move us several steps closer to Tom and Mary's conversation, they still lack the ability to recognize speech without training the application first. And the best any of these products promise is an accuracy rate of 90 percent. This maybe better than most typists, but in the case of Tom and Mary, I think a zero-error rate is in order. Similarly, products offered by other vendors like Creative Technologies and Dragon Systems require users to sit for long hours in front of a computer before it can understand them. These vendor products are similarly priced and require Pentium-class PC's with large hard drives and a lot of system memory. 
But don't despair. There are researchers at this moment trying to remove these obstacles. For example, researchers at MIT's Laboratory for Computer Science are developing systems that allow computers be controlled with voice commands without all that training. They called them "speech understanding systems." Their Jupiter project is currently in use, and if a person calls 888-573-8255, he or she can get information about the whether by simply speaking into the telephone. The system uses speech recognition for voice commands and speech synthesis to prompt callers and give them the results of their inquiries. Note however, this is not continuous speech recognition. Other projects the MIT teams are currently working include a mapping application and an airline information system. MIT researchers are primarily focusing on telephone response systems because they believe these services will reach more users, as there are more telephones in homes than computers. 
     Also, for the past twenty years, a research unit at the University of Edinburgh has been researching voice technologies. The Center for Speech Technology Research (CSTR) is researching ways to make computers understand continuous speech without having to go through any training requirements. CSTR is also developing humanlike speech synthesis and researching voice verification for security systems. In the field of speech recognition, CSTR has been working with the likes of Plessey and Marconi on a project called Alvey. Alvey, an integrated speech demonstrator, uses knowledge based and statistical techniques to convert speech to text. 
     CSTR has been a pioneer in the field of speech synthesis as well. They have been developing a way to generate a natural sounding voice through a method called "diphone synthesis." Combining stored speech samples with the desired output creates smooth speech and controls the pitch and duration of the sounds. CSTR released their Festival Speech Synthesis System and is downloadable from the Internet for free. And lastly, CSTR has been researching voice verification by comparing a speaker's voice with stored samples. According to CSTR, their system is able to distinguish the difference between a recorded voice and a live voice.
     The immediate future for voice technologies is promising. Although each application discussed in this paper has some limitations, I believe two or three versions down the road these obstacles will be overcome. In the interim, the current offerings do allow users to be more productive in their offices and at their homes. And in the case of physically impaired individuals, these new products have opened avenues to information and other opportunities once thought forever closed. Voice applications of the future are only going to improve because the technology can only get better. Tom and Mary may have to wait for us to catch up, but it's only a matter of time when human and machine will interact as they are able. Maybe not during my lifetime, but someday it will happen.

 Works Cited:
          CNN. "Voice Recognition Removes Computer Obstacles for Chinese." CNN. July 1, 1999.
Online: Available: http://cnn.com/TECH/computing/9907/01/t_t/voice.recognition/index.html. Aug 30, 1999.

          CNN. "IBM Offer Speech Extension to XML." CNN. Feb 19, 1999.
Online: Available: http://cnn.com/TECH/computing/9902/19/speechm1.idg/index.html. Aug 30, 1999.

          CNN. "IBM's Talking Browser Brings Net to Visually Impaired." CNN. Feb 4, 1999.
Online: Available: http://cnn.com/TECH/computing/9902/04/ibmtalks.idg/index.html. Aug 31, 1999.

          CSTR. "Introduction to CSTR." University of Edinburgh. 1999.
Online: Available: http://www.cstr.ac.uk/intro.html. Aug 28, 1999.

          IBM. "ViaVoice98 - The Next Generation of ViaVoice is Here." IBM. 1999.
Online: Available: http://www.ibm.com. Aug 30, 1999.

          Kurzweil. "Voice Technology Solutions." 1999.
Online: Available: http://www.stafforditec.demon.co.uk. Aug 28, 1999.

          PC Magazine Online. "Computer Will Be More Human - Voice Recognition." ZDNet. June 22, 1999.
Online: Available: http://www.zdnet.com/pcmag/features/future/human03.html. Sept 1, 1999.