Common sense and speech recognition

Posted on December 2, 2005

One of the big problems with speech recognition systems is that they can be quite stupid. Most existing speech recognitions sytems either use tightly constrained word networks, or they are just a form of keyword spotting. If our goal is to design a system that can freely transcribe speech, we still have a long way to go.

One of the main problems with attempting to freely transcribe speech is that a lot of words sound like a lot of other words (some words even sound exactly the same — brake vs. break), so even if you have a clear idea of the sounds (or phonemes), it can be difficult to determine the word. What we need is a common sense database. But first, onto existing real world speech recognition.

Existing Real-world Approaches

A word network based system is only designed to recognise a distinct number — generally quite small — of types of phrases. As an example, Voice Commander, a pocket pc application can only recognise these commands:

  • Call contact at home/work/mobile
  • Show contact
  • Digit Dial (which then only recognises digits)
  • Start application
  • What can I say (which gives this list)
  • Goodbye

By limiting the choices the recognition network can make it greatly reduces the effort involved in deciding what each word was.

The other solution to speech recognition, generally employed in automated telephone systems, is keyword spotting. A typical keyword spotting session will start with the automated system prompting you with something like “Please state your problem.” (We’ll assume were in computer technical support here), and the keyword spotting system will attempt to find any predefined keywords in your reply. As an example, if you then said “I am having problems with my Windows machine resetting” it might have spotted Windows and resetting and decided to transfer you to the software support area.

Common Sense

As mentioned earlier, one of the big problems with automatic speech transcription is deciding whether the user said “put the eggs in the basket” or “put the eggs in the casket”. An even harder example might be “put the key in the lock” vs “put the key in the loch”. To decide what was said the automatic system needs to have some idea of what makes sense.

To this end, and as reported in this story in Technology Research News, researchers at MIT have developed the Open Mind Common Sense database. This database is basically just a big list of ‘facts’ in sentence form that have been submitted by users on the web (hence the quotes around ‘facts’). Anybody who registeres can submit their own facts by answering various questions, and can also query the database about facts as well. Additionally the data is available for download as a zip file for research purposes (although it appears to be a bit out of date).

Anyways, lets say we needed to decide on “put the eggs in the basket” vs “.. casket”. A query for “egg basket” gives 9 examples (like “one type of basket holds easter eggs) but a query for “egg casket” doesn’t return anything. So, automated system chooses basket as the word.

But, even if you don’t care about speech recognition it is just fun to query the database and come up with lots of useful facts like You can type on a computer keyboard, You can use a computer in a ranch house. and Computers are used to use up desk space. Have fun!

» Filed Under Blogger Posts, research, speech

Comments

Leave a Reply




  • Pages

  • Recent Posts

  • Categories

  • Interesting from Elsewhere

  • Meta