An introduction to audio-visual speech recognition

Posted on April 30, 2007

This is from an introduction to my latest paper, and I thought it might be useful to put up here. Feel free to leave any comments on this below.

Audio-visual Speech Recognition

Automatic speech recognition is a very mature area of research, and one that is increasingly becoming involved in our day-to-day lives. While many systems that can recognise speech from an audio signal have shown promise when performing well defined tasks like dictation or call-centre navigation in reasonably controlled environments, automatic speech recognition has certainly not yet reached the stage where a user can seamlessly interact with a automatic speech interface [1]. One of the major stumbling blocks to speech becoming an alternative human-computer interface is the lack of robustness of present systems to channel or environmental noise, which can degrade performance by many orders of magnitude [2].

However, speech does not consist of the audio modality alone, and studies of human production and perception of speech have shown that the visual movement of the speaker’s face and lips are an important factor in human communication. Hiding or modifying one of the modalities independent of the other has been shown to cause errors in human speech perception [3, 4].

Fortunately many of the sources of audio degradation can be considered to have little effect on the visual signal, for example, a group of people talking out of view of the camera. A similar assumption can also be drawn about many sources of video degradation, such as face movement or minor occlusions. By taking advantage of visual speech in combination with traditional audio speech, automatic speech recognition systems can increase the robustness to degradation in both modalities.

The chosen method of combining these two orthogonal sources of information is still a major area of ongoing research in audio-visual speech recognition (AVSR). Early AVSR systems could be generally be divided into two main groups, early or late integration, based on whether the two modalities were combined before or after classification/scoring. Late integration had the advantage that the reliability of each modality’s classifier could be weighted easily before combination, but was difficult to use on anything but isolated word recognition due to the problem of aligning and fusing two possibly significantly different speech transcriptions. This was not a problem with early integration, where features are combined before using a single classifier, but, on the other hand, it would be very difficult to model the reliability of each modality.

To allow a compromise between these two extremes, middle integration schemes were developed that allow classifier scores to be combined in a weighted manner within the structure of the classifier itself. The simplest of the middle integration methods, and the subject of this paper, is the synchronous multi-stream HMM [1] (MSHMM). There are more complicated middle integration designs, primarily intended to allow modelling of the asynchronous nature of audio visual speech, such as asynchronous [5], product [1] or coupled HMMs [6]. These designs can be significantly more complicated to train and test, however, and the small performance increase may not be worth it, especially in embedded environments where processing power or memory might be limited.

References

[1] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior, “Recent advances in the automatic recognition of audiovisual speech,” Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003.

[2] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Communication, vol. 16, no. 3, pp. 261–291, 1995.

[3] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, no. 5588, pp. 746–748, Dec. 1976.

[4] S. M. Thomas and T. R. Jordan, “Contributions of oral and extraoral facial movement to visual and audiovisual speech perception,” Journal of Experimental Psychology: Human Perception and Performance, vol. 30, no. 5, pp. 873–888, 2004.

[5] S. Bengio, “Multimodal speech processing using asynchronous hidden markov models,” Information Fusion, vol. 5, no. 2, pp. 81–9, June 2004.

[6] A. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy, “A coupled hmm for audio-visual speech recognition,” in Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP ’02). IEEE International Conference on, vol. 2, 2002, pp. 2013–2016.

» Filed Under audio-visual, research, speech

Comments

Leave a Reply




  • Pages

  • Recent Posts

  • Categories

  • Interesting from Elsewhere

  • Meta