<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>the blog of david dean &#187; audio-visual</title>
	<atom:link href="http://www.davidbdean.com/category/audio-visual/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.davidbdean.com</link>
	<description>currently not blogging much at all</description>
	<lastBuildDate>Sat, 21 Jun 2008 15:30:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>An introduction to audio-visual speech recognition</title>
		<link>http://www.davidbdean.com/2007/04/30/an-introduction-to-audio-visual-speech-recognition/</link>
		<comments>http://www.davidbdean.com/2007/04/30/an-introduction-to-audio-visual-speech-recognition/#comments</comments>
		<pubDate>Mon, 30 Apr 2007 04:30:18 +0000</pubDate>
		<dc:creator>David Dean</dc:creator>
				<category><![CDATA[audio-visual]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[speech]]></category>

		<guid isPermaLink="false">http://www.davidbdean.com/2007/04/30/an-introduction-to-audio-visual-speech-recognition/</guid>
		<description><![CDATA[This is from an introduction to my latest paper, and I thought it might be useful to put up here. Feel free to leave any comments on this below.
Audio-visual Speech Recognition
Automatic speech recognition is a very mature area of research, and one that is increasingly becoming involved in our day-to-day lives. While many systems that [...]]]></description>
			<content:encoded><![CDATA[<p>This is from an introduction to my latest paper, and I thought it might be useful to put up here. Feel free to leave any comments on this below.</p>
<h3>Audio-visual Speech Recognition</h3>
<p>Automatic speech recognition is a very mature area of research, and one that is increasingly becoming involved in our day-to-day lives. While many systems that can recognise speech from an audio signal have shown promise when performing well defined tasks like dictation or call-centre navigation in reasonably controlled environments, automatic speech recognition has certainly not yet reached the stage where a user can seamlessly interact with a automatic speech interface [<a href="#1">1</a>]. One of the major stumbling blocks to speech becoming an alternative human-computer interface is the lack of robustness of present systems to channel or environmental noise, which can degrade performance by many orders of magnitude [<a href="#2">2</a>].</p>
<p>However, speech does not consist of the audio modality alone, and studies of human production and perception of speech have shown that the visual movement of the speaker&#8217;s face and lips are an important factor in human communication. Hiding or modifying one of the modalities independent of the other has been shown to cause errors in human speech perception [<a href="#3">3</a>, <a href="#4">4</a>]. </p>
<p>Fortunately many of the sources of audio degradation can be considered to have little effect on the visual signal, for example, a group of people talking out of view of the camera. A similar assumption can also be drawn about many sources of video degradation, such as face movement or minor occlusions. By taking advantage of visual speech in combination with traditional audio speech, automatic speech recognition systems can increase the robustness to degradation in both modalities.</p>
<p>The chosen method of combining these two orthogonal sources of information is still a major area of ongoing research in audio-visual speech recognition (AVSR). Early AVSR systems could be generally be divided into two main groups, early or late integration, based on whether the two modalities were combined before or after classification/scoring. Late integration had the advantage that the reliability of each modality&#8217;s classifier could be weighted easily before combination, but was difficult to use on anything but isolated word recognition due to the problem of aligning and fusing two possibly significantly different speech transcriptions. This was not a problem with early integration, where features are combined before using a single classifier, but, on the other hand, it would be very difficult to model the reliability of each modality. </p>
<p>To allow a compromise between these two extremes, middle integration schemes were developed that allow classifier scores to be combined in a weighted manner within the structure of the classifier itself. The simplest of the middle integration methods, and the subject of this paper, is the synchronous multi-stream HMM [<a href="#1">1</a>] (MSHMM). There are more complicated middle integration designs, primarily intended to allow modelling of the asynchronous nature of audio visual speech, such as asynchronous [<a href="#5">5</a>], product [<a href="#1">1</a>] or coupled HMMs [<a href="#6">6</a>]. These designs can be significantly more complicated to train and test, however, and the small performance increase may not be worth it, especially in embedded environments where processing power or memory might be limited.</p>
<h3>References</h3>
<p><a name="1">[1]</a> G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior, “<a href="http://scholar.google.com/scholar?q=Recent+advances+in+the+automatic+recognition+of+audiovisual+speech">Recent advances in the automatic recognition of audiovisual speech</a>,” <em>Proceedings of the IEEE</em>, vol. 91, no. 9, pp. 1306–1326, 2003.</p>
<p><a name="2">[2]</a> Y. Gong, “<a href="http://scholar.google.com/scholar?q=Speech+recognition+in+noisy+environments%3A+A+survey">Speech recognition in noisy environments: A survey</a>,” <em>Speech Communication</em>, vol. 16, no. 3, pp. 261–291, 1995.</p>
<p><a name="3">[3]</a> H. McGurk and J. MacDonald, “<a href="http://scholar.google.com/scholar?hl=en&#038;lr=&#038;q=Hearing+lips+and+seeing+voices&#038;btnG=Search">Hearing lips and seeing voices</a>,” <em>Nature</em>, vol. 264, no. 5588, pp. 746–748, Dec. 1976.</p>
<p><a name="4">[4]</a> S. M. Thomas and T. R. Jordan, “<a href="http://scholar.google.com/scholar?hl=en&#038;lr=&#038;q=Contributions+of+oral+and+extraoral+facial+movement+to+visual+and+audiovisual+speech+perception&#038;btnG=Search">Contributions of oral and extraoral facial movement to visual and audiovisual speech perception</a>,” <em>Journal of Experimental Psychology: Human Perception and Performance</em>, vol. 30, no. 5, pp. 873–888, 2004.</p>
<p><a name="5">[5]</a> S. Bengio, “<a href="http://scholar.google.com/scholar?q=Multimodal+speech+processing+using+asynchronous+hidden+markov+models">Multimodal speech processing using asynchronous hidden markov models</a>,” <em>Information Fusion</em>, vol. 5, no. 2, pp. 81–9, June 2004.</p>
<p><a name="6">[6]</a> A. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy, “<a href="http://scholar.google.com/scholar?q=A+coupled+hmm+for+audio-visual+speech+recognition">A coupled hmm for audio-visual speech recognition</a>,” in <em>Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP ’02). IEEE International Conference on</em>, vol. 2, 2002, pp. 2013–2016.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidbdean.com/2007/04/30/an-introduction-to-audio-visual-speech-recognition/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Audio-visual speech and the McGurk effect</title>
		<link>http://www.davidbdean.com/2007/04/23/audio-visual-speech-and-the-mcgurk-effect/</link>
		<comments>http://www.davidbdean.com/2007/04/23/audio-visual-speech-and-the-mcgurk-effect/#comments</comments>
		<pubDate>Mon, 23 Apr 2007 03:49:55 +0000</pubDate>
		<dc:creator>David Dean</dc:creator>
				<category><![CDATA[audio-visual]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[speech]]></category>

		<guid isPermaLink="false">http://www.davidbdean.com/2007/04/23/audio-visual-speech-and-the-mcgurk-effect/</guid>
		<description><![CDATA[It may not be immediately obvious to most, but speech is fundamentally a multimodal interaction. (Multimodal is the fancy-pants way of saying that the interaction occurs through more than one mode or channel of communication &#8211; audio, visual, gestural, etc.).
While we can communicate very well with audio alone, such as during a telephone call, our [...]]]></description>
			<content:encoded><![CDATA[<p>It may not be immediately obvious to most, but speech is fundamentally a <a href="http://en.wikipedia.org/wiki/Multimodal_interaction">multimodal interaction</a>. (Multimodal is the <a href="http://en.wikipedia.org/wiki/Jargon">fancy-pants way</a> of saying that the interaction occurs through more than one mode or channel of communication &#8211; audio, visual, gestural, etc.).</p>
<p>While we can communicate very well with audio alone, such as during a telephone call, our brains make use of many visual cues when we talk face-to-face. As well as more broad visual cues such as gestures and facial expressions, it may come as a surprise to learn that the actual motion of the lips play a very important part in the comprehension of human speech.</p>
<p>A useful demonstration of the impact of the visual modality on speech is the McGurk effect, first published by <a href="http://scholar.google.com/scholar?hl=en&#038;lr=&#038;q=mcgurk+Hearing+lips+and+seeing+voices&#038;btnG=Search">McGurk and McDonald in 1976</a>. Rather than explain it in too much detail right now, go watch the video below from an <a href="http://www.hackszine.com/blog/archive/2007/02/hear_with_your_eyes_the_mcgurk.html">episode of the Hackszine video podcast</a>.</p>
<p><object width="425" height="350"><param name="movie" value="http://www.youtube.com/v/T4fUi0eG1X4"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/T4fUi0eG1X4" type="application/x-shockwave-flash" wmode="transparent" width="425" height="350"></embed></object></p>
<p>The basic and original McGurk effect was demonstrated by dubbing a video of a person saying &#8216;gah&#8217; with audio of them saying &#8216;bah&#8217;. If you watch the dubbed video, they appear to be saying &#8216;dah&#8217;, but the audio along clearly says &#8216;bah&#8217;. This shows that even though you may not realise it, the visual lip movements are having an effect on your perception of speech. The hackszine video extends the McGurk effect to cover bad dubbing in general, but I would only consider the McGurk effect to cover when said bad dubbing appears to make the person say something that is neither in the video or dubbed audio.</p>
<p>Finally, this (I think) Japanese talk show appears to be <em>very</em> interested in the McGurk effect. It makes for fairly amusing watching.</p>
<p><object width="425" height="350"><param name="movie" value="http://www.youtube.com/v/eD4x_6HBi7E"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/eD4x_6HBi7E" type="application/x-shockwave-flash" wmode="transparent" width="425" height="350"></embed></object></p>
<p>More information:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Mcgurk_effect">McGurk Effect</a> at Wikipedia</li>
<li><a href="http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20seventeen.htm">Hearing with your eyes: The McGurk Effect</a></li>
<li>McGurk, Harry; and MacDonald, John (1976); &#8220;<a href="http://scholar.google.com/scholar?hl=en&#038;lr=&#038;q=mcgurk+Hearing+lips+and+seeing+voices&#038;btnG=Search">Hearing lips and seeing voices</a>,&#8221; Nature, Vol 264(5588), pp. 746–748</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.davidbdean.com/2007/04/23/audio-visual-speech-and-the-mcgurk-effect/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>VidTIMIT dataset freely available online</title>
		<link>http://www.davidbdean.com/2006/09/20/vidtimit-dataset-freely-available-online/</link>
		<comments>http://www.davidbdean.com/2006/09/20/vidtimit-dataset-freely-available-online/#comments</comments>
		<pubDate>Wed, 20 Sep 2006 12:20:42 +0000</pubDate>
		<dc:creator>David Dean</dc:creator>
				<category><![CDATA[audio-visual]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[speech]]></category>

		<guid isPermaLink="false">http://www.cebidae.com/2006/09/20/vidtimit-dataset-freely-available-online/</guid>
		<description><![CDATA[Conrad Sanderson has released the VidTIMIT audio-visual speech dataset so that it is freely available online.
The dataset is comprised of video and corresponding audio recordings of 43 people, reciting short sentences. It can be useful for research on topics such as automatic lip reading, multi-view face recognition, multi-modal speech recognition and person identification.
Link.
]]></description>
			<content:encoded><![CDATA[<p><a href="http://users.rsise.anu.edu.au/~conrad/cs/">Conrad Sanderson</a> has released the VidTIMIT audio-visual speech dataset so that it is freely available online.</p>
<blockquote><p>The dataset is comprised of video and corresponding audio recordings of 43 people, reciting short sentences. It can be useful for research on topics such as automatic lip reading, multi-view face recognition, multi-modal speech recognition and person identification.</p></blockquote>
<p><a href="http://users.rsise.anu.edu.au/~conrad/vidtimit/index.html">Link</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidbdean.com/2006/09/20/vidtimit-dataset-freely-available-online/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
