Javascript must be enabled for the correct page display

Hue -based Automatic Lipreading

Duifhuis, P. (2003) Hue -based Automatic Lipreading. Master's Thesis / Essay, Artificial Intelligence.

AI_Ma_2003_PDuifhuis.CV.pdf - Published Version

Download (2MB) | Preview


While they might not even notice it. humans use their eyes when they are understanding speech. Especially when the quality of the sound deteriorates, the visual counterpart can contribute considerably to the intelligibility of speech. Artificial speech recognizers have great difficulty with discerning speech from varying background noise. We can learn from humans that incorporating visual information in the recognition process, can be a fruitful approach to this problem. The field of artificial audio-visual speech recognition is indeed a popular and growing one, with still a lot of territories to explore. An overview of audio-visual speech recognition today is given, as well as an investigation into where visual speech processing can really contribute to speech recognition. Three different methods are discerned, namely: • Detecting whether there is a speaker at all. • Knowing when someone is speaking or silent. • Distinguishing similar sounding phonemes. A system was created with the purpose of exploring the problems and possibilities of audio-visual speech recognition in 'real-life' situations, without the help of artificial circumstances to facilitate recognition. This system estimates a set of features that can be used for distinguishing similar phonemes, and for estimating whether a speaker is silent or not. Although it has not been implemented, the system could very well be expanded to detect whether there is a speaker at all. It was found that detecting the whereabouts of a mouth in a video frame, with the precondition that the image contains a face at a certain distance, can be done in a simple en coniputationally cheap manner. This method is based primarily on the selection of pixels with a certain hue, and to a lesser degree saturation and brightness. The extraction of features such as the region of interest, the height and width of the outer contour and the height of the inner contour of the mouth, renders varying results. Some subjects give very good results, whereas others give poor results. The main problems lie in articulation and the differences between speakers. In continuous speech, visemnes are heavily influenced by surrounding visemes, and therefore it is hard to discern them accurately. Furthermore, due to the differences between speakers it is hard to create a single system that works well for all subjects. Speakers articulate differently1and although lips have a similar hue, the distribution of colour of the faces differs as well. With regard to the methods of improving auditory speech recognition, the discrimination between phonemes will most probably be very difficult with this system. Although the system can predict reliably whether a mouth is opened or closed, other viseme-related features such as 'rounded' or 'spread', are hard to categorize. Next to unclear articulation, this is because in continuous speech visemes are heavily influenced by surrounding visemes. It is estimated that detecting whether a speaker is silent or speaking can only be done in situations where the speaker closes his mouth for a longer period of time. To conclude, a crude method has been implemented that can be used for further research. Not only can the lip detection be refined, this system also begs the development of a module that classifies the estimated features. Aside from speech recognition, the method for detecting areas of a certain colour may prove successful in a lot more applications.

Item Type: Thesis (Master's Thesis / Essay)
Degree programme: Artificial Intelligence
Thesis type: Master's Thesis / Essay
Language: English
Date Deposited: 15 Feb 2018 07:30
Last Modified: 15 Feb 2018 07:30

Actions (login required)

View Item View Item