Interplay between audio and visual scene analysis
Tom Huang
(University of Illinois Champaign-Urbana)

My background is in image processing and computer vision, however recently I have been involved in a number of projects which integrate audio/speech processing and image analysis. One major area is video analysis (including the sound track) for indexing, retrieval, and summarization. A critical component is the detection of events such as applause and cheering in sports video, by fusing cues from the image sequence and the audio. Since such events are often mixed with other sound sources (e.g., commentator's voice), it will be of great help if audio separation techniques could be used to decompose the audio. In the other direction, visual cues may help audio separation; For example, it two speakers appear in the video, facial motion tracking may help to separate the speech. More generally, Machine Learning techniques have been proven very useful in computer vision. Some of these may be useful in audio scene analysis as well.

Relevant material:

Ziyou Xiong, Regunathan Radhakrishnan, Ajay Divakaran and Thomas S. Huang, "Audio-Visual Sports Highlights Extraction Using Coupled Hidden Markov Models", submitted to Pattern Analysis and Application Journal, Special Issue on Video Based Event Detection. http://www.ifp.uiuc.edu/~zxiong/ZYPapers/PAACoupledHMM.pdf
Nebojsa Jojic, Nemanja Petrovic, Thomas Huang, "Scene generative models for adaptive video fast forward" , International Conference on Image Processing (ICIP), Barcelona, Spain, 2003. http://www.ifp.uiuc.edu/~zxiong/GoodPapers/Nemanja_ICIP03.pdf
Nebojsa Jojic and Brendan Frey, "Learning Flexible Sprites in Video Layers" , IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2001. http://www.ifp.uiuc.edu/~zxiong/GoodPapers/Nebojsa_lay-cvpr01.pdf