Part-based Lipreading for Audio-Visual Speech Recognition

Lipreading is an important component of audio-visual speech recognition. However, lips are usually modeled as a whole in lipreading, which ignores that each part of lip focuses on different characteristics of mouth and the overall model can not fit each part perfectly. Besides, features based on the whole lip usually vary a lot according to different speakers, which leads that the training databases usually need to contain as much speakers as possible. In this paper, A part-based lipreading (PBL) method is proposed to deal with the mismatch between an overall lip model and the separate parts of lips, also the excessive dependence of models on the speakers in training set. PBL models lips partly and predicts jointly. It employs a uniform partition strategy on convolutional features and generates several part-level sub-results for final prediction. Experiments are performed on a large publicly available dataset (LRW) and part of it (p-LRW, 65 words), in order to simulate the progressive instructions in the working scene of robots. Word accuracy of PBL reaches 82.8% on LRW and 88.9% on p-LRW. Finally, an end-to-end audio-visual speech recognition system using PBL is established and achieves 98.3% word accuracy on LRW.

PDF

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Audio-Visual Speech Recognition LRW PBL Top-1 Accuracy 98.3 # 3

Methods


No methods listed for this paper. Add relevant methods here