TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Audio-Visual Active Speaker Detection	AVA-ActiveSpeaker	3D-ResNet-GRU	validation mean average precision	84.0%	# 17

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-task-learning-for-audio-visual-active/audio-visual-active-speaker-detection-on-ava)](https://paperswithcode.com/sota/audio-visual-active-speaker-detection-on-ava?p=multi-task-learning-for-audio-visual-active)`

Multi-Task Learning for Audio Visual Active Speaker Detection

The ActivityNet Large-Scale Activity Recognition Challenge Workshop, CVPR 2019 · Yuanhang Zhang, Jingyun Xiao, Shuang Yang, Shiguang Shan ·

This report describes the approach underlying our submission to the active speaker detection task (task B-2) of ActivityNet Challenge 2019. We introduce a new audio-visual model which builds upon a 3D-ResNet18 visual model pretrained for lipreading and a VGG-M acoustic model pretrained for audio-to-video synchronization. The model is trained with two losses in a multi-task learning fashion: a contrastive loss to enforce matching between audio and video features for active speakers, and a regular crossentropy loss to obtain speaker / non-speaker labels. This model obtains 84.0% mAP on the validation set of AVAActiveSpeaker. Experimental results showcase the pretrained embeddings' abilities to transfer across tasks and data formats, as well as the advantage of the proposed multi-task learning strategy.

PDF