Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition
Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly
PDF AbstractResults from the Paper
Ranked #1 on Emotion Recognition in Conversation on MELD (Weighted Accuracy metric)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Emotion Classification | CMU-MOSEI | ConCluGen | Accuracy | 66.48 | # 4 | |
Weighted Accuracy | 66.48 | # 1 | ||||
Facial Expression Recognition | CMU-MOSEI | ConCluGen | Weighted Accuracy | 66.48 | # 1 | |
Emotion Recognition in Conversation | MELD | ConCluGen | Accuracy | 60.03 | # 15 | |
Weighted Accuracy | 60.03 | # 1 | ||||
Facial Expression Recognition | MELD | ConCluGen | Weighted Accuracy | 60.03 | # 1 |