Former-DFER: Dynamic Facial Expression Recognition Transformer

This paper proposes a dynamic facial expression recognition transformer (Former-DFER) for the in-the-wild scenario. Specifically, the proposed Former-DFER mainly consists of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-Former). The CS-Former consists of five convolution blocks and N spatial encoders, which is designed to guide the network to learn occlusion and pose-robust facial features from the spatial perspective. And the temporal transformer consists of M temporal encoders, which is designed to allow the network to learn contextual facial features from the temporal perspective. The heatmaps of the leaned facial features demonstrate that the proposed Former-DFER is capable of handling the issues such as occlusion, non-frontal pose, and head motion. And the visualization of the feature distribution shows that the proposed method can learn more discriminative facial features. Moreover, our Former-DFER also achieves state-of-the-art results on the DFEW and AFEW benchmarks.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Dynamic Facial Expression Recognition DFEW Former-DFER WAR 65.70 # 15
UAR 53.69 # 14
Dynamic Facial Expression Recognition FERV39k Former-DFER WAR 46.85 # 8
UAR 37.20 # 6
Dynamic Facial Expression Recognition MAFW Former-DFER WAR 43.27 # 12
UAR 31.16 # 11

Methods


No methods listed for this paper. Add relevant methods here