Former-DFER: Dynamic Facial Expression Recognition Transformer
This paper proposes a dynamic facial expression recognition transformer (Former-DFER) for the in-the-wild scenario. Specifically, the proposed Former-DFER mainly consists of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-Former). The CS-Former consists of five convolution blocks and N spatial encoders, which is designed to guide the network to learn occlusion and pose-robust facial features from the spatial perspective. And the temporal transformer consists of M temporal encoders, which is designed to allow the network to learn contextual facial features from the temporal perspective. The heatmaps of the leaned facial features demonstrate that the proposed Former-DFER is capable of handling the issues such as occlusion, non-frontal pose, and head motion. And the visualization of the feature distribution shows that the proposed method can learn more discriminative facial features. Moreover, our Former-DFER also achieves state-of-the-art results on the DFEW and AFEW benchmarks.
PDF AbstractCode
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Dynamic Facial Expression Recognition | DFEW | Former-DFER | WAR | 65.70 | # 15 | |
UAR | 53.69 | # 14 | ||||
Dynamic Facial Expression Recognition | FERV39k | Former-DFER | WAR | 46.85 | # 8 | |
UAR | 37.20 | # 6 | ||||
Dynamic Facial Expression Recognition | MAFW | Former-DFER | WAR | 43.27 | # 12 | |
UAR | 31.16 | # 11 |