Multiscale Multimodal Transformer for Multimodal Action Recognition

While action recognition has been an active research area for several years, most existing approaches merely leverage the video modality as opposed to humans that efficiently process video and audio cues simultaneously. This limits the usage of recent models to applications where the actions are visually well-defined. On the other hand, audio and video can be perceived in a hierarchical structure, e.g., from audio signal per sampling time point to audio activities and the whole category in the audio classification. In this work, we develop a multiscale multimodal Transformer (MMT) that employs hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer. Furthermore, we propose a set of multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that specifically align the two modalities for robust multimodal representation fusion. MMT surpasses previous state-of-the-art approaches by 7.3%, 1.6% and 2.1% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. Moreover, our MAT significantly outperforms AST by 22.2%, 4.4% and 4.7% on the three public benchmark datasets and is 3x more efficient based on the number of FLOPs. Through extensive ablation studies and visualizations, we demonstrate that the proposed MMT can effectively capture semantically more separable feature representations from a combination of video and audio signals.

PDF

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Action Recognition EPIC-KITCHENS-100 MMT Action@1 47.8 # 10
Verb@1 70.1 # 10
Noun@1 61.0 # 7
Audio Classification VGGSound MMT (Video) Top 1 Accuracy 56.1 # 14
Top 5 Accuracy 77.9 # 7
Audio Classification VGGSound MMT (Audio-Visual) Top 1 Accuracy 66.2 # 4
Top 5 Accuracy 85.7 # 1
Multi-modal Classification VGG-Sound MMT Top-1 Accuracy 66.2 # 1
Top-5 Accuracy 85.7 # 1

Methods