Motion-Appearance Co-Memory Networks for Video Question Answering

CVPR 2018  ·  Jiyang Gao, Runzhou Ge, Kan Chen, Ram Nevatia ·

Video Question Answering (QA) is an important task in understanding video temporal structure. We observe that there are three unique attributes of video QA compared with image QA: (1) it deals with long sequences of images containing richer information not only in quantity but also in variety; (2) motion and appearance information are usually correlated with each other and able to provide useful attention cues to the other; (3) different questions require different number of frames to infer the answer. Based these observations, we propose a motion-appearance comemory network for video QA. Our networks are built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA. Specifically, there are three salient aspects: (1) a co-memory attention mechanism that utilizes cues from both motion and appearance to generate attention; (2) a temporal conv-deconv network to generate multi-level contextual facts; (3) a dynamic fact ensemble method to construct temporal representation dynamically for different questions. We evaluate our method on TGIF-QA dataset, and the results outperform state-of-the-art significantly on all four tasks of TGIF-QA.

PDF Abstract CVPR 2018 PDF CVPR 2018 Abstract

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Visual Question Answering (VQA) MSRVTT-QA Co-Mem Accuracy 0.32 # 28
Visual Question Answering (VQA) MSVD-QA Co-Mem Accuracy 0.317 # 31

Methods