MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Question Answering ActivityNet-QA MovieChat Accuracy 45.7 # 15
Confidence score 3.1 # 7
Zero-Shot Video Question Answer ActivityNet-QA MovieChat Confidence Score 3.1 # 12
Accuracy 45.7 # 12
zero-shot long video breakpoint-mode question answering MovieChat-1K Video-ChatGPT [maaz2023video] Score 2.45 # 2
1:1 Accuracy 48 # 1
zero-shot long video global-model question answering MovieChat-1K Video-ChatGPT [maaz2023video] 1:1 Accuracy 47.6 # 3
Score 2.55 # 4
zero-shot long video breakpoint-mode question answering MovieChat-1K Video LLaMA [zhang2023video] Score 2.04 # 4
1:1 Accuracy 39.1 # 3
zero-shot long video global-model question answering MovieChat-1K Video LLaMA [zhang2023video] 1:1 Accuracy 51.7 # 2
Score 2.67 # 3
zero-shot long video breakpoint-mode question answering MovieChat-1K Video Chat [li2023videochat] Score 2.29 # 3
1:1 Accuracy 46.1 # 2
zero-shot long video global-model question answering MovieChat-1K Video Chat [li2023videochat] Score 3 # 2
zero-shot long video breakpoint-mode question answering MovieChat-1K MovieChat Score 2.57 # 1
zero-shot long video global-model question answering MovieChat-1K MovieChat 1:1 Accuracy 62.3 # 1
Score 3.23 # 1
zero-shot long video breakpoint-model question answering MovieChat-1K MovieChat 1:1 Accuracy 0.483 # 1
zero-shot long video global-mode question answering MovieChat-1K Video Chat [li2023videochat] 1:1 Accuracy 57.8 # 1
Zero-Shot Video Question Answer MSRVTT-QA MovieChat Accuracy 52.7 # 15
Confidence Score 2.6 # 17
Zero-Shot Video Question Answer MSVD-QA MovieChat Accuracy 75.2 # 4
Confidence Score 2.9 # 14
Question Answering NExT-QA (Open-ended VideoQA) MovieChat Accuracy 49.9 # 4
Confidence Score 2.7 # 4
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct MovieChat gpt-score 2.42 # 8
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct MovieChat gpt-score 2.93 # 5
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct MovieChat gpt-score 2.76 # 7
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct MovieChat gpt-score 3.01 # 8
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct MovieChat gpt-score 2.24 # 8

Methods


No methods listed for this paper. Add relevant methods here