TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	ActivityNet-QA	MovieChat	Accuracy	45.7	# 15
Video Question Answering	ActivityNet-QA	MovieChat	Confidence score	3.1	# 7
Zero-Shot Video Question Answer	ActivityNet-QA	MovieChat	Confidence Score	3.1	# 12
Zero-Shot Video Question Answer	ActivityNet-QA	MovieChat	Accuracy	45.7	# 12
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video-ChatGPT [maaz2023video]	Score	2.45	# 2
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video-ChatGPT [maaz2023video]	1:1 Accuracy	48	# 1
zero-shot long video global-model question answering	MovieChat-1K	Video-ChatGPT [maaz2023video]	1:1 Accuracy	47.6	# 3
zero-shot long video global-model question answering	MovieChat-1K	Video-ChatGPT [maaz2023video]	Score	2.55	# 4
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video LLaMA [zhang2023video]	Score	2.04	# 4
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video LLaMA [zhang2023video]	1:1 Accuracy	39.1	# 3
zero-shot long video global-model question answering	MovieChat-1K	Video LLaMA [zhang2023video]	1:1 Accuracy	51.7	# 2
zero-shot long video global-model question answering	MovieChat-1K	Video LLaMA [zhang2023video]	Score	2.67	# 3
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video Chat [li2023videochat]	Score	2.29	# 3
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video Chat [li2023videochat]	1:1 Accuracy	46.1	# 2
zero-shot long video global-model question answering	MovieChat-1K	Video Chat [li2023videochat]	Score	3	# 2
zero-shot long video breakpoint-mode question answering	MovieChat-1K	MovieChat	Score	2.57	# 1
zero-shot long video global-model question answering	MovieChat-1K	MovieChat	1:1 Accuracy	62.3	# 1
zero-shot long video global-model question answering	MovieChat-1K	MovieChat	Score	3.23	# 1
zero-shot long video breakpoint-model question answering	MovieChat-1K	MovieChat	1:1 Accuracy	0.483	# 1
zero-shot long video global-mode question answering	MovieChat-1K	Video Chat [li2023videochat]	1:1 Accuracy	57.8	# 1
Zero-Shot Video Question Answer	MSRVTT-QA	MovieChat	Accuracy	52.7	# 15
Zero-Shot Video Question Answer	MSRVTT-QA	MovieChat	Confidence Score	2.6	# 17
Zero-Shot Video Question Answer	MSVD-QA	MovieChat	Accuracy	75.2	# 4
Zero-Shot Video Question Answer	MSVD-QA	MovieChat	Confidence Score	2.9	# 14
Question Answering	NExT-QA (Open-ended VideoQA)	MovieChat	Accuracy	49.9	# 4
Question Answering	NExT-QA (Open-ended VideoQA)	MovieChat	Confidence Score	2.7	# 4
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	MovieChat	gpt-score	2.42	# 8
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	MovieChat	gpt-score	2.93	# 5
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	MovieChat	gpt-score	2.76	# 7
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	MovieChat	gpt-score	3.01	# 8
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	MovieChat	gpt-score	2.24	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zero-shot-long-video-breakpoint-mode-question)](https://paperswithcode.com/sota/zero-shot-long-video-breakpoint-mode-question?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zero-shot-long-video-global-model-question)](https://paperswithcode.com/sota/zero-shot-long-video-global-model-question?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zero-shot-long-video-breakpoint-model)](https://paperswithcode.com/sota/zero-shot-long-video-breakpoint-model?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zero-shot-long-video-global-mode-question)](https://paperswithcode.com/sota/zero-shot-long-video-global-mode-question?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/question-answering-on-next-qa-open-ended)](https://paperswithcode.com/sota/question-answering-on-next-qa-open-ended?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=moviechat-from-dense-token-to-sparse-memory)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/moviechat-from-dense-token-to-sparse-memory/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=moviechat-from-dense-token-to-sparse-memory)`

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

31 Jul 2023 · Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang ·

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.

PDF Abstract

Code

Add Remove Mark official

rese1f/MovieChat official

408

Tasks

Add Remove

Question Answering

Video-based Generative Performance Benchmarking

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Question Answering

Video Understanding

zero-shot long video breakpoint-model question answering

zero-shot long video breakpoint-mode question answering

zero-shot long video global-model question answering

zero-shot long video global-mode question answering

zero-shot long video question answering

Zero-Shot Video Question Answer

Datasets

ActivityNet-QA

NExT-QA MSRVTT-QA MSVD-QA VideoInstruct

NExT-QA (Open-ended VideoQA)

Results from the Paper

Edit

Ranked #1 on zero-shot long video global-mode question answering on MovieChat-1K

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	ActivityNet-QA	MovieChat	Accuracy	45.7	# 15	Compare
Video Question Answering	ActivityNet-QA	MovieChat	Confidence score	3.1	# 7	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	MovieChat	Confidence Score	3.1	# 12	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	MovieChat	Accuracy	45.7	# 12	Compare
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video-ChatGPT [maaz2023video]	Score	2.45	# 2	Compare
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video-ChatGPT [maaz2023video]	1:1 Accuracy	48	# 1	Compare
zero-shot long video global-model question answering	MovieChat-1K	Video-ChatGPT [maaz2023video]	1:1 Accuracy	47.6	# 3	Compare
zero-shot long video global-model question answering	MovieChat-1K	Video-ChatGPT [maaz2023video]	Score	2.55	# 4	Compare
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video LLaMA [zhang2023video]	Score	2.04	# 4	Compare
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video LLaMA [zhang2023video]	1:1 Accuracy	39.1	# 3	Compare
zero-shot long video global-model question answering	MovieChat-1K	Video LLaMA [zhang2023video]	1:1 Accuracy	51.7	# 2	Compare
zero-shot long video global-model question answering	MovieChat-1K	Video LLaMA [zhang2023video]	Score	2.67	# 3	Compare
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video Chat [li2023videochat]	Score	2.29	# 3	Compare
zero-shot long video breakpoint-mode question answering	MovieChat-1K	Video Chat [li2023videochat]	1:1 Accuracy	46.1	# 2	Compare
zero-shot long video global-model question answering	MovieChat-1K	Video Chat [li2023videochat]	Score	3	# 2	Compare
zero-shot long video breakpoint-mode question answering	MovieChat-1K	MovieChat	Score	2.57	# 1	Compare
zero-shot long video global-model question answering	MovieChat-1K	MovieChat	1:1 Accuracy	62.3	# 1	Compare
zero-shot long video global-model question answering	MovieChat-1K	MovieChat	Score	3.23	# 1	Compare
zero-shot long video breakpoint-model question answering	MovieChat-1K	MovieChat	1:1 Accuracy	0.483	# 1	Compare
zero-shot long video global-mode question answering	MovieChat-1K	Video Chat [li2023videochat]	1:1 Accuracy	57.8	# 1	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	MovieChat	Accuracy	52.7	# 15	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	MovieChat	Confidence Score	2.6	# 17	Compare
Zero-Shot Video Question Answer	MSVD-QA	MovieChat	Accuracy	75.2	# 4	Compare
Zero-Shot Video Question Answer	MSVD-QA	MovieChat	Confidence Score	2.9	# 14	Compare
Question Answering	NExT-QA (Open-ended VideoQA)	MovieChat	Accuracy	49.9	# 4	Compare
Question Answering	NExT-QA (Open-ended VideoQA)	MovieChat	Confidence Score	2.7	# 4	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	MovieChat	gpt-score	2.42	# 8	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	MovieChat	gpt-score	2.93	# 5	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	MovieChat	gpt-score	2.76	# 7	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	MovieChat	gpt-score	3.01	# 8	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	MovieChat	gpt-score	2.24	# 8	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove