TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Charades-Ego	LaViLa (Zero-shot, TimeSformer-L)	mAP	28.9	# 5
Action Recognition	Charades-Ego	LaViLa (Finetuned, TimeSformer-L)	mAP	36.1	# 1
Egocentric Activity Recognition	EGTEA	LaViLa (Finetuned, TimeSformer-L)	Average Accuracy	81.75	# 1
Egocentric Activity Recognition	EGTEA	LaViLa (Finetuned, TimeSformer-L)	Mean class accuracy	76	# 1
Action Recognition	EPIC-KITCHENS-100	LaViLa (TimeSformer-L)	Action@1	51	# 4
Action Recognition	EPIC-KITCHENS-100	LaViLa (TimeSformer-L)	Verb@1	72	# 4
Action Recognition	EPIC-KITCHENS-100	LaViLa (TimeSformer-L)	Noun@1	62.9	# 5
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Finetuned, TimeSformer-L)	mAP(V2T)	54.7	# 3
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Finetuned, TimeSformer-L)	mAP(T2V)	47.1	# 2
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Finetuned, TimeSformer-L)	mAP (Avg)	50.9	# 2
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Finetuned, TimeSformer-L)	nDCG (V2T)	68.1	# 2
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Finetuned, TimeSformer-L)	nDCG (T2V)	64.9	# 2
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Finetuned, TimeSformer-L)	nDCG (Avg)	66.5	# 2
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot, TimeSformer-L)	mAP(V2T)	40	# 7
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot, TimeSformer-L)	mAP(T2V)	32.2	# 7
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot, TimeSformer-L)	nDCG (V2T)	36.1	# 7
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot, TimeSformer-L)	nDCG (T2V)	33.2	# 7
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot)	mAP (Avg)	36.1	# 10
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot)	nDCG (Avg)	34.6	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-video-representations-from-large/action-recognition-on-charades-ego)](https://paperswithcode.com/sota/action-recognition-on-charades-ego?p=learning-video-representations-from-large)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-video-representations-from-large/egocentric-activity-recognition-on-egtea-1)](https://paperswithcode.com/sota/egocentric-activity-recognition-on-egtea-1?p=learning-video-representations-from-large)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-video-representations-from-large/multi-instance-retrieval-on-epic-kitchens-100)](https://paperswithcode.com/sota/multi-instance-retrieval-on-epic-kitchens-100?p=learning-video-representations-from-large)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-video-representations-from-large/action-recognition-on-epic-kitchens-100)](https://paperswithcode.com/sota/action-recognition-on-epic-kitchens-100?p=learning-video-representations-from-large)`

Learning Video Representations from Large Language Models

CVPR 2023 · Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar ·

We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

facebookresearch/lavila official

↳ Quickstart in

Colab

Spaces

434

ceezh/llovi

Tasks

Add Remove

Action Classification

Action Recognition

Egocentric Activity Recognition

Multi-Instance Retrieval

Datasets

UCF101

HMDB51

WebText

HowTo100M

EPIC-KITCHENS-100

EGTEA Charades-Ego

Results from the Paper

Edit

Ranked #1 on Action Recognition on Charades-Ego

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Charades-Ego	LaViLa (Zero-shot, TimeSformer-L)	mAP	28.9	# 5	Compare
Action Recognition	Charades-Ego	LaViLa (Finetuned, TimeSformer-L)	mAP	36.1	# 1	Compare
Egocentric Activity Recognition	EGTEA	LaViLa (Finetuned, TimeSformer-L)	Average Accuracy	81.75	# 1	Compare
Egocentric Activity Recognition	EGTEA	LaViLa (Finetuned, TimeSformer-L)	Mean class accuracy	76	# 1	Compare
Action Recognition	EPIC-KITCHENS-100	LaViLa (TimeSformer-L)	Action@1	51	# 4	Compare
			Verb@1	72	# 4	Compare
			Noun@1	62.9	# 5	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Finetuned, TimeSformer-L)	mAP(V2T)	54.7	# 3	Compare
			mAP(T2V)	47.1	# 2	Compare
			mAP (Avg)	50.9	# 2	Compare
			nDCG (V2T)	68.1	# 2	Compare
			nDCG (T2V)	64.9	# 2	Compare
			nDCG (Avg)	66.5	# 2	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot, TimeSformer-L)	mAP(V2T)	40	# 7	Compare
			mAP(T2V)	32.2	# 7	Compare
			nDCG (V2T)	36.1	# 7	Compare
			nDCG (T2V)	33.2	# 7	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot)	mAP (Avg)	36.1	# 10	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	LaViLa (Zero-shot)	nDCG (Avg)	34.6	# 10	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Learning Video Representations from Large Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove