Learning Video Representations from Large Language Models

We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition Charades-Ego LaViLa (Zero-shot, TimeSformer-L) mAP 28.9 # 5
Action Recognition Charades-Ego LaViLa (Finetuned, TimeSformer-L) mAP 36.1 # 1
Egocentric Activity Recognition EGTEA LaViLa (Finetuned, TimeSformer-L) Average Accuracy 81.75 # 1
Mean class accuracy 76 # 1
Action Recognition EPIC-KITCHENS-100 LaViLa (TimeSformer-L) Action@1 51 # 4
Verb@1 72 # 4
Noun@1 62.9 # 5
Multi-Instance Retrieval EPIC-KITCHENS-100 LaViLa (Finetuned, TimeSformer-L) mAP(V2T) 54.7 # 3
mAP(T2V) 47.1 # 2
mAP (Avg) 50.9 # 2
nDCG (V2T) 68.1 # 2
nDCG (T2V) 64.9 # 2
nDCG (Avg) 66.5 # 2
Multi-Instance Retrieval EPIC-KITCHENS-100 LaViLa (Zero-shot, TimeSformer-L) mAP(V2T) 40 # 7
mAP(T2V) 32.2 # 7
nDCG (V2T) 36.1 # 7
nDCG (T2V) 33.2 # 7
Multi-Instance Retrieval EPIC-KITCHENS-100 LaViLa (Zero-shot) mAP (Avg) 36.1 # 10
nDCG (Avg) 34.6 # 10

Methods


No methods listed for this paper. Add relevant methods here