Zero-Shot Video Question Answer

34 papers with code • 12 benchmarks • 11 datasets

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Libraries

Use these libraries to find Zero-Shot Video Question Answer models and implementations

Most implemented papers

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Mistral 7B

mistralai/mistral-src 10 Oct 2023

We introduce Mistral 7B v0. 1, a 7-billion-parameter language model engineered for superior performance and efficiency.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

PKU-YuanGroup/Video-LLaVA 16 Nov 2023

In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

ahjeongseo/MASN-pytorch CVPR 2017

In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways.

MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks

wuyuejinxia/prcv2019-mvb-renet 26 Jul 2019

Second, all baggage images are captured by specially-designed multi-view camera system to handle pose variation and occlusion, in order to obtain the 3D information of baggage surface as complete as possible.

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

antoyang/FrozenBiLM 16 Jun 2022

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

opengvlab/internvideo 6 Dec 2022

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

pku-yuangroup/chat-univi 14 Nov 2023

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

dvlab-research/llama-vid 28 Nov 2023

Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.