Audio-Visual Question Answering (AVQA)

9 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio-Visual Question Answering (AVQA)

Trend	Dataset	Best Model	Paper	Code	Compare
	AVQA	ONE-PEACE			See all

Datasets

AVQA

Most implemented papers

Most implemented Social Latest No code

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE • • 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Paper
Code

Hierarchical Conditional Relation Networks for Video Question Answering

thaolmk54/hcrn-videoqa • • CVPR 2020

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.

Paper
Code

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

GeWu-Lab/MUSIC-AVQA • • CVPR 2022

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.

Paper
Code

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR • • 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

Paper
Code

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Bravo5542/TJSTG • • 21 May 2023

Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding.

Paper
Code

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

gewu-lab/pstp-net • • 10 Aug 2023

Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.

Paper
Code

Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering

zhangbin-ai/apl • • 20 Dec 2023

These selected pairs are constrained to have larger similarity values than the mismatched pairs.

Paper
Code

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

rikeilong/bay-cat • 7 Mar 2024

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components.

Paper
Code

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

rikeilong/mcd-foravqa • 11 Mar 2024

Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer.

Paper
Code

Audio-Visual Question Answering (AVQA)

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result