Audio-Visual Question Answering (AVQA)
9 papers with code • 1 benchmarks • 1 datasets
Most implemented papers
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
In this work, we explore a scalable way for building a general representation model toward unlimited modalities.
Hierarchical Conditional Relation Networks for Video Question Answering
Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.
Learning to Answer Questions in Dynamic Audio-Visual Scenarios
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios
Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding.
Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.
Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering
These selected pairs are constrained to have larger similarity values than the mismatched pairs.
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components.
Answering Diverse Questions via Text Attached with Key Audio-Visual Clues
Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer.