audio-visual event localization
9 papers with code • 1 benchmarks • 1 datasets
Most implemented papers
Audio-Visual Event Localization in Unconstrained Videos
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.
Dual-modality seq2seq network for audio-visual event localization
Audio-visual event localization requires one to identify theevent which is both visible and audible in a video (eitherat a frame or video level).
Positive Sample Propagation along the Audio-Visual Event Line
To encourage the network to extract high correlated features for positive samples, a new audio-visual pair similarity loss is proposed.
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
Recognizing and localizing events in videos is a fundamental task for video understanding.
Cross-Modal Background Suppression for Audio-Visual Event Localization
Audiovisual Event (AVE) localization requires the model to jointly localize an event by observing audio and visual information.
ActionFormer: Localizing Moments of Actions with Transformers
Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding.
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization
In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task.
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL).