AudioCaps

Introduced by Kim et al. in AudioCaps: Generating Captions for Audios in The Wild

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds sourced from the AudioSet dataset. Annotators were provided the audio tracks together with category hints (and with additional video hints if needed).

Source: Audio Retrieval with Natural Language Queries

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Audio Generation	AudioCaps	Audiobox
Text to Audio/Video Retrieval	AudioCaps	CE-Visual + CE-Audio
Audio/Video to Text Retrieval	AudioCaps	CE-Visual + VGGSound
Audio captioning	AudioCaps	EnCLAP-large
Text to Audio Retrieval	AudioCaps	InternVideo2-6B
Audio to Text Retrieval	AudioCaps	ONE-PEACE
Zero-shot Text to Audio Retrieval	AudioCaps	InternVideo2-6B
Zero-shot Audio Captioning	AudioCaps	ZerAuCap
Target Sound Extraction	AudioCaps	CLAPSep