TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio captioning	AudioCaps	BART + YAMNet + PANNs	CIDEr	0.753	# 8
Audio captioning	AudioCaps	BART + YAMNet + PANNs	SPIDEr	0.465	# 7
Audio captioning	AudioCaps	BART + YAMNet + PANNs	SPICE	0.176	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/automated-audio-captioning-by-fine-tuning/audio-captioning-on-audiocaps)](https://paperswithcode.com/sota/audio-captioning-on-audiocaps?p=automated-audio-captioning-by-fine-tuning)`

AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS

DCASE workshop 2021 · F ́elix Gontier, Romain Serizel, Christophe Cerisara ·

utomated audio captioning is the multimodal task of describing environmental audio recordings with fluent natural language. Most current methods utilize pre-trained analysis models to extract rele- vant semantic content from the audio input. However, prior infor- mation on language modeling is rarely introduced, and correspond- ing architectures are limited in capacity due to data scarcity. In this paper, we present a method leveraging the linguistic informa- tion contained in BART, a large-scale conditional language model with general purpose pre-training. The caption generation is condi- tioned on sequences of textual AudioSet tags. This input is enriched with temporally aligned audio embeddings that allows the model to improve the sound event recognition. The full BART architecture is fine-tuned with few additional parameters. Experimental results demonstrate that, beyond the scaling properties of the architecture, language-only pre-training improves the text quality in the multi- modal setting of audio captioning. The best model achieves state- of-the-art performance on AudioCaps with 46.5 SPIDEr.

PDF