AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS

utomated audio captioning is the multimodal task of describing environmental audio recordings with fluent natural language. Most current methods utilize pre-trained analysis models to extract rele- vant semantic content from the audio input. However, prior infor- mation on language modeling is rarely introduced, and correspond- ing architectures are limited in capacity due to data scarcity. In this paper, we present a method leveraging the linguistic informa- tion contained in BART, a large-scale conditional language model with general purpose pre-training. The caption generation is condi- tioned on sequences of textual AudioSet tags. This input is enriched with temporally aligned audio embeddings that allows the model to improve the sound event recognition. The full BART architecture is fine-tuned with few additional parameters. Experimental results demonstrate that, beyond the scaling properties of the architecture, language-only pre-training improves the text quality in the multi- modal setting of audio captioning. The best model achieves state- of-the-art performance on AudioCaps with 46.5 SPIDEr.

PDF

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Audio captioning AudioCaps BART + YAMNet + PANNs CIDEr 0.753 # 8
SPIDEr 0.465 # 7
SPICE 0.176 # 7

Methods