Zero-shot audio captioning with audio-language model guidance and audio context keywords

14 Nov 2023  ยท  Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata ยท

Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action. Inspired by zero-shot image captioning methods, we propose ZerAuCap, a novel framework for summarising such general audio signals in a text caption without requiring task-specific training. In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content. Additionally, we use audio context keywords that prompt the language model to generate text that is broadly relevant to sounds. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets. Our code is available at https://github.com/ExplainableML/ZerAuCap.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-shot Audio Captioning AudioCaps ZerAuCap BLEU-4 6.8 # 2
METEOR 12.3 # 1
ROUGE-L 33.1 # 1
CIDEr 28.1 # 1
SPICE 8.6 # 1
SPIDEr 18.3 # 1
Zero-shot Audio Captioning AudioCaps No audio (baseline) BLEU-4 0 # 3
METEOR 4.1 # 3
ROUGE-L 17.8 # 2
CIDEr 0.1 # 3
SPICE 0 # 2
SPIDEr 0 # 2
Zero-shot Audio Captioning Clotho ZerAuCap METEOR 9.4 # 1
ROUGE-L 25.4 # 1
CIDEr 14 # 1
SPICE 5.3 # 1
SPIDEr 9.7 # 1
BLEU-4 2.9 # 1
Zero-shot Audio Captioning Clotho No audio (baseline) METEOR 3.8 # 2
ROUGE-L 16.6 # 2
CIDEr 0.2 # 2
SPICE 0.1 # 2
SPIDEr 0.2 # 2
BLEU-4 0 # 2

Methods


No methods listed for this paper. Add relevant methods here