Zero-shot Audio Classification

6 papers with code • 2 benchmarks • 2 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Zero-shot Audio Classification

Trend	Dataset	Best Model	Paper	Code	Compare
	VGG-Sound	LanguageBind(FT)			See all
	AudioSet	LanguageBind(FT)			See all

Datasets

Most implemented papers

Most implemented Social Latest No code

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

pku-yuangroup/languagebind • • 3 Oct 2023

We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.

Paper
Code

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

xinhaomei/wavcaps • • 30 Mar 2023

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

Paper
Code

Sound-Guided Semantic Image Manipulation

kuai-lab/sound-guided-semantic-image-manipulation • • CVPR 2022

Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space.

Paper
Code

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

zhaoyanpeng/vipant • • NAACL 2022

In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2. 2\% R@1.

Paper
Code

ImageBind: One Embedding Space To Bind Them All

facebookresearch/imagebind • • CVPR 2023

We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.

Paper
Code

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

julirao/whisper_audio_classification • • 15 Nov 2023

Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings.

Paper
Code

Zero-shot Audio Classification

Benchmarks Add a Result

Datasets

Most implemented papers

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Sound-Guided Semantic Image Manipulation

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

ImageBind: One Embedding Space To Bind Them All

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Content

Benchmarks

Add a Result