no code implementations • 30 May 2023 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Volkan Kılıç, Mark D. Plumbley, Wenwu Wang
Experimental results show that LHDFF outperforms existing audio captioning models.
no code implementations • 5 Dec 2022 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e. g., generating a fixed caption for a given audio clip), simple (e. g., using common words and simple grammar), and generic (e. g., generating the same caption for similar audio clips).
1 code implementation • 28 Oct 2022 • Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
Audio captioning aims to generate text descriptions of audio clips.
no code implementations • 10 Oct 2022 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Mark D. Plumbley, Volkan Kilic, Wenwu Wang
Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs.
1 code implementation • 29 Mar 2022 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets.
no code implementations • 7 Mar 2022 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Jinzheng Zhao, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF).
no code implementations • 6 Mar 2022 • Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
no code implementations • 13 Oct 2021 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips.
no code implementations • 22 Jul 2015 • Jianyuan Sun, Guoqiang Zhong, Junyu Dong, Yajuan Cai
Random forests are a type of ensemble method which makes predictions by combining the results of several independent trees.