Search Results for author: Yanmin Qian

Found 37 papers, 10 papers with code

GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting

no code implementations • 29 Apr 2024 • Bo Chen, Shoukang Hu, Qi Chen, Chenpeng Du, Ran Yi, Yanmin Qian, Xie Chen

We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$\sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame.

Talking Face Generation

Paper
Add Code

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

no code implementations • 10 Apr 2024 • Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

CoVoMix is capable of first converting dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers.

Dialogue Generation

Paper
Add Code

Improving Design of Input Condition Invariant Speech Enhancement

1 code implementation • 25 Jan 2024 • Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated.

Speech Enhancement

7,962

Paper
Code

Prompt-driven Target Speech Diarization

no code implementations • 23 Oct 2023 • Yidi Jiang, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, Haizhou Li

We introduce a novel task named `target speech diarization', which seeks to determine `when target event occurred' within an audio signal.

Action Detection Activity Detection

Paper
Add Code

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

1 code implementation • 14 Oct 2023 • Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks.

Quantization Text Generation

Paper
Code

Toward Universal Speech Enhancement for Diverse Input Conditions

no code implementations • 29 Sep 2023 • Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian

Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model.

Denoising Speech Enhancement

Paper
Add Code

Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction

no code implementations • 25 Sep 2023 • Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Xinkai Wang, Hemin Yang, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng

Additionally, we introduce Regenerate-DCEM (R-DCEM) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model.

Speech Extraction

Paper
Add Code

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

1 code implementation • 21 Sep 2023 • Shuai Wang, Qibing Bai, Qi Liu, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li

Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets.

Speaker Recognition

566

Paper
Code

InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models

no code implementations • 28 Aug 2023 • Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song

We tested InstructME in instrument-editing, remixing, and multi-round editing.

Paper
Add Code

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

no code implementations • 23 Jul 2023 • Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe

In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Exploring Binary Classification Loss For Speaker Verification

1 code implementation • 17 Jul 2023 • Bing Han, Zhengyang Chen, Yanmin Qian

The mismatch between close-set training and open-set testing usually leads to significant performance degradation for speaker verification task.

Binary Classification Classification +2

Paper
Code

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

no code implementations • 30 May 2023 • Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng

State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

no code implementations • 25 May 2023 • Wangyou Zhang, Yanmin Qian

Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data.

Denoising Self-Supervised Learning +2

Paper
Add Code

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

1 code implementation • NeurIPS 2023 • Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng, Xuedong Huang

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language.

Language Modelling Multi-Task Learning +2

Paper
Code

Whisper-KDQ: A Lightweight Whisper via Guided Knowledge Distillation and Quantization for Efficient ASR

no code implementations • 18 May 2023 • Hang Shao, Wei Wang, Bei Liu, Xun Gong, Haoyu Wang, Yanmin Qian

Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks.

Knowledge Distillation Quantization +2

Paper
Add Code

Code-Switching Text Generation and Injection in Mandarin-English ASR

no code implementations • 20 Mar 2023 • Haibin Yu, Yuxuan Hu, Yao Qian, Ma Jin, Linquan Liu, Shujie Liu, Yu Shi, Yanmin Qian, Edward Lin, Michael Zeng

Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Target Sound Extraction with Variable Cross-modality Clues

1 code implementation • 15 Mar 2023 • Chenda Li, Yao Qian, Zhuo Chen, Dongmei Wang, Takuya Yoshioka, Shujie Liu, Yanmin Qian, Michael Zeng

Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources.

AudioCaps Target Sound Extraction

Paper
Code

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

no code implementations • 17 Nov 2022 • Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

1 code implementation • 17 Aug 2022 • Gaofeng Cheng, Yifan Chen, Runyan Yang, Qingxuan Li, Zehui Yang, Lingxuan Ye, Pengyuan Zhang, Qingqing Zhang, Lei Xie, Yanmin Qian, Kong Aik Lee, Yonghong Yan

In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level.

Machine Translation speaker-diarization +1

Paper
Code

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

1 code implementation • 19 Jul 2022 • Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

7,962

Paper
Code

End-to-End Multi-speaker ASR with Independent Vector Analysis

no code implementations • 1 Apr 2022 • Robin Scheibler, Wangyou Zhang, Xuankai Chang, Shinji Watanabe, Yanmin Qian

We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

SkiM: Skipping Memory LSTM for Low-Latency Real-Time Continuous Speech Separation

no code implementations • 26 Jan 2022 • Chenda Li, Lei Yang, Weiqin Wang, Yanmin Qian

We adopt the time-domain speech separation method and the recently proposed Graph-PIT to build a super low-latency online speech separation model, which is very important for the real application.

Speech Separation

Paper
Add Code

Separating Long-Form Speech with Group-Wise Permutation Invariant Training

no code implementations • 27 Oct 2021 • Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription.

Speech Separation

Paper
Add Code

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

no code implementations • 27 Oct 2021 • Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

The deep learning based time-domain models, e. g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement.

Speech Enhancement speech-recognition +1

Paper
Add Code

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

5 code implementations • 26 Oct 2021 • Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks.

Denoising Self-Supervised Learning +3

18,626

Paper
Code

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

no code implementations • 23 Feb 2021 • Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian

Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions.

Action Detection Activity Detection +4

Paper
Add Code

Dual-Path Modeling for Long Recording Speech Separation in Meetings

no code implementations • 23 Feb 2021 • Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian

A transformer-based dual-path system is proposed, which integrates transform layers for global modeling.

Speech Separation

Paper
Add Code

Data Augmentation for End-to-end Code-switching Speech Recognition

no code implementations • 4 Nov 2020 • Chenpeng Du, Hao Li, Yizhou Lu, Lan Wang, Yanmin Qian

Training a code-switching end-to-end automatic speech recognition (ASR) model normally requires a large amount of data, while code-switching data is often limited.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Future Vector Enhanced LSTM Language Model for LVCSR

no code implementations • 31 Jul 2020 • Qi Liu, Yanmin Qian, Kai Yu

For the speech recognition rescoring, although the proposed LSTM LM obtains very slight gains, the new model seems obtain the great complementary with the conventional LSTM LM.

Language Modelling speech-recognition +1

Paper
Add Code

End-to-End Multi-speaker Speech Recognition with Transformer

no code implementations • 10 Feb 2020 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios.

Decoder speech-recognition +1

Paper
Add Code

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

no code implementations • 15 Oct 2019 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition.

speech-recognition Speech Recognition +1

Paper
Add Code

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

no code implementations • 18 Jun 2019 • Xu Xiang, Shuai Wang, Houjun Huang, Yanmin Qian, Kai Yu

The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2. 238% EER on VoxCeleb1 test set and 2. 761% EER on SITW core-core test set, respectively.

Speaker Recognition

Paper
Add Code

End-to-End Monaural Multi-speaker ASR System without Pretraining

no code implementations • 5 Nov 2018 • Xuankai Chang, Yanmin Qian, Kai Yu, Shinji Watanabe

The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting

no code implementations • 2 Aug 2018 • Zhehuai Chen, Yanmin Qian, Kai Yu

The few studies on sequence discriminative training for KWS are limited for fixed vocabulary or LVCSR based methods and have not been compared to the state-of-the-art deep learning based KWS approaches.

Keyword Spotting speech-recognition +1

Paper
Add Code

Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

no code implementations • 19 Jul 2017 • Yanmin Qian, Xuankai Chang, Dong Yu

Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Recognizing Multi-talker Speech with Permutation Invariant Training

no code implementations • 22 Mar 2017 • Dong Yu, Xuankai Chang, Yanmin Qian

Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Very Deep Convolutional Neural Networks for Robust Speech Recognition

2 code implementations • 2 Oct 2016 • Yanmin Qian, Philip C. Woodland

On the Aurora 4 task, the very deep CNN achieves a WER of 8. 81%, further 7. 99% with auxiliary feature joint training, and 7. 09% with LSTM-RNN joint decoding.

Robust Speech Recognition speech-recognition

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.