An Exploration of Mimic Architectures for Residual Network Based Spectral Mapping

25 Sep 2018  ·  Peter Plantinga, Deblin Bagchi, Eric Fosler-Lussier ·

Spectral mapping uses a deep neural network (DNN) to map directly from noisy speech to clean speech. Our previous study found that the performance of spectral mapping improves greatly when using helpful cues from an acoustic model trained on clean speech. The mapper network learns to mimic the input favored by the spectral classifier and cleans the features accordingly. In this study, we explore two new innovations: we replace a DNN-based spectral mapper with a residual network that is more attuned to the goal of predicting clean speech. We also examine how integrating long term context in the mimic criterion (via wide-residual biLSTM networks) affects the performance of spectral mapping compared to DNNs. Our goal is to derive a model that can be used as a preprocessor for any recognition system; the features derived from our model are passed through the standard Kaldi ASR pipeline and achieve a WER of 9.3%, which is the lowest recorded word error rate for CHiME-2 dataset using only feature adaptation.

PDF Abstract


  Add Datasets introduced or used in this paper