Paper

Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision

Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge that demands further investigation. Previous studies have noted a substantial performance disparity between self-supervised and fully supervised approaches. In this paper, we propose an effective Self-Distillation network with Ensemble Prototypes (SDEP) to facilitate self-supervised speaker representation learning. It assigns representation of augmented views of utterances to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the SDEP framework in self-supervised speaker verification. SDEP achieves a new state-of-the-art on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate 1.94%, 1.99%, and 3.77% for trial Vox1-O, Vox1-E and Vox1-H , respectively), without using any speaker labels in the training phase.

Results in Papers With Code
(↓ scroll down to see all results)