1 code implementation • ICML 2020 • Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs
As Transformer models are becoming larger and more expensive to train, recent research has focused on understanding and improving optimization in these models.
1 code implementation • ICML 2020 • Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs
As Transformer models are becoming larger and more expensive to train, recent research has focused on understanding and improving optimization in these models.
1 code implementation • 7 Jun 2022 • Sajad Norouzi, Rasa Hosseinzadeh, Felipe Perez, Maksims Volkovs
The student is optimized to predict the output of the teacher after multiple decoding steps while the teacher follows the student via a slow-moving average.
no code implementations • ICLR 2022 • Xiao Shi Huang, Felipe Perez, Maksims Volkovs
Empirically, we show that CMLMC achieves state-of-the-art NAR performance when trained on raw data without distillation and approaches AR performance on multiple datasets.