no code implementations • 17 Jun 2023 • Zhenxun Zhuang
An algorithm is said to be adaptive to a certain parameter (of the problem) if it does not need a priori knowledge of such a parameter but performs competitively to those that know it.
no code implementations • 23 Aug 2022 • Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei zhang, Zhenxun Zhuang
We also compare these algorithms with popular optimizers on a set of deep learning tasks, observing that we can match the performance of Adam while beating the others.
1 code implementation • 10 May 2022 • Mingrui Liu, Zhenxun Zhuang, Yunwei Lei, Chunyang Liao
Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
1 code implementation • 31 Jan 2022 • Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, Francesco Orabona
First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-$\ell_2$.
2 code implementations • 12 Feb 2020 • Xiaoyu Li, Zhenxun Zhuang, Francesco Orabona
Moreover, we show the surprising property that these two strategies are \emph{adaptive} to the noise level in the stochastic gradients of PL functions.
no code implementations • 22 Oct 2019 • Zhenxun Zhuang, Yunlong Wang, Kezi Yu, Songtao Lu
The online meta-learning framework is designed for the continual lifelong learning setting.
1 code implementation • 25 Jan 2019 • Zhenxun Zhuang, Ashok Cutkosky, Francesco Orabona
Stochastic Gradient Descent (SGD) has played a central role in machine learning.