no code implementations • 7 Dec 2023 • Vimal Thilak, Chen Huang, Omid Saremi, Laurent Dinh, Hanlin Goh, Preetum Nakkiran, Joshua M. Susskind, Etai Littwin
In this paper, we introduce LiDAR (Linear Discriminant Analysis Rank), a metric designed to measure the quality of representations within JE architectures.
1 code implementation • 31 Oct 2023 • Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, Etai Littwin
Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms.
no code implementations • 24 Oct 2023 • Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, Preetum Nakkiran
Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity.
1 code implementation • 15 Oct 2023 • Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, Joshua Susskind
We investigate the capabilities of transformer models on relational reasoning tasks.
no code implementations • 13 Oct 2023 • Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Angel Bautista, Chen Huang, Vimal Thilak, Etai Littwin, Jiatao Gu, Josh Susskind, Samy Bengio
We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential computation steps (i. e., the depth of the computation graph).
no code implementations • 3 Aug 2023 • Greg Yang, Etai Littwin
Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam?
no code implementations • NeurIPS 2023 • Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind
Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.
no code implementations • 22 May 2023 • Enric Boix-Adsera, Etai Littwin
We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss.
1 code implementation • 11 Mar 2023 • Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind
We show that $\sigma$Reparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer {to competitive performance} without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers.
no code implementations • 10 Jun 2022 • Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua Susskind
While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination.
no code implementations • ICLR 2022 • Ruixiang Zhang, Shuangfei Zhai, Etai Littwin, Josh Susskind
We show that the low-rank approximation of NFKs derived from unsupervised generative models and supervised learning models gives rise to high-quality compact representations of data, achieving competitive results on a variety of machine learning tasks.
no code implementations • 2 Jul 2021 • Shih-Yu Sun, Vimal Thilak, Etai Littwin, Omid Saremi, Joshua M. Susskind
Deep linear networks trained with gradient descent yield low rank solutions, as is typically studied in matrix factorization.
no code implementations • 1 Jul 2021 • Etai Littwin, Omid Saremi, Shuangfei Zhai, Vimal Thilak, Hanlin Goh, Joshua M. Susskind, Greg Yang
We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottle-neck.
no code implementations • 8 May 2021 • Greg Yang, Etai Littwin
To facilitate this proof, we develop a graphical notation for Tensor Programs.
no code implementations • NeurIPS 2020 • Etai Littwin, Ben Myara, Sima Sabah, Joshua Susskind, Shuangfei Zhai, Oren Golan
Modern neural network performance typically improves as model size increases.
1 code implementation • NeurIPS 2020 • Etai Littwin, Tomer Galanti, Lior Wolf, Greg Yang
{\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}.
no code implementations • 28 Jan 2020 • Etai Littwin, Tomer Galanti, Lior Wolf
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
no code implementations • ICML Workshop Deep_Phenomen 2019 • Etai Littwin, Lior Wolf
The Hessian of neural networks can be decomposed into a sum of two matrices: (i) the positive semidefinite generalized Gauss-Newton matrix G, and (ii) the matrix H containing negative eigenvalues.
no code implementations • 25 Sep 2019 • Etai Littwin, Lior Wolf
A critical part of the training process of neural networks takes place in the very first gradient steps post initialization.
no code implementations • NeurIPS 2018 • Etai Littwin, Lior Wolf
Normalization techniques play an important role in supporting efficient and often more effective training of deep neural networks.
no code implementations • 8 Nov 2016 • Etai Littwin, Lior Wolf
Deep Residual Networks present a premium in performance in comparison to conventional networks of the same depth and are trainable at extreme depths.
no code implementations • CVPR 2016 • Etai Littwin, Lior Wolf
Deep learning techniques are renowned for supporting effective transfer learning.
no code implementations • CVPR 2015 • Etai Littwin, Hadar Averbuch-Elor, Daniel Cohen-Or
In this paper, we introduce a spherical embedding technique to position a given set of silhouettes of an object as observed from a set of cameras arbitrarily positioned around the object.