no code implementations • 25 May 2024 • Minhak Song, Kwangjun Ahn, Chulhee Yun
This suggests that the observed alignment between the gradient and the dominant subspace is spurious.
1 code implementation • 2 Oct 2023 • Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics.