1 code implementation • 17 May 2024 • Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, Marius Hobbhahn
We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB).
no code implementations • 17 May 2024 • Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn
We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions.
no code implementations • 10 Oct 2023 • Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, Daniel Murfet
We investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT).