no code implementations • 24 Apr 2024 • Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.
no code implementations • 23 Apr 2024 • Stefan Heimersheim, Neel Nanda
Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.
no code implementations • 1 Mar 2024 • János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda
We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.
1 code implementation • 23 Feb 2024 • Cody Rushing, Neel Nanda
Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate.
no code implementations • 11 Feb 2024 • Bilal Chughtai, Alan Cooney, Neel Nanda
How do transformer-based large language models (LLMs) store and retrieve knowledge?
1 code implementation • 22 Jan 2024 • Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas
In other words, are neural mechanisms universal across different models?
1 code implementation • 28 Nov 2023 • Aleksandar Makelov, Georg Lange, Neel Nanda
We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice.
1 code implementation • 1 Nov 2023 • Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda
We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active.
1 code implementation • 23 Oct 2023 • Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda
Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs).
1 code implementation • 6 Oct 2023 • Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda
We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.
no code implementations • 27 Sep 2023 • Fred Zhang, Neel Nanda
Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step.
1 code implementation • 2 Sep 2023 • Neel Nanda, Andrew Lee, Martin Wattenberg
How do sequence models represent their decision-making process?
no code implementations • 18 Jul 2023 • Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik
\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models.
1 code implementation • 31 May 2023 • Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez
Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.
2 code implementations • 2 May 2023 • Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas
Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood.
no code implementations • 22 Apr 2023 • Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez
Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.
1 code implementation • 6 Feb 2023 • Bilal Chughtai, Lawrence Chan, Neel Nanda
Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.
1 code implementation • 12 Jan 2023 • Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.
no code implementations • 24 Sep 2022 • Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah
In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i. e. decreasing loss at increasing token indices).
3 code implementations • 12 Apr 2022 • Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants.
no code implementations • 4 Oct 2021 • Neel Nanda, Jonathan Uesato, Sven Gowal
Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels.
no code implementations • 17 Feb 2021 • Michael K. Cohen, Marcus Hutter, Neel Nanda
If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time.