Search Results for author: Neel Nanda

Found 22 papers, 12 papers with code

Improving Dictionary Learning with Gated Sparse Autoencoders

no code implementations • 24 Apr 2024 • Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

Dictionary Learning

Paper
Add Code

How to use and interpret activation patching

no code implementations • 23 Apr 2024 • Stefan Heimersheim, Neel Nanda

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.

Paper
Add Code

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations • 1 Mar 2024 • János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Paper
Add Code

Explorations of Self-Repair in Language Models

1 code implementation • 23 Feb 2024 • Cody Rushing, Neel Nanda

Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate.

Paper
Code

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

no code implementations • 11 Feb 2024 • Bilal Chughtai, Alan Cooney, Neel Nanda

How do transformer-based large language models (LLMs) store and retrieve knowledge?

Attribute

Paper
Add Code

Universal Neurons in GPT2 Language Models

1 code implementation • 22 Jan 2024 • Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas

In other words, are neural mechanisms universal across different models?

Paper
Code

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

1 code implementation • 28 Nov 2023 • Aleksandar Makelov, Georg Lange, Neel Nanda

We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice.

Attribute

Paper
Code

Training Dynamics of Contextual N-Grams in Language Models

1 code implementation • 1 Nov 2023 • Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda

We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active.

Paper
Code

Linear Representations of Sentiment in Large Language Models

1 code implementation • 23 Oct 2023 • Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs).

Zero-Shot Learning

Paper
Code

Copy Suppression: Comprehensively Understanding an Attention Head

1 code implementation • 6 Oct 2023 • Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.

Language Modelling

Paper
Code

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

no code implementations • 27 Sep 2023 • Fred Zhang, Neel Nanda

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step.

Paper
Add Code

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

1 code implementation • 2 Sep 2023 • Neel Nanda, Andrew Lee, Martin Wattenberg

How do sequence models represent their decision-making process?

Decision Making

Paper
Code

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

no code implementations • 18 Jul 2023 • Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik

\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models.

Multiple-choice Question Answering

Paper
Add Code

Neuron to Graph: Interpreting Language Model Neurons at Scale

1 code implementation • 31 May 2023 • Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.

Language Modelling

Paper
Code

Finding Neurons in a Haystack: Case Studies with Sparse Probing

2 code implementations • 2 May 2023 • Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood.

2,063

Paper
Code

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

no code implementations • 22 Apr 2023 • Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.

Paper
Add Code

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

1 code implementation • 6 Feb 2023 • Bilal Chughtai, Lawrence Chan, Neel Nanda

Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.

Paper
Code

Progress measures for grokking via mechanistic interpretability

1 code implementation • 12 Jan 2023 • Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.

Memorization

Paper
Code

In-context Learning and Induction Heads

no code implementations • 24 Sep 2022 • Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah

In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i. e. decreasing loss at increasing token indices).

In-Context Learning

Paper
Add Code

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

3 code implementations • 12 Apr 2022 • Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants.

Code Generation Out of Distribution (OOD) Detection +2

1,447

Paper
Code

An Empirical Investigation of Learning from Biased Toxicity Labels

no code implementations • 4 Oct 2021 • Neel Nanda, Jonathan Uesato, Sven Gowal

Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels.

Fairness

Paper
Add Code

Fully General Online Imitation Learning

no code implementations • 17 Feb 2021 • Michael K. Cohen, Marcus Hutter, Neel Nanda

If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time.

Imitation Learning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.