Search Results for author: Rohin Shah

Found 25 papers, 8 papers with code

Inferring Reward Functions from Demonstrators with Unknown Biases

no code implementations • ICLR 2019 • Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan

Our goal is to infer reward functions from demonstrations.

Paper
Add Code

Improving Dictionary Learning with Gated Sparse Autoencoders

no code implementations • 24 Apr 2024 • Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

Dictionary Learning

Paper
Add Code

Evaluating Frontier Models for Dangerous Capabilities

no code implementations • 20 Mar 2024 • Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane

To understand the risks posed by a new AI system, we must understand what it can and cannot do.

Paper
Add Code

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations • 1 Mar 2024 • János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Paper
Add Code

Challenges with unsupervised LLM knowledge discovery

no code implementations • 15 Dec 2023 • Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent.

Language Modelling Large Language Model

Paper
Add Code

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

1 code implementation • NeurIPS 2023 • Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah

Given the completion of two years of BASALT competitions, we offer to the community a formalized benchmark through the BASALT Evaluation and Demonstrations Dataset (BEDD), which serves as a resource for algorithm development and performance assessment.

Benchmarking

Paper
Code

Explaining grokking through circuit efficiency

no code implementations • 5 Sep 2023 • Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation.

Paper
Add Code

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

no code implementations • 18 Jul 2023 • Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik

\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models.

Multiple-choice Question Answering

Paper
Add Code

Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition

no code implementations • 23 Mar 2023 • Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Sharada Mohanty, Byron Galbraith, Ke Chen, Yan Song, Tianze Zhou, Bingquan Yu, He Liu, Kai Guan, Yujing Hu, Tangjie Lv, Federico Malato, Florian Leopold, Amogh Raut, Ville Hautamäki, Andrew Melnik, Shu Ishida, João F. Henriques, Robert Klassert, Walter Laurito, Ellen Novoseller, Vinicius G. Goecks, Nicholas Waytowich, David Watkins, Josh Miller, Rohin Shah

To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022.

Paper
Add Code

SIRL: Similarity-based Implicit Representation Learning

no code implementations • 2 Jan 2023 • Andreea Bobu, Yi Liu, Rohin Shah, Daniel S. Brown, Anca D. Dragan

This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not.

Contrastive Learning Data Augmentation +1

Paper
Add Code

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

no code implementations • 4 Oct 2022 • Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton

However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization.

Paper
Add Code

An Empirical Investigation of Representation Learning for Imitation

2 code implementations • 16 May 2022 • Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, Ping Luo, Stuart Russell, Pieter Abbeel, Rohin Shah

We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites.

Image Classification Imitation Learning +1

Paper
Code

Retrospective on the 2021 BASALT Competition on Learning from Human Feedback

no code implementations • 14 Apr 2022 • Rohin Shah, Steven H. Wang, Cody Wild, Stephanie Milani, Anssi Kanervisto, Vinicius G. Goecks, Nicholas Waytowich, David Watkins-Valls, Bharat Prakash, Edmund Mills, Divyansh Garg, Alexander Fries, Alexandra Souly, Chan Jun Shern, Daniel del Castillo, Tom Lieberum

The goal of the competition was to promote research towards agents that use learning from human feedback (LfHF) techniques to solve open-world tasks.

Paper
Add Code

The MineRL BASALT Competition on Learning from Human Feedback

no code implementations • 5 Jul 2021 • Rohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, Anca Dragan

Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve.

Imitation Learning

Paper
Add Code

Learning What To Do by Simulating the Past

1 code implementation • ICLR 2021 • David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan

Since reward functions are hard to specify, recent work has focused on learning policies from human feedback.

Paper
Code

Combining Reward Information from Multiple Sources

no code implementations • 22 Mar 2021 • Dmitrii Krasheninnikov, Rohin Shah, Herke van Hoof

We study this problem in the setting with two conflicting reward functions learned from different sources.

Informativeness

Paper
Add Code

Choice Set Misspecification in Reward Inference

no code implementations • 19 Jan 2021 • Rachel Freedman, Rohin Shah, Anca Dragan

A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback, like demonstrations or corrections.

Paper
Add Code

Evaluating the Robustness of Collaborative Agents

no code implementations • 14 Jan 2021 • Paul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, A. D. Dragan, Rohin Shah

We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness.

Paper
Add Code

Benefits of Assistance over Reward Learning

no code implementations • 1 Jan 2021 • Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell

By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.

Paper
Add Code

The MAGICAL Benchmark for Robust Imitation

1 code implementation • NeurIPS 2020 • Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell

This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings.

Imitation Learning

Paper
Code

Optimal Policies Tend to Seek Power

1 code implementation • NeurIPS 2021 • Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli

Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives.

Reinforcement Learning (RL)

Paper
Code

On the Utility of Learning about Humans for Human-AI Coordination

2 code implementations • NeurIPS 2019 • Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, Anca Dragan

While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves.

636

Paper
Code

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

no code implementations • 23 Jun 2019 • Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca D. Dragan

But in the era of deep learning, a natural suggestion researchers make is to avoid mathematical models of human behavior that are fraught with specific assumptions, and instead use a purely data-driven approach.

Paper
Add Code

Preferences Implicit in the State of the World

1 code implementation • ICLR 2019 • Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan

We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized.

Reinforcement Learning (RL)

Paper
Code

Active Inverse Reward Design

1 code implementation • 9 Sep 2018 • Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

We propose structuring this process as a series of queries asking the user to compare between different reward functions.

Informativeness

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.