no code implementations • ICLR 2019 • Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan
Our goal is to infer reward functions from demonstrations.
no code implementations • 24 Apr 2024 • Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.
no code implementations • 20 Mar 2024 • Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane
To understand the risks posed by a new AI system, we must understand what it can and cannot do.
no code implementations • 1 Mar 2024 • János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda
We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.
no code implementations • 15 Dec 2023 • Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah
We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent.
1 code implementation • NeurIPS 2023 • Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah
Given the completion of two years of BASALT competitions, we offer to the community a formalized benchmark through the BASALT Evaluation and Demonstrations Dataset (BEDD), which serves as a resource for algorithm development and performance assessment.
no code implementations • 5 Sep 2023 • Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar
One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation.
no code implementations • 18 Jul 2023 • Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik
\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models.
no code implementations • 23 Mar 2023 • Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Sharada Mohanty, Byron Galbraith, Ke Chen, Yan Song, Tianze Zhou, Bingquan Yu, He Liu, Kai Guan, Yujing Hu, Tangjie Lv, Federico Malato, Florian Leopold, Amogh Raut, Ville Hautamäki, Andrew Melnik, Shu Ishida, João F. Henriques, Robert Klassert, Walter Laurito, Ellen Novoseller, Vinicius G. Goecks, Nicholas Waytowich, David Watkins, Josh Miller, Rohin Shah
To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022.
no code implementations • 2 Jan 2023 • Andreea Bobu, Yi Liu, Rohin Shah, Daniel S. Brown, Anca D. Dragan
This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not.
no code implementations • 4 Oct 2022 • Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton
However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization.
2 code implementations • 16 May 2022 • Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, Ping Luo, Stuart Russell, Pieter Abbeel, Rohin Shah
We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites.
no code implementations • 14 Apr 2022 • Rohin Shah, Steven H. Wang, Cody Wild, Stephanie Milani, Anssi Kanervisto, Vinicius G. Goecks, Nicholas Waytowich, David Watkins-Valls, Bharat Prakash, Edmund Mills, Divyansh Garg, Alexander Fries, Alexandra Souly, Chan Jun Shern, Daniel del Castillo, Tom Lieberum
The goal of the competition was to promote research towards agents that use learning from human feedback (LfHF) techniques to solve open-world tasks.
no code implementations • 5 Jul 2021 • Rohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, Anca Dragan
Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve.
1 code implementation • ICLR 2021 • David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan
Since reward functions are hard to specify, recent work has focused on learning policies from human feedback.
no code implementations • 22 Mar 2021 • Dmitrii Krasheninnikov, Rohin Shah, Herke van Hoof
We study this problem in the setting with two conflicting reward functions learned from different sources.
no code implementations • 19 Jan 2021 • Rachel Freedman, Rohin Shah, Anca Dragan
A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback, like demonstrations or corrections.
no code implementations • 14 Jan 2021 • Paul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, A. D. Dragan, Rohin Shah
We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness.
no code implementations • 1 Jan 2021 • Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell
By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.
1 code implementation • NeurIPS 2020 • Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell
This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings.
1 code implementation • NeurIPS 2021 • Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli
Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives.
2 code implementations • NeurIPS 2019 • Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, Anca Dragan
While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves.
no code implementations • 23 Jun 2019 • Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca D. Dragan
But in the era of deep learning, a natural suggestion researchers make is to avoid mathematical models of human behavior that are fraught with specific assumptions, and instead use a purely data-driven approach.
1 code implementation • ICLR 2019 • Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan
We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized.
1 code implementation • 9 Sep 2018 • Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell
We propose structuring this process as a series of queries asking the user to compare between different reward functions.