no code implementations • ICLR 2019 • Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan
Our goal is to infer reward functions from demonstrations.
no code implementations • 2 May 2024 • Jerry Zhi-Yang He, Sashrika Pandey, Mariah L. Schrum, Anca Dragan
Proper usage of the context enables the LLM to generate personalized responses, whereas inappropriate contextual influence can lead to stereotypical and potentially harmful generations (e. g. associating "female" with "housekeeper").
no code implementations • 20 Mar 2024 • Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane
To understand the risks posed by a new AI system, we must understand what it can and cannot do.
no code implementations • 9 Mar 2024 • Evan Ellis, Gaurav R. Ghosal, Stuart J. Russell, Anca Dragan, Erdem Biyik
Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task.
1 code implementation • 5 Mar 2024 • Cassidy Laidlaw, Shivam Singhal, Anca Dragan
Thus, we propose regularizing based on the OM divergence between policies instead of AD divergence to prevent reward hacking.
no code implementations • 27 Feb 2024 • Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons
Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment.
1 code implementation • 13 Dec 2023 • Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan
Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks.
no code implementations • 9 Nov 2023 • Joey Hong, Sergey Levine, Anca Dragan
LLMs trained with supervised fine-tuning or "single-step" RL, as with standard RLHF, might struggle which tasks that require such goal-directed behavior, since they are not trained to optimize for overall conversational outcomes after multiple turns of interaction.
no code implementations • 31 Oct 2023 • Joey Hong, Anca Dragan, Sergey Levine
Theoretically, we show that standard offline RL algorithms conditioned on observation histories suffer from poor sample complexity, in accordance with the above intuition.
no code implementations • 26 Oct 2023 • Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila Mcilraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann
In this short consensus paper, we outline risks from upcoming, advanced AI systems.
1 code implementation • 3 Oct 2023 • W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson, Serena Booth, Anca Dragan, Peter Stone, Scott Niekum
Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return.
no code implementations • 31 Jul 2023 • Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan
To interact with humans in the world, agents need to understand the diverse types of language that people use, relate them to the visual world, and act based on them.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
no code implementations • 30 Jun 2023 • Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, Sergey Levine
Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to.
no code implementations • 14 Jun 2023 • Minae Kwon, Hengyuan Hu, Vivek Myers, Siddharth Karamcheti, Anca Dragan, Dorsa Sadigh
We additionally illustrate our approach with a robot on 2 carefully designed surfaces.
1 code implementation • NeurIPS 2023 • Cassidy Laidlaw, Stuart Russell, Anca Dragan
Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does.
1 code implementation • 8 Mar 2023 • Erik Jones, Anca Dragan, aditi raghunathan, Jacob Steinhardt
Auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging.
no code implementations • 2 Jan 2023 • Ran Tian, Masayoshi Tomizuka, Anca Dragan, Andrea Bajcsy
Interestingly, robot actions influence what this experience is, and therefore influence how people's internal models change.
no code implementations • 9 Dec 2022 • Joey Hong, Kush Bhatia, Anca Dragan
This begs the question: how accurate do these models need to be in order for the reward inference to be accurate?
no code implementations • 30 Nov 2022 • David Zhang, Micah Carroll, Andreea Bobu, Anca Dragan
One of the most successful paradigms for reward learning uses human feedback in the form of comparisons.
1 code implementation • 20 Nov 2022 • Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, Sam Devlin
Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks.
no code implementations • 3 Nov 2022 • Mesut Yang, Micah Carroll, Anca Dragan
We show that using optimal behavior as a prior for human models makes these models vastly more data-efficient and able to generalize to new environments.
no code implementations • 28 Apr 2022 • Micah Carroll, Jessy Lin, Orr Paradise, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, Sam Devlin
Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks.
no code implementations • 25 Apr 2022 • Micah Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell
These steps involve two challenging ingredients: estimation requires anticipating how hypothetical algorithms would influence user preferences if deployed - we do this by using historical user interaction data to train a predictive user model which implicitly contains their preference dynamics; evaluation and optimization additionally require metrics to assess whether such influences are manipulative or otherwise unwanted - we use the notion of "safe shifts", that define a trust region within which behavior is safe: for instance, the natural way in which users would shift without interference from the system could be deemed "safe".
1 code implementation • ICLR 2022 • Cassidy Laidlaw, Anca Dragan
However, these models fail when humans exhibit systematic suboptimality, i. e. when their deviations from optimal behavior are not independent, but instead consistent over time.
1 code implementation • ACL 2022 • Jessy Lin, Daniel Fried, Dan Klein, Anca Dragan
In classic instruction following, language like "I'd like the JetBlue flight" maps to actions (e. g., selecting that flight).
no code implementations • 12 Nov 2021 • Lawrence Chan, Andrew Critch, Anca Dragan
More importantly, we show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can.
1 code implementation • 4 Nov 2021 • Kimin Lee, Laura Smith, Anca Dragan, Pieter Abbeel
However, it is difficult to quantify the progress in preference-based RL due to the lack of a commonly adopted benchmark.
no code implementations • 5 Jul 2021 • Rohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, Anca Dragan
Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve.
1 code implementation • ICLR 2021 • David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan
Since reward functions are hard to specify, recent work has focused on learning policies from human feedback.
no code implementations • 19 Jan 2021 • Rachel Freedman, Rohin Shah, Anca Dragan
A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback, like demonstrations or corrections.
no code implementations • 1 Jan 2021 • Lawrence Chan, Andrew Critch, Anca Dragan
Surprisingly, we find that if we give the learner access to the correct model of the demonstrator's irrationality, these irrationalities can actually help reward inference.
no code implementations • ICLR 2021 • Jensen Gao, Siddharth Reddy, Glen Berseth, Nicholas Hardy, Nikhilesh Natraj, Karunesh Ganguly, Anca Dragan, Sergey Levine
In the typing domain, we leverage backspaces as implicit feedback that the interface did not perform the desired action.
no code implementations • 1 Jan 2021 • Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell
By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.
1 code implementation • NeurIPS 2020 • Yuqing Du, Stas Tiomkin, Emre Kiciman, Daniel Polani, Pieter Abbeel, Anca Dragan
One difficulty in using artificial agents for human-assistive applications lies in the challenge of accurately assisting with a person's goal(s).
2 code implementations • NeurIPS 2019 • Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, Anca Dragan
While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves.
no code implementations • ICLR 2019 • Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, Chelsea Finn
A significant challenge for the practical application of reinforcement learning toreal world problems is the need to specify an oracle reward function that correctly defines a task.
1 code implementation • ICLR 2019 • Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan
We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized.
1 code implementation • 24 Jan 2019 • Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan
Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.
no code implementations • 4 Jan 2019 • Gokul Swamy, Jens Schulz, Rohan Choudhury, Dylan Hadfield-Menell, Anca Dragan
Fundamental to robotics is the debate between model-based and model-free learning: should the robot build an explicit model of the world, or learn a policy directly?
no code implementations • 31 May 2018 • Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, Chelsea Finn
A significant challenge for the practical application of reinforcement learning in the real world is the need to specify an oracle reward function that correctly defines a task.
1 code implementation • NeurIPS 2017 • Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan
When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios.
1 code implementation • 28 May 2017 • Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell
We show that when a human is not perfectly rational then a robot that tries to infer and act according to the human's underlying preferences can always perform better than a robot that simply follows the human's literal order.
1 code implementation • ACL 2017 • Jacob Andreas, Anca Dragan, Dan Klein
Several approaches have recently been proposed for learning decentralized deep multiagent policies that coordinate via a differentiable communication channel.
2 code implementations • 27 Mar 2017 • Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, Ken Goldberg
One approach to Imitation Learning is Behavior Cloning, in which a robot observes a supervisor and infers a control policy.
no code implementations • 24 Nov 2016 • Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch.
no code implementations • 4 Oct 2016 • Michael Laskey, Caleb Chuck, Jonathan Lee, Jeffrey Mahler, Sanjay Krishnan, Kevin Jamieson, Anca Dragan, Ken Goldberg
Although policies learned with RC sampling can be superior to HC sampling for standard learning models such as linear SVMs, policies learned with HC sampling may be comparable with highly-expressive learning models such as deep learning and hyper-parametric decision trees, which have little model error.
2 code implementations • NeurIPS 2016 • Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans.