no code implementations • 14 Dec 2023 • Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs.
1 code implementation • 7 Dec 2022 • Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt
Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect.
3 code implementations • 20 May 2021 • Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt
Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code.
Ranked #8 on Code Generation on APPS
2 code implementations • 10 Mar 2021 • Dan Hendrycks, Collin Burns, Anya Chen, Spencer Ball
We address this bottleneck within the legal domain by introducing the Contract Understanding Atticus Dataset (CUAD), a new dataset for legal contract review.
1 code implementation • CVPR 2021 • Collin Burns, Jacob Steinhardt
Feature alignment is an approach to improving robustness to distribution shift that matches the distribution of feature activations between the training distribution and test distribution.
4 code implementations • 5 Mar 2021 • Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt
To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics.
Ranked #96 on Math Word Problem Solving on MATH
no code implementations • ICLR 2021 • Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
no code implementations • ICLR 2021 • Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt
We show how to assess a language model’s knowledge of basic concepts of morality.
12 code implementations • 7 Sep 2020 • Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
Ranked #60 on Multi-task Language Understanding on MMLU
2 code implementations • 5 Aug 2020 • Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt
We show how to assess a language model's knowledge of basic concepts of morality.
Ranked #1 on Average on hendrycks2020ethics
no code implementations • 7 Jul 2020 • Alexandr Andoni, Collin Burns, Yi Li, Sepideh Mahabadi, David P. Woodruff
We show that, for both problems, for dimensions $d=1, 2$, one can obtain streaming algorithms with space polynomially smaller than $\frac{1}{\lambda\epsilon}$, which is the complexity of SGD for strongly convex functions like the bias-regularized SVM, and which is known to be tight in general, even for $d=1$.
1 code implementation • 29 Mar 2019 • Collin Burns, Jesse Thomason, Wesley Tansey
In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments.