Search Results for author: Simon Lermen

Found 4 papers, 1 papers with code

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

no code implementations • 26 Nov 2023 • Simon Lermen, Ondřej Kvapil

There has been increasing interest in evaluations of language models for a variety of risks and characteristics.

Paper
Add Code

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

no code implementations • 31 Oct 2023 • Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

With a budget of less than \$200 and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B and on the Mixtral instruct model.

Paper
Add Code

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

no code implementations • 31 Oct 2023 • Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Llama 2-Chat is a collection of large language models that Meta developed and released to the public.

Paper
Add Code

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

1 code implementation • 3 Jul 2023 • Teun van der Weij, Simon Lermen, Leon Lang

Recently, there has been an increase in interest in evaluating large language models for emergent and dangerous capabilities.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.