Search Results for author: Simon Lermen

Found 4 papers, 1 papers with code

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

no code implementations26 Nov 2023 Simon Lermen, Ondřej Kvapil

There has been increasing interest in evaluations of language models for a variety of risks and characteristics.

Natural Language Understanding

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

no code implementations31 Oct 2023 Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

With a budget of less than \$200 and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B and on the Mixtral instruct model.

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

no code implementations31 Oct 2023 Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Llama 2-Chat is a collection of large language models that Meta developed and released to the public.

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

1 code implementation3 Jul 2023 Teun van der Weij, Simon Lermen, Leon Lang

Recently, there has been an increase in interest in evaluating large language models for emergent and dangerous capabilities.

Cannot find the paper you are looking for? You can Submit a new open access paper.