Search Results for author: Arjun Arunasalam

Rethinking How to Evaluate Language Model Jailbreak

We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems.

246

Paper
Code

In this paper, we measure their ability to refute popular S&P misconceptions that the general public holds.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.