HH Red Teaming Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

The **HH Red Teaming dataset** comprises two distinct types of data, each serving a unique purpose:

1. **Human Preference Data about Helpfulness and Harmlessness**:
   - This dataset is associated with the paper titled "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback."
   - It provides insights into how humans perceive helpfulness and harmlessness in AI-generated responses.
   - The data format is straightforward: Each line in the JSONL files contains a pair of texts—one "chosen" and one "rejected."
   - For helpfulness, the data are grouped into train/test splits from base models, rejection sampling against an early preference model, and a dataset sampled during an iterated "online" process.
   - For harmlessness, the data are collected for base models and formatted similarly.
   - Details about the data collection process and crowdworker population can be found in the paper¹.

2. **Red Teaming Data**:
   - This dataset is associated with the paper titled "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned."
   - It aims to understand how crowdworkers "red team" language models and assess their harmfulness.
   - Each line in the JSONL file contains a dictionary with the following fields:
     - `transcript`: A text transcript of a conversation between a human adversary (red team member) and an AI assistant.
     - `min_harmlessness_score_transcript`: A real value score indicating the harmlessness of the AI assistant (lower scores imply more harm).
     - `num_params`: The number of parameters in the language model powering the AI assistant.
     - `model_type`: The type of model powering the AI assistant.
     - `rating`: The red team member's rating of how successful they were at breaking the AI assistant (Likert scale, higher ratings indicate more success)¹².

Please note that the data may contain content that could be offensive or upsetting, including discussions of abuse, violence, and other sensitive topics. Researchers should engage with the data responsibly and in accordance with their own risk tolerance¹.

Source: Conversation with Bing, 3/17/2024
(1) GitHub - anthropics/hh-rlhf: Human preference data for "Training a .... https://github.com/anthropics/hh-rlhf.
(2) Trelis/hh-rlhf-dpo · Datasets at Hugging Face. https://huggingface.co/datasets/Trelis/hh-rlhf-dpo.
(3) Anthropic/hh-rlhf at main - Hugging Face. https://huggingface.co/datasets/Anthropic/hh-rlhf/tree/main.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

HH Red Teaming

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

HHH

Usage

License

Modalities

Languages

HH Red Teaming

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

HHH

Usage

License Edit

Modalities Edit

Languages Edit