The HH Red Teaming dataset comprises two distinct types of data, each serving a unique purpose:
Details about the data collection process and crowdworker population can be found in the paper¹.
Red Teaming Data:
transcript
: A text transcript of a conversation between a human adversary (red team member) and an AI assistant.min_harmlessness_score_transcript
: A real value score indicating the harmlessness of the AI assistant (lower scores imply more harm).num_params
: The number of parameters in the language model powering the AI assistant.model_type
: The type of model powering the AI assistant.rating
: The red team member's rating of how successful they were at breaking the AI assistant (Likert scale, higher ratings indicate more success)¹².Please note that the data may contain content that could be offensive or upsetting, including discussions of abuse, violence, and other sensitive topics. Researchers should engage with the data responsibly and in accordance with their own risk tolerance¹.
Source: Conversation with Bing, 3/17/2024 (1) GitHub - anthropics/hh-rlhf: Human preference data for "Training a .... https://github.com/anthropics/hh-rlhf. (2) Trelis/hh-rlhf-dpo · Datasets at Hugging Face. https://huggingface.co/datasets/Trelis/hh-rlhf-dpo. (3) Anthropic/hh-rlhf at main - Hugging Face. https://huggingface.co/datasets/Anthropic/hh-rlhf/tree/main.
Paper | Code | Results | Date | Stars |
---|