Model Card and Evaluations for Claude Models
This report includes the model card [1] for Claude models, focusing on Claude 2, along with the results of a range of safety, alignment, and capabilities evaluations. We have been iterating on the training and evaluation of Claude-type models since our first work on Reinforcement Learning from Human Feedback (RLHF) [2]; the newest Claude 2 model represents a continuous evolution from those early and less capable โhelpful and harmlessโ language assistants. This report is not intended to be a scientific paper since most aspects of training and evaluating these models have been documented in our research papers. These include papers on preference modeling [3], reinforcement learning from human feedback for helpful and harmless models [2], red teaming language models [4], measuring representation of subjective global values in language models [5], honesty, (i.e., exploring language modelsโ ability to recognize what they know) [6], evaluating language models with language model-generated tests [7], moral self-correction [8], and Constitutional AI [9]. We also discussed Claudeโs specific constitution in a recent blog post [10]. Our work using human evaluations to test model safety is most thoroughly documented in our paper โRed-Teaming Language Models to Reduce Harmsโ [4], while our recent work on automated safety evaluation is โDiscovering Language Model Behaviors with Model-Written Evaluationsโ [7]. This report is also not comprehensive โ we expect to release new findings as we continue our research and evaluations of frontier models. However, we hope it provides useful insight into Claude 2โs capabilities and limitations.
PDFTask | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Common Sense Reasoning | ARC (Challenge) | Claude Instant 1.1 (few-shot, k=5) | Accuracy | 85.7 | # 11 | |
Common Sense Reasoning | ARC (Challenge) | Claude 1.3 (few-shot, k=5) | Accuracy | 90 | # 4 | |
Common Sense Reasoning | ARC (Challenge) | Claude 2 (few-shot, k=5) | Accuracy | 91 | # 3 | |
Arithmetic Reasoning | GSM8K | Claude Instant 1.1 (0-shot chain-of-thought) | Accuracy | 80.9 | # 60 | |
Arithmetic Reasoning | GSM8K | Claude 1.3 (0-shot chain-of-thought) | Accuracy | 85.2 | # 40 | |
Arithmetic Reasoning | GSM8K | Claude 2 (0-shot chain-of-thought) | Accuracy | 88 | # 29 | |
Code Generation | HumanEval | Claude Instant 1.1 | Pass@1 | 52.8 | # 41 | |
Code Generation | HumanEval | Claude 1.3 | Pass@1 | 56 | # 38 | |
Code Generation | HumanEval | Claude 2 | Pass@1 | 71.2 | # 22 | |
Multi-task Language Understanding | MMLU | Claude Instant 1.1 (5-shot) | Average (%) | 73.4 | # 22 | |
Multi-task Language Understanding | MMLU | Claude 1.3 (5-shot) | Average (%) | 77 | # 14 | |
Multi-task Language Understanding | MMLU | Claude 2 (5-shot) | Average (%) | 78.5 | # 11 | |
Question Answering | QuALITY | Claude 2 (5-shot) | Accuracy | 83.2 | # 2 | |
Question Answering | QuALITY | Claude 1.3 (5-shot) | Accuracy | 84.1 | # 1 | |
Question Answering | QuALITY | Claude Instant 1.1 (5-shot) | Accuracy | 80.5 | # 4 | |
Bug fixing | SWE-bench | Claude 2 | Resolved (unassisted) | 1.96% | # 1 | |
Resolved (assisted) | 4.80% | # 1 | ||||
Question Answering | TriviaQA | Claude Instant 1.1 (few-shot, k=5) | EM | 78.9 | # 9 | |
Question Answering | TriviaQA | Claude 1.3 (few-shot, k=5) | EM | 86.7 | # 3 | |
Question Answering | TriviaQA | Claude 2 (few-shot, k=5) | EM | 87.5 | # 1 |