TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Common Sense Reasoning	ARC (Challenge)	Claude Instant 1.1 (few-shot, k=5)	Accuracy	85.7	# 11
Common Sense Reasoning	ARC (Challenge)	Claude 1.3 (few-shot, k=5)	Accuracy	90	# 4
Common Sense Reasoning	ARC (Challenge)	Claude 2 (few-shot, k=5)	Accuracy	91	# 3
Arithmetic Reasoning	GSM8K	Claude Instant 1.1 (0-shot chain-of-thought)	Accuracy	80.9	# 60
Arithmetic Reasoning	GSM8K	Claude 1.3 (0-shot chain-of-thought)	Accuracy	85.2	# 40
Arithmetic Reasoning	GSM8K	Claude 2 (0-shot chain-of-thought)	Accuracy	88	# 29
Code Generation	HumanEval	Claude Instant 1.1	Pass@1	52.8	# 41
Code Generation	HumanEval	Claude 1.3	Pass@1	56	# 38
Code Generation	HumanEval	Claude 2	Pass@1	71.2	# 22
Multi-task Language Understanding	MMLU	Claude Instant 1.1 (5-shot)	Average (%)	73.4	# 22
Multi-task Language Understanding	MMLU	Claude 1.3 (5-shot)	Average (%)	77	# 14
Multi-task Language Understanding	MMLU	Claude 2 (5-shot)	Average (%)	78.5	# 11
Question Answering	QuALITY	Claude 2 (5-shot)	Accuracy	83.2	# 2
Question Answering	QuALITY	Claude 1.3 (5-shot)	Accuracy	84.1	# 1
Question Answering	QuALITY	Claude Instant 1.1 (5-shot)	Accuracy	80.5	# 4
Bug fixing	SWE-bench	Claude 2	Resolved (unassisted)	1.96%	# 1
Bug fixing	SWE-bench	Claude 2	Resolved (assisted)	4.80%	# 1
Question Answering	TriviaQA	Claude Instant 1.1 (few-shot, k=5)	EM	78.9	# 9
Question Answering	TriviaQA	Claude 1.3 (few-shot, k=5)	EM	86.7	# 3
Question Answering	TriviaQA	Claude 2 (few-shot, k=5)	EM	87.5	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/model-card-and-evaluations-for-claude-models/question-answering-on-quality)](https://paperswithcode.com/sota/question-answering-on-quality?p=model-card-and-evaluations-for-claude-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/model-card-and-evaluations-for-claude-models/bug-fixing-on-swe-bench)](https://paperswithcode.com/sota/bug-fixing-on-swe-bench?p=model-card-and-evaluations-for-claude-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/model-card-and-evaluations-for-claude-models/question-answering-on-triviaqa)](https://paperswithcode.com/sota/question-answering-on-triviaqa?p=model-card-and-evaluations-for-claude-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/model-card-and-evaluations-for-claude-models/common-sense-reasoning-on-arc-challenge)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-challenge?p=model-card-and-evaluations-for-claude-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/model-card-and-evaluations-for-claude-models/multi-task-language-understanding-on-mmlu)](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu?p=model-card-and-evaluations-for-claude-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/model-card-and-evaluations-for-claude-models/code-generation-on-humaneval)](https://paperswithcode.com/sota/code-generation-on-humaneval?p=model-card-and-evaluations-for-claude-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/model-card-and-evaluations-for-claude-models/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=model-card-and-evaluations-for-claude-models)`

Model Card and Evaluations for Claude Models

Technical Report 2023 · Anthropic ·

This report includes the model card [1] for Claude models, focusing on Claude 2, along with the results of a range of safety, alignment, and capabilities evaluations. We have been iterating on the training and evaluation of Claude-type models since our first work on Reinforcement Learning from Human Feedback (RLHF) [2]; the newest Claude 2 model represents a continuous evolution from those early and less capable ‘helpful and harmless’ language assistants. This report is not intended to be a scientific paper since most aspects of training and evaluating these models have been documented in our research papers. These include papers on preference modeling [3], reinforcement learning from human feedback for helpful and harmless models [2], red teaming language models [4], measuring representation of subjective global values in language models [5], honesty, (i.e., exploring language models’ ability to recognize what they know) [6], evaluating language models with language model-generated tests [7], moral self-correction [8], and Constitutional AI [9]. We also discussed Claude’s specific constitution in a recent blog post [10]. Our work using human evaluations to test model safety is most thoroughly documented in our paper “Red-Teaming Language Models to Reduce Harms” [4], while our recent work on automated safety evaluation is “Discovering Language Model Behaviors with Model-Written Evaluations” [7]. This report is also not comprehensive – we expect to release new findings as we continue our research and evaluations of frontier models. However, we hope it provides useful insight into Claude 2’s capabilities and limitations.

PDF

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Arithmetic Reasoning

Bug fixing

Code Generation

Common Sense Reasoning

Language Modelling

Multi-task Language Understanding

Question Answering

reinforcement-learning

Datasets

MMLU

GSM8K

TriviaQA

HumanEval

TruthfulQA

ARC (AI2 Reasoning Challenge)

QuALITY BBQ HHH SWE-bench

Results from the Paper

Add Remove

Ranked #1 on Question Answering on QuALITY

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Common Sense Reasoning	ARC (Challenge)	Claude Instant 1.1 (few-shot, k=5)	Accuracy	85.7	# 11	Compare
Common Sense Reasoning	ARC (Challenge)	Claude 1.3 (few-shot, k=5)	Accuracy	90	# 4	Compare
Common Sense Reasoning	ARC (Challenge)	Claude 2 (few-shot, k=5)	Accuracy	91	# 3	Compare
Arithmetic Reasoning	GSM8K	Claude Instant 1.1 (0-shot chain-of-thought)	Accuracy	80.9	# 60	Compare
Arithmetic Reasoning	GSM8K	Claude 1.3 (0-shot chain-of-thought)	Accuracy	85.2	# 40	Compare
Arithmetic Reasoning	GSM8K	Claude 2 (0-shot chain-of-thought)	Accuracy	88	# 29	Compare
Code Generation	HumanEval	Claude Instant 1.1	Pass@1	52.8	# 41	Compare
Code Generation	HumanEval	Claude 1.3	Pass@1	56	# 38	Compare
Code Generation	HumanEval	Claude 2	Pass@1	71.2	# 22	Compare
Multi-task Language Understanding	MMLU	Claude Instant 1.1 (5-shot)	Average (%)	73.4	# 22	Compare
Multi-task Language Understanding	MMLU	Claude 1.3 (5-shot)	Average (%)	77	# 14	Compare
Multi-task Language Understanding	MMLU	Claude 2 (5-shot)	Average (%)	78.5	# 11	Compare
Question Answering	QuALITY	Claude 2 (5-shot)	Accuracy	83.2	# 2	Compare
Question Answering	QuALITY	Claude 1.3 (5-shot)	Accuracy	84.1	# 1	Compare
Question Answering	QuALITY	Claude Instant 1.1 (5-shot)	Accuracy	80.5	# 4	Compare
Bug fixing	SWE-bench	Claude 2	Resolved (unassisted)	1.96%	# 1	Compare
Bug fixing	SWE-bench	Claude 2	Resolved (assisted)	4.80%	# 1	Compare
Question Answering	TriviaQA	Claude Instant 1.1 (few-shot, k=5)	EM	78.9	# 9	Compare
Question Answering	TriviaQA	Claude 1.3 (few-shot, k=5)	EM	86.7	# 3	Compare
Question Answering	TriviaQA	Claude 2 (few-shot, k=5)	EM	87.5	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Model Card and Evaluations for Claude Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove