TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Arithmetic Reasoning	GSM8K	Phi-GSM+V 1.3B+1.3B (verify48@1)	Accuracy	81.5	# 59
Arithmetic Reasoning	GSM8K	Phi-GSM+V 1.3B+1.3B (verify48@1)	Parameters (Billion)	2.6	# 7
Arithmetic Reasoning	GSM8K	Phi-GSM 2.7B (fine-tuned)	Accuracy	74.3	# 81
Arithmetic Reasoning	GSM8K	Phi-GSM 2.7B (fine-tuned)	Parameters (Billion)	2.7	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tinygsm-achieving-80-on-gsm8k-with-small/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=tinygsm-achieving-80-on-gsm8k-with-small)`

TinyGSM: achieving >80% on GSM8k with small language models

14 Dec 2023 · Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang ·

Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce \texttt{TinyGSM}, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on \texttt{TinyGSM}, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset \texttt{TinyGSM}, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Arithmetic Reasoning

GSM8K

Math

Mathematical Reasoning

Datasets

GSM8K

MATH

SVAMP

Results from the Paper

Add Remove

Ranked #59 on Arithmetic Reasoning on GSM8K

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Arithmetic Reasoning	GSM8K	Phi-GSM+V 1.3B+1.3B (verify48@1)	Accuracy	81.5	# 59	Compare
Arithmetic Reasoning	GSM8K	Phi-GSM+V 1.3B+1.3B (verify48@1)	Parameters (Billion)	2.6	# 7	Compare
Arithmetic Reasoning	GSM8K	Phi-GSM 2.7B (fine-tuned)	Accuracy	74.3	# 81	Compare
Arithmetic Reasoning	GSM8K	Phi-GSM 2.7B (fine-tuned)	Parameters (Billion)	2.7	# 8	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

TinyGSM: achieving >80% on GSM8k with small language models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove