TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Code Generation	PECC	Claude 3 Haiku	Pass@3	27.67	# 1
Code Generation	PECC	Llama-3-8B-Instruct	Pass@3	3.1	# 8
Code Generation	PECC	WizardLM-2-7B	Pass@3	3.72	# 7
Code Generation	PECC	chat-bison	Pass@3	8.48	# 4
Code Generation	PECC	Phi-3-mini-128k-instruct	Pass@3	7.18	# 6
Code Generation	PECC	Mixtral-8x7B-Instruct	Pass@3	8.35	# 5
Code Generation	PECC	codechat-bison	Pass@3	11.39	# 3
Code Generation	PECC	GPT-3.5 Turbo	Pass@3	23.75	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pecc-problem-extraction-and-coding-challenges/code-generation-on-pecc)](https://paperswithcode.com/sota/code-generation-on-pecc?p=pecc-problem-extraction-and-coding-challenges)`

PECC: Problem Extraction and Coding Challenges

29 Apr 2024 · Patrick Haller, Jonas Golde, Alan Akbik ·

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.

PDF Abstract

Code

Add Remove Mark official

hallerpatrick/pecc official

↳ Quickstart in

Spaces

Tasks

Add Remove

Code Generation

Math

Text Generation

Datasets

Introduced in the Paper:

PECC

Used in the Paper:

HumanEval

APPS

Results from the Paper

Edit

Ranked #1 on Code Generation on PECC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Code Generation	PECC	Claude 3 Haiku	Pass@3	27.67	# 1	Compare
Code Generation	PECC	Llama-3-8B-Instruct	Pass@3	3.1	# 8	Compare
Code Generation	PECC	WizardLM-2-7B	Pass@3	3.72	# 7	Compare
Code Generation	PECC	chat-bison	Pass@3	8.48	# 4	Compare
Code Generation	PECC	Phi-3-mini-128k-instruct	Pass@3	7.18	# 6	Compare
Code Generation	PECC	Mixtral-8x7B-Instruct	Pass@3	8.35	# 5	Compare
Code Generation	PECC	codechat-bison	Pass@3	11.39	# 3	Compare
Code Generation	PECC	GPT-3.5 Turbo	Pass@3	23.75	# 2	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

PECC: Problem Extraction and Coding Challenges

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove