PECC: Problem Extraction and Coding Challenges

29 Apr 2024  ·  Patrick Haller, Jonas Golde, Alan Akbik ·

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.

PDF Abstract

Datasets


Introduced in the Paper:

PECC

Used in the Paper:

HumanEval APPS

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Code Generation PECC Claude 3 Haiku Pass@3 27.67 # 1
Code Generation PECC Llama-3-8B-Instruct Pass@3 3.1 # 8
Code Generation PECC WizardLM-2-7B Pass@3 3.72 # 7
Code Generation PECC chat-bison Pass@3 8.48 # 4
Code Generation PECC Phi-3-mini-128k-instruct Pass@3 7.18 # 6
Code Generation PECC Mixtral-8x7B-Instruct Pass@3 8.35 # 5
Code Generation PECC codechat-bison Pass@3 11.39 # 3
Code Generation PECC GPT-3.5 Turbo Pass@3 23.75 # 2

Methods