NPHardEval4V Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

**NPHardEval4V** is a dynamic reasoning benchmark designed to evaluate the reasoning capabilities of **Multimodal Large Language Models (MLLMs)**. Let me provide you with more details:

1. **Purpose and Gap Addressed**:
   - The benchmark aims to address existing gaps in evaluating the pure reasoning abilities of MLLMs.
   - It provides a venue to disentangle the effects of various factors (such as image recognition and instruction following) from the overall performance of the models.
   - By focusing solely on reasoning abilities, NPHardEval4V helps researchers understand and guide further development in this area.

2. **Construction and Features**:
   - NPHardEval4V is built by converting textual descriptions of questions from the existing NPHardEval dataset into image representations.
   - Unlike traditional benchmarks that primarily focus on static evaluations, NPHardEval4V is dynamic. It is updated monthly to prevent overfitting and ensure authentic and fine-grained model evaluation.
   - The benchmark evaluates MLLMs across three problem classes: polynomial time, NP-complete, and NP-hard problems.
   - It assesses performance in three dimensions:
     - **Recognition (RA)**: Ability to understand image and video modalities.
     - **Instruction-following (ER)**: How well the model follows instructions.
     - **Reasoning (AA)**: Pure reasoning abilities.

3. **Findings and Impact**:
   - Significant discrepancies in reasoning abilities exist across different models.
   - MLLMs exhibit relatively weak performance compared to Large Language Models (LLMs) in terms of reasoning.
   - Investigating different prompting styles (visual, text, and combined) reveals varying impacts of multimodal inputs on model performance.

In summary, NPHardEval4V provides a valuable resource for assessing reasoning abilities in MLLMs and contributes to advancing research in this domain. 🌟

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

NPHardEval4V

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

NPHardEval

Usage

License

Modalities

Languages

NPHardEval4V

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

NPHardEval

Usage

License Edit

Modalities Edit

Languages Edit