NPHardEval4V is a dynamic reasoning benchmark designed to evaluate the reasoning capabilities of Multimodal Large Language Models (MLLMs). Let me provide you with more details:

  1. Purpose and Gap Addressed:
  2. The benchmark aims to address existing gaps in evaluating the pure reasoning abilities of MLLMs.
  3. It provides a venue to disentangle the effects of various factors (such as image recognition and instruction following) from the overall performance of the models.
  4. By focusing solely on reasoning abilities, NPHardEval4V helps researchers understand and guide further development in this area.

  5. Construction and Features:

  6. NPHardEval4V is built by converting textual descriptions of questions from the existing NPHardEval dataset into image representations.
  7. Unlike traditional benchmarks that primarily focus on static evaluations, NPHardEval4V is dynamic. It is updated monthly to prevent overfitting and ensure authentic and fine-grained model evaluation.
  8. The benchmark evaluates MLLMs across three problem classes: polynomial time, NP-complete, and NP-hard problems.
  9. It assesses performance in three dimensions:

    • Recognition (RA): Ability to understand image and video modalities.
    • Instruction-following (ER): How well the model follows instructions.
    • Reasoning (AA): Pure reasoning abilities.
  10. Findings and Impact:

  11. Significant discrepancies in reasoning abilities exist across different models.
  12. MLLMs exhibit relatively weak performance compared to Large Language Models (LLMs) in terms of reasoning.
  13. Investigating different prompting styles (visual, text, and combined) reveals varying impacts of multimodal inputs on model performance.

In summary, NPHardEval4V provides a valuable resource for assessing reasoning abilities in MLLMs and contributes to advancing research in this domain. 🌟

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages