GEM (A General Evaluation Benchmark on Multi-modal Tasks)

Introduced by Su et al. in GEM: A General Evaluation Benchmark for Multimodal Tasks

GEM (A General Evaluation Benchmark on Multi-modal Tasks) is a significant benchmark dataset designed to evaluate the performance of cross-modal pre-trained models, including both understanding and generation tasks. Unlike existing datasets such as GLUE, SuperGLUE, XGLUE, and XTREME, which primarily focus on natural language tasks, GEM stands out as a large-scale vision-language benchmark.

Here are the key features of GEM:

  1. Multimodal Focus: GEM covers both vision and language domains. It consists of two main components:
  2. GEM-I: This part focuses on image-language tasks.
  3. GEM-V: This part focuses on video-language tasks.

  4. Large-Scale Dataset: GEM is one of the largest vision-language datasets available. It encompasses both image-language and video-language tasks simultaneously.

  5. Multilingual Labeling: The dataset is labeled in multiple languages, making it versatile for multilingual multimodal research.

  6. Baseline Models: The creators of GEM provide two baseline models to facilitate research and development in this area.

The goal of GEM is to advance the field of multimodal research by providing a comprehensive evaluation benchmark that spans vision and language modalities. Researchers can use this dataset to assess the capabilities of their models across different tasks and languages¹².

(1) GEM: A General Evaluation Benchmark for Multimodal Tasks. https://arxiv.org/abs/2106.09889. (2) GEM Submission Instructions - GitHub Pages. https://microsoft.github.io/GEM/. (3) GEM: A General Evaluation Benchmark for Multimodal Tasks. https://www.microsoft.com/en-us/research/publication/gem-a-general-evaluation-benchmark-for-multimodal-tasks/. (4) undefined. https://doi.org/10.48550/arXiv.2106.09889.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages