GEM (A General Evaluation Benchmark on Multi-modal Tasks) is a significant benchmark dataset designed to evaluate the performance of cross-modal pre-trained models, including both understanding and generation tasks. Unlike existing datasets such as GLUE, SuperGLUE, XGLUE, and XTREME, which primarily focus on natural language tasks, GEM stands out as a large-scale vision-language benchmark.
Here are the key features of GEM:
GEM-V: This part focuses on video-language tasks.
Large-Scale Dataset: GEM is one of the largest vision-language datasets available. It encompasses both image-language and video-language tasks simultaneously.
Multilingual Labeling: The dataset is labeled in multiple languages, making it versatile for multilingual multimodal research.
Baseline Models: The creators of GEM provide two baseline models to facilitate research and development in this area.
The goal of GEM is to advance the field of multimodal research by providing a comprehensive evaluation benchmark that spans vision and language modalities. Researchers can use this dataset to assess the capabilities of their models across different tasks and languages¹².
(1) GEM: A General Evaluation Benchmark for Multimodal Tasks. https://arxiv.org/abs/2106.09889. (2) GEM Submission Instructions - GitHub Pages. https://microsoft.github.io/GEM/. (3) GEM: A General Evaluation Benchmark for Multimodal Tasks. https://www.microsoft.com/en-us/research/publication/gem-a-general-evaluation-benchmark-for-multimodal-tasks/. (4) undefined. https://doi.org/10.48550/arXiv.2106.09889.
Paper | Code | Results | Date | Stars |
---|