Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

17 Jun 2022  ยท  Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi ยท

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for Unified-IO are available at: https://unified-io.allenai.org.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Surface Normal Estimation GRIT Unified-IOXL Normal (ablation) 45.0 # 2
Normal (test) 44.3 # 2
Keypoint Estimation GRIT Unified-IOXL Keypoint (ablation) 68.1 # 2
Keypoint (test) 67.7 # 2
Object Segmentation GRIT Unified-IOXL Segmentation (ablation) 56.3 # 1
Segmentation (test) 56.5 # 1
Referring Expression Comprehension GRIT Unified-IOXL Refexp (ablation) 78.6 # 1
Refexp (test) 78.9 # 1
Visual Question Answering (VQA) GRIT Unified-IOXL VQA (ablation) 74.5 # 1
VQA (test) 74.5 # 1
Object Localization GRIT Unified-IOXL Localization (ablation) 67.0 # 1
Localization (test) 67.1 # 1
Object Categorization GRIT Unified-IOXL Categorization (ablation) 61.7 # 1
Categorization (test) 60.8 # 1

Methods


No methods listed for this paper. Add relevant methods here