SOHO

Introduced by Huang et al. in Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

SOHO (“See Out of tHe bOx”) that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. Text embeddings are used to extract textual embedding features. A trainable CNN is used to extract visual representations. SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in the proposed pre-training task Masked Visual Modeling (MVM).

Source: Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
BIG-bench Machine Learning	1	25.00%
Retrieval	1	25.00%
Visual Entailment	1	25.00%
Visual Reasoning	1	25.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models