ALIGN

Introduced by Jia et al. in Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.

Source: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	48	6.33%
Large Language Model	23	3.03%
Retrieval	21	2.77%
Image Generation	18	2.37%
Semantic Segmentation	17	2.24%
GPT-4	17	2.24%
Question Answering	15	1.98%
Domain Adaptation	15	1.98%
Decision Making	14	1.85%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models