General Image Descriptors for Open World Image Retrieval using ViT CLIP
The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc. This is a fundamental computer vision problem with notable applications in image retrieval, search engines and e-commerce. In this work, we explain our 4th place solution to the GUIE Challenge, and our "bag of tricks" to fine-tune zero-shot Vision Transformers (ViT) pre-trained using CLIP.
PDF AbstractResults from the Paper
Submit
results from this paper
to get state-of-the-art GitHub badges and help the
community compare results to other papers.
Methods
Adam •
Attention Dropout •
BPE •
CLIP •
Cosine Annealing •
Dense Connections •
Discriminative Fine-Tuning •
Dropout •
GELU •
GPT •
Layer Normalization •
Linear Layer •
Linear Warmup With Cosine Annealing •
Multi-Head Attention •
Residual Connection •
Scaled Dot-Product Attention •
Softmax •
Weight Decay