1 code implementation • 19 Feb 2024 • Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein
The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e. g. LLaVA and OpenFlamingo.