PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

13 Feb 2024  ·  Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne ·

Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.

PDF Abstract

Datasets


Introduced in the Paper:

M2KR

Used in the Paper:

MS MARCO WIT IGLUE InfoSeek OVEN

Results from the Paper


 Ranked #1 on Retrieval on InfoSeek (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Retrieval InfoSeek PreFLMR Recall@5 62.1 # 1
Visual Question Answering (VQA) InfoSeek RA-VQAv2 w/ PreFLMR Accuracy 30.65 # 1

Methods


No methods listed for this paper. Add relevant methods here