DPR-ANN Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

We provide the [code](https://github.com/IntelLabs/DPR-dataset-generator/tree/main) to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models. With the dense passage retriever (DPR) [[1]](#1), we encode text snippets from the C4 dataset [[2]](#2) to generate 768-dimensional vectors:

- context DPR embeddings for the base set and
- question DPR embeddings for the query set.

The metric for similarity search is inner product [[1]](#1).

The number of base and query embedding vectors is parametrizable.

See the [main repository](https://github.com/IntelLabs/DPR-dataset-generator/tree/main) for details on how to generate
the DPR10M specific instance introduced in [[3]](#3).

<a id="1">[1]</a>
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage
Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)

<a id="2">[2]</a>
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu,
P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)

<a id="3">[3]</a>
Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed
indices. In: Proceedings of the VLDB Endowment, 16, 11 (2023)

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

DPR-ANN

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Usage

License

Modalities

Languages

DPR-ANN

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Usage

License Edit

Modalities Edit

Languages Edit