We provide the code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models. With the dense passage retriever (DPR) [1], we encode text snippets from the C4 dataset [2] to generate 768-dimensional vectors:
The metric for similarity search is inner product [1].
The number of base and query embedding vectors is parametrizable.
See the main repository for details on how to generate the DPR10M specific instance introduced in [3].
[1] Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)
[2] Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)
[3] Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed indices. In: Proceedings of the VLDB Endowment, 16, 11 (2023)
Paper | Code | Results | Date | Stars |
---|