Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

3 Nov 2023  ·  Jinrui Yang, Timothy Baldwin, Trevor Cohn ·

We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text Reranking Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval BM25_whitespace_tokenizer MRR@100 14.18 # 1

Methods


No methods listed for this paper. Add relevant methods here