no code implementations • 5 Apr 2024 • João Coelho, Bruno Martins, João Magalhães, Jamie Callan, Chenyan Xiong
This study investigates the existence of positional biases in Transformer-based models for text representation learning, particularly in the context of web document retrieval.
no code implementations • 6 Feb 2024 • Harshit Mehrotra, Jamie Callan, Zhen Fan
The ClueWeb22 dataset containing nearly 10 billion documents was released in 2022 to support academic and industry research.
1 code implementation • 11 May 2023 • Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, Graham Neubig
In this work, we provide a generalized view of active retrieval augmented generation, methods that actively decide when and what to retrieve across the course of the generation.
2 code implementations • 20 Dec 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
Given a query, HyDE first zero-shot instructs an instruction-following language model (e. g. InstructGPT) to generate a hypothetical document.
1 code implementation • 5 Dec 2022 • Zhengbao Jiang, Luyu Gao, Jun Araki, Haibo Ding, Zhiruo Wang, Jamie Callan, Graham Neubig
Systems for knowledge-intensive tasks such as open-domain question answering (QA) usually consist of two stages: efficient retrieval of relevant documents from a large corpus and detailed reading of the selected documents to generate answers.
Ranked #1 on Passage Retrieval on Natural Questions
no code implementations • 29 Nov 2022 • Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, Jamie Callan
ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information.
2 code implementations • 18 Nov 2022 • Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, PengFei Liu, Yiming Yang, Jamie Callan, Graham Neubig
Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem.
Ranked #18 on Arithmetic Reasoning on GSM8K
1 code implementation • 9 May 2022 • Luyu Gao, Jamie Callan
In this paper, we propose instead to model full query-to-document interaction, leveraging the attention operation and modular Transformer re-ranker framework.
1 code implementation • 11 Mar 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity.
2 code implementations • 30 Aug 2021 • HongChien Yu, Chenyan Xiong, Jamie Callan
This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval.
1 code implementation • ACL 2022 • Luyu Gao, Jamie Callan
Recent research demonstrates the effectiveness of using fine-tuned language models~(LM) for dense retrieval.
1 code implementation • EMNLP 2021 • Luyu Gao, Jamie Callan
Pre-trained Transformer language models (LM) have become go-to text representation encoders.
1 code implementation • NAACL 2021 • Luyu Gao, Zhuyun Dai, Jamie Callan
Classical information retrieval systems such as BM25 rely on exact lexical match and carry out search efficiently with inverted list index.
1 code implementation • 21 Jan 2021 • Luyu Gao, Zhuyun Dai, Jamie Callan
Pre-trained deep language models~(LM) have advanced the state-of-the-art of text retrieval.
no code implementations • 21 Jan 2021 • Luís Borges, Bruno Martins, Jamie Callan
Our work aimed at experimentally assessing the benefits of model ensembling within the context of neural methods for passage reranking.
1 code implementation • 20 Jan 2021 • HongChien Yu, Zhuyun Dai, Jamie Callan
Most research on pseudo relevance feedback (PRF) has been done in vector space and probabilistic retrieval models.
5 code implementations • ACL (RepL4NLP) 2021 • Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan
Contrastive learning has been applied successfully to learn vector representations of text.
no code implementations • Findings of the Association for Computational Linguistics 2020 • Vaibhav Kumar, Jamie Callan
Given an input question, it uses a BERT-based classifier (trained with weak supervision) to de-contextualize the input by selecting relevant terms from the dialog history.
no code implementations • 19 Aug 2020 • Shuo Zhang, Krisztian Balog, Jamie Callan
Category systems are central components of knowledge bases, as they provide a hierarchical grouping of semantically related concepts and entities.
no code implementations • 18 Aug 2020 • Vaibhav Kumar, Vikas Raunak, Jamie Callan
Given a natural language query, teaching machines to ask clarifying questions is of immense utility in practical natural language processing systems.
no code implementations • 21 Jul 2020 • Luyu Gao, Zhuyun Dai, Jamie Callan
Deep language models such as BERT pre-trained on large corpus have given a huge performance boost to the state-of-the-art information retrieval ranking systems.
1 code implementation • 23 May 2020 • Shuo Zhang, Zhuyun Dai, Krisztian Balog, Jamie Callan
We propose to generate natural language summaries as answers to describe the complex information contained in a table.
no code implementations • 29 Apr 2020 • Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, Jamie Callan
This paper presents CLEAR, a retrieval model that seeks to complement classical lexical exact-match models such as BM25 with semantic matching signals from a neural embedding matching model.
no code implementations • EMNLP 2020 • Luyu Gao, Zhuyun Dai, Jamie Callan
Recent innovations in Transformer-based ranking models have advanced the state-of-the-art in information retrieval.
1 code implementation • 30 Mar 2020 • Jeffrey Dalton, Chenyan Xiong, Jamie Callan
A common theme through the runs is the use of BERT-based neural reranking methods.
2 code implementations • 23 Oct 2019 • Zhuyun Dai, Jamie Callan
When applied to passages, DeepCT-Index produces term weights that can be stored in an ordinary inverted index for passage retrieval.
1 code implementation • 22 May 2019 • Zhuyun Dai, Jamie Callan
Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations.
Ranked #5 on Ad-Hoc Information Retrieval on TREC Robust04
no code implementations • 27 Sep 2018 • Mary Arpita Pyreddy, Varshini Ramaseshan, Narendra Nath Joshi, Zhuyun Dai, Chenyan Xiong, Jamie Callan, Zhiyuan Liu
This paper studies the consistency of the kernel-based neural ranking model K-NRM, a recent state-of-the-art neural IR model, which is important for reproducible research and deployment in the industry.
no code implementations • 3 May 2018 • Chenyan Xiong, Zhengzhong Liu, Jamie Callan, Tie-Yan Liu
The salience model also improves ad hoc search accuracy, providing effective ranking features by modeling the salience of query entities in candidate documents.
no code implementations • WSDM 2018 2018 • Zhuyun Dai, Chenyan Xiong, Jamie Callan, Zhiyuan Liu
This paper presents Conv-KNRM, a Convolutional Kernel-based Neural Ranking Model that models n-gram soft matches for ad-hoc search.
1 code implementation • 20 Jun 2017 • Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, Russell Power
Given a query and a set of documents, K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score.
no code implementations • 20 Jun 2017 • Chenyan Xiong, Jamie Callan, Tie-Yan Liu
This paper presents a word-entity duet framework for utilizing knowledge bases in ad-hoc retrieval.
no code implementations • 1 Jul 2013 • Bhavana Dalvi, William W. Cohen, Jamie Callan
In multiclass semi-supervised learning (SSL), it is sometimes the case that the number of classes present in the data is not known, and hence no labeled examples are provided for some classes.
no code implementations • 1 Jul 2013 • Bhavana Dalvi, William W. Cohen, Jamie Callan
We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus.