1 code implementation • 4 Apr 2024 • Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang
This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs.
no code implementations • 16 Nov 2023 • Yilun Zhao, Yitao Long, Hongjun Liu, Linyong Nan, Lyuhao Chen, Ryo Kamoi, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan
This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning and problem-solving capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables.
1 code implementation • 14 Nov 2023 • Yusen Zhang, Nan Zhang, Yixin Liu, Alexander Fabbri, Junru Liu, Ryo Kamoi, Xiaoxin Lu, Caiming Xiong, Jieyu Zhao, Dragomir Radev, Kathleen McKeown, Rui Zhang
However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization.
1 code implementation • 2 Mar 2023 • Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez, Greg Durrett
Textual entailment models are increasingly applied in settings like fact-checking, presupposition verification in question answering, or summary evaluation.
1 code implementation • 13 Oct 2022 • Ryo Kamoi, Tanya Goyal, Greg Durrett
Despite recent progress in abstractive summarization, models often generate summaries with factual errors.
no code implementations • 1 Mar 2020 • Ryo Kamoi, Kei Kobayashi
This suggests that the reason the Mahalanobis confidence score works so well is mistaken, and makes use of different information from ODIN, another popular OoD detection method based on prediction confidence.
no code implementations • 15 Nov 2019 • Ryo Kamoi, Kei Kobayashi
This paper focuses on the relationship between the choice of a prior distribution and the likelihoods assigned to out-of-distribution inputs.