Agreement is overrated: A plea for correlation to assess human evaluation reliability

WS 2019  ·  Jacopo Amidei, Paul Piwek, Alistair Willis ·

Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG evaluation data, in particular, its reliability. According to existing scales of IAA interpretation {--} see, for example, Lommel et al. (2014), Liu et al. (2016), Sedoc et al. (2018) and Amidei et al. (2018a) {--} most data collected for NLG evaluation fail the reliability test. We confirmed this trend by analysing papers published over the last 10 years in NLG-specific conferences (in total 135 papers that included some sort of human evaluation study). Following Sampson and Babarczy (2008), Lommel et al. (2014), Joshi et al. (2016) and Amidei et al. (2018b), such phenomena can be explained in terms of irreducible human language variability. Using three case studies, we show the limits of considering IAA as the only criterion for checking evaluation reliability. Given human language variability, we propose that for human evaluation of NLG, correlation coefficients and agreement coefficients should be used together to obtain a better assessment of the evaluation data reliability. This is illustrated using the three case studies.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here