A Dutch Dataset for Cross-lingual Multilabel Toxicity Detection

RANLP (BUCC) 2021 · Ben Burtenshaw, Mike Kestemont ·

Multi-label toxicity detection is highly prominent, with many research groups, companies, and individuals engaging with it through shared tasks and dedicated venues. This paper describes a cross-lingual approach to annotating multi-label text classification on a newly developed Dutch language dataset, using a model trained on English data. We present an ensemble model of one Transformer model and an LSTM using Multilingual embeddings. The combination of multilingual embeddings and the Transformer model improves performance in a cross-lingual setting.

PDF Abstract