DNLP@FinTOC’20: Table of Contents Detection in Financial Documents

Title Detection and Table of Contents Generation are important components in detecting document structure. In particular, these two elements serve to provide the skeleton of the document, providing users with an understanding of organization, as well as the relevance of information, and where to find information within the document. Here, we show that using tesseract with Levenstein distance, a feature set inspired by Alk et al., we were able to correctly classify the title to an F1 measure 0.73 and 0.87, and the table-of-contents to a harmonic mean of 0.36 and 0.39, in English and French respectively. Our methodology works with both PDF and scanned documents, giving it a wide range of applicability within the document engineering and storage domains.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here