Document Enhancement System Using Auto-encoders

NeurIPS Workshop Document_Intelligen 2019 · Mehrdad J. Gangeh, Sunil R. Tiyyagura, Sridhar V. Dasaratha, Hamid Motahari, Nigel P. Duffy ·

The conversion of scanned documents to digital forms is performed using an Optical Character Recognition (OCR) software. This work focuses on improving the quality of scanned documents in order to improve the OCR output. We create an end-to-end document enhancement pipeline which takes in a set of noisy documents and produces clean ones. Deep neural network based denoising auto-encoders are trained to improve the OCR quality. We train a blind model that works on different noise levels of scanned text documents. Results are shown for blurring and watermark noise removal from noisy scanned documents.

PDF Abstract