TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	AI2D	DUBLIN	EM	51.11	# 4
Visual Question Answering (VQA)	DeepForm	DUBLIN	F1	62.23	# 1
Visual Question Answering (VQA)	DocVQA test	DUBLIN	ANLS	0.782	# 23
Visual Question Answering (VQA)	DocVQA test	DUBLIN (variable resolution)	ANLS	0.803	# 21
Visual Question Answering (VQA)	InfographicVQA	DUBLIN (variable resolution)	ANLS	42.6	# 16
Visual Question Answering (VQA)	InfographicVQA	DUBLIN	ANLS	36.82	# 20
Visual Question Answering (VQA)	WebSRC	DUBLIN	EM	77.75	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dublin-document-understanding-by-language/visual-question-answering-vqa-on-deepform)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-deepform?p=dublin-document-understanding-by-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dublin-document-understanding-by-language/visual-question-answering-vqa-on-websrc)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-websrc?p=dublin-document-understanding-by-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dublin-document-understanding-by-language/visual-question-answering-vqa-on-ai2d)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-ai2d?p=dublin-document-understanding-by-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dublin-document-understanding-by-language/visual-question-answering-vqa-on)](https://paperswithcode.com/sota/visual-question-answering-vqa-on?p=dublin-document-understanding-by-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dublin-document-understanding-by-language/visual-question-answering-on-docvqa-test)](https://paperswithcode.com/sota/visual-question-answering-on-docvqa-test?p=dublin-document-understanding-by-language)`

DUBLIN -- Document Understanding By Language-Image Network

23 May 2023 · Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som, Vishrav Chaudhary, Saurabh Tiwary ·

Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Text Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and AI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve competitive performance on RVL-CDIP document classification. Moreover, we create new baselines for text-based datasets by rendering them as document images to promote research in this direction.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Document Classification

document understanding

Feature Engineering

Key Information Extraction

Optical Character Recognition (OCR)

Question Answering

Reading Comprehension

Text Generation

Visual Question Answering

Visual Question Answering (VQA)

Datasets

SQuAD

Natural Questions

WikiSQL

FUNSD DocVQA

RVL-CDIP

TabFact ChartQA TextCaps

AI2D

InfographicVQA

VisualMRC

Screen2Words

Results from the Paper

Edit

Ranked #1 on Visual Question Answering (VQA) on DeepForm

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	AI2D	DUBLIN	EM	51.11	# 4	Compare
Visual Question Answering (VQA)	DeepForm	DUBLIN	F1	62.23	# 1	Compare
Visual Question Answering (VQA)	DocVQA test	DUBLIN	ANLS	0.782	# 23	Compare
Visual Question Answering (VQA)	DocVQA test	DUBLIN (variable resolution)	ANLS	0.803	# 21	Compare
Visual Question Answering (VQA)	InfographicVQA	DUBLIN (variable resolution)	ANLS	42.6	# 16	Compare
Visual Question Answering (VQA)	InfographicVQA	DUBLIN	ANLS	36.82	# 20	Compare
Visual Question Answering (VQA)	WebSRC	DUBLIN	EM	77.75	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

DUBLIN -- Document Understanding By Language-Image Network

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove