TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	A-OKVQA	PaLI-X-VPD	MC Accuracy	80.4	# 2
Visual Question Answering (VQA)	A-OKVQA	PaLI-X-VPD	DA VQA Score	68.2	# 2
Visual Question Answering (VQA)	GQA test-dev	PaLI-X-VPD	Accuracy	67.3	# 2
Meme Classification	Hateful Memes	PaLI-X-VPD	ROC-AUC	0.892	# 1
Visual Question Answering (VQA)	OK-VQA	PaLI-X-VPD	Accuracy	66.8	# 1
Object Counting	TallyQA-Complex	PaLI-X-VPD	Accuracy	76.6	# 2
Object Counting	TallyQA-Simple	PaLI-X-VPD	Accuracy	86.2	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-program-distillation-distilling-tools/meme-classification-on-hateful-memes)](https://paperswithcode.com/sota/meme-classification-on-hateful-memes?p=visual-program-distillation-distilling-tools)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-program-distillation-distilling-tools/visual-question-answering-on-ok-vqa)](https://paperswithcode.com/sota/visual-question-answering-on-ok-vqa?p=visual-program-distillation-distilling-tools)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-program-distillation-distilling-tools/visual-question-answering-on-a-okvqa)](https://paperswithcode.com/sota/visual-question-answering-on-a-okvqa?p=visual-program-distillation-distilling-tools)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-program-distillation-distilling-tools/visual-question-answering-on-gqa-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-gqa-test-dev?p=visual-program-distillation-distilling-tools)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-program-distillation-distilling-tools/object-counting-on-tallyqa-complex)](https://paperswithcode.com/sota/object-counting-on-tallyqa-complex?p=visual-program-distillation-distilling-tools)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-program-distillation-distilling-tools/object-counting-on-tallyqa-simple)](https://paperswithcode.com/sota/object-counting-on-tallyqa-simple?p=visual-program-distillation-distilling-tools)`

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

5 Dec 2023 · Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman ·

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Language Modelling

Large Language Model

Meme Classification

Object Counting

Visual Question Answering (VQA)

Datasets

Visual Genome

GQA

OK-VQA

TextVQA

Hateful Memes

A-OKVQA

MMBench TallyQA

Results from the Paper

Edit

Ranked #1 on Meme Classification on Hateful Memes

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	A-OKVQA	PaLI-X-VPD	MC Accuracy	80.4	# 2	Compare
Visual Question Answering (VQA)	A-OKVQA	PaLI-X-VPD	DA VQA Score	68.2	# 2	Compare
Visual Question Answering (VQA)	GQA test-dev	PaLI-X-VPD	Accuracy	67.3	# 2	Compare
Meme Classification	Hateful Memes	PaLI-X-VPD	ROC-AUC	0.892	# 1	Compare
Visual Question Answering (VQA)	OK-VQA	PaLI-X-VPD	Accuracy	66.8	# 1	Compare
Object Counting	TallyQA-Complex	PaLI-X-VPD	Accuracy	76.6	# 2	Compare
Object Counting	TallyQA-Simple	PaLI-X-VPD	Accuracy	86.2	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove