TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	GQA	RelViT	Accuracy	65.54	# 2
Zero-Shot Human-Object Interaction Detection	HICO	RelViT	mAP (NonRare)	42.04	# 1
Zero-Shot Human-Object Interaction Detection	HICO	RelViT	mAP (Rare)	28.36	# 1
Human-Object Interaction Detection	HICO	RelViT	mAP	43.98	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/relvit-concept-guided-vision-transformer-for-1/zero-shot-human-object-interaction-detection-1)](https://paperswithcode.com/sota/zero-shot-human-object-interaction-detection-1?p=relvit-concept-guided-vision-transformer-for-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/relvit-concept-guided-vision-transformer-for-1/visual-question-answering-on-gqa)](https://paperswithcode.com/sota/visual-question-answering-on-gqa?p=relvit-concept-guided-vision-transformer-for-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/relvit-concept-guided-vision-transformer-for-1/human-object-interaction-detection-on-hico-1)](https://paperswithcode.com/sota/human-object-interaction-detection-on-hico-1?p=relvit-concept-guided-vision-transformer-for-1)`

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

ICLR 2022 · Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar ·

Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i.e., systematic generalization. In this work, we use vision transformers (ViTs) as our base model for visual reasoning and make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. Specifically, we introduce a novel concept-feature dictionary to allow flexible image feature retrieval at training time with concept keys. This dictionary enables two new concept-guided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) significantly outperforms prior approaches on HICO and GQA by 16% and 13% in the original split, and by 43% and 18% in the systematic split. Our ablation analyses also reveal our model's compatibility with multiple ViT variants and robustness to hyper-parameters.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Code

Add Remove Mark official

NVlabs/RelViT official

Tasks

Add Remove

Human-Object Interaction Detection

Object

Retrieval

Systematic Generalization

Visual Question Answering (VQA)

Visual Reasoning

Zero-Shot Human-Object Interaction Detection

Datasets

GQA

HICO

Results from the Paper

Edit

Ranked #1 on Zero-Shot Human-Object Interaction Detection on HICO

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	GQA	RelViT	Accuracy	65.54	# 2	Compare
Zero-Shot Human-Object Interaction Detection	HICO	RelViT	mAP (NonRare)	42.04	# 1	Compare
Zero-Shot Human-Object Interaction Detection	HICO	RelViT	mAP (Rare)	28.36	# 1	Compare
Human-Object Interaction Detection	HICO	RelViT	mAP	43.98	# 4	Compare

Methods

Add Remove

Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Transformer • Vision Transformer

Edit Social Preview

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove