TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video-to-image Affordance Grounding	EPIC-Hotspot	Afformer	KLD	0.97	# 1
Video-to-image Affordance Grounding	EPIC-Hotspot	Afformer	SIM	0.56	# 1
Video-to-image Affordance Grounding	EPIC-Hotspot	Afformer	AUC-J	0.88	# 1
Video-to-image Affordance Grounding	OPRA	Afformer (ResNet-50-FPN encoder)	KLD	1.55	# 2
Video-to-image Affordance Grounding	OPRA	Afformer (ResNet-50-FPN encoder)	Top-1 Action Accuracy	52.14	# 2
Video-to-image Affordance Grounding	OPRA	Afformer (ViTDet-B encoder)	KLD	1.51	# 1
Video-to-image Affordance Grounding	OPRA	Afformer (ViTDet-B encoder)	Top-1 Action Accuracy	52.27	# 1
Video-to-image Affordance Grounding	OPRA (28x28)	Afformer	KLD	1.05	# 1
Video-to-image Affordance Grounding	OPRA (28x28)	Afformer	SIM	0.53	# 1
Video-to-image Affordance Grounding	OPRA (28x28)	Afformer	AUC-J	0.89	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/affordance-grounding-from-demonstration-video-1/video-to-image-affordance-grounding-on-epic)](https://paperswithcode.com/sota/video-to-image-affordance-grounding-on-epic?p=affordance-grounding-from-demonstration-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/affordance-grounding-from-demonstration-video-1/video-to-image-affordance-grounding-on-opra)](https://paperswithcode.com/sota/video-to-image-affordance-grounding-on-opra?p=affordance-grounding-from-demonstration-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/affordance-grounding-from-demonstration-video-1/video-to-image-affordance-grounding-on-opra-1)](https://paperswithcode.com/sota/video-to-image-affordance-grounding-on-opra-1?p=affordance-grounding-from-demonstration-video-1)`

Affordance Grounding from Demonstration Video to Target Image

CVPR 2023 · Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou ·

Humans excel at learning from expert demonstrations and solving their own problems. To equip intelligent robots and assistants, such as AR glasses, with this ability, it is essential to ground human hand interactions (i.e., affordances) from demonstration videos and apply them to a target image like a user's AR glass view. The video-to-image affordance grounding task is challenging due to (1) the need to predict fine-grained affordances, and (2) the limited training data, which inadequately covers video-image discrepancies and negatively impacts grounding. To tackle them, we propose Affordance Transformer (Afformer), which has a fine-grained transformer-based decoder that gradually refines affordance grounding. Moreover, we introduce Mask Affordance Hand (MaskAHand), a self-supervised pre-training technique for synthesizing video-image data and simulating context changes, enhancing affordance grounding across video-image discrepancies. Afformer with MaskAHand pre-training achieves state-of-the-art performance on multiple benchmarks, including a substantial 37% improvement on the OPRA dataset. Code is made available at https://github.com/showlab/afformer.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

showlab/afformer official

Tasks

Add Remove

Decoder

Video-to-image Affordance Grounding

Datasets

100DOH

OPRA EPIC-Hotspot

Results from the Paper

Edit

Ranked #1 on Video-to-image Affordance Grounding on EPIC-Hotspot

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video-to-image Affordance Grounding	EPIC-Hotspot	Afformer	KLD	0.97	# 1	Compare
			SIM	0.56	# 1	Compare
			AUC-J	0.88	# 1	Compare
Video-to-image Affordance Grounding	OPRA	Afformer (ResNet-50-FPN encoder)	KLD	1.55	# 2	Compare
Video-to-image Affordance Grounding	OPRA	Afformer (ResNet-50-FPN encoder)	Top-1 Action Accuracy	52.14	# 2	Compare
Video-to-image Affordance Grounding	OPRA	Afformer (ViTDet-B encoder)	KLD	1.51	# 1	Compare
Video-to-image Affordance Grounding	OPRA	Afformer (ViTDet-B encoder)	Top-1 Action Accuracy	52.27	# 1	Compare
Video-to-image Affordance Grounding	OPRA (28x28)	Afformer	KLD	1.05	# 1	Compare
			SIM	0.53	# 1	Compare
			AUC-J	0.89	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Affordance Grounding from Demonstration Video to Target Image

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove