TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video-to-image Affordance Grounding	EPIC-Hotspot	HAG-Net (+Hand Box)	KLD	1.21	# 2
Video-to-image Affordance Grounding	EPIC-Hotspot	HAG-Net (+Hand Box)	SIM	0.41	# 2
Video-to-image Affordance Grounding	EPIC-Hotspot	HAG-Net (+Hand Box)	AUC-J	0.80	# 2
Video-to-image Affordance Grounding	OPRA (28x28)	HAG-Net (+Hand Box)	KLD	1.41	# 3
Video-to-image Affordance Grounding	OPRA (28x28)	HAG-Net (+Hand Box)	SIM	0.37	# 3
Video-to-image Affordance Grounding	OPRA (28x28)	HAG-Net (+Hand Box)	AUC-J	0.81	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-visual-affordance-grounding-from/video-to-image-affordance-grounding-on-epic)](https://paperswithcode.com/sota/video-to-image-affordance-grounding-on-epic?p=learning-visual-affordance-grounding-from)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-visual-affordance-grounding-from/video-to-image-affordance-grounding-on-opra-1)](https://paperswithcode.com/sota/video-to-image-affordance-grounding-on-opra-1?p=learning-visual-affordance-grounding-from)`

Learning Visual Affordance Grounding from Demonstration Videos

12 Aug 2021 · Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, DaCheng Tao ·

Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which is beneficial for many applications, such as robot grasping and action recognition. However, existing methods mainly rely on the appearance feature of the objects to segment each region of the image, which face the following two problems: (i) there are multiple possible regions in an object that people interact with; and (ii) there are multiple possible human interactions in the same object region. To address these problems, we propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net has a dual-branch structure to process the demonstration video and object image. For the video branch, we introduce hand-aided attention to enhance the region around the hand in each video frame and then use the LSTM network to aggregate the action features. For the object branch, we introduce a semantic enhancement module (SEM) to make the network focus on different parts of the object according to the action classes and utilize a distillation loss to align the output features of the object branch with that of the video branch and transfer the knowledge in the video branch to the object branch. Quantitative and qualitative evaluations on two challenging datasets show that our method has achieved stateof-the-art results for affordance grounding. The source code will be made available to the public.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Object

Video-to-image Affordance Grounding

Datasets

OPRA EPIC-Hotspot

Results from the Paper

Edit

Ranked #2 on Video-to-image Affordance Grounding on EPIC-Hotspot

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video-to-image Affordance Grounding	EPIC-Hotspot	HAG-Net (+Hand Box)	KLD	1.21	# 2	Compare
			SIM	0.41	# 2	Compare
			AUC-J	0.80	# 2	Compare
Video-to-image Affordance Grounding	OPRA (28x28)	HAG-Net (+Hand Box)	KLD	1.41	# 3	Compare
			SIM	0.37	# 3	Compare
			AUC-J	0.81	# 3	Compare

Methods

Add Remove

LSTM • Sigmoid Activation • Tanh Activation

Edit Social Preview

Learning Visual Affordance Grounding from Demonstration Videos

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove