TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Motion Synthesis	HumanML3D	AttT2M	FID	0.112	# 10
Motion Synthesis	HumanML3D	AttT2M	Diversity	9.700	# 7
Motion Synthesis	HumanML3D	AttT2M	Multimodality	2.452	# 5
Motion Synthesis	HumanML3D	AttT2M	R Precision Top3	0.786	# 9
Motion Synthesis	KIT Motion-Language	AttT2M	FID	0.870	# 16
Motion Synthesis	KIT Motion-Language	AttT2M	R Precision Top3	0.751	# 7
Motion Synthesis	KIT Motion-Language	AttT2M	Diversity	10.96	# 4
Motion Synthesis	KIT Motion-Language	AttT2M	Multimodality	2.281	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attt2m-text-driven-human-motion-generation-1/motion-synthesis-on-humanml3d)](https://paperswithcode.com/sota/motion-synthesis-on-humanml3d?p=attt2m-text-driven-human-motion-generation-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attt2m-text-driven-human-motion-generation-1/motion-synthesis-on-kit-motion-language)](https://paperswithcode.com/sota/motion-synthesis-on-kit-motion-language?p=attt2m-text-driven-human-motion-generation-1)`

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

ICCV 2023 · Chongyang Zhong, Lei Hu, Zihao Zhang, Shihong Xia ·

Generating 3D human motion based on textual descriptions has been a research focus in recent years. It requires the generated motion to be diverse, natural, and conform to the textual description. Due to the complex spatio-temporal nature of human motion and the difficulty in learning the cross-modal relationship between text and motion, text-driven motion generation is still a challenging problem. To address these issues, we propose \textbf{AttT2M}, a two-stage method with multi-perspective attention mechanism: \textbf{body-part attention} and \textbf{global-local motion-text attention}. The former focuses on the motion embedding perspective, which means introducing a body-part spatio-temporal encoder into VQ-VAE to learn a more expressive discrete latent space. The latter is from the cross-modal perspective, which is used to learn the sentence-level and word-level motion-text cross-modal relationship. The text-driven motion is finally generated with a generative transformer. Extensive experiments conducted on HumanML3D and KIT-ML demonstrate that our method outperforms the current state-of-the-art works in terms of qualitative and quantitative evaluation, and achieve fine-grained synthesis and action2motion. Our code is in https://github.com/ZcyMonkey/AttT2M

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

zcymonkey/attt2m official

Tasks

Add Remove

Motion Synthesis

Sentence

Datasets

HumanML3D KIT Motion-Language

Results from the Paper

Add Remove

Ranked #10 on Motion Synthesis on HumanML3D

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Motion Synthesis	HumanML3D	AttT2M	FID	0.112	# 10	Compare
			Diversity	9.700	# 7	Compare
			Multimodality	2.452	# 5	Compare
			R Precision Top3	0.786	# 9	Compare
Motion Synthesis	KIT Motion-Language	AttT2M	FID	0.870	# 16	Compare
			R Precision Top3	0.751	# 7	Compare
			Diversity	10.96	# 4	Compare
			Multimodality	2.281	# 5	Compare

Methods

Add Remove

Focus • VQ-VAE

Edit Social Preview

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove