Optimizing Network Structure for 3D Human Pose Estimation

A human pose is naturally represented as a graph where the joints are the nodes and the bones are the edges. So it is natural to apply Graph Convolutional Network (GCN) to estimate 3D poses from 2D poses. In this work, we propose a generic formulation where both GCN and Fully Connected Network (FCN) are its special cases. From this formulation, we discover that GCN has limited representation power when used for estimating 3D poses. We overcome the limitation by introducing Locally Connected Network (LCN) which is naturally implemented by this generic formulation. It notably improves the representation capability over GCN. In addition, since every joint is only connected to a few joints in its neighborhood, it has strong generalization power. The experiments on public datasets show it: (1) outperforms the state-of-the-arts; (2) is less data hungry than alternative models; (3) generalizes well to unseen actions and datasets.

PDF Abstract

Results from the Paper


Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
3D Human Pose Estimation Human3.6M ONS LCN Average MPJPE (mm) 52.7 # 197
Using 2D ground-truth joints No # 2
Multi-View or Monocular Monocular # 1
3D Human Pose Estimation Human3.6M ONS LCN (GT) Average MPJPE (mm) 36.3 # 58
Using 2D ground-truth joints Yes # 2
Multi-View or Monocular Monocular # 1

Methods