Towards Good Practices for Deep 3D Hand Pose Estimation

Cairong Zhang; Guijin Wang; Hengkai Guo; Xinghao Chen

arxiv: 1707.07248 · v1 · pith:HL7EM2BKnew · submitted 2017-07-23 · 💻 cs.CV

Towards Good Practices for Deep 3D Hand Pose Estimation

Hengkai Guo , Guijin Wang , Xinghao Chen , Cairong Zhang This is my paper

classification 💻 cs.CV

keywords handposeestimationconvnetperformancedatasetsdeepgood

0 comments

read the original abstract

3D hand pose estimation from single depth image is an important and challenging problem for human-computer interaction. Recently deep convolutional networks (ConvNet) with sophisticated design have been employed to address it, but the improvement over traditional random forest based methods is not so apparent. To exploit the good practice and promote the performance for hand pose estimation, we propose a tree-structured Region Ensemble Network (REN) for directly 3D coordinate regression. It first partitions the last convolution outputs of ConvNet into several grid regions. The results from separate fully-connected (FC) regressors on each regions are then integrated by another FC layer to perform the estimation. By exploitation of several training strategies including data augmentation and smooth $L_1$ loss, proposed REN can significantly improve the performance of ConvNet to localize hand joints. The experimental results demonstrate that our approach achieves the best performance among state-of-the-art algorithms on three public hand pose datasets. We also experiment our methods on fingertip detection and human pose datasets and obtain state-of-the-art accuracy.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-timestamp 3D Human Pose Estimation
cs.CV 2023-12 unverdicted novelty 5.0

LiCamPose combines multi-view RGB and LiDAR inputs via volumetric fusion, pretrains on synthetic data, and applies unsupervised adaptation to achieve robust single-frame 3D human pose estimation on multiple datasets.