3D human pose estimation in video with temporal convolutions and semi-supervised training

Christoph Feichtenhofer; Dario Pavllo; David Grangier; Michael Auli

arxiv: 1811.11742 · v2 · pith:WSHRRQP2new · submitted 2018-11-28 · 💻 cs.CV

3D human pose estimation in video with temporal convolutions and semi-supervised training

Dario Pavllo , Christoph Feichtenhofer , David Grangier , Michael Auli This is my paper

classification 💻 cs.CV

keywords videokeypointsmodelsemi-supervisedback-projectionconvolutionsdataerror

0 comments

read the original abstract

In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. In the supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows significant improvements on HumanEva-I. Moreover, experiments with back-projection show that it comfortably outperforms previous state-of-the-art results in semi-supervised settings where labeled data is scarce. Code and models are available at https://github.com/facebookresearch/VideoPose3D

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings
cs.CV 2026-02 conditional novelty 6.0

A YOLOv8 and homography-based system reconstructs canoe boat velocity with MAPE 0.011 and stroke rate with MAPE 0.009 from video, matching GPS closely.
Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation
cs.CV 2026-04 unverdicted novelty 5.0

MixTGFormer reports state-of-the-art 3D pose estimation errors of 37.6 mm on Human3.6M and 15.7 mm on MPI-INF-3DHP by using parallel GCN-Transformer streams with SE layers for local-global feature fusion.
A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera
cs.CV 2019-07 unverdicted novelty 3.0

A multitask framework lifts 2D keypoints to 3D poses via a two-stream network then applies ENAS to model spatio-temporal pose evolution for action recognition on Human3.6M, MSR Action3D and SBU datasets.