A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera
Pith reviewed 2026-05-24 21:02 UTC · model grok-4.3
The pith
A two-stage deep framework lifts 2D keypoints to 3D poses then applies an ENAS-searched network to recognize actions from RGB video.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that joint 3D pose estimation and action recognition can be performed effectively from single RGB sequences by first lifting detected 2D keypoints to 3D poses via a two-stream network and then feeding the resulting 3D pose sequences into an ENAS-optimized spatio-temporal model that operates on an image-based intermediate representation.
What carries the argument
ENAS algorithm used to search for an optimal network that models the spatio-temporal evolution of estimated 3D poses through an image-based intermediate representation, following the two-stream 2D-to-3D lifting stage.
If this is right
- The method achieves effective performance on Human3.6M for 3D pose estimation and on MSR Action3D and SBU Kinect Interaction for action recognition.
- Training and inference require only a low computational budget.
- The two-stage design supports real-time 2D keypoint detection followed by 3D lifting and action modeling.
Where Pith is reading between the lines
- The pipeline could be extended to process longer video streams or multiple people if the 3D lifting stage scales.
- Replacing the separate stages with a single end-to-end differentiable network might further reduce error accumulation between pose estimation and action recognition.
- The image-based intermediate representation of 3D poses might allow reuse of existing 2D image classifiers for the action task.
Load-bearing premise
Errors introduced when lifting 2D keypoints to 3D poses will not substantially reduce the accuracy of the downstream action recognition step.
What would settle it
Measure action recognition accuracy on the same 3D pose sequences once with the network's estimated 3D poses and once with ground-truth 3D poses; a large drop on the estimated poses would falsify the claim that lifting errors do not degrade recognition.
Figures
read the original abstract
We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB video sequences. Our approach proceeds along two stages. In the first, we run a real-time 2D pose detector to determine the precise pixel location of important keypoints of the body. A two-stream neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second, we deploy the Efficient Neural Architecture Search (ENAS) algorithm to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that our method requires a low computational budget for training and inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a multitask deep learning framework consisting of a 2D pose detector, a two-stream neural network for lifting 2D keypoints to 3D poses, and an ENAS-optimized spatio-temporal model for action recognition from the estimated 3D poses represented as images. It claims that experiments on the Human3.6M, MSR Action3D, and SBU Kinect Interaction datasets demonstrate the effectiveness of the approach for both 3D pose estimation and action recognition, while requiring low computational resources for training and inference.
Significance. If the central claims hold after addressing the gaps below, the work would contribute a staged but unified pipeline that combines real-time 2D detection, 2D-to-3D lifting, and neural-architecture-search-driven spatio-temporal modeling, potentially enabling efficient joint pose-and-action systems on public datasets.
major comments (1)
- [Experiments] The experimental section provides no ablation that measures action-recognition accuracy on ground-truth 3D poses versus the 3D poses produced by the two-stream lifting network. Because the pipeline is strictly staged (2D detector → lifting → ENAS model), this comparison is required to substantiate the claim that the method is effective for both tasks and that lifting errors do not substantially degrade downstream recognition on MSR Action3D and SBU Kinect Interaction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below.
read point-by-point responses
-
Referee: [Experiments] The experimental section provides no ablation that measures action-recognition accuracy on ground-truth 3D poses versus the 3D poses produced by the two-stream lifting network. Because the pipeline is strictly staged (2D detector → lifting → ENAS model), this comparison is required to substantiate the claim that the method is effective for both tasks and that lifting errors do not substantially degrade downstream recognition on MSR Action3D and SBU Kinect Interaction.
Authors: We agree that the requested ablation would strengthen the experimental section by quantifying the effect of lifting errors on action recognition. MSR Action3D and SBU provide Kinect-derived 3D joint positions that can serve as ground-truth 3D input to the ENAS model. In the revised manuscript we will add this comparison (estimated 3D poses vs. Kinect 3D poses) for the action-recognition task on both datasets and report the resulting accuracy difference. revision: yes
Circularity Check
No significant circularity; pipeline relies on external datasets and prior algorithms
full rationale
The paper presents a staged pipeline (2D detector to two-stream 3D lifting network to ENAS-optimized spatio-temporal model) whose performance claims are evaluated directly on external public benchmarks (Human3.6M, MSR Action3D, SBU). No derivation, equation, or central claim reduces by construction to a quantity fitted or defined within the paper itself; ENAS and the 2D detector are cited from prior independent work, and no self-citation forms a load-bearing uniqueness argument. The method is therefore self-contained against external data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A real-time 2D pose detector supplies sufficiently accurate keypoints as input to the lifting network.
- domain assumption ENAS can discover an architecture whose performance on the image-based 3D-pose representation is representative of the joint task.
Reference graph
Works this paper leans on
-
[1]
D. Weinland, R. Ronfard, and E. Boyer. A Survey of Vision-based Methods for Action Representation, Segmentation and Recognition. Computer Vision and Image Under- standing (CVIU), 115(2):224–241, 2011
work page 2011
-
[2]
D. Lowe. Distinctive Image Features from Scale-invariant Keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004
work page 2004
- [3]
- [4]
- [5]
-
[6]
J. Wang, Z. Liu, Y . Wu, and J. Yuan. Mining Actionlet Ensemble for Action Recog- nition with Depth Cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1297, 2012
work page 2012
-
[7]
L. Xia, C. Chen, and J. K. Aggarwal. View-Invariant Human Action Recognition us- ing Histograms of 3D Joints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20–27, 2012. A PREPRINT: HIEU PHAM ET AL. 2019 11
work page 2012
-
[8]
R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, and R. Vidal. Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 471–478, 2013
work page 2013
-
[9]
R. Vemulapalli, F. Arrate, and R. Chellappa. Human Action Recognition by Represent- ing 3D Skeletons as Points in a Lie Group. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 588–595, 2014
work page 2014
-
[10]
W. Ding, K. Liu, X. Fu, and F. Cheng. Profile HMMs for Skeleton-based Human Action Recognition. Signal Processing: Image Communication, 42:109–119, 2016
work page 2016
-
[11]
Z. Zhang. Microsoft Kinect Sensor and Its Effect. IEEE Multimedia, 19(2):4–10, 2012
work page 2012
-
[12]
Z. Cao, T. Simon, S. Wei, and Y . Sheikh. Realtime Multi-person 2D Pose Estima- tion using Part Affinity Fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7291–7299, 2017
work page 2017
-
[13]
H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean. Efficient Neural Architecture Search via Parameters Sharing. In International Conference on Machine Learning (ICML) , pages 4095–4104, 2018
work page 2018
- [14]
-
[15]
J. Gu, X. Ding, S. Wang, and Y . Wu. Action and Gait Recognition from Recovered 3D Human Joints. IEEE Transactions on Systems, Man, and Cybernetics, 40(4):1021– 1033, 2010
work page 2010
- [16]
-
[17]
C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environ- ments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 36(7):1325–1339, 2014
work page 2014
-
[18]
W. Li, Z. Zhang, and Z. Liu. Action Recognition Based on a Bag of 3D Points. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9–14, 2010
work page 2010
-
[19]
K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Two-person Inter- action Detection using Body-pose Features and Multiple Instance Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 28–35, 2012
work page 2012
-
[20]
S. Nikolaos, B. Bogdan, I. Bogdan, and A. K. Ioannis. 3D Human Pose Estimation: A Review of the Literature and Analysis of Covariates. Computer Vision and Image Understanding (CVIU), 152:1–20, 2016
work page 2016
-
[21]
L. Presti and M. La Cascia. 3D Skeleton-based Human Action Classification: A Sur- vey. Pattern Recognition, 53:130–147, 2016
work page 2016
-
[22]
C. Sminchisescu. 3D Human Motion Analysis in Monocular Video Techniques and Challenges. In IEEE International Conference on Video and Signal Based Surveillance (ICVSBS), pages 76–76, 2006. 12 A PREPRINT: HIEU PHAM ET AL. 2019
work page 2006
-
[23]
V . Ramakrishna, T. Kanade, and Y . Sheikh. Reconstructing 3D Human Pose from 2D Image Landmarks. In European Conference on Computer Vision (ECCV), pages 573– 586, 2012
work page 2012
- [24]
- [25]
-
[26]
G. Pavlakos, X. Zhou, K. G Derpanis, and K. Daniilidis. Coarse-to-fine V olumetric Prediction for Single-image 3D Human Pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7025–7034, 2017
work page 2017
-
[27]
3D human pose estimation in video with temporal convolutions and semi-supervised training
D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3D Human Pose Estimation in Video with Temporal Convolutions and Semi-supervised Training. arXiv preprint arXiv:1811.11742, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [28]
-
[29]
I. Katircioglu, B. Tekin, M. Salzmann, V . Lepetit, and P. Fua. Learning Latent Repre- sentations of 3D Human Pose with Deep Neural Networks. International Journal of Computer Vision (IJCV), 126(12):1326–1341, 2018
work page 2018
-
[30]
Multi-Scale Context Aggregation by Dilated Convolutions
Y . Fisher and K. Vladlen. Multi-scale Context Aggregation by Dilated Convolutions. arXiv preprint arXiv:1511.07122, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[31]
K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770– 778, 2016
work page 2016
-
[32]
H. Sepp and S. Jürgen. Long Short-Term Memory. Neural Computation, 9:1735–1780, 1997
work page 1997
-
[33]
J. Martinez, R. Hossain, J. Romero, and J. Little. A Simple Yet Effective Baseline for 3D Human Pose Estimation. In IEEE International Conference on Computer Vision (ICCV), pages 2640–2649, 2017
work page 2017
- [34]
-
[35]
L. Han, X. Wu, W. Liang, G. Hou, and Y . Jia. Discriminative Human Action Recogni- tion in the Learned Hierarchical Manifold Space. Image and Vision Computing (IVC), 28, 2010
work page 2010
-
[36]
J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal LSTM with Trust Gates for 3D Human Action Recognition. In European Conference on Computer Vision (ECCV), pages 816–833, 2016. A PREPRINT: HIEU PHAM ET AL. 2019 13
work page 2016
-
[37]
Y . Du, W. Wang, and L. Wang. Hierarchical Recurrent Neural Network for Skele- ton based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1110–1118, 2015
work page 2015
-
[38]
A. Shahroudy, J. Liu, T. T. Ng, and G. Wang. NTU RGB+ D: A Large Scale Dataset for 3D Human Activity Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016
work page 2016
-
[39]
T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4580–4584, 2015
work page 2015
- [40]
- [41]
-
[42]
B. X. Nie, C. Xiong, and S. Zhu. Joint Action Recognition and Pose Estimation from Video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1293–1301, 2015
work page 2015
-
[43]
D. C. Luvizon, D. Picard, and H. Tabia. 2D/3D Pose Estimation and Action Recog- nition using Multitask Deep Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5137–5146, 2018
work page 2018
-
[44]
P. J. Huber. Robust Estimation of a Location Parameter. In Breakthroughs in Statistics, pages 492–518. Springer, 1992
work page 1992
-
[45]
S. Christian, I. Sergey, and V . Vincent. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI Conference on Artificial Intelligence (AAAI), 2016
work page 2016
-
[46]
H. Gao, L. Zhuang, M. Laurens van der, and Q. W. Kilian. Densely Connected Convo- lutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017
work page 2017
-
[47]
Neural Architecture Search with Reinforcement Learning
Z. Barret and V . L. Quoc. Neural Architecture Search with Reinforcement Learning. arXiv preprint arXiv:1611.01578, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML), 2015
work page 2015
-
[49]
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Im- proving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv preprint arXiv:1207.0580, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[50]
G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (NIPS), pages 971– 980, 2017. 14 A PREPRINT: HIEU PHAM ET AL. 2019
work page 2017
-
[51]
H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Exploiting Deep Residual Networks for Human Action Recognition from Skeletal Data. Computer Vi- sion and Image Understanding (CVIU), 170:51–66, 2018
work page 2018
-
[52]
H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Skeletal Move- ment to Color Map: A Novel Representation for 3D Action Recognition with Incep- tion Residual Networks. IEEE International Conference on Image Processing (ICIP), pages 3483–3487, 2018
work page 2018
-
[53]
H. Pham, H. Salmane, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Spatio- Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks. Sensors, 19(8), 2019
work page 2019
-
[54]
K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human- Level Performance on ImageNet Classification. In IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015
work page 2015
-
[55]
Adam: A Method for Stochastic Optimization
D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[56]
N. Yurii. A Method for Solving a Convex Programming Problem with Convergence Rate O(1/K2). Soviet Mathematics Doklady, pages 372–367, 1983
work page 1983
-
[57]
SGDR: Stochastic Gradient Descent with Warm Restarts
L. Ilya and H. Frank. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv preprint arXiv:1608.03983, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[58]
Y . Du, Y . Wong, Y . Liu, F. Han, Y . Gui, Z; Wang, M. Kankanhalli, and W. Geng. Marker-less 3D Human Motion Capture with Monocular Image Sequence and Height- maps. In European Conference on Computer Vision (ECCV), pages 20–36, 2016
work page 2016
-
[59]
S. Park, J. Hwang, and N. Kwak. 3D Human Pose Estimation using Convolutional Neural Networks with 2D Pose Information. In European Conference on Computer Vision (ECCV), pages 156–169, 2016
work page 2016
-
[60]
X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4966–4975, 2016
work page 2016
- [61]
- [62]
- [63]
-
[64]
C. Chen, K. Liu, and N. Kehtarnavaz. Real-time Human Action Recognition based on Depth Motion Maps. Journal of Real-time Image Processing, 12, 2016. A PREPRINT: HIEU PHAM ET AL. 2019 15
work page 2016
-
[65]
P. Wang, C. Yuan, W. Hu, B. Li, and Y . Zhang. Graph Based Skeleton Motion Rep- resentation and Similarity Measurement for Action Recognition. In British Machine Vision Conference (BMVC), 2016
work page 2016
-
[66]
J. Weng, C. Weng, and J. Yuan. Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST- NBNN) for Skeleton-Based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
work page 2017
-
[67]
H. Xu, E. Chen, C. Liang, L. Qi, and L. Guan. Spatio-temporal Pyramid Model based on Depth Maps for Action Recognition. In IEEE International Workshop on Multime- dia Signal Processing (MMSP), pages 1–6, 2015
work page 2015
-
[68]
I. Lee, D. Kim, S. Kang, and S. Lee. Ensemble Deep Learning for Skeleton-based Action Recognition using Temporal Sliding LSTM Networks. In IEEE International Conference on Computer Vision (ICCV), pages 1012–1020, 2017
work page 2017
-
[69]
S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An End-to-End Spatio-Temporal Atten- tion Model for Human Action Recognition from Skeleton Data. In AAAI Conference on Artificial Intelligence (AAAI), 2017
work page 2017
-
[70]
J. Weng, C. Weng, J. Yuan, and Z. Liu. Discriminative Spatio-Temporal Pattern Dis- covery for 3D Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (TCCVT), 29(4):1077–1089, 2019
work page 2019
-
[71]
Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. A New Representation of Skeleton Sequences for 3D Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4570–4579, 2017
work page 2017
-
[72]
T. Yusuf and K. Piotr. CNN-based Action Recognition and Supervised Domain Adap- tation on 3D Body Skeletons via Kernel Feature Maps. In British Machine Vision Conference (BMVC), page 158, 2018
work page 2018
-
[73]
H. Wang and L. Wang. Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. In IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3633–3642, 2017
work page 2017
-
[74]
J. Liu, G. Wang, L. Duan, K. Abdiyeva, and A. C. Kot. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks. IEEE Transac- tions on Image Processing (TIP), 27(4):1586–1599, 2018
work page 2018
- [75]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.