pith. sign in

arxiv: 1907.06968 · v1 · pith:YI7MBICUnew · submitted 2019-07-16 · 💻 cs.CV

A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera

Pith reviewed 2026-05-24 21:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D pose estimationaction recognitiondeep learningneural architecture searchRGB videomultitask learninghuman activity recognition
0
0 comments X

The pith

A two-stage deep framework lifts 2D keypoints to 3D poses then applies an ENAS-searched network to recognize actions from RGB video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a multitask pipeline that first detects 2D body keypoints in real time and maps them to 3D poses with a two-stream neural network. In the second stage it uses the ENAS algorithm to discover an architecture that converts sequences of these 3D poses into an image-based representation and classifies the performed action. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets are presented to show that the combined system works on both tasks while keeping training and inference costs modest. A reader would care because separate pose and action pipelines are common; unifying them in one low-budget flow could simplify deployment for video-based monitoring or interaction systems.

Core claim

The central claim is that joint 3D pose estimation and action recognition can be performed effectively from single RGB sequences by first lifting detected 2D keypoints to 3D poses via a two-stream network and then feeding the resulting 3D pose sequences into an ENAS-optimized spatio-temporal model that operates on an image-based intermediate representation.

What carries the argument

ENAS algorithm used to search for an optimal network that models the spatio-temporal evolution of estimated 3D poses through an image-based intermediate representation, following the two-stream 2D-to-3D lifting stage.

If this is right

  • The method achieves effective performance on Human3.6M for 3D pose estimation and on MSR Action3D and SBU Kinect Interaction for action recognition.
  • Training and inference require only a low computational budget.
  • The two-stage design supports real-time 2D keypoint detection followed by 3D lifting and action modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pipeline could be extended to process longer video streams or multiple people if the 3D lifting stage scales.
  • Replacing the separate stages with a single end-to-end differentiable network might further reduce error accumulation between pose estimation and action recognition.
  • The image-based intermediate representation of 3D poses might allow reuse of existing 2D image classifiers for the action task.

Load-bearing premise

Errors introduced when lifting 2D keypoints to 3D poses will not substantially reduce the accuracy of the downstream action recognition step.

What would settle it

Measure action recognition accuracy on the same 3D pose sequences once with the network's estimated 3D poses and once with ground-truth 3D poses; a large drop on the estimated poses would falsify the claim that lifting errors do not degrade recognition.

Figures

Figures reproduced from arXiv: 1907.06968 by Alain Crouzil, Houssam Salmane, Huy Hieu Pham, Louahdi Khoudour, Pablo Zegers, Sergio A Velastin.

Figure 1
Figure 1. Figure 1: Overview of the proposed method. In the estimation stage, we first run OpenPose [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diagram of the proposed two-stream network for training our 3D pose estimator. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Immediate image-based representations for the recognition stage. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the proposed approach for 3D pose-based action recognition. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of 3D output of the estimation stage with some samples on the test [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Diagram of the top performing normal cell (a) and reduction cell (b) discovered by ENAS [13] on AS1 subset [18]. They were then used to construct the final network architec￾ture (c). We recommend the interested readers to [13] to better understand this procedure. 4.4 Computational efficiency evaluation On a single GeForce GTX 1080Ti GPU with 11GB memory, the runtime of OpenPose [12] is less than 0.1s per f… view at source ↗
read the original abstract

We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB video sequences. Our approach proceeds along two stages. In the first, we run a real-time 2D pose detector to determine the precise pixel location of important keypoints of the body. A two-stream neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second, we deploy the Efficient Neural Architecture Search (ENAS) algorithm to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that our method requires a low computational budget for training and inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces a multitask deep learning framework consisting of a 2D pose detector, a two-stream neural network for lifting 2D keypoints to 3D poses, and an ENAS-optimized spatio-temporal model for action recognition from the estimated 3D poses represented as images. It claims that experiments on the Human3.6M, MSR Action3D, and SBU Kinect Interaction datasets demonstrate the effectiveness of the approach for both 3D pose estimation and action recognition, while requiring low computational resources for training and inference.

Significance. If the central claims hold after addressing the gaps below, the work would contribute a staged but unified pipeline that combines real-time 2D detection, 2D-to-3D lifting, and neural-architecture-search-driven spatio-temporal modeling, potentially enabling efficient joint pose-and-action systems on public datasets.

major comments (1)
  1. [Experiments] The experimental section provides no ablation that measures action-recognition accuracy on ground-truth 3D poses versus the 3D poses produced by the two-stream lifting network. Because the pipeline is strictly staged (2D detector → lifting → ENAS model), this comparison is required to substantiate the claim that the method is effective for both tasks and that lifting errors do not substantially degrade downstream recognition on MSR Action3D and SBU Kinect Interaction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Experiments] The experimental section provides no ablation that measures action-recognition accuracy on ground-truth 3D poses versus the 3D poses produced by the two-stream lifting network. Because the pipeline is strictly staged (2D detector → lifting → ENAS model), this comparison is required to substantiate the claim that the method is effective for both tasks and that lifting errors do not substantially degrade downstream recognition on MSR Action3D and SBU Kinect Interaction.

    Authors: We agree that the requested ablation would strengthen the experimental section by quantifying the effect of lifting errors on action recognition. MSR Action3D and SBU provide Kinect-derived 3D joint positions that can serve as ground-truth 3D input to the ENAS model. In the revised manuscript we will add this comparison (estimated 3D poses vs. Kinect 3D poses) for the action-recognition task on both datasets and report the resulting accuracy difference. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline relies on external datasets and prior algorithms

full rationale

The paper presents a staged pipeline (2D detector to two-stream 3D lifting network to ENAS-optimized spatio-temporal model) whose performance claims are evaluated directly on external public benchmarks (Human3.6M, MSR Action3D, SBU). No derivation, equation, or central claim reduces by construction to a quantity fitted or defined within the paper itself; ENAS and the 2D detector are cited from prior independent work, and no self-citation forms a load-bearing uniqueness argument. The method is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, preventing exhaustive enumeration. The approach rests on standard deep-learning assumptions about labeled video datasets and the transferability of 2D-to-3D lifting to action recognition.

axioms (2)
  • domain assumption A real-time 2D pose detector supplies sufficiently accurate keypoints as input to the lifting network.
    Invoked in the first stage description; accuracy of this detector is presupposed.
  • domain assumption ENAS can discover an architecture whose performance on the image-based 3D-pose representation is representative of the joint task.
    Central to the second stage; no independent verification of the search objective is supplied.

pith-pipeline@v0.9.0 · 5708 in / 1420 out tokens · 28687 ms · 2026-05-24T21:02:42.424894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 6 internal anchors

  1. [1]

    Weinland, R

    D. Weinland, R. Ronfard, and E. Boyer. A Survey of Vision-based Methods for Action Representation, Segmentation and Recognition. Computer Vision and Image Under- standing (CVIU), 115(2):224–241, 2011

  2. [2]

    D. Lowe. Distinctive Image Features from Scale-invariant Keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004

  3. [3]

    Laptev, M

    I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning Realistic Human Ac- tions from Movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008

  4. [4]

    Dollár, V

    P. Dollár, V . Rabaud, G. Cottrell, and S. Belongie. Behavior Recognition via Sparse Spatio-temporal Features. In IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pages 65–72, 2005

  5. [5]

    Ye and R

    M. Ye and R. Yang. Real-time Simultaneous Pose and Shape Estimation for Articulated Objects using a Single Depth Camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2345–2352, 2014

  6. [6]

    J. Wang, Z. Liu, Y . Wu, and J. Yuan. Mining Actionlet Ensemble for Action Recog- nition with Depth Cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1297, 2012

  7. [7]

    L. Xia, C. Chen, and J. K. Aggarwal. View-Invariant Human Action Recognition us- ing Histograms of 3D Joints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20–27, 2012. A PREPRINT: HIEU PHAM ET AL. 2019 11

  8. [8]

    Chaudhry, F

    R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, and R. Vidal. Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 471–478, 2013

  9. [9]

    Vemulapalli, F

    R. Vemulapalli, F. Arrate, and R. Chellappa. Human Action Recognition by Represent- ing 3D Skeletons as Points in a Lie Group. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 588–595, 2014

  10. [10]

    W. Ding, K. Liu, X. Fu, and F. Cheng. Profile HMMs for Skeleton-based Human Action Recognition. Signal Processing: Image Communication, 42:109–119, 2016

  11. [11]

    Z. Zhang. Microsoft Kinect Sensor and Its Effect. IEEE Multimedia, 19(2):4–10, 2012

  12. [12]

    Z. Cao, T. Simon, S. Wei, and Y . Sheikh. Realtime Multi-person 2D Pose Estima- tion using Part Affinity Fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7291–7299, 2017

  13. [13]

    H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean. Efficient Neural Architecture Search via Parameters Sharing. In International Conference on Machine Learning (ICML) , pages 4095–4104, 2018

  14. [14]

    Johansson

    G. Johansson. Visual Motion Perception. Scientific American, 232(6):76–89, 1975

  15. [15]

    J. Gu, X. Ding, S. Wang, and Y . Wu. Action and Gait Recognition from Recovered 3D Human Joints. IEEE Transactions on Systems, Man, and Cybernetics, 40(4):1021– 1033, 2010

  16. [16]

    Newell, K

    A. Newell, K. Yang, and J. Deng. Stacked Hourglass Networks for Human Pose Esti- mation. In European Conference on Computer Vision (ECCV), pages 483–499, 2016

  17. [17]

    Ionescu, D

    C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environ- ments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 36(7):1325–1339, 2014

  18. [18]

    W. Li, Z. Zhang, and Z. Liu. Action Recognition Based on a Bag of 3D Points. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9–14, 2010

  19. [19]

    K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Two-person Inter- action Detection using Body-pose Features and Multiple Instance Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 28–35, 2012

  20. [20]

    Nikolaos, B

    S. Nikolaos, B. Bogdan, I. Bogdan, and A. K. Ioannis. 3D Human Pose Estimation: A Review of the Literature and Analysis of Covariates. Computer Vision and Image Understanding (CVIU), 152:1–20, 2016

  21. [21]

    Presti and M

    L. Presti and M. La Cascia. 3D Skeleton-based Human Action Classification: A Sur- vey. Pattern Recognition, 53:130–147, 2016

  22. [22]

    Sminchisescu

    C. Sminchisescu. 3D Human Motion Analysis in Monocular Video Techniques and Challenges. In IEEE International Conference on Video and Signal Based Surveillance (ICVSBS), pages 76–76, 2006. 12 A PREPRINT: HIEU PHAM ET AL. 2019

  23. [23]

    Ramakrishna, T

    V . Ramakrishna, T. Kanade, and Y . Sheikh. Reconstructing 3D Human Pose from 2D Image Landmarks. In European Conference on Computer Vision (ECCV), pages 573– 586, 2012

  24. [24]

    Li and A

    S. Li and A. B. Chan. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In Asian Conference on Computer Vision (ACCV) , pages 332–347, 2014

  25. [25]

    Tekin, A

    B. Tekin, A. Rozantsev, V . Lepetit, and P. Fua. Direct Prediction of 3D Body Poses from Motion Compensated Sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 991–1000, 2016

  26. [26]

    Pavlakos, X

    G. Pavlakos, X. Zhou, K. G Derpanis, and K. Daniilidis. Coarse-to-fine V olumetric Prediction for Single-image 3D Human Pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7025–7034, 2017

  27. [27]

    3D human pose estimation in video with temporal convolutions and semi-supervised training

    D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3D Human Pose Estimation in Video with Temporal Convolutions and Semi-supervised Training. arXiv preprint arXiv:1811.11742, 2018

  28. [28]

    Mehta, S

    D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics (TOG), 36(4):44, 2017

  29. [29]

    Katircioglu, B

    I. Katircioglu, B. Tekin, M. Salzmann, V . Lepetit, and P. Fua. Learning Latent Repre- sentations of 3D Human Pose with Deep Neural Networks. International Journal of Computer Vision (IJCV), 126(12):1326–1341, 2018

  30. [30]

    Multi-Scale Context Aggregation by Dilated Convolutions

    Y . Fisher and K. Vladlen. Multi-scale Context Aggregation by Dilated Convolutions. arXiv preprint arXiv:1511.07122, 2015

  31. [31]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770– 778, 2016

  32. [32]

    Sepp and S

    H. Sepp and S. Jürgen. Long Short-Term Memory. Neural Computation, 9:1735–1780, 1997

  33. [33]

    Martinez, R

    J. Martinez, R. Hossain, J. Romero, and J. Little. A Simple Yet Effective Baseline for 3D Human Pose Estimation. In IEEE International Conference on Computer Vision (ICCV), pages 2640–2649, 2017

  34. [34]

    Lv and R

    F. Lv and R. Nevatia. Recognition and Segmentation of 3D Human Action Using HMM and Multi-class AdaBoost. In European Conference on Computer Vision (ECCV) , pages 359–372, 2006

  35. [35]

    L. Han, X. Wu, W. Liang, G. Hou, and Y . Jia. Discriminative Human Action Recogni- tion in the Learned Hierarchical Manifold Space. Image and Vision Computing (IVC), 28, 2010

  36. [36]

    J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal LSTM with Trust Gates for 3D Human Action Recognition. In European Conference on Computer Vision (ECCV), pages 816–833, 2016. A PREPRINT: HIEU PHAM ET AL. 2019 13

  37. [37]

    Y . Du, W. Wang, and L. Wang. Hierarchical Recurrent Neural Network for Skele- ton based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1110–1118, 2015

  38. [38]

    Shahroudy, J

    A. Shahroudy, J. Liu, T. T. Ng, and G. Wang. NTU RGB+ D: A Large Scale Dataset for 3D Human Activity Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016

  39. [39]

    T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4580–4584, 2015

  40. [40]

    Chéron, I

    G. Chéron, I. Laptev, and C. Schmid. P-CNN: Pose-based CNN Features for Action Recognition. In IEEE International Conference on Computer Vision (ICCV), 2015

  41. [41]

    Yao and L

    B. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human- object Interaction Activities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 17–24, 2010

  42. [42]

    B. X. Nie, C. Xiong, and S. Zhu. Joint Action Recognition and Pose Estimation from Video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1293–1301, 2015

  43. [43]

    D. C. Luvizon, D. Picard, and H. Tabia. 2D/3D Pose Estimation and Action Recog- nition using Multitask Deep Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5137–5146, 2018

  44. [44]

    P. J. Huber. Robust Estimation of a Location Parameter. In Breakthroughs in Statistics, pages 492–518. Springer, 1992

  45. [45]

    Christian, I

    S. Christian, I. Sergey, and V . Vincent. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI Conference on Artificial Intelligence (AAAI), 2016

  46. [46]

    H. Gao, L. Zhuang, M. Laurens van der, and Q. W. Kilian. Densely Connected Convo- lutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017

  47. [47]

    Neural Architecture Search with Reinforcement Learning

    Z. Barret and V . L. Quoc. Neural Architecture Search with Reinforcement Learning. arXiv preprint arXiv:1611.01578, 2017

  48. [48]

    Ioffe and C

    S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML), 2015

  49. [49]

    G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Im- proving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv preprint arXiv:1207.0580, 2012

  50. [50]

    Klambauer, T

    G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (NIPS), pages 971– 980, 2017. 14 A PREPRINT: HIEU PHAM ET AL. 2019

  51. [51]

    H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Exploiting Deep Residual Networks for Human Action Recognition from Skeletal Data. Computer Vi- sion and Image Understanding (CVIU), 170:51–66, 2018

  52. [52]

    H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Skeletal Move- ment to Color Map: A Novel Representation for 3D Action Recognition with Incep- tion Residual Networks. IEEE International Conference on Image Processing (ICIP), pages 3483–3487, 2018

  53. [53]

    H. Pham, H. Salmane, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Spatio- Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks. Sensors, 19(8), 2019

  54. [54]

    K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human- Level Performance on ImageNet Classification. In IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015

  55. [55]

    Adam: A Method for Stochastic Optimization

    D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014

  56. [56]

    N. Yurii. A Method for Solving a Convex Programming Problem with Convergence Rate O(1/K2). Soviet Mathematics Doklady, pages 372–367, 1983

  57. [57]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    L. Ilya and H. Frank. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv preprint arXiv:1608.03983, 2016

  58. [58]

    Y . Du, Y . Wong, Y . Liu, F. Han, Y . Gui, Z; Wang, M. Kankanhalli, and W. Geng. Marker-less 3D Human Motion Capture with Monocular Image Sequence and Height- maps. In European Conference on Computer Vision (ECCV), pages 20–36, 2016

  59. [59]

    S. Park, J. Hwang, and N. Kwak. 3D Human Pose Estimation using Convolutional Neural Networks with 2D Pose Information. In European Conference on Computer Vision (ECCV), pages 156–169, 2016

  60. [60]

    X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4966–4975, 2016

  61. [61]

    Xingyi, S

    Z. Xingyi, S. Xiao, Z. Wei, L. Shuang, and W. Yichen. Deep Kinematic Pose Regres- sion. In European Conference on Computer Vision (ECCV), pages 186–201, 2016

  62. [62]

    Mehta, H

    D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3D Human Pose Estimation in the Wild using Improved CNN Supervision. In International Conference on 3D Vision (3DV), pages 506–516, 2017

  63. [63]

    Shuang, S

    L. Shuang, S. Xiao, and W. Yichen. Compositional Human Pose Regression.Computer Vision and Image Understanding, 176-177:1 – 8, 2018

  64. [64]

    C. Chen, K. Liu, and N. Kehtarnavaz. Real-time Human Action Recognition based on Depth Motion Maps. Journal of Real-time Image Processing, 12, 2016. A PREPRINT: HIEU PHAM ET AL. 2019 15

  65. [65]

    P. Wang, C. Yuan, W. Hu, B. Li, and Y . Zhang. Graph Based Skeleton Motion Rep- resentation and Similarity Measurement for Action Recognition. In British Machine Vision Conference (BMVC), 2016

  66. [66]

    J. Weng, C. Weng, and J. Yuan. Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST- NBNN) for Skeleton-Based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  67. [67]

    H. Xu, E. Chen, C. Liang, L. Qi, and L. Guan. Spatio-temporal Pyramid Model based on Depth Maps for Action Recognition. In IEEE International Workshop on Multime- dia Signal Processing (MMSP), pages 1–6, 2015

  68. [68]

    I. Lee, D. Kim, S. Kang, and S. Lee. Ensemble Deep Learning for Skeleton-based Action Recognition using Temporal Sliding LSTM Networks. In IEEE International Conference on Computer Vision (ICCV), pages 1012–1020, 2017

  69. [69]

    S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An End-to-End Spatio-Temporal Atten- tion Model for Human Action Recognition from Skeleton Data. In AAAI Conference on Artificial Intelligence (AAAI), 2017

  70. [70]

    J. Weng, C. Weng, J. Yuan, and Z. Liu. Discriminative Spatio-Temporal Pattern Dis- covery for 3D Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (TCCVT), 29(4):1077–1089, 2019

  71. [71]

    Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. A New Representation of Skeleton Sequences for 3D Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4570–4579, 2017

  72. [72]

    Yusuf and K

    T. Yusuf and K. Piotr. CNN-based Action Recognition and Supervised Domain Adap- tation on 3D Body Skeletons via Kernel Feature Maps. In British Machine Vision Conference (BMVC), page 158, 2018

  73. [73]

    Wang and L

    H. Wang and L. Wang. Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. In IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3633–3642, 2017

  74. [74]

    J. Liu, G. Wang, L. Duan, K. Abdiyeva, and A. C. Kot. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks. IEEE Transac- tions on Image Processing (TIP), 27(4):1586–1599, 2018

  75. [75]

    Zhang, C

    P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View Adaptive Neural Networks for High Performance Skeleton-based Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), (1):1–1, 2019