A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera

Alain Crouzil; Houssam Salmane; Huy Hieu Pham; Louahdi Khoudour; Pablo Zegers; Sergio A Velastin

arxiv: 1907.06968 · v1 · pith:YI7MBICUnew · submitted 2019-07-16 · 💻 cs.CV

A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera

Huy Hieu Pham , Houssam Salmane , Louahdi Khoudour , Alain Crouzil , Pablo Zegers , Sergio A Velastin This is my paper

Pith reviewed 2026-05-24 21:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D pose estimationaction recognitiondeep learningneural architecture searchRGB videomultitask learninghuman activity recognition

0 comments

The pith

A two-stage deep framework lifts 2D keypoints to 3D poses then applies an ENAS-searched network to recognize actions from RGB video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a multitask pipeline that first detects 2D body keypoints in real time and maps them to 3D poses with a two-stream neural network. In the second stage it uses the ENAS algorithm to discover an architecture that converts sequences of these 3D poses into an image-based representation and classifies the performed action. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets are presented to show that the combined system works on both tasks while keeping training and inference costs modest. A reader would care because separate pose and action pipelines are common; unifying them in one low-budget flow could simplify deployment for video-based monitoring or interaction systems.

Core claim

The central claim is that joint 3D pose estimation and action recognition can be performed effectively from single RGB sequences by first lifting detected 2D keypoints to 3D poses via a two-stream network and then feeding the resulting 3D pose sequences into an ENAS-optimized spatio-temporal model that operates on an image-based intermediate representation.

What carries the argument

ENAS algorithm used to search for an optimal network that models the spatio-temporal evolution of estimated 3D poses through an image-based intermediate representation, following the two-stream 2D-to-3D lifting stage.

If this is right

The method achieves effective performance on Human3.6M for 3D pose estimation and on MSR Action3D and SBU Kinect Interaction for action recognition.
Training and inference require only a low computational budget.
The two-stage design supports real-time 2D keypoint detection followed by 3D lifting and action modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipeline could be extended to process longer video streams or multiple people if the 3D lifting stage scales.
Replacing the separate stages with a single end-to-end differentiable network might further reduce error accumulation between pose estimation and action recognition.
The image-based intermediate representation of 3D poses might allow reuse of existing 2D image classifiers for the action task.

Load-bearing premise

Errors introduced when lifting 2D keypoints to 3D poses will not substantially reduce the accuracy of the downstream action recognition step.

What would settle it

Measure action recognition accuracy on the same 3D pose sequences once with the network's estimated 3D poses and once with ground-truth 3D poses; a large drop on the estimated poses would falsify the claim that lifting errors do not degrade recognition.

Figures

Figures reproduced from arXiv: 1907.06968 by Alain Crouzil, Houssam Salmane, Huy Hieu Pham, Louahdi Khoudour, Pablo Zegers, Sergio A Velastin.

**Figure 2.** Figure 2: Diagram of the proposed two-stream network for training our 3D pose estimator. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Immediate image-based representations for the recognition stage. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the proposed approach for 3D pose-based action recognition. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of 3D output of the estimation stage with some samples on the test [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Diagram of the top performing normal cell (a) and reduction cell (b) discovered by ENAS [13] on AS1 subset [18]. They were then used to construct the final network architecture (c). We recommend the interested readers to [13] to better understand this procedure. 4.4 Computational efficiency evaluation On a single GeForce GTX 1080Ti GPU with 11GB memory, the runtime of OpenPose [12] is less than 0.1s per f… view at source ↗

read the original abstract

We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB video sequences. Our approach proceeds along two stages. In the first, we run a real-time 2D pose detector to determine the precise pixel location of important keypoints of the body. A two-stream neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second, we deploy the Efficient Neural Architecture Search (ENAS) algorithm to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that our method requires a low computational budget for training and inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper chains a standard 2D detector, two-stream lifting network, and ENAS into a joint pipeline for 3D pose and action from RGB, but offers no ablation showing the lifting stage actually supports the action numbers.

read the letter

The core fact is that nothing here is mathematically new. The method runs a published real-time 2D pose detector, feeds the output into a two-stream 2D-to-3D network, then hands the resulting 3D poses to an ENAS-searched spatio-temporal model. Each piece already existed; the paper simply wires them together for the combined task on Human3.6M, MSR Action3D, and SBU data. That is the entire contribution in terms of novelty. What the work does reasonably is keep the whole thing lightweight and describe a complete end-to-end flow that could be dropped into an application needing both outputs from one camera. The low-compute claim is at least plausible given the choice of ENAS and the staged design. The soft spot is exactly the one the stress-test note flags. Because the pipeline is strictly sequential, any error in the lifting stage can reach the action classifier. The abstract asserts that experiments verify effectiveness, yet the description supplies no numbers, no baselines, and no comparison of action accuracy on ground-truth 3D poses versus the estimated ones. Without that check it is impossible to tell whether the reported action results depend on accurate 3D geometry or would look similar with noisier input. The citation pattern is fine; the authors correctly credit the prior detectors and the ENAS paper. The evaluation, however, stays at the level of “we ran it and it worked” rather than controlled measurement of the joint claim. This paper is for groups that build practical video systems and want an example of how to glue pose lifting to action recognition while controlling compute. A reader already working on ENAS or two-stream lifting might skim the architecture choices for ideas, but the work does not reorganize any subfield. I would send it to peer review. The joint-task framing is sensible and the components are reproducible, so referees can ask for the missing ablations and quantitative tables. The paper is coherent on its own terms and does not contain internal contradictions, so it meets the threshold for serious referee time even if the current evidence is thin.

Referee Report

1 major / 0 minor

Summary. The paper introduces a multitask deep learning framework consisting of a 2D pose detector, a two-stream neural network for lifting 2D keypoints to 3D poses, and an ENAS-optimized spatio-temporal model for action recognition from the estimated 3D poses represented as images. It claims that experiments on the Human3.6M, MSR Action3D, and SBU Kinect Interaction datasets demonstrate the effectiveness of the approach for both 3D pose estimation and action recognition, while requiring low computational resources for training and inference.

Significance. If the central claims hold after addressing the gaps below, the work would contribute a staged but unified pipeline that combines real-time 2D detection, 2D-to-3D lifting, and neural-architecture-search-driven spatio-temporal modeling, potentially enabling efficient joint pose-and-action systems on public datasets.

major comments (1)

[Experiments] The experimental section provides no ablation that measures action-recognition accuracy on ground-truth 3D poses versus the 3D poses produced by the two-stream lifting network. Because the pipeline is strictly staged (2D detector → lifting → ENAS model), this comparison is required to substantiate the claim that the method is effective for both tasks and that lifting errors do not substantially degrade downstream recognition on MSR Action3D and SBU Kinect Interaction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Experiments] The experimental section provides no ablation that measures action-recognition accuracy on ground-truth 3D poses versus the 3D poses produced by the two-stream lifting network. Because the pipeline is strictly staged (2D detector → lifting → ENAS model), this comparison is required to substantiate the claim that the method is effective for both tasks and that lifting errors do not substantially degrade downstream recognition on MSR Action3D and SBU Kinect Interaction.

Authors: We agree that the requested ablation would strengthen the experimental section by quantifying the effect of lifting errors on action recognition. MSR Action3D and SBU provide Kinect-derived 3D joint positions that can serve as ground-truth 3D input to the ENAS model. In the revised manuscript we will add this comparison (estimated 3D poses vs. Kinect 3D poses) for the action-recognition task on both datasets and report the resulting accuracy difference. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline relies on external datasets and prior algorithms

full rationale

The paper presents a staged pipeline (2D detector to two-stream 3D lifting network to ENAS-optimized spatio-temporal model) whose performance claims are evaluated directly on external public benchmarks (Human3.6M, MSR Action3D, SBU). No derivation, equation, or central claim reduces by construction to a quantity fitted or defined within the paper itself; ENAS and the 2D detector are cited from prior independent work, and no self-citation forms a load-bearing uniqueness argument. The method is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, preventing exhaustive enumeration. The approach rests on standard deep-learning assumptions about labeled video datasets and the transferability of 2D-to-3D lifting to action recognition.

axioms (2)

domain assumption A real-time 2D pose detector supplies sufficiently accurate keypoints as input to the lifting network.
Invoked in the first stage description; accuracy of this detector is presupposed.
domain assumption ENAS can discover an architecture whose performance on the image-based 3D-pose representation is representative of the joint task.
Central to the second stage; no independent verification of the search objective is supplied.

pith-pipeline@v0.9.0 · 5708 in / 1420 out tokens · 28687 ms · 2026-05-24T21:02:42.424894+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 6 internal anchors

[1]

Weinland, R

D. Weinland, R. Ronfard, and E. Boyer. A Survey of Vision-based Methods for Action Representation, Segmentation and Recognition. Computer Vision and Image Under- standing (CVIU), 115(2):224–241, 2011

work page 2011
[2]

D. Lowe. Distinctive Image Features from Scale-invariant Keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004

work page 2004
[3]

Laptev, M

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning Realistic Human Ac- tions from Movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008

work page 2008
[4]

Dollár, V

P. Dollár, V . Rabaud, G. Cottrell, and S. Belongie. Behavior Recognition via Sparse Spatio-temporal Features. In IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pages 65–72, 2005

work page 2005
[5]

Ye and R

M. Ye and R. Yang. Real-time Simultaneous Pose and Shape Estimation for Articulated Objects using a Single Depth Camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2345–2352, 2014

work page 2014
[6]

J. Wang, Z. Liu, Y . Wu, and J. Yuan. Mining Actionlet Ensemble for Action Recog- nition with Depth Cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1297, 2012

work page 2012
[7]

L. Xia, C. Chen, and J. K. Aggarwal. View-Invariant Human Action Recognition us- ing Histograms of 3D Joints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20–27, 2012. A PREPRINT: HIEU PHAM ET AL. 2019 11

work page 2012
[8]

Chaudhry, F

R. Chaudhry, F. Oﬂi, G. Kurillo, R. Bajcsy, and R. Vidal. Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 471–478, 2013

work page 2013
[9]

Vemulapalli, F

R. Vemulapalli, F. Arrate, and R. Chellappa. Human Action Recognition by Represent- ing 3D Skeletons as Points in a Lie Group. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 588–595, 2014

work page 2014
[10]

W. Ding, K. Liu, X. Fu, and F. Cheng. Proﬁle HMMs for Skeleton-based Human Action Recognition. Signal Processing: Image Communication, 42:109–119, 2016

work page 2016
[11]

Z. Zhang. Microsoft Kinect Sensor and Its Effect. IEEE Multimedia, 19(2):4–10, 2012

work page 2012
[12]

Z. Cao, T. Simon, S. Wei, and Y . Sheikh. Realtime Multi-person 2D Pose Estima- tion using Part Afﬁnity Fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7291–7299, 2017

work page 2017
[13]

H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean. Efﬁcient Neural Architecture Search via Parameters Sharing. In International Conference on Machine Learning (ICML) , pages 4095–4104, 2018

work page 2018
[14]

Johansson

G. Johansson. Visual Motion Perception. Scientiﬁc American, 232(6):76–89, 1975

work page 1975
[15]

J. Gu, X. Ding, S. Wang, and Y . Wu. Action and Gait Recognition from Recovered 3D Human Joints. IEEE Transactions on Systems, Man, and Cybernetics, 40(4):1021– 1033, 2010

work page 2010
[16]

Newell, K

A. Newell, K. Yang, and J. Deng. Stacked Hourglass Networks for Human Pose Esti- mation. In European Conference on Computer Vision (ECCV), pages 483–499, 2016

work page 2016
[17]

Ionescu, D

C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environ- ments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 36(7):1325–1339, 2014

work page 2014
[18]

W. Li, Z. Zhang, and Z. Liu. Action Recognition Based on a Bag of 3D Points. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9–14, 2010

work page 2010
[19]

K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Two-person Inter- action Detection using Body-pose Features and Multiple Instance Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 28–35, 2012

work page 2012
[20]

Nikolaos, B

S. Nikolaos, B. Bogdan, I. Bogdan, and A. K. Ioannis. 3D Human Pose Estimation: A Review of the Literature and Analysis of Covariates. Computer Vision and Image Understanding (CVIU), 152:1–20, 2016

work page 2016
[21]

Presti and M

L. Presti and M. La Cascia. 3D Skeleton-based Human Action Classiﬁcation: A Sur- vey. Pattern Recognition, 53:130–147, 2016

work page 2016
[22]

Sminchisescu

C. Sminchisescu. 3D Human Motion Analysis in Monocular Video Techniques and Challenges. In IEEE International Conference on Video and Signal Based Surveillance (ICVSBS), pages 76–76, 2006. 12 A PREPRINT: HIEU PHAM ET AL. 2019

work page 2006
[23]

Ramakrishna, T

V . Ramakrishna, T. Kanade, and Y . Sheikh. Reconstructing 3D Human Pose from 2D Image Landmarks. In European Conference on Computer Vision (ECCV), pages 573– 586, 2012

work page 2012
[24]

Li and A

S. Li and A. B. Chan. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In Asian Conference on Computer Vision (ACCV) , pages 332–347, 2014

work page 2014
[25]

Tekin, A

B. Tekin, A. Rozantsev, V . Lepetit, and P. Fua. Direct Prediction of 3D Body Poses from Motion Compensated Sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 991–1000, 2016

work page 2016
[26]

Pavlakos, X

G. Pavlakos, X. Zhou, K. G Derpanis, and K. Daniilidis. Coarse-to-ﬁne V olumetric Prediction for Single-image 3D Human Pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7025–7034, 2017

work page 2017
[27]

3D human pose estimation in video with temporal convolutions and semi-supervised training

D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3D Human Pose Estimation in Video with Temporal Convolutions and Semi-supervised Training. arXiv preprint arXiv:1811.11742, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Mehta, S

D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shaﬁei, H. P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics (TOG), 36(4):44, 2017

work page 2017
[29]

Katircioglu, B

I. Katircioglu, B. Tekin, M. Salzmann, V . Lepetit, and P. Fua. Learning Latent Repre- sentations of 3D Human Pose with Deep Neural Networks. International Journal of Computer Vision (IJCV), 126(12):1326–1341, 2018

work page 2018
[30]

Multi-Scale Context Aggregation by Dilated Convolutions

Y . Fisher and K. Vladlen. Multi-scale Context Aggregation by Dilated Convolutions. arXiv preprint arXiv:1511.07122, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770– 778, 2016

work page 2016
[32]

Sepp and S

H. Sepp and S. Jürgen. Long Short-Term Memory. Neural Computation, 9:1735–1780, 1997

work page 1997
[33]

Martinez, R

J. Martinez, R. Hossain, J. Romero, and J. Little. A Simple Yet Effective Baseline for 3D Human Pose Estimation. In IEEE International Conference on Computer Vision (ICCV), pages 2640–2649, 2017

work page 2017
[34]

Lv and R

F. Lv and R. Nevatia. Recognition and Segmentation of 3D Human Action Using HMM and Multi-class AdaBoost. In European Conference on Computer Vision (ECCV) , pages 359–372, 2006

work page 2006
[35]

L. Han, X. Wu, W. Liang, G. Hou, and Y . Jia. Discriminative Human Action Recogni- tion in the Learned Hierarchical Manifold Space. Image and Vision Computing (IVC), 28, 2010

work page 2010
[36]

J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal LSTM with Trust Gates for 3D Human Action Recognition. In European Conference on Computer Vision (ECCV), pages 816–833, 2016. A PREPRINT: HIEU PHAM ET AL. 2019 13

work page 2016
[37]

Y . Du, W. Wang, and L. Wang. Hierarchical Recurrent Neural Network for Skele- ton based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1110–1118, 2015

work page 2015
[38]

Shahroudy, J

A. Shahroudy, J. Liu, T. T. Ng, and G. Wang. NTU RGB+ D: A Large Scale Dataset for 3D Human Activity Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016

work page 2016
[39]

T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4580–4584, 2015

work page 2015
[40]

Chéron, I

G. Chéron, I. Laptev, and C. Schmid. P-CNN: Pose-based CNN Features for Action Recognition. In IEEE International Conference on Computer Vision (ICCV), 2015

work page 2015
[41]

Yao and L

B. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human- object Interaction Activities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 17–24, 2010

work page 2010
[42]

B. X. Nie, C. Xiong, and S. Zhu. Joint Action Recognition and Pose Estimation from Video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1293–1301, 2015

work page 2015
[43]

D. C. Luvizon, D. Picard, and H. Tabia. 2D/3D Pose Estimation and Action Recog- nition using Multitask Deep Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5137–5146, 2018

work page 2018
[44]

P. J. Huber. Robust Estimation of a Location Parameter. In Breakthroughs in Statistics, pages 492–518. Springer, 1992

work page 1992
[45]

Christian, I

S. Christian, I. Sergey, and V . Vincent. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2016

work page 2016
[46]

H. Gao, L. Zhuang, M. Laurens van der, and Q. W. Kilian. Densely Connected Convo- lutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017

work page 2017
[47]

Neural Architecture Search with Reinforcement Learning

Z. Barret and V . L. Quoc. Neural Architecture Search with Reinforcement Learning. arXiv preprint arXiv:1611.01578, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Ioffe and C

S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML), 2015

work page 2015
[49]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Im- proving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv preprint arXiv:1207.0580, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[50]

Klambauer, T

G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (NIPS), pages 971– 980, 2017. 14 A PREPRINT: HIEU PHAM ET AL. 2019

work page 2017
[51]

H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Exploiting Deep Residual Networks for Human Action Recognition from Skeletal Data. Computer Vi- sion and Image Understanding (CVIU), 170:51–66, 2018

work page 2018
[52]

H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Skeletal Move- ment to Color Map: A Novel Representation for 3D Action Recognition with Incep- tion Residual Networks. IEEE International Conference on Image Processing (ICIP), pages 3483–3487, 2018

work page 2018
[53]

H. Pham, H. Salmane, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Spatio- Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks. Sensors, 19(8), 2019

work page 2019
[54]

K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectiﬁers: Surpassing Human- Level Performance on ImageNet Classiﬁcation. In IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015

work page 2015
[55]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[56]

N. Yurii. A Method for Solving a Convex Programming Problem with Convergence Rate O(1/K2). Soviet Mathematics Doklady, pages 372–367, 1983

work page 1983
[57]

SGDR: Stochastic Gradient Descent with Warm Restarts

L. Ilya and H. Frank. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[58]

Y . Du, Y . Wong, Y . Liu, F. Han, Y . Gui, Z; Wang, M. Kankanhalli, and W. Geng. Marker-less 3D Human Motion Capture with Monocular Image Sequence and Height- maps. In European Conference on Computer Vision (ECCV), pages 20–36, 2016

work page 2016
[59]

S. Park, J. Hwang, and N. Kwak. 3D Human Pose Estimation using Convolutional Neural Networks with 2D Pose Information. In European Conference on Computer Vision (ECCV), pages 156–169, 2016

work page 2016
[60]

X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4966–4975, 2016

work page 2016
[61]

Xingyi, S

Z. Xingyi, S. Xiao, Z. Wei, L. Shuang, and W. Yichen. Deep Kinematic Pose Regres- sion. In European Conference on Computer Vision (ECCV), pages 186–201, 2016

work page 2016
[62]

Mehta, H

D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3D Human Pose Estimation in the Wild using Improved CNN Supervision. In International Conference on 3D Vision (3DV), pages 506–516, 2017

work page 2017
[63]

Shuang, S

L. Shuang, S. Xiao, and W. Yichen. Compositional Human Pose Regression.Computer Vision and Image Understanding, 176-177:1 – 8, 2018

work page 2018
[64]

C. Chen, K. Liu, and N. Kehtarnavaz. Real-time Human Action Recognition based on Depth Motion Maps. Journal of Real-time Image Processing, 12, 2016. A PREPRINT: HIEU PHAM ET AL. 2019 15

work page 2016
[65]

P. Wang, C. Yuan, W. Hu, B. Li, and Y . Zhang. Graph Based Skeleton Motion Rep- resentation and Similarity Measurement for Action Recognition. In British Machine Vision Conference (BMVC), 2016

work page 2016
[66]

J. Weng, C. Weng, and J. Yuan. Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST- NBNN) for Skeleton-Based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[67]

H. Xu, E. Chen, C. Liang, L. Qi, and L. Guan. Spatio-temporal Pyramid Model based on Depth Maps for Action Recognition. In IEEE International Workshop on Multime- dia Signal Processing (MMSP), pages 1–6, 2015

work page 2015
[68]

I. Lee, D. Kim, S. Kang, and S. Lee. Ensemble Deep Learning for Skeleton-based Action Recognition using Temporal Sliding LSTM Networks. In IEEE International Conference on Computer Vision (ICCV), pages 1012–1020, 2017

work page 2017
[69]

S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An End-to-End Spatio-Temporal Atten- tion Model for Human Action Recognition from Skeleton Data. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2017

work page 2017
[70]

J. Weng, C. Weng, J. Yuan, and Z. Liu. Discriminative Spatio-Temporal Pattern Dis- covery for 3D Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (TCCVT), 29(4):1077–1089, 2019

work page 2019
[71]

Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. A New Representation of Skeleton Sequences for 3D Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4570–4579, 2017

work page 2017
[72]

Yusuf and K

T. Yusuf and K. Piotr. CNN-based Action Recognition and Supervised Domain Adap- tation on 3D Body Skeletons via Kernel Feature Maps. In British Machine Vision Conference (BMVC), page 158, 2018

work page 2018
[73]

Wang and L

H. Wang and L. Wang. Modeling Temporal Dynamics and Spatial Conﬁgurations of Actions Using Two-Stream Recurrent Neural Networks. In IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3633–3642, 2017

work page 2017
[74]

J. Liu, G. Wang, L. Duan, K. Abdiyeva, and A. C. Kot. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks. IEEE Transac- tions on Image Processing (TIP), 27(4):1586–1599, 2018

work page 2018
[75]

Zhang, C

P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View Adaptive Neural Networks for High Performance Skeleton-based Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), (1):1–1, 2019

work page 2019

[1] [1]

Weinland, R

D. Weinland, R. Ronfard, and E. Boyer. A Survey of Vision-based Methods for Action Representation, Segmentation and Recognition. Computer Vision and Image Under- standing (CVIU), 115(2):224–241, 2011

work page 2011

[2] [2]

D. Lowe. Distinctive Image Features from Scale-invariant Keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004

work page 2004

[3] [3]

Laptev, M

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning Realistic Human Ac- tions from Movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008

work page 2008

[4] [4]

Dollár, V

P. Dollár, V . Rabaud, G. Cottrell, and S. Belongie. Behavior Recognition via Sparse Spatio-temporal Features. In IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pages 65–72, 2005

work page 2005

[5] [5]

Ye and R

M. Ye and R. Yang. Real-time Simultaneous Pose and Shape Estimation for Articulated Objects using a Single Depth Camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2345–2352, 2014

work page 2014

[6] [6]

J. Wang, Z. Liu, Y . Wu, and J. Yuan. Mining Actionlet Ensemble for Action Recog- nition with Depth Cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1297, 2012

work page 2012

[7] [7]

L. Xia, C. Chen, and J. K. Aggarwal. View-Invariant Human Action Recognition us- ing Histograms of 3D Joints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20–27, 2012. A PREPRINT: HIEU PHAM ET AL. 2019 11

work page 2012

[8] [8]

Chaudhry, F

R. Chaudhry, F. Oﬂi, G. Kurillo, R. Bajcsy, and R. Vidal. Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 471–478, 2013

work page 2013

[9] [9]

Vemulapalli, F

R. Vemulapalli, F. Arrate, and R. Chellappa. Human Action Recognition by Represent- ing 3D Skeletons as Points in a Lie Group. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 588–595, 2014

work page 2014

[10] [10]

W. Ding, K. Liu, X. Fu, and F. Cheng. Proﬁle HMMs for Skeleton-based Human Action Recognition. Signal Processing: Image Communication, 42:109–119, 2016

work page 2016

[11] [11]

Z. Zhang. Microsoft Kinect Sensor and Its Effect. IEEE Multimedia, 19(2):4–10, 2012

work page 2012

[12] [12]

Z. Cao, T. Simon, S. Wei, and Y . Sheikh. Realtime Multi-person 2D Pose Estima- tion using Part Afﬁnity Fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7291–7299, 2017

work page 2017

[13] [13]

H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean. Efﬁcient Neural Architecture Search via Parameters Sharing. In International Conference on Machine Learning (ICML) , pages 4095–4104, 2018

work page 2018

[14] [14]

Johansson

G. Johansson. Visual Motion Perception. Scientiﬁc American, 232(6):76–89, 1975

work page 1975

[15] [15]

J. Gu, X. Ding, S. Wang, and Y . Wu. Action and Gait Recognition from Recovered 3D Human Joints. IEEE Transactions on Systems, Man, and Cybernetics, 40(4):1021– 1033, 2010

work page 2010

[16] [16]

Newell, K

A. Newell, K. Yang, and J. Deng. Stacked Hourglass Networks for Human Pose Esti- mation. In European Conference on Computer Vision (ECCV), pages 483–499, 2016

work page 2016

[17] [17]

Ionescu, D

C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environ- ments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 36(7):1325–1339, 2014

work page 2014

[18] [18]

W. Li, Z. Zhang, and Z. Liu. Action Recognition Based on a Bag of 3D Points. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9–14, 2010

work page 2010

[19] [19]

K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Two-person Inter- action Detection using Body-pose Features and Multiple Instance Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 28–35, 2012

work page 2012

[20] [20]

Nikolaos, B

S. Nikolaos, B. Bogdan, I. Bogdan, and A. K. Ioannis. 3D Human Pose Estimation: A Review of the Literature and Analysis of Covariates. Computer Vision and Image Understanding (CVIU), 152:1–20, 2016

work page 2016

[21] [21]

Presti and M

L. Presti and M. La Cascia. 3D Skeleton-based Human Action Classiﬁcation: A Sur- vey. Pattern Recognition, 53:130–147, 2016

work page 2016

[22] [22]

Sminchisescu

C. Sminchisescu. 3D Human Motion Analysis in Monocular Video Techniques and Challenges. In IEEE International Conference on Video and Signal Based Surveillance (ICVSBS), pages 76–76, 2006. 12 A PREPRINT: HIEU PHAM ET AL. 2019

work page 2006

[23] [23]

Ramakrishna, T

V . Ramakrishna, T. Kanade, and Y . Sheikh. Reconstructing 3D Human Pose from 2D Image Landmarks. In European Conference on Computer Vision (ECCV), pages 573– 586, 2012

work page 2012

[24] [24]

Li and A

S. Li and A. B. Chan. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In Asian Conference on Computer Vision (ACCV) , pages 332–347, 2014

work page 2014

[25] [25]

Tekin, A

B. Tekin, A. Rozantsev, V . Lepetit, and P. Fua. Direct Prediction of 3D Body Poses from Motion Compensated Sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 991–1000, 2016

work page 2016

[26] [26]

Pavlakos, X

G. Pavlakos, X. Zhou, K. G Derpanis, and K. Daniilidis. Coarse-to-ﬁne V olumetric Prediction for Single-image 3D Human Pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7025–7034, 2017

work page 2017

[27] [27]

3D human pose estimation in video with temporal convolutions and semi-supervised training

D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3D Human Pose Estimation in Video with Temporal Convolutions and Semi-supervised Training. arXiv preprint arXiv:1811.11742, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Mehta, S

D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shaﬁei, H. P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics (TOG), 36(4):44, 2017

work page 2017

[29] [29]

Katircioglu, B

I. Katircioglu, B. Tekin, M. Salzmann, V . Lepetit, and P. Fua. Learning Latent Repre- sentations of 3D Human Pose with Deep Neural Networks. International Journal of Computer Vision (IJCV), 126(12):1326–1341, 2018

work page 2018

[30] [30]

Multi-Scale Context Aggregation by Dilated Convolutions

Y . Fisher and K. Vladlen. Multi-scale Context Aggregation by Dilated Convolutions. arXiv preprint arXiv:1511.07122, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[31] [31]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770– 778, 2016

work page 2016

[32] [32]

Sepp and S

H. Sepp and S. Jürgen. Long Short-Term Memory. Neural Computation, 9:1735–1780, 1997

work page 1997

[33] [33]

Martinez, R

J. Martinez, R. Hossain, J. Romero, and J. Little. A Simple Yet Effective Baseline for 3D Human Pose Estimation. In IEEE International Conference on Computer Vision (ICCV), pages 2640–2649, 2017

work page 2017

[34] [34]

Lv and R

F. Lv and R. Nevatia. Recognition and Segmentation of 3D Human Action Using HMM and Multi-class AdaBoost. In European Conference on Computer Vision (ECCV) , pages 359–372, 2006

work page 2006

[35] [35]

L. Han, X. Wu, W. Liang, G. Hou, and Y . Jia. Discriminative Human Action Recogni- tion in the Learned Hierarchical Manifold Space. Image and Vision Computing (IVC), 28, 2010

work page 2010

[36] [36]

J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal LSTM with Trust Gates for 3D Human Action Recognition. In European Conference on Computer Vision (ECCV), pages 816–833, 2016. A PREPRINT: HIEU PHAM ET AL. 2019 13

work page 2016

[37] [37]

Y . Du, W. Wang, and L. Wang. Hierarchical Recurrent Neural Network for Skele- ton based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1110–1118, 2015

work page 2015

[38] [38]

Shahroudy, J

A. Shahroudy, J. Liu, T. T. Ng, and G. Wang. NTU RGB+ D: A Large Scale Dataset for 3D Human Activity Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016

work page 2016

[39] [39]

T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4580–4584, 2015

work page 2015

[40] [40]

Chéron, I

G. Chéron, I. Laptev, and C. Schmid. P-CNN: Pose-based CNN Features for Action Recognition. In IEEE International Conference on Computer Vision (ICCV), 2015

work page 2015

[41] [41]

Yao and L

B. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human- object Interaction Activities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 17–24, 2010

work page 2010

[42] [42]

B. X. Nie, C. Xiong, and S. Zhu. Joint Action Recognition and Pose Estimation from Video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1293–1301, 2015

work page 2015

[43] [43]

D. C. Luvizon, D. Picard, and H. Tabia. 2D/3D Pose Estimation and Action Recog- nition using Multitask Deep Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5137–5146, 2018

work page 2018

[44] [44]

P. J. Huber. Robust Estimation of a Location Parameter. In Breakthroughs in Statistics, pages 492–518. Springer, 1992

work page 1992

[45] [45]

Christian, I

S. Christian, I. Sergey, and V . Vincent. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2016

work page 2016

[46] [46]

H. Gao, L. Zhuang, M. Laurens van der, and Q. W. Kilian. Densely Connected Convo- lutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017

work page 2017

[47] [47]

Neural Architecture Search with Reinforcement Learning

Z. Barret and V . L. Quoc. Neural Architecture Search with Reinforcement Learning. arXiv preprint arXiv:1611.01578, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [48]

Ioffe and C

S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML), 2015

work page 2015

[49] [49]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Im- proving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv preprint arXiv:1207.0580, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[50] [50]

Klambauer, T

G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (NIPS), pages 971– 980, 2017. 14 A PREPRINT: HIEU PHAM ET AL. 2019

work page 2017

[51] [51]

H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Exploiting Deep Residual Networks for Human Action Recognition from Skeletal Data. Computer Vi- sion and Image Understanding (CVIU), 170:51–66, 2018

work page 2018

[52] [52]

H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Skeletal Move- ment to Color Map: A Novel Representation for 3D Action Recognition with Incep- tion Residual Networks. IEEE International Conference on Image Processing (ICIP), pages 3483–3487, 2018

work page 2018

[53] [53]

H. Pham, H. Salmane, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin. Spatio- Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks. Sensors, 19(8), 2019

work page 2019

[54] [54]

K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectiﬁers: Surpassing Human- Level Performance on ImageNet Classiﬁcation. In IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015

work page 2015

[55] [55]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[56] [56]

N. Yurii. A Method for Solving a Convex Programming Problem with Convergence Rate O(1/K2). Soviet Mathematics Doklady, pages 372–367, 1983

work page 1983

[57] [57]

SGDR: Stochastic Gradient Descent with Warm Restarts

L. Ilya and H. Frank. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[58] [58]

Y . Du, Y . Wong, Y . Liu, F. Han, Y . Gui, Z; Wang, M. Kankanhalli, and W. Geng. Marker-less 3D Human Motion Capture with Monocular Image Sequence and Height- maps. In European Conference on Computer Vision (ECCV), pages 20–36, 2016

work page 2016

[59] [59]

S. Park, J. Hwang, and N. Kwak. 3D Human Pose Estimation using Convolutional Neural Networks with 2D Pose Information. In European Conference on Computer Vision (ECCV), pages 156–169, 2016

work page 2016

[60] [60]

X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4966–4975, 2016

work page 2016

[61] [61]

Xingyi, S

Z. Xingyi, S. Xiao, Z. Wei, L. Shuang, and W. Yichen. Deep Kinematic Pose Regres- sion. In European Conference on Computer Vision (ECCV), pages 186–201, 2016

work page 2016

[62] [62]

Mehta, H

D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3D Human Pose Estimation in the Wild using Improved CNN Supervision. In International Conference on 3D Vision (3DV), pages 506–516, 2017

work page 2017

[63] [63]

Shuang, S

L. Shuang, S. Xiao, and W. Yichen. Compositional Human Pose Regression.Computer Vision and Image Understanding, 176-177:1 – 8, 2018

work page 2018

[64] [64]

C. Chen, K. Liu, and N. Kehtarnavaz. Real-time Human Action Recognition based on Depth Motion Maps. Journal of Real-time Image Processing, 12, 2016. A PREPRINT: HIEU PHAM ET AL. 2019 15

work page 2016

[65] [65]

P. Wang, C. Yuan, W. Hu, B. Li, and Y . Zhang. Graph Based Skeleton Motion Rep- resentation and Similarity Measurement for Action Recognition. In British Machine Vision Conference (BMVC), 2016

work page 2016

[66] [66]

J. Weng, C. Weng, and J. Yuan. Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST- NBNN) for Skeleton-Based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017

[67] [67]

H. Xu, E. Chen, C. Liang, L. Qi, and L. Guan. Spatio-temporal Pyramid Model based on Depth Maps for Action Recognition. In IEEE International Workshop on Multime- dia Signal Processing (MMSP), pages 1–6, 2015

work page 2015

[68] [68]

I. Lee, D. Kim, S. Kang, and S. Lee. Ensemble Deep Learning for Skeleton-based Action Recognition using Temporal Sliding LSTM Networks. In IEEE International Conference on Computer Vision (ICCV), pages 1012–1020, 2017

work page 2017

[69] [69]

S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An End-to-End Spatio-Temporal Atten- tion Model for Human Action Recognition from Skeleton Data. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2017

work page 2017

[70] [70]

J. Weng, C. Weng, J. Yuan, and Z. Liu. Discriminative Spatio-Temporal Pattern Dis- covery for 3D Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (TCCVT), 29(4):1077–1089, 2019

work page 2019

[71] [71]

Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. A New Representation of Skeleton Sequences for 3D Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4570–4579, 2017

work page 2017

[72] [72]

Yusuf and K

T. Yusuf and K. Piotr. CNN-based Action Recognition and Supervised Domain Adap- tation on 3D Body Skeletons via Kernel Feature Maps. In British Machine Vision Conference (BMVC), page 158, 2018

work page 2018

[73] [73]

Wang and L

H. Wang and L. Wang. Modeling Temporal Dynamics and Spatial Conﬁgurations of Actions Using Two-Stream Recurrent Neural Networks. In IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3633–3642, 2017

work page 2017

[74] [74]

J. Liu, G. Wang, L. Duan, K. Abdiyeva, and A. C. Kot. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks. IEEE Transac- tions on Image Processing (TIP), 27(4):1586–1599, 2018

work page 2018

[75] [75]

Zhang, C

P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View Adaptive Neural Networks for High Performance Skeleton-based Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), (1):1–1, 2019

work page 2019