pith. machine review for the scientific record. sign in

arxiv: 2604.17688 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D human pose estimationGCN-Transformerspatio-temporal modelingdual-stream networkMixformerHuman3.6MMPI-INF-3DHP
0
0 comments X

The pith

Dual-stream GCN-Transformer network fuses local skeleton structure with global context to set new accuracy records for 3D human pose estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MixTGFormer, a network that runs two parallel streams to model spatial and temporal relationships in human skeletons at once. One stream uses graph convolutions to respect local joint connections while the other uses attention to capture broader scene context, with Mixformer blocks and squeeze-excitation layers handling the fusion. This design fills the gap left by prior Transformer methods that ignored fine local skeletal structure and channel interactions. If the approach holds, it would produce more reliable 3D poses from ordinary video, supporting applications in motion capture, sports analysis, and interactive graphics.

Core claim

The Dual-stream Spatio-temporal GCN-Transformer Network models spatial and temporal relationships of human skeletons simultaneously through two parallel channels, achieving effective fusion of global and local features. Its core consists of stacked Mixformers, each containing Mixformer Blocks that integrate Graph Convolutional Networks into the Transformer in spatial and temporal forms, followed by a Squeeze-and-Excitation Layer to supplement the fused information.

What carries the argument

Mixformer Block that integrates GCN into Transformer for parallel extraction and fusion of local skeletal and global spatio-temporal information from human pose sequences.

If this is right

  • Local joint connectivity is preserved alongside long-range temporal dynamics within the same architecture.
  • Channel-wise information is refined after fusion, reducing loss of detail between spatial and temporal streams.
  • The stacked design allows joint modeling of spatial structure and temporal evolution without separate post-processing stages.
  • Benchmark results of 37.6 mm and 15.7 mm P1 error follow directly from the improved feature utilization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid blocks could be inserted into existing 2D-to-3D pipelines to boost accuracy with minimal architecture change.
  • If the fusion proves robust, similar dual-stream designs might transfer to other graph-sequence tasks such as action recognition.
  • Real-time deployment would require checking whether the added GCN operations increase latency beyond acceptable limits for video streams.

Load-bearing premise

The performance gains arise from genuine generalization produced by the dual-stream GCN-Transformer fusion rather than from fitting to the specific training distributions of Human3.6M and MPI-INF-3DHP.

What would settle it

Evaluating the model on an independent dataset such as 3DPW and finding that its P1 error does not remain lower than current leading methods would indicate the reported gains may not generalize.

Figures

Figures reproduced from arXiv: 2604.17688 by Jian Xiang, Jiawen Duan, Linlin Xue, Wan Xiang, Zhiqiang Li.

Figure 1
Figure 1. Figure 1: Top: MixTGFormer model structure; (a) Overall architecture of Mixformer; (b) Spatial Mixformer Block; (c) Temporal Mixformer Block. The input tokens are the local joints of the human body and the frames of the pose sequence. When performing the 2D-to-3D lifting task, losses may occur due to occlusion, detection failure, and errors. The loss terms include position loss (𝐿3𝐷) and acceleration (𝐿△𝐴 ) loss, wh… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of the Squeeze-and-Excitation Layer (SE Layer). The input X and output X' have the same shape. 4. Experiments 4.1. Datasets and Evaluation Metrics We comprehensively validated the proposed model (MixTGFormer) on two large-scale 3D human pose estimation datasets (Human3.6M [44] and MPI-INF-3DHP [45]). The Human3.6M dataset is the most commonly used dataset in 3D human pose estimation. I… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison with other 3D human pose estimation methods on the Human3.6M dataset. MPJPE represents the mean (per) joint position error (the lower the better), and Param represents the number of parameters. The MPI-INF-3DHP dataset is another large-scale dataset commonly used in 3D human pose estimation, with three different settings: green screen, non-green screen, and outdoor environment. Following the eva… view at source ↗
read the original abstract

3D human pose estimation is a classic and important research direction in the field of computer vision. In recent years, Transformer-based methods have made significant progress in lifting 2D to 3D human pose estimation. However, these methods primarily focus on modeling global temporal and spatial relationships, neglecting local skeletal relationships and the information interaction between different channels. Therefore, we have proposed a novel method,the Dual-stream Spatio-temporal GCN-Transformer Network (MixTGFormer). This method models the spatial and temporal relationships of human skeletons simultaneously through two parallel channels, achieving effective fusion of global and local features. The core of MixTGFormer is composed of stacked Mixformers. Specifically, the Mixformer includes the Mixformer Block and the Squeeze-and-Excitation Layer ( SE Layer). It first extracts and fuses various information of human skeletons through two parallel Mixformer Blocks with different modes. Then, it further supplements the fused information through the SE Layer. The Mixformer Block integrates Graph Convolutional Networks (GCN) into the Transformer, enhancing both local and global information utilization. Additionally, we further implement its temporal and spatial forms to extract both spatial and temporal relationships. We extensively evaluated our model on two benchmark datasets (Human3.6M and MPI-INF-3DHP). The experimental results showed that, compared to other methods, our MixTGFormer achieved state-of-the-art results, with P1 errors of 37.6mm and 15.7mm on these datasets, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces MixTGFormer, a dual-stream spatio-temporal GCN-Transformer network for 3D human pose estimation from 2D inputs. It employs stacked Mixformer modules, each consisting of two parallel Mixformer Blocks that integrate GCN for local skeletal modeling with Transformer for global spatio-temporal relationships, followed by fusion and refinement via a Squeeze-and-Excitation (SE) Layer. The approach is evaluated on Human3.6M and MPI-INF-3DHP, claiming state-of-the-art MPJPE (P1) results of 37.6 mm and 15.7 mm respectively.

Significance. If substantiated, the dual-stream fusion strategy could meaningfully advance hybrid architectures in 3D pose estimation by addressing the limitation of pure Transformer methods in capturing local skeletal structure. The design is coherent with current trends toward combining graph convolutions and attention mechanisms, and the reported benchmark numbers, if robust, would represent a concrete incremental improvement.

major comments (2)
  1. Experimental results: The SOTA claims (37.6 mm on Human3.6M, 15.7 mm on MPI-INF-3DHP) are presented without error bars, ablation studies isolating the dual-stream Mixformer Block contribution or SE Layer, multiple random seeds, or detailed baseline re-implementations. This is load-bearing for the central empirical claim and prevents verification that gains arise from the proposed fusion rather than tuning.
  2. Method section (Mixformer Block description): The integration of GCN into the Transformer via parallel streams is described at a high level without equations for feature fusion, attention masking, or the exact temporal/spatial variants, which is necessary to evaluate reproducibility and whether the architecture genuinely combines local and global modeling as claimed.
minor comments (3)
  1. Abstract: Typo with missing space ('method,the Dual-stream').
  2. Notation: 'P1 errors' is used without explicit definition or reference to the standard Human3.6M protocol (Protocol #1 MPJPE).
  3. The manuscript would benefit from a results table explicitly listing recent competing methods with their reported errors for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for strengthening the empirical validation and methodological details, and we have revised the manuscript accordingly to address them.

read point-by-point responses
  1. Referee: Experimental results: The SOTA claims (37.6 mm on Human3.6M, 15.7 mm on MPI-INF-3DHP) are presented without error bars, ablation studies isolating the dual-stream Mixformer Block contribution or SE Layer, multiple random seeds, or detailed baseline re-implementations. This is load-bearing for the central empirical claim and prevents verification that gains arise from the proposed fusion rather than tuning.

    Authors: We agree that additional statistical rigor and isolation of contributions are required to substantiate the central claims. In the revised manuscript, we have incorporated error bars computed across multiple random seeds, comprehensive ablation studies that separately evaluate the dual-stream Mixformer Blocks and the SE Layer, and detailed accounts of baseline re-implementations performed under identical training conditions and hyperparameters. These additions confirm that the performance improvements derive from the proposed fusion strategy. revision: yes

  2. Referee: Method section (Mixformer Block description): The integration of GCN into the Transformer via parallel streams is described at a high level without equations for feature fusion, attention masking, or the exact temporal/spatial variants, which is necessary to evaluate reproducibility and whether the architecture genuinely combines local and global modeling as claimed.

    Authors: We acknowledge that the original presentation was insufficiently detailed for full reproducibility. The revised Method section now includes explicit equations describing the parallel stream integration, the feature fusion operation (including concatenation and projection steps), attention masking for spatio-temporal modeling, and precise definitions of the temporal and spatial variants of the Mixformer Block. These additions clarify the local skeletal modeling via GCN alongside global attention-based relations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a proposed neural network architecture (MixTGFormer with dual-stream GCN-Transformer Mixformer Blocks and SE Layer) and reports empirical SOTA performance metrics on public benchmarks (Human3.6M and MPI-INF-3DHP). No derivation chain, first-principles predictions, or mathematical reductions are present. Claims rest on experimental results rather than any self-definitional, fitted-input, or self-citation load-bearing steps that reduce to inputs by construction. The architecture description and benchmark evaluations are independent of the defined circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The performance claim rests on the unstated assumption that standard supervised training on the two benchmarks will produce the reported errors, plus the architectural choice of dual streams and SE layers; no free parameters are quantified in the abstract.

axioms (1)
  • domain assumption Standard back-propagation and Adam-style optimization suffice to train the network to the claimed accuracy.
    Implicit in any deep-learning pose estimation paper; invoked by the training and evaluation description.
invented entities (1)
  • Mixformer Block no independent evidence
    purpose: To fuse GCN local structure with Transformer global attention inside each stream.
    New named component introduced to realize the dual-stream idea; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5581 in / 1368 out tokens · 30744 ms · 2026-05-10T06:03:04.583831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 44 canonical work pages · 3 internal anchors

  1. [1]

    Mehdi, V

    N. Mehdi, V . Thomas, S. Ivaldi, F. Colas, 2022. Simultaneous pose and posture estimation with a two -stage particle filter for visuo-inertial fusion. 2022 International Conference on Advanced Robotics and Mechatronics (ICARM), Guilin, China, pp. 132 –

  2. [2]

    10.1109/ICARM54641.2022.9959293

  3. [3]

    ACM Trans

    D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P . Seidel, W. Xu, D. Casas, C. Theobalt, 2017. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics (TOG), 36(4), pp.1–14. https://doi.org/10.1145/3072959.3073596

  4. [4]

    H.-Y . Lin, T. -W. Chen, 2010. Augmented reality with human body interaction based on monocular 3D pose estimation. International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 321 –331. Springer. https://doi.org/10.1007/978-3-642-17688-3_31

  5. [5]

    M. Liu, J. Yuan, 2018. Recognizing human actions as the evolution of pose estimation maps. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1159–1168. 10.1109/CVPR.2018.00127

  6. [6]

    M. Liu, H. Liu, C. Chen, 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogniti on, 68, 346–362. https://doi.org/10.1016/j.patcog.2017.02.030

  7. [7]

    G. Moon, K. Mu Lee, 2020. I2L -MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. European Conference on Computer Vision (ECCV). https://doi.org/10.1007/978-3-030-58571-6_44

  8. [8]

    Pavlakos, X

    G. Pavlakos, X. Zhou, K. Daniilidis, 2018. Ordinal depth supervision for 3D human pose estimation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10.1109/CVPR.2018.00763

  9. [9]

    Zheng, S

    C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, Z. Ding, 2021. 3D human pose estimation with spatial and temporal transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11656 –11665. https://doi.org/10.48550/arXiv.2103.10455

  10. [10]

    Zhang, X

    S. Zhang, X. Li, C. Hu, J. Xu, H. Liu, 2024. DSTFormer: 3D human pose estimation with a dual -scale spatial and temporal transformer network. 2024 International Conference on Advanced Robotics and Mechatronics (ICARM), Tokyo, Japan, pp. 484 –

  11. [11]

    10.1109/ICARM62033.2024.10715863

  12. [12]

    Y . Chen, Z. Wang, Y . Peng, Z. Zhang, G. Yu, J. Sun, 2018. Cascaded pyramid network for multi -person pose estimation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10.1109/CVPR.2018.00742

  13. [13]

    Newell, K

    A. Newell, K. Yang, J. Deng, 2016. Stacked hourglass networks for human pose estimation. European Conference on Computer Vision, pp. 483–499. Springer. https://doi.org/10.1007/978-3-319-46484-829

  14. [14]

    K. Sun, B. Xiao, D. Liu, J. Wang, 2019. Deep high -resolution representation learning for human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703. 10.1109/CVPR.2019.00584

  15. [15]

    Y . He, R. Yan, K. Fragkiadaki, S. -I. Yu, 2020. Epipolar transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7779–7788. 10.1109/CVPR42600.2020.00780. Author name / Procedia Economics and Finance 00 (2012) 000–000 15

  16. [17]

    N. D. Reddy, L. Guigues, L. Pishchulin, J. Eledath, S. G. Narasimhan, 2021. Tesse Track: End -to-end learnable multi -person articulated 3D pose tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15190 – 15200. 10.1109/CVPR46437.2021.01494

  17. [18]

    W. Hu, C. Zhang, F. Zhan, L. Zhang, T. -T. Wong, 2021. Conditional directed graph convolution for 3D human pose estimation. Proceedings of the 29th ACM International Conference on Multimedia, pp. 602–611. https://doi.org/10.1145/3474085.3475219

  18. [20]

    Q. Zhao, C. Zheng, M. Liu, P. Wang, C. Chen, 2023. PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8877–8886. https://doi.org/10.48550/arXiv.2303.17472

  19. [21]

    W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, Y . Wang, 2023. MotionBERT: A unified perspective on learning human motion representations. Proceedings of the IEEE/CVF International Conference on Computer Vision . https://doi.org/10.48550/arXiv.2210.06551

  20. [22]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, 2017. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), pp. 6000 –6010. Curran Associates Inc., Red Hook, NY , USA

  21. [23]

    T. N. Kipf, M. Welling, 2016. Semi -supervised classification with graph convolutional networks. arXiv. https://doi.org/10.48550/arXiv.1609.02907

  22. [24]

    Language Models are Few-Shot Learners

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al., 2020. Language models are few -shot learners. Advances in Neural Information Processing Systems, 33, 1877 –1901. https://doi.org/10.48550/arXiv.2005.14165

  23. [25]

    Zhang, Z

    J. Zhang, Z. Tu, J. Yang, Y . Chen, J. Yuan, 2022. MixSTE: Seq2seq Mixed spatio-temporal encoder for 3D human pose estimation in video. arXiv. https://doi.org/10.48550/arXiv.2203.00859

  24. [26]

    Y . Liu, Z. Shao, N. Hoffmann, 2021. Global attention mechanism: Retain information to enhance channel -spatial interactions. arXiv. https://doi.org/10.48550/arXiv.2112.05561

  25. [27]

    J. Hu, L. Shen, G. Sun, 2018. Squeeze -and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp. 7132–7141. 10.1109/CVPR.2018.00745

  26. [28]

    Q. Dang, J. Yin, B. Wang, W. Zheng, 2019. Deep learning based 2D human pose estimation: A survey. Tsinghua Science and Technology, 24(6), 663–676. 10.26599/TST.2018.9010100

  27. [29]

    Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

    C. Ionescu, F. Li, C. Sminchisescu, 2011. Latent structured models for human pose estimation. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, pp. 2220–2227. 10.1109/ICCV .2011.6126500

  28. [30]

    Agarwal, B

    A. Agarwal, B. Triggs, 2005. Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58. 10.1109/TPAMI.2006.21

  29. [31]

    Takano, Y

    W. Takano, Y . Nakamura, 2015. Action database for categorizing and inferring human poses from video sequences. Robotics and Autonomous Systems, 70, 116–125. https://doi.org/10.1016/j.robot.2015.03.001

  30. [32]

    Liang, M

    J. Liang, M. Yin, 2024. SCGFormer: Semantic Chebyshev graph convolution transformer for 3D human pose estimation. Applied Sciences, 14(4), 1646. https://doi.org/10.3390/app14041646

  31. [33]

    K. Zhou, X. Han, N. Jiang, K. Jia and J. Lu, 2022 . HEMlets PoSh: Learning Part-Centric Heatmap Triplets for 3D Human Pose and Shape Estimation . in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3000 -3014. 10.1109/TPAMI.2021.3051173

  32. [34]

    Cheng, B

    Y . Cheng, B. Yang, B. Wang, R. T. Tan, 2020. 3D human pose estimation using spatio-temporal networks with explicit occlusion training. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). https://doi.org/10.1609/aaai.v34i07.6689

  33. [35]

    Pavllo, C

    D. Pavllo, C. Feichtenhofer, D. Grangier, M. Auli, 2019. 3D human pose estimation in video with temporal convolutions and semi- supervised training. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7753 –7762. https://doi.org/10.48550/arXiv.1811.11742

  34. [36]

    W. Li, H. Liu, H. Tang, P. Wang, L. V . Gool, 2022. MHFormer: Multi -hypothesis transformer for 3D human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13137 –13146. https://doi.org/10.48550/arXiv.2111.12707

  35. [37]

    W. Shan, Z. Liu, X. Zhang, S. Wang, S. Ma, W. Gao, 2022. P -STMO: Pre-trained spatial temporal many -to-one model for 3D human pose estimation. Computer Vision – ECCV 2022, Lecture Notes in Computer Science, vol. 13665, pp. 1 –17. Springer, Cham. https://doi.org/10.1007/978-3-031-20065-6_27

  36. [38]

    Z. Tang, Z. Qiu, Y . Hao, R. Hong, T. Yao, 2023. 3D human pose estimation with spatio-temporal criss-cross attention. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), V ancouver, BC, Canada, pp. 4790 –4799. 10.1109/CVPR52729.2023.00464

  37. [39]

    L. Zhao, X. Peng, Y . Tian, M. Kapadia, D. N. Metaxas, 2019. Semantic graph convolutional networks for 3D human pose regression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 3420–3430. 10.1109/CVPR.2019.00354

  38. [40]

    T. Xu, W. Takano, 2021. Graph stacked hourglass networks for 3D human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 16100 –16109. 16 Author name / Procedia Economics and Finance 00 (2012) 000–000 10.1109/CVPR46437.2021.01584

  39. [41]

    B. X. B. Y u, Z. Zhang, Y . Liu, S.-H. Zhong, Y . Liu, C. W. Chen, 2023. GLA -GCN: Global-local adaptive graph convolutional network for 3D human pose estimation from monocular video. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 8784–8795. https://doi.org/10.1109/ICCV51070.2023.00810

  40. [42]

    W. Zhao, W. Wang, Y . Tian, 2022. GraFormer: Graph-oriented transformer for 3D pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 20406 –20415. 10.1109/CVPR52688.2022.01979

  41. [43]

    Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , journal =

    M. Defferrard, X. Bresson, P. Vandergheynst, 2016. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in Neural Information Processing Systems, 29, 3844–3852. https://doi.org/10.48550/arXiv.1606.09375

  42. [44]

    J. Gong, L. G. Foo, Z. Fan, Q. Ke, H. Rahmani, J. Liu, 2023. DiffPose: Toward more reliable 3D pose estimation. Proceedings o f the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, pp. 13041 –13051. 10.1109/CVPR52729.2023.01253

  43. [45]

    Mehraban, V

    S. Mehraban, V . Adeli, B. Taati, 2024. MotionAGFormer: Enhancing 3D human pose estimation with a transformer -GCNFormer network. 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, pp. 6905–6915

  44. [46]

    Ionescu, D

    C. Ionescu, D. Papava, V . Olaru, C. Sminchisescu, 2013. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325 –1339. 10.1109/TPAMI.2013.248

  45. [47]

    arXiv:1606.04797

    D. Mehta, H. Rhodin, D. Casas, P . Fua, O. Sotnychenko, W. Xu, C. Theobalt, 2017. Monocular 3D human pose estimation in the wild using improved CNN supervision. 2017 International Conference on 3D Vision (3DV), pp. 506 –516. 10.1109/3DV .2017.00064

  46. [48]

    Decoupled Weight Decay Regularization

    I. Loshchilov, F. Hutter, 2017. Decoupled weight decay regularization. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1711.05101

  47. [49]

    H. Chen, J. -Y . He, W. Xiang, W. Liu, Z. -Q. Cheng, H. Liu, B. Luo, Y . Geng, X. Xie, 2023. HDFormer: High -order directed transformer for 3D human pose estimation. Proceedings of the Thirty -Second International Joint Conference on Artificial Intelligence Main Track. Pages 581-589. https://doi.org/10.24963/ijcai.2023/65

  48. [50]

    X. Qian, Y . Tang, N. Zhang, M. Han, J. Xiao, M. -C. Huang, R. -S. Lin, 2023. HSTFormer: Hierarchical spatial -temporal transformers for 3D human pose estimation. arXiv. https://doi.org/10.48550/arXiv.2301.07322