pith. sign in

arxiv: 2602.23231 · v2 · pith:PTROZZCPnew · submitted 2026-02-26 · 💻 cs.CV

Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Pith reviewed 2026-05-22 10:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords skeleton-based action recognitionmulti-view triangulation3D skeleton estimationhuman action recognitioncomputer visionpose estimationmulti-camera systems
0
0 comments X

The pith

Using multiple camera views to triangulate more accurate 3D skeletons improves state-of-the-art action recognition models significantly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that skeleton-based action recognition models are held back by the quality of their input data rather than by shortcomings in the learning algorithms themselves. Triangulating 3D joint positions from several synchronized camera views produces cleaner skeletons that raise recognition accuracy on standard benchmarks. A reader would care because the result points to a simple, hardware-level lever that delivers gains without requiring new model architectures. The authors therefore recommend treating multi-view capture as the normal setup for this task.

Core claim

By making use of multiple camera views to triangulate more accurate 3D skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models.

What carries the argument

Multi-view triangulation of 3D skeletons from 2D pose detections across cameras.

If this is right

  • Existing models achieve higher accuracy once supplied with triangulated skeletons.
  • The cost-benefit ratio of adding cameras is favorable for most practical deployments.
  • Future work in skeleton-based recognition should adopt multi-view capture as the default configuration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Emphasizing input quality may reduce pressure to develop ever-larger neural architectures for this task.
  • Multi-view triangulation could also improve robustness when subjects are partially occluded.
  • The same triangulation principle may transfer to other 3D reconstruction problems that currently rely on single-view estimates.

Load-bearing premise

The observed gains in recognition accuracy are produced by the higher geometric accuracy of the triangulated skeletons rather than by differences in training procedures or dataset composition between the single-view and multi-view conditions.

What would settle it

Run the identical action recognition model on single-view skeletons versus multi-view triangulated skeletons taken from the exact same video sequences while holding all training and evaluation steps fixed, then measure whether accuracy differs.

Figures

Figures reproduced from arXiv: 2602.23231 by Alexander Poeppel, Daniel Bermuth, Wolfgang Reif.

Figure 1
Figure 1. Figure 1: Example of a kick other person action with the new multi-view whole-body skeletons. II. RELATED WORK Most research in skeleton-based action recognition has focused on developing new model architectures, while only few works have investigated the influence of input skeleton quality so far. The original NTU-RGBD dataset [26], [18] was created with three non-calibrated Kinect RGB-D cameras, from which each ca… view at source ↗
Figure 2
Figure 2. Figure 2: Extrinsic calibration by overlapping skeletons from [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrix of the ensembled ProtoGCN model on NTU-RGBD-60-xsub [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrix of the ensembled ProtoGCN model on NTU-RGBD-120-xsub [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that the quality of input 3D skeleton data is a limiting factor for skeleton-based action recognition models. It argues that triangulating skeletons from multiple camera views produces measurably more accurate 3D poses than single-view methods, and that feeding these higher-quality skeletons into existing state-of-the-art recognition architectures yields significant performance gains. The authors conclude that the cost-benefit ratio favors multi-view capture and recommend it as the new standard setup for the field.

Significance. If the performance gains can be shown to arise specifically from improved triangulation accuracy under controlled conditions, the result would reorient research priorities in skeleton-based action recognition toward data acquisition rather than solely toward model architecture. It would also provide a concrete, low-cost intervention that could be adopted immediately by practitioners.

major comments (3)
  1. [§4.1, §4.2] §4.1 and §4.2: the experimental design does not demonstrate that single-view and multi-view conditions differ only in skeleton precision. The manuscript must explicitly state whether the identical raw video sequences, camera calibrations, subject pose distributions, and action class balances were used for both conditions, or whether multi-view recordings were collected separately and may therefore differ in lighting, subject behavior, or 2D detector performance.
  2. [Table 2] Table 2 (or equivalent results table): the reported accuracy improvements lack error bars, statistical significance tests, or ablation isolating the triangulation step from other multi-view effects (e.g., better 2D keypoint detection due to redundant views). Without these controls the attribution of gains to 3D reconstruction accuracy remains unverified.
  3. [§3.2] §3.2: the claim that multi-view triangulation is 'parameter-free' relative to single-view lifting is not supported by the description of the triangulation procedure; any choice of camera selection, outlier rejection threshold, or bundle-adjustment iterations introduces hyperparameters that must be reported and held constant across baselines.
minor comments (2)
  1. [Abstract] The abstract states that performance 'improves significantly' yet supplies no numerical deltas or dataset names; this quantitative summary should appear in the abstract itself.
  2. [§2, §3] Notation for 3D joint coordinates is introduced inconsistently between §2 and §3; a single, explicit definition (e.g., J ∈ ℝ^{3×K}) should be used throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4.1, §4.2] §4.1 and §4.2: the experimental design does not demonstrate that single-view and multi-view conditions differ only in skeleton precision. The manuscript must explicitly state whether the identical raw video sequences, camera calibrations, subject pose distributions, and action class balances were used for both conditions, or whether multi-view recordings were collected separately and may therefore differ in lighting, subject behavior, or 2D detector performance.

    Authors: We agree that explicit clarification is required. The experiments in §4.1 and §4.2 were performed on the identical multi-view video sequences. For the single-view condition we selected one camera from the multi-view capture and applied single-view lifting to that view alone; the multi-view condition triangulated using all views of the same sequences. Consequently, raw video, calibrations, subject poses, action distributions, lighting, and 2D detector outputs are identical by construction. We will add a clear statement to this effect in the revised §4.1 and §4.2. revision: yes

  2. Referee: [Table 2] Table 2 (or equivalent results table): the reported accuracy improvements lack error bars, statistical significance tests, or ablation isolating the triangulation step from other multi-view effects (e.g., better 2D keypoint detection due to redundant views). Without these controls the attribution of gains to 3D reconstruction accuracy remains unverified.

    Authors: We accept this criticism. In the revision we will augment Table 2 with error bars (standard deviation across runs) and report paired statistical significance tests. Our current single-view baselines already employ the identical 2D detector as the multi-view pipeline, so the primary difference is the 3D reconstruction method. We will add a short ablation that compares triangulation with and without view-redundancy fusion at the 2D stage to further isolate the contribution of improved 3D accuracy. revision: partial

  3. Referee: [§3.2] §3.2: the claim that multi-view triangulation is 'parameter-free' relative to single-view lifting is not supported by the description of the triangulation procedure; any choice of camera selection, outlier rejection threshold, or bundle-adjustment iterations introduces hyperparameters that must be reported and held constant across baselines.

    Authors: The referee is correct that the wording in §3.2 is imprecise. While the core triangulation algorithm (DLT) contains fewer learned parameters than neural lifting methods, we did apply fixed outlier-rejection thresholds and a fixed number of bundle-adjustment iterations. We will revise §3.2 to remove the 'parameter-free' phrasing, describe the exact procedure, and list all hyperparameters together with the statement that they were held constant for all reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical demonstration only

full rationale

The paper advances an empirical claim that multi-view triangulation yields more accurate 3D skeletons and thereby improves downstream action-recognition accuracy. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described content. The central argument rests on experimental comparison rather than reducing by construction to its own inputs, self-citations, or ansatzes. This is a standard empirical study whose load-bearing steps are external benchmarks and controlled measurements, not internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that triangulation from multiple views produces measurably superior 3D skeletons and that any accuracy gain transfers directly to downstream action classifiers without additional confounding variables.

axioms (1)
  • domain assumption Triangulation from multiple calibrated cameras yields higher-accuracy 3D joint positions than single-view estimation.
    Invoked implicitly when the abstract states that multi-view produces 'more accurate 3D skeletons'.

pith-pipeline@v0.9.0 · 5662 in / 1138 out tokens · 31859 ms · 2026-05-22T10:59:28.469880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    K. C. Alowonou and J.-H. Han. MSA-GCN: Exploiting Multi-Scale Temporal Dynamics With Adaptive Graph Convolution for Skeleton- Based Action Recognition.IEEE Access, 2024

  2. [2]

    Bermuth, A

    D. Bermuth, A. Poeppel, and W. Reif. V oxelkeypointfusion: Gen- eralizable multi-view multi-person pose estimation.arXiv preprint arXiv:2410.18723, 2024

  3. [3]

    Bermuth, A

    D. Bermuth, A. Poeppel, and W. Reif. RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond.arXiv preprint arXiv:2503.21692, 2025

  4. [4]

    Bermuth, A

    D. Bermuth, A. Poeppel, and W. Reif. Tutabo-1: towards real-time capable AI-based safety systems for human-robot collaboration. In 2025 IEEE International Conference on Advanced Robotics (ICAR). Institute of Electrical and Electronics Engineers (IEEE), 2025

  5. [5]

    L. Cao, S. Huai, and J. Gai. Reenvisioning Skeleton-based Action Recognition Through the Lens of NLP. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  6. [6]

    Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu. Channel- wise topology refinement graph convolution for skeleton-based action recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021

  7. [7]

    H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani. Infogcn: Representation learning for human skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022

  8. [8]

    Do and M

    J. Do and M. Kim. Skateformer: skeletal-temporal transformer for human action recognition. InEuropean Conference on Computer Vision, pages 401–420. Springer, 2024

  9. [9]

    H. Duan, J. Wang, K. Chen, and D. Lin. DG-STGCN: dynamic spatial-temporal modeling for skeleton-based action recognition.arXiv preprint arXiv:2210.05895, 2022

  10. [10]

    H. Duan, J. Wang, K. Chen, and D. Lin. Pyskl: Towards good practices for skeleton action recognition. InProceedings of the 30th ACM International Conference on Multimedia, pages 7351–7354, 2022

  11. [11]

    H. Duan, Y . Zhao, K. Chen, D. Lin, and B. Dai. Revisiting skeleton- based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022

  12. [12]

    H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y . Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE international conference on computer vision, pages 3334–3342, 2015

  13. [13]

    J. Lee, M. Lee, D. Lee, and S. Lee. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10444–10453, 2023

  14. [14]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  15. [15]

    D. Liu, Y . Hu, K. Hua, Y . Lu, Z. Zhang, X. Ma, Z. Zhong, and P. Chen. TDSN-GCN: Transformerify Overall Structure Decaying Static Graph Embedding NAS-guided GCN for Skeleton Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 2025

  16. [16]

    H. Liu, Y . Liu, M. Ren, H. Wang, Y . Wang, and Z. Sun. Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29248–29257, 2025

  17. [17]

    H. Liu, Y . Liu, C. Wang, Y . Wang, and Z. Sun. SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition. arXiv preprint arXiv:2511.22433, 2025

  18. [18]

    J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding.IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019

  19. [19]

    Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020

  20. [20]

    Memmesheimer, S

    R. Memmesheimer, S. H¨aring, N. Theisen, and D. Paulus. Skeleton-dml: Deep metric learning for skeleton-based one-shot action recognition. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3702–3710, 2022

  21. [21]

    Memmesheimer, N

    R. Memmesheimer, N. Theisen, and D. Paulus. Sl-dml: Signal level deep metric learning for multimodal one-shot action recognition. In 2020 25th International conference on pattern recognition (ICPR), pages 4573–4580. IEEE, 2021

  22. [22]

    Myung, N

    W. Myung, N. Su, J.-H. Xue, and G. Wang. Degcn: Deformable graph convolutional networks for skeleton-based action recognition.IEEE Transactions on Image Processing, 33:2477–2490, 2024

  23. [23]

    Pan and X

    Q. Pan and X. Xie. Language-guided temporal primitive modeling for skeleton-based action recognition.Neurocomputing, 613:128636, 2025

  24. [24]

    H. Qu, Y . Cai, and J. Liu. Llms are good action recognizers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18395–18406, 2024

  25. [25]

    Sabater, L

    A. Sabater, L. Santos, J. Santos-Victor, A. Bernardino, L. Montesano, and A. C. Murillo. One-shot action recognition in challenging therapy scenarios. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2777–2785, 2021

  26. [26]

    Shahroudy, J

    A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016

  27. [27]

    Trivedi, A

    N. Trivedi, A. Thatipelli, and R. K. Sarvadevabhatla. NTU-X: an enhanced large-scale dataset for improving pose-based recognition of subtle human actions. InProceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing, pages 1–9, 2021

  28. [28]

    Wang and P

    L. Wang and P. Koniusz. 3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5620–5631, 2023

  29. [29]

    X. Wang, X. Xu, and Y . Mu. Neural koopman pooling: Control-inspired temporal dynamics encoding for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10597–10607, 2023

  30. [30]

    Y . Wang, Y . Wu, W. He, X. Guo, F. Zhu, L. Bai, R. Zhao, J. Wu, T. He, W. Ouyang, et al. Hulk: A universal knowledge translator for human-centric tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  31. [31]

    Xiang and Z

    L. Xiang and Z. Wang. Joint mixing data augmentation for skeleton- based action recognition.ACM Transactions on Multimedia Computing, Communications and Applications, 21(4):1–24, 2025

  32. [32]

    H. Xu, Y . Gao, Z. Hui, J. Li, and X. Gao. Language Knowledge- Assisted Representation Learning for Skeleton-Based Action Recogni- tion.arXiv preprint arXiv:2305.12398, 2023

  33. [33]

    S. Yang, J. Liu, S. Lu, E. M. Hwa, and A. C. Kot. One-shot action recognition via multi-scale spatial-temporal skeleton matching.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(7):5149– 5156, 2024

  34. [34]

    Zhang, L

    J. Zhang, L. Lin, and J. Liu. Shap-mix: Shapley value guided mixing for long-tailed skeleton based action recognition.arXiv preprint arXiv:2407.12312, 2024

  35. [35]

    Zhang, W

    Z. Zhang, W. Cai, Q. Liu, and Y . Wang. SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggrega- tion.arXiv preprint arXiv:2504.11749, 2025

  36. [36]

    Zhou and X

    L. Zhou and X. Jiao. Multi-modal and multi-part with skeletons and texts for action recognition.Expert Systems with Applications, page 126646, 2025

  37. [37]

    Y . Zhou, T. Xu, C. Wu, X. Wu, and J. Kittler. Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12648–12658, 2025

  38. [38]

    Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua. Blockgcn: Redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2049–2058, 2024

  39. [39]

    W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y . Wang. MotionBERT: A Unified Perspective on Learning Human Motion Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15085–15099, 2023. APPENDIX A. Calibration Details Like mentioned in the main text, the calibration process can be split into three steps: •Estimat...