Skarimva: Skeleton-based Action Recognition is a Multi-view Application
Pith reviewed 2026-05-22 10:59 UTC · model grok-4.3
The pith
Using multiple camera views to triangulate more accurate 3D skeletons improves state-of-the-art action recognition models significantly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By making use of multiple camera views to triangulate more accurate 3D skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models.
What carries the argument
Multi-view triangulation of 3D skeletons from 2D pose detections across cameras.
If this is right
- Existing models achieve higher accuracy once supplied with triangulated skeletons.
- The cost-benefit ratio of adding cameras is favorable for most practical deployments.
- Future work in skeleton-based recognition should adopt multi-view capture as the default configuration.
Where Pith is reading between the lines
- Emphasizing input quality may reduce pressure to develop ever-larger neural architectures for this task.
- Multi-view triangulation could also improve robustness when subjects are partially occluded.
- The same triangulation principle may transfer to other 3D reconstruction problems that currently rely on single-view estimates.
Load-bearing premise
The observed gains in recognition accuracy are produced by the higher geometric accuracy of the triangulated skeletons rather than by differences in training procedures or dataset composition between the single-view and multi-view conditions.
What would settle it
Run the identical action recognition model on single-view skeletons versus multi-view triangulated skeletons taken from the exact same video sequences while holding all training and evaluation steps fixed, then measure whether accuracy differs.
Figures
read the original abstract
Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the quality of input 3D skeleton data is a limiting factor for skeleton-based action recognition models. It argues that triangulating skeletons from multiple camera views produces measurably more accurate 3D poses than single-view methods, and that feeding these higher-quality skeletons into existing state-of-the-art recognition architectures yields significant performance gains. The authors conclude that the cost-benefit ratio favors multi-view capture and recommend it as the new standard setup for the field.
Significance. If the performance gains can be shown to arise specifically from improved triangulation accuracy under controlled conditions, the result would reorient research priorities in skeleton-based action recognition toward data acquisition rather than solely toward model architecture. It would also provide a concrete, low-cost intervention that could be adopted immediately by practitioners.
major comments (3)
- [§4.1, §4.2] §4.1 and §4.2: the experimental design does not demonstrate that single-view and multi-view conditions differ only in skeleton precision. The manuscript must explicitly state whether the identical raw video sequences, camera calibrations, subject pose distributions, and action class balances were used for both conditions, or whether multi-view recordings were collected separately and may therefore differ in lighting, subject behavior, or 2D detector performance.
- [Table 2] Table 2 (or equivalent results table): the reported accuracy improvements lack error bars, statistical significance tests, or ablation isolating the triangulation step from other multi-view effects (e.g., better 2D keypoint detection due to redundant views). Without these controls the attribution of gains to 3D reconstruction accuracy remains unverified.
- [§3.2] §3.2: the claim that multi-view triangulation is 'parameter-free' relative to single-view lifting is not supported by the description of the triangulation procedure; any choice of camera selection, outlier rejection threshold, or bundle-adjustment iterations introduces hyperparameters that must be reported and held constant across baselines.
minor comments (2)
- [Abstract] The abstract states that performance 'improves significantly' yet supplies no numerical deltas or dataset names; this quantitative summary should appear in the abstract itself.
- [§2, §3] Notation for 3D joint coordinates is introduced inconsistently between §2 and §3; a single, explicit definition (e.g., J ∈ ℝ^{3×K}) should be used throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4.1, §4.2] §4.1 and §4.2: the experimental design does not demonstrate that single-view and multi-view conditions differ only in skeleton precision. The manuscript must explicitly state whether the identical raw video sequences, camera calibrations, subject pose distributions, and action class balances were used for both conditions, or whether multi-view recordings were collected separately and may therefore differ in lighting, subject behavior, or 2D detector performance.
Authors: We agree that explicit clarification is required. The experiments in §4.1 and §4.2 were performed on the identical multi-view video sequences. For the single-view condition we selected one camera from the multi-view capture and applied single-view lifting to that view alone; the multi-view condition triangulated using all views of the same sequences. Consequently, raw video, calibrations, subject poses, action distributions, lighting, and 2D detector outputs are identical by construction. We will add a clear statement to this effect in the revised §4.1 and §4.2. revision: yes
-
Referee: [Table 2] Table 2 (or equivalent results table): the reported accuracy improvements lack error bars, statistical significance tests, or ablation isolating the triangulation step from other multi-view effects (e.g., better 2D keypoint detection due to redundant views). Without these controls the attribution of gains to 3D reconstruction accuracy remains unverified.
Authors: We accept this criticism. In the revision we will augment Table 2 with error bars (standard deviation across runs) and report paired statistical significance tests. Our current single-view baselines already employ the identical 2D detector as the multi-view pipeline, so the primary difference is the 3D reconstruction method. We will add a short ablation that compares triangulation with and without view-redundancy fusion at the 2D stage to further isolate the contribution of improved 3D accuracy. revision: partial
-
Referee: [§3.2] §3.2: the claim that multi-view triangulation is 'parameter-free' relative to single-view lifting is not supported by the description of the triangulation procedure; any choice of camera selection, outlier rejection threshold, or bundle-adjustment iterations introduces hyperparameters that must be reported and held constant across baselines.
Authors: The referee is correct that the wording in §3.2 is imprecise. While the core triangulation algorithm (DLT) contains fewer learned parameters than neural lifting methods, we did apply fixed outlier-rejection thresholds and a fixed number of bundle-adjustment iterations. We will revise §3.2 to remove the 'parameter-free' phrasing, describe the exact procedure, and list all hyperparameters together with the statement that they were held constant for all reported comparisons. revision: yes
Circularity Check
No circularity: empirical demonstration only
full rationale
The paper advances an empirical claim that multi-view triangulation yields more accurate 3D skeletons and thereby improves downstream action-recognition accuracy. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described content. The central argument rests on experimental comparison rather than reducing by construction to its own inputs, self-citations, or ansatzes. This is a standard empirical study whose load-bearing steps are external benchmarks and controlled measurements, not internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Triangulation from multiple calibrated cameras yields higher-accuracy 3D joint positions than single-view estimation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
by making use of multiple camera views to triangulate more accurate 3D skeletons, the performance of state-of-the-art action recognition models can be improved significantly
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
K. C. Alowonou and J.-H. Han. MSA-GCN: Exploiting Multi-Scale Temporal Dynamics With Adaptive Graph Convolution for Skeleton- Based Action Recognition.IEEE Access, 2024
work page 2024
-
[2]
D. Bermuth, A. Poeppel, and W. Reif. V oxelkeypointfusion: Gen- eralizable multi-view multi-person pose estimation.arXiv preprint arXiv:2410.18723, 2024
-
[3]
D. Bermuth, A. Poeppel, and W. Reif. RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond.arXiv preprint arXiv:2503.21692, 2025
-
[4]
D. Bermuth, A. Poeppel, and W. Reif. Tutabo-1: towards real-time capable AI-based safety systems for human-robot collaboration. In 2025 IEEE International Conference on Advanced Robotics (ICAR). Institute of Electrical and Electronics Engineers (IEEE), 2025
work page 2025
-
[5]
L. Cao, S. Huai, and J. Gai. Reenvisioning Skeleton-based Action Recognition Through the Lens of NLP. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025
work page 2025
-
[6]
Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu. Channel- wise topology refinement graph convolution for skeleton-based action recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021
work page 2021
-
[7]
H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani. Infogcn: Representation learning for human skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022
work page 2022
- [8]
- [9]
-
[10]
H. Duan, J. Wang, K. Chen, and D. Lin. Pyskl: Towards good practices for skeleton action recognition. InProceedings of the 30th ACM International Conference on Multimedia, pages 7351–7354, 2022
work page 2022
-
[11]
H. Duan, Y . Zhao, K. Chen, D. Lin, and B. Dai. Revisiting skeleton- based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022
work page 2022
-
[12]
H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y . Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE international conference on computer vision, pages 3334–3342, 2015
work page 2015
-
[13]
J. Lee, M. Lee, D. Lee, and S. Lee. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10444–10453, 2023
work page 2023
-
[14]
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014
work page 2014
-
[15]
D. Liu, Y . Hu, K. Hua, Y . Lu, Z. Zhang, X. Ma, Z. Zhong, and P. Chen. TDSN-GCN: Transformerify Overall Structure Decaying Static Graph Embedding NAS-guided GCN for Skeleton Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 2025
work page 2025
-
[16]
H. Liu, Y . Liu, M. Ren, H. Wang, Y . Wang, and Z. Sun. Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29248–29257, 2025
work page 2025
- [17]
-
[18]
J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding.IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019
work page 2019
-
[19]
Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020
work page 2020
-
[20]
R. Memmesheimer, S. H¨aring, N. Theisen, and D. Paulus. Skeleton-dml: Deep metric learning for skeleton-based one-shot action recognition. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3702–3710, 2022
work page 2022
-
[21]
R. Memmesheimer, N. Theisen, and D. Paulus. Sl-dml: Signal level deep metric learning for multimodal one-shot action recognition. In 2020 25th International conference on pattern recognition (ICPR), pages 4573–4580. IEEE, 2021
work page 2020
- [22]
- [23]
-
[24]
H. Qu, Y . Cai, and J. Liu. Llms are good action recognizers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18395–18406, 2024
work page 2024
-
[25]
A. Sabater, L. Santos, J. Santos-Victor, A. Bernardino, L. Montesano, and A. C. Murillo. One-shot action recognition in challenging therapy scenarios. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2777–2785, 2021
work page 2021
-
[26]
A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016
work page 2016
-
[27]
N. Trivedi, A. Thatipelli, and R. K. Sarvadevabhatla. NTU-X: an enhanced large-scale dataset for improving pose-based recognition of subtle human actions. InProceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing, pages 1–9, 2021
work page 2021
-
[28]
L. Wang and P. Koniusz. 3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5620–5631, 2023
work page 2023
-
[29]
X. Wang, X. Xu, and Y . Mu. Neural koopman pooling: Control-inspired temporal dynamics encoding for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10597–10607, 2023
work page 2023
-
[30]
Y . Wang, Y . Wu, W. He, X. Guo, F. Zhu, L. Bai, R. Zhao, J. Wu, T. He, W. Ouyang, et al. Hulk: A universal knowledge translator for human-centric tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[31]
L. Xiang and Z. Wang. Joint mixing data augmentation for skeleton- based action recognition.ACM Transactions on Multimedia Computing, Communications and Applications, 21(4):1–24, 2025
work page 2025
- [32]
-
[33]
S. Yang, J. Liu, S. Lu, E. M. Hwa, and A. C. Kot. One-shot action recognition via multi-scale spatial-temporal skeleton matching.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(7):5149– 5156, 2024
work page 2024
- [34]
- [35]
-
[36]
L. Zhou and X. Jiao. Multi-modal and multi-part with skeletons and texts for action recognition.Expert Systems with Applications, page 126646, 2025
work page 2025
-
[37]
Y . Zhou, T. Xu, C. Wu, X. Wu, and J. Kittler. Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12648–12658, 2025
work page 2025
-
[38]
Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua. Blockgcn: Redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2049–2058, 2024
work page 2049
-
[39]
W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y . Wang. MotionBERT: A Unified Perspective on Learning Human Motion Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15085–15099, 2023. APPENDIX A. Calibration Details Like mentioned in the main text, the calibration process can be split into three steps: •Estimat...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.