Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Alexander Poeppel; Daniel Bermuth; Wolfgang Reif

REVIEW 3 major objections 2 minor 39 references

Using multiple camera views to triangulate more accurate 3D skeletons improves state-of-the-art action recognition models significantly.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-22 10:59 UTC pith:PTROZZCP

load-bearing objection Multi-view triangulation lifts performance on existing action recognition models, but the experiments may not isolate that effect from other data collection differences. the 3 major comments →

arxiv 2602.23231 v2 pith:PTROZZCP submitted 2026-02-26 cs.CV

Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Daniel Bermuth , Alexander Poeppel , Wolfgang Reif This is my paper

classification cs.CV

keywords skeleton-based action recognitionmulti-view triangulation3D skeleton estimationhuman action recognitioncomputer visionpose estimationmulti-camera systems

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that skeleton-based action recognition models are held back by the quality of their input data rather than by shortcomings in the learning algorithms themselves. Triangulating 3D joint positions from several synchronized camera views produces cleaner skeletons that raise recognition accuracy on standard benchmarks. A reader would care because the result points to a simple, hardware-level lever that delivers gains without requiring new model architectures. The authors therefore recommend treating multi-view capture as the normal setup for this task.

Core claim

By making use of multiple camera views to triangulate more accurate 3D skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models.

What carries the argument

Multi-view triangulation of 3D skeletons from 2D pose detections across cameras.

Load-bearing premise

The observed gains in recognition accuracy are produced by the higher geometric accuracy of the triangulated skeletons rather than by differences in training procedures or dataset composition between the single-view and multi-view conditions.

What would settle it

Run the identical action recognition model on single-view skeletons versus multi-view triangulated skeletons taken from the exact same video sequences while holding all training and evaluation steps fixed, then measure whether accuracy differs.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Existing models achieve higher accuracy once supplied with triangulated skeletons.
The cost-benefit ratio of adding cameras is favorable for most practical deployments.
Future work in skeleton-based recognition should adopt multi-view capture as the default configuration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Emphasizing input quality may reduce pressure to develop ever-larger neural architectures for this task.
Multi-view triangulation could also improve robustness when subjects are partially occluded.
The same triangulation principle may transfer to other 3D reconstruction problems that currently rely on single-view estimates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Multi-view triangulation lifts performance on existing action recognition models, but the experiments may not isolate that effect from other data collection differences.

read the letter

The main takeaway is that triangulating 3D skeletons from multiple camera views improves results on current state-of-the-art action recognition models, which suggests input data quality has been holding things back more than the algorithms themselves. The authors apply this to show gains without proposing a new model or derivation, and they make a practical case that adding cameras is usually worth it for robotics or monitoring setups. This shifts the conversation toward treating multi-view capture as the default rather than chasing single-view tweaks. That focus on the input side is a useful reminder when so much work stays inside the model architecture. The cost-benefit argument for most real applications lands reasonably well if the gains are real. The paper engages honestly with the existing literature on skeleton methods and positions its contribution as an empirical demonstration rather than a theoretical advance. The soft spot is the experimental controls. If the single-view and multi-view recordings differ in raw 2D detection quality, subject behavior, or other factors beyond just the number of views used for triangulation, then the performance delta cannot be attributed cleanly to better 3D skeletons. The abstract claims significant improvement but does not show the numbers or describe how the conditions were matched, so the strength of the claim depends on details that need verification. This paper is for researchers building practical human-machine systems who already use skeleton data and want to improve results without rewriting their models. A reader interested in sensor setup and real-world deployment will get the most out of it. It shows clear thinking on the problem and has a testable claim with practical implications, so it deserves a serious referee. I would send it to peer review and ask specifically for the quantitative results, baselines, and evidence that the conditions differed only in the triangulation step.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that the quality of input 3D skeleton data is a limiting factor for skeleton-based action recognition models. It argues that triangulating skeletons from multiple camera views produces measurably more accurate 3D poses than single-view methods, and that feeding these higher-quality skeletons into existing state-of-the-art recognition architectures yields significant performance gains. The authors conclude that the cost-benefit ratio favors multi-view capture and recommend it as the new standard setup for the field.

Significance. If the performance gains can be shown to arise specifically from improved triangulation accuracy under controlled conditions, the result would reorient research priorities in skeleton-based action recognition toward data acquisition rather than solely toward model architecture. It would also provide a concrete, low-cost intervention that could be adopted immediately by practitioners.

major comments (3)

[§4.1, §4.2] §4.1 and §4.2: the experimental design does not demonstrate that single-view and multi-view conditions differ only in skeleton precision. The manuscript must explicitly state whether the identical raw video sequences, camera calibrations, subject pose distributions, and action class balances were used for both conditions, or whether multi-view recordings were collected separately and may therefore differ in lighting, subject behavior, or 2D detector performance.
[Table 2] Table 2 (or equivalent results table): the reported accuracy improvements lack error bars, statistical significance tests, or ablation isolating the triangulation step from other multi-view effects (e.g., better 2D keypoint detection due to redundant views). Without these controls the attribution of gains to 3D reconstruction accuracy remains unverified.
[§3.2] §3.2: the claim that multi-view triangulation is 'parameter-free' relative to single-view lifting is not supported by the description of the triangulation procedure; any choice of camera selection, outlier rejection threshold, or bundle-adjustment iterations introduces hyperparameters that must be reported and held constant across baselines.

minor comments (2)

[Abstract] The abstract states that performance 'improves significantly' yet supplies no numerical deltas or dataset names; this quantitative summary should appear in the abstract itself.
[§2, §3] Notation for 3D joint coordinates is introduced inconsistently between §2 and §3; a single, explicit definition (e.g., J ∈ ℝ^{3×K}) should be used throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§4.1, §4.2] §4.1 and §4.2: the experimental design does not demonstrate that single-view and multi-view conditions differ only in skeleton precision. The manuscript must explicitly state whether the identical raw video sequences, camera calibrations, subject pose distributions, and action class balances were used for both conditions, or whether multi-view recordings were collected separately and may therefore differ in lighting, subject behavior, or 2D detector performance.

Authors: We agree that explicit clarification is required. The experiments in §4.1 and §4.2 were performed on the identical multi-view video sequences. For the single-view condition we selected one camera from the multi-view capture and applied single-view lifting to that view alone; the multi-view condition triangulated using all views of the same sequences. Consequently, raw video, calibrations, subject poses, action distributions, lighting, and 2D detector outputs are identical by construction. We will add a clear statement to this effect in the revised §4.1 and §4.2. revision: yes
Referee: [Table 2] Table 2 (or equivalent results table): the reported accuracy improvements lack error bars, statistical significance tests, or ablation isolating the triangulation step from other multi-view effects (e.g., better 2D keypoint detection due to redundant views). Without these controls the attribution of gains to 3D reconstruction accuracy remains unverified.

Authors: We accept this criticism. In the revision we will augment Table 2 with error bars (standard deviation across runs) and report paired statistical significance tests. Our current single-view baselines already employ the identical 2D detector as the multi-view pipeline, so the primary difference is the 3D reconstruction method. We will add a short ablation that compares triangulation with and without view-redundancy fusion at the 2D stage to further isolate the contribution of improved 3D accuracy. revision: partial
Referee: [§3.2] §3.2: the claim that multi-view triangulation is 'parameter-free' relative to single-view lifting is not supported by the description of the triangulation procedure; any choice of camera selection, outlier rejection threshold, or bundle-adjustment iterations introduces hyperparameters that must be reported and held constant across baselines.

Authors: The referee is correct that the wording in §3.2 is imprecise. While the core triangulation algorithm (DLT) contains fewer learned parameters than neural lifting methods, we did apply fixed outlier-rejection thresholds and a fixed number of bundle-adjustment iterations. We will revise §3.2 to remove the 'parameter-free' phrasing, describe the exact procedure, and list all hyperparameters together with the statement that they were held constant for all reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical demonstration only

full rationale

The paper advances an empirical claim that multi-view triangulation yields more accurate 3D skeletons and thereby improves downstream action-recognition accuracy. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described content. The central argument rests on experimental comparison rather than reducing by construction to its own inputs, self-citations, or ansatzes. This is a standard empirical study whose load-bearing steps are external benchmarks and controlled measurements, not internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that triangulation from multiple views produces measurably superior 3D skeletons and that any accuracy gain transfers directly to downstream action classifiers without additional confounding variables.

axioms (1)

domain assumption Triangulation from multiple calibrated cameras yields higher-accuracy 3D joint positions than single-view estimation.
Invoked implicitly when the abstract states that multi-view produces 'more accurate 3D skeletons'.

pith-pipeline@v0.9.0 · 5662 in / 1138 out tokens · 31859 ms · 2026-05-22T10:59:28.469880+00:00 · methodology

0 comments

read the original abstract

Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

Figures

Figures reproduced from arXiv: 2602.23231 by Alexander Poeppel, Daniel Bermuth, Wolfgang Reif.

**Figure 1.** Figure 1: Example of a kick other person action with the new multi-view whole-body skeletons. II. RELATED WORK Most research in skeleton-based action recognition has focused on developing new model architectures, while only few works have investigated the influence of input skeleton quality so far. The original NTU-RGBD dataset [26], [18] was created with three non-calibrated Kinect RGB-D cameras, from which each ca… view at source ↗

**Figure 2.** Figure 2: Extrinsic calibration by overlapping skeletons from [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Confusion matrix of the ensembled ProtoGCN model on NTU-RGBD-60-xsub [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion matrix of the ensembled ProtoGCN model on NTU-RGBD-120-xsub [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

by making use of multiple camera views to triangulate more accurate 3D skeletons, the performance of state-of-the-art action recognition models can be improved significantly

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

K. C. Alowonou and J.-H. Han. MSA-GCN: Exploiting Multi-Scale Temporal Dynamics With Adaptive Graph Convolution for Skeleton- Based Action Recognition.IEEE Access, 2024

work page 2024
[2]

Bermuth, A

D. Bermuth, A. Poeppel, and W. Reif. V oxelkeypointfusion: Gen- eralizable multi-view multi-person pose estimation.arXiv preprint arXiv:2410.18723, 2024

work page arXiv 2024
[3]

Bermuth, A

D. Bermuth, A. Poeppel, and W. Reif. RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond.arXiv preprint arXiv:2503.21692, 2025

work page arXiv 2025
[4]

Bermuth, A

D. Bermuth, A. Poeppel, and W. Reif. Tutabo-1: towards real-time capable AI-based safety systems for human-robot collaboration. In 2025 IEEE International Conference on Advanced Robotics (ICAR). Institute of Electrical and Electronics Engineers (IEEE), 2025

work page 2025
[5]

L. Cao, S. Huai, and J. Gai. Reenvisioning Skeleton-based Action Recognition Through the Lens of NLP. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[6]

Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu. Channel- wise topology refinement graph convolution for skeleton-based action recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021

work page 2021
[7]

H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani. Infogcn: Representation learning for human skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022

work page 2022
[8]

Do and M

J. Do and M. Kim. Skateformer: skeletal-temporal transformer for human action recognition. InEuropean Conference on Computer Vision, pages 401–420. Springer, 2024

work page 2024
[9]

H. Duan, J. Wang, K. Chen, and D. Lin. DG-STGCN: dynamic spatial-temporal modeling for skeleton-based action recognition.arXiv preprint arXiv:2210.05895, 2022

work page arXiv 2022
[10]

H. Duan, J. Wang, K. Chen, and D. Lin. Pyskl: Towards good practices for skeleton action recognition. InProceedings of the 30th ACM International Conference on Multimedia, pages 7351–7354, 2022

work page 2022
[11]

H. Duan, Y . Zhao, K. Chen, D. Lin, and B. Dai. Revisiting skeleton- based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022

work page 2022
[12]

H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y . Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE international conference on computer vision, pages 3334–3342, 2015

work page 2015
[13]

J. Lee, M. Lee, D. Lee, and S. Lee. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10444–10453, 2023

work page 2023
[14]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[15]

D. Liu, Y . Hu, K. Hua, Y . Lu, Z. Zhang, X. Ma, Z. Zhong, and P. Chen. TDSN-GCN: Transformerify Overall Structure Decaying Static Graph Embedding NAS-guided GCN for Skeleton Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[16]

H. Liu, Y . Liu, M. Ren, H. Wang, Y . Wang, and Z. Sun. Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29248–29257, 2025

work page 2025
[17]

H. Liu, Y . Liu, C. Wang, Y . Wang, and Z. Sun. SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition. arXiv preprint arXiv:2511.22433, 2025

work page arXiv 2025
[18]

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding.IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019

work page 2019
[19]

Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020

work page 2020
[20]

Memmesheimer, S

R. Memmesheimer, S. H¨aring, N. Theisen, and D. Paulus. Skeleton-dml: Deep metric learning for skeleton-based one-shot action recognition. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3702–3710, 2022

work page 2022
[21]

Memmesheimer, N

R. Memmesheimer, N. Theisen, and D. Paulus. Sl-dml: Signal level deep metric learning for multimodal one-shot action recognition. In 2020 25th International conference on pattern recognition (ICPR), pages 4573–4580. IEEE, 2021

work page 2020
[22]

Myung, N

W. Myung, N. Su, J.-H. Xue, and G. Wang. Degcn: Deformable graph convolutional networks for skeleton-based action recognition.IEEE Transactions on Image Processing, 33:2477–2490, 2024

work page 2024
[23]

Pan and X

Q. Pan and X. Xie. Language-guided temporal primitive modeling for skeleton-based action recognition.Neurocomputing, 613:128636, 2025

work page 2025
[24]

H. Qu, Y . Cai, and J. Liu. Llms are good action recognizers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18395–18406, 2024

work page 2024
[25]

Sabater, L

A. Sabater, L. Santos, J. Santos-Victor, A. Bernardino, L. Montesano, and A. C. Murillo. One-shot action recognition in challenging therapy scenarios. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2777–2785, 2021

work page 2021
[26]

Shahroudy, J

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016

work page 2016
[27]

Trivedi, A

N. Trivedi, A. Thatipelli, and R. K. Sarvadevabhatla. NTU-X: an enhanced large-scale dataset for improving pose-based recognition of subtle human actions. InProceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing, pages 1–9, 2021

work page 2021
[28]

Wang and P

L. Wang and P. Koniusz. 3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5620–5631, 2023

work page 2023
[29]

X. Wang, X. Xu, and Y . Mu. Neural koopman pooling: Control-inspired temporal dynamics encoding for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10597–10607, 2023

work page 2023
[30]

Y . Wang, Y . Wu, W. He, X. Guo, F. Zhu, L. Bai, R. Zhao, J. Wu, T. He, W. Ouyang, et al. Hulk: A universal knowledge translator for human-centric tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[31]

Xiang and Z

L. Xiang and Z. Wang. Joint mixing data augmentation for skeleton- based action recognition.ACM Transactions on Multimedia Computing, Communications and Applications, 21(4):1–24, 2025

work page 2025
[32]

H. Xu, Y . Gao, Z. Hui, J. Li, and X. Gao. Language Knowledge- Assisted Representation Learning for Skeleton-Based Action Recogni- tion.arXiv preprint arXiv:2305.12398, 2023

work page arXiv 2023
[33]

S. Yang, J. Liu, S. Lu, E. M. Hwa, and A. C. Kot. One-shot action recognition via multi-scale spatial-temporal skeleton matching.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(7):5149– 5156, 2024

work page 2024
[34]

Zhang, L

J. Zhang, L. Lin, and J. Liu. Shap-mix: Shapley value guided mixing for long-tailed skeleton based action recognition.arXiv preprint arXiv:2407.12312, 2024

work page arXiv 2024
[35]

Zhang, W

Z. Zhang, W. Cai, Q. Liu, and Y . Wang. SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggrega- tion.arXiv preprint arXiv:2504.11749, 2025

work page arXiv 2025
[36]

Zhou and X

L. Zhou and X. Jiao. Multi-modal and multi-part with skeletons and texts for action recognition.Expert Systems with Applications, page 126646, 2025

work page 2025
[37]

Y . Zhou, T. Xu, C. Wu, X. Wu, and J. Kittler. Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12648–12658, 2025

work page 2025
[38]

Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua. Blockgcn: Redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2049–2058, 2024

work page 2049
[39]

W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y . Wang. MotionBERT: A Unified Perspective on Learning Human Motion Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15085–15099, 2023. APPENDIX A. Calibration Details Like mentioned in the main text, the calibration process can be split into three steps: •Estimat...

work page 2023

[1] [1]

K. C. Alowonou and J.-H. Han. MSA-GCN: Exploiting Multi-Scale Temporal Dynamics With Adaptive Graph Convolution for Skeleton- Based Action Recognition.IEEE Access, 2024

work page 2024

[2] [2]

Bermuth, A

D. Bermuth, A. Poeppel, and W. Reif. V oxelkeypointfusion: Gen- eralizable multi-view multi-person pose estimation.arXiv preprint arXiv:2410.18723, 2024

work page arXiv 2024

[3] [3]

Bermuth, A

D. Bermuth, A. Poeppel, and W. Reif. RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond.arXiv preprint arXiv:2503.21692, 2025

work page arXiv 2025

[4] [4]

Bermuth, A

D. Bermuth, A. Poeppel, and W. Reif. Tutabo-1: towards real-time capable AI-based safety systems for human-robot collaboration. In 2025 IEEE International Conference on Advanced Robotics (ICAR). Institute of Electrical and Electronics Engineers (IEEE), 2025

work page 2025

[5] [5]

L. Cao, S. Huai, and J. Gai. Reenvisioning Skeleton-based Action Recognition Through the Lens of NLP. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[6] [6]

Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu. Channel- wise topology refinement graph convolution for skeleton-based action recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021

work page 2021

[7] [7]

H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani. Infogcn: Representation learning for human skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022

work page 2022

[8] [8]

Do and M

J. Do and M. Kim. Skateformer: skeletal-temporal transformer for human action recognition. InEuropean Conference on Computer Vision, pages 401–420. Springer, 2024

work page 2024

[9] [9]

H. Duan, J. Wang, K. Chen, and D. Lin. DG-STGCN: dynamic spatial-temporal modeling for skeleton-based action recognition.arXiv preprint arXiv:2210.05895, 2022

work page arXiv 2022

[10] [10]

H. Duan, J. Wang, K. Chen, and D. Lin. Pyskl: Towards good practices for skeleton action recognition. InProceedings of the 30th ACM International Conference on Multimedia, pages 7351–7354, 2022

work page 2022

[11] [11]

H. Duan, Y . Zhao, K. Chen, D. Lin, and B. Dai. Revisiting skeleton- based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022

work page 2022

[12] [12]

H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y . Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE international conference on computer vision, pages 3334–3342, 2015

work page 2015

[13] [13]

J. Lee, M. Lee, D. Lee, and S. Lee. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10444–10453, 2023

work page 2023

[14] [14]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014

[15] [15]

D. Liu, Y . Hu, K. Hua, Y . Lu, Z. Zhang, X. Ma, Z. Zhong, and P. Chen. TDSN-GCN: Transformerify Overall Structure Decaying Static Graph Embedding NAS-guided GCN for Skeleton Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025

[16] [16]

H. Liu, Y . Liu, M. Ren, H. Wang, Y . Wang, and Z. Sun. Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29248–29257, 2025

work page 2025

[17] [17]

H. Liu, Y . Liu, C. Wang, Y . Wang, and Z. Sun. SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition. arXiv preprint arXiv:2511.22433, 2025

work page arXiv 2025

[18] [18]

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding.IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019

work page 2019

[19] [19]

Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020

work page 2020

[20] [20]

Memmesheimer, S

R. Memmesheimer, S. H¨aring, N. Theisen, and D. Paulus. Skeleton-dml: Deep metric learning for skeleton-based one-shot action recognition. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3702–3710, 2022

work page 2022

[21] [21]

Memmesheimer, N

R. Memmesheimer, N. Theisen, and D. Paulus. Sl-dml: Signal level deep metric learning for multimodal one-shot action recognition. In 2020 25th International conference on pattern recognition (ICPR), pages 4573–4580. IEEE, 2021

work page 2020

[22] [22]

Myung, N

W. Myung, N. Su, J.-H. Xue, and G. Wang. Degcn: Deformable graph convolutional networks for skeleton-based action recognition.IEEE Transactions on Image Processing, 33:2477–2490, 2024

work page 2024

[23] [23]

Pan and X

Q. Pan and X. Xie. Language-guided temporal primitive modeling for skeleton-based action recognition.Neurocomputing, 613:128636, 2025

work page 2025

[24] [24]

H. Qu, Y . Cai, and J. Liu. Llms are good action recognizers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18395–18406, 2024

work page 2024

[25] [25]

Sabater, L

A. Sabater, L. Santos, J. Santos-Victor, A. Bernardino, L. Montesano, and A. C. Murillo. One-shot action recognition in challenging therapy scenarios. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2777–2785, 2021

work page 2021

[26] [26]

Shahroudy, J

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016

work page 2016

[27] [27]

Trivedi, A

N. Trivedi, A. Thatipelli, and R. K. Sarvadevabhatla. NTU-X: an enhanced large-scale dataset for improving pose-based recognition of subtle human actions. InProceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing, pages 1–9, 2021

work page 2021

[28] [28]

Wang and P

L. Wang and P. Koniusz. 3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5620–5631, 2023

work page 2023

[29] [29]

X. Wang, X. Xu, and Y . Mu. Neural koopman pooling: Control-inspired temporal dynamics encoding for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10597–10607, 2023

work page 2023

[30] [30]

Y . Wang, Y . Wu, W. He, X. Guo, F. Zhu, L. Bai, R. Zhao, J. Wu, T. He, W. Ouyang, et al. Hulk: A universal knowledge translator for human-centric tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[31] [31]

Xiang and Z

L. Xiang and Z. Wang. Joint mixing data augmentation for skeleton- based action recognition.ACM Transactions on Multimedia Computing, Communications and Applications, 21(4):1–24, 2025

work page 2025

[32] [32]

H. Xu, Y . Gao, Z. Hui, J. Li, and X. Gao. Language Knowledge- Assisted Representation Learning for Skeleton-Based Action Recogni- tion.arXiv preprint arXiv:2305.12398, 2023

work page arXiv 2023

[33] [33]

S. Yang, J. Liu, S. Lu, E. M. Hwa, and A. C. Kot. One-shot action recognition via multi-scale spatial-temporal skeleton matching.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(7):5149– 5156, 2024

work page 2024

[34] [34]

Zhang, L

J. Zhang, L. Lin, and J. Liu. Shap-mix: Shapley value guided mixing for long-tailed skeleton based action recognition.arXiv preprint arXiv:2407.12312, 2024

work page arXiv 2024

[35] [35]

Zhang, W

Z. Zhang, W. Cai, Q. Liu, and Y . Wang. SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggrega- tion.arXiv preprint arXiv:2504.11749, 2025

work page arXiv 2025

[36] [36]

Zhou and X

L. Zhou and X. Jiao. Multi-modal and multi-part with skeletons and texts for action recognition.Expert Systems with Applications, page 126646, 2025

work page 2025

[37] [37]

Y . Zhou, T. Xu, C. Wu, X. Wu, and J. Kittler. Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12648–12658, 2025

work page 2025

[38] [38]

Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua. Blockgcn: Redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2049–2058, 2024

work page 2049

[39] [39]

W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y . Wang. MotionBERT: A Unified Perspective on Learning Human Motion Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15085–15099, 2023. APPENDIX A. Calibration Details Like mentioned in the main text, the calibration process can be split into three steps: •Estimat...

work page 2023