arxiv: 2604.17013 · v1 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

Towards Universal Skeleton-Based Action Recognition

Jidong Kuang , Hongsong Wang , Jie Gui

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords skeleton-based action recognitionheterogeneous dataopen-vocabulary recognitionTransformer modelmotion-text alignmenthuman-robot interactioncontrastive learningunified representation

0 comments

The pith

A Transformer model with multi-grained text alignment recognizes actions from heterogeneous skeleton sources using an integrated open-vocabulary dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the fact that skeleton data for action recognition comes from varied human sources and robot structures, creating heterogeneity that prior models ignore by assuming uniform inputs. It builds a large Heterogeneous Open-Vocabulary Skeleton dataset by combining and cleaning existing large-scale collections, then introduces a Transformer architecture with unified skeleton representation, a two-stream motion encoder, and contrastive alignment between motion features and text at global, stream-specific, and fine-grained levels. This setup aims to produce representations that work across different joint layouts and coordinate systems while supporting actions described by arbitrary vocabulary. A sympathetic reader would care because real-world human-robot interaction requires understanding actions without retraining for each new sensor or robot body, and without being limited to a fixed list of action labels.

Core claim

The paper claims that its Transformer-based model, built around unified skeleton representation, a motion encoder that processes multi-modal embeddings in two streams, and multi-grained motion-text alignment via contrastive learning, enables effective and generalizable skeleton-based action recognition on heterogeneous data with open vocabularies, as shown through extensive experiments on popular benchmarks containing mixed skeleton sources.

What carries the argument

The multi-grained motion-text alignment that performs contrastive learning at global instance, stream-specific, and fine-grained levels to map the two-stream Transformer motion representations into a shared semantic space with text embeddings.

If this is right

The same model parameters can process skeleton inputs from multiple human capture systems and humanoid robots without retraining.
Action labels described in natural language but absent from training data become recognizable through the text alignment pathway.
Performance gains appear on standard benchmarks when those benchmarks are presented as mixed heterogeneous collections rather than isolated uniform ones.
The approach directly supports scenarios where both human and robot skeletons must be interpreted under one recognition system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could support transfer of learned motion patterns from human datasets directly to robot control loops without additional skeleton conversion steps.
Real-time applications might benefit if the two-stream encoder is adapted for streaming input rather than fixed clips.
Extending the alignment to include additional modalities such as depth or audio could further reduce reliance on any single skeleton format.

Load-bearing premise

Combining existing skeleton datasets after refinement produces a single benchmark whose differences in joint definitions, coordinate systems, and action semantics are consistent enough for the model to learn general features instead of dataset-specific patterns.

What would settle it

A test set of newly captured skeletons with joint layouts or coordinate systems markedly different from the integrated training data, where the model shows no improvement over separate per-dataset baselines, would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.17013 by Hongsong Wang, Jidong Kuang, Jie Gui.

**Figure 2.** Figure 2: Characteristics of skeleton structure and sample distri [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed architecture for heterogeneous skeleton-based Action Recognition with Open Vocabularies. Hetero [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Radar Chart of Per-Class Accuracy Improvements from [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of the calibration factor [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

With the development of robotics, skeleton-based action recognition has become increasingly important, as human-robot interaction requires understanding the actions of humans and humanoid robots. Due to different sources of human skeletons and structures of humanoid robots, skeleton data naturally exhibit heterogeneity. However, previous works overlook the data heterogeneity of skeletons and solely construct models using homogeneous skeletons. Moreover, open-vocabulary action recognition is also essential for real-world applications. To this end, this work studies the challenging problem of heterogeneous skeleton-based action recognition with open vocabularies. We construct a large-scale Heterogeneous Open-Vocabulary (HOV) Skeleton dataset by integrating and refining multiple representative large-scale skeleton-based action datasets. To address universal skeleton-based action recognition, we propose a Transformer-based model that comprises three key components: unified skeleton representation, motion encoder for skeletons, and multi-grained motion-text alignment. The motion encoder feeds multi-modal skeleton embeddings into a two-stream Transformer-based encoder to learn spatio-temporal action representations, which are then mapped to a semantic space to align with text embeddings. Multi-grained motion-text alignment incorporates contrastive learning at three levels: global instance alignment, stream-specific alignment, and fine-grained alignment. Extensive experiments on popular benchmarks with heterogeneous skeleton data demonstrate both the effectiveness and the generalization ability of the proposed method. Code is available at https://github.com/jidongkuang/Universal-Skeleton.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Merges existing skeleton datasets into HOV and adds a three-part Transformer with multi-grained contrastive alignment for open-vocab heterogeneous recognition, but the unification step and results need closer checks.

read the letter

The paper's main move is to build a merged Heterogeneous Open-Vocabulary skeleton dataset from prior collections and pair it with a Transformer that uses unified skeleton input, a two-stream motion encoder, and three levels of motion-text contrastive loss. That combination targets the practical gap where skeletons arrive from humans, robots, or different capture systems and where action labels are not closed-set. Releasing the integrated data and code is a concrete step that others can use even if they tweak the model later. The architecture itself recycles standard Transformer blocks and contrastive objectives but applies them at global, stream-specific, and fine-grained levels, which is a reasonable way to handle varying skeleton granularity. On the positive side, the problem statement is clear and the robotics angle is timely. The multi-grained alignment is a direct response to the heterogeneity issue rather than a generic add-on. That said, the unification of joint definitions, coordinate frames, and action semantics across sources is load-bearing and only sketched in the abstract. If padding or canonicalization leaves residual source signatures, the reported generalization could be overstated. No quantitative results, ablations, or dataset statistics appear here, so the effectiveness claim cannot be evaluated yet. The stress-test concern about dataset artifacts is plausible until the full experiments and preprocessing details are examined. This work is aimed at CV researchers who already work on skeleton action recognition and want to extend it to open-vocab or mixed-source settings. A reader who needs a new benchmark or baseline for heterogeneous data would find the dataset useful. I would send it to peer review because the applied problem is real and the resources are new, even though the current write-up leaves the central assumptions untested in detail.

Referee Report

2 major / 2 minor

Summary. The paper addresses heterogeneous skeleton-based action recognition with open vocabularies by constructing the HOV dataset through integration of multiple existing large-scale skeleton datasets and proposing a Transformer-based architecture. The model includes a unified skeleton representation, a two-stream motion encoder that processes multi-modal skeleton embeddings for spatio-temporal features, and multi-grained motion-text alignment via contrastive losses at global instance, stream-specific, and fine-grained levels. Experiments on popular benchmarks with heterogeneous data are claimed to show effectiveness and generalization ability, with code released.

Significance. If the results hold, the work has moderate significance for advancing skeleton-based action recognition beyond homogeneous assumptions toward real-world applications like human-robot interaction. The open-vocabulary setting and multi-grained alignment are timely contributions. Explicit credit is due for releasing code at the provided GitHub link, which supports reproducibility.

major comments (2)

[§3.1] §3.1 (Unified Skeleton Representation): The description of integrating datasets with differing joint cardinalities, coordinate systems (camera vs. world), and action vocabularies does not provide a concrete canonicalization procedure or quantitative validation that residual source-specific signatures are removed. This mapping is load-bearing for the generalization claim, as incomplete alignment would allow the two-stream Transformer and contrastive losses to exploit dataset artifacts rather than learn universal features.
[§4] §4 (Experiments): All reported results use the integrated HOV data; no ablation or cross-dataset transfer experiments are described that isolate whether performance gains stem from the proposed components versus dataset-specific statistics. This directly affects the central claim of generalization to heterogeneous skeletons.

minor comments (2)

[Abstract] The abstract and §2 could more clearly distinguish the proposed multi-grained alignment from standard contrastive learning baselines in skeleton-text models.
[§3.3] Notation for the three alignment losses (global, stream-specific, fine-grained) should be introduced with explicit equations in §3.3 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of the unified skeleton representation and the experimental validation of generalization.

read point-by-point responses

Referee: [§3.1] §3.1 (Unified Skeleton Representation): The description of integrating datasets with differing joint cardinalities, coordinate systems (camera vs. world), and action vocabularies does not provide a concrete canonicalization procedure or quantitative validation that residual source-specific signatures are removed. This mapping is load-bearing for the generalization claim, as incomplete alignment would allow the two-stream Transformer and contrastive losses to exploit dataset artifacts rather than learn universal features.

Authors: We agree that the current description in §3.1 would benefit from greater explicitness. In the revised manuscript we will expand this section with a concrete canonicalization procedure: all skeletons are mapped to a common 25-joint topology by retaining overlapping joints and zero-padding or linearly interpolating missing ones; coordinate systems are aligned to a shared world frame via affine transformations derived from available camera parameters (or relative normalization when parameters are absent); action labels are unified through a manually curated semantic ontology that merges synonymous classes across sources. We will also add quantitative validation consisting of (i) pre- and post-canonicalization distribution comparisons (e.g., Kolmogorov-Smirnov statistics on joint-angle and velocity histograms) demonstrating substantial reduction of source-specific signatures and (ii) an ablation that trains the model on non-canonicalized data and reports the resulting drop in cross-source performance. These additions will directly substantiate that the learned features are universal rather than artifact-driven. revision: yes
Referee: [§4] §4 (Experiments): All reported results use the integrated HOV data; no ablation or cross-dataset transfer experiments are described that isolate whether performance gains stem from the proposed components versus dataset-specific statistics. This directly affects the central claim of generalization to heterogeneous skeletons.

Authors: We acknowledge that while the manuscript reports results on the integrated HOV dataset together with evaluations on established heterogeneous benchmarks, dedicated cross-dataset transfer ablations that train on one source subset and test on another are not presented in sufficient detail. In the revision we will insert a new subsection under §4 that includes (i) leave-one-source-out transfer experiments, (ii) component-wise ablations (removing the unified representation, the two-stream encoder, or individual contrastive losses) under these transfer protocols, and (iii) statistical comparisons against a baseline that receives only dataset-specific statistics. These experiments will isolate the contribution of each proposed component and thereby reinforce the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an empirical method: integrating existing skeleton datasets into HOV, then applying a two-stream Transformer with multi-grained contrastive alignment. No equations or formal derivations are presented that reduce claimed performance or generalization to fitted parameters, self-definitions, or self-citation chains. The central claims rest on experimental results on benchmarks rather than a load-bearing mathematical step that is equivalent to its inputs by construction. Dataset integration and architectural choices are independent preprocessing and modeling decisions, not self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the merged dataset and the assumption that the proposed alignment strategy generalizes across skeleton heterogeneity; no explicit free parameters or invented physical entities are described.

axioms (1)

domain assumption Skeleton data from different sources can be unified into a common representation without loss of critical motion information.
Invoked by the unified skeleton representation component of the model.

pith-pipeline@v0.9.0 · 5535 in / 1261 out tokens · 81784 ms · 2026-05-10T07:23:18.846574+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 5 canonical work pages

[1]

Maskclr: Attention-guided contrastive learning for robust action representation learning

Mohamed Abdelfattah, Mariam Hassan, and Alexandre Alahi. Maskclr: Attention-guided contrastive learning for robust action representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18678–18687, 2024. 2

2024
[2]

Skeleton- aware networks for deep motion retargeting.ACM Transac- tions on Graphics, 39(4):62–1, 2020

Kfir Aberman, Peizhuo Li, Dani Lischinski, Olga Sorkine- Hornung, Daniel Cohen-Or, and Baoquan Chen. Skeleton- aware networks for deep motion retargeting.ACM Transac- tions on Graphics, 39(4):62–1, 2020. 2

2020
[3]

A cross- dataset study for text-based 3d human motion retrieval

L ´eore Bensabath, Mathis Petrovich, and Gul Varol. A cross- dataset study for text-based 3d human motion retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1932–1940, 2024. 3

1932
[4]

An empirical study and analysis of generalized zero- shot learning for object recognition in the wild

Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero- shot learning for object recognition in the wild. InCom- puter Vision–ECCV 2016: 14th European Conference, Am- sterdam, The Netherlands, October 11-14, 2016, Proceed- ings, Part II 14, pages 52–68. Springer, 2016. 15

2016
[5]

Skeleton- based action recognition with non-linear dependency model- ing and hilbert-schmidt independence criterion

Haipeng Chen, Yuheng Yang, and Yingda Lyu. Skeleton- based action recognition with non-linear dependency model- ing and hilbert-schmidt independence criterion. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 2043–2051, 2025. 15

2043
[6]

Channel-wise topology refinement graph convolution for skeleton-based action recognition

Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13359–13368, 2021. 3

2021
[7]

Hi- erarchically self-supervised transformer for human skeleton representation learning

Yuxiao Chen, Long Zhao, Jianbo Yuan, Yu Tian, Zhaoyang Xia, Shijie Geng, Ligong Han, and Dimitris N Metaxas. Hi- erarchically self-supervised transformer for human skeleton representation learning. InEuropean Conference on Com- puter Vision, pages 185–202. Springer, 2022. 2

2022
[8]

Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition

Yang Chen, Jingcai Guo, Tian He, Xiaocheng Lu, and Ling Wang. Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. InProceedings of the 32nd ACM International Conference on Multimedia, pages 778–786, 2024. 2, 16

2024
[9]

Neu- ron: Learning context-aware evolving representations for zero-shot skeleton action recognition

Yang Chen, Jingcai Guo, Song Guo, and Dacheng Tao. Neu- ron: Learning context-aware evolving representations for zero-shot skeleton action recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8721–8730, 2025. 2, 16

2025
[10]

Skeleton-based action recognition with shift graph convolutional network

Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. Skeleton-based action recognition with shift graph convolutional network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 183–192, 2020. 3

2020
[11]

Hierarchical transformer: Unsupervised representation learning for skeleton-based hu- man action recognition

Yi-Bin Cheng, Xipeng Chen, Junhong Chen, Pengxu Wei, Dongyu Zhang, and Liang Lin. Hierarchical transformer: Unsupervised representation learning for skeleton-based hu- man action recognition. In2021 IEEE International Con- ference on Multimedia and Expo (ICME), pages 1–6. IEEE,
[12]

Bridging the skeleton- text modality gap: Diffusion-powered modality alignment for zero-shot skeleton-based action recognition

Jeonghyeok Do and Munchurl Kim. Bridging the skeleton- text modality gap: Diffusion-powered modality alignment for zero-shot skeleton-based action recognition. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 12757–12768, 2025. 2, 16

2025
[13]

Representation learning of temporal dynamics for skeleton-based action recognition

Yong Du, Yun Fu, and Liang Wang. Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Transactions on Image Processing, 25(7):3010–3022,
[14]

Revisiting skeleton-based action recognition

Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting skeleton-based action recognition. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2969–2978, 2022. 3, 13

2022
[15]

Skeletr: Towards skeleton-based action recognition in the wild

Haodong Duan, Mingze Xu, Bing Shuai, Davide Mod- olo, Zhuowen Tu, Joseph Tighe, and Alessandro Bergamo. Skeletr: Towards skeleton-based action recognition in the wild. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 13634–13644, 2023. 3

2023
[16]

Unified pose sequence modeling

Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. Unified pose sequence modeling. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13019–13030, 2023. 3

2023
[17]

Hyperbolic self-paced learning for self-supervised skeleton-based action representations

Luca Franco, Paolo Mandica, Bharti Munjal, and Fabio Galasso. Hyperbolic self-paced learning for self-supervised skeleton-based action representations. InInternational Con- ference on Learning Representations, 2023. 15

2023
[18]

De- vise: A deep visual-semantic embedding model.Advances in Neural Information Processing Systems, 26, 2013

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De- vise: A deep visual-semantic embedding model.Advances in Neural Information Processing Systems, 26, 2013. 16

2013
[19]

Efficient spatio-temporal con- trastive learning for skeleton-based 3-d action recognition

Xuehao Gao, Yang Yang, Yimeng Zhang, Maosen Li, Jin- Gang Yu, and Shaoyi Du. Efficient spatio-temporal con- trastive learning for skeleton-based 3-d action recognition. IEEE Transactions on Multimedia, 25:405–417, 2021. 8

2021
[20]

Rethinking masked data reconstruction pretraining for strong 3d action representation learning

Tao Gong, Qi Chu, Bin Liu, and Nenghai Yu. Rethinking masked data reconstruction pretraining for strong 3d action representation learning. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 3149–3157, 2025. 2, 15

2025
[21]

Dmmg: dual min-max games for self-supervised skeleton-based action recognition.IEEE Transactions on Im- age Processing, 33:395–407, 2023

Shannan Guan, Xin Yu, Wei Huang, Gengfa Fang, and Haiyan Lu. Dmmg: dual min-max games for self-supervised skeleton-based action recognition.IEEE Transactions on Im- age Processing, 33:395–407, 2023. 2

2023
[22]

Ac- tion2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Ac- tion2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020. 13

2021
[23]

Generating diverse and natural 3d 9 human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d 9 human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 3, 13

2022
[24]

Contrastive learning from ex- tremely augmented skeleton sequences for self-supervised action recognition

Tianyu Guo, Hong Liu, Zhan Chen, Mengyuan Liu, Tao Wang, and Runwei Ding. Contrastive learning from ex- tremely augmented skeleton sequences for self-supervised action recognition. InProceedings of the AAAI conference on artificial intelligence, pages 762–770, 2022. 2, 15

2022
[25]

Syntactically guided generative embeddings for zero-shot skeleton action recognition

Pranay Gupta, Divyanshu Sharma, and Ravi Kiran Sarvadev- abhatla. Syntactically guided generative embeddings for zero-shot skeleton action recognition. In2021 IEEE Interna- tional Conference on Image Processing (ICIP), pages 439–
[26]

Part aware contrastive learning for self-supervised action recognition

Yilei Hua, Wenhan Wu, Ce Zheng, Aidong Lu, Mengyuan Liu, Chen Chen, and Shiqian Wu. Part aware contrastive learning for self-supervised action recognition. InProceed- ings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 855–863, 2023. 2

2023
[27]

Graph contrastive learn- ing for skeleton-based action recognition.arXiv preprint arXiv:2301.10900, 2023

Xiaohu Huang, Hao Zhou, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, and Bin Feng. Graph contrastive learn- ing for skeleton-based action recognition.arXiv preprint arXiv:2301.10900, 2023. 15

work page arXiv 2023
[28]

Learning robust visual-semantic embed- dings

Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. Learning robust visual-semantic embed- dings. InProceedings of the IEEE International Conference on Computer Vision, pages 3571–3580, 2017. 16

2017
[29]

Skeleton based zero shot action recognition in joint pose-language semantic space.arXiv preprint arXiv:1911.11344, 2019

Bhavan Jasani and Afshaan Mazagonwalla. Skeleton based zero shot action recognition in joint pose-language semantic space.arXiv preprint arXiv:1911.11344, 2019. 7, 16

work page arXiv 1911
[30]

Global-local motion transformer for unsupervised skeleton-based action learning

Boeun Kim, Hyung Jin Chang, Jungho Kim, and Jin Young Choi. Global-local motion transformer for unsupervised skeleton-based action learning. InEuropean conference on computer vision, pages 209–225. Springer, 2022. 8

2022
[31]

Hierarchically decomposed graph convolutional net- works for skeleton-based action recognition

Jungho Lee, Minhyeok Lee, Dogyoon Lee, and Sangyoun Lee. Hierarchically decomposed graph convolutional net- works for skeleton-based action recognition. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 10444–10453, 2023. 15

2023
[32]

3d human action rep- resentation learning via cross-view consistency pursuit

Linguo Li, Minsi Wang, Bingbing Ni, Hang Wang, Jiancheng Yang, and Wenjun Zhang. 3d human action rep- resentation learning via cross-view consistency pursuit. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4741–4750, 2021. 8

2021
[33]

Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition

Ming-Zhe Li, Zhen Jia, Zhang Zhang, Zhanyu Ma, and Liang Wang. Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition. InInternational Conference on Image and Graphics, pages 68–80. Springer,
[34]

Sa-dvae: Improv- ing zero-shot skeleton-based action recognition by disentan- gled variational autoencoders

Sheng-Wei Li, Zi-Xiang Wei, Wei-Jie Chen, Yi-Hsin Yu, Chih-Yuan Yang, and Jane Yung-jen Hsu. Sa-dvae: Improv- ing zero-shot skeleton-based action recognition by disentan- gled variational autoencoders. InEuropean Conference on Computer Vision, pages 447–462. Springer, 2024. 2, 16

2024
[35]

Ms2l: Multi-task self-supervised learning for skeleton based action recognition

Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. Ms2l: Multi-task self-supervised learning for skeleton based action recognition. InProceedings of the 28th ACM international conference on multimedia, pages 2490–2498, 2020. 8

2020
[36]

Actionlet- dependent contrastive learning for unsupervised skeleton- based action recognition

Lilang Lin, Jiahang Zhang, and Jiaying Liu. Actionlet- dependent contrastive learning for unsupervised skeleton- based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2363–2372, 2023. 2, 15

2023
[37]

Idempotent unsupervised representation learning for skeleton-based action recognition

Lilang Lin, Lehong Wu, Jiahang Zhang, and Jiaying Liu. Idempotent unsupervised representation learning for skeleton-based action recognition. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024. 2

2024
[38]

Revealing key details to see differ- ences: A novel prototypical perspective for skeleton-based action recognition

Hongda Liu, Yunfan Liu, Min Ren, Hao Wang, Yunlong Wang, and Zhenan Sun. Revealing key details to see differ- ences: A novel prototypical perspective for skeleton-based action recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29248–29257,
[39]

Ntu rgb+ d 120: A large- scale benchmark for 3d human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2019

Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. Ntu rgb+ d 120: A large- scale benchmark for 3d human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2019. 3, 13

2019
[40]

Smpl: A skinned multi- person linear model.ACM Transactions on Graphics, 34(6),

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi- person linear model.ACM Transactions on Graphics, 34(6),
[41]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019. 13

2019
[42]

Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation

Yunyao Mao, Wengang Zhou, Zhenbo Lu, Jiajun Deng, and Houqiang Li. Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation. InEuropean Conference on Computer Vision, pages 734–752. Springer,
[43]

Unsupervised 3d human pose representation with viewpoint and pose disen- tanglement

Qiang Nie, Ziwei Liu, and Yunhui Liu. Unsupervised 3d human pose representation with viewpoint and pose disen- tanglement. InEuropean Conference on Computer Vision, pages 102–118. Springer, 2020. 2

2020
[44]

Babel: Bodies, action and behavior with english la- bels

Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english la- bels. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 722–731, 2021. 4, 13

2021
[45]

Llms are good action recognizers

Haoxuan Qu, Yujun Cai, and Jun Liu. Llms are good action recognizers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18395– 18406, 2024. 3

2024
[46]

Generalized zero-and few-shot learning via aligned variational autoencoders

Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8247–8255, 2019. 7, 16

2019
[47]

Ntu rgb+ d: A large scale dataset for 3d human activity anal- ysis

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity anal- ysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1010–1019, 2016. 3, 13 10

2016
[48]

Ski models: Skeleton induced vision- language embeddings for understanding activities of daily living

Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, and Srijan Das. Ski models: Skeleton induced vision- language embeddings for understanding activities of daily living. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6931–6939, 2025. 2

2025
[49]

Predict & cluster: Unsupervised skeleton based action recognition

Kun Su, Xiulong Liu, and Eli Shlizerman. Predict & cluster: Unsupervised skeleton based action recognition. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9631–9640, 2020. 8

2020
[50]

Self-supervised 3d skeleton action representation learning with motion con- sistency and continuity

Yukun Su, Guosheng Lin, and Qingyao Wu. Self-supervised 3d skeleton action representation learning with motion con- sistency and continuity. InProceedings of the IEEE/CVF international conference on computer vision, pages 13328– 13338, 2021. 2

2021
[51]

Uni- fied multi-modal unsupervised representation learning for skeleton-based action understanding

Shengkai Sun, Daizong Liu, Jianfeng Dong, Xiaoye Qu, Junyu Gao, Xun Yang, Xun Wang, and Meng Wang. Uni- fied multi-modal unsupervised representation learning for skeleton-based action understanding. InProceedings of the 31st ACM International Conference on Multimedia, pages 2973–2984, 2023. 2, 14, 15, 16

2023
[52]

Modeling temporal dynamics and spatial configurations of actions using two- stream recurrent neural networks

Hongsong Wang and Liang Wang. Modeling temporal dynamics and spatial configurations of actions using two- stream recurrent neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 499–508, 2017. 1

2017
[53]

Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection.IEEE Transactions on Im- age Processing, 27(9):4382–4394, 2018

Hongsong Wang and Liang Wang. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection.IEEE Transactions on Im- age Processing, 27(9):4382–4394, 2018. 1

2018
[54]

Heterogeneous skeleton-based action representation learn- ing

Hongsong Wang, Xiaoyan Ma, Jidong Kuang, and Jie Gui. Heterogeneous skeleton-based action representation learn- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19154–19164, 2025. 3

2025
[55]

Cross-view action modeling, learning and recog- nition

Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song- Chun Zhu. Cross-view action modeling, learning and recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2649–2656, 2014. 3, 13

2014
[56]

Contrast-reconstruction representation learning for self-supervised skeleton-based action recognition.IEEE Transactions on Image Processing, 31:6224–6238, 2022

Peng Wang, Jun Wen, Chenyang Si, Yuntao Qian, and Liang Wang. Contrast-reconstruction representation learning for self-supervised skeleton-based action recognition.IEEE Transactions on Image Processing, 31:6224–6238, 2022. 2, 8

2022
[57]

Va-ar: Learning velocity-aware action representations with mixture of window attention

Jiangning Wei, Lixiong Qin, Bo Yu, Tianjian Zou, Chuhan Yan, Dandan Xiao, Yang Yu, Lan Yang, Ke Li, and Jun Liu. Va-ar: Learning velocity-aware action representations with mixture of window attention. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8286– 8294, 2025. 15

2025
[58]

Interactive spatiotemporal token atten- tion network for skeleton-based general interactive action recognition

Yuhang Wen, Zixuan Tang, Yunsheng Pang, Beichen Ding, and Mengyuan Liu. Interactive spatiotemporal token atten- tion network for skeleton-based general interactive action recognition. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7886–7892. IEEE, 2023. 3

2023
[59]

Chase: Learning convex hull adaptive shift for skeleton-based multi-entity action recognition.Advances in Neural Information Processing Systems, 37:9388–9420,

Yuhang Wen, Mengyuan Liu, Songtao Wu, and Beichen Ding. Chase: Learning convex hull adaptive shift for skeleton-based multi-entity action recognition.Advances in Neural Information Processing Systems, 37:9388–9420,
[60]

Fine-grained action retrieval through multiple parts- of-speech embeddings

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. Fine-grained action retrieval through multiple parts- of-speech embeddings. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 450–459,
[61]

Scd- net: Spatiotemporal clues disentanglement network for self- supervised skeleton-based action recognition

Cong Wu, Xiao-Jun Wu, Josef Kittler, Tianyang Xu, Sara Ahmed, Muhammad Awais, and Zhenhua Feng. Scd- net: Spatiotemporal clues disentanglement network for self- supervised skeleton-based action recognition. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 5949–5957, 2024. 2

2024
[62]

Frequency-semantic enhanced variational au- toencoder for zero-shot skeleton-based action recognition

Wenhan Wu, Zhishuai Guo, Chen Chen, Hongfei Xue, and Aidong Lu. Frequency-semantic enhanced variational au- toencoder for zero-shot skeleton-based action recognition. arXiv preprint arXiv:2506.22179, 2025. 2

work page arXiv 2025
[63]

Generative action description prompts for skeleton-based action recognition

Wangmeng Xiang, Chao Li, Yuxuan Zhou, Biao Wang, and Lei Zhang. Generative action description prompts for skeleton-based action recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10276–10285, 2023. 1, 15

2023
[64]

Dynamic semantic-based spa- tial graph convolution network for skeleton-based human ac- tion recognition

Jianyang Xie, Yanda Meng, Yitian Zhao, Anh Nguyen, Xi- aoyun Yang, and Yalin Zheng. Dynamic semantic-based spa- tial graph convolution network for skeleton-based human ac- tion recognition. InProceedings of the AAAI conference on artificial intelligence, pages 6225–6233, 2024. 15

2024
[65]

An information compensation framework for zero-shot skeleton-based action recognition.IEEE Transactions on Multimedia, 2025

Haojun Xu, Yan Gao, Jie Li, and Xinbo Gao. An information compensation framework for zero-shot skeleton-based action recognition.IEEE Transactions on Multimedia, 2025. 2, 16

2025
[66]

Prototypical contrast and reverse prediction: Unsuper- vised skeleton based action recognition.IEEE Transactions on Multimedia, 25:624–634, 2021

Shihao Xu, Haocong Rao, Xiping Hu, Jun Cheng, and Bin Hu. Prototypical contrast and reverse prediction: Unsuper- vised skeleton based action recognition.IEEE Transactions on Multimedia, 25:624–634, 2021. 8

2021
[67]

Unsupervised motion representation learning with capsule autoencoders.Advances in Neural Information Processing Systems, 34:3205–3217, 2021

Ziwei Xu, Xudong Shen, Yongkang Wong, and Mohan S Kankanhalli. Unsupervised motion representation learning with capsule autoencoders.Advances in Neural Information Processing Systems, 34:3205–3217, 2021. 2, 8

2021
[68]

Spatial tempo- ral graph convolutional networks for skeleton-based action recognition

Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo- ral graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2018. 1, 3

2018
[69]

Unik: A unified framework for real-world skeleton-based action recognition.arXiv preprint arXiv:2107.08580, 2021

Di Yang, Yaohui Wang, Antitza Dantcheva, Lorenzo Garat- toni, Gianpiero Francesca, and Franc ¸ois Br ´emond. Unik: A unified framework for real-world skeleton-based action recognition.arXiv preprint arXiv:2107.08580, 2021. 3

work page arXiv 2021
[70]

Skeleton cloud colorization for unsupervised 3d action representation learning

Siyuan Yang, Jun Liu, Shijian Lu, Meng Hwa Er, and Alex C Kot. Skeleton cloud colorization for unsupervised 3d action representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13423– 13433, 2021. 8

2021
[71]

Self-supervised 3d action representa- tion learning with skeleton cloud colorization.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 46(1): 509–524, 2023

Siyuan Yang, Jun Liu, Shijian Lu, Er Meng Hwa, Yongjian Hu, and Alex C Kot. Self-supervised 3d action representa- tion learning with skeleton cloud colorization.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 46(1): 509–524, 2023. 8 11

2023
[72]

Motion guided attention learning for self-supervised 3d human action recog- nition.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8623–8634, 2022

Yang Yang, Guangjun Liu, and Xuehao Gao. Motion guided attention learning for self-supervised 3d human action recog- nition.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8623–8634, 2022. 8

2022
[73]

Qinyang Zeng, Chengju Liu, Ming Liu, and Qijun Chen. Contrastive 3d human skeleton action representation learn- ing via crossmoco with spatiotemporal occlusion mask data augmentation.IEEE Transactions on Multimedia, 25:1564– 1574, 2023. 8

2023
[74]

Contrastive positive mining for unsupervised 3d action representation learning

Haoyuan Zhang, Yonghong Hou, Wenjing Zhang, and Wan- qing Li. Contrastive positive mining for unsupervised 3d action representation learning. InEuropean Conference on Computer Vision, pages 36–51. Springer, 2022. 2, 15

2022
[75]

Hierarchi- cal consistent contrastive learning for skeleton-based action recognition with growing augmentations

Jiahang Zhang, Lilang Lin, and Jiaying Liu. Hierarchi- cal consistent contrastive learning for skeleton-based action recognition with growing augmentations. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3427– 3435, 2023. 15

2023
[76]

Skinned motion retargeting with residual per- ception of motion semantics & geometry

Jiaxu Zhang, Junwu Weng, Di Kang, Fang Zhao, Shaoli Huang, Xuefei Zhe, Linchao Bao, Ying Shan, Jue Wang, and Zhigang Tu. Skinned motion retargeting with residual per- ception of motion semantics & geometry. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13864–13872, 2023. 2

2023
[77]

Unsupervised representation learning with long-term dynamics for skeleton based action recognition

Nenggan Zheng, Jun Wen, Risheng Liu, Liangqu Long, Jian- hua Dai, and Zhefeng Gong. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. InProceedings of the AAAI conference on arti- ficial intelligence, 2018. 8

2018
[78]

Spatio-temporal fusion for human action recognition via joint trajectory graph

Yaolin Zheng, Hongbo Huang, Xiuying Wang, Xiaoxu Yan, and Longfei Xu. Spatio-temporal fusion for human action recognition via joint trajectory graph. InProceedings of the AAAI conference on artificial intelligence, pages 7579–7587,
[79]

Learn- ing discriminative representations for skeleton based action recognition

Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Learn- ing discriminative representations for skeleton based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10608– 10617, 2023. 15

2023
[80]

Self-supervised action representation learning from partial spatio-temporal skeleton sequences

Yujie Zhou, Haodong Duan, Anyi Rao, Bing Su, and Ji- aqi Wang. Self-supervised action representation learning from partial spatio-temporal skeleton sequences. InProceed- ings of the AAAI conference on artificial intelligence, pages 3825–3833, 2023. 15

2023

Showing first 80 references.