Generative Data Augmentation for Skeleton Action Recognition
Pith reviewed 2026-05-10 11:25 UTC · model grok-4.3
The pith
A conditional generative pipeline using a Transformer encoder-decoder synthesizes realistic skeleton sequences to augment training data and raise action recognition accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Transformer-based encoder-decoder architecture, paired with a generative refinement module and dropout, learns to generate high-fidelity and diverse skeleton sequences conditioned on action labels. When these sequences augment the original training sets, multiple skeleton-based action recognition models achieve higher accuracy on HumanAct12 and the refined NTU-RGBD dataset, with gains observed in both few-shot and full-data regimes.
What carries the argument
The Transformer-based encoder-decoder architecture together with a generative refinement module and dropout mechanism that balances fidelity and diversity while sampling action-conditioned skeleton sequences.
If this is right
- Multiple existing skeleton-based recognition models record higher accuracy after training on the augmented data.
- Performance gains occur in both few-shot and full training data regimes.
- The generated sequences transfer effectively across different recognition architectures.
- Effective synthesis remains possible even when the original labeled set is small.
Where Pith is reading between the lines
- The same conditioning approach could be tested on generating sequences for rare actions that appear infrequently in current collections.
- If the refinement module successfully controls artifact levels, the pipeline might apply to other pose-based tasks such as motion prediction.
- Measuring how well generated sequences cover the space of natural pose variations could clarify why recognition improves.
Load-bearing premise
The synthesized skeleton sequences must stay close enough to the real data distribution to add useful variety without introducing artifacts that lower downstream recognition accuracy.
What would settle it
Training the same recognition models on the union of real and generated sequences and observing no accuracy gain or an accuracy drop on held-out real test sets would show the augmentation does not work.
Figures
read the original abstract
Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a conditional generative pipeline for data augmentation in skeleton action recognition. It employs a Transformer-based encoder-decoder architecture combined with a generative refinement module and dropout mechanism to learn the distribution of real skeleton sequences conditioned on action labels, enabling synthesis of diverse high-fidelity sequences. Experiments on HumanAct12 and refined NTU-RGBD (NTU-VIBE) datasets are reported to show consistent accuracy gains for multiple skeleton-based recognition models in both few-shot and full-data regimes.
Significance. If the generated sequences demonstrably preserve real-data statistics while adding useful diversity, the approach could provide a practical solution to data scarcity in 3D skeleton action recognition, with particular value in low-data settings. The public release of source code supports reproducibility and is a strength.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The claim that the method produces 'high-fidelity' and 'diverse' data is supported solely by downstream accuracy improvements on HumanAct12 and NTU-VIBE; no quantitative fidelity metrics (e.g., per-joint position/velocity errors, Fréchet distance on pose embeddings, or action-conditioned distribution distances) or diversity measures are provided. This is load-bearing because accuracy gains alone cannot distinguish genuine augmentation from regularization effects or label-consistent artifacts.
- [Method and Experiments] Method and Experiments sections: No ablation studies isolate the contribution of the refinement module, the dropout schedule, or the Transformer encoder-decoder itself versus simpler baselines. Without these controls, it is unclear whether the reported gains are attributable to the generative component or to other design choices.
minor comments (2)
- [Abstract] Abstract: The source-code link is given only as 'here' without a concrete URL; replace with the actual repository address.
- [Notation] Notation and terminology: Ensure consistent distinction between 'skeleton sequences', 'pose data', and 'joint angles' across the manuscript to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that strengthening the direct evaluation of fidelity, diversity, and component contributions will improve the manuscript. Below we address each major comment and describe the planned revisions.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The claim that the method produces 'high-fidelity' and 'diverse' data is supported solely by downstream accuracy improvements on HumanAct12 and NTU-VIBE; no quantitative fidelity metrics (e.g., per-joint position/velocity errors, Fréchet distance on pose embeddings, or action-conditioned distribution distances) or diversity measures are provided. This is load-bearing because accuracy gains alone cannot distinguish genuine augmentation from regularization effects or label-consistent artifacts.
Authors: We acknowledge that reliance on downstream accuracy alone leaves open the possibility of regularization or artifact effects. In the revised manuscript we will add quantitative fidelity and diversity metrics, including: (i) mean per-joint position and velocity errors between generated and real sequences, (ii) Fréchet distance on 3D pose embeddings extracted from a pre-trained action recognition model, and (iii) diversity statistics such as average pairwise Euclidean distance in the latent space and entropy of generated action-class distributions. These will be reported for both few-shot and full-data regimes on HumanAct12 and NTU-VIBE to demonstrate that the observed gains arise from high-fidelity, diverse samples rather than spurious effects. revision: yes
-
Referee: [Method and Experiments] Method and Experiments sections: No ablation studies isolate the contribution of the refinement module, the dropout schedule, or the Transformer encoder-decoder itself versus simpler baselines. Without these controls, it is unclear whether the reported gains are attributable to the generative component or to other design choices.
Authors: We agree that isolating the contribution of each design element is necessary. The revised version will include a dedicated ablation section with the following experiments: (1) full model versus model without the generative refinement module, (2) full model versus variants with fixed or removed dropout schedules, (3) Transformer encoder-decoder replaced by LSTM and MLP baselines while keeping the rest of the pipeline identical, and (4) comparison against non-generative augmentation baselines (e.g., random jitter, interpolation). All ablations will report both recognition accuracy and the new fidelity/diversity metrics on the same datasets and splits. revision: yes
Circularity Check
No circularity: generative model trained independently on real data; downstream gains are empirical
full rationale
The paper trains a conditional Transformer encoder-decoder plus refinement module on real labeled skeleton sequences to learn their distribution and synthesize new ones. This generative step is defined and optimized separately from the downstream skeleton action recognition task. The reported accuracy improvements on HumanAct12 and NTU-VIBE are measured on held-out test sets after augmentation and are not algebraically or statistically forced by the training objective of the generator. No self-citations, uniqueness theorems, or fitted parameters are invoked as load-bearing premises for the central claim. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real skeleton sequences follow a learnable conditional distribution given action labels.
Reference graph
Works this paper leans on
-
[1]
A. Antoniou, A. Storkey, and H. Edwards. Data augmentation gener- ative adversarial networks. InInternational Conference on Artificial Neural Networks (ICANN), pages 594–603, 2018
work page 2018
- [2]
-
[3]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affin- ity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019
work page 2019
-
[4]
X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18000–18010, 2023
work page 2023
-
[5]
Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu. Channel- wise topology refinement graph convolution for skeleton-based action recognition. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 13359–13368, 2021
work page 2021
-
[6]
M. Cormier, Y . Schmid, and J. Beyerer. Enhancing skeleton-based action recognition in real-world scenarios through realistic data aug- mentation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 290–299, 2024
work page 2024
- [7]
-
[8]
P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), pages 8780–8794, 2021
work page 2021
-
[9]
H. Duan, J. Wang, K. Chen, and D. Lin. Pyskl: Towards good practices for skeleton action recognition. InProceedings of the 30th ACM International Conference on Multimedia (ACM MM), pages 7351– 7354, 2022
work page 2022
-
[10]
C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022
work page 2022
-
[11]
C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia (ACM MM), pages 2021–2029, 2020
work page 2021
-
[12]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InConference on Neural Information Processing Systems (NeurIPS), page 6840–6851, 2020
work page 2020
- [13]
- [14]
-
[15]
T. Huynh-The, C.-H. Hua, and D.-S. Kim. Encoding pose features to images with data augmentation for 3-d action recognition.IEEE Transactions on Industrial Informatics (TII), 16(5):3100–3111, 2019
work page 2019
-
[16]
C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7):1325–1339, 2014
work page 2014
-
[17]
A. Jahanian, X. Puig, Y . Tian, and P. Isola. Generative models as a data source for multiview representation learning. InInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[18]
J. Kim, J. Kim, and S. Choi. Flame: Free-form language-based motion synthesis & editing. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 12345–12355, 2023
work page 2023
-
[19]
M. Kocabas, N. Athanasiou, and M. J. Black. Vibe: Video inference for human body pose and shape estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5253–5263, 2020
work page 2020
-
[20]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. InConference on Neural Information Processing Systems (NeurIPS), pages 1097–1105, 2012
work page 2012
-
[21]
X. Liu, Z. Feng, D. Kanojia, and W. Wang. DGFM: Full body dance generation driven by music foundation models. InAudio Imagination: NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation, 2024
work page 2024
-
[22]
Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 143–152, 2020
work page 2020
- [23]
-
[24]
F. Meng, H. Liu, Y . Liang, J. Tu, and M. Liu. Sample fusion network: An end-to-end data augmentation network for skeleton-based human action recognition.IEEE Transactions on Image Processing (TIP), 28(11):5281–5295, 2019
work page 2019
-
[25]
A. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. InInternational Conference on Machine Learning (ICML), page 8162–8171, 2021
work page 2021
-
[26]
J. Park, B. Kim, and J. Jeong. An analysis of synthetic data for improving performance of skeleton-based fall down detection models. In5th International Conference on Big Data Analytics and Practices (IBDAP), pages 89–92, 2024
work page 2024
-
[27]
P. Provini, A. L. Camp, and K. E. Crandell. Emerging biological insights enabled by high-resolution 3d motion data: promises, perspec- tives and pitfalls.Journal of Experimental Biology, 226:jeb245138, 2023
work page 2023
- [28]
-
[29]
Z. Ren, S. Huang, and X. Li. Realistic human motion generation with cross-diffusion models. InProceedings of the European Conference on Computer Vision (ECCV), pages 345–362, 2024
work page 2024
-
[30]
Z. Ren, Z. Pan, X. Zhou, and L. Kang. Diffusion motion: Generate text-guided 3d human motion by diffusion model. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023
work page 2023
-
[31]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022
work page 2022
- [32]
-
[33]
A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016
work page 2016
-
[34]
J. Shen, J. Dudley, and P. O. Kristensson. The imaginative genera- tive adversarial network: Automatic data augmentation for dynamic skeleton-based hand gesture and human action recognition. InPro- ceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), pages 1–8, 2021
work page 2021
-
[35]
J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning (ICML), pages 2256–2265, 2015
work page 2015
-
[36]
J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
- [37]
-
[38]
B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov. Effective data augmentation with diffusion models. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[39]
J. Tu, H. Liu, F. Meng, M. Liu, and R. Ding. Spatial-temporal data augmentation based on lstm autoencoder network for skeleton-based human action recognition. InIEEE International Conference on Image Processing (ICIP), pages 3478–3482, 2018
work page 2018
- [40]
-
[41]
C. Xin, S. Kim, Y . Cho, and K. S. Park. Enhancing human action recognition with 3d skeleton data: A comprehensive study of deep learning and data augmentation.Electronics, 13(4), 2024
work page 2024
-
[42]
S. Yan, Y . Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 7444–7452, 2018
work page 2018
-
[43]
J. Zhang, Y . Zhang, X. Cun, S. Huang, Y . Zhang, H. Zhao, H. Lu, and X. Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2023
work page 2023
-
[44]
Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua. Blockgcn: Redefining topology awareness for skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10181–10191, 2024
work page 2024
-
[45]
W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pages 1–20, 2023
work page 2023
-
[46]
S. Zou, X. Zuo, Y . Qian, S. Wang, C. Guo, C. Xu, M. Gong, and L. Cheng. Polarization human shape and pose dataset. InProceedings of the European Conference on Computer Vision (ECCV), pages 1–17, 2020
work page 2020
-
[47]
S. Zou, X. Zuo, Y . Qian, S. Wang, C. Xu, M. Gong, and L. Cheng. 3d human shape reconstruction from a polarization image. InProceedings of the European Conference on Computer Vision (ECCV), pages 1–17, 2020
work page 2020
-
[48]
S. Zou, X. Zuo, S. Wang, Y . Qian, C. Guo, and L. Cheng. Human pose and shape estimation from single polarization images.IEEE Transactions on Multimedia (TMM), 25(12):3560–3572, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.