pith. sign in

arxiv: 2604.14933 · v1 · submitted 2026-04-16 · 💻 cs.CV

Generative Data Augmentation for Skeleton Action Recognition

Pith reviewed 2026-05-10 11:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords skeleton action recognitiondata augmentationgenerative modelstransformerfew-shot learninghuman poseconditional generation
0
0 comments X

The pith

A conditional generative pipeline using a Transformer encoder-decoder synthesizes realistic skeleton sequences to augment training data and raise action recognition accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to address the high cost of collecting large annotated 3D skeleton datasets by learning the distribution of real pose sequences conditioned on action labels. It introduces a generative method that produces additional diverse yet faithful sequences to supplement limited training data. Experiments demonstrate that this augmentation consistently lifts performance for several existing recognition models on HumanAct12 and NTU-VIBE, including in few-shot settings. A sympathetic reader would care because skeleton data is expensive to gather at scale, so effective synthesis could make high-accuracy systems feasible with far less manual labeling effort.

Core claim

The central claim is that a Transformer-based encoder-decoder architecture, paired with a generative refinement module and dropout, learns to generate high-fidelity and diverse skeleton sequences conditioned on action labels. When these sequences augment the original training sets, multiple skeleton-based action recognition models achieve higher accuracy on HumanAct12 and the refined NTU-RGBD dataset, with gains observed in both few-shot and full-data regimes.

What carries the argument

The Transformer-based encoder-decoder architecture together with a generative refinement module and dropout mechanism that balances fidelity and diversity while sampling action-conditioned skeleton sequences.

If this is right

  • Multiple existing skeleton-based recognition models record higher accuracy after training on the augmented data.
  • Performance gains occur in both few-shot and full training data regimes.
  • The generated sequences transfer effectively across different recognition architectures.
  • Effective synthesis remains possible even when the original labeled set is small.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning approach could be tested on generating sequences for rare actions that appear infrequently in current collections.
  • If the refinement module successfully controls artifact levels, the pipeline might apply to other pose-based tasks such as motion prediction.
  • Measuring how well generated sequences cover the space of natural pose variations could clarify why recognition improves.

Load-bearing premise

The synthesized skeleton sequences must stay close enough to the real data distribution to add useful variety without introducing artifacts that lower downstream recognition accuracy.

What would settle it

Training the same recognition models on the union of real and generated sequences and observing no accuracy gain or an accuracy drop on held-out real test sets would show the augmentation does not work.

Figures

Figures reproduced from arXiv: 2604.14933 by Andrew Gilbert, Anthony Adeyemi-Ejeye, Wanqing Li, Xu Dong.

Figure 1
Figure 1. Figure 1: Overview of our approach. With only a small set of labelled skeleton sequences, the model generates diverse and high-fidelity samples. When [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed network. (Top) Conditional Skeleton Diffusion Module. The encoder processes a skeleton feature sequence together with the noise step t and the corresponding action label, producing a conditional representation. A Transformer-based decoder then reconstructs the clean skeleton sequence from the noise-corrupted input, guided by this representation. In addition, a lightweight classific… view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualisations comparing real and synthetic skeleton [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualisation of our generated skeleton sequences conditioned on action labels. Our results demonstrate that our method generates diverse motion [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a conditional generative pipeline for data augmentation in skeleton action recognition. It employs a Transformer-based encoder-decoder architecture combined with a generative refinement module and dropout mechanism to learn the distribution of real skeleton sequences conditioned on action labels, enabling synthesis of diverse high-fidelity sequences. Experiments on HumanAct12 and refined NTU-RGBD (NTU-VIBE) datasets are reported to show consistent accuracy gains for multiple skeleton-based recognition models in both few-shot and full-data regimes.

Significance. If the generated sequences demonstrably preserve real-data statistics while adding useful diversity, the approach could provide a practical solution to data scarcity in 3D skeleton action recognition, with particular value in low-data settings. The public release of source code supports reproducibility and is a strength.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The claim that the method produces 'high-fidelity' and 'diverse' data is supported solely by downstream accuracy improvements on HumanAct12 and NTU-VIBE; no quantitative fidelity metrics (e.g., per-joint position/velocity errors, Fréchet distance on pose embeddings, or action-conditioned distribution distances) or diversity measures are provided. This is load-bearing because accuracy gains alone cannot distinguish genuine augmentation from regularization effects or label-consistent artifacts.
  2. [Method and Experiments] Method and Experiments sections: No ablation studies isolate the contribution of the refinement module, the dropout schedule, or the Transformer encoder-decoder itself versus simpler baselines. Without these controls, it is unclear whether the reported gains are attributable to the generative component or to other design choices.
minor comments (2)
  1. [Abstract] Abstract: The source-code link is given only as 'here' without a concrete URL; replace with the actual repository address.
  2. [Notation] Notation and terminology: Ensure consistent distinction between 'skeleton sequences', 'pose data', and 'joint angles' across the manuscript to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that strengthening the direct evaluation of fidelity, diversity, and component contributions will improve the manuscript. Below we address each major comment and describe the planned revisions.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The claim that the method produces 'high-fidelity' and 'diverse' data is supported solely by downstream accuracy improvements on HumanAct12 and NTU-VIBE; no quantitative fidelity metrics (e.g., per-joint position/velocity errors, Fréchet distance on pose embeddings, or action-conditioned distribution distances) or diversity measures are provided. This is load-bearing because accuracy gains alone cannot distinguish genuine augmentation from regularization effects or label-consistent artifacts.

    Authors: We acknowledge that reliance on downstream accuracy alone leaves open the possibility of regularization or artifact effects. In the revised manuscript we will add quantitative fidelity and diversity metrics, including: (i) mean per-joint position and velocity errors between generated and real sequences, (ii) Fréchet distance on 3D pose embeddings extracted from a pre-trained action recognition model, and (iii) diversity statistics such as average pairwise Euclidean distance in the latent space and entropy of generated action-class distributions. These will be reported for both few-shot and full-data regimes on HumanAct12 and NTU-VIBE to demonstrate that the observed gains arise from high-fidelity, diverse samples rather than spurious effects. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments sections: No ablation studies isolate the contribution of the refinement module, the dropout schedule, or the Transformer encoder-decoder itself versus simpler baselines. Without these controls, it is unclear whether the reported gains are attributable to the generative component or to other design choices.

    Authors: We agree that isolating the contribution of each design element is necessary. The revised version will include a dedicated ablation section with the following experiments: (1) full model versus model without the generative refinement module, (2) full model versus variants with fixed or removed dropout schedules, (3) Transformer encoder-decoder replaced by LSTM and MLP baselines while keeping the rest of the pipeline identical, and (4) comparison against non-generative augmentation baselines (e.g., random jitter, interpolation). All ablations will report both recognition accuracy and the new fidelity/diversity metrics on the same datasets and splits. revision: yes

Circularity Check

0 steps flagged

No circularity: generative model trained independently on real data; downstream gains are empirical

full rationale

The paper trains a conditional Transformer encoder-decoder plus refinement module on real labeled skeleton sequences to learn their distribution and synthesize new ones. This generative step is defined and optimized separately from the downstream skeleton action recognition task. The reported accuracy improvements on HumanAct12 and NTU-VIBE are measured on held-out test sets after augmentation and are not algebraically or statistically forced by the training objective of the generator. No self-citations, uniqueness theorems, or fitted parameters are invoked as load-bearing premises for the central claim. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard generative modeling assumptions for sequence data; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Real skeleton sequences follow a learnable conditional distribution given action labels.
    This underpins the entire generative synthesis process described.

pith-pipeline@v0.9.0 · 5479 in / 1128 out tokens · 35092 ms · 2026-05-10T11:25:10.324891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    Antoniou, A

    A. Antoniou, A. Storkey, and H. Edwards. Data augmentation gener- ative adversarial networks. InInternational Conference on Artificial Neural Networks (ICANN), pages 594–603, 2018

  2. [2]

    Brock, J

    A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. InInternational Conference on Learning Representations (ICLR), 2019

  3. [3]

    Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affin- ity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019

  4. [4]

    X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18000–18010, 2023

  5. [5]

    Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu. Channel- wise topology refinement graph convolution for skeleton-based action recognition. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 13359–13368, 2021

  6. [6]

    Cormier, Y

    M. Cormier, Y . Schmid, and J. Beyerer. Enhancing skeleton-based action recognition in real-world scenarios through realistic data aug- mentation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 290–299, 2024

  7. [7]

    Dabral, M

    R. Dabral, M. H. Mughal, V . Golyanik, and C. Theobalt. Mofusion: A framework for denoising-diffusion-based motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9760–9770, 2023

  8. [8]

    Dhariwal and A

    P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), pages 8780–8794, 2021

  9. [9]

    H. Duan, J. Wang, K. Chen, and D. Lin. Pyskl: Towards good practices for skeleton action recognition. InProceedings of the 30th ACM International Conference on Multimedia (ACM MM), pages 7351– 7354, 2022

  10. [10]

    C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022

  11. [11]

    C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia (ACM MM), pages 2021–2029, 2020

  12. [12]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InConference on Neural Information Processing Systems (NeurIPS), page 6840–6851, 2020

  13. [13]

    Ho and T

    J. Ho and T. Salimans. Classifier-free diffusion guidance. In Deep Generative Models and Downstream Applications Workshop at the 35th Conference on Neural Information Processing Systems (NeurIPS), 2022

  14. [14]

    Hu, W.-S

    J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang. Jointly learning heterogeneous features for rgb-d activity recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5344–5352, 2015

  15. [15]

    Huynh-The, C.-H

    T. Huynh-The, C.-H. Hua, and D.-S. Kim. Encoding pose features to images with data augmentation for 3-d action recognition.IEEE Transactions on Industrial Informatics (TII), 16(5):3100–3111, 2019

  16. [16]

    Ionescu, D

    C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7):1325–1339, 2014

  17. [17]

    Jahanian, X

    A. Jahanian, X. Puig, Y . Tian, and P. Isola. Generative models as a data source for multiview representation learning. InInternational Conference on Learning Representations (ICLR), 2022

  18. [18]

    J. Kim, J. Kim, and S. Choi. Flame: Free-form language-based motion synthesis & editing. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 12345–12355, 2023

  19. [19]

    Kocabas, N

    M. Kocabas, N. Athanasiou, and M. J. Black. Vibe: Video inference for human body pose and shape estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5253–5263, 2020

  20. [20]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. InConference on Neural Information Processing Systems (NeurIPS), pages 1097–1105, 2012

  21. [21]

    X. Liu, Z. Feng, D. Kanojia, and W. Wang. DGFM: Full body dance generation driven by music foundation models. InAudio Imagination: NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation, 2024

  22. [22]

    Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 143–152, 2020

  23. [23]

    Loper, N

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015

  24. [24]

    F. Meng, H. Liu, Y . Liang, J. Tu, and M. Liu. Sample fusion network: An end-to-end data augmentation network for skeleton-based human action recognition.IEEE Transactions on Image Processing (TIP), 28(11):5281–5295, 2019

  25. [25]

    Nichol and P

    A. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. InInternational Conference on Machine Learning (ICML), page 8162–8171, 2021

  26. [26]

    J. Park, B. Kim, and J. Jeong. An analysis of synthetic data for improving performance of skeleton-based fall down detection models. In5th International Conference on Big Data Analytics and Practices (IBDAP), pages 89–92, 2024

  27. [27]

    Provini, A

    P. Provini, A. L. Camp, and K. E. Crandell. Emerging biological insights enabled by high-resolution 3d motion data: promises, perspec- tives and pitfalls.Journal of Experimental Biology, 226:jeb245138, 2023

  28. [28]

    Ramesh, P

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents, 2022

  29. [29]

    Z. Ren, S. Huang, and X. Li. Realistic human motion generation with cross-diffusion models. InProceedings of the European Conference on Computer Vision (ECCV), pages 345–362, 2024

  30. [30]

    Z. Ren, Z. Pan, X. Zhou, and L. Kang. Diffusion motion: Generate text-guided 3d human motion by diffusion model. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023

  31. [31]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

  32. [32]

    Shafir, G

    Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano. Human motion diffusion as a generative prior. InInternational Conference on Learning Representations (ICLR), 2023

  33. [33]

    Shahroudy, J

    A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016

  34. [34]

    J. Shen, J. Dudley, and P. O. Kristensson. The imaginative genera- tive adversarial network: Automatic data augmentation for dynamic skeleton-based hand gesture and human action recognition. InPro- ceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), pages 1–8, 2021

  35. [35]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning (ICML), pages 2256–2265, 2015

  36. [36]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021

  37. [37]

    Tevet, S

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano. Human motion diffusion model. InInternational Conference on Learning Representations (ICLR), 2023

  38. [38]

    Trabucco, K

    B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov. Effective data augmentation with diffusion models. InInternational Conference on Learning Representations (ICLR), 2024

  39. [39]

    J. Tu, H. Liu, F. Meng, M. Liu, and R. Ding. Spatial-temporal data augmentation based on lstm autoencoder network for skeleton-based human action recognition. InIEEE International Conference on Image Processing (ICIP), pages 3478–3482, 2018

  40. [40]

    Wu and L

    D. Wu and L. Shao. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 724–731, 2014

  41. [41]

    C. Xin, S. Kim, Y . Cho, and K. S. Park. Enhancing human action recognition with 3d skeleton data: A comprehensive study of deep learning and data augmentation.Electronics, 13(4), 2024

  42. [42]

    S. Yan, Y . Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 7444–7452, 2018

  43. [43]

    Zhang, Y

    J. Zhang, Y . Zhang, X. Cun, S. Huang, Y . Zhang, H. Zhao, H. Lu, and X. Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2023

  44. [44]

    Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua. Blockgcn: Redefining topology awareness for skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10181–10191, 2024

  45. [45]

    W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pages 1–20, 2023

  46. [46]

    S. Zou, X. Zuo, Y . Qian, S. Wang, C. Guo, C. Xu, M. Gong, and L. Cheng. Polarization human shape and pose dataset. InProceedings of the European Conference on Computer Vision (ECCV), pages 1–17, 2020

  47. [47]

    S. Zou, X. Zuo, Y . Qian, S. Wang, C. Xu, M. Gong, and L. Cheng. 3d human shape reconstruction from a polarization image. InProceedings of the European Conference on Computer Vision (ECCV), pages 1–17, 2020

  48. [48]

    S. Zou, X. Zuo, S. Wang, Y . Qian, C. Guo, and L. Cheng. Human pose and shape estimation from single polarization images.IEEE Transactions on Multimedia (TMM), 25(12):3560–3572, 2023