arxiv: 2605.09231 · v2 · submitted 2026-05-10 · 💻 cs.CV · stat.ML

An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories

Arafat Rahman , Shashwat Kumar , Laura E. Barnes , Anuj Srivastava This is my paper

Pith reviewed 2026-05-13 06:00 UTC · model grok-4.3

classification 💻 cs.CV stat.ML

keywords variational autoencoderskeletal trajectoriesKendall shape manifoldelastic shape analysisaction recognitiongait analysisgenerative modelsRiemannian geometry

0 comments

The pith

ES-VAE uses the transported square-root velocity field on Kendall's shape manifold to isolate intrinsic skeletal dynamics and outperform standard VAEs plus sequence models on gait prediction and action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ES-VAE to model sequences of human skeleton poses by focusing on their intrinsic shapes rather than external factors like viewpoint or execution speed. It builds the variational autoencoder around the TSRVF representation on Kendall's shape manifold, which automatically removes rigid translations, rotations, scaling, and rate variability. The encoder applies the Riemannian logarithm map and the decoder applies the exponential map to respect the manifold geometry. Experiments on gait cycle data for clinical mobility scoring and on the NTU RGB+D dataset for action recognition show consistent gains over regular VAEs and baselines such as temporal convolutions, transformers, and graph networks. A sympathetic reader would care because the approach demonstrates how embedding data in the right geometric space lets generative models allocate capacity to the features that actually matter for prediction tasks.

Core claim

The paper claims that mapping skeletal trajectories to the TSRVF representation on Kendall's shape manifold, then encoding them with the Riemannian logarithm map and decoding with the exponential map, produces a generative model whose latent space captures underlying shape dynamics without wasting capacity on nuisance factors, yielding improved performance on clinical gait analysis and action recognition compared with standard VAEs and other sequence models.

What carries the argument

The transported square-root velocity field (TSRVF) representation on Kendall's shape manifold, which removes rigid motions and temporal rate variability, with the VAE encoder using the Riemannian logarithm map and the decoder using the corresponding exponential map.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same manifold preprocessing could be inserted into other generative models such as diffusion models or GANs for pose synthesis tasks.
The approach suggests a route for improving generalization when skeletal data comes from cameras with unknown or varying viewpoints.
Similar elastic representations might apply to non-human trajectory data such as animal locomotion or robotic joint paths.
The resulting latent space may be more directly interpretable in terms of shape variations than latent spaces from Euclidean VAEs.

Load-bearing premise

That stripping away rotations, scales, translations, and speed variations through the TSRVF representation leaves intact all information needed for the downstream clinical and recognition tasks.

What would settle it

Retraining the same model architecture on the NTU RGB+D dataset after replacing the TSRVF step with raw joint coordinates and observing whether action recognition accuracy drops below the reported ES-VAE level.

Figures

Figures reproduced from arXiv: 2605.09231 by Anuj Srivastava, Arafat Rahman, Laura E. Barnes, Shashwat Kumar.

**Figure 2.** Figure 2: Comparison of submanifold learning methods on synthetic data on [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: ES-VAE modes of variation: z1 (short stride and stiffer limbs), z2 (left arm variability), z3 (left arm and right knee variability), z4 (right arm variability), z5 (subtle right elbow variability). Black: mean; red/blue: ±3 traversals. that participants who score higher on z1 and z3 (shorter stride, stiffer limbs) also score lower on POMA. z4 correlates positively with LesionLeft, which shows that it encod… view at source ↗

**Figure 4.** Figure 4: Correlation of demographic and clinical variables with the first five ES-VAE latent dimen [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Mean gait cycles from registered trajectories. Stroke patients (red) exhibit hemiplegic [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Left: Scatter plot of z1 vs. z2 showing separation between stroke (red) and healthy (green) cohorts. Right: Correlation heatmap of the first five latent dimensions, confirming their independence [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Boxplots of the first five latent dimensions by clinical group. [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Correlation of demographic and clinical variables with the first five Tangent PCA dimen [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Tangent PCA modes of variation: PC1 (stride/stiffness), PC2 (bilateral arm variability), [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes the Elastic Shape Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal pose trajectories. It employs the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold to remove rigid transformations, scaling, and temporal rate variability, thereby isolating intrinsic shape dynamics. The encoder maps sequences to a latent space via the Riemannian logarithm map, and the decoder reconstructs via the exponential map. The model is evaluated on skeletal gait cycles for clinical mobility score prediction and healthy vs. post-stroke classification, as well as on the NTU RGB+D dataset for action recognition, with claims of consistent outperformance over standard VAEs and sequence baselines including TCNs, transformers, and GCNs.

Significance. If the empirical results hold and the Riemannian operations are shown to be well-defined on the data, this work could advance generative modeling of manifold-valued longitudinal pose data by providing a principled way to factor out nuisance variables. The use of established elastic shape analysis tools (TSRVF and Kendall manifold) within a VAE framework offers potential for more interpretable latent representations and improved downstream performance in clinical gait analysis and action recognition tasks.

major comments (2)

The abstract asserts consistent outperformance on two datasets but supplies no quantitative results, error bars, statistical tests, or implementation details, leaving the central claim without verifiable support in the provided text.
The model construction relies on the Riemannian logarithm map in the encoder and exponential map in the decoder after TSRVF representation on Kendall's shape manifold. Kendall shape space (quotient of pre-shapes by SO(3)) has a cut locus; the log map is not globally defined or single-valued, and small perturbations near antipodal configurations can produce discontinuous jumps in the tangent-space coordinates. If gait cycles or NTU actions contain such poses, the latent-space mapping becomes unstable or ill-defined, so any reported gains over Euclidean VAEs or sequence baselines cannot be attributed to the geometry-aware construction. The manuscript does not discuss domain restrictions, cut-locus handling, or validation that the maps remain continuous on the observed data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and outline the revisions we will make to improve the paper.

read point-by-point responses

Referee: The abstract asserts consistent outperformance on two datasets but supplies no quantitative results, error bars, statistical tests, or implementation details, leaving the central claim without verifiable support in the provided text.

Authors: The full manuscript includes comprehensive quantitative results with error bars, statistical tests, and implementation details in the experimental sections and supplementary material. To directly address the referee's concern about the abstract, we will revise it to incorporate a small number of key quantitative highlights (e.g., accuracy gains on gait classification and action recognition) while preserving conciseness. This change will make the central claims more verifiable from the abstract itself. revision: yes
Referee: The model construction relies on the Riemannian logarithm map in the encoder and exponential map in the decoder after TSRVF representation on Kendall's shape manifold. Kendall shape space (quotient of pre-shapes by SO(3)) has a cut locus; the log map is not globally defined or single-valued, and small perturbations near antipodal configurations can produce discontinuous jumps in the tangent-space coordinates. If gait cycles or NTU actions contain such poses, the latent-space mapping becomes unstable or ill-defined, so any reported gains over Euclidean VAEs or sequence baselines cannot be attributed to the geometry-aware construction. The manuscript does not discuss domain restrictions, cut-locus handling, or validation that the maps remain continuous on the observed data.

Authors: We acknowledge that the current manuscript does not explicitly discuss the cut locus of Kendall shape space or provide validation for the continuity of the log map. In the revised version we will add a dedicated subsection in the methods that (i) recalls the cut-locus issue, (ii) describes the preprocessing steps (TSRVF alignment and reference-shape selection) used to keep observed pre-shapes sufficiently far from antipodal configurations, and (iii) reports empirical checks (e.g., distributions of geodesic distances to the reference shape) confirming that the log map remained continuous and single-valued on all sequences from both the gait and NTU datasets. These additions will clarify that the reported performance gains can be attributed to the geometry-aware construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The ES-VAE construction applies the established TSRVF representation on Kendall's shape manifold (imported from prior shape-analysis literature) together with standard Riemannian logarithm and exponential maps in the encoder/decoder. These operations are not defined in terms of the target performance metrics or downstream clinical/action-recognition tasks; the model equations therefore remain independent of the reported outperformance numbers. Evaluations occur on external datasets (gait cycles, NTU RGB+D) against non-self-referential baselines, with no fitted parameters renamed as predictions, no load-bearing self-citation chains, and no self-definitional loops. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into specific parameters or assumptions; the central claim rests on the domain property of TSRVF and standard VAE training.

axioms (1)

domain assumption TSRVF representation on Kendall's shape manifold removes rigid translations, rotations, global scaling, and temporal rate variability
Stated directly in the abstract as an inherent property leveraged by the model.

pith-pipeline@v0.9.0 · 5575 in / 1180 out tokens · 69243 ms · 2026-05-13T06:00:34.529582+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Kendall shape space Σ^k_m = S^k_m / SO(m) ... geodesic distance d_Σ (2); Exp_ν(w) (3); Log_ν(X) (4); TSRVF q(t) (5); Riemannian ELBO with squared geodesic loss (10)-(12)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We represent skeleton trajectories as stochastic processes in Kendall shape space ... transported square-root velocity field (TSRVF) ... Riemannian VAE encoder/decoder

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

Toxins , volume=

Stiff knee gait disorders as neuromechanical consequences of spastic hemiplegia in chronic stroke , author=. Toxins , volume=. 2023 , publisher=

work page 2023
[2]

Artificial Intelligence in Medicine , volume=

Matching incomplete time series with dynamic time warping: an algorithm and an application to post-stroke rehabilitation , author=. Artificial Intelligence in Medicine , volume=. 2009 , publisher=

work page 2009
[3]

Journal of Exercise Rehabilitation , volume=

Application of dynamic time warping algorithm for pattern similarity of gait , author=. Journal of Exercise Rehabilitation , volume=. 2019 , publisher=

work page 2019
[4]

IEEE Transactions on Emerging Topics in Computing , volume=

A machine-learning model for automatic detection of movement compensations in stroke patients , author=. IEEE Transactions on Emerging Topics in Computing , volume=. 2020 , publisher=

work page 2020
[5]

Journal of Biopharmaceutical Statistics , volume=

Functional modeling of pedaling kinematics for the Stroke patients , author=. Journal of Biopharmaceutical Statistics , volume=. 2020 , publisher=

work page 2020
[6]

Objective Assessment of Upper-Limb Mobility for Poststroke Rehabilitation , year=

Zhang, Zhe and Fang, Qiang and Gu, Xudong , journal=. Objective Assessment of Upper-Limb Mobility for Poststroke Rehabilitation , year=

work page
[7]

Scientific Data , volume=

A full-body motion capture gait dataset of 138 able-bodied adults across the life span and 50 stroke survivors , author=. Scientific Data , volume=. 2023 , publisher=

work page 2023
[8]

2018 , publisher=

Eichler, Nadav and Hel-Or, Hagit and Shimshoni, Ilan and Itah, Dorit and Gross, Bella and Raz, Shmuel , journal=. 2018 , publisher=

work page 2018
[9]

Scientific Reports , volume=

Stroke walking and balance characteristics via principal component analysis , author=. Scientific Reports , volume=. 2024 , publisher=

work page 2024
[10]

1999 , publisher=

Shape and Shape Theory , author=. 1999 , publisher=

work page 1999
[11]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Action recognition using rate-invariant analysis of skeletal shape trajectories , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2016 , publisher=

work page 2016
[12]

Hosni, Nadia and Drira, Hassen and Chaieb, Faten and Amor, Boulbaba Ben , booktitle=. 3. 2018 , organization=

work page 2018
[13]

Geometric deep neural network using rigid and non-rigid transformations for human action recognition , author=. Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV) , pages=

work page
[14]

Auto-encoding variational

Kingma, Diederik P and Welling, Max , journal=. Auto-encoding variational

work page
[15]

Learning weighted submanifolds with variational autoencoders and

Miolane, Nina and Holmes, Susan , booktitle=. Learning weighted submanifolds with variational autoencoders and

work page
[16]

Advances in Neural Information Processing Systems , volume=

A geometric perspective on variational autoencoders , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

Nature Methods , volume=

Deep generative modeling for single-cell transcriptomics , author=. Nature Methods , volume=. 2018 , publisher=

work page 2018
[18]

Learning low-dimensional representations of shape data sets with diffeomorphic autoencoders , author=. Proc. Int. Conf. Information Processing in Medical Imaging (IPMI) , pages=. 2019 , organization=

work page 2019
[19]

2024 , publisher=

Dummer, Sven and Strisciuglio, Nicola and Brune, Christoph , journal=. 2024 , publisher=

work page 2024
[20]

Dummer, Sven and Brune, Christoph and Strisciuglio, Nicola , howpublished=

work page
[21]

2025 , publisher=

Gatti, Anthony A and Blankemeier, Louis and Van Veen, Dave and Hargreaves, Brian and Delp, Scott L and Gold, Garry E and Kogan, Feliks and Chaudhari, Akshay S , journal=. 2025 , publisher=

work page 2025
[22]

Fu, Yihang and He, Lifang and Chen, Qingyu , journal=

work page
[23]

Neural Networks , volume=

Approximation capabilities of multilayer feedforward networks , author=. Neural Networks , volume=. 1991 , publisher=

work page 1991
[24]

Temporal convolutional networks for action segmentation and detection , author=. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[25]

Neural Computation , volume=

Long short-term memory , author=. Neural Computation , volume=. 1997 , publisher=

work page 1997
[26]

Advances in Neural Information Processing Systems , volume=

Attention is all you need , author=. Advances in Neural Information Processing Systems , volume=

work page
[27]

Shahroudy, Amir and Liu, Jun and Ng, Tian-Tsong and Wang, Gang , booktitle=

work page
[28]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized? , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[29]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[30]

Spatial temporal graph convolutional networks for skeleton-based action recognition , author=. Proc. AAAI Conf. Artificial Intelligence , volume=

work page
[31]

PLoS One , volume=

Locomotor trajectories of stroke patients during oriented gait and turning , author=. PLoS One , volume=. 2016 , publisher=

work page 2016
[32]

International Conference on Learning Representations , year=

Conditional Image Generation by Conditioning Variational Auto-Encoders , author=. International Conference on Learning Representations , year=

work page
[33]

IEEE transactions on pattern analysis and machine intelligence , volume=

Human action recognition from various data modalities: A review , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2022 , publisher=

work page 2022
[34]

2012 , publisher=

Dynamic programming and optimal control: Volume I , author=. 2012 , publisher=

work page 2012
[35]

Skeleton

Fan, Chao and Ma, Jingzhe and Jin, Dongyang and Shen, Chuanfu and Yu, Shiqi , booktitle=. Skeleton

work page
[36]

The Journal of Engineering , volume=

Vision skeleton trajectory based motion assessment system for healthcare rehabilitation , author=. The Journal of Engineering , volume=. 2020 , publisher=

work page 2020
[37]

Hakim, Tal and Shimshoni, Ilan , booktitle=. A-

work page
[38]

Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating

Adeli, Vida and Mehraban, Soroush and Ballester, Irene and Zarghami, Yasamin and Sabo, Andrea and Iaboni, Andrea and Taati, Babak , booktitle=. Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating. 2024 , organization=

work page 2024
[39]

2016 , publisher=

Statistical shape analysis: with applications in R , author=. 2016 , publisher=

work page 2016
[40]

Journal of Machine Learning Research , year =

Nina Miolane and Nicolas Guigui and Alice Le Brigant and Johan Mathe and Benjamin Hou and Yann Thanwerdas and Stefan Heyder and Olivier Peltre and Niklas Koep and Hadi Zaatiti and Hatem Hajri and Yann Cabanes and Thomas Gerald and Paul Chauchat and Christian Shewmake and Daniel Brooks and Bernhard Kainz and Claire Donnat and Susan Holmes and Xavier Pennec...

work page
[41]

Computational Statistics & Data Analysis , volume=

Generative models for functional data using phase and amplitude separation , author=. Computational Statistics & Data Analysis , volume=. 2013 , publisher=

work page 2013
[42]

NeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations , year=

Kendall shape-vae: Learning shapes in a generative framework , author=. NeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations , year=

work page 2022
[43]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Data augmentation in high dimensional low sample size setting using a geometry-based variational autoencoder , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

work page 2022
[44]

IEEE Transactions on Image Processing , volume=

Vtae: Variational transformer autoencoder with manifolds learning , author=. IEEE Transactions on Image Processing , volume=. 2023 , publisher=

work page 2023
[45]

arXiv preprint arXiv:2002.05227 , year=

Variational autoencoders with riemannian brownian motion priors , author=. arXiv preprint arXiv:2002.05227 , year=

work page arXiv 2002