Autoencoder Architectures for Athlete Performance Scoring from Wearable Telemetry

Jan Zubalewicz; Krzysztof Siwek; Mateusz Kubita

arxiv: 2606.28145 · v1 · pith:VJE7PZA4new · submitted 2026-06-26 · 💻 cs.LG

Autoencoder Architectures for Athlete Performance Scoring from Wearable Telemetry

Mateusz Kubita , Jan Zubalewicz , Krzysztof Siwek This is my paper

Pith reviewed 2026-06-29 04:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords autoencoderwearable telemetryperformance scoringdimensionality reductionunsupervised learningrunner profileslatent scorecomposite criterion

0 comments

The pith

Deep autoencoders achieve the lowest reconstruction error and highest composite interpretability score when reducing nine wearable sensor features to a single latent performance indicator for runners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates several dimensionality reduction techniques to turn high-dimensional running telemetry into one scalar performance score without any labeled data. It compares three autoencoder variants, PCA, and a variational autoencoder using reconstruction accuracy together with several measures of how well the latent score ranks runners in physiologically sensible ways. A composite ranking that blends these metrics selects the deep autoencoder as best overall. Widening the hidden layers makes deeper PCA variants nearly as competitive, pointing to model capacity as the real constraint rather than the single-dimensional output. Across all models, running pace, aerobic decoupling, and average heart rate consistently drive the latent score.

Core claim

Deep autoencoder achieved the lowest reconstruction error and the highest composite score. Once the PCA hidden layers were widened, the deeper variants became closely competitive with Deep AE on the composite criterion, indicating that the limiting factor was hidden layer capacity rather than the one dimensional bottleneck. Running pace, aerobic decoupling, and average heart rate emerged as the dominant latent score drivers across all models and resampling runs, consistent with established physiology.

What carries the argument

The composite selection criterion that aggregates reconstruction MSE, Spearman and Kendall correlations, mutual information, and permutation importance via modified Borda count to choose among unsupervised models.

If this is right

The deep autoencoder provides the best balance of accurate reconstruction and interpretable latent scores.
Hidden layer capacity, not the bottleneck dimension, limits performance in these models.
Running pace, aerobic decoupling, and average heart rate are the primary physiological drivers of the derived performance score.
Bootstrap validation confirms the stability of the feature rankings and model selection.
A simple two-feature linear baseline underperforms the neural models on the composite criterion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the latent score proves stable over time, it could serve as a real-time training load monitor without needing race results.
The same approach might generalize to other endurance sports where multiple sensors produce unlabeled data streams.
Widening layers in other reduction methods could close the gap further if computational resources allow.
The dominance of pace and heart rate metrics suggests the latent score largely rediscovers known training principles rather than uncovering new ones.

Load-bearing premise

The composite selection criterion selects models that produce a physiologically meaningful latent performance score even without any ground-truth labels.

What would settle it

Observing that the top-ranked model on the composite criterion produces latent scores that show no correlation with actual changes in runner fitness over a training season would falsify the claim that the method yields meaningful performance indicators.

Figures

Figures reproduced from arXiv: 2606.28145 by Jan Zubalewicz, Krzysztof Siwek, Mateusz Kubita.

**Figure 3.** Figure 3: Spearman correlation matrix of latent scores produced by the five [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 2.** Figure 2: Reconstruction MSE across the five models on the held out test [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Wearable devices produce large, high dimensional training logs for everyday runners, and interpretation rather than data collection is now the limiting step. This paper evaluates five dimensionality reduction models, three autoencoder variants, PCA, and a Variational Autoencoder, on their ability to compress nine sensor runner profiles into a single scalar performance indicator, the latent score. Because the setting is fully unsupervised, model quality is assessed along two complementary axes: reconstruction error (Mean Squared Error) and latent score interpretability, measured via Spearman and Kendall rank correlations, Mutual Information, and Permutation Importance. These are combined into a composite selection criterion that prevents selecting models on reconstruction accuracy alone. Feature rankings from the four metrics are aggregated via a modified Borda count, and their stability is confirmed by bootstrap validation. A two feature linear baseline is included to anchor the comparison. Deep autoencoder achieved the lowest reconstruction error and the highest composite score. Once the PCA hidden layers were widened, the deeper variants became closely competitive with Deep AE on the composite criterion, indicating that the limiting factor was hidden layer capacity rather than the one dimensional bottleneck. Running pace, aerobic decoupling, and average heart rate emerged as the dominant latent score drivers across all models and resampling runs, consistent with established physiology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard autoencoders and PCA on runner telemetry with a Borda-based composite metric; the latent score stays internally consistent but untested against real performance outcomes.

read the letter

This paper takes routine autoencoder and PCA methods, applies them to nine sensor channels from runner wearables, and uses a composite rule to pick the model that best compresses the data into a one-dimensional latent score. The composite combines MSE with rank correlations, mutual information, and permutation importance, then aggregates via modified Borda count plus bootstrap checks. A two-feature linear baseline is included for reference.

The work is clear on its unsupervised constraints and reports that pace, aerobic decoupling, and average heart rate dominate the latent score across models, which lines up with basic physiology. Widening the PCA layers makes it competitive with the deep autoencoder, so the paper correctly identifies capacity rather than the bottleneck as the main limit.

The soft spot is exactly the one the stress-test note flags: every axis in the composite is computed from the same unlabeled features and model internals. Nothing is checked against external athletic outcomes such as race times, training response, or coach ratings. The claim that the resulting score is a performance indicator therefore rests on the untested premise that internal coherence equals physiological validity.

This is an incremental application paper aimed at people who build or evaluate consumer training tools. It is honest about its methods and does not overclaim novelty, but the lack of any external anchor keeps the practical payoff speculative. I would send it to peer review so referees can ask for a small validation experiment against held-out performance data; the core protocol is reproducible enough to make that request straightforward.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates five dimensionality reduction models (three autoencoder variants, PCA, and VAE) for compressing nine-dimensional wearable sensor data from runners into a one-dimensional latent performance score. Quality is assessed via reconstruction MSE together with interpretability metrics (Spearman and Kendall rank correlations, mutual information, permutation importance) that are aggregated by a modified Borda count; bootstrap resampling confirms stability of the ranking. A two-feature linear baseline is included. The deep autoencoder is reported to achieve the lowest MSE and highest composite score; widening the PCA layers renders deeper variants competitive, leading to the conclusion that hidden-layer capacity rather than the 1-D bottleneck is the limiting factor. Running pace, aerobic decoupling, and average heart rate are identified as the dominant drivers across models.

Significance. If the composite internal-consistency criterion reliably identifies models whose latent scores track athletic performance, the work would offer a practical unsupervised tool for wearable telemetry analysis. The bootstrap validation protocol and the explicit linear baseline are methodological strengths that improve reproducibility. However, because all four interpretability axes are computed from the same unlabeled sensor features and model internals, the significance is currently limited to internal compression quality rather than demonstrated physiological validity.

major comments (1)

[Abstract] Abstract and evaluation protocol: the central claim that the latent score constitutes a physiologically meaningful performance indicator rests on the modified Borda aggregation of reconstruction MSE, Spearman/Kendall correlations, mutual information, and permutation importance. All four axes are derived solely from the unlabeled telemetry and model internals; none are validated against external performance outcomes (e.g., race times, training logs, or physiological markers). This makes the superiority of the deep autoencoder an internal-consistency result rather than evidence that the 1-D latent variable tracks actual athletic performance.

minor comments (2)

[Methods] The exact functional form of the modified Borda count and the weighting of the four interpretability metrics should be stated with an equation in the methods section.
[Results] The phrase 'PCA hidden layers' is unclear because PCA is a linear method; clarify whether this refers to an autoencoder-style implementation of PCA or to the number of retained components.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We agree that the evaluation protocol relies exclusively on internal consistency metrics computed from unlabeled telemetry and model internals, without external validation against ground-truth performance outcomes. This limits the strength of claims regarding physiological validity of the latent score. We will revise the manuscript accordingly to clarify the scope of our results.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation protocol: the central claim that the latent score constitutes a physiologically meaningful performance indicator rests on the modified Borda aggregation of reconstruction MSE, Spearman/Kendall correlations, mutual information, and permutation importance. All four axes are derived solely from the unlabeled telemetry and model internals; none are validated against external performance outcomes (e.g., race times, training logs, or physiological markers). This makes the superiority of the deep autoencoder an internal-consistency result rather than evidence that the 1-D latent variable tracks actual athletic performance.

Authors: We acknowledge the validity of this observation. The manuscript is framed as an unsupervised dimensionality-reduction study, and the composite criterion (reconstruction MSE + interpretability metrics aggregated by modified Borda count) is explicitly internal. The abstract and introduction do use the term “performance indicator,” which can be read as implying external validity that is not demonstrated. We will revise the abstract, introduction, and conclusion to state that the latent score is a data-driven scalar that (i) minimizes reconstruction error while (ii) aligning with physiologically plausible feature rankings according to the four internal metrics, and (iii) that external validation against race times or training logs remains future work. We will also add an explicit limitations paragraph on the absence of labeled performance outcomes. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper evaluates five dimensionality reduction models using independent external metrics (reconstruction MSE, Spearman/Kendall rank correlations, mutual information, permutation importance) that are aggregated via modified Borda count into a composite criterion. These metrics are computed from the data and model outputs without reducing to a self-referential definition or fitted parameter; the claim that Deep AE achieves the highest composite score follows directly from applying these standard, non-derived criteria rather than any equation or self-citation that forces the result by construction. A linear baseline is included for anchoring, and no load-bearing step invokes uniqueness theorems or ansatzes from prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that a one-dimensional latent variable derived from unsupervised compression can serve as a physiologically meaningful performance indicator, plus the modeling choice that the composite of reconstruction and interpretability metrics is the right way to rank models without labels.

axioms (2)

domain assumption Unsupervised compression of sensor streams yields a latent variable whose value correlates with established physiological performance markers
Invoked when the paper treats the latent score as a performance indicator and validates it via correlations with pace, heart rate, and aerobic decoupling.
ad hoc to paper A modified Borda count aggregation of four interpretability metrics plus reconstruction error produces a stable model ranking
The composite selection criterion is introduced without external justification or comparison to alternative aggregation methods.

invented entities (1)

latent score no independent evidence
purpose: Single scalar performance indicator obtained from the bottleneck of the dimensionality reduction model
The paper introduces this derived quantity as the target output; no independent physiological validation or falsifiable prediction outside the model is provided.

pith-pipeline@v0.9.1-grok · 5754 in / 1663 out tokens · 32406 ms · 2026-06-29T04:50:17.520266+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 10 canonical work pages

[1]

Effective injury forecasting in soccer with GPS training data and machine learning,

A. Rossi, L. Pappalardo, P. Cintia, F. M. Iaia, J. Fernández, and D. Medina, “Effective injury forecasting in soccer with GPS training data and machine learning,”PLOS ONE, vol. 13, no. 7, p. e0201264,
[2]

DOI: https://doi.org/10.1371/journal.pone.0201264

work page doi:10.1371/journal.pone.0201264
[3]

Rank aggregation methods for the web,

C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, “Rank aggregation methods for the web,” inProc. 10th Int. Conf. World Wide Web, 2001, pp. 613–622. DOI: https://doi.org/10.1145/371920.372165

work page doi:10.1145/371920.372165 2001
[4]

Reducing the dimensionality of data with neural networks,

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,”Science, vol. 313, no. 5786, pp. 504–507,
[5]

DOI: https://doi.org/10.1126/science.1127647

work page doi:10.1126/science.1127647
[6]

Auto-encoding variational Bayes,

D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. Int. Conf. Learn. Representations (ICLR), 2014. arXiv: https:// arxiv.org/abs/1312.6114

Pith/arXiv arXiv 2014
[7]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inAdvances in Neural Information Processing Systems 30, 2017, pp. 4765–4774

2017
[8]

High-intensity interval training, solu- tions to the programming puzzle: Part II: anaerobic energy, neuromus- cular load and practical applications,

M. Buchheit and P. B. Laursen, “High-intensity interval training, solu- tions to the programming puzzle: Part II: anaerobic energy, neuromus- cular load and practical applications,”Sports Medicine, vol. 43, no. 10, pp. 927–954, 2013. DOI: https://doi.org/10.1007/s40279-013-0066-5

work page doi:10.1007/s40279-013-0066-5 2013
[9]

Autoencoders, unsupervised learning, and deep architectures,

P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,” Proc. ICML Workshop on Unsupervised and Transfer Learning, 2012, pp. 37–49

2012
[10]

Random forests,

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5– 32, 2001. DOI: https://doi.org/10.1023/A:1010933404324

work page doi:10.1023/a:1010933404324 2001
[11]

Permutation importance: a corrected feature importance measure,

A. Altmann, L. Tolo¸ si, O. Sander, and T. Lengauer, “Permutation importance: a corrected feature importance measure,”Bioinformatics, vol. 26, no. 10, pp. 1340–1347, 2010. DOI: https://doi.org/10.1093/ bioinformatics/btq134

2010
[12]

Smartwatch telemetry validation for physiological assessment in ambulatory environments,

J. Alemany et al., “Smartwatch telemetry validation for physiological assessment in ambulatory environments,”IEEE Sensors Journal, vol. 21, no. 14, pp. 15890–15901, 2021. DOI: https://doi.org/10.1109/JSEN. 2021.3072114

work page doi:10.1109/jsen 2021
[13]

Monitoring training status with heart rate measures: do all roads lead to Rome?

M. Buchheit, “Monitoring training status with heart rate measures: do all roads lead to Rome?”Frontiers in Physiology, vol. 5, p. 73, 2014. DOI: https://doi.org/10.3389/fphys.2014.00073

work page doi:10.3389/fphys.2014.00073 2014
[14]

Molnar,Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed., 2022

C. Molnar,Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed., 2022. URL: https://christophm.github. io/interpretable-ml-book/

2022
[15]

Heart rate variability in elite triathletes: is variation in variability the key to effective training? A case comparison,

D. J. Plews, P. B. Laursen, A. E. Kilding, and M. Buchheit, “Heart rate variability in elite triathletes: is variation in variability the key to effective training? A case comparison,”European Journal of Applied Physiology, vol. 112, no. 11, pp. 3729–3741, 2012. DOI: https://doi. org/10.1007/s00421-011-2317-6

work page doi:10.1007/s00421-011-2317-6 2012
[16]

Efron and R

B. Efron and R. J. Tibshirani,An Introduction to the Bootstrap. New York, NY: Chapman and Hall/CRC, 1993. DOI: https://doi.org/10.1201/ 9780429246593

1993
[17]

The training-injury prevention paradox: should athletes be training smarter and harder?

T. J. Gabbett, “The training-injury prevention paradox: should athletes be training smarter and harder?”British Journal of Sports Medicine, vol. 50, no. 5, pp. 273–280, 2016. DOI: https://doi.org/10.1136/ bjsports-2015-095788

2016
[18]

Hochreiter and J

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: https://doi.org/ 10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[19]

Machine and deep learning for sport-specific movement recognition: a systematic review of model development and performance,

E. E. Cust, A. J. Sweeting, K. Ball, and S. Robertson, “Machine and deep learning for sport-specific movement recognition: a systematic review of model development and performance,”Journal of Sports Sciences, vol. 37, no. 5, pp. 568–600, 2019. DOI: https://doi.org/10. 1080/02640414.2018.1521769

arXiv 2019
[20]

2009.The Elements of Statistical Learning: Data Mining, Inference, and Prediction(2 ed.)

T. Hastie, R. Tibshirani, and J. Friedman,The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY: Springer, 2009. DOI: https://doi.org/10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009

[1] [1]

Effective injury forecasting in soccer with GPS training data and machine learning,

A. Rossi, L. Pappalardo, P. Cintia, F. M. Iaia, J. Fernández, and D. Medina, “Effective injury forecasting in soccer with GPS training data and machine learning,”PLOS ONE, vol. 13, no. 7, p. e0201264,

[2] [2]

DOI: https://doi.org/10.1371/journal.pone.0201264

work page doi:10.1371/journal.pone.0201264

[3] [3]

Rank aggregation methods for the web,

C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, “Rank aggregation methods for the web,” inProc. 10th Int. Conf. World Wide Web, 2001, pp. 613–622. DOI: https://doi.org/10.1145/371920.372165

work page doi:10.1145/371920.372165 2001

[4] [4]

Reducing the dimensionality of data with neural networks,

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,”Science, vol. 313, no. 5786, pp. 504–507,

[5] [5]

DOI: https://doi.org/10.1126/science.1127647

work page doi:10.1126/science.1127647

[6] [6]

Auto-encoding variational Bayes,

D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. Int. Conf. Learn. Representations (ICLR), 2014. arXiv: https:// arxiv.org/abs/1312.6114

Pith/arXiv arXiv 2014

[7] [7]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inAdvances in Neural Information Processing Systems 30, 2017, pp. 4765–4774

2017

[8] [8]

High-intensity interval training, solu- tions to the programming puzzle: Part II: anaerobic energy, neuromus- cular load and practical applications,

M. Buchheit and P. B. Laursen, “High-intensity interval training, solu- tions to the programming puzzle: Part II: anaerobic energy, neuromus- cular load and practical applications,”Sports Medicine, vol. 43, no. 10, pp. 927–954, 2013. DOI: https://doi.org/10.1007/s40279-013-0066-5

work page doi:10.1007/s40279-013-0066-5 2013

[9] [9]

Autoencoders, unsupervised learning, and deep architectures,

P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,” Proc. ICML Workshop on Unsupervised and Transfer Learning, 2012, pp. 37–49

2012

[10] [10]

Random forests,

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5– 32, 2001. DOI: https://doi.org/10.1023/A:1010933404324

work page doi:10.1023/a:1010933404324 2001

[11] [11]

Permutation importance: a corrected feature importance measure,

A. Altmann, L. Tolo¸ si, O. Sander, and T. Lengauer, “Permutation importance: a corrected feature importance measure,”Bioinformatics, vol. 26, no. 10, pp. 1340–1347, 2010. DOI: https://doi.org/10.1093/ bioinformatics/btq134

2010

[12] [12]

Smartwatch telemetry validation for physiological assessment in ambulatory environments,

J. Alemany et al., “Smartwatch telemetry validation for physiological assessment in ambulatory environments,”IEEE Sensors Journal, vol. 21, no. 14, pp. 15890–15901, 2021. DOI: https://doi.org/10.1109/JSEN. 2021.3072114

work page doi:10.1109/jsen 2021

[13] [13]

Monitoring training status with heart rate measures: do all roads lead to Rome?

M. Buchheit, “Monitoring training status with heart rate measures: do all roads lead to Rome?”Frontiers in Physiology, vol. 5, p. 73, 2014. DOI: https://doi.org/10.3389/fphys.2014.00073

work page doi:10.3389/fphys.2014.00073 2014

[14] [14]

Molnar,Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed., 2022

C. Molnar,Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed., 2022. URL: https://christophm.github. io/interpretable-ml-book/

2022

[15] [15]

Heart rate variability in elite triathletes: is variation in variability the key to effective training? A case comparison,

D. J. Plews, P. B. Laursen, A. E. Kilding, and M. Buchheit, “Heart rate variability in elite triathletes: is variation in variability the key to effective training? A case comparison,”European Journal of Applied Physiology, vol. 112, no. 11, pp. 3729–3741, 2012. DOI: https://doi. org/10.1007/s00421-011-2317-6

work page doi:10.1007/s00421-011-2317-6 2012

[16] [16]

Efron and R

B. Efron and R. J. Tibshirani,An Introduction to the Bootstrap. New York, NY: Chapman and Hall/CRC, 1993. DOI: https://doi.org/10.1201/ 9780429246593

1993

[17] [17]

The training-injury prevention paradox: should athletes be training smarter and harder?

T. J. Gabbett, “The training-injury prevention paradox: should athletes be training smarter and harder?”British Journal of Sports Medicine, vol. 50, no. 5, pp. 273–280, 2016. DOI: https://doi.org/10.1136/ bjsports-2015-095788

2016

[18] [18]

Hochreiter and J

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: https://doi.org/ 10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997

[19] [19]

Machine and deep learning for sport-specific movement recognition: a systematic review of model development and performance,

E. E. Cust, A. J. Sweeting, K. Ball, and S. Robertson, “Machine and deep learning for sport-specific movement recognition: a systematic review of model development and performance,”Journal of Sports Sciences, vol. 37, no. 5, pp. 568–600, 2019. DOI: https://doi.org/10. 1080/02640414.2018.1521769

arXiv 2019

[20] [20]

2009.The Elements of Statistical Learning: Data Mining, Inference, and Prediction(2 ed.)

T. Hastie, R. Tibshirani, and J. Friedman,The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY: Springer, 2009. DOI: https://doi.org/10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009