arxiv: 2605.13520 · v1 · submitted 2026-05-13 · ❄️ cond-mat.stat-mech · cs.LG

Recognition: unknown

Beyond Explained Variance: A Cautionary Tale of PCA

Gionni Marchetti

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:25 UTC · model grok-4.3

classification ❄️ cond-mat.stat-mech cs.LG

keywords PCAt-SNEpersistent homologymanifold learningdata visualizationring structurefossil teeth

0 comments

The pith

PCA scatterplots can falsely suggest clusters in data that actually form a simple ring with no clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how principal component analysis fails to reveal the true structure of high-dimensional data lying on a nonlinear manifold. Using a dataset of fossil teeth from the early mammal Kuehneotherium, the standard PCA plot appears to show distinct clusters in one region, but t-SNE embeddings combined with persistent homology instead show the points arranged in a ring-like shape with intrinsic dimension one and no real clusters. The authors introduce a generative model of points sampled uniformly from a unit circle, under which pairwise cosine distances follow an arcsine distribution that qualitatively matches the U-shaped pattern observed in the data.

Core claim

For the Kuehneotherium fossil teeth measurements, the PCA scatterplot reported in prior work displays apparent clustering where the second principal component is negative, yet t-SNE and persistent homology analysis show the data points form a ring with no evident clustering and one-dimensional intrinsic geometry. A probabilistic model in which points are drawn uniformly from a unit circle produces an arcsine distribution for pairwise cosine distances, in qualitative agreement with the U-shaped distribution present in the actual data.

What carries the argument

The generative probabilistic-geometric model of uniform sampling from a unit circle, which produces the arcsine law for cosine distances and thereby supports the ring topology identified by t-SNE and persistent homology over PCA clustering.

If this is right

Methods that rely primarily on explained variance for visualization can distort the apparent geometry of data on nonlinear manifolds.
Combining t-SNE with persistent homology can detect one-dimensional ring structures that PCA scatterplots obscure as clusters.
Matching observed distance histograms to the arcsine distribution supplies an independent check on whether a circular manifold model is plausible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analyses in paleontology or materials science that use PCA for low-dimensional projections may need routine checks with topology-sensitive tools to avoid mistaking rings for clusters.
If many biological or physical datasets prove to lie on such low-dimensional circles, distance-distribution diagnostics could become a standard preprocessing step before choosing a visualization method.
The qualitative agreement between model and data could be turned into a quantitative test by deriving exact moments or goodness-of-fit statistics for the arcsine law on finite samples.

Load-bearing premise

That uniform sampling from a unit circle is the correct description of how the data were generated and that the qualitative match between the model's arcsine distance distribution and the observed U-shape provides independent confirmation rather than post-hoc fitting.

What would settle it

A new measurement or reanalysis of the same teeth data that shows either persistent homology confirming clusters aligned with the PCA result or a distance distribution that deviates substantially from the arcsine shape while still forming a ring in t-SNE.

Figures

Figures reproduced from arXiv: 2605.13520 by Gionni Marchetti.

**Figure 3.** Figure 3: , while the corresponding estimated parameters are reported in Table I. For completeness, we also report the Akaike Information Criterion (AIC) [10, 47] and the Bayesian Information Criterion (BIC) [10, 48] for the two models. We obtain AIC ≈ 6011 and BIC ≈ 6042 for the GMM, whereas AIC ≈ 5127 and BIC ≈ 5159 for the BMM. Accordingly, the Beta mixture model provides the preferred fit to the empirical data,… view at source ↗

read the original abstract

We address shortcomings of principal component analysis (PCA) for visualizing high-dimensional data lying on a nonlinear low-dimensional manifold via two-dimensional scatterplots, focusing on a fossil teeth dataset from the early mammalian insectivore Kuehneotherium. While the PCA scatterplot reported by Jolliffe and Cadima (Philosophical Transactions of the Royal Society A, 2016) shows clustering in the region where PC2 < 0, our analysis based on t-SNE and persistent homology (PH) reveals a ring-like structure with no evident clustering and intrinsic dimensionality equal to one. We further propose a generative probabilistic-geometric model in which the data are sampled uniformly from a unit circle. Under this model, pairwise cosine distances follow an arcsine distribution, in qualitative agreement with the observed U-shaped distribution, thereby independently supporting the analysis based on tt t-SNE and persistent homology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a useful example of PCA distortion on manifold data but the circle model does not independently support the ring claim.

read the letter

The main thing to know is that this paper re-analyzes the Kuehneotherium teeth measurements and finds a ring-like structure of intrinsic dimension one instead of the clusters shown in the earlier PCA scatterplot from Jolliffe and Cadima. t-SNE and persistent homology are used to reach that conclusion, followed by a simple generative model of uniform points on a unit circle whose pairwise cosine distances produce an arcsine distribution that matches the observed U-shaped histogram in a qualitative way.

Referee Report

2 major / 1 minor

Summary. The paper argues that PCA applied to a high-dimensional fossil teeth dataset from Kuehneotherium produces misleading scatterplots that suggest clustering (particularly for PC2 < 0), whereas t-SNE and persistent homology instead reveal a ring-like structure of intrinsic dimension 1 with no evident clusters. The authors introduce a generative model in which points are sampled uniformly from a unit circle; under this model the distribution of pairwise cosine distances is arcsine, which they report is in qualitative agreement with the observed U-shaped histogram and thereby provides independent support for the ring topology.

Significance. If the ring structure and its generative description can be placed on firmer quantitative footing, the work supplies a concrete cautionary example of PCA's limitations for nonlinear manifolds in paleontological data analysis and illustrates the value of combining manifold-learning visualizations with simple probabilistic-geometric models.

major comments (2)

[Abstract] Abstract and generative-model section: the assertion that the arcsine distribution supplies 'independent support' for the ring structure is circular. The model is introduced specifically to reproduce the observed U-shaped distance histogram; once uniform sampling on the circle is assumed, the arcsine law for the cosine of the angular difference follows immediately by transformation of variables and therefore cannot corroborate the topology inferred from t-SNE/PH.
[Abstract] Abstract and results section on distance distributions: no quantitative goodness-of-fit statistic (Kolmogorov-Smirnov distance, chi-squared test, or bootstrap confidence bands on the histogram) is reported for the arcsine law versus the empirical pairwise-cosine distribution. Qualitative visual agreement alone leaves the support for the generative model suggestive rather than conclusive.

minor comments (1)

[Abstract] Abstract contains the typographical error 'tt t-SNE' (should read 't-SNE').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and insightful comments, which have helped us improve the clarity and rigor of our manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and generative-model section: the assertion that the arcsine distribution supplies 'independent support' for the ring structure is circular. The model is introduced specifically to reproduce the observed U-shaped distance histogram; once uniform sampling on the circle is assumed, the arcsine law for the cosine of the angular difference follows immediately by transformation of variables and therefore cannot corroborate the topology inferred from t-SNE/PH.

Authors: We appreciate this observation and agree that the term 'independent support' could be misinterpreted. The ring structure is primarily inferred from the t-SNE visualization and persistent homology analysis, which indicate a one-dimensional manifold without clusters. The generative model of uniform sampling on a unit circle is then introduced as a minimal probabilistic model that reproduces the characteristic U-shaped histogram of pairwise cosine distances observed in the data. While the arcsine distribution follows directly from the model assumptions, the fact that this simple model aligns with the empirical distance distribution provides a parsimonious geometric explanation consistent with the manifold-learning results. We will revise the abstract and the generative-model section to replace 'independently supporting' with language emphasizing consistency and explanatory power, avoiding any implication of statistical independence. revision: partial
Referee: [Abstract] Abstract and results section on distance distributions: no quantitative goodness-of-fit statistic (Kolmogorov-Smirnov distance, chi-squared test, or bootstrap confidence bands on the histogram) is reported for the arcsine law versus the empirical pairwise-cosine distribution. Qualitative visual agreement alone leaves the support for the generative model suggestive rather than conclusive.

Authors: We concur that quantitative measures would strengthen the presentation. In the revised version, we will compute and report a Kolmogorov-Smirnov statistic for the fit between the empirical distribution of pairwise cosine distances and the theoretical arcsine distribution. Additionally, we will include bootstrap-derived confidence bands on the empirical histogram to provide a visual and quantitative assessment of the agreement. These additions will make the support for the generative model more conclusive. revision: yes

Circularity Check

1 steps flagged

Arcsine agreement follows by construction from the circle model chosen to match the observed ring, supplying no independent support

specific steps

fitted input called prediction [Abstract]
"We further propose a generative probabilistic-geometric model in which the data are sampled uniformly from a unit circle. Under this model, pairwise cosine distances follow an arcsine distribution, in qualitative agreement with the observed U-shaped distribution, thereby independently supporting the analysis based on t-SNE and persistent homology."

The arcsine distribution for the dot product of two independent uniform points on the circle follows immediately from the angular difference being uniform on [0, π] via standard change of variables. Once the model is selected to reproduce the ring structure and U-shape seen in the data, the agreement is mathematically forced and cannot furnish independent corroboration of the data-generating process or of the t-SNE/PH topology.

full rationale

The paper's topology conclusion rests on t-SNE and persistent homology. The generative model is then posited as uniform sampling on the unit circle specifically to account for the ring and the U-shaped distances. The arcsine law for cosine distances is a standard transformation of variables once the model is fixed, so the reported qualitative agreement is guaranteed rather than corroborative. This matches the fitted-input-called-prediction pattern and is presented as independent support, creating partial circularity in the derivation chain even though the core manifold inference itself is not reduced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The load-bearing elements are the assumption that the data-generating process is uniform sampling on a circle and the interpretation that qualitative distance-distribution agreement constitutes independent confirmation.

axioms (1)

domain assumption The observed data points can be modeled as uniformly sampled from a unit circle in some embedding space.
This is the generative model proposed to explain the U-shaped pairwise distance distribution.

pith-pipeline@v0.9.0 · 5441 in / 1396 out tokens · 49037 ms · 2026-05-14T18:25:17.811249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

[1]

To this end, various heuristics exist

Estimating Intrinsic Dimensionality using PCA Within PCA, the methods for estimating the intrin- sic (or effective) dimension, which, in this context, cor- responds to deciding how many principal components (PCs) to keep. To this end, various heuristics exist. In this work, we shall consider the following ones: Kaiser criterion (also known as the Kaiser-G...

work page 2004
[2]

Pearson, Philosophical Magazine Series 12, 559 (1901)

K. Pearson, Philosophical Magazine Series 12, 559 (1901)

work page 1901
[3]

Hotelling, Journal of Educational Psychology24, 498 (1933)

H. Hotelling, Journal of Educational Psychology24, 498 (1933)

work page 1933
[4]

I. T. Jolliffe,Principal Component Analysis, Springer Series in Statistics (Springer, New York, NY, 2002), 2nd ed., ISBN 978-0-387-95442-4, springer Science+Business Media New York; eBook ISBN: 978-0-387-22440-4; Softcover ISBN: 978-1- 4419-2999-0; Published in Springer Book Archive

work page 2002
[5]

A Tutorial on Principal Component Analysis

J. Shlens,A tutorial on principal component analysis(2014), 1404.1100, URLhttps://arxiv.org/abs/1404.1100

work page internal anchor Pith review Pith/arXiv arXiv 2014
[6]

Greenacre, P

M. Greenacre, P. J. F. Groenen, T. Hastie, A. I. D’Enza, A. I. Markos, and E. Tuzhilina, Nature Reviews Methods Primers 2(2022)

work page 2022
[7]

Scheidgen, L

M. Scheidgen, L. Himanen, A. Ladines, D. Sikter, M. Nakhaee, ´A. Fekete, T.-C. Chang, A. Golparvar, J. Mar´ ıquez, S. Brockhauser, et al., Journal of Open Source Software8, 5388 (2023), URLhttps://doi.org/10.21105/joss.05388

work page doi:10.21105/joss.05388 2023
[8]

M. K. Horton, P. Huck, R. X. Yang, J. M. Munro, S. Dwaraknath, A. M. Ganose, R. S. Kingsbury, M. Wen, J. X. Shen, T. S. Mathis, et al., Nature Materials24, 1522 (2025), ISSN 1476-4660, URLhttps://doi.org/10.1038/s41563-025-02272-0

work page doi:10.1038/s41563-025-02272-0 2025
[9]

H. M. Berman, T. Battistuz, T. N. Bhat, W. Bluhm, P. E. Bourne, K. Burkhardt, L. Iype, S. Jain, P. Fagan, J. Marvin, et al., Nucleic Acids Research28, 235 (2000), the worldwide repository of experimentally determined macromolecular structures, URLhttps://www.rcsb.org/

work page 2000
[10]

Varadi, D

M. Varadi, D. Bertoni, P. Magana, U. Paramval, I. Pidruchna, M. Radhakrishnan, M. Tsenkov, S. Nair, M. Mirdita, J. Yeo, et al., Nucleic Acids Research52, D368 (2024)

work page 2024
[11]

Hastie, R

T. Hastie, R. Tibshirani, and J. Friedman,The Elements of Statistical Learning, Springer Series in Statistics (Springer New York Inc., New York, NY, USA, 2017), 12th ed

work page 2017
[12]

M. P. Deisenroth, A. A. Faisal, and C. S. Ong,Mathematics for Machine Learning(Cambridge University Press, 2020)

work page 2020
[13]

P. G. Gill, M. A. Purnell, N. Crumpton, K. R. Brown, N. J. Gostling, M. Stampanoni, and E. J. Rayfield, Nature512, 303 (2014), ISSN 1476-4687, URLhttps://doi.org/10.1038/nature13622

work page doi:10.1038/nature13622 2014
[15]

van der Maaten and G

L. van der Maaten and G. Hinton, Journal of Machine Learning Research9, 2579 (2008)

work page 2008
[16]

Kobak and P

D. Kobak and P. Berens, Nature Communications10, 5416 (2019), ISSN 2041-1723, URLhttps://doi.org/10.1038/ s41467-019-13056-x

work page 2019
[17]

G. E. Carlsson, Bulletin of the American Mathematical Society46, 255 (2009)

work page 2009
[18]

Otter, M

N. Otter, M. A. Porter, U. Tillmann, P. Grindrod, and H. A. Harrington, EPJ Data Science6, 17 (2017), ISSN 2193-1127, URLhttps://doi.org/10.1140/epjds/s13688-017-0109-5

work page doi:10.1140/epjds/s13688-017-0109-5 2017
[19]

Wasserman, Annual Review of Statistics and Its Application5, 501 (2018)

L. Wasserman, Annual Review of Statistics and Its Application5, 501 (2018)

work page 2018
[20]

Munch, Journal of Learning Analytics4, 47–61 (2017), URLhttps://learning-analytics.info/index.php/JLA/ article/view/5196

E. Munch, Journal of Learning Analytics4, 47–61 (2017), URLhttps://learning-analytics.info/index.php/JLA/ article/view/5196

work page 2017
[21]

Chazal and B

F. Chazal and B. Michel, Frontiers in Artificial Intelligence4(2021)

work page 2021
[23]

Damrich, P

S. Damrich, P. Berens, and D. Kobak,Persistent homology for high-dimensional data based on spectral methods(2024), 2311.03087, URLhttps://arxiv.org/abs/2311.03087

work page arXiv 2024
[24]

L´ evy, Compositio Mathematica7, 283 (1939)

P. L´ evy, Compositio Mathematica7, 283 (1939)

work page 1939
[25]

Strang, The American Mathematical Monthly100, 848 (1993)

G. Strang, The American Mathematical Monthly100, 848 (1993)

work page 1993
[26]

G. W. Stewart, SIAM Review35, 551 (1993)

work page 1993
[27]

Shinn, Proceedings of the National Academy of Sciences120, e2311420120 (2023)

M. Shinn, Proceedings of the National Academy of Sciences120, e2311420120 (2023)

work page 2023
[28]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Journal of Machine Learning Research12, 2825 (2011), URLhttp://jmlr.org/papers/v12/ pedregosa11a.html

work page 2011
[29]

H. F. Kaiser, Educational and Psychological Measurement20, 141 (1960), https://doi.org/10.1177/001316446002000116, URLhttps://doi.org/10.1177/001316446002000116

work page doi:10.1177/001316446002000116 1960
[30]

J. B. Tenenbaum, V. de Silva, and J. C. Langford, Science290, 2319 (2000)

work page 2000
[31]

G´ eron,Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems(O’ReillY, U.S.A, 2019)

A. G´ eron,Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems(O’ReillY, U.S.A, 2019)

work page 2019
[32]

M. A. Kramer, AIChE Journal37, 233 (1991)

work page 1991
[33]

Recanatesi, S

S. Recanatesi, S. Bradde, V. Balasubramanian, N. A. Steinmetz, and E. Shea-Brown, Patterns3, 100555 (2022), ISSN 2666-3899, URLhttps://www.sciencedirect.com/science/article/pii/S266638992200160X

work page 2022
[34]

Gavish and D

M. Gavish and D. L. Donoho, IEEE Transactions on Information Theory60, 5040 (2014)

work page 2014
[36]

V. A. Marˇ cenko and L. A. Pastur, Mathematics of the USSR-Sbornik1, 457 (1967)

work page 1967
[37]

de Bodt, A

C. de Bodt, A. Diaz-Papkovich, M. Bleher, K. Bunte, C. Coupette, S. Damrich, E. F. Sanmartin, F. A. Hamprecht, E. ´Agnes Horv´ at, D. Kohli, et al.,Low-dimensional embeddings of high-dimensional data(2025), 2508.15929, URLhttps: //arxiv.org/abs/2508.15929. 6

work page arXiv 2025
[38]

P. G. Poliˇ car, M. Straˇ zar, and B. Zupan, Journal of Statistical Software109, 1–30 (2024), URLhttps://www.jstatsoft. org/index.php/jss/article/view/v109i03

work page 2024
[39]

Villani,Optimal Transport: Old and New(Springer, Berlin, Heidelberg, 2008)

C. Villani,Optimal Transport: Old and New(Springer, Berlin, Heidelberg, 2008)

work page 2008
[40]

Peyr´ e and M

G. Peyr´ e and M. Cuturi, arXiv preprint arXiv:1803.00567 (2018)

work page arXiv 2018
[41]

giotto-tda:

G. Tauzin, U. Lupo, L. Tunstall, J. B. P´ erez, M. Caorsi, A. Medina-Mardones, A. Dassatti, and K. Hess,giotto-tda: A topological data analysis toolkit for machine learning and data exploration(2020), 2004.02551

work page arXiv 2020
[42]

K. P. Murphy,Machine learning - a probabilistic perspective(MIT Press, Cambridge, Massachusetts, 2012)

work page 2012
[43]

K. Zeng, C. E. P. De Jes´ us, A. J. Fox, and M. D. Graham, Machine Learning: Science and Technology5, 025053 (2024), URLhttps://doi.org/10.1088/2632-2153/ad4ba5

work page doi:10.1088/2632-2153/ad4ba5 2024
[44]

K. V. Mardia and P. E. Jupp,Directional Statistics, Wiley Series in Probability and Statistics (Wiley, 2000), ISBN 978-0471953333

work page 2000
[45]

Casella and R

G. Casella and R. L. Berger,Statistical Inference(Duxbury, 2002), 2nd ed

work page 2002
[46]

K. V. Bury,Statistical Distributions in Engineering(Cambridge University Press, 1999)

work page 1999
[47]

Wes McKinney, inProceedings of the 9th Python in Science Conference, edited by St´ efan van der Walt and Jarrod Millman (2010), pp. 56 – 61

work page 2010
[48]

Akaike, 2nd International Symposium on Information Theory pp

H. Akaike, 2nd International Symposium on Information Theory pp. 267–281 (1973)

work page 1973
[49]

Schwarz, The Annals of Statistics6, 461 (1978)

G. Schwarz, The Annals of Statistics6, 461 (1978)

work page 1978
[50]

Zanolli, F

C. Zanolli, F. Bouchet, J. Fortuny, F. Bernardini, C. Tuniz, and D. M. Alba, Journal of Human Evolution177, 103326 (2023), ISSN 0047-2484, URLhttps://www.sciencedirect.com/science/article/pii/S0047248423000039

work page 2023
[51]

P. G. Gill, Ph.D. thesis, University of Bristol (2004)

work page 2004
[52]

P. G. Gill, M. A. Purnell, N. Crumpton, K. R. Brown, N. J. Gostling, M. Stampanoni, and E. J. Rayfield, Nature512, 303 (2014)

work page 2014
[53]

P. G. Gill,Personal correspondence(2025)

work page 2025
[54]

D. F. Andrews, Biometrics28, 125 (1972)

work page 1972
[55]

Garc´ ıa-Osorio and C

C. Garc´ ıa-Osorio and C. Fyfe, Journal of Universal Computer Science11, 1806 (2005)

work page 2005
[56]

McKinney, inProceedings of the 9th Python in Science Conference, edited by S

W. McKinney, inProceedings of the 9th Python in Science Conference, edited by S. van der Walt and J. Millman (2010), pp. 56–61

work page 2010
[57]

I. T. Jolliffe and J. Cadima, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences374, 20150202 (2016)

work page 2016
[58]

Damrich, O

S. Damrich, O. Bobrowski, and P. Skraba, arXiv preprint arXiv:2305.15640 (2023)

work page arXiv 2023
[59]

R. Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science, Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, 2018)

work page 2018
[60]

S. L. Brunton and J. N. Kutz,Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control (Cambridge University Press, 2019)

work page 2019