pith. sign in

arxiv: 2603.20727 · v2 · submitted 2026-03-21 · 📊 stat.ME · stat.AP

Compositional regression using principal nested spheres

Pith reviewed 2026-05-15 07:29 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords compositional regressionprincipal nested spheresmanifold-valued datasimplex geometrysphere embeddingcylindrical coordinatesenvironmental exposure
0
0 comments X

The pith

Compositional regression succeeds by embedding simplex data on a sphere, reducing it via principal nested spheres to a cylinder with one circular and several linear scores, regressing there, and mapping predictions back.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Compositional responses lie on a simplex and cannot be treated with ordinary Euclidean regression because their sum-to-one constraint creates curved geometry. The method first embeds the data in the positive orthant of a sphere. Principal nested spheres then produce an intermediate cylindrical space whose leading coordinate is circular and whose remaining coordinates are Euclidean. Regression is performed in this cylinder using standard techniques. The resulting fitted values are transformed back to the simplex, preserving the compositional constraint. Simulations show the procedure recovers relationships accurately, and an application to chemical exposure data demonstrates practical interpretability.

Core claim

By embedding compositional data into the positive orthant of the sphere and applying principal nested spheres, one obtains a cylindrical intermediate space with a leading circular score and Euclidean higher-order scores. Regression proceeds directly in this space, after which the estimates are mapped back to the original simplex.

What carries the argument

Principal Nested Spheres (PNS) applied to sphere-embedded compositional responses, producing a cylindrical space with one circular coordinate and the rest Euclidean for regression.

If this is right

  • Standard regression models can be used on compositional responses without directly violating the sum-to-one constraint.
  • The leading circular score isolates the dominant nonlinear variation while higher-order Euclidean scores permit linear adjustments.
  • Back-mapping of fitted values guarantees that all predictions remain valid compositions.
  • The framework extends the idea of intermediate-space regression to other manifold-valued response types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the cylindrical reduction works reliably, analogous intermediate mappings could simplify regression on other curved manifolds such as positive definite matrices or tree spaces.
  • The circular leading score may correspond to interpretable physical or chemical cycles in applications, suggesting targeted validation against domain knowledge.
  • Allowing multiple circular scores in the intermediate space could handle compositional data with several independent nonlinear features.

Load-bearing premise

The sphere embedding together with the principal nested spheres reduction must preserve the main nonlinear relationships present in the original simplex data so that regression performed in the cylinder remains meaningful once mapped back.

What would settle it

Generate synthetic compositional responses from a known regression function on the simplex, embed and reduce them with principal nested spheres, fit the model in the cylinder, map the predictions back, and verify whether the recovered relationship matches the generating function within sampling variability.

Figures

Figures reproduced from arXiv: 2603.20727 by Florence George, Ian L. Dryden, Mymuna Monem, Natalia Soares Quinete.

Figure 1
Figure 1. Figure 1: The great circle fit (left, in green) and the small circle fit (right) for the 3D geochemical [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A ternary diagram with fitted PNS subspheres. The first principal component from the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simulated compositional data with five components. The ‘PNS score 1’ and ‘PNS all’ [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The number of chemicals detected in total over the matrices they represent (water, dust, [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A map of the zip codes in Region A and Region B. Region A locations are mainly [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The environmental chemical data with each row of the water, dust, food and soil vector [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Above: the empirical cdf of the first two PNS scores for Region A (black) and Region [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The PNS biplot for the environmental chemical data. (Above) Paths showing the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Regression with compositional responses is challenging due to the nonlinear geometry of the simplex and the limitations of Euclidean methods. We propose a regression framework for manifold-valued data based on mappings to statistically tractable intermediate spaces. For compositional data, responses are embedded in the positive orthant of the sphere and analysed using Principal Nested Spheres (PNS), yielding a cylindrical intermediate space with a circular leading score and Euclidean higher-order scores. Regression is performed in this intermediate space and fitted values are mapped back to the simplex. A simulation study demonstrates good performance of PNS-based regression. An application to environmental chemical exposure data illustrates the interpretability and practical utility of the method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a regression framework for compositional responses by embedding them in the positive orthant of the sphere, applying Principal Nested Spheres (PNS) to obtain a cylindrical intermediate space (circular leading score plus Euclidean higher-order scores), performing regression in that space, and mapping fitted values back to the simplex. It reports a simulation study with good performance and an application to environmental chemical exposure data to illustrate interpretability.

Significance. If the PNS-derived cylindrical space adequately preserves the essential nonlinear structure of the original simplex data, the method supplies a geometrically motivated alternative to direct Euclidean regression on compositional data. The simulation study and real-data example constitute concrete evidence of performance and utility, which strengthens the contribution for a methodological statistics paper.

major comments (1)
  1. The abstract states that regression is performed in the cylindrical intermediate space, but does not specify the model used for the circular leading score (e.g., circular regression, projected linear model, or other). This detail is load-bearing for reproducibility and for evaluating whether the back-mapping step preserves regression properties; it should be stated explicitly with the relevant equations in the methods section.
minor comments (1)
  1. The abstract mentions 'good performance' in the simulation without naming the metrics or baseline comparators; adding one sentence on these would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The single major comment is addressed below; we agree that explicit specification of the regression model for the circular component will improve clarity and reproducibility.

read point-by-point responses
  1. Referee: The abstract states that regression is performed in the cylindrical intermediate space, but does not specify the model used for the circular leading score (e.g., circular regression, projected linear model, or other). This detail is load-bearing for reproducibility and for evaluating whether the back-mapping step preserves regression properties; it should be stated explicitly with the relevant equations in the methods section.

    Authors: We agree that the specific regression model for the circular leading score should be stated explicitly. The current manuscript describes regression in the cylindrical space (circular score plus Euclidean scores) in Section 3 but does not isolate the circular component with equations. In the revised version we will (i) update the abstract to read 'using circular regression for the leading score and linear regression for the higher-order scores' and (ii) add the explicit model equations (including the link function and any projection step) to the methods section so that the back-mapping properties can be directly assessed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The proposed framework maps compositional responses to the positive orthant of the sphere, applies established Principal Nested Spheres (PNS) to produce a cylindrical intermediate space (circular leading score plus Euclidean coordinates), performs ordinary regression in that space, and back-maps fitted values to the simplex. This sequence relies on prior geometric constructions for PNS and standard regression techniques; no equation reduces a fitted parameter to a prediction by construction, no central claim is justified solely by self-citation, and no ansatz is smuggled in. The simulation study and environmental application provide external checks rather than tautological confirmation. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard manifold geometry for the simplex and the properties of principal nested spheres; no new free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Compositional data on the simplex can be isometrically embedded into the positive orthant of the sphere.
    Standard transformation used for compositional data to enable spherical analysis.
  • domain assumption Principal nested spheres yield a cylindrical space whose leading circular score and higher Euclidean scores are suitable for linear regression.
    Core modeling choice that enables the intermediate-space regression step.

pith-pipeline@v0.9.0 · 5404 in / 1254 out tokens · 38567 ms · 2026-05-15T07:29:38.209447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    responses are embedded in the positive orthant of the sphere and analysed using Principal Nested Spheres (PNS), yielding a cylindrical intermediate space with a circular leading score and Euclidean higher-order scores

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    (1986).The Statistical Analysis of Compositional Data

    Aitchison, J. (1986).The Statistical Analysis of Compositional Data. Chapman and Hall, London

  2. [2]

    and Bacon-Shone, J

    Aitchison, J. and Bacon-Shone, J. (1984). Log contrast models for experiments with mixtures. Biometrika, 71(2):323–330

  3. [3]

    Bates, D. (2005). Fitting linear mixed models in R.R News, 5(1):27–30

  4. [4]

    and Hochberg, Y

    Benjamini, Y . and Hochberg, Y . (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing.J. Roy. Statist. Soc. Ser . B, 57(1):289–300

  5. [5]

    Dryden, I. L. (2025).shapespackage. R Foundation for Statistical Computing, Vienna, Austria. Contributed package, Version 1.2.8

  6. [6]

    Dryden, I. L. and Mardia, K. V . (2016).Statistical Shape Analysis, with Applications in R, 2nd edition. Wiley, Chichester. 18

  7. [7]

    Fletcher, P. T. (2013). Geodesic regression and the theory of least squares on Riemannian mani- folds.Int. J. Comput. Vis., 105(2):171–185

  8. [8]

    Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal com- ponent analysis.Biometrika, 58:453–467

  9. [9]

    L., and Marron, J

    Jung, S., Dryden, I. L., and Marron, J. S. (2012). Analysis of principal nested spheres. Biometrika, 99(3):551–568

  10. [10]

    Kenward, M. G. and Roger, J. H. (1997). Small sample inference for fixed effects from restricted maximum likelihood.Biometrics, 53:983–997

  11. [11]

    L., Scealy, J

    Lee, H., Hingee, K. L., Scealy, J. L., Wood, A. T. A., Grunsky, E., and Marron, J. S. (2025). Principal subsimplex analysis. arXiv 2504.09853

  12. [12]

    Li, B., Yoon, C., and Ahn, J. (2023). Reproducing kernels and new approaches in compositional data analysis.Journal of Machine Learning Research, 24(327):1–34

  13. [13]

    Mardia, K. V . and Jupp, P. E. (2000).Directional statistics. Wiley Series in Probability and Statistics. John Wiley & Sons Ltd., Chichester

  14. [14]

    Marron, J. S. and Dryden, I. L. (2021).Object Oriented Data Analysis. CRC Press/Chapman and Hall, Boca Raton

  15. [15]

    L., and George, F

    Monem, M., Dryden, I. L., and George, F. (2025). Principal nested spheres for high-dimensional data. arXiv 2511.08398

  16. [16]

    D., Cappelini, L

    Ogunbiyi, O. D., Cappelini, L. T. D., Monem, M., Mejias, E., George, F., Gardinali, P., Bag- ner, D. M., and Quinete, N. (2024). Innovative non-targeted screening approach using high-resolution mass spectrometry for the screening of organic chemicals and identification of specific tracers of soil and dust exposure in children.Journal of Hazardous Material...

  17. [17]

    Pennec, X. (2006). Intrinsic statistics on Riemannian manifolds: Basic tools for geometric mea- surements.Journal of Mathematical Imaging and Vision, 25(1):127–154

  18. [18]

    Scealy, J. L. and Welsh, A. H. (2011). Regression for compositional data by using distributions defined on the hypersphere.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):351–375

  19. [19]

    Srivastava, A., Jermyn, I., and Joshi, S. (2007). Riemannian analysis of probability density functions with applications in vision. In2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. 19

  20. [20]

    and Klassen, E

    Srivastava, A. and Klassen, E. P. (2016).Functional and Shape Data Analysis. Springer, New York. van den Boogaart, K. G. and Tolosana-Delgado, R. (2013).Analyzing Compositional Data with R. Springer, Heidelberg. van den Boogaart, K. G., Tolosana-Delgado, R., and Bren, M. (2024).compositions: Composi- tional Data Analysis. R package version 2.0-8. 20