pith. sign in

arxiv: 2502.09986 · v5 · submitted 2025-02-14 · 📊 stat.ME · stat.ML

Statistical description and dimension reduction of continuous time categorical trajectories with multivariate functional principal components

Pith reviewed 2026-05-23 03:18 UTC · model grok-4.3

classification 📊 stat.ME stat.ML
keywords categorical trajectoriesfunctional principal componentsbinary indicatorsdimension reductionmultivariate functional dataconsistencycontinuity in probability
0
0 comments X

The pith

Categorical trajectories convert to binary indicators for multivariate functional principal components analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper converts each categorical trajectory into a collection of binary 0-1 indicator functions, one per possible state. This step recasts the tasks of description, comparison, and dimension reduction as a multivariate functional principal components problem that preserves all original information and handles simultaneous states. Under the assumption that the indicators are continuous in probability, the resulting mean and covariance functions stay continuous and carry direct interpretations as departures from independence in the joint state probabilities. Consistent estimators for these functions follow from weak regularity conditions on the data, and the binary trajectories themselves qualify as random elements in the Hilbert space of square-integrable functions.

Core claim

By associating each state of a categorical trajectory with a binary random indicator function, the statistical description problem becomes a multivariate functional principal components analysis. The sample paths are piecewise constant with finitely many jumps, yet under continuity in probability the means and the multivariate covariance functions are continuous and admit interpretations in terms of departure from independence of the joint probabilities. The binary trajectories can be viewed as random elements in the Hilbert space of square integrable functions, and consistent estimators of the mean trajectories and covariance functions exist under weak regularity assumptions.

What carries the argument

Multivariate functional principal components analysis applied to the vector of binary indicator functions, one per categorical state.

If this is right

  • Dimension reduction to a small number of principal components becomes available while retaining full information from the original categorical sequences.
  • Multiple states observed at the same instant are represented without loss through the joint covariance structure.
  • The principal components supply direct visual and numerical summaries of typical variation across trajectories.
  • Estimation of means and covariances remains valid for any finite collection of observed paths under the stated regularity conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same indicator construction may apply to other piecewise-constant processes with finitely many jumps, such as certain point processes or regime-switching models.
  • One could test whether the leading principal components recover known transition patterns in longitudinal categorical data.
  • Rates of convergence for the estimators could be derived to guide sample-size requirements in applied settings.

Load-bearing premise

The 0-1 indicator trajectories must be continuous in probability so that their means and covariances remain continuous functions even though the observed paths jump.

What would settle it

A collection of categorical trajectories satisfying continuity in probability for which the estimated covariance function fails to converge to any continuous limit as the number of observed paths grows.

Figures

Figures reproduced from arXiv: 2502.09986 by Caroline Peltier, Herv\'e Cardot.

Figure 1
Figure 1. Figure 1: TDS bandplot for [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three gustometer-controlled stimuli (S06, S07 and S04) extracted from the open data [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Empirical probabilities pbj (t), t ∈ [0, 1]. The curves are drawn only for the states j whose average probability of occurrence is larger than 5%, that is to say R 1 0 pbj (t)dt ≥ 0.05. 4.1 Temporal dominance of sensation (TDS) trajectories Most statistical analyses for TDS data are, until now, based on the examination of the evolution over time of the proportion pbj (t) of occurrence of each state j (see … view at source ↗
Figure 4
Figure 4. Figure 4: Proportion of total variance captured by the principal components in [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Estimated principal component scores. Different gray levels are used to distinguish [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MFPCA. First component. Variations around the probability of occurrence related [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MFPCA. Second component. Variations around the probability of occurrence related [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Continuous time correspondence analysis. Evolution over time of the first encoding [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Continuous time correspondence analysis. Evolution over time of the first encoding [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: CFDA. Individual scores on the first two components. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Continuous time correspondence analysis. Map of the values of the first two encoding [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Proportion of total variance captured by the principal components considering equal [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Estimated principal component scores with weights (10). Different colours are used [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: MFPCA with unequal weights (10). First component. Variations around the prob [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: MFPCA with unequal weights (10). Second component. Variations around the [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: TCATA experiment. Empirical probabilities of occurrence [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: TCATA experiment. Mean number PD j=1 pbj (t) of selected states over time. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Proportion of total variance captured by the principal components considering equal [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Estimated MFCPA scores with TCATA data. Different levels of gray are used to [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: MFPCA of TCATA data. First component. Variations around the probability [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: MFPCA of TCATA data. Second component. Variations around the probability [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Distribution of the correlation coefficient between original and noisy first two princi [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗
read the original abstract

Getting tools that allow simple representations and comparisons of a set of categorical trajectories is of major interest for statisticians. Without loosing any information, we associate to each state a binary random indicator function, taking values in $\{0,1\}$, and turn the problem of statistical description of the categorical trajectories into a multivariate functional principal components analysis. This viewpoint encompasses experimental frameworks where two or more states can be observed simultaneously. The sample paths being piecewise constant, with a finite number of jumps, this a rare case in functional data analysis in which the trajectories are not supposed to be continuous and can be observed exhaustively. Under the weak hypothesis assuming only continuity in probability of the $0-1$ trajectories, the means and the (multivariate) covariance function are continuous and have interpretations in terms of departure from independence of the joint probabilities. Considering a functional data point of view, we show that the binary trajectories, which are right-continuous functions with left-hand limits, can be seen as random elements in the Hilbert space of square integrable functions. The multivariate functional principal components are simple to interpret and we show that we can get consistent estimators of the mean trajectories and the covariance functions under weak regularity assumptions. The ability of the approach to represent categorical trajectories in a small dimension space is illustrated on a data set of sensory perceptions, considering different gustometer-controlled stimuli experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes to represent continuous-time categorical trajectories by associating each state with a binary 0-1 indicator function, thereby recasting the problem as one of multivariate functional principal component analysis. Under the assumption that the indicator processes are continuous in probability, the authors claim that the mean functions equal the marginal probability trajectories and the cross-covariance functions equal the joint probabilities minus the product of the marginals (hence continuous and interpretable as departures from independence). They further assert that consistent estimators of the means and covariance functions exist under weak regularity assumptions, that the cadlag sample paths remain square-integrable elements of the relevant Hilbert space, and that the resulting MFPCA provides a low-dimensional representation, which is illustrated on sensory-perception data from gustometer experiments.

Significance. If the consistency claims hold, the work supplies a direct route for applying standard functional-data tools to categorical trajectories while preserving the possibility of simultaneous states and without requiring continuous sample paths. The probabilistic interpretation of the covariance operator as a measure of departure from independence is a clear interpretive strength, and the reliance on only continuity in probability plus square-integrability is notably weak. The empirical example on real sensory data demonstrates practical utility.

major comments (2)
  1. [Abstract] Abstract and the section stating the consistency result: the manuscript asserts existence of consistent estimators for the mean trajectories and the multivariate covariance function but supplies neither the explicit form of the estimators, a derivation of consistency, error bounds, nor any simulation evidence. Because this consistency statement is the central theoretical claim, the absence of supporting detail is load-bearing.
  2. [Theoretical framework] Section introducing the Hilbert-space embedding: while the cadlag paths are correctly noted to be square-integrable, the manuscript does not discuss whether the continuity-in-probability assumption is automatically satisfied for mutually exclusive categorical states or how it might be verified from data; this assumption is invoked to guarantee that the mean and covariance objects remain continuous and lie in the Hilbert space.
minor comments (3)
  1. [Abstract] Abstract: 'Without loosing any information' should read 'Without losing any information'.
  2. [Abstract] Abstract: 'this a rare case' should read 'this is a rare case'.
  3. The manuscript would benefit from a short simulation study (even a small one) that checks finite-sample behavior of the proposed estimators under the stated continuity-in-probability regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the work's significance and for the constructive major comments. We address each point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the section stating the consistency result: the manuscript asserts existence of consistent estimators for the mean trajectories and the multivariate covariance function but supplies neither the explicit form of the estimators, a derivation of consistency, error bounds, nor any simulation evidence. Because this consistency statement is the central theoretical claim, the absence of supporting detail is load-bearing.

    Authors: The estimators are the standard empirical mean functions and empirical cross-covariance functions of the observed multivariate indicator trajectories. Consistency in probability follows from the square-integrability of the cadlag paths together with continuity in probability, via standard uniform integrability arguments for functional data. We agree that the manuscript would benefit from greater explicitness on this central claim. In revision we will state the estimators explicitly, sketch the consistency argument, and add a small simulation study (in supplementary material) demonstrating finite-sample performance under the stated assumptions. revision: yes

  2. Referee: [Theoretical framework] Section introducing the Hilbert-space embedding: while the cadlag paths are correctly noted to be square-integrable, the manuscript does not discuss whether the continuity-in-probability assumption is automatically satisfied for mutually exclusive categorical states or how it might be verified from data; this assumption is invoked to guarantee that the mean and covariance objects remain continuous and lie in the Hilbert space.

    Authors: Continuity in probability is not an automatic consequence of mutual exclusivity; it follows from the cadlag property together with the finite number of jumps per trajectory (which implies that the probability of a discontinuity at any fixed t is zero). When states may co-occur, the same argument applies componentwise to each indicator process. For data-based verification we suggest inspecting the empirical marginal probability curves for visible jumps or estimating the probability of state changes in small time windows around observed times. We will insert a short clarifying paragraph on these points in the revised theoretical framework section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained in standard FDA

full rationale

The paper maps categorical trajectories to 0-1 indicator processes and applies multivariate FPCA. The central results (continuity of means/covariances under continuity-in-probability, consistency of estimators) follow directly from the stated weak hypothesis and standard Hilbert-space arguments for cadlag square-integrable paths; they do not reduce to fitted parameters or self-citations. No load-bearing self-citation chains, no ansatz smuggled via prior work, and no renaming of known results as new derivations. The approach rests on external, independently verifiable FDA theory rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the continuity-in-probability assumption for the indicator processes and on the standard Hilbert-space embedding of square-integrable functions; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The 0-1 indicator trajectories are continuous in probability.
    Invoked to ensure means and covariances are continuous and to place the paths in the Hilbert space.
  • domain assumption The trajectories are right-continuous with left limits and piecewise constant with finitely many jumps.
    Stated as the observation model for categorical trajectories.

pith-pipeline@v0.9.0 · 5774 in / 1453 out tokens · 21155 ms · 2026-05-23T03:18:55.312209+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Aitchison, J. (1983). Principal component analysis of compositional data. Biometrika , 70:57--65

  2. [2]

    Béno, N., Nicolle, L., and Visalli, M. (2023). A dataset of consumer perceptions of gustometer-controlled stimuli measured with three temporal sensory evaluation methods. Data in Brief , 48:109271

  3. [3]

    and Frascolla, C

    Cardot, H. and Frascolla, C. (2024). Hypothesis testing for panels of semi-markov processes with parametric sojourn time distributions. J. Stat. Plann. Inference , 228:59--79

  4. [4]

    Cardot, H., Frascolla, C., Schlich, P., and Visalli, M. (2019). Estimating finite mixtures of semi-markov chains: An application to the segmentation of temporal sensory data. J. R. Stat. Soc., Ser. C, Appl. Stat. , 68:1281--1303

  5. [5]

    Castura, J., Antunez, L., Gimenez, A., and Ares, G. (2016). Temporal check-all-that-apply (tcata): A novel dynamic method for characterizing products. Food Quality and Preference , 47A:79--90

  6. [6]

    Chiou, J., Chen, Y., and Yang, Y. (2014). Multivariate functional principal component analysis: A normalization approach. Statistica Sinica , 24:1571--1596

  7. [7]

    Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference. Journal of Multivariate Analysis , 12:136--154

  8. [8]

    Deville, J. (1982). Analyse des données chronologiques qualitatives. Annales de l'INSEE , 45:45--104

  9. [9]

    and Saporta, G

    Deville, J. and Saporta, G. (1980). Analyse harmonique qualitative. In Data Analysis and Informatics, Proc. Int. Symp., Versailles , pages 375--389

  10. [10]

    Gertheiss, J., Rügamer, D., Liew, B., and Greven, S. (2024). Functional data analysis: An introduction and recent developments. Biometrical Journal , 66:e202300363

  11. [11]

    Greenacre, M. (2021). Compositional data analysis. Annu. Rev. Stat. Appl. , 8:271--299

  12. [12]

    Happ, C. (2022). Mfpca: Multivariate functional principal component analysis. R package version 1.3-10

  13. [13]

    and Greven, S

    Happ, C. and Greven, S. (2018). Multivariate functional principal component analysis for data observed on different (dimensional) domains. J. Am. Stat. Assoc. , 113:649--659

  14. [14]

    and Eubank, R

    Hsing, T. and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators . Wiley Series in Probability and Statistics. John Wiley & Sons

  15. [15]

    and Staicu, A

    Koner, S. and Staicu, A. (2023). Second-generation functional data. Annu. Rev. Stat. Appl. , 10:547--572

  16. [16]

    and Opri s an, G

    Limnios, N. and Opri s an, G. (2001). Semi-Markov processes and reliability . Stat. Ind. Technol. Birkh \"a user, Basel

  17. [17]

    Lindsey, J. (2012). Statistical analysis of stochastic processes in time , volume 14 of Camb. Ser. Stat. Probab. Math. Cambridge University Press, Cambridge

  18. [18]

    Peltier, C., Visalli, M., Schlich, P., and Cardot, H. (2023). Analyzing temporal dominance of sensations data with categorical functional data analysis. Food Quality and Preference , 109

  19. [19]

    Pineau, N., Schlich, P., Cordelle, S., Mathonnière, C., Issanchou, S., and Imbert, A. (2009). Temporal dominance of sensations: Construction of the tds curves and comparison with time-intensity. Food Quality and Preference , 20:450--455

  20. [20]

    Preda, C., Grimonprez, Q., and Vandewalle, V. (2021). Categorical functional data analysis. the cfda r package. Mathematics , 9(23):3074

  21. [21]

    R: A Language and Environment for Statistical Computing

    R Core Team (2024). R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria

  22. [22]

    Ramsay, J. O. and Silverman, B. W. (2005). Functional data analysis. Springer Ser. Stat. New York, NY: Springer, 2nd ed. edition

  23. [23]

    Serfling, R. (1980). Approximation theorems of mathematical statistics . Wiley Ser. Probab. Math. Stat. John Wiley & Sons, Hoboken, NJ