pith. sign in

arxiv: 2606.09153 · v1 · pith:ZD5WMC7Enew · submitted 2026-06-08 · 🧮 math.ST · stat.ME· stat.TH

The Asymptotic Distribution of Sample Canonical Directions in Gaussian Spiked High-dimensional CCA

Pith reviewed 2026-06-27 14:54 UTC · model grok-4.3

classification 🧮 math.ST stat.MEstat.TH
keywords high-dimensional CCAspiked modelcanonical directionsasymptotic distributioncentral limit theoremresolvent functionalsdirectional recoveryoutlier eigenvalues
0
0 comments X

The pith

In spiked high-dimensional Gaussian CCA, squared alignment between sample and population canonical directions converges to an explicit deterministic limit with fluctuations obeying a central limit theorem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies the directional recovery properties of sample canonical directions in a finite-rank spiked CCA model where the two data dimensions grow proportionally to sample size under Gaussian assumptions. Sample directions fail to be consistent for their population counterparts even when the associated canonical correlations separate from the bulk. The authors derive a deterministic limit for the squared alignment that quantifies retained population directional information and prove a central limit theorem for the fluctuations around this limit, with variance given by limits of resolvent trace functionals. They further construct consistent plug-in estimators for the limit and variance by inverting the deterministic outlier eigenvalue map.

Core claim

For each simple population spike, the squared alignment between a sample canonical direction and its population counterpart admits a deterministic first-order limit that measures retained directional information at the population level. Fluctuations of this alignment around the limit obey a central limit theorem whose asymptotic variance is expressed through deterministic limits of resolvent trace functionals. Plug-in estimators for both the limiting mean and the asymptotic variance are obtained by inverting the deterministic outlier eigenvalue map and are shown to be consistent.

What carries the argument

Deterministic first-order limit of the squared alignment between sample and population canonical directions, together with the associated central limit theorem derived from resolvent trace functionals.

If this is right

  • The limiting alignment supplies an explicit quantitative measure of population directional information retained by each sample direction.
  • The central limit theorem supplies asymptotic normality that can be used for inference on directional recovery quality.
  • Consistent plug-in estimators allow computation of both the limit and its variance directly from observed data without knowledge of population parameters.
  • The same inversion technique that produces the estimators also yields computable expressions for the resolvent-based variance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit form of the alignment limit suggests a simple correction factor that could be applied to improve estimation of population directions from the sample ones.
  • Because the variance depends on resolvent traces, the same machinery may extend to other linear statistics of the sample canonical vectors.
  • The results indicate that directional recovery quality can be assessed and reported routinely in applied CCA analyses once the plug-in estimators are implemented.

Load-bearing premise

The observations come from a Gaussian population with finite-rank spiked structure and the two block dimensions grow proportionally with sample size.

What would settle it

Empirical squared alignments computed on data generated from the model that deviate systematically from the predicted deterministic limit as dimensions and sample size increase.

Figures

Figures reproduced from arXiv: 2606.09153 by Jiang Hu, Zhangni Pu, Zhangxiao Zhuo.

Figure 1
Figure 1. Figure 1: Sample squared canonical correlations for the limestone grassland community data. The dashed horizontal line indicates the upper edge d+ = 0.5236. Only the first eigenvalue is identified as a sample spiked eigenvalue. We next apply the plug-in estimation procedure stated in Proposition 3.1 to this separated direction. For the sample spiked eigenvalue l1 = 0.8293, we define rˆ1 := γ −1 (l1), µˆ1 := 1 1 + d(… view at source ↗
Figure 2
Figure 2. Figure 2: Scatter plot of the first pair of sample canonical variates for the limestone grassland community data [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rank-one setting: boxplots of hu1, ν1i 2 over 5000 repetitions for n ∈ {400, 800, . . . , 4000}, illustrating the first-order limit in Theorem 2.4. We next investigate the fluctuation result in Theorem 2.5. Still in the rank-one setting, we fix a large sample size n = 4000 and standardize hu1, ν1i 2 according to Theorem 2.5: Z1 = √ n  hu1, ν1i 2 − 1/(1 + d(r1)) σ(r1)/(1 + d(r1))2 . Here and below, σ(ri) … view at source ↗
Figure 4
Figure 4. Figure 4: Rank-one setting: histogram of the standardized statistic Z1 at n = 4000 over 5000 repetitions, overlaid with the N (0, 1) density, supporting the Gaussian approximation in Theorem 2.5. We then turn to the rank-three setting. For each spike ri (i = 1, 2, 3), we compute hui , νii 2 over repeated Monte Carlo simulations and plot the running empirical mean against the number of repetitions [PITH_FULL_IMAGE:f… view at source ↗
Figure 5
Figure 5. Figure 5: Rank-three setting: convergence trajectories of the running means of hui , νii 2 (i = 1, 2, 3) over 5000 repetitions, illustrating the first-order limits and the dependence on signal strength. Finally, we assess the Gaussian fluctuation in the rank-three setting. For n ∈ {2000, 4000, 6000}, we form the standardized statistics Zi = √ n  hui , νii 2 − 1/(1 + d(ri)) σ(ri)/(1 + d(ri))2 , i = 1, 2, 3, and rep… view at source ↗
Figure 6
Figure 6. Figure 6: Rank-three setting: histograms of the standardized statistics Zi (i = 1, 2, 3) at n = 2000 over 5000 repetitions, overlaid with the N (0, 1) density. -4 -3 -2 -1 0 1 2 3 4 Standard Normal Quantiles -4 -3 -2 -1 0 1 2 3 4 Quantiles of Input Sample (a) r1 = 0.86, n = 4000 -4 -3 -2 -1 0 1 2 3 4 Standard Normal Quantiles -4 -3 -2 -1 0 1 2 3 4 Quantiles of Input Sample (b) r2 = 0.81, n = 4000 -4 -3 -2 -1 0 1 2 3… view at source ↗
Figure 7
Figure 7. Figure 7: Rank-three setting: Q-Q plots of the standardized statistics Zi (i = 1, 2, 3) at n = 4000 against the standard normal distribution. -5 -4 -3 -2 -1 0 1 2 3 4 5 Standardized Statistic Z1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cumulative Probability (a) r1 = 0.86, n = 6000 -5 -4 -3 -2 -1 0 1 2 3 4 5 Standardized Statistic Z2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cumulative Probability (b) r2 = 0.81, n = 60… view at source ↗
Figure 8
Figure 8. Figure 8: Rank-three setting: empirical distribution functions of the standardized statistics Zi (i = 1, 2, 3) at n = 6000, compared with the standard normal distribution function. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

This paper studies the asymptotic behavior of sample canonical directions in a finite-rank spiked high-dimensional canonical correlation analysis model under a Gaussian population assumption. Under the asymptotic regime in which the dimensions of the two data blocks grow proportionally with the sample size, sample canonical directions are generally not consistent estimators of their population counterparts, even when the corresponding sample canonical correlations separate from the bulk spectrum. To quantify directional recovery, we investigate the squared alignment between a sample canonical direction and its associated population direction. For each simple population spike, we first establish a deterministic first-order limit for this squared alignment, which gives an explicit measure of the population-level directional information retained by the sample direction. We then prove a central limit theorem for its fluctuations around the deterministic limit, with an explicit asymptotic variance expressed through deterministic limits of resolvent trace functionals. To make the theoretical quantities computable from data, we further construct plug-in estimators for both the limiting mean and the asymptotic variance by inverting the deterministic outlier eigenvalue map, and prove their consistency. Numerical simulations and a real-data illustration support the theoretical results and demonstrate how the proposed estimators assess the recovery quality of sample canonical directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper studies the asymptotic behavior of sample canonical directions in a finite-rank spiked high-dimensional CCA model under Gaussian population assumptions and proportional growth of dimensions with sample size. It derives a deterministic first-order limit for the squared alignment between sample and population directions for each simple spike, proves a CLT for fluctuations around this limit with asymptotic variance expressed via resolvent trace functionals, and constructs consistent plug-in estimators for the limit and variance by inverting the deterministic outlier eigenvalue map.

Significance. If the derivations hold, the work supplies explicit, computable measures of directional recovery quality in high-dimensional CCA, where sample directions are typically inconsistent. The deterministic-equivalent approach and resolvent-based variance, together with the consistency proof for the plug-in estimators, provide a practical tool for assessing retained population information; this extends standard RMT techniques to CCA and is supported by simulations and real-data examples.

minor comments (4)
  1. The introduction would benefit from an early, explicit statement of the main theorems (including the precise form of the deterministic limit and the CLT variance expression) to orient the reader before the technical sections.
  2. Notation for the two data-block dimensions and the spike strengths should be introduced with a single consolidated table or display equation near the model definition to avoid repeated cross-references.
  3. In the simulation section, the number of Monte Carlo replications and the precise parameter values used to generate the population covariance blocks should be stated explicitly so that the reported alignment histograms can be reproduced.
  4. The real-data illustration would be strengthened by reporting the estimated spike strengths and the resulting plug-in estimates of alignment and variance alongside the raw canonical correlations.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive review, accurate summary of our contributions, and recommendation for minor revision. The report correctly identifies the key results on the deterministic limit and CLT for squared alignments of sample canonical directions, as well as the plug-in estimators.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation establishes deterministic first-order limits and a CLT for squared alignment of sample canonical directions under the Gaussian finite-rank spiked model with proportional growth, using resolvent trace functionals for the variance. Plug-in estimators are obtained by inverting the outlier eigenvalue map with separate consistency proofs. These steps follow standard non-circular RMT techniques and do not reduce any claimed limit or prediction to a fitted input or self-citation by construction. The central results remain independent of the data-driven estimators.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

Only the abstract is available, limiting visibility into all modeling choices; the listed items are the explicit assumptions stated.

axioms (3)
  • domain assumption Gaussian population assumption for the two data blocks
    Explicitly stated as the model assumption in the abstract.
  • domain assumption Finite-rank spiked structure in the population covariance
    The model is defined as finite-rank spiked high-dimensional CCA.
  • domain assumption Proportional growth regime where dimensions grow linearly with sample size
    The asymptotic regime is specified in the abstract.

pith-pipeline@v0.9.1-grok · 5733 in / 1409 out tokens · 20796 ms · 2026-06-27T14:54:36.659281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 4 canonical work pages

  1. [1]

    Anderson, T. W. (2003). An introduction to multivariate statistical analysis , volume 3. Wiley New York

  2. [2]

    and feng Yao, J

    Bai, Z. and feng Yao, J. (2008). Central limit theorems for eigenvalues in a spiked population model. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques , 44(3):447 – 474

  3. [3]

    Bai, Z., Hou, Z., Hu, J., Jiang, D., and Zhang, X. (2022). Limiting canonical distribution of two large-dimensional random vectors. In Methodology and Applications of Statistics: A Volume in Honor of CR Rao on the Occasion of his 100th Birthday , pages 213–238. Springer

  4. [4]

    Bai, Z., Miao, B., and Pan, G. (2007). On asymptotics of eigenvectors of large sample covariance matrix. The Annals of Probability , 35(4):1532–1572

  5. [5]

    Bai, Z. D. and Silverstein, J. W. (2004). CLT for linear spectral statistics of large-dimensional sample covariance matrices. The Annals of Probability , 32(1A):553 – 605

  6. [6]

    Bao, Z., Ding, X., Wang, J., and Wang, K. (2022). Statistical inference for principal com- ponents of spiked covariance matrices. The Annals of Statistics , 50(2):1144–1169

  7. [7]

    Bao, Z., Ding, X., and Wang, K. (2021). Singular vector and singular subspace distribution for the matrix denoising model. Ann. Statist. , 49(1):370–392

  8. [8]

    Bao, Z., Hu, J., Pan, G., and Zhou, W. (2019). Canonical correlation coefficients of high- dimensional gaussian vectors: Finite rank case. The Annals of Statistics , 47(1):612–640

  9. [9]

    Bao, Z., Wang, D., and Zhu, Y. (2026). Eigenvector distribution of random matrices under critical finite-rank deformations

  10. [10]

    and Gorin, V

    Bykhovskaya, A. and Gorin, V. (2023). High-dimensional canonical correlation analysis. arXiv preprint arXiv:2306.16393

  11. [11]

    Fan, J., Fan, Y., Han, X., and Lv, J. (2022). Asymptotic theory of eigenvectors for random matrices with diverging spikes. Journal of the American Statistical Association, 117(538):996– 1009

  12. [12]

    Gittins, R. (1985). Canonical Analysis: A Review with Applications in Ecology , volume 12 of Biomathematics. Springer, Berlin, Heidelberg

  13. [13]

    Harold, H. (1936). Relations between two sets of variables. Biometrika, 28(3):321–377

  14. [14]

    Hotelling, H. (1935). The most predictable criterion. Journal of educational Psychology , 26(2):139

  15. [15]

    Hou, Z., Zhang, X., Bai, Z., and Hu, J. (2023). Spiked eigenvalues of noncentral fisher matrix with applications. Bernoulli, 29(4):3171–3197

  16. [16]

    Lei, J. (2016). A goodness-of-fit test for stochastic block models. The Annals of Statistics , 44(1):401–424. 45

  17. [17]

    Li, Y., Zhou, H., and Hu, J. (2023). The eigenvector lsd of information plus noise matrices and its application to linear regression model. Statistics & Probability Letters , 197:109811

  18. [18]

    Liu, X., Liu, Y., Pan, G., Zhang, L., and Zhang, Z. (2023). Asymptotic properties of spiked eigenvalues and eigenvectors of signal-plus-noise matrices with their applications. arXiv preprint arXiv:2310.13939

  19. [19]

    and Pastur, L

    Lytova, A. and Pastur, L. (2009). Central limit theorem for linear eigenvalue statistics of random matrices with independent entries. The Annals of Probability , 37(5):1778 – 1840

  20. [20]

    and Yang, F

    Ma, Z. and Yang, F. (2023). Sample canonical correlation coefficients of high-dimensional random vectors with finite rank correlations. Bernoulli, 29(3):1905–1932

  21. [21]

    Muirhead, R. J. (1982). Aspects of multivariate statistical theory . John Wiley & Sons

  22. [22]

    Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica , 17(4):1617–1642

  23. [23]

    Pu, Z., Zhang, X., Hu, J., and Bai, Z. (2024). The asymptotic properties of the ex- treme eigenvectors of high-dimensional generalized spiked covariance model. arXiv preprint arXiv:2405.08524

  24. [24]

    Wachter, K. W. (1980). The Limiting Empirical Measure of Multiple Discriminant Ratios. The Annals of Statistics , 8(5):937 – 957

  25. [25]

    Xi, H., Yang, F., and Yin, J. (2020). Convergence of eigenvector empirical spectral distri- bution of sample covariance matrices. The Annals of Statistics , 48(2):953–982

  26. [26]

    Xia, N., Qin, Y., and Bai, Z. (2013). Convergence rates of eigenvector empirical spec- tral distribution of large dimensional sample covariance matrix. The Annals of Statistics , 41(5):2572–2607

  27. [27]

    Yang, F. (2022). Limiting distribution of the sample canonical correlation coefficients of high-dimensional random vectors. Electronic Journal of Probability , 27:1–71

  28. [28]

    and Pan, G

    Yang, Y. and Pan, G. (2012). The convergence of the empirical distribution of canonical correlation coefficients. Electronic Journal of Probability , 17:1–13

  29. [29]

    and Pan, G

    Yang, Y. and Pan, G. (2015). Independence test for high dimensional data based on regularized canonical correlation coefficients. The Annals of Statistics , 43(2):467–500

  30. [30]

    Zhang, X. (2023). The limiting spectral distribution of the sample canonical correlation matrix. arXiv preprint arXiv:2309.13369 . 46