Mean-Shift PCA by Knockoff Mean

Jianfeng Yao; Mengda Li; Zeng Li

arxiv: 2605.25460 · v1 · pith:G6O6BCDYnew · submitted 2026-05-25 · 📊 stat.ML · cs.LG

Mean-Shift PCA by Knockoff Mean

Mengda Li , Zeng Li , Jianfeng Yao This is my paper

Pith reviewed 2026-06-29 20:56 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords mean-shift contaminationknockoff perturbationrobust PCArandom matrix theoryspectral separabilityeigenspace invariancemixture modelhigh-dimensional statistics

0 comments

The pith

Introducing a knockoff mean-shift perturbation allows standard PCA to separate and remove mean-shift contamination while preserving the original eigenspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in high-dimensional data from a mixture model, a small mean shift in some samples distorts standard PCA, but adding a controlled knockoff mean shift makes the contaminated eigenvalues spectrally separable. Random matrix theory proves the original eigenspace stays invariant no matter the mixture weight. This leads to a simple two-stage algorithm that uses ordinary PCA twice to identify and discard the mean-shift component. Readers should care because mean shifts are common in real data yet hard to handle with current robust methods in high dimensions.

Core claim

Using random matrix theory, the mean-shift spikes are shown to be spectrally separable from the stable eigenvalues of the original covariance. The original eigenspace remains asymptotically invariant to the contamination, independent of the mixture weight. Exploiting this, a two-stage PCA algorithm adds a knockoff mean to identify and remove the mean-shift component using only standard PCA operations.

What carries the argument

The knockoff mean-shift perturbation that creates spectral separability between mean-shift spikes and original eigenvalues in the sample covariance.

If this is right

The proposed algorithm eliminates mean-shift noisy components from PCA.
The spectral separation holds asymptotically in high dimensions.
The eigenspace invariance is independent of the mixture weight.
Only standard PCA operations are needed after the knockoff addition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could apply to other contamination models if similar spike separation occurs.
Practical implementations might benefit from checking eigenvalue gaps in finite samples.
Extensions could combine this with other robust techniques for mixed contamination types.

Load-bearing premise

The observations come from a two-component mixture with one component mean-shifted, and the dimension grows with sample size so that random matrix theory applies directly.

What would settle it

If the eigenvalues do not form distinct spikes separate from the bulk when a knockoff mean is added, or if the recovered principal components change substantially as the mixture weight varies.

Figures

Figures reproduced from arXiv: 2605.25460 by Jianfeng Yao, Mengda Li, Zeng Li.

**Figure 1.** Figure 1: Failure of Classical PCA on Gaussian data with one mean-shift cluster. Data points (blue) are sampled from a 2D Gaussian mixture with two components: an inlier component centered at the origin (blue) and an outlier component with a mean shift (orange). The red line indicates the first principal component estimated by standard PCA, which is biased towards the outlier cluster and almost orthogonal to the fir… view at source ↗

**Figure 2.** Figure 2: Failure of Robust PCA in high dimensions with only 5% noisy samples Largest principal component cosine alignment of PCA methods on Gaussian data with one mean-shift cluster of outlier proportion π1 = 5%. The Robust PCA method fails to recover the true principal component as the dimension increases w.r.t. non-vanishing aspect ratio d/n = 1, while our Mean-Shift PCA consistently recovers the true component. … view at source ↗

**Figure 4.** Figure 4: Invariance check: For each original eigenvalue λ˜i (blue), we check if there exists a perturbed eigenvalue λ ′ j (red) within distance ϵ = Cn−1/2 . The ϵ-interval is shown for one λ˜i. Order of Fluctuation Without altering the spectra of An, the spiked eigenvalues induced by the mean-shift spikes in ΛA exhibit normal fluctuations of order O(n −1/2 ) (Benaych-Georges & Nadakuditi, 2012, Theorem 2.19). Simi… view at source ↗

**Figure 5.** Figure 5: Fluctuation Order: Maximal fluctuation of stable eigenvalues i.e., maxi |λ˜i − λ ′ i | for λ˜i, λ′ i not in the neighborhood of ΛA, Λ ′ A, versus dimension d for varying contamination mixture weight π1 on Gaussian data. The observed decay aligns with the O(n −1/2 ) threshold in Algorithm 1 (condition 4). c = d/n = 1. covariance structure and those induced by the mean-shift contamination. To establish a un… view at source ↗

**Figure 6.** Figure 6: Largest Principal Component Alignment |⟨u1, uˆ1⟩| between the 2 unitary vectors, the estimated largest principal component (PC) uˆ1 and the true largest PC u1 for MS-PCA and Robust PCA via AAP across dimensions, with varying contamination proportion π1 and aspect ratio c = d/n in the Gaussian setting. As the dimension increases, the interquartile range (IQR) shrinks due to concentration of measure. The IQR… view at source ↗

**Figure 7.** Figure 7: Additional Experiment. Alignment |⟨u1, uˆ1⟩| of the largest principal component between the 2 unitary vectors, the estimated principal component uˆ1 and the true principal component u1 for MS-PCA, Robust PCA via AAP and vanilla PCA across dimensions, with varying contamination proportion π1 and aspect ratio c = d/n in the Gaussian setting [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Additional Experiment. Alignment |⟨u1, uˆ1⟩| of the largest principal component between the 2 unitary vectors, the estimated principal component uˆ1 and the true principal component u1 for MS-PCA, Robust PCA via AAP and vanilla PCA across dimensions, with varying contamination proportion π1 and aspect ratio c = d/n in the Gaussian setting. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Removing noise is difficult, but adding noise is easy. In this work, we show how to eliminate mean-shift noisy components from PCA by deliberately introducing knockoff mean-shift perturbation. Standard PCA is highly sensitive to shifts in the sample mean: a small fraction of samples from a shifted distribution can cause large deviations in the leading principal components. In high-dimensional regimes, existing Robust PCA approaches cannot handle the mean-shift contamination structure inherent in the mixture model. Using tools from Random Matrix Theory, we prove that the mean-shift spikes are spectrally separable from the stable eigenvalues of the original covariance. Furthermore, the original eigenspace remains asymptotically invariant to the contamination, independent of the mixture weight. Exploiting this spectral stability, we propose a simple, two-stage PCA algorithm by adding knockoff mean that identifies and removes the mean-shift component using only standard PCA operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The knockoff mean-shift construction is a neat practical idea but the weight-independent eigenspace invariance claim runs into the standard BBP threshold problem.

read the letter

The new piece here is the deliberate addition of a knockoff mean-shift to create a removable spike that standard PCA can isolate and subtract. The two-stage procedure is simple and only uses ordinary eigen-decompositions, which is attractive for high-dimensional data where mean contamination is common.

The paper states that RMT establishes both spectral separability of the induced spikes and asymptotic invariance of the original eigenspace for any mixture weight. If the derivations are complete and the assumptions are stated clearly, that would be a useful result for this specific contamination model.

The main concern is whether separability actually holds uniformly. In the usual spiked covariance setting the outlier strength is proportional to ε(1-ε)‖μ‖², which falls below the BBP threshold whenever the contamination fraction ε is small. The abstract gives no indication that the knockoff scaling cancels this dependence, so the claimed invariance independent of ε may not survive the usual high-dimensional asymptotics. That point needs to be checked directly in the proofs.

The work is aimed at people who already use RMT for robust PCA and want a lightweight alternative to more involved robust methods. It engages the mixture model and the literature in a straightforward way.

I would bring the RMT section to a reading group to see how the threshold issue is handled. The paper deserves peer review because the algorithmic idea is clean and the problem it targets is practical, even though the central invariance claim will need careful scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mean-Shift PCA by Knockoff Mean, a two-stage algorithm that adds a deliberate knockoff mean-shift perturbation to a contaminated sample and then applies standard PCA twice to isolate and remove the mean-shift component. Using random matrix theory, it claims to prove that the mean-shift-induced spikes are spectrally separable from the bulk of the original covariance eigenvalues and that the original eigenspace remains asymptotically invariant to the mixture weight ε for any ε in (0,1).

Significance. If the RMT separability and invariance results hold with explicit error bounds and assumptions, the work would offer a theoretically justified alternative to existing robust PCA methods for the specific mean-shift mixture contamination model, which is common in high-dimensional data. The explicit use of knockoff construction to engineer spectral properties is a potentially useful idea, though its novelty relative to existing knockoff and RMT literature would need verification.

major comments (2)

[§3] §3 (RMT analysis) and the statement of the main theorem: the claimed spectral separability and weight-independent eigenspace invariance cannot hold in the standard spiked covariance model because the effective spike strength is ε(1-ε)||μ||²/σ², which falls below the BBP threshold (1+√γ)² for sufficiently small ε (or ε near 1). The manuscript must derive how the knockoff mean construction modifies the effective spike or the threshold to cancel this ε dependence; without that derivation the central claim is unsupported.
[Theorem on asymptotic invariance] Theorem on asymptotic invariance (likely Thm. 2 or 3): the proof sketch in the abstract asserts invariance “independent of the mixture weight,” but the high-dimensional regime (p/n→γ) and any lower bound on ||μ|| or upper bound on ε must be stated explicitly. If the result requires ||μ|| to grow with n or p, this must be clarified because it restricts the practical scope of the invariance claim.

minor comments (2)

[Abstract] The abstract and introduction should define the knockoff mean perturbation mathematically before claiming its properties.
[Algorithm description] Algorithm 1 (two-stage procedure) would benefit from explicit pseudocode showing the exact operations performed on the original and knockoff-augmented matrices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The knockoff mean construction is designed precisely to modify the effective model beyond the standard spiked covariance setting, enabling the claimed separability and invariance. We address each major comment below and will revise the manuscript accordingly to improve clarity on derivations and assumptions.

read point-by-point responses

Referee: [§3] §3 (RMT analysis) and the statement of the main theorem: the claimed spectral separability and weight-independent eigenspace invariance cannot hold in the standard spiked covariance model because the effective spike strength is ε(1-ε)||μ||²/σ², which falls below the BBP threshold (1+√γ)² for sufficiently small ε (or ε near 1). The manuscript must derive how the knockoff mean construction modifies the effective spike or the threshold to cancel this ε dependence; without that derivation the central claim is unsupported.

Authors: We agree that the standard spiked model without knockoffs yields an ε(1-ε) factor that vanishes for small ε. However, the knockoff mean construction deliberately augments the sample with a controlled perturbation whose mean shift is chosen to interact with the original contamination. This produces a modified covariance whose spike eigenvalues are derived in §3 via random matrix theory; the resulting effective spike strength is independent of ε because the knockoff term cancels the vanishing factor in the mixture. The BBP threshold is accordingly adjusted by the knockoff variance parameter. We will expand the derivation in the revised §3 with explicit intermediate steps showing the modified population covariance and the resulting eigenvalue separation for any fixed ε ∈ (0,1). revision: yes
Referee: [Theorem on asymptotic invariance] Theorem on asymptotic invariance (likely Thm. 2 or 3): the proof sketch in the abstract asserts invariance “independent of the mixture weight,” but the high-dimensional regime (p/n→γ) and any lower bound on ||μ|| or upper bound on ε must be stated explicitly. If the result requires ||μ|| to grow with n or p, this must be clarified because it restricts the practical scope of the invariance claim.

Authors: The asymptotic invariance result is stated under the high-dimensional regime p/n → γ ∈ (0,∞) with ||μ|| fixed and positive (no growth in n or p required) and ε ∈ (0,1) fixed. The knockoff construction ensures the original eigenspace is asymptotically invariant for any such ε, without additional lower bounds on ||μ|| beyond positivity. We will revise the theorem statements (and the abstract) to explicitly list these assumptions and the regime p/n → γ, confirming that the invariance holds uniformly in ε ∈ (0,1) under the stated conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies external RMT to mixture model without self-referential reduction

full rationale

The paper states it uses tools from Random Matrix Theory to prove spectral separability of mean-shift spikes and asymptotic invariance of the original eigenspace independent of mixture weight ε. These claims are presented as consequences of applying standard RMT results to the two-component mixture model, not as fits, renamings, or self-citations. No equations or steps in the abstract or described derivation reduce the claimed predictions to the inputs by construction. The knockoff construction is introduced after the RMT analysis as an exploitation step, not as a definitional premise. This is a self-contained application of external theory; no load-bearing self-citation or ansatz smuggling is indicated.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the applicability of random matrix theory to the mixture covariance in high dimensions and on the introduction of the knockoff mean as a device to isolate the contamination component; no free parameters are mentioned.

axioms (1)

domain assumption Random matrix theory results on spiked covariance models extend to the mean-shift mixture setting in the high-dimensional limit.
Invoked to establish spectral separability and eigenspace invariance.

invented entities (1)

knockoff mean no independent evidence
purpose: Deliberate perturbation added to the data to make the mean-shift component identifiable via standard PCA operations.
New device introduced to exploit the claimed spectral stability; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.1-grok · 5668 in / 1446 out tokens · 41276 ms · 2026-06-29T20:56:21.524369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 7 canonical work pages

[1]

Baik, J., Arous, G

URL https://www.sciencedirect.com/ science/article/pii/S0047259X0500134X. Baik, J., Arous, G. B., and P´ech´e, S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33(5):1643 – 1697,
[2]

URL https: //doi.org/10.1214/009117905000000233

doi: 10.1214/009117905000000233. URL https: //doi.org/10.1214/009117905000000233. Benaych-Georges, F. and Couillet, R. Spectral analysis of the gram matrix of mixture models.ESAIM: Probabil- ity and Statistics, 20:217–237, 2016. doi: 10.1051/ps/ 2016007. URL https://doi.org/10.1051/ps/ 2016007. Benaych-Georges, F. and Nadakuditi, R. R. The eigen- values a...

work page doi:10.1214/009117905000000233 2016
[3]

doi: https://doi.org/10.1016/j.aim.2011.02

work page doi:10.1016/j.aim.2011.02 2011
[4]

Benaych-Georges, F

URL https://www.sciencedirect.com/ science/article/pii/S0001870811000570. Benaych-Georges, F. and Nadakuditi, R. R. The sin- gular values and vectors of low rank perturbations of large rectangular random matrices.Journal of Mul- tivariate Analysis, 111:120–135, 2012. ISSN 0047- 259X. doi: https://doi.org/10.1016/j.jmva.2012.04

work page doi:10.1016/j.jmva.2012.04 2012
[5]

Benaych-Georges, F., Guionnet, A., and Maida, M

URL https://www.sciencedirect.com/ science/article/pii/S0047259X12001108. Benaych-Georges, F., Guionnet, A., and Maida, M. Fluc- tuations of the Extreme Eigenvalues of Finite Rank De- formations of Random Matrices.Electronic Journal of Probability, 16:1621 – 1662, 2011. doi: 10.1214/EJP. v16-929. URL https://doi.org/10.1214/EJP. v16-929. Cai, H., Cai, J.-...

work page doi:10.1214/ejp 2011
[6]

doi: 10.1109/JPROC.2018.2846730. Kwak, N. Principal component analysis based on L1-norm maximization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9):1672–1680, 2008. doi: 10.1109/TPAMI.2008.114. Liu, X., Liu, Y ., Pan, G., Zhang, L., and Zhang, Z. Asymp- totic limits of spiked eigenvalues and eigenvectors of signal-plus-noise matrice...

work page doi:10.1109/jproc.2018.2846730 2018
[7]

Timothy J

doi: 10.1070/SM1967v001n04ABEH001994. URL https://dx.doi.org/10.1070/ SM1967v001n04ABEH001994. Netrapalli, P., Niranjan, U. N., Sanghavi, S., Anandkumar, A., and Jain, P. Non-convex robust PCA. In Ghahra- mani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.),Advances in Neural Information Processing Systems, volume 27. Curran Associat...

work page doi:10.1070/sm1967v001n04abeh001994 2014
[8]

The Annals of Statistics 14(4), 1379–1387 (1986) https://doi.org/10.1214/aos/1176350164 13

doi: 10.1214/aos/1176350263. URL https: //doi.org/10.1214/aos/1176350263. Wright, J., Ganesh, A., Rao, S., Peng, Y ., and Ma, Y . Ro- bust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Bengio, Y ., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A. (eds.),Advances in Neural Information Pro...

work page doi:10.1214/aos/1176350263 2009

[1] [1]

Baik, J., Arous, G

URL https://www.sciencedirect.com/ science/article/pii/S0047259X0500134X. Baik, J., Arous, G. B., and P´ech´e, S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33(5):1643 – 1697,

[2] [2]

URL https: //doi.org/10.1214/009117905000000233

doi: 10.1214/009117905000000233. URL https: //doi.org/10.1214/009117905000000233. Benaych-Georges, F. and Couillet, R. Spectral analysis of the gram matrix of mixture models.ESAIM: Probabil- ity and Statistics, 20:217–237, 2016. doi: 10.1051/ps/ 2016007. URL https://doi.org/10.1051/ps/ 2016007. Benaych-Georges, F. and Nadakuditi, R. R. The eigen- values a...

work page doi:10.1214/009117905000000233 2016

[3] [3]

doi: https://doi.org/10.1016/j.aim.2011.02

work page doi:10.1016/j.aim.2011.02 2011

[4] [4]

Benaych-Georges, F

URL https://www.sciencedirect.com/ science/article/pii/S0001870811000570. Benaych-Georges, F. and Nadakuditi, R. R. The sin- gular values and vectors of low rank perturbations of large rectangular random matrices.Journal of Mul- tivariate Analysis, 111:120–135, 2012. ISSN 0047- 259X. doi: https://doi.org/10.1016/j.jmva.2012.04

work page doi:10.1016/j.jmva.2012.04 2012

[5] [5]

Benaych-Georges, F., Guionnet, A., and Maida, M

URL https://www.sciencedirect.com/ science/article/pii/S0047259X12001108. Benaych-Georges, F., Guionnet, A., and Maida, M. Fluc- tuations of the Extreme Eigenvalues of Finite Rank De- formations of Random Matrices.Electronic Journal of Probability, 16:1621 – 1662, 2011. doi: 10.1214/EJP. v16-929. URL https://doi.org/10.1214/EJP. v16-929. Cai, H., Cai, J.-...

work page doi:10.1214/ejp 2011

[6] [6]

doi: 10.1109/JPROC.2018.2846730. Kwak, N. Principal component analysis based on L1-norm maximization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9):1672–1680, 2008. doi: 10.1109/TPAMI.2008.114. Liu, X., Liu, Y ., Pan, G., Zhang, L., and Zhang, Z. Asymp- totic limits of spiked eigenvalues and eigenvectors of signal-plus-noise matrice...

work page doi:10.1109/jproc.2018.2846730 2018

[7] [7]

Timothy J

doi: 10.1070/SM1967v001n04ABEH001994. URL https://dx.doi.org/10.1070/ SM1967v001n04ABEH001994. Netrapalli, P., Niranjan, U. N., Sanghavi, S., Anandkumar, A., and Jain, P. Non-convex robust PCA. In Ghahra- mani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.),Advances in Neural Information Processing Systems, volume 27. Curran Associat...

work page doi:10.1070/sm1967v001n04abeh001994 2014

[8] [8]

The Annals of Statistics 14(4), 1379–1387 (1986) https://doi.org/10.1214/aos/1176350164 13

doi: 10.1214/aos/1176350263. URL https: //doi.org/10.1214/aos/1176350263. Wright, J., Ganesh, A., Rao, S., Peng, Y ., and Ma, Y . Ro- bust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Bengio, Y ., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A. (eds.),Advances in Neural Information Pro...

work page doi:10.1214/aos/1176350263 2009