Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning

Anders Gj{\o}lbye; Emma Kargaard; Ida Kargaard; Lars Kai Hansen; Lina Skerath

arxiv: 2601.21294 · v2 · pith:XZFERWROnew · submitted 2026-01-29 · 💻 cs.LG · stat.ML

Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning

Anders Gj{\o}lbye , Ida Kargaard , Emma Kargaard , Lina Skerath , Lars Kai Hansen This is my paper

Pith reviewed 2026-05-21 14:20 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords phase transitionmissing datapartial least squaresmultimodal learningspiked random matrixreplica symmetric analysissingular vector alignment

0 comments

The pith

Missing data in paired multimodal views attenuates the effective signal in PLS-SVD and triggers a sharp BBP-type phase transition in recovery of shared directions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines Partial Least Squares via singular vectors of the cross-covariance when both data views contain independent missing entries. After normalization, the masked matrix behaves as a rectangular spiked random matrix whose spike strength is reduced by the square root of the retention probability ρ. Replica-symmetric analysis then yields an explicit critical threshold: below it the leading singular vectors are asymptotically orthogonal to the latent directions; above it they achieve a positive overlap whose closed form depends on the aspect ratio and the attenuated signal. This predicts when spectral PLS will recover shared structure and when it will fail in high-dimensional multimodal settings with missingness.

Core claim

In a proportional high-dimensional spiked model with entry-wise missing-completely-at-random masking, the normalized masked cross-covariance is equivalent to a spiked rectangular random matrix whose signal strength is multiplied by √ρ. The replica-symmetric equations therefore locate a BBP-type transition at a missingness-adjusted critical value; below the threshold the leading singular vectors carry no asymptotic information about the shared latent factors, while above it the overlap with those factors is given by explicit formulas. The same adjusted threshold is conjectured to govern each component when multiple separated spikes are present.

What carries the argument

Normalized masked cross-covariance treated as a spiked rectangular random matrix whose effective spike is scaled by √ρ, analyzed via replica-symmetric equations that locate the BBP transition and compute asymptotic overlaps.

If this is right

Leading singular vectors of masked PLS-SVD are asymptotically uninformative below the missingness-adjusted BBP threshold.
Above the threshold the overlap with latent shared directions follows explicit asymptotic formulas that depend on aspect ratio and √ρ.
The same missingness-adjusted threshold governs recovery componentwise when latent spikes are well separated.
Simulations and semi-synthetic multimodal experiments confirm the predicted phase diagram across aspect ratios, signal strengths, and missingness levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could estimate the joint retention probability ρ from observed data and rescale expectations or thresholds before running PLS-SVD.
The framework may extend to other spectral multimodal methods whose statistics reduce to cross-covariance matrices.
Testing whether the transition persists under correlated or structured missingness would clarify the robustness of the independent MCAR assumption.

Load-bearing premise

The analysis assumes independent entry-wise missing-completely-at-random masking inside a proportional high-dimensional spiked model.

What would settle it

Compute the empirical overlap between the leading singular vectors and the ground-truth latent directions while sweeping signal strength across the predicted missingness-adjusted threshold; the overlap should remain near zero below the threshold and rise according to the closed-form formula above it.

read the original abstract

Partial Least Squares (PLS) learns shared structure from paired data via the top singular vectors of the empirical cross-covariance (PLS-SVD), but multimodal datasets often have missing entries in both views. We study PLS-SVD under independent entry-wise missing-completely-at-random masking in a proportional high-dimensional spiked model. After appropriate normalization, the masked cross-covariance behaves like a spiked rectangular random matrix whose effective signal strength is attenuated by $\sqrt{\rho}$, where $\rho$ is the joint entry retention probability. The replica-symmetric analysis predicts a sharp BBP-type phase transition: below a critical signal-to-noise threshold the leading singular vectors are asymptotically uninformative, while above it they achieve nontrivial alignment with the latent shared directions, with closed-form asymptotic overlap formulas. We also state a finite-rank extension as a conjecture, predicting that the same missingness-adjusted threshold applies componentwise when the latent spikes are separated. Simulations and semi-synthetic multimodal experiments agree with the predicted phase diagram and recovery curves across aspect ratios, signal strengths, and missingness levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows missingness scales down the PLS cross-covariance spike by sqrt(rho) and gives closed-form BBP thresholds plus overlaps that simulations match.

read the letter

The main takeaway is that independent missing entries in paired multimodal data reduce the effective signal strength in the cross-covariance by sqrt(rho), where rho is the joint retention probability. This shifts the BBP-type phase transition higher, so the leading singular vectors stay uninformative below a critical threshold and gain nontrivial overlap with the true shared directions above it. The paper supplies explicit formulas for both the adjusted threshold and the asymptotic overlaps after the transition, plus a conjecture for the finite-rank case where the same rule applies componentwise if spikes are separated enough.

Referee Report

2 major / 2 minor

Summary. The paper studies PLS-SVD under independent entry-wise MCAR masking in a proportional high-dimensional spiked model for multimodal data. After normalization, the masked cross-covariance is claimed to behave as a spiked rectangular random matrix with effective signal strength attenuated by √ρ (ρ = joint retention probability). Replica-symmetric analysis predicts a sharp BBP-type phase transition with closed-form asymptotic overlap formulas for the leading singular vectors; a finite-rank extension is stated as a conjecture. Simulations and semi-synthetic experiments are reported to match the predicted phase diagram and recovery curves.

Significance. If the central claims hold, the work extends the BBP transition to missing-data settings in spectral multimodal learning and supplies explicit formulas for recovery thresholds and overlaps. This has potential practical value for determining when PLS can recover shared latent structure under incomplete views. Credit is given for the clean normalization that isolates the √ρ attenuation, the explicit phase-transition prediction, and the reported agreement between theory and simulations across aspect ratios, signal strengths, and missingness levels.

major comments (2)

[Replica-symmetric analysis] The replica-symmetric analysis (described in the theoretical development): the closed-form overlap formulas and phase-transition threshold rest on the replica-symmetric ansatz, an approximation; the manuscript supplies neither full derivations nor error bounds, leaving the support for the exact formulas incomplete despite simulation agreement.
[Finite-rank extension] Finite-rank extension (stated in the abstract and theoretical sections): this componentwise conjecture for separated spikes is load-bearing for the multimodal claim with multiple latent directions, yet is presented without derivation, additional analysis, or targeted simulations beyond the rank-1 case.

minor comments (2)

Clarify early in the manuscript the precise definition of the joint retention probability ρ and its relation to the per-view missing rates.
The abstract mentions 'semi-synthetic multimodal experiments' but provides limited detail on how the real data were masked or preprocessed; a short methods paragraph would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We appreciate the positive assessment of the significance and the recognition of the clean normalization and simulation agreement. We address each major comment below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Replica-symmetric analysis] The replica-symmetric analysis (described in the theoretical development): the closed-form overlap formulas and phase-transition threshold rest on the replica-symmetric ansatz, an approximation; the manuscript supplies neither full derivations nor error bounds, leaving the support for the exact formulas incomplete despite simulation agreement.

Authors: We agree that the analysis relies on the replica-symmetric ansatz, which is a standard but non-rigorous tool in high-dimensional random matrix and statistical physics analyses of spiked models. The manuscript derives the phase transition threshold and overlap formulas under this ansatz, with the predictions validated by extensive numerical simulations that match across aspect ratios, signal strengths, and missingness levels. To address the concern, we will revise the manuscript by adding a concise outline of the key replica calculation steps in an appendix and include additional references to related works where similar replica-symmetric predictions have been confirmed. We maintain that the strong simulation agreement provides substantial support for the formulas in this setting. revision: partial
Referee: [Finite-rank extension] Finite-rank extension (stated in the abstract and theoretical sections): this componentwise conjecture for separated spikes is load-bearing for the multimodal claim with multiple latent directions, yet is presented without derivation, additional analysis, or targeted simulations beyond the rank-1 case.

Authors: The finite-rank extension is explicitly presented as a conjecture in the abstract and theoretical sections, motivated by the decoupling of well-separated spikes in the asymptotic regime. While a complete derivation is not included, the conjecture follows from extending the rank-1 replica analysis via perturbation or block-diagonal arguments. To strengthen this part, we will add targeted simulations for low finite ranks (such as rank 2 and 3) with separated spikes in the revised manuscript, verifying that the missingness-adjusted threshold applies componentwise as predicted. This will provide additional empirical support for the multimodal applicability. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper models the masked cross-covariance after normalization as a spiked rectangular random matrix with effective signal strength attenuated by √ρ under independent MCAR masking in a proportional high-dimensional regime. It then applies standard replica-symmetric analysis to obtain the BBP-type phase transition threshold and closed-form asymptotic overlap formulas. These steps follow directly from the explicit modeling assumptions and established random-matrix techniques without any reduction to fitted quantities from the same data, without self-citations as load-bearing premises, and without redefining inputs in terms of outputs. Simulations are presented only as verification, not as part of the derivation chain. The central claims therefore remain independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The derivation rests on the proportional high-dimensional limit, the MCAR missingness model, and the replica-symmetric ansatz; no free parameters or new postulated entities are introduced.

axioms (3)

domain assumption Proportional high-dimensional limit in which dimensions and sample size grow proportionally
Standard setting for spiked random-matrix models invoked throughout the analysis
domain assumption Independent entry-wise missing-completely-at-random masking
Defines the masked cross-covariance whose normalization yields the attenuated spiked matrix
ad hoc to paper Replica-symmetric ansatz
Used to obtain the closed-form phase-transition threshold and overlap formulas

pith-pipeline@v0.9.0 · 5727 in / 1268 out tokens · 72197 ms · 2026-05-21T14:20:01.808750+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

After appropriate normalization, the masked cross-covariance behaves like a spiked rectangular random matrix whose effective signal strength is attenuated by √ρ... replica-symmetric analysis predicts a sharp BBP-type phase transition... θ_crit = 1/((α_x α_y)^{1/4} √ρ)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.2... r²_x = (α_x α_y ρ² θ⁴ − 1)/(α_y ρ θ² (α_x ρ θ² + 1)) for α_x α_y ρ² θ⁴ > 1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.