Understanding Self-Supervised Learning via Latent Distribution Matching

Fabian A Mikulasch; Friedemann Zenke

arxiv: 2605.03517 · v3 · pith:75ARSMLEnew · submitted 2026-05-05 · 💻 cs.LG · stat.ML

Understanding Self-Supervised Learning via Latent Distribution Matching

Fabian A Mikulasch , Friedemann Zenke This is my paper

Pith reviewed 2026-05-21 00:06 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords self-supervised learninglatent distribution matchingidentifiabilitycontrastive learningnon-contrastive learningindependent component analysispredictive SSLentropy maximization

0 comments

The pith

Self-supervised learning works by matching representation distributions to a latent model while maximizing entropy to prevent collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes viewing self-supervised learning as latent distribution matching. Representations are trained both to fit the probability structure of an assumed latent model and to spread out evenly so they do not all collapse to the same point. The same principle accounts for contrastive methods, non-contrastive methods, predictive methods, and stop-gradient techniques, and it also recovers classical independent component analysis as a special case. From this perspective the authors derive a new sampling-free Bayesian filter with a Kalman predictor for time series and prove that predictive versions of the approach recover unique latent factors even when the predictor is nonlinear.

Core claim

We cast SSL as latent distribution matching (LDM): learning representations that maximize their log-probability under an assumed latent model (alignment), while maximizing latent entropy to prevent collapse (uniformity). This view unifies independent component analysis with contrastive, non-contrastive, and predictive SSL methods, including stop gradient approaches. We further prove that predictive LDM yields identifiable latent representations under mild assumptions, even with nonlinear predictors.

What carries the argument

Latent distribution matching, the joint objective of maximizing log-probability under an assumed latent model for alignment and maximizing latent entropy for uniformity, which unifies multiple SSL families and supports identifiability proofs.

If this is right

Contrastive, non-contrastive, predictive, and stop-gradient SSL methods all arise as instances of the same latent distribution matching objective.
A nonlinear sampling-free Bayesian filtering model equipped with a Kalman-based predictor can be derived directly from the LDM view for high-dimensional time series.
Predictive LDM produces identifiable latent representations under mild assumptions even when the predictor itself is nonlinear.
The assumptions implicit in existing SSL algorithms become explicit once they are rewritten as particular choices of latent model and entropy term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Different choices of the assumed latent model could systematically generate new SSL algorithms for domains where current methods are weak.
Techniques from classical independent component analysis might be imported into modern SSL pipelines once both are expressed inside the same LDM framework.
The identifiability guarantee could be checked empirically by generating synthetic data with known factors, training nonlinear predictive LDM, and measuring how uniquely the factors are recovered.

Load-bearing premise

The claim that predictive latent distribution matching recovers identifiable representations rests on mild assumptions whose precise content is not stated in the central argument.

What would settle it

Training a predictive LDM model with a nonlinear predictor on data whose ground-truth latent factors are known and observing that the recovered representations are not unique up to permutation would falsify the identifiability result.

Figures

Figures reproduced from arXiv: 2605.03517 by Fabian A Mikulasch, Friedemann Zenke.

**Figure 1.** Figure 1: We formulate SSL as a distribution matching problem in which the transformed data distribution R(z, z′ ) is matched to the latent model Pθ(z, z′ ). The transformation is deterministic R(z|x) = δ(z −f(x)), where f(x) is a deep network. The model likelihood log Pθ and latent entropy HR correspond to alignment and uniformity terms in the loss function (Wang & Isola, 2020). Among SSL approaches, latent predict… view at source ↗

**Figure 2.** Figure 2: Source recovery with LDM in linear ICA. A Linear ICA assumes that the data distribution has independent factors, that can be recovered by aligning them with the correct underlying independent distribution (Cardoso, 2002). B Distributions of pixel intensities in natural images are non-Gaussian (Hyvarinen & Oja ¨ , 1999). In contrast, mixed images are closer to Gaussian, as expected from the central limit th… view at source ↗

**Figure 3.** Figure 3: Comparison of learned image representations on CIFAR-10. A The eigenspectrum of the learned representations generally decays more slowly for parametric entropy estimators, both on the plane (solid) and the sphere (dashed). Whether or not MI was maximized (+ MI) had little impact on the spectrum. The observed cutoff at low double digits is consistent with previous estimates of intrinsic dimensionality of CI… view at source ↗

**Figure 4.** Figure 4: Predictive distribution matching in latent space using a nonlinear Bayesian filtering model with Kalman-based predictor. A Example frames of synthetic dataset of a high dimensional noisy observable with linear latent dynamics. The red line denotes ground truth position. See Appendix, Fig. A3 for a more nonlinear task. B We use a Kalman filter backbone for the predictor Pθ(zt|z:t) with hidden states ht and … view at source ↗

**Figure 5.** Figure 5: System identification through predictive LDM. A Forcing prediction errors into a Gaussian form leads to local linearization of the relation between true and recovered latent variables. B Schematic of nonlinear prediction task. Trajectory noise in the true latent space is Gaussian to enable identification. C Visualizations of the actual (left) and recovered latent space before (middle) and after (right) … view at source ↗

read the original abstract

Self-supervised learning (SSL) excels at finding general-purpose latent representations from complex data, yet lacks a unifying theoretical framework that explains the diverse existing methods and guides the design of new ones. We cast SSL as latent distribution matching (LDM): learning representations that maximize their log-probability under an assumed latent model (alignment), while maximizing latent entropy to prevent collapse (uniformity). This view unifies independent component analysis with contrastive, non-contrastive, and predictive SSL methods, including stop gradient approaches. Leveraging LDM, we derive a nonlinear, sampling-free Bayesian filtering model with a Kalman-based predictor for high-dimensional timeseries. We further prove that predictive LDM yields identifiable latent representations under mild assumptions, even with nonlinear predictors. Overall, LDM clarifies the assumptions behind established SSL methods and provides principled guidance for developing new approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LDM frames SSL as alignment plus uniformity to link ICA with contrastive and predictive methods, but the identifiability claim for nonlinear predictors needs the actual assumptions checked before it can carry the unification story.

read the letter

The main takeaway is that this paper recasts self-supervised learning as latent distribution matching: representations are pushed to have high probability under an assumed latent model while their entropy is maximized to stop collapse. That single objective is used to recover ICA, contrastive losses, non-contrastive methods, and stop-gradient tricks as special cases. They also derive a sampling-free Bayesian filter for high-dimensional time series that uses a Kalman-style predictor, and they state an identifiability theorem for the predictive case even when the predictor is nonlinear.

Referee Report

2 major / 2 minor

Summary. The paper proposes casting self-supervised learning (SSL) as latent distribution matching (LDM): representations are learned to maximize log-probability under an assumed latent model (alignment) while maximizing latent entropy to avoid collapse (uniformity). This is claimed to unify independent component analysis with contrastive, non-contrastive, and predictive SSL methods (including stop-gradient variants). The authors derive a nonlinear sampling-free Bayesian filter with a Kalman-based predictor for high-dimensional time series and prove that predictive LDM produces identifiable latent representations under mild assumptions even when the predictor is nonlinear.

Significance. If the derivations and identifiability proof are correct, the LDM perspective would offer a coherent organizing principle that explains why diverse SSL objectives succeed and supplies concrete guidance for new methods, particularly in time-series settings. The identifiability result for nonlinear predictors would strengthen the unification claim by showing that predictive approaches recover unique latents without collapse.

major comments (2)

[§4] §4 (Predictive LDM and Identifiability): The central claim that predictive LDM yields identifiable representations 'under mild assumptions, even with nonlinear predictors' is load-bearing for the unification story, yet the proof does not explicitly enumerate the assumptions (e.g., latent prior factorization, almost-everywhere invertibility of the nonlinear predictor, or Markovian dynamics). Without these, it is impossible to verify whether the result extends classical nonlinear ICA or merely restates it under equivalent restrictions.
[§3.2] §3.2 (Derivation of the Kalman-based predictor): The manuscript asserts a 'nonlinear, sampling-free Bayesian filtering model' obtained from LDM, but the transition from the LDM objective to the specific Kalman update equations is not shown step-by-step; the reader cannot confirm that the predictor remains sampling-free once the latent model is nonlinear.

minor comments (2)

[§2] Notation for the latent entropy term is introduced without an explicit equation number; adding a displayed equation would improve readability when comparing alignment and uniformity objectives across methods.
[§2.3] The abstract states that LDM 'unifies ICA with ... stop gradient approaches,' but the main text does not include a side-by-side reduction of a canonical stop-gradient loss (e.g., SimSiam) to the LDM objective; a short table or paragraph would make the unification concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have helped us strengthen the clarity of the identifiability result and the derivation of the predictor. We address each major comment below.

read point-by-point responses

Referee: [§4] §4 (Predictive LDM and Identifiability): The central claim that predictive LDM yields identifiable representations 'under mild assumptions, even with nonlinear predictors' is load-bearing for the unification story, yet the proof does not explicitly enumerate the assumptions (e.g., latent prior factorization, almost-everywhere invertibility of the nonlinear predictor, or Markovian dynamics). Without these, it is impossible to verify whether the result extends classical nonlinear ICA or merely restates it under equivalent restrictions.

Authors: We agree that the assumptions underlying the identifiability proof should be stated explicitly at the outset of §4 to allow readers to assess the result's scope and its relation to nonlinear ICA. The proof relies on three mild conditions: (i) the latent prior is factorized (independent components), (ii) the nonlinear predictor is invertible almost everywhere, and (iii) the latent dynamics are Markovian. These are standard in the nonlinear ICA literature and are implicitly used in the proof, but were not enumerated in a dedicated paragraph. In the revised manuscript we have inserted a new subsection titled 'Assumptions' at the beginning of §4 that lists these conditions, provides brief justification for each, and discusses how the result extends classical nonlinear ICA to the predictive SSL setting while preserving the unification claim. revision: yes
Referee: [§3.2] §3.2 (Derivation of the Kalman-based predictor): The manuscript asserts a 'nonlinear, sampling-free Bayesian filtering model' obtained from LDM, but the transition from the LDM objective to the specific Kalman update equations is not shown step-by-step; the reader cannot confirm that the predictor remains sampling-free once the latent model is nonlinear.

Authors: We acknowledge that the main-text presentation of the derivation is high-level and that a fully expanded step-by-step transition from the LDM objective to the Kalman update equations would improve verifiability. In the revised manuscript we have added Appendix B containing the complete algebraic derivation. It proceeds from the LDM objective (maximizing alignment log-probability under the assumed latent model while enforcing uniformity) through the variational posterior update, shows that the resulting filter equations reduce to a Kalman-style predictor that operates on closed-form moments, and explicitly demonstrates that no sampling is required even when the predictor is nonlinear. This appendix confirms that the sampling-free property is preserved under the nonlinear latent model. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; LDM framework and identifiability result presented as independent derivations.

full rationale

The abstract and description define LDM explicitly as maximizing log-probability (alignment) plus entropy (uniformity), then apply it to unify ICA with SSL variants and derive a Kalman-based predictor. The identifiability proof for predictive LDM is stated as holding under mild assumptions even for nonlinear predictors, without any visible reduction of the result to a fitted parameter or self-referential definition. No load-bearing self-citation, ansatz smuggling, or renaming of known results appears in the provided text. The central claims retain independent mathematical content and are not forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger records the conceptual elements explicitly named; no numerical free parameters are mentioned.

axioms (1)

domain assumption Mild assumptions suffice for identifiability of predictive LDM even with nonlinear predictors
The abstract states that the identifiability result holds under these mild assumptions.

invented entities (1)

Latent Distribution Matching (LDM) no independent evidence
purpose: Unifying framework for SSL methods via alignment and uniformity
LDM is introduced in the paper as the central conceptual device.

pith-pipeline@v0.9.0 · 5665 in / 1352 out tokens · 57962 ms · 2026-05-21T00:06:45.670576+00:00 · methodology

Review history (3 revisions) →

Understanding Self-Supervised Learning via Latent Distribution Matching

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)