Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model

Francis Bach (SIERRA)

arxiv: 2607.01895 · v1 · pith:EIUZYMRGnew · submitted 2026-07-02 · 💻 cs.LG · math.OC· math.ST· stat.ML· stat.TH

Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model

Francis Bach (SIERRA) This is my paper

Pith reviewed 2026-07-03 17:43 UTC · model grok-4.3

classification 💻 cs.LG math.OCmath.STstat.MLstat.TH

keywords log-density-ratio estimationGaussian location modelvariational estimatorspectral estimatorridge regularizationhigh-dimensional asymptoticsdensity ratio estimation

0 comments

The pith

In the Gaussian location model, ridge-regularized variational log-density-ratio estimation has lower asymptotic risk than the spectral alternative when observations greatly exceed dimension, but the spectral estimator is preferable with few

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ridge-regularized estimators for log-density ratios under the Gaussian location model q ~ N(0,I), p ~ N(Δ,I). It derives deterministic high-dimensional equivalents for their population risks as dimension and sample sizes grow with fixed ratios. The variational estimator is obtained by minimizing an empirical KL objective with L2 penalty, while the spectral estimator solves a continuum of ridge least-squares problems. Asymptotics show that the variational form wins for large observation counts because it exploits the model specification more fully, whereas the spectral form has lower variance when data is scarce. Experiments sweep aspect ratios and regularization to confirm the risk crossover.

Core claim

Under the analyzed high-dimensional regimes with fixed observation-to-dimension ratios, the well-specified variational estimator attains smaller population risk than the spectral estimator when the number of observations is large, while the spectral estimator is favored with fewer observations due to its covariance-based construction having lower variance; the limits are characterized via CGMT for the variational case and resolvent equivalents for the spectral case.

What carries the argument

High-dimensional deterministic asymptotic equivalents for the population risks of the two ridge-regularized estimators, obtained respectively from the convex-Gaussian-min-max theorem and from resolvents of weighted sums of two independent Gaussian sample covariances.

If this is right

For density-ratio tasks in high dimensions, the variational estimator should be selected when sample size is large relative to dimension.
The spectral estimator should be preferred in the small-sample regime to exploit its lower variance.
A nuclear-norm penalty can be added to either estimator to induce feature learning, though its analysis remains partial.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The risk comparison may extend to other well-specified parametric families beyond the Gaussian location model.
The derived asymptotic formulas could be used to optimize the regularization parameter as a function of the aspect ratio without cross-validation.
Similar variational-versus-spectral trade-offs may appear in other ratio estimation problems such as importance sampling or mutual-information estimation.

Load-bearing premise

The observations are drawn exactly from the Gaussian location model with identical covariance matrices for the two distributions.

What would settle it

A finite-sample simulation in the Gaussian location model that measures empirical risks for both estimators across a sweep of observation-to-dimension ratios and checks whether the risk ordering reverses at the ratio predicted by the asymptotic formulas.

Figures

Figures reproduced from arXiv: 2607.01895 by Francis Bach (SIERRA).

**Figure 2.** Figure 2: Second-order separating curves in the weak-signa [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

**Figure 3.** Figure 3: Regimes of feasibility for unregularized estimat [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of two estimators for a given [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of two estimators for a given [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of estimators, limits, and empirical p [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of estimators when varying αp (αq = 0.1 fixed, with αp swept over [0, 1]). Left: variational, right: spectral. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of estimators when varying αq (αp = 0.05 fixed, with αq swept over [0, 1], only shown partially for the variational estimator since the performance is out of range). Left: variational, right: spectral. 8 Conclusion Summary. This paper compared ridge-regularized variational and spectral density-ratio estimation in a simple Gaussian location model under proportional high-dimensional asymptotics m … view at source ↗

**Figure 9.** Figure 9: Performance of the estimator with nuclear penalty [PITH_FULL_IMAGE:figures/full_fig_p060_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of two spectral estimators for a given [PITH_FULL_IMAGE:figures/full_fig_p060_10.png] view at source ↗

read the original abstract

We study ridge-regularized log-density-ratio estimation in the Gaussian location model with a common covariance matrix. By affine invariance, the model is written as q $\sim$ N(0, I), p $\sim$ N($\Delta$, I), with linear features, where $\Delta$ is a mean vector. The variational estimator is the empirical Kullback-Leibler (KL) log-normalized fit with a squared L2-penalty on its nonconstant coefficient, and the spectral estimator recently introduced in [1] replaces a single variational problem by a continuum of ridge-regularized least-squares problems. We derive high-dimensional deterministic asymptotic equivalents when the numbers of observations and dimension tend to infinity with fixed ratios. The regularized variational limit is characterized by a scalar entropy minimization problem derived from the convex-Gaussian-min-max theorem (CGMT), while the regularized spectral limit follows from deterministic equivalents for resolvents of weighted sums of two independent Gaussian sample covariance matrices. We use these formulas to compare population risks, with experiments focused on fixed-signal aspect-ratio sweeps and optimized regularization. Our conclusion is that with many observations, under the criteria and asymptotic regimes analyzed here, the well-specified variational estimator has the smaller risk, while with fewer observations, the spectral estimator is favored because its covariance-based construction has lower variance. We also study how a nuclear penalty can be used and partially analyzed to perform feature learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Derives explicit high-dimensional limits for ridge-regularized variational and spectral density-ratio estimators in the Gaussian location model and compares their risks by observation regime.

read the letter

The paper works out high-dimensional asymptotic equivalents for the population risk of two ridge-regularized log-density-ratio estimators under the Gaussian location model with common covariance. It uses CGMT to reduce the variational estimator to a scalar entropy problem and resolvent deterministic equivalents for the spectral estimator, then compares the resulting risks after optimizing the regularization parameter.

What is new is the explicit limit formulas and the regime-dependent comparison: the variational estimator wins when the number of observations is large relative to dimension, while the spectral one is better when observations are scarce because its covariance construction has lower variance. The experiments sweep fixed-signal aspect ratios and report the optimized risks, which makes the conclusion concrete.

The work is scoped tightly to the exact model and the high-dimensional limit with fixed ratios, which is stated up front. That keeps the claims honest but also limits how far the rule travels without further checks. The nuclear-norm extension for feature learning is mentioned but only partially analyzed, so it reads as an aside rather than a developed result.

The math relies on standard tools for this model class and the risk comparison stays inside the same asymptotic regime used to derive the limits. No obvious internal contradiction appears from the abstract and stress-test note.

This is for readers who already work on high-dimensional density-ratio estimation or asymptotic analysis of regularized M-estimators. Someone who needs a practical rule for choosing between these two estimators on near-Gaussian data would find it useful. It is formally grounded enough to deserve a serious referee rather than a desk reject.

Referee Report

0 major / 2 minor

Summary. The manuscript derives high-dimensional asymptotic equivalents for the population risks of ridge-regularized variational and spectral log-density-ratio estimators in the Gaussian location model (q ~ N(0,I), p ~ N(Δ,I) with common covariance). The variational limit is characterized via the convex-Gaussian-min-max theorem (CGMT) as a scalar entropy minimization problem, while the spectral limit follows from deterministic equivalents for resolvents of weighted sums of two independent Gaussian sample covariance matrices. These formulas are used to compare risks, with the conclusion that the well-specified variational estimator has smaller risk with many observations while the spectral estimator is favored with fewer observations due to lower variance from its covariance-based construction. Experiments focus on fixed-signal aspect-ratio sweeps with optimized regularization, and a nuclear penalty is partially analyzed for feature learning.

Significance. If the derivations hold, the work provides precise asymptotic risk characterizations that enable direct comparison of the two estimators in the high-dimensional regime with fixed observation-to-dimension ratios. The explicit use of CGMT for the variational case and resolvent deterministic equivalents for the spectral case is a strength, as these are standard tools that yield exact limits under the stated model. The scoping of the conclusion to the analyzed criteria, regimes, and Gaussian location model is clearly stated upfront.

minor comments (2)

[Abstract] Abstract: the phrase 'optimized regularization' is used for the risk comparisons but the manuscript does not specify whether the regularization strength is chosen by oracle knowledge of the population risk or by a data-driven procedure; this affects how the comparison should be interpreted in practice.
[Model setup paragraph] The high-dimensional limit is taken with fixed ratios of observations to dimension, but the notation for the limiting ratios (e.g., n_p/d, n_q/d) is not introduced until later sections; adding it to the model-setup paragraph would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment of the derivations and scoping, and recommendation of minor revision. No major comments appear in the provided report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives high-dimensional asymptotic equivalents for the variational estimator via the external convex-Gaussian-min-max theorem (CGMT) and for the spectral estimator via resolvent deterministic equivalents of weighted sample covariances. These are standard external mathematical tools. The subsequent population-risk comparison occurs inside the stated Gaussian location model and fixed-ratio asymptotic regime, but does not reduce any claim to a fitted parameter, self-definition, or self-citation chain by construction. Model assumptions are declared explicitly at the outset; no load-bearing step collapses to the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on the exact Gaussian location model, the high-dimensional proportional limit, and the validity of the CGMT and resolvent deterministic equivalents; no new entities are postulated.

free parameters (1)

ridge regularization strength
The paper studies ridge-regularized estimators and optimizes the penalty parameter; this is a tunable hyperparameter whose value affects the reported risk comparison.

axioms (2)

domain assumption Data exactly follows the two-Gaussian location model with identical covariance matrices
Invoked in the model definition and used to obtain affine invariance.
domain assumption High-dimensional limit with n, d → ∞ at fixed aspect ratios
Required for the deterministic equivalents derived via CGMT and resolvent theory.

pith-pipeline@v0.9.1-grok · 5791 in / 1500 out tokens · 26608 ms · 2026-07-03T17:43:34.602771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 4 canonical work pages · 2 internal anchors

[1]

A Spectral Framework for Closed-Form Relative Density Estimation

Francis Bach. A spectral framework for closed-form rela tive density estimation, 2026. arXiv:2605.10668. (cited on pages 1, 2, 3, 5, 6, 7, 26, 28, and 29) 59 Figure 9: Performance of the estimator with nuclear penalty , when varying α p (α q = 0. 1 ﬁxed, with α p swept over [0, 1]). Figure 10: Comparison of two spectral estimators for a given s = 1 when r...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Direct importance estimation for covariate shif t adaptation

Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, His ashi Kashima, Paul von B¨ unau, and Motoaki Kawanabe. Direct importance estimation for covariate shif t adaptation. Annals of the Institute of Statistical Mathematics , 60(4):699–746, 2008. (cited on pages 1 and 3)

2008
[3]

A least-squares approach to direct im- portance estimation

Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct im- portance estimation. Journal of Machine Learning Research , 10:1391–1445, 2009. (cited on pages 1 and 3)

2009
[4]

Density Ratio Estimation in Machine Learning

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012. (cited on pages 1 and 3)

2012
[5]

Wainwright, and Michael I

XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jo rdan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory , 56(11):5847–5861, 2010. (cited on pages 1, 3, 4, 7, and 29)

2010
[6]

Donsker and S

Monroe D. Donsker and S. R. Srinivasa Varadhan. Asymptot ic evaluation of certain Markov process expectations for large time, I. Communications on Pure and Applied Mathematics , 28(1):1–47, 1975. (cited on pages 1, 3, 7, and 29)

1975
[7]

Inferences for case-control and semiparametr ic two-sample density ratio models

Jing Qin. Inferences for case-control and semiparametr ic two-sample density ratio models. Biometrika, 85(3):619–630, 1998. (cited on page 1)

1998
[8]

Cand` es and Pragya Sur

Emmanuel J. Cand` es and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics , 48(1):27–42, 2020. (cited on pages 1 and 3)

2020
[9]

Fi nite-sample performance of the maximum likelihood estimator in logistic regression, 2024

Hugo Chardon, Matthieu Lerasle, and Jaouad Mourtada. Fi nite-sample performance of the maximum likelihood estimator in logistic regression, 2024. arXiv: 2411.02137. (cited on pages 1 and 3)

work page arXiv 2024
[10]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression : Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970. (cited on page 1)

1970
[11]

Smola, Arthur Gretton, Kar sten M

Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Kar sten M. Borgwardt, and Bernhard Sch¨ olkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, 2007. (cited on page 3)

2007
[12]

Linking losses for density ratio and class-probability estimation

Aditya Krishna Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning , 2016. (cited on page 3)

2016
[13]

Cand` es

Pragya Sur and Emmanuel J. Cand` es. A modern maximum-li kelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences , 116(29):14516–14525, 2019. (cited on page 3)

2019
[14]

The i mpact of regularization on high-dimensional logistic regression

Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi. The i mpact of regularization on high-dimensional logistic regression. In Advances in Neural Information Processing Systems , 2019. (cited on pages 3, 13, and 28)

2019
[15]

Silverstein

Zhidong Bai and Jack W. Silverstein. Spectral Analysis of Large Dimensional Random Matrices . Springer, 2nd edition, 2010. (cited on pages 3, 14, 28, and 29) 61

2010
[16]

Dete rministic equivalents for certain functionals of large random matrices

Walid Hachem, Philippe Loubaton, and Jamal Najim. Dete rministic equivalents for certain functionals of large random matrices. The Annals of Applied Probability , 17(3):875–930, 2007. (cited on pages 3 and 28)

2007
[17]

High-dimensional asy mptotics of prediction: Ridge regression and classiﬁcation

Edgar Dobriban and Stefan Wager. High-dimensional asy mptotics of prediction: Ridge regression and classiﬁcation. The Annals of Statistics , 46(1):247–279, 2018. (cited on page 3)

2018
[18]

The interpolation p hase transition in neural networks: Memo- rization and generalization under lazy training

Andrea Montanari and Yiqiao Zhong. The interpolation p hase transition in neural networks: Memo- rization and generalization under lazy training. The Annals of Statistics , 50(5):2816–2847, 2022. (cited on pages 3, 14, and 28)

2022
[19]

High-dimensional analysis of double des cent for linear regression with random projec- tions

Francis Bach. High-dimensional analysis of double des cent for linear regression with random projec- tions. SIAM Journal on Mathematics of Data Science , 6(1):26–50, 2024. (cited on pages 3, 14, 15, 16, 28, 42, and 43)

2024
[20]

Peter McCullagh and John A. Nelder. Generalized Linear Models . Chapman & Hall, 1989. (cited on page 4)

1989
[21]

Convex Optimization

Stephen Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004. (cited on page 4)

2004
[22]

Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 2nd edition, 2006. (cited on pages 4 and 13)

2006
[23]

Minimization of ϕ -divergences on sets of signed measures

Michel Broniatowski and Amor Keziou. Minimization of ϕ -divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica , 43(4):403–442, 2006. (cited on page 5)

2006
[24]

Orthogonal Polynomials: Computation and Approximation

Walter Gautschi. Orthogonal Polynomials: Computation and Approximation . Oxford University Press,
[25]

(cited on pages 6 and 13)
[26]

Existence of direct density ratio estimators, 2025

Erika Banzato, Mathias Drton, Kian Saraf-Poor, and Hon gjian Shi. Existence of direct density ratio estimators, 2025. arXiv:2502.12738. (cited on page 9)

work page arXiv 2025
[27]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoﬀer, Mor Shpigel Nacson, Suriya Gu nasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research , 19(70):1–57, 2018. (cited on page 9)

2018
[28]

Lee, Daniel Soudry, and Nath an Srebro

Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nath an Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems , 2018. (cited on page 9)

2018
[29]

van der Vaart

Aad W. van der Vaart. Asymptotic Statistics . Cambridge University Press, 1998. (cited on pages 10, 19, and 20)

1998
[30]

High-Dimensional Probability: An Introduction with Applic ations in Data Science

Roman Vershynin. High-Dimensional Probability: An Introduction with Applic ations in Data Science . Cambridge University Press, 2018. (cited on page 10)

2018
[31]

The Gaussian min-max theorem in the Presence of Convexity

Christos Thrampoulidis, Samet Oymak, and Babak Hassib i. The Gaussian min-max theorem in the presence of convexity, 2014. arXiv:1408.4837. (cited on pages 12 and 37) 62

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

Regularized linear regression: A precise analysis of the estimation error

Christos Thrampoulidis, Samet Oymak, and Babak Hassib i. Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory , 2015. (cited on page 12)

2015
[33]

Corless, Gaston H

Robert M. Corless, Gaston H. Gonnet, David E. G. Hare, Da vid J. Jeﬀrey, and Donald E. Knuth. On the Lambert W function. Advances in Computational Mathematics , 5:329–359, 1996. (cited on page 13)

1996
[34]

Trace Ideals and Their Applications

Barry Simon. Trace Ideals and Their Applications . American Mathematical Society, 2005. (cited on page 18)

2005
[35]

Richard P. Brent. Algorithms for Minimization Without Derivatives . Prentice Hall, 1973. (cited on page 26)

1973
[36]

Convex multi-task feature learning

Andreas Argyriou, Theodoros Evgeniou, and Massimilia no Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008. (cited on pages 28, 45, and 47)

2008
[37]

Sinan G¨ unt¨ urk

Ingrid Daubechies, Ronald DeVore, Massimo Fornasier, and C. Sinan G¨ unt¨ urk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics , 63(1):1–38, 2010. (cited on page 29)

2010
[38]

Optimization with sparsity-inducing penalties

Francis Bach, Rodolphe Jenatton, Julien Mairal, and Gu illaume Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning , 4(1):1–106, 2012. (cited on page 29)

2012
[39]

Ajanki, L´ aszl´ o Erd˝ os, and Torben H

Oskari H. Ajanki, L´ aszl´ o Erd˝ os, and Torben H. Kr¨ uger. Stability of the matrix Dyson equation and random matrices with correlations. Probability Theory and Related Fields , 173(1–2):293–373, 2019. (cited on pages 29, 49, 51, and 53)

2019
[40]

Mutual information neural esti mation

Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajes hwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural esti mation. In International Conference on Machine Learning, 2018. (cited on page 29)

2018
[41]

Alemi, and George Tucker

Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, 2019. (cited on page 29)

2019
[42]

Random features for larg e-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for larg e-scale kernel machines. In Advances in Neural Information Processing Systems , 2007. (cited on page 29)

2007
[43]

Christopher K. I. Williams and Matthias Seeger. Using t he Nystr¨ om method to speed up kernel machines. In Advances in Neural Information Processing Systems , 2001. (cited on page 29)

2001
[44]

A rando m matrix approach to neural networks

Cosme Louart, Zhenyu Liao, and Romain Couillet. A rando m matrix approach to neural networks. The Annals of Applied Probability , 28(2):1190–1248, 2018. (cited on page 29)

2018
[45]

The generalization erro r of random features regression: Precise asymptotics and the double descent curve

Song Mei and Andrea Montanari. The generalization erro r of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics , 75(4):667–766, 2022. (cited on page 29)

2022
[46]

Hong Hu and Yue M. Lu. Universality laws for high-dimens ional learning with random features. IEEE Transactions on Information Theory , 69(3):1932–1964, 2023. (cited on page 29) 63

1932
[47]

Gener- alisation error in learning with random features and the hid den manifold model

Federica Gerace, Bruno Loureiro, Florent Krzakala, Ma rc M´ ezard, and Lenka Zdeborov´ a. Gener- alisation error in learning with random features and the hid den manifold model. In International Conference on Machine Learning , 2020. (cited on page 29)

2020
[48]

Learning curves of generic features mapsfor realistic datasets with a teacher-student model

Bruno Loureiro, C´ edric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc M´ ezard, and Lenka Zdeborov´ a. Learning curves of generic features mapsfor realistic datasets with a teacher-student model. In Advances in Neural Information Processing Systems , 2021. (cited on page 29)

2021
[49]

Fluctuations, bias, variance & ensemble of learners: Exact asymptotics fo r convex losses in high-dimension

Bruno Loureiro, Cedric Gerbelot, Maria Reﬁnetti, Gabr iele Sicuro, and Florent Krzakala. Fluctuations, bias, variance & ensemble of learners: Exact asymptotics fo r convex losses in high-dimension. In International Conference on Machine Learning , 2022. (cited on page 29)

2022
[50]

Johnstone

Iain M. Johnstone. On the distribution of the largest ei genvalue in principal components analysis. The Annals of Statistics , 29(2):295–327, 2001. (cited on page 29)

2001
[51]

Ph ase transition of the largest eigenvalue for nonnull complex sample covariance matrices

Jinho Baik, G´ erard Ben Arous, and Sandrine P´ ech´ e. Ph ase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability , 33(5):1643–1697, 2005. (cited on page 29)

2005
[52]

Precise error analysis of regularized M- estimators in high dimensions

Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassi bi. Precise error analysis of regularized M- estimators in high dimensions. IEEE Transactions on Information Theory , 64(8):5592–5628, 2018. (cited on page 39)

2018
[53]

On the numerical evaluation of Fred holm determinants

Folkmar Bornemann. On the numerical evaluation of Fred holm determinants. Mathematics of Com- putation, 79(270):871–915, 2010. (cited on page 44)

2010
[54]

Perturbation Theory for Linear Operators

Tosio Kato. Perturbation Theory for Linear Operators . Springer, 1995. (cited on page 46)

1995
[55]

Isotropic local laws for sample covariance and generalized Wigner matrices

Alex Bloemendal, L´ aszl´ o Erd˝ os, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Isotropic local laws for sample covariance and generalized Wigner matrices. Electronic Journal of Probability , 19(33):1–53,
[56]

(cited on pages 49 and 53)
[57]

Charles M. Stein. Estimation of the mean of a multivaria te normal distribution. The Annals of Statistics, 9(6):1135–1151, 1981. (cited on page 49)

1981
[58]

Khorunzhy, Boris A

Alexei M. Khorunzhy, Boris A. Khoruzhenko, and Leonid A . Pastur. Asymptotic properties of large random matrices with independent entries. Journal of Mathematical Physics , 37(10):5033–5060, 1996. (cited on page 49)

1996
[59]

Ajanki, L´ aszl´ o Erd˝ os, and Torben H

Oskari H. Ajanki, L´ aszl´ o Erd˝ os, and Torben H. Kr¨ uger. Quadratic Vector Equations on Complex Upper Half-Plane . Memoirs of the American Mathematical Society. American Ma thematical Society,
[60]

(cited on pages 51 and 53)
[61]

Kr¨ uger

Johannes Alt, L´ aszl´ o Erd˝ os, and Torben H. Kr¨ uger. The Dyson equation with linear self-energy: Spectral bands, edges and cusps. Documenta Mathematica , 25:1421–1539, 2020. (cited on pages 51 and 53)

2020
[62]

Fixed point equations and nonlinear eig envalue problems in ordered Banach spaces

Herbert Amann. Fixed point equations and nonlinear eig envalue problems in ordered Banach spaces. SIAM Review , 18(4):620–709, 1976. (cited on page 59)

1976
[63]

Positive Deﬁnite Matrices

Rajendra Bhatia. Positive Deﬁnite Matrices . Princeton University Press, 2007. (cited on page 59) 64

2007

[1] [1]

A Spectral Framework for Closed-Form Relative Density Estimation

Francis Bach. A spectral framework for closed-form rela tive density estimation, 2026. arXiv:2605.10668. (cited on pages 1, 2, 3, 5, 6, 7, 26, 28, and 29) 59 Figure 9: Performance of the estimator with nuclear penalty , when varying α p (α q = 0. 1 ﬁxed, with α p swept over [0, 1]). Figure 10: Comparison of two spectral estimators for a given s = 1 when r...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Direct importance estimation for covariate shif t adaptation

Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, His ashi Kashima, Paul von B¨ unau, and Motoaki Kawanabe. Direct importance estimation for covariate shif t adaptation. Annals of the Institute of Statistical Mathematics , 60(4):699–746, 2008. (cited on pages 1 and 3)

2008

[3] [3]

A least-squares approach to direct im- portance estimation

Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct im- portance estimation. Journal of Machine Learning Research , 10:1391–1445, 2009. (cited on pages 1 and 3)

2009

[4] [4]

Density Ratio Estimation in Machine Learning

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012. (cited on pages 1 and 3)

2012

[5] [5]

Wainwright, and Michael I

XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jo rdan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory , 56(11):5847–5861, 2010. (cited on pages 1, 3, 4, 7, and 29)

2010

[6] [6]

Donsker and S

Monroe D. Donsker and S. R. Srinivasa Varadhan. Asymptot ic evaluation of certain Markov process expectations for large time, I. Communications on Pure and Applied Mathematics , 28(1):1–47, 1975. (cited on pages 1, 3, 7, and 29)

1975

[7] [7]

Inferences for case-control and semiparametr ic two-sample density ratio models

Jing Qin. Inferences for case-control and semiparametr ic two-sample density ratio models. Biometrika, 85(3):619–630, 1998. (cited on page 1)

1998

[8] [8]

Cand` es and Pragya Sur

Emmanuel J. Cand` es and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics , 48(1):27–42, 2020. (cited on pages 1 and 3)

2020

[9] [9]

Fi nite-sample performance of the maximum likelihood estimator in logistic regression, 2024

Hugo Chardon, Matthieu Lerasle, and Jaouad Mourtada. Fi nite-sample performance of the maximum likelihood estimator in logistic regression, 2024. arXiv: 2411.02137. (cited on pages 1 and 3)

work page arXiv 2024

[10] [10]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression : Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970. (cited on page 1)

1970

[11] [11]

Smola, Arthur Gretton, Kar sten M

Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Kar sten M. Borgwardt, and Bernhard Sch¨ olkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, 2007. (cited on page 3)

2007

[12] [12]

Linking losses for density ratio and class-probability estimation

Aditya Krishna Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning , 2016. (cited on page 3)

2016

[13] [13]

Cand` es

Pragya Sur and Emmanuel J. Cand` es. A modern maximum-li kelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences , 116(29):14516–14525, 2019. (cited on page 3)

2019

[14] [14]

The i mpact of regularization on high-dimensional logistic regression

Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi. The i mpact of regularization on high-dimensional logistic regression. In Advances in Neural Information Processing Systems , 2019. (cited on pages 3, 13, and 28)

2019

[15] [15]

Silverstein

Zhidong Bai and Jack W. Silverstein. Spectral Analysis of Large Dimensional Random Matrices . Springer, 2nd edition, 2010. (cited on pages 3, 14, 28, and 29) 61

2010

[16] [16]

Dete rministic equivalents for certain functionals of large random matrices

Walid Hachem, Philippe Loubaton, and Jamal Najim. Dete rministic equivalents for certain functionals of large random matrices. The Annals of Applied Probability , 17(3):875–930, 2007. (cited on pages 3 and 28)

2007

[17] [17]

High-dimensional asy mptotics of prediction: Ridge regression and classiﬁcation

Edgar Dobriban and Stefan Wager. High-dimensional asy mptotics of prediction: Ridge regression and classiﬁcation. The Annals of Statistics , 46(1):247–279, 2018. (cited on page 3)

2018

[18] [18]

The interpolation p hase transition in neural networks: Memo- rization and generalization under lazy training

Andrea Montanari and Yiqiao Zhong. The interpolation p hase transition in neural networks: Memo- rization and generalization under lazy training. The Annals of Statistics , 50(5):2816–2847, 2022. (cited on pages 3, 14, and 28)

2022

[19] [19]

High-dimensional analysis of double des cent for linear regression with random projec- tions

Francis Bach. High-dimensional analysis of double des cent for linear regression with random projec- tions. SIAM Journal on Mathematics of Data Science , 6(1):26–50, 2024. (cited on pages 3, 14, 15, 16, 28, 42, and 43)

2024

[20] [20]

Peter McCullagh and John A. Nelder. Generalized Linear Models . Chapman & Hall, 1989. (cited on page 4)

1989

[21] [21]

Convex Optimization

Stephen Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004. (cited on page 4)

2004

[22] [22]

Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 2nd edition, 2006. (cited on pages 4 and 13)

2006

[23] [23]

Minimization of ϕ -divergences on sets of signed measures

Michel Broniatowski and Amor Keziou. Minimization of ϕ -divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica , 43(4):403–442, 2006. (cited on page 5)

2006

[24] [24]

Orthogonal Polynomials: Computation and Approximation

Walter Gautschi. Orthogonal Polynomials: Computation and Approximation . Oxford University Press,

[25] [25]

(cited on pages 6 and 13)

[26] [26]

Existence of direct density ratio estimators, 2025

Erika Banzato, Mathias Drton, Kian Saraf-Poor, and Hon gjian Shi. Existence of direct density ratio estimators, 2025. arXiv:2502.12738. (cited on page 9)

work page arXiv 2025

[27] [27]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoﬀer, Mor Shpigel Nacson, Suriya Gu nasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research , 19(70):1–57, 2018. (cited on page 9)

2018

[28] [28]

Lee, Daniel Soudry, and Nath an Srebro

Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nath an Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems , 2018. (cited on page 9)

2018

[29] [29]

van der Vaart

Aad W. van der Vaart. Asymptotic Statistics . Cambridge University Press, 1998. (cited on pages 10, 19, and 20)

1998

[30] [30]

High-Dimensional Probability: An Introduction with Applic ations in Data Science

Roman Vershynin. High-Dimensional Probability: An Introduction with Applic ations in Data Science . Cambridge University Press, 2018. (cited on page 10)

2018

[31] [31]

The Gaussian min-max theorem in the Presence of Convexity

Christos Thrampoulidis, Samet Oymak, and Babak Hassib i. The Gaussian min-max theorem in the presence of convexity, 2014. arXiv:1408.4837. (cited on pages 12 and 37) 62

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [32]

Regularized linear regression: A precise analysis of the estimation error

Christos Thrampoulidis, Samet Oymak, and Babak Hassib i. Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory , 2015. (cited on page 12)

2015

[33] [33]

Corless, Gaston H

Robert M. Corless, Gaston H. Gonnet, David E. G. Hare, Da vid J. Jeﬀrey, and Donald E. Knuth. On the Lambert W function. Advances in Computational Mathematics , 5:329–359, 1996. (cited on page 13)

1996

[34] [34]

Trace Ideals and Their Applications

Barry Simon. Trace Ideals and Their Applications . American Mathematical Society, 2005. (cited on page 18)

2005

[35] [35]

Richard P. Brent. Algorithms for Minimization Without Derivatives . Prentice Hall, 1973. (cited on page 26)

1973

[36] [36]

Convex multi-task feature learning

Andreas Argyriou, Theodoros Evgeniou, and Massimilia no Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008. (cited on pages 28, 45, and 47)

2008

[37] [37]

Sinan G¨ unt¨ urk

Ingrid Daubechies, Ronald DeVore, Massimo Fornasier, and C. Sinan G¨ unt¨ urk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics , 63(1):1–38, 2010. (cited on page 29)

2010

[38] [38]

Optimization with sparsity-inducing penalties

Francis Bach, Rodolphe Jenatton, Julien Mairal, and Gu illaume Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning , 4(1):1–106, 2012. (cited on page 29)

2012

[39] [39]

Ajanki, L´ aszl´ o Erd˝ os, and Torben H

Oskari H. Ajanki, L´ aszl´ o Erd˝ os, and Torben H. Kr¨ uger. Stability of the matrix Dyson equation and random matrices with correlations. Probability Theory and Related Fields , 173(1–2):293–373, 2019. (cited on pages 29, 49, 51, and 53)

2019

[40] [40]

Mutual information neural esti mation

Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajes hwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural esti mation. In International Conference on Machine Learning, 2018. (cited on page 29)

2018

[41] [41]

Alemi, and George Tucker

Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, 2019. (cited on page 29)

2019

[42] [42]

Random features for larg e-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for larg e-scale kernel machines. In Advances in Neural Information Processing Systems , 2007. (cited on page 29)

2007

[43] [43]

Christopher K. I. Williams and Matthias Seeger. Using t he Nystr¨ om method to speed up kernel machines. In Advances in Neural Information Processing Systems , 2001. (cited on page 29)

2001

[44] [44]

A rando m matrix approach to neural networks

Cosme Louart, Zhenyu Liao, and Romain Couillet. A rando m matrix approach to neural networks. The Annals of Applied Probability , 28(2):1190–1248, 2018. (cited on page 29)

2018

[45] [45]

The generalization erro r of random features regression: Precise asymptotics and the double descent curve

Song Mei and Andrea Montanari. The generalization erro r of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics , 75(4):667–766, 2022. (cited on page 29)

2022

[46] [46]

Hong Hu and Yue M. Lu. Universality laws for high-dimens ional learning with random features. IEEE Transactions on Information Theory , 69(3):1932–1964, 2023. (cited on page 29) 63

1932

[47] [47]

Gener- alisation error in learning with random features and the hid den manifold model

Federica Gerace, Bruno Loureiro, Florent Krzakala, Ma rc M´ ezard, and Lenka Zdeborov´ a. Gener- alisation error in learning with random features and the hid den manifold model. In International Conference on Machine Learning , 2020. (cited on page 29)

2020

[48] [48]

Learning curves of generic features mapsfor realistic datasets with a teacher-student model

Bruno Loureiro, C´ edric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc M´ ezard, and Lenka Zdeborov´ a. Learning curves of generic features mapsfor realistic datasets with a teacher-student model. In Advances in Neural Information Processing Systems , 2021. (cited on page 29)

2021

[49] [49]

Fluctuations, bias, variance & ensemble of learners: Exact asymptotics fo r convex losses in high-dimension

Bruno Loureiro, Cedric Gerbelot, Maria Reﬁnetti, Gabr iele Sicuro, and Florent Krzakala. Fluctuations, bias, variance & ensemble of learners: Exact asymptotics fo r convex losses in high-dimension. In International Conference on Machine Learning , 2022. (cited on page 29)

2022

[50] [50]

Johnstone

Iain M. Johnstone. On the distribution of the largest ei genvalue in principal components analysis. The Annals of Statistics , 29(2):295–327, 2001. (cited on page 29)

2001

[51] [51]

Ph ase transition of the largest eigenvalue for nonnull complex sample covariance matrices

Jinho Baik, G´ erard Ben Arous, and Sandrine P´ ech´ e. Ph ase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability , 33(5):1643–1697, 2005. (cited on page 29)

2005

[52] [52]

Precise error analysis of regularized M- estimators in high dimensions

Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassi bi. Precise error analysis of regularized M- estimators in high dimensions. IEEE Transactions on Information Theory , 64(8):5592–5628, 2018. (cited on page 39)

2018

[53] [53]

On the numerical evaluation of Fred holm determinants

Folkmar Bornemann. On the numerical evaluation of Fred holm determinants. Mathematics of Com- putation, 79(270):871–915, 2010. (cited on page 44)

2010

[54] [54]

Perturbation Theory for Linear Operators

Tosio Kato. Perturbation Theory for Linear Operators . Springer, 1995. (cited on page 46)

1995

[55] [55]

Isotropic local laws for sample covariance and generalized Wigner matrices

Alex Bloemendal, L´ aszl´ o Erd˝ os, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Isotropic local laws for sample covariance and generalized Wigner matrices. Electronic Journal of Probability , 19(33):1–53,

[56] [56]

(cited on pages 49 and 53)

[57] [57]

Charles M. Stein. Estimation of the mean of a multivaria te normal distribution. The Annals of Statistics, 9(6):1135–1151, 1981. (cited on page 49)

1981

[58] [58]

Khorunzhy, Boris A

Alexei M. Khorunzhy, Boris A. Khoruzhenko, and Leonid A . Pastur. Asymptotic properties of large random matrices with independent entries. Journal of Mathematical Physics , 37(10):5033–5060, 1996. (cited on page 49)

1996

[59] [59]

Ajanki, L´ aszl´ o Erd˝ os, and Torben H

Oskari H. Ajanki, L´ aszl´ o Erd˝ os, and Torben H. Kr¨ uger. Quadratic Vector Equations on Complex Upper Half-Plane . Memoirs of the American Mathematical Society. American Ma thematical Society,

[60] [60]

(cited on pages 51 and 53)

[61] [61]

Kr¨ uger

Johannes Alt, L´ aszl´ o Erd˝ os, and Torben H. Kr¨ uger. The Dyson equation with linear self-energy: Spectral bands, edges and cusps. Documenta Mathematica , 25:1421–1539, 2020. (cited on pages 51 and 53)

2020

[62] [62]

Fixed point equations and nonlinear eig envalue problems in ordered Banach spaces

Herbert Amann. Fixed point equations and nonlinear eig envalue problems in ordered Banach spaces. SIAM Review , 18(4):620–709, 1976. (cited on page 59)

1976

[63] [63]

Positive Deﬁnite Matrices

Rajendra Bhatia. Positive Deﬁnite Matrices . Princeton University Press, 2007. (cited on page 59) 64

2007