High-dimensional ridge regression with random features for non-identically distributed data with a variance profile

Issa-Mbenard Dabo; J\'er\'emie Bigot

arxiv: 2504.03035 · v2 · pith:KJCKM6OSnew · submitted 2025-04-03 · 📊 stat.ML · cs.LG· math.PR· math.ST· stat.ME· stat.TH

High-dimensional ridge regression with random features for non-identically distributed data with a variance profile

Issa-Mbenard Dabo , J\'er\'emie Bigot This is my paper

Pith reviewed 2026-05-22 21:19 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PRmath.STstat.MEstat.TH

keywords ridge regressionrandom featureshigh-dimensional asymptoticsvariance profilenon-identically distributed datageneralization riskdouble descent

0 comments

The pith

Asymptotic equivalents for training and test risks of random-feature ridge regression are derived under row-dependent variance profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper moves random feature ridge regression analysis beyond the standard model where all samples share the same covariance matrix. It introduces a variance-profile model in which each training and test vector has its own diagonal covariance matrix whose entries can vary by row and by feature. In the proportional regime where the number of samples n, ambient dimension p, and random-feature dimension m all grow large together, the authors obtain two families of asymptotic equivalents for the training and test risks. One family follows from a linear-plus-chaos approximation combined with traffic-probability arguments; the other is deterministic and follows from operator-valued free probability via an amalgamation-over-the-diagonal construction. These equivalents remain accurate in simulations and show that heterogeneous variance profiles can alter generalization curves and induce double-descent behavior at small ridge values.

Core claim

For data generated under the variance-profile model with row-dependent diagonal covariances Σ_i = diag(γ_{i1}^2, …, γ_{ip}^2) for training and analogous matrices for test points, the training and test risks of ridge regression on m random features admit explicit asymptotic equivalents when n, p, and m diverge proportionally. The equivalents are obtained first by combining the linear-plus-chaos approximation with traffic-probability arguments and second by a deterministic operator-valued free-probability calculation that uses an amalgamation-over-the-diagonal argument; both sets are shown to be sharp in numerical experiments.

What carries the argument

The variance-profile model with row-dependent diagonal covariance matrices Σ_i and ~Σ_i, which replaces the single shared covariance assumption and enables the extension of risk asymptotics to non-identically distributed data.

If this is right

The derived equivalents accurately track empirical risks across a range of heterogeneous variance profiles, including mixture profiles motivated by MNIST.
Heterogeneous variances can change the location and height of the double-descent peak in the test risk when the ridge parameter is small.
The two independent derivation routes (linear-plus-chaos plus traffic probabilities, and operator-valued free probability) produce matching expressions, confirming the robustness of the asymptotic formulas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variance-profile approach could be applied to other kernel or neural-network estimators whose risk expressions are currently known only under homogeneous sampling.
In practice, one could estimate the row-wise variance profiles from data and plug them into the asymptotic formulas to obtain quick risk predictions without retraining.
The amalgamation-over-the-diagonal technique may extend to models with additional block or low-rank structure in the covariance profiles.

Load-bearing premise

The covariates are generated exactly according to the variance-profile model with row-dependent diagonal covariance matrices for both training and test sets.

What would settle it

Run Monte Carlo simulations with a fixed heterogeneous variance profile, increase n, p, and m while keeping their ratios constant, and check whether the empirical training and test risks converge to the formulas predicted by either the linear-plus-chaos or the free-probability equivalents.

read the original abstract

Random feature ridge regression is often analyzed in the high-dimensional regime under the homogeneous sampling model $x_i=\Sigma^{1/2}x_i'$, where the vectors $x_i'$ have iid entries and the same covariance matrix $\Sigma$ is shared by all samples. In this paper, we move beyond this setting and study non-identically distributed data through a variance-profile model in which the training and test covariates have row-dependent diagonal covariance matrices $\Sigma_i=\diag(\gamma_{i1}^2,\ldots,\gamma_{ip}^2)$ and $\widetilde{\Sigma}_i=\diag(\tilde\gamma_{i1}^2,\ldots,\tilde\gamma_{ip}^2)$. Our main contribution is the derivation of asymptotic equivalents for the training and test risks of ridge regression with random features when $n$, $p$, and $m$ grow proportionally. The first set of equivalents is obtained by combining the linear-plus-chaos approximation with traffic-probability arguments, whereas the second set is deterministic and follows from operator-valued free probability through an amalgamation-over-the-diagonal argument. These equivalents are sharp in numerical experiments. They also reveal how heterogeneous variance profiles, including mixture-type profiles inspired by MNIST, can modify generalization and exhibit double-descent behavior when the ridge parameter is small.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives asymptotic equivalents for random feature ridge risks under a row-dependent variance-profile model using chaos approximations and free probability amalgamation.

read the letter

The main takeaway is that the authors have extended the usual homogeneous analysis of random feature ridge regression to a variance-profile setting where each training and test vector has its own diagonal covariance. They give two sets of asymptotic equivalents for training and test risk in the proportional regime and check that the formulas track the numerics, including on mixture profiles drawn from MNIST-like data. The work also shows how the heterogeneity can shift the location and height of the double-descent peak when the ridge parameter is small. That is the concrete advance over the homogeneous literature cited in the abstract. The two derivation routes—one via linear-plus-chaos plus traffic probabilities, the other via operator-valued free probability with amalgamation over the diagonal—are direct but non-routine extensions of existing tools, and the paper presents both. The numerical match is reported as sharp, which is the main empirical support offered. No load-bearing internal contradiction appears in the stated claims or modeling assumptions. The central limitation is that the abstract (and the stress-test note) does not supply explicit error bounds or the full derivation steps, so the sharpness claim rests on the cited techniques plus the reported experiments. That is a moderate rather than fatal gap for a theory paper; it mainly means a referee would want to see the complete proofs and data-generation details. The paper is aimed at readers already working in high-dimensional statistics and random feature methods who want to move past the iid assumption. Anyone who has followed the homogeneous analyses will see immediately what changes and what stays the same. It is the kind of targeted technical extension that deserves a serious referee rather than a desk reject, even if the final verdict after review might be that the results are incremental. I would send it to peer review.

Referee Report

3 major / 1 minor

Summary. The manuscript derives asymptotic equivalents for the training and test risks of ridge regression with random features under a variance-profile model for non-identically distributed data, where training and test covariates have row-dependent diagonal covariances Σ_i and ~Σ_i. The equivalents are obtained by combining the linear-plus-chaos approximation with traffic-probability arguments and via deterministic expressions from operator-valued free probability using an amalgamation-over-the-diagonal argument, in the proportional regime where n, p, m grow together. Numerical experiments indicate the equivalents are sharp and illustrate how heterogeneous profiles (including MNIST-inspired mixtures) affect generalization and double-descent behavior for small ridge parameters.

Significance. If the derivations hold, the work provides a technically grounded extension of random matrix theory tools (linear-plus-chaos plus operator-valued free probability) to heterogeneous diagonal covariance settings beyond the standard homogeneous Σ model. This is a clear strength, as the approach directly builds on established techniques without introducing circularity. The numerical validation on synthetic and mixture profiles supports applicability to real data heterogeneity, and the explicit revelation of modified double-descent phenomena offers falsifiable predictions for generalization under variance profiles.

major comments (3)

[Abstract and main derivation sections] The central claim is the derivation of asymptotic equivalents, yet the manuscript provides only high-level descriptions of the linear-plus-chaos plus traffic-probability steps and the amalgamation-over-the-diagonal argument without the full expansion of the resulting equations or the explicit form of the deterministic equivalents (e.g., the resolvent expressions or fixed-point equations). This is load-bearing for the main contribution.
[Numerical experiments section] The claim that 'these equivalents are sharp in numerical experiments' is made without accompanying error bounds, convergence rates, or quantitative measures of approximation error (e.g., relative deviation as n,p,m → ∞). This undermines assessment of the regime's validity for the stated proportional growth.
[Numerical experiments section] Data-generation procedures for the variance profiles γ_ij and ~γ_ij (including how the MNIST mixture profiles are constructed) are not specified in sufficient detail to reproduce the reported double-descent curves or to verify the heterogeneous effects.

minor comments (1)

Notation for the random feature matrix and the ridge parameter could be clarified with explicit definitions early in the text to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the constructive comments. We address each major comment below and indicate the changes we will make in the revised version.

read point-by-point responses

Referee: [Abstract and main derivation sections] The central claim is the derivation of asymptotic equivalents, yet the manuscript provides only high-level descriptions of the linear-plus-chaos plus traffic-probability steps and the amalgamation-over-the-diagonal argument without the full expansion of the resulting equations or the explicit form of the deterministic equivalents (e.g., the resolvent expressions or fixed-point equations). This is load-bearing for the main contribution.

Authors: We agree that the current presentation of the derivations is somewhat high-level. In the revised manuscript we will expand the relevant sections to include the full expansion of the linear-plus-chaos plus traffic-probability steps and the explicit resolvent and fixed-point equations arising from the amalgamation-over-the-diagonal argument. These details will be placed in the main text where space permits or moved to a dedicated appendix so that the deterministic equivalents are stated completely. revision: yes
Referee: [Numerical experiments section] The claim that 'these equivalents are sharp in numerical experiments' is made without accompanying error bounds, convergence rates, or quantitative measures of approximation error (e.g., relative deviation as n,p,m → ∞). This undermines assessment of the regime's validity for the stated proportional growth.

Authors: We accept that quantitative measures of approximation quality are currently missing. In the revision we will add plots and tables that report the relative deviation between the empirical risks and the asymptotic equivalents across increasing values of n, p, m (with n/p and m/p fixed at the reported ratios). Where feasible we will also include empirical convergence rates and simple error bounds derived from the concentration arguments already used in the proofs. revision: yes
Referee: [Numerical experiments section] Data-generation procedures for the variance profiles γ_ij and ~γ_ij (including how the MNIST mixture profiles are constructed) are not specified in sufficient detail to reproduce the reported double-descent curves or to verify the heterogeneous effects.

Authors: We agree that the data-generation details are insufficient for reproducibility. The revised manuscript will contain an expanded subsection that fully specifies the construction of the diagonal variance profiles γ_ij and ~γ_ij, including the exact parameter values, the mixture weights, and the procedure used to generate the MNIST-inspired heterogeneous profiles. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external RMT tools to variance-profile model

full rationale

The central derivation combines linear-plus-chaos approximation with traffic-probability arguments and operator-valued free probability (amalgamation-over-the-diagonal) to obtain asymptotic equivalents for ridge RF risks under the row-dependent variance-profile model. These are standard, externally established techniques from random matrix theory and free probability, not reductions of the target risks to quantities defined or fitted from the same data. No self-definitional steps, fitted-input predictions, load-bearing self-citations, or smuggled ansatzes are present in the described chain. The result remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides only the high-level model and regime; no explicit free parameters, invented entities, or additional axioms are stated beyond the proportional growth assumption.

axioms (1)

domain assumption n, p, and m grow proportionally
Required for the asymptotic equivalents; stated directly in the abstract.

pith-pipeline@v0.9.0 · 5782 in / 1119 out tokens · 25423 ms · 2026-05-22T21:19:45.855729+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our main contribution is the derivation of asymptotic equivalents for the training and test risks of ridge regression with random features when n, p, and m grow proportionally... using the linear-plus-chaos approximation with traffic-probability arguments... operator-valued free probability through an amalgamation-over-the-diagonal argument.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

variance-profile model... Σ_i = diag(γ_i1², …, γ_ip²)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.