pith. sign in

arxiv: 2601.18546 · v2 · submitted 2026-01-26 · 💻 cs.LG

Information Hidden in Gradients of Regression with Target Noise

Pith reviewed 2026-05-16 10:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords gradient covarianceHessian approximationlinear regressiontarget noise injectionsecond-order optimizationdata covariance
0
0 comments X

The pith

Calibrating target noise variance to batch size lets gradient covariance recover the data covariance matrix in linear regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that gradients alone contain second-order information about the data. By adding Gaussian noise to the regression targets such that the total noise variance matches the batch size, the covariance of the resulting gradients approximates the Hessian of the loss, which for linear regression equals the input covariance matrix. This approximation holds even when the model parameters are far from the optimal values. The result provides a practical way to access curvature information without computing second derivatives directly.

Core claim

Injecting Gaussian noise into the targets so that the total target noise variance equals the batch size ensures that the empirical gradient covariance closely approximates the Hessian, equaling the data covariance Σ for linear regression, with non-asymptotic operator-norm guarantees under sub-Gaussian inputs.

What carries the argument

the variance calibration rule that sets injected Gaussian noise variance to equal the batch size, allowing the empirical covariance of gradients to estimate the Hessian.

If this is right

  • Preconditioning optimization steps using the recovered covariance for faster convergence.
  • Estimating adversarial risk from gradients only.
  • Enabling gradient-only training in distributed or black-box settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar calibration might allow Hessian recovery in non-linear models by local linearization.
  • The method could reduce communication costs in federated learning by transmitting only gradients.
  • Testing the calibration on real-world datasets with varying batch sizes would confirm robustness.

Load-bearing premise

The inputs must have sub-Gaussian tails and the added noise variance must be set exactly equal to the batch size for the approximation to hold.

What would settle it

Compute the difference between gradient covariance and data covariance on synthetic data with sub-Gaussian inputs when noise variance equals batch size versus when it does not.

read the original abstract

Second-order information -- such as curvature or data covariance -- is critical for optimisation, diagnostics, and robustness. However, in many modern settings, only the gradients are observable. We show that the gradients alone can reveal the Hessian, equalling the data covariance $\Sigma$ for the linear regression. Our key insight is a simple variance calibration: injecting Gaussian noise so that the total target noise variance equals the batch size ensures that the empirical gradient covariance closely approximates the Hessian, even when evaluated far from the optimum. We provide non-asymptotic operator-norm guarantees under sub-Gaussian inputs. We also show that without such calibration, recovery can fail by an $\Omega(1)$ factor. The proposed method is practical (a "set target-noise variance to $n$" rule) and robust (variance $\mathcal{O}(n)$ suffices to recover $\Sigma$ up to scale). Applications include preconditioning for faster optimisation, adversarial risk estimation, and gradient-only training, for example, in distributed systems. We support our theoretical results with experiments on synthetic and real data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that for linear regression, the covariance of stochastic gradients with respect to targets corrupted by calibrated Gaussian noise recovers the input covariance matrix Σ (which equals the Hessian) even far from the optimum. The key step is setting the total target noise variance exactly equal to the batch size b so that the stochastic component of each gradient has covariance precisely Σ while the bias term vanishes in the covariance; non-asymptotic operator-norm concentration bounds are stated under sub-Gaussian input assumptions, an Ω(1) scaling failure is shown without the calibration, and a practical rule of setting noise variance to n (with robustness for O(n) variance) is proposed. Experiments on synthetic and real data are included to support the theory.

Significance. If the stated non-asymptotic bounds hold, the result supplies a concrete, gradient-only route to second-order information that is useful for preconditioning, distributed optimization, and adversarial-risk estimation. The explicit calibration rule, the demonstration that the bias term drops out of the covariance, and the explicit counter-example without calibration are strengths that distinguish the work from heuristic gradient-covariance methods. The sub-Gaussian assumption and batch-size scaling are standard and the approach appears parameter-free once the rule is applied.

major comments (1)
  1. [Theoretical analysis] The central non-asymptotic operator-norm guarantee (stated in the abstract and presumably proved in the main theoretical section) relies on the exact equality between total target noise variance and batch size; the manuscript should explicitly state the dimension dependence and failure-probability terms in the bound so that the practical regime (high-d, moderate b) can be assessed.
minor comments (2)
  1. [Abstract and notation] Notation: the abstract alternates between batch size b and n; a single consistent symbol should be used throughout the calibration rule and all theorems.
  2. [Experiments] Experiments: the real-data section should report the input dimension, chosen batch size, and the precise noise variance used so that the O(n) robustness claim can be directly verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comment on the theoretical bound. We address the point below.

read point-by-point responses
  1. Referee: [Theoretical analysis] The central non-asymptotic operator-norm guarantee (stated in the abstract and presumably proved in the main theoretical section) relies on the exact equality between total target noise variance and batch size; the manuscript should explicitly state the dimension dependence and failure-probability terms in the bound so that the practical regime (high-d, moderate b) can be assessed.

    Authors: We agree that making the dimension dependence and failure-probability terms explicit will help readers evaluate applicability in high-dimensional settings. The non-asymptotic bound (Theorem 1) takes the form ||G - Σ||_op ≤ C √((d + log(1/δ))/b) with probability at least 1-δ, where C depends on the sub-Gaussian parameter. In the revision we will state this full dependence explicitly in the abstract and restate it clearly in the theorem, together with a short discussion of the regime b ≪ d. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives the result via explicit noise injection: total target variance is set exactly to batch size b, making the stochastic component of each gradient have covariance precisely equal to the data covariance Σ (which equals the Hessian for linear regression). The deterministic bias term Σ(w − β) cancels in the covariance, yielding the approximation even far from the optimum. This is a direct algebraic identity followed by non-asymptotic operator-norm concentration under sub-Gaussian inputs; the paper explicitly shows an Ω(1) error without the calibration. No parameter is fitted and then renamed as a prediction, no self-citation is load-bearing for the central step, and no ansatz or uniqueness theorem is smuggled in. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the sub-Gaussian input assumption for the non-asymptotic bounds and on the ability to inject noise with exact variance equal to batch size; no free parameters are fitted and no new entities are introduced.

axioms (1)
  • domain assumption Inputs are sub-Gaussian
    Invoked to obtain non-asymptotic operator-norm guarantees on the gradient-covariance estimator

pith-pipeline@v0.9.0 · 5489 in / 1154 out tokens · 22110 ms · 2026-05-16T10:49:47.905457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.