Information Hidden in Gradients of Regression with Target Noise
Pith reviewed 2026-05-16 10:49 UTC · model grok-4.3
The pith
Calibrating target noise variance to batch size lets gradient covariance recover the data covariance matrix in linear regression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Injecting Gaussian noise into the targets so that the total target noise variance equals the batch size ensures that the empirical gradient covariance closely approximates the Hessian, equaling the data covariance Σ for linear regression, with non-asymptotic operator-norm guarantees under sub-Gaussian inputs.
What carries the argument
the variance calibration rule that sets injected Gaussian noise variance to equal the batch size, allowing the empirical covariance of gradients to estimate the Hessian.
If this is right
- Preconditioning optimization steps using the recovered covariance for faster convergence.
- Estimating adversarial risk from gradients only.
- Enabling gradient-only training in distributed or black-box settings.
Where Pith is reading between the lines
- Similar calibration might allow Hessian recovery in non-linear models by local linearization.
- The method could reduce communication costs in federated learning by transmitting only gradients.
- Testing the calibration on real-world datasets with varying batch sizes would confirm robustness.
Load-bearing premise
The inputs must have sub-Gaussian tails and the added noise variance must be set exactly equal to the batch size for the approximation to hold.
What would settle it
Compute the difference between gradient covariance and data covariance on synthetic data with sub-Gaussian inputs when noise variance equals batch size versus when it does not.
read the original abstract
Second-order information -- such as curvature or data covariance -- is critical for optimisation, diagnostics, and robustness. However, in many modern settings, only the gradients are observable. We show that the gradients alone can reveal the Hessian, equalling the data covariance $\Sigma$ for the linear regression. Our key insight is a simple variance calibration: injecting Gaussian noise so that the total target noise variance equals the batch size ensures that the empirical gradient covariance closely approximates the Hessian, even when evaluated far from the optimum. We provide non-asymptotic operator-norm guarantees under sub-Gaussian inputs. We also show that without such calibration, recovery can fail by an $\Omega(1)$ factor. The proposed method is practical (a "set target-noise variance to $n$" rule) and robust (variance $\mathcal{O}(n)$ suffices to recover $\Sigma$ up to scale). Applications include preconditioning for faster optimisation, adversarial risk estimation, and gradient-only training, for example, in distributed systems. We support our theoretical results with experiments on synthetic and real data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that for linear regression, the covariance of stochastic gradients with respect to targets corrupted by calibrated Gaussian noise recovers the input covariance matrix Σ (which equals the Hessian) even far from the optimum. The key step is setting the total target noise variance exactly equal to the batch size b so that the stochastic component of each gradient has covariance precisely Σ while the bias term vanishes in the covariance; non-asymptotic operator-norm concentration bounds are stated under sub-Gaussian input assumptions, an Ω(1) scaling failure is shown without the calibration, and a practical rule of setting noise variance to n (with robustness for O(n) variance) is proposed. Experiments on synthetic and real data are included to support the theory.
Significance. If the stated non-asymptotic bounds hold, the result supplies a concrete, gradient-only route to second-order information that is useful for preconditioning, distributed optimization, and adversarial-risk estimation. The explicit calibration rule, the demonstration that the bias term drops out of the covariance, and the explicit counter-example without calibration are strengths that distinguish the work from heuristic gradient-covariance methods. The sub-Gaussian assumption and batch-size scaling are standard and the approach appears parameter-free once the rule is applied.
major comments (1)
- [Theoretical analysis] The central non-asymptotic operator-norm guarantee (stated in the abstract and presumably proved in the main theoretical section) relies on the exact equality between total target noise variance and batch size; the manuscript should explicitly state the dimension dependence and failure-probability terms in the bound so that the practical regime (high-d, moderate b) can be assessed.
minor comments (2)
- [Abstract and notation] Notation: the abstract alternates between batch size b and n; a single consistent symbol should be used throughout the calibration rule and all theorems.
- [Experiments] Experiments: the real-data section should report the input dimension, chosen batch size, and the precise noise variance used so that the O(n) robustness claim can be directly verified.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the constructive comment on the theoretical bound. We address the point below.
read point-by-point responses
-
Referee: [Theoretical analysis] The central non-asymptotic operator-norm guarantee (stated in the abstract and presumably proved in the main theoretical section) relies on the exact equality between total target noise variance and batch size; the manuscript should explicitly state the dimension dependence and failure-probability terms in the bound so that the practical regime (high-d, moderate b) can be assessed.
Authors: We agree that making the dimension dependence and failure-probability terms explicit will help readers evaluate applicability in high-dimensional settings. The non-asymptotic bound (Theorem 1) takes the form ||G - Σ||_op ≤ C √((d + log(1/δ))/b) with probability at least 1-δ, where C depends on the sub-Gaussian parameter. In the revision we will state this full dependence explicitly in the abstract and restate it clearly in the theorem, together with a short discussion of the regime b ≪ d. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper derives the result via explicit noise injection: total target variance is set exactly to batch size b, making the stochastic component of each gradient have covariance precisely equal to the data covariance Σ (which equals the Hessian for linear regression). The deterministic bias term Σ(w − β) cancels in the covariance, yielding the approximation even far from the optimum. This is a direct algebraic identity followed by non-asymptotic operator-norm concentration under sub-Gaussian inputs; the paper explicitly shows an Ω(1) error without the calibration. No parameter is fitted and then renamed as a prediction, no self-citation is load-bearing for the central step, and no ansatz or uniqueness theorem is smuggled in. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Inputs are sub-Gaussian
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.