Linear Regression with Unknown Truncation Beyond Gaussian Features

Alexandros Kouridakis; Alkis Kalavasis; Anay Mehrotra; Constantine Caramanis

arxiv: 2602.12534 · v2 · pith:UGGZRSDFnew · submitted 2026-02-13 · 📊 stat.ML · cs.DS· cs.LG· math.ST· stat.TH

Linear Regression with Unknown Truncation Beyond Gaussian Features

Alexandros Kouridakis , Anay Mehrotra , Alkis Kalavasis , Constantine Caramanis This is my paper

Pith reviewed 2026-05-25 07:27 UTC · model grok-4.3

classification 📊 stat.ML cs.DScs.LGmath.STstat.TH

keywords truncated linear regressionunknown survival setsub-Gaussian featurespositive-only learningunions of intervalsPAC learningcensored regression

0 comments

The pith

An algorithm recovers the linear regressor from samples truncated by an unknown survival set in polynomial time under only a sub-Gaussian assumption on the features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Truncated linear regression seeks to estimate an unknown d-dimensional vector w* from pairs (x, y) that are observed only when y lies inside an unknown survival set S*. Earlier methods either assumed S* is given in advance or required Gaussian features and incurred runtime exponential in d or in 1/ε. The paper supplies the first algorithm that achieves ε-accurate recovery of w* in poly(d/ε) time while needing merely that the feature vectors are sub-Gaussian. The method rests on a new subroutine that learns a union of a bounded number of intervals from positive examples alone, provided those examples satisfy an auxiliary smoothness condition. This subroutine is presented as a contribution that may be useful beyond the regression setting.

Core claim

The paper gives a polynomial-time algorithm for truncated linear regression with unknown survival set S* that recovers the regressor w* to additive error ε whenever the feature vectors are sub-Gaussian and the positive examples obey a smoothness condition sufficient to learn unions of intervals from positive data only.

What carries the argument

A subroutine that learns unions of a bounded number of intervals from positive examples alone under a smoothness condition on the distribution of positive data.

If this is right

Estimation of w* becomes possible even when the truncation boundaries are completely unknown.
The algorithm applies to the larger class of sub-Gaussian feature distributions rather than requiring exact Gaussianity.
Runtime scales polynomially with dimension d and inverse accuracy 1/ε.
The positive-only interval-learning subroutine can be invoked in other settings that supply only positive examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same positive-only learning technique could be applied to other censored or truncated statistical problems where the censoring mechanism must be recovered from observed data alone.
If real-world truncated data often satisfy the required smoothness, the result immediately yields practical estimators for survival analysis and selection-bias settings.
Extending the approach to feature distributions with only polynomial-moment tails would require new concentration arguments but would further broaden applicability.

Load-bearing premise

The positive examples must satisfy a smoothness condition that makes it possible to learn the unknown survival set as a union of intervals using only positive data.

What would settle it

An explicit family of sub-Gaussian feature distributions and smooth positive distributions on which no algorithm (or this specific algorithm) recovers w* to error ε in poly(d/ε) time would falsify the claim.

Figures

Figures reproduced from arXiv: 2602.12534 by Alexandros Kouridakis, Alkis Kalavasis, Anay Mehrotra, Constantine Caramanis.

**Figure 2.** Figure 2: Example of Algorithm 3 in action, for learning a union of 𝑘 = 2 intervals with error 𝜀 = 0.2. The points above represent samples on the real line. The red diamonds are positive samples drawn from the distribution D★ + , while the black dots are samples from the distribution D which is smooth w.r.t. the target D★. The algorithm considers the intervals defined by consecutive red points, and discards the 𝑘−1 … view at source ↗

**Figure 3.** Figure 3: Illustrations for Example A.2. A multi-fiber spectroscope from the APOGEE survey, where the requirement of a minimum distance between fibers creates spatially dependent selection effects. Image credit: [SDS22c]. B Further Consequences of Main Result (Theorem 3.1) In this section, we present some additional consequences of Theorem 3.1 in special cases where the covariates are Gaussian (Appendix B.1) and whe… view at source ↗

read the original abstract

In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Claims first poly(d/ε) algorithm for unknown-truncation linear regression under sub-Gaussian features via new positive-only interval learner, but key smoothness derivation needs checking.

read the letter

The paper's main result is an algorithm that recovers the regressor in poly(d/ε) time for truncated linear regression when the survival set is unknown, assuming only sub-Gaussian features. This stands out because earlier methods needed Gaussianity and ran in d to some power of 1/ε. The abstract positions it clearly as the first such result. It does a good job framing the problem as practically relevant, going back to Galton and Tobin, and introducing a subroutine for learning bounded unions of intervals from positive examples alone under a smoothness condition. That subroutine might have uses in other positive-unlabeled or censored settings, adding to PAC learning literature. The potential issue is whether that smoothness condition actually follows from sub-Gaussian features and the linear model without extra assumptions. The abstract does not show the derivation, and if the paper inserts something like log-concavity on the noise or bounded density, then the claim of 'only sub-Gaussian' does not fully hold up. The stress-test concern lands until the full proof is examined. Since the full text is available but the provided info is abstract-heavy, the central claim's soundness is hard to judge without the details. The math and citations look standard for this area, with proper references to Daskalakis et al. and Lee et al. No obvious circularity in the abstract. The positive-only subroutine is presented as new and independent. This is for people working on robust regression or learning from censored data. It could be worth a reading group if the proofs check out, as the idea of reducing to interval learning is clever. A serious editor should send it to referees to verify the reduction and runtime claims, even if revisions are needed on the assumptions.

Referee Report

2 major / 0 minor

Summary. The manuscript claims to give the first poly(d/ε)-time algorithm for truncated linear regression with unknown survival set S*, under the assumption that feature vectors are sub-Gaussian (relaxing prior Gaussianity requirements), by reducing to a novel subroutine that learns a bounded union of intervals from positive examples only, under an additional smoothness condition on the observed positive distribution.

Significance. If the central claims hold, the result would be significant: it improves runtime from d^{poly(1/ε)} to poly(d/ε) while weakening the feature distribution assumption, and the positive-only union-of-intervals learner may be of independent interest for other PAC learning settings.

major comments (2)

[Abstract] Abstract: the claim that the algorithm requires 'only' sub-Gaussian features is undercut by the reliance on a 'certain smoothness condition' for the positive-only subroutine; the manuscript must explicitly state whether this smoothness is derived from sub-Gaussianity of x and the linear model, or is an additional assumption on the conditional distribution induced by unknown truncation.
[Abstract] The reduction from truncated regression to the union-of-intervals learner is load-bearing for the poly(d/ε) runtime claim, yet the abstract supplies no indication that the required smoothness on positive examples follows automatically from sub-Gaussian features; if the derivation inserts an extra assumption (e.g., log-concavity or bounded density on the projection), the stated hypothesis class is not covered.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the abstract. We address each point below and will revise the abstract for clarity.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the algorithm requires 'only' sub-Gaussian features is undercut by the reliance on a 'certain smoothness condition' for the positive-only subroutine; the manuscript must explicitly state whether this smoothness is derived from sub-Gaussianity of x and the linear model, or is an additional assumption on the conditional distribution induced by unknown truncation.

Authors: We agree the abstract should be explicit on this point. In the full paper, the required smoothness on the positive distribution is derived directly from sub-Gaussianity of the features together with the linear model and the definition of the unknown truncation (see the analysis leading to the positive-only learner in Sections 3 and 4). No additional assumption on the conditional distribution is introduced. We will revise the abstract to state this derivation explicitly. revision: yes
Referee: [Abstract] The reduction from truncated regression to the union-of-intervals learner is load-bearing for the poly(d/ε) runtime claim, yet the abstract supplies no indication that the required smoothness on positive examples follows automatically from sub-Gaussian features; if the derivation inserts an extra assumption (e.g., log-concavity or bounded density on the projection), the stated hypothesis class is not covered.

Authors: The reduction does not insert extra assumptions. The smoothness condition on positive examples is shown to follow automatically from sub-Gaussian features, the linear model, and the truncation (via the explicit reduction and the properties of the induced marginal in the positive-only subroutine). The hypothesis class remains exactly the sub-Gaussian features case claimed. We will update the abstract to indicate that the smoothness follows from the stated assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with novel subroutine

full rationale

The abstract describes a poly(d/ε) algorithm for truncated linear regression under sub-Gaussian features that relies on a novel positive-only subroutine for learning bounded unions of intervals under an explicit smoothness condition on positive examples. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the central claim to its inputs by construction. The subroutine is stated to be of independent interest and is not defined in terms of the regression estimator. The smoothness condition is presented as an enabling assumption rather than derived from the model within the visible material, but this does not constitute circularity under the specified criteria. The paper's derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Ledger populated from abstract claims only; sub-Gaussianity and smoothness are domain assumptions required for the runtime guarantee. No free parameters or invented entities are mentioned.

axioms (2)

domain assumption Feature vectors are sub-Gaussian
Explicitly required for the poly(d/ε) runtime in the abstract.
domain assumption Positive examples satisfy a smoothness condition enabling efficient learning of bounded unions of intervals
Stated as necessary for the novel subroutine in the abstract.

pith-pipeline@v0.9.0 · 5844 in / 1211 out tokens · 20471 ms · 2026-05-25T07:27:10.996528+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.3 (Positive PAC Learning unions of intervals under Smoothness)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

R., et al., 2017, @doi [ ] 10.3847/1538-3881/aa784d , https://ui.adsabs.harvard.edu/abs/2017AJ....154...94M 154, 94

url: https://openreview.net/forum?id=PxcWJqO3qj (cit. on pp. 1, 3–5, 7). [Mad83] Gangadharrao S. Maddala.Limited-Dependent and Qualitative Variables in Econometrics. Econo- metric Society Monographs. Cambridge University Press, 1983.isbn: 9780521338257. url: https://books.google.com/books?id=-Ji1ZaUg7gcC (cit. on pp. 3, 7). [Mal22] Karl Gunnar Malmquist. ...

work page doi:10.3847/1538-3881/aa784d 1983
[2]

TheApachePointObservatoryGalacticEvolutionExperiment(APOGEE)Spectrographs

issn: 0001-0782.doi: 10.1145/1968.1972. url: https://doi.org/10.1145/1968.1972 (cit. on p. 7). [Ver18] Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press,2018. isbn:9781108415194. url: https://books.google.com/books?id=ND...

work page doi:10.1145/1968.1972 1968
[3]

Surveys often partition targets into cohorts with different observation priorities, leading to truncation of some cohorts

Sampling-dependent biasesarise from target truncation strategies. Surveys often partition targets into cohorts with different observation priorities, leading to truncation of some cohorts. For example, APOGEE stratifies targets by apparent brightness (H-band magnitude) and preferentially targets certain strata. Importantly, cohort definitions can depend o...

work page
[4]

exclusion zone

Physical limitationsof instruments impose unavoidable constraints. Beyond flux limits, atmospheric absorption can prevent observation of certain objects, and instrument design can induce truncation over sky position. For instance, APOGEE uses multi-fiber spectrographs that route light from many targets through optical fibers to a separate spectrographforh...

work page
[5]

parent catalog

Cascading effectspropagate biases from upstream catalogs to downstream surveys. Many surveys begin with a “parent catalog” and inherit its selection criteria. For example, galaxy surveys that start from the Uppsala General Catalogue inherit criteria that omit small angular-size galaxies unless they exceed brightness thresholds [FCWG+95]. These inherited f...

work page
[6]

On the one hand, they consider truncation jointly in𝑥 and 𝑦, which is more general than the model we study, where truncation happens on𝑦 independent of𝑥

work page
[7]

On the other hand, they require the distribution of𝑥 to be Gaussian, where as we do not require any such parameteric restriction. That said, since we allowD to be non-Gaussian, we can capture an important special case of their model where the (high-dimensional) survival set is of the form𝑆★ 𝑋 × 𝑆★ 𝑌 for 𝑆★ 𝑋 ⊆ R𝑑 and 𝑆★ 𝑌 ⊆ R. Indeed, in this case, we can...

work page
[8]

closeness

Hence, bothΛ and 𝑀 in Assumption 4 can be substituted by𝑂(𝜎 + ∥𝜇∥2). The reason why we choose to explicitly defineΛ and 𝑀 is because (i) they allow us to more concisely present many of our results and calculations in the proofs, and (ii) some of our results do not require the full power of sub-Gaussianity, but continue to hold under only the moment bounds...

work page
[9]

Note that𝑃 depends only on𝑆 (but not𝑤 )

𝑃 be a distribution over(𝑥, 𝑦) ∈ R𝑑 × R such that the marginal of𝑥 is Dobs and the marginal of 𝑦 given 𝑥 is N (⟨𝑤 ★, 𝑥⟩ , 1, 𝑆★ ∩ 𝑆). Note that𝑃 depends only on𝑆 (but not𝑤 )

work page
[10]

Note that𝑄 depends both on𝑆 and 𝑤

𝑄 be a distribution over(𝑥, 𝑧) ∈ R𝑑 × Rsuch that the marginal of𝑥 is Dobs and the marginal of 𝑧 given 𝑥 is N (⟨𝑤 , 𝑥⟩ , 1, 𝑆). Note that𝑄 depends both on𝑆 and 𝑤. If we could sample from𝑃 and 𝑄, we would have an exact sampler for the desired gradient and we would be done. Our sampler samples from two other distributions𝑃′, 𝑄′ that serve as proxies to 𝑃, 𝑄....

work page
[11]

We prove that the total variation distance between𝑃′ and 𝑃 is small by using the fact that𝑆 is a good approximation of𝑆★

𝑃′ be the distribution on( ˆ𝑥, ˆ𝑦), which corresponds to sampling from the true model with parameters 𝑤 ★, 𝑆★. We prove that the total variation distance between𝑃′ and 𝑃 is small by using the fact that𝑆 is a good approximation of𝑆★

work page
[12]

Weprovethat 𝑄′ iscloseintotalvariationdistanceto 𝑄,byrelying on the guarantees of Lemma D.12

𝑄′bethedistributionon (e𝑥,e𝑧),whichcorrespondstodrawingfirst 𝑥 ∼ Dobs andthendrawing the outcome from the Gaussian with mean⟨𝑤 , 𝑥⟩ and variance 1 conditional on𝑆, using the algorithmofLemmaD.12. Weprovethat 𝑄′ iscloseintotalvariationdistanceto 𝑄,byrelying on the guarantees of Lemma D.12. These two results (see Lemma D.27) allow us to control the bias of ...

work page
[13]

Strong Convexity: 𝑓 is 𝜅-strongly convex overP

work page
[14]

Bounded Gradient Second Moment: E h 𝑔(𝑤 (𝑡)) 2 2 𝑤 (𝑡−1) i ≤ 𝜇2

work page
[15]

Bounded Gradient Bias: E 𝑔(𝑤 (𝑡)) 𝑤 (𝑡−1) − ∇ℒ𝑆(𝑤 (𝑡)) 2 ≤ 𝛽 Then, the average iterate¯𝑤 B 1 𝑇 P 𝑡∈[𝑇] 𝑤 (𝑡) satisfies E [ 𝑓 ( ¯𝑤 )] − 𝑓 (𝑤 min) ≤ 𝜇2 𝜅𝑇 (1 + log 𝑇) + 𝛽𝐷 where 𝑤 min B arg min𝑤 ∈P 𝑓 (𝑤 ). Using this convergence guarantee and our above analysis, it is straightforward to show that PSGD converges to the minimizer of the perturbed log-likeliho...

work page

[1] [1]

R., et al., 2017, @doi [ ] 10.3847/1538-3881/aa784d , https://ui.adsabs.harvard.edu/abs/2017AJ....154...94M 154, 94

url: https://openreview.net/forum?id=PxcWJqO3qj (cit. on pp. 1, 3–5, 7). [Mad83] Gangadharrao S. Maddala.Limited-Dependent and Qualitative Variables in Econometrics. Econo- metric Society Monographs. Cambridge University Press, 1983.isbn: 9780521338257. url: https://books.google.com/books?id=-Ji1ZaUg7gcC (cit. on pp. 3, 7). [Mal22] Karl Gunnar Malmquist. ...

work page doi:10.3847/1538-3881/aa784d 1983

[2] [2]

TheApachePointObservatoryGalacticEvolutionExperiment(APOGEE)Spectrographs

issn: 0001-0782.doi: 10.1145/1968.1972. url: https://doi.org/10.1145/1968.1972 (cit. on p. 7). [Ver18] Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press,2018. isbn:9781108415194. url: https://books.google.com/books?id=ND...

work page doi:10.1145/1968.1972 1968

[3] [3]

Surveys often partition targets into cohorts with different observation priorities, leading to truncation of some cohorts

Sampling-dependent biasesarise from target truncation strategies. Surveys often partition targets into cohorts with different observation priorities, leading to truncation of some cohorts. For example, APOGEE stratifies targets by apparent brightness (H-band magnitude) and preferentially targets certain strata. Importantly, cohort definitions can depend o...

work page

[4] [4]

exclusion zone

Physical limitationsof instruments impose unavoidable constraints. Beyond flux limits, atmospheric absorption can prevent observation of certain objects, and instrument design can induce truncation over sky position. For instance, APOGEE uses multi-fiber spectrographs that route light from many targets through optical fibers to a separate spectrographforh...

work page

[5] [5]

parent catalog

Cascading effectspropagate biases from upstream catalogs to downstream surveys. Many surveys begin with a “parent catalog” and inherit its selection criteria. For example, galaxy surveys that start from the Uppsala General Catalogue inherit criteria that omit small angular-size galaxies unless they exceed brightness thresholds [FCWG+95]. These inherited f...

work page

[6] [6]

On the one hand, they consider truncation jointly in𝑥 and 𝑦, which is more general than the model we study, where truncation happens on𝑦 independent of𝑥

work page

[7] [7]

On the other hand, they require the distribution of𝑥 to be Gaussian, where as we do not require any such parameteric restriction. That said, since we allowD to be non-Gaussian, we can capture an important special case of their model where the (high-dimensional) survival set is of the form𝑆★ 𝑋 × 𝑆★ 𝑌 for 𝑆★ 𝑋 ⊆ R𝑑 and 𝑆★ 𝑌 ⊆ R. Indeed, in this case, we can...

work page

[8] [8]

closeness

Hence, bothΛ and 𝑀 in Assumption 4 can be substituted by𝑂(𝜎 + ∥𝜇∥2). The reason why we choose to explicitly defineΛ and 𝑀 is because (i) they allow us to more concisely present many of our results and calculations in the proofs, and (ii) some of our results do not require the full power of sub-Gaussianity, but continue to hold under only the moment bounds...

work page

[9] [9]

Note that𝑃 depends only on𝑆 (but not𝑤 )

𝑃 be a distribution over(𝑥, 𝑦) ∈ R𝑑 × R such that the marginal of𝑥 is Dobs and the marginal of 𝑦 given 𝑥 is N (⟨𝑤 ★, 𝑥⟩ , 1, 𝑆★ ∩ 𝑆). Note that𝑃 depends only on𝑆 (but not𝑤 )

work page

[10] [10]

Note that𝑄 depends both on𝑆 and 𝑤

𝑄 be a distribution over(𝑥, 𝑧) ∈ R𝑑 × Rsuch that the marginal of𝑥 is Dobs and the marginal of 𝑧 given 𝑥 is N (⟨𝑤 , 𝑥⟩ , 1, 𝑆). Note that𝑄 depends both on𝑆 and 𝑤. If we could sample from𝑃 and 𝑄, we would have an exact sampler for the desired gradient and we would be done. Our sampler samples from two other distributions𝑃′, 𝑄′ that serve as proxies to 𝑃, 𝑄....

work page

[11] [11]

We prove that the total variation distance between𝑃′ and 𝑃 is small by using the fact that𝑆 is a good approximation of𝑆★

𝑃′ be the distribution on( ˆ𝑥, ˆ𝑦), which corresponds to sampling from the true model with parameters 𝑤 ★, 𝑆★. We prove that the total variation distance between𝑃′ and 𝑃 is small by using the fact that𝑆 is a good approximation of𝑆★

work page

[12] [12]

Weprovethat 𝑄′ iscloseintotalvariationdistanceto 𝑄,byrelying on the guarantees of Lemma D.12

𝑄′bethedistributionon (e𝑥,e𝑧),whichcorrespondstodrawingfirst 𝑥 ∼ Dobs andthendrawing the outcome from the Gaussian with mean⟨𝑤 , 𝑥⟩ and variance 1 conditional on𝑆, using the algorithmofLemmaD.12. Weprovethat 𝑄′ iscloseintotalvariationdistanceto 𝑄,byrelying on the guarantees of Lemma D.12. These two results (see Lemma D.27) allow us to control the bias of ...

work page

[13] [13]

Strong Convexity: 𝑓 is 𝜅-strongly convex overP

work page

[14] [14]

Bounded Gradient Second Moment: E h 𝑔(𝑤 (𝑡)) 2 2 𝑤 (𝑡−1) i ≤ 𝜇2

work page

[15] [15]

Bounded Gradient Bias: E 𝑔(𝑤 (𝑡)) 𝑤 (𝑡−1) − ∇ℒ𝑆(𝑤 (𝑡)) 2 ≤ 𝛽 Then, the average iterate¯𝑤 B 1 𝑇 P 𝑡∈[𝑇] 𝑤 (𝑡) satisfies E [ 𝑓 ( ¯𝑤 )] − 𝑓 (𝑤 min) ≤ 𝜇2 𝜅𝑇 (1 + log 𝑇) + 𝛽𝐷 where 𝑤 min B arg min𝑤 ∈P 𝑓 (𝑤 ). Using this convergence guarantee and our above analysis, it is straightforward to show that PSGD converges to the minimizer of the perturbed log-likeliho...

work page