Linear Regression with Unknown Truncation Beyond Gaussian Features
Pith reviewed 2026-05-25 07:27 UTC · model grok-4.3
The pith
An algorithm recovers the linear regressor from samples truncated by an unknown survival set in polynomial time under only a sub-Gaussian assumption on the features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper gives a polynomial-time algorithm for truncated linear regression with unknown survival set S* that recovers the regressor w* to additive error ε whenever the feature vectors are sub-Gaussian and the positive examples obey a smoothness condition sufficient to learn unions of intervals from positive data only.
What carries the argument
A subroutine that learns unions of a bounded number of intervals from positive examples alone under a smoothness condition on the distribution of positive data.
If this is right
- Estimation of w* becomes possible even when the truncation boundaries are completely unknown.
- The algorithm applies to the larger class of sub-Gaussian feature distributions rather than requiring exact Gaussianity.
- Runtime scales polynomially with dimension d and inverse accuracy 1/ε.
- The positive-only interval-learning subroutine can be invoked in other settings that supply only positive examples.
Where Pith is reading between the lines
- The same positive-only learning technique could be applied to other censored or truncated statistical problems where the censoring mechanism must be recovered from observed data alone.
- If real-world truncated data often satisfy the required smoothness, the result immediately yields practical estimators for survival analysis and selection-bias settings.
- Extending the approach to feature distributions with only polynomial-moment tails would require new concentration arguments but would further broaden applicability.
Load-bearing premise
The positive examples must satisfy a smoothness condition that makes it possible to learn the unknown survival set as a union of intervals using only positive data.
What would settle it
An explicit family of sub-Gaussian feature distributions and smooth positive distributions on which no algorithm (or this specific algorithm) recovers w* to error ε in poly(d/ε) time would falsify the claim.
Figures
read the original abstract
In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to give the first poly(d/ε)-time algorithm for truncated linear regression with unknown survival set S*, under the assumption that feature vectors are sub-Gaussian (relaxing prior Gaussianity requirements), by reducing to a novel subroutine that learns a bounded union of intervals from positive examples only, under an additional smoothness condition on the observed positive distribution.
Significance. If the central claims hold, the result would be significant: it improves runtime from d^{poly(1/ε)} to poly(d/ε) while weakening the feature distribution assumption, and the positive-only union-of-intervals learner may be of independent interest for other PAC learning settings.
major comments (2)
- [Abstract] Abstract: the claim that the algorithm requires 'only' sub-Gaussian features is undercut by the reliance on a 'certain smoothness condition' for the positive-only subroutine; the manuscript must explicitly state whether this smoothness is derived from sub-Gaussianity of x and the linear model, or is an additional assumption on the conditional distribution induced by unknown truncation.
- [Abstract] The reduction from truncated regression to the union-of-intervals learner is load-bearing for the poly(d/ε) runtime claim, yet the abstract supplies no indication that the required smoothness on positive examples follows automatically from sub-Gaussian features; if the derivation inserts an extra assumption (e.g., log-concavity or bounded density on the projection), the stated hypothesis class is not covered.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on the abstract. We address each point below and will revise the abstract for clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the algorithm requires 'only' sub-Gaussian features is undercut by the reliance on a 'certain smoothness condition' for the positive-only subroutine; the manuscript must explicitly state whether this smoothness is derived from sub-Gaussianity of x and the linear model, or is an additional assumption on the conditional distribution induced by unknown truncation.
Authors: We agree the abstract should be explicit on this point. In the full paper, the required smoothness on the positive distribution is derived directly from sub-Gaussianity of the features together with the linear model and the definition of the unknown truncation (see the analysis leading to the positive-only learner in Sections 3 and 4). No additional assumption on the conditional distribution is introduced. We will revise the abstract to state this derivation explicitly. revision: yes
-
Referee: [Abstract] The reduction from truncated regression to the union-of-intervals learner is load-bearing for the poly(d/ε) runtime claim, yet the abstract supplies no indication that the required smoothness on positive examples follows automatically from sub-Gaussian features; if the derivation inserts an extra assumption (e.g., log-concavity or bounded density on the projection), the stated hypothesis class is not covered.
Authors: The reduction does not insert extra assumptions. The smoothness condition on positive examples is shown to follow automatically from sub-Gaussian features, the linear model, and the truncation (via the explicit reduction and the properties of the induced marginal in the positive-only subroutine). The hypothesis class remains exactly the sub-Gaussian features case claimed. We will update the abstract to indicate that the smoothness follows from the stated assumptions. revision: yes
Circularity Check
No significant circularity; derivation self-contained with novel subroutine
full rationale
The abstract describes a poly(d/ε) algorithm for truncated linear regression under sub-Gaussian features that relies on a novel positive-only subroutine for learning bounded unions of intervals under an explicit smoothness condition on positive examples. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the central claim to its inputs by construction. The subroutine is stated to be of independent interest and is not defined in terms of the regression estimator. The smoothness condition is presented as an enabling assumption rather than derived from the model within the visible material, but this does not constitute circularity under the specified criteria. The paper's derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Feature vectors are sub-Gaussian
- domain assumption Positive examples satisfy a smoothness condition enabling efficient learning of bounded unions of intervals
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.3 (Positive PAC Learning unions of intervals under Smoothness)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
url: https://openreview.net/forum?id=PxcWJqO3qj (cit. on pp. 1, 3–5, 7). [Mad83] Gangadharrao S. Maddala.Limited-Dependent and Qualitative Variables in Econometrics. Econo- metric Society Monographs. Cambridge University Press, 1983.isbn: 9780521338257. url: https://books.google.com/books?id=-Ji1ZaUg7gcC (cit. on pp. 3, 7). [Mal22] Karl Gunnar Malmquist. ...
-
[2]
TheApachePointObservatoryGalacticEvolutionExperiment(APOGEE)Spectrographs
issn: 0001-0782.doi: 10.1145/1968.1972. url: https://doi.org/10.1145/1968.1972 (cit. on p. 7). [Ver18] Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press,2018. isbn:9781108415194. url: https://books.google.com/books?id=ND...
-
[3]
Sampling-dependent biasesarise from target truncation strategies. Surveys often partition targets into cohorts with different observation priorities, leading to truncation of some cohorts. For example, APOGEE stratifies targets by apparent brightness (H-band magnitude) and preferentially targets certain strata. Importantly, cohort definitions can depend o...
-
[4]
Physical limitationsof instruments impose unavoidable constraints. Beyond flux limits, atmospheric absorption can prevent observation of certain objects, and instrument design can induce truncation over sky position. For instance, APOGEE uses multi-fiber spectrographs that route light from many targets through optical fibers to a separate spectrographforh...
-
[5]
Cascading effectspropagate biases from upstream catalogs to downstream surveys. Many surveys begin with a “parent catalog” and inherit its selection criteria. For example, galaxy surveys that start from the Uppsala General Catalogue inherit criteria that omit small angular-size galaxies unless they exceed brightness thresholds [FCWG+95]. These inherited f...
-
[6]
On the one hand, they consider truncation jointly in𝑥 and 𝑦, which is more general than the model we study, where truncation happens on𝑦 independent of𝑥
-
[7]
On the other hand, they require the distribution of𝑥 to be Gaussian, where as we do not require any such parameteric restriction. That said, since we allowD to be non-Gaussian, we can capture an important special case of their model where the (high-dimensional) survival set is of the form𝑆★ 𝑋 × 𝑆★ 𝑌 for 𝑆★ 𝑋 ⊆ R𝑑 and 𝑆★ 𝑌 ⊆ R. Indeed, in this case, we can...
-
[8]
Hence, bothΛ and 𝑀 in Assumption 4 can be substituted by𝑂(𝜎 + ∥𝜇∥2). The reason why we choose to explicitly defineΛ and 𝑀 is because (i) they allow us to more concisely present many of our results and calculations in the proofs, and (ii) some of our results do not require the full power of sub-Gaussianity, but continue to hold under only the moment bounds...
-
[9]
Note that𝑃 depends only on𝑆 (but not𝑤 )
𝑃 be a distribution over(𝑥, 𝑦) ∈ R𝑑 × R such that the marginal of𝑥 is Dobs and the marginal of 𝑦 given 𝑥 is N (⟨𝑤 ★, 𝑥⟩ , 1, 𝑆★ ∩ 𝑆). Note that𝑃 depends only on𝑆 (but not𝑤 )
-
[10]
Note that𝑄 depends both on𝑆 and 𝑤
𝑄 be a distribution over(𝑥, 𝑧) ∈ R𝑑 × Rsuch that the marginal of𝑥 is Dobs and the marginal of 𝑧 given 𝑥 is N (⟨𝑤 , 𝑥⟩ , 1, 𝑆). Note that𝑄 depends both on𝑆 and 𝑤. If we could sample from𝑃 and 𝑄, we would have an exact sampler for the desired gradient and we would be done. Our sampler samples from two other distributions𝑃′, 𝑄′ that serve as proxies to 𝑃, 𝑄....
-
[11]
𝑃′ be the distribution on( ˆ𝑥, ˆ𝑦), which corresponds to sampling from the true model with parameters 𝑤 ★, 𝑆★. We prove that the total variation distance between𝑃′ and 𝑃 is small by using the fact that𝑆 is a good approximation of𝑆★
-
[12]
Weprovethat 𝑄′ iscloseintotalvariationdistanceto 𝑄,byrelying on the guarantees of Lemma D.12
𝑄′bethedistributionon (e𝑥,e𝑧),whichcorrespondstodrawingfirst 𝑥 ∼ Dobs andthendrawing the outcome from the Gaussian with mean⟨𝑤 , 𝑥⟩ and variance 1 conditional on𝑆, using the algorithmofLemmaD.12. Weprovethat 𝑄′ iscloseintotalvariationdistanceto 𝑄,byrelying on the guarantees of Lemma D.12. These two results (see Lemma D.27) allow us to control the bias of ...
-
[13]
Strong Convexity: 𝑓 is 𝜅-strongly convex overP
-
[14]
Bounded Gradient Second Moment: E h 𝑔(𝑤 (𝑡)) 2 2 𝑤 (𝑡−1) i ≤ 𝜇2
-
[15]
Bounded Gradient Bias: E 𝑔(𝑤 (𝑡)) 𝑤 (𝑡−1) − ∇ℒ𝑆(𝑤 (𝑡)) 2 ≤ 𝛽 Then, the average iterate¯𝑤 B 1 𝑇 P 𝑡∈[𝑇] 𝑤 (𝑡) satisfies E [ 𝑓 ( ¯𝑤 )] − 𝑓 (𝑤 min) ≤ 𝜇2 𝜅𝑇 (1 + log 𝑇) + 𝛽𝐷 where 𝑤 min B arg min𝑤 ∈P 𝑓 (𝑤 ). Using this convergence guarantee and our above analysis, it is straightforward to show that PSGD converges to the minimizer of the perturbed log-likeliho...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.