pith. sign in

arxiv: 2602.12534 · v2 · pith:UGGZRSDFnew · submitted 2026-02-13 · 📊 stat.ML · cs.DS· cs.LG· math.ST· stat.TH

Linear Regression with Unknown Truncation Beyond Gaussian Features

Pith reviewed 2026-05-25 07:27 UTC · model grok-4.3

classification 📊 stat.ML cs.DScs.LGmath.STstat.TH
keywords truncated linear regressionunknown survival setsub-Gaussian featurespositive-only learningunions of intervalsPAC learningcensored regression
0
0 comments X

The pith

An algorithm recovers the linear regressor from samples truncated by an unknown survival set in polynomial time under only a sub-Gaussian assumption on the features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Truncated linear regression seeks to estimate an unknown d-dimensional vector w* from pairs (x, y) that are observed only when y lies inside an unknown survival set S*. Earlier methods either assumed S* is given in advance or required Gaussian features and incurred runtime exponential in d or in 1/ε. The paper supplies the first algorithm that achieves ε-accurate recovery of w* in poly(d/ε) time while needing merely that the feature vectors are sub-Gaussian. The method rests on a new subroutine that learns a union of a bounded number of intervals from positive examples alone, provided those examples satisfy an auxiliary smoothness condition. This subroutine is presented as a contribution that may be useful beyond the regression setting.

Core claim

The paper gives a polynomial-time algorithm for truncated linear regression with unknown survival set S* that recovers the regressor w* to additive error ε whenever the feature vectors are sub-Gaussian and the positive examples obey a smoothness condition sufficient to learn unions of intervals from positive data only.

What carries the argument

A subroutine that learns unions of a bounded number of intervals from positive examples alone under a smoothness condition on the distribution of positive data.

If this is right

  • Estimation of w* becomes possible even when the truncation boundaries are completely unknown.
  • The algorithm applies to the larger class of sub-Gaussian feature distributions rather than requiring exact Gaussianity.
  • Runtime scales polynomially with dimension d and inverse accuracy 1/ε.
  • The positive-only interval-learning subroutine can be invoked in other settings that supply only positive examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same positive-only learning technique could be applied to other censored or truncated statistical problems where the censoring mechanism must be recovered from observed data alone.
  • If real-world truncated data often satisfy the required smoothness, the result immediately yields practical estimators for survival analysis and selection-bias settings.
  • Extending the approach to feature distributions with only polynomial-moment tails would require new concentration arguments but would further broaden applicability.

Load-bearing premise

The positive examples must satisfy a smoothness condition that makes it possible to learn the unknown survival set as a union of intervals using only positive data.

What would settle it

An explicit family of sub-Gaussian feature distributions and smooth positive distributions on which no algorithm (or this specific algorithm) recovers w* to error ε in poly(d/ε) time would falsify the claim.

Figures

Figures reproduced from arXiv: 2602.12534 by Alexandros Kouridakis, Alkis Kalavasis, Anay Mehrotra, Constantine Caramanis.

Figure 1
Figure 1. Figure 1: Example of positive-only learning under smoothness. Here, we wish to PAC learn the [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of Algorithm 3 in action, for learning a union of 𝑘 = 2 intervals with error 𝜀 = 0.2. The points above represent samples on the real line. The red diamonds are positive samples drawn from the distribution D★ + , while the black dots are samples from the distribution D which is smooth w.r.t. the target D★. The algorithm considers the intervals defined by consecutive red points, and discards the 𝑘−1 … view at source ↗
Figure 3
Figure 3. Figure 3: Illustrations for Example A.2. A multi-fiber spectroscope from the APOGEE survey, where the requirement of a minimum distance between fibers creates spatially dependent selection effects. Image credit: [SDS22c]. B Further Consequences of Main Result (Theorem 3.1) In this section, we present some additional consequences of Theorem 3.1 in special cases where the covariates are Gaussian (Appendix B.1) and whe… view at source ↗
read the original abstract

In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims to give the first poly(d/ε)-time algorithm for truncated linear regression with unknown survival set S*, under the assumption that feature vectors are sub-Gaussian (relaxing prior Gaussianity requirements), by reducing to a novel subroutine that learns a bounded union of intervals from positive examples only, under an additional smoothness condition on the observed positive distribution.

Significance. If the central claims hold, the result would be significant: it improves runtime from d^{poly(1/ε)} to poly(d/ε) while weakening the feature distribution assumption, and the positive-only union-of-intervals learner may be of independent interest for other PAC learning settings.

major comments (2)
  1. [Abstract] Abstract: the claim that the algorithm requires 'only' sub-Gaussian features is undercut by the reliance on a 'certain smoothness condition' for the positive-only subroutine; the manuscript must explicitly state whether this smoothness is derived from sub-Gaussianity of x and the linear model, or is an additional assumption on the conditional distribution induced by unknown truncation.
  2. [Abstract] The reduction from truncated regression to the union-of-intervals learner is load-bearing for the poly(d/ε) runtime claim, yet the abstract supplies no indication that the required smoothness on positive examples follows automatically from sub-Gaussian features; if the derivation inserts an extra assumption (e.g., log-concavity or bounded density on the projection), the stated hypothesis class is not covered.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the abstract. We address each point below and will revise the abstract for clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the algorithm requires 'only' sub-Gaussian features is undercut by the reliance on a 'certain smoothness condition' for the positive-only subroutine; the manuscript must explicitly state whether this smoothness is derived from sub-Gaussianity of x and the linear model, or is an additional assumption on the conditional distribution induced by unknown truncation.

    Authors: We agree the abstract should be explicit on this point. In the full paper, the required smoothness on the positive distribution is derived directly from sub-Gaussianity of the features together with the linear model and the definition of the unknown truncation (see the analysis leading to the positive-only learner in Sections 3 and 4). No additional assumption on the conditional distribution is introduced. We will revise the abstract to state this derivation explicitly. revision: yes

  2. Referee: [Abstract] The reduction from truncated regression to the union-of-intervals learner is load-bearing for the poly(d/ε) runtime claim, yet the abstract supplies no indication that the required smoothness on positive examples follows automatically from sub-Gaussian features; if the derivation inserts an extra assumption (e.g., log-concavity or bounded density on the projection), the stated hypothesis class is not covered.

    Authors: The reduction does not insert extra assumptions. The smoothness condition on positive examples is shown to follow automatically from sub-Gaussian features, the linear model, and the truncation (via the explicit reduction and the properties of the induced marginal in the positive-only subroutine). The hypothesis class remains exactly the sub-Gaussian features case claimed. We will update the abstract to indicate that the smoothness follows from the stated assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with novel subroutine

full rationale

The abstract describes a poly(d/ε) algorithm for truncated linear regression under sub-Gaussian features that relies on a novel positive-only subroutine for learning bounded unions of intervals under an explicit smoothness condition on positive examples. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the central claim to its inputs by construction. The subroutine is stated to be of independent interest and is not defined in terms of the regression estimator. The smoothness condition is presented as an enabling assumption rather than derived from the model within the visible material, but this does not constitute circularity under the specified criteria. The paper's derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Ledger populated from abstract claims only; sub-Gaussianity and smoothness are domain assumptions required for the runtime guarantee. No free parameters or invented entities are mentioned.

axioms (2)
  • domain assumption Feature vectors are sub-Gaussian
    Explicitly required for the poly(d/ε) runtime in the abstract.
  • domain assumption Positive examples satisfy a smoothness condition enabling efficient learning of bounded unions of intervals
    Stated as necessary for the novel subroutine in the abstract.

pith-pipeline@v0.9.0 · 5844 in / 1211 out tokens · 20471 ms · 2026-05-25T07:27:10.996528+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    R., Schiavon, R

    url: https://openreview.net/forum?id=PxcWJqO3qj (cit. on pp. 1, 3–5, 7). [Mad83] Gangadharrao S. Maddala.Limited-Dependent and Qualitative Variables in Econometrics. Econo- metric Society Monographs. Cambridge University Press, 1983.isbn: 9780521338257. url: https://books.google.com/books?id=-Ji1ZaUg7gcC (cit. on pp. 3, 7). [Mal22] Karl Gunnar Malmquist. ...

  2. [2]

    TheApachePointObservatoryGalacticEvolutionExperiment(APOGEE)Spectrographs

    issn: 0001-0782.doi: 10.1145/1968.1972. url: https://doi.org/10.1145/1968.1972 (cit. on p. 7). [Ver18] Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press,2018. isbn:9781108415194. url: https://books.google.com/books?id=ND...

  3. [3]

    Surveys often partition targets into cohorts with different observation priorities, leading to truncation of some cohorts

    Sampling-dependent biasesarise from target truncation strategies. Surveys often partition targets into cohorts with different observation priorities, leading to truncation of some cohorts. For example, APOGEE stratifies targets by apparent brightness (H-band magnitude) and preferentially targets certain strata. Importantly, cohort definitions can depend o...

  4. [4]

    exclusion zone

    Physical limitationsof instruments impose unavoidable constraints. Beyond flux limits, atmospheric absorption can prevent observation of certain objects, and instrument design can induce truncation over sky position. For instance, APOGEE uses multi-fiber spectrographs that route light from many targets through optical fibers to a separate spectrographforh...

  5. [5]

    parent catalog

    Cascading effectspropagate biases from upstream catalogs to downstream surveys. Many surveys begin with a “parent catalog” and inherit its selection criteria. For example, galaxy surveys that start from the Uppsala General Catalogue inherit criteria that omit small angular-size galaxies unless they exceed brightness thresholds [FCWG+95]. These inherited f...

  6. [6]

    On the one hand, they consider truncation jointly in𝑥 and 𝑦, which is more general than the model we study, where truncation happens on𝑦 independent of𝑥

  7. [7]

    On the other hand, they require the distribution of𝑥 to be Gaussian, where as we do not require any such parameteric restriction. That said, since we allowD to be non-Gaussian, we can capture an important special case of their model where the (high-dimensional) survival set is of the form𝑆★ 𝑋 × 𝑆★ 𝑌 for 𝑆★ 𝑋 ⊆ R𝑑 and 𝑆★ 𝑌 ⊆ R. Indeed, in this case, we can...

  8. [8]

    closeness

    Hence, bothΛ and 𝑀 in Assumption 4 can be substituted by𝑂(𝜎 + ∥𝜇∥2). The reason why we choose to explicitly defineΛ and 𝑀 is because (i) they allow us to more concisely present many of our results and calculations in the proofs, and (ii) some of our results do not require the full power of sub-Gaussianity, but continue to hold under only the moment bounds...

  9. [9]

    Note that𝑃 depends only on𝑆 (but not𝑤 )

    𝑃 be a distribution over(𝑥, 𝑦) ∈ R𝑑 × R such that the marginal of𝑥 is Dobs and the marginal of 𝑦 given 𝑥 is N (⟨𝑤 ★, 𝑥⟩ , 1, 𝑆★ ∩ 𝑆). Note that𝑃 depends only on𝑆 (but not𝑤 )

  10. [10]

    Note that𝑄 depends both on𝑆 and 𝑤

    𝑄 be a distribution over(𝑥, 𝑧) ∈ R𝑑 × Rsuch that the marginal of𝑥 is Dobs and the marginal of 𝑧 given 𝑥 is N (⟨𝑤 , 𝑥⟩ , 1, 𝑆). Note that𝑄 depends both on𝑆 and 𝑤. If we could sample from𝑃 and 𝑄, we would have an exact sampler for the desired gradient and we would be done. Our sampler samples from two other distributions𝑃′, 𝑄′ that serve as proxies to 𝑃, 𝑄....

  11. [11]

    We prove that the total variation distance between𝑃′ and 𝑃 is small by using the fact that𝑆 is a good approximation of𝑆★

    𝑃′ be the distribution on( ˆ𝑥, ˆ𝑦), which corresponds to sampling from the true model with parameters 𝑤 ★, 𝑆★. We prove that the total variation distance between𝑃′ and 𝑃 is small by using the fact that𝑆 is a good approximation of𝑆★

  12. [12]

    Weprovethat 𝑄′ iscloseintotalvariationdistanceto 𝑄,byrelying on the guarantees of Lemma D.12

    𝑄′bethedistributionon (e𝑥,e𝑧),whichcorrespondstodrawingfirst 𝑥 ∼ Dobs andthendrawing the outcome from the Gaussian with mean⟨𝑤 , 𝑥⟩ and variance 1 conditional on𝑆, using the algorithmofLemmaD.12. Weprovethat 𝑄′ iscloseintotalvariationdistanceto 𝑄,byrelying on the guarantees of Lemma D.12. These two results (see Lemma D.27) allow us to control the bias of ...

  13. [13]

    Strong Convexity: 𝑓 is 𝜅-strongly convex overP

  14. [14]

    Bounded Gradient Second Moment: E h 𝑔(𝑤 (𝑡)) 2 2 𝑤 (𝑡−1) i ≤ 𝜇2

  15. [15]

    Bounded Gradient Bias: E 𝑔(𝑤 (𝑡)) 𝑤 (𝑡−1) − ∇ℒ𝑆(𝑤 (𝑡)) 2 ≤ 𝛽 Then, the average iterate¯𝑤 B 1 𝑇 P 𝑡∈[𝑇] 𝑤 (𝑡) satisfies E [ 𝑓 ( ¯𝑤 )] − 𝑓 (𝑤 min) ≤ 𝜇2 𝜅𝑇 (1 + log 𝑇) + 𝛽𝐷 where 𝑤 min B arg min𝑤 ∈P 𝑓 (𝑤 ). Using this convergence guarantee and our above analysis, it is straightforward to show that PSGD converges to the minimizer of the perturbed log-likeliho...