pith. sign in

arxiv: 2410.01656 · v2 · submitted 2024-10-02 · 🧮 math.ST · cs.DS· cs.LG· stat.CO· stat.ML· stat.TH

Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians

Pith reviewed 2026-05-23 20:10 UTC · model grok-4.3

classification 🧮 math.ST cs.DScs.LGstat.COstat.MLstat.TH
keywords truncated estimationexponential familiesunknown truncationGaussian estimationlinear regressionpositive unlabeled learningpolynomial approximation
0
0 comments X

The pith

A d^{poly(ℓ/ε)} time algorithm estimates parameters of exponential families from samples truncated by any unknown set approximable by degree-ℓ polynomials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to recover distributional parameters when data is observed only inside an unknown region S. It supplies a runtime that depends on dimension d to the power of a polynomial in ℓ/ε for any qualifying exponential family whenever S admits an ε-approximation by degree-ℓ polynomials. This immediately yields the first such procedure for fully general Gaussians and the first procedure for linear regression whose features are Gaussian yet whose responses are truncated by an unknown S. A second, fully polynomial-time result is given when S is restricted to a halfspace or axis-aligned box. The work also supplies an auxiliary reduction from positive-unlabeled learning to positive-negative learning that tolerates covariate shift.

Core claim

For any exponential family obeying the required structural conditions and any unknown truncation set S that is ε-close to a degree-ℓ polynomial, the parameters can be recovered to accuracy ε in time d^{poly(ℓ/ε)}; the same framework yields the first algorithms for arbitrary-covariance Gaussians under unknown truncation and for linear regression with Gaussian features under unknown truncation, while a separate poly(d/ε)-time algorithm exists when S is a halfspace or axis-aligned rectangle.

What carries the argument

A reduction that converts the truncated estimation task into a robust positive-unlabeled learning problem, then solves the latter via polynomial approximation of the truncation set.

If this is right

  • The first algorithm exists that recovers the mean and covariance of an arbitrary Gaussian from samples observed only inside an unknown set S.
  • The first algorithm exists for linear regression when the feature vectors are Gaussian and the responses are observed only inside an unknown truncation set.
  • Polynomial-time recovery is possible for all Gaussians and a broader class of exponential families when the truncation set is a halfspace or an axis-aligned rectangle.
  • A general reduction maps positive-unlabeled learning under covariate shift to ordinary positive-negative learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the structural assumptions on the exponential family can be verified for additional families, the same polynomial-approximation approach would immediately give truncated estimators for those families as well.
  • The reduction from positive-unlabeled to positive-negative learning may apply to other missing-data or selection-bias settings beyond truncation.
  • Testing whether a concrete truncation set admits a low-degree polynomial approximation could become a practical preprocessing step for deciding which algorithm to run.

Load-bearing premise

The exponential family must obey unspecified structural conditions and the unknown truncation set S must admit an ε-approximation by a degree-ℓ polynomial.

What would settle it

An explicit exponential family obeying the structural conditions together with a truncation set S that cannot be approximated by any low-degree polynomial for which no d^{poly(ℓ/ε)}-time algorithm exists.

Figures

Figures reproduced from arXiv: 2410.01656 by Anay Mehrotra, Jane H. Lee, Manolis Zampetakis.

Figure 1
Figure 1. Figure 1: Illustration that θ ⋆ can be far from θPMLE even though the gradient at θ ⋆ is small (Lemma B.4). To ensure θ ⋆ is close to θPMLE, we carefully select the domain Ω. The blue line is the objective of Perturbed MLE LS(·). This shows that for any S that is ε-close to S ⋆ in the following sense E (S△S ⋆ ; θ ⋆ ) ≤ ε · E (S ⋆ ; θ ⋆ ) the gradient at θ ⋆ has a small L2 norm: ∥∇LS(θ)|θ ⋆ ∥2 ≤ O(ε) · e poly(1/α) . … view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the properties of θ0 (see Lemma 3.7): θ0 is at most poly(1/α) far from θ ⋆ ( [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the marginal distributions constructed in the reduction in [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An illustration of the χ 2 -Bridge promised to exist in Informal Assumption 3 (or its formal version Assumption 3.5). This bridge always exists for Gaussians and product exponential distri￾butions after pre-processing described in Section 9. In this example, the two distributions E(θ1) and E(θ2) are far from each other in χ 2 -divergence (χ 2 (E(θ1)∥E(θ2)), χ 2 (E(θ2)∥E(θ1)) ≥ 1086) and the bridge distribu… view at source ↗
Figure 5
Figure 5. Figure 5: A construction appearing in the proof of [PITH_FULL_IMAGE:figures/full_fig_p044_5.png] view at source ↗
read the original abstract

We study the estimation of distributional parameters when samples are shown only if they fall in some unknown set $S \subseteq \mathbb{R}^d$. Kontonis, Tzamos, and Zampetakis (FOCS'19) gave a $d^{\mathrm{poly}(1/\varepsilon)}$ time algorithm for finding $\varepsilon$-accurate parameters for the special case of Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas, Kane, Pittas, and Zarifis (COLT'24) showed that this exponential dependence on $1/\varepsilon$ is necessary even when $S$ belongs to some well-behaved classes. These works leave the following open problems which we address in this work: Can we estimate the parameters of any Gaussian or even extend beyond Gaussians? Can we design $\mathrm{poly}(d/\varepsilon)$ time algorithms when $S$ is a simple set such as a halfspace? We make progress on both of these questions by providing the following results: 1. Toward the first question, we give a $d^{\mathrm{poly}(\ell/\varepsilon)}$ time algorithm for any exponential family that satisfies some structural assumptions and any unknown set $S$ that is $\varepsilon$-approximable by degree-$\ell$ polynomials. This result has two important applications: 1a) The first algorithm for estimating arbitrary Gaussian distributions from samples truncated to an unknown $S$; and 1b) The first algorithm for linear regression with unknown truncation and Gaussian features. 2. To address the second question, we provide an algorithm with runtime $\mathrm{poly}(d/\varepsilon)$ that works for a set of exponential families (containing all Gaussians) when $S$ is a halfspace or an axis-aligned rectangle. Along the way, we develop tools that may be of independent interest, including, a reduction from PAC learning with positive and unlabeled samples to PAC learning with positive and negative samples that is robust to certain covariate shifts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims a d^{poly(ℓ/ε)} time algorithm for parameter estimation in exponential families satisfying structural assumptions, when samples are truncated to an unknown set S that is ε-approximable by degree-ℓ polynomials. This yields the first algorithms for arbitrary (non-diagonal) Gaussians under unknown truncation and for linear regression with unknown truncation and Gaussian features. It also gives a poly(d/ε) time algorithm for a class of exponential families (including Gaussians) when S is a halfspace or axis-aligned rectangle, plus a reduction from positive-unlabeled to positive-negative PAC learning that is robust to covariate shift.

Significance. If the central claims hold, the work is significant: it resolves open questions left by Kontonis et al. (FOCS'19) and Diakonikolas et al. (COLT'24) by extending efficient truncated estimation beyond diagonal Gaussians, supplies the first algorithms for arbitrary Gaussians and truncated linear regression, and provides a poly(d/ε) runtime for simple truncation sets. The PU-to-PN reduction robust to covariate shift is a tool of independent interest. Credit is due for the explicit algorithmic constructions and the reduction.

minor comments (3)
  1. [Abstract and §1] Abstract and introduction repeatedly refer to 'some structural assumptions' on the exponential family without a concise enumerated list; move or add an explicit bullet list of the assumptions (with forward reference to the theorem statement) to improve readability.
  2. [Theorem statements in §3 and §4] The runtime statements (e.g., d^{poly(ℓ/ε)}) should explicitly list dependence on all parameters including the failure probability δ and any constants hidden in the poly notation.
  3. [§2 (Preliminaries)] Notation for the polynomial approximability of S (ε-approximable by degree-ℓ polynomials) is introduced without a formal definition or reference to the precise norm or measure used; add a short definition paragraph or equation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the paper, recognition of its significance in resolving open questions from prior work, and recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents novel algorithmic results for estimating parameters of exponential families under unknown truncation sets that satisfy polynomial approximability or simplicity conditions (halfspaces, rectangles). These are framed as new constructions with explicit runtimes, including a reduction from positive-unlabeled to positive-negative PAC learning robust to covariate shift. Prior citations (e.g., Kontonis et al. FOCS'19 for the diagonal-Gaussian special case) provide context and contrast rather than load-bearing justification for the new claims; the central results do not reduce by definition or construction to fitted inputs, self-defined quantities, or unverified self-citations. The derivation chain is self-contained as independent algorithmic development.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all claims rest on unspecified structural assumptions for the exponential family and polynomial approximability of S.

pith-pipeline@v0.9.0 · 5929 in / 1081 out tokens · 21377 ms · 2026-05-23T20:10:23.732508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    The Reverse Isoperimetric Problem for Gaussian Measure

    ISBN : 9780486612720. URL: https://books.google.com/books?id=MtU8uP7XMvoC (cit. on p. 40). [Bal93] Keith Ball. “The Reverse Isoperimetric Problem for Gaussian Measure”. In: Discrete Comput. Geom. 10.4 (Dec. 1993), pp. 411–420.ISSN : 0179-5376. DOI: 10.1007/BF02573986. URL: https://doi.org/10.1007/BF02573986 (cit. on p. 16). [Ber46] Joseph Berkson. “Limita...

  2. [2]

    Learning From Satisfy- ing Assignments Under Continuous Distributions

    Orlando, FL, USA: Association for Computing Machinery, 2023, pp. 1699–1712. ISBN : 9781450399135. DOI: 10.1145/3564246.3585177 (cit. on pp. 7, 20). [CDS20] Clément L. Canonne, Anindya De, and Rocco A. Servedio. “Learning From Satisfy- ing Assignments Under Continuous Distributions”. In: Proceedings of the 2020 ACM- SIAM Symposium on Discrete Algorithms (S...

  3. [3]

    New Results for Learning Noisy Parities and Halfspaces

    URL: https : / / proceedings . neurips . cc / paper _ files / paper / 2021 / file / 0ed8861dc36bee580d100f91283d0559-Paper.pdf (cit. on pp. 1, 2, 19, 20). [Efr22] Bradley Efron. Exponential Families in Theory and Practice . Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2022. DOI: https://doi.org/10. 1017/9781108773157 (cit. o...

  4. [4]

    An Investigation of the Consequences of Partial Aggregation of Micro-Economic Data

    URL: https://api.semanticscholar.org/CorpusID:220363551 (cit. on p. 20). [FW72] Edgar L. Feige and Harold W. Watts. “An Investigation of the Consequences of Partial Aggregation of Micro-Economic Data”. In:Econometrica 40.2 (1972), pp. 343–360.ISSN : 00129682, 14680262. URL: http://www.jstor.org/stable/1909411 (cit. on p. 19). [FYC23] Jiaojiao Fan, Bo Yuan...

  5. [5]

    Imbens and D.B

    URL: http://www.nber.org/chapters/c10489 (cit. on p. 19). [IR15] Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomed- ical Sciences: An Introduction. Cambridge University Press, 2015.DOI: 10.1017/CBO9781139025751 (cit. on pp. 1, 19). [IZD20] Andrew Ilyas, Emmanouil Zampetakis, and Constantinos Daskalakis. “A Theoreti- ...

  6. [6]

    Learning Geometric Concepts via Gaussian Surface Area

    DOI: 10.1137/060649057 . URL: https://doi.org/10.1137/060649057 (cit. on pp. 9, 15, 16, 35). [KOS08] Adam R. Klivans, Ryan O’Donnell, and Rocco A. Servedio. “Learning Geometric Concepts via Gaussian Surface Area”. In: 2008 49th Annual IEEE Symposium on Foun- dations of Computer Science . 2008, pp. 541–550. DOI: 10.1109/FOCS.2008.64 (cit. on p. 16). [KSST1...

  7. [7]

    A Computationally Efficient Method for Learning Exponential Family Distributions

    DOI: 10.1007/s00477-011-0458-8 (cit. on p. 1). [SB14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From The- ory to Algorithms . Cambridge University Press, 2014. DOI: https : / / doi . org / 10 . 1017/CBO9781107298019 (cit. on pp. 9, 16, 36). [SSW21] Abhin Shah, Devavrat Shah, and Gregory Wornell. “A Computationally Efficient M...

  8. [8]

    [DGTZ18]) Σ−1/2 (bµ S − µ S) 2 ≤ ε and (1 − ε)ΣS ⪯bΣS ⪯ (1 + ε)ΣS

    (Relation between (bµ ,bΣ) and (µ S, ΣS); Lemma 5 of Daskalakis et al. [DGTZ18]) Σ−1/2 (bµ S − µ S) 2 ≤ ε and (1 − ε)ΣS ⪯bΣS ⪯ (1 + ε)ΣS

  9. [9]

    [DGTZ18]) Σ−1/2 (bµ S − µ ) ≤ O log1/2 1 α , bΣS ⪰ Ω α2 · Σ, and Σ−1/2bΣSΣ−1/2 − I F ≤ O log 1 α

    (Relation between (bµ ,bΣ) and (µ , Σ); Corollary 1 of Daskalakis et al. [DGTZ18]) Σ−1/2 (bµ S − µ ) ≤ O log1/2 1 α , bΣS ⪰ Ω α2 · Σ, and Σ−1/2bΣSΣ−1/2 − I F ≤ O log 1 α

  10. [10]

    more direct

    (Further relations between (bµ ,bΣ) and (µ , Σ); Proposition 1 of Daskalakis et al. [DGTZ18]) I − Σ1/2bΣ−1Σ1/2 F ≤ O log (1/α) α2 and I −bΣ1/2 S Σ−1bΣ1/2 S F ≤ O log (1/α) α2 , Ω α2 · I ⪯ Σ−1/2bΣSΣ−1/2 ⪯ O 1 α2 · I and Ω α2 · I ⪯bΣ−1/2 S ΣbΣ−1/2 S ⪯ O 1 α2 · I , bΣ−1 S Σ1/2 (bµ S − µ ) 2 ≤ O log (1/α) α2 and Σ−1bΣ1/2 S (bµ S − µ ) 2 ≤ O log (1/α) α2 . B P...

  11. [11]

    Case A (t ≤ −1.05): Since H(·) ≥ 0 and t ≤ 0, 2H2(t) − 3tH (t) ≥ 0 and, hence, t2 − 1 + 2H2(t) − 3tH (t) ≥ t2 − 1 (t≤−1.05) > 1 10

  12. [12]

    Further, on t ≥ − 1.05, H(t) ≥ e−1.052/2 √ 2π(1−Φ(−1.05)) =: c(1.05)

    Case B ( t ∈ (−1.05, −2/3]): Since H(·) ≥ 0 and t ≤ − 2/3, 2 H2(t) − 3tH (t) ≥ 2H2(t) + 2H(t). Further, on t ≥ − 1.05, H(t) ≥ e−1.052/2 √ 2π(1−Φ(−1.05)) =: c(1.05). Hence, 2 H2(t) − 3tH (t) ≥ 2H2(t) + 2H(t) ≥ 2c(1.05)2 + 2c(1.05) ≥ 0.68. Further, t2 − 1 ≥ − 5/9 ≥ − 0.56. It follows that t2 − 1 + 2H2(t) − 3tH (t) ≥ 0.1. 79

  13. [13]

    Further, on t ≥ − 2/3, H(t) ≥ e−(2/3)2/2 √ 2π(1−Φ(−2/3)) =: c(2/3)

    Case C ( t ∈ (−2/3, −2/5]): Since H(·) ≥ 0 and t ≤ − 2/5, 2 H2(t) − 3tH (t) ≥ 2H2(t) + 6H(t)/5. Further, on t ≥ − 2/3, H(t) ≥ e−(2/3)2/2 √ 2π(1−Φ(−2/3)) =: c(2/3). Hence, 2 H2(t) − 3tH (t) ≥ 2H2(t) + 6H(t)/5 ≥ 2c(2/3)2 + (6/5)c(2/3) ≥ 0.878. Further, t2 − 1 ≥ − 0.85. It follows that t2 − 1 + 2H2(t) − 3tH (t) ≥ 0.02

  14. [14]

    Further, on t ≥ − 2/5, H(t) ≥ e−(2/5)2/2 √ 2π(1−Φ(−2/5)) =: c(2/5)

    Case D (t ∈ (−2/5, −1/5]): Since H(·) ≥ 0 and t ≤ 1/5, 2H2(t) − 3tH (t) ≥ 2H2(t) + 3H(t)/5. Further, on t ≥ − 2/5, H(t) ≥ e−(2/5)2/2 √ 2π(1−Φ(−2/5)) =: c(2/5). Hence, 2 H2(t) − 3tH (t) ≥ 2H2(t) + 3H(t)/5 ≥ 2c(2/5)2 + (3/5)c(2/5) ≥ 0.968. Further, t2 − 1 ≥ − 0.96. It follows that t2 − 1 + 2H2(t) − 3tH (t) ≥ 0.008

  15. [15]

    Further, on t ≥ − 1/5, H(t) ≥ e−(1/5)2/2 √ 2π(1−Φ(−1/5)) =: c(1/5)

    Case E ( t ∈ (−1/5, −1/20]): Since H(·) ≥ 0 and t ≤ 1/20, 2 H2(t) − 3tH (t) ≥ 2H2(t) + 3H(t)/20. Further, on t ≥ − 1/5, H(t) ≥ e−(1/5)2/2 √ 2π(1−Φ(−1/5)) =: c(1/5). Hence, 2 H2(t) − 3tH (t) ≥ 2H2(t) + 3H(t)/20 ≥ 2c(1/5)2 + (3/20)c(1/5) ≥ 1.01. Further, t2 − 1 ≥ − 1. It follows that t2 − 1 + 2H2(t) − 3tH (t) ≥ 0.01

  16. [16]

    Further, on t ≥ − 1/20, H(t) ≥ e−(1/20)2/2 √ 2π(1−Φ(−1/20)) =: c(1/20)

    Case F ( t ∈ (−1/20, 0]): Since H(·) ≥ 0 and t ≤ 0, 2 H2(t) − 3tH (t) ≥ 2H2(t). Further, on t ≥ − 1/20, H(t) ≥ e−(1/20)2/2 √ 2π(1−Φ(−1/20)) =: c(1/20). Hence, 2H2(t) − 3tH (t) ≥ 2H2(t) ≥ 2c(1/20)2 ≥ 1.17. Further, t2 − 1 ≥ −1. It follows that t2 − 1 + 2H2(t) − 3tH (t) ≥ 0.17. B.6 Proof of Theorem 8.1: Sample Complexity of Estimation with Unknown Trunca- t...

  17. [17]

    (Strong Convexity) L (·) is Ω λ · k−k strongly convex

  18. [18]

    Moreover, given a sampler for E(·), v can be efficiently computed

    (Stochastic gradient and their Second moments) The random variable v = t(x) − t(z) where x ∼ E(θ⋆, S⋆) and z ∼ E(eθ) is an unbiased estimate ofL (·) and satisfies E h ∥v∥2 2 | θ i ≤ DΛ. Moreover, given a sampler for E(·), v can be efficiently computed

  19. [19]

    (Starting point) Any point θ ∈ Θ can be used as a starting point and satisfies ∥θ −eθ∥2 ≤ D

  20. [20]

    The following result follows

    (Projection to Θ) Finally, we inherit the projection oracle to the domain Θ which contains the optimizereθ from Assumption 3.3. The following result follows. Proposition C.1. Suppose Assumptions 1 and 2 and Assumptions 3.1, 3.2 and 3.4 hold. Supposediam(Θ) ≤ D. Fix any ε, δ ∈ (0, 1). There is an algorithm that, given n independent samples from E(θ⋆, S⋆) f...