pith. sign in

arxiv: 2505.06452 · v3 · submitted 2025-05-09 · 🧮 math.ST · stat.TH

Semiparametric semi-supervised learning for general targets under distribution shift and decaying overlap

Pith reviewed 2026-05-22 16:58 UTC · model grok-4.3

classification 🧮 math.ST stat.TH
keywords semi-supervised learningsemiparametric estimationmissing at randomdistribution shiftdouble robustnessaugmented inverse probability weightingdecaying overlap
0
0 comments X

The pith

Augmented inverse probability weighting estimators maintain double robustness and semiparametric efficiency for general targets in semi-supervised learning under vanishing overlap and distribution shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a semiparametric framework for semi-supervised settings where outcome labels are missing at random and the probability of observing a label decreases with larger total sample sizes. This framework covers a range of targets including means, regression coefficients, quantiles, and causal effects while permitting distributional shifts between the labeled and unlabeled portions of the data. It extends augmented inverse probability weighting estimators to retain double robustness and efficiency, but replaces the usual root-n convergence with adjusted rates that reflect how quickly overlap vanishes. A reader would care because the setup removes common restrictive assumptions such as missing completely at random or strict positivity, allowing more realistic use of abundant unlabeled covariates when labels remain costly.

Core claim

We introduce the D2S3 framework for estimation, inference, and efficiency benchmarking in semi-supervised learning under missing-at-random labels and vanishing overlap. Augmented inverse probability weighting estimators are shown to preserve double robustness, asymptotic normality, and semiparametric efficiency even with high-dimensional nuisance estimation and distributional shift. Classical root-n rates no longer hold; instead, corrected asymptotic rates are derived that explicitly incorporate the decay in overlap probability.

What carries the argument

Augmented inverse probability weighting estimators corrected for the decay rate of the labeling probability, which carries double robustness and efficiency through the regime of vanishing overlap.

If this is right

  • Double robustness holds for smooth targets such as means, linear coefficients, quantiles, and causal effects.
  • Asymptotic normality is retained with rates adjusted downward from root-n according to the speed of overlap decay.
  • Semiparametric efficiency is achieved under high-dimensional nuisance functions and distributional shift.
  • The framework applies directly to settings with scarce labeled data such as IoT and public health studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Accurate estimation of the decay rate itself becomes a key practical requirement for reliable inference.
  • The same rate-adjustment logic may extend to other estimators besides augmented inverse probability weighting.
  • Applications in limited-experiment causal inference could benefit from treating the labeled sample as the experimental arm with decaying selection probability.

Load-bearing premise

The rate at which overlap between labeled and unlabeled samples decays is known or can be estimated at a rate fast enough for the corrected asymptotic results to apply.

What would settle it

A simulation with known overlap decay faster than the modeled rate in which the estimator's distribution deviates from the predicted corrected asymptotic normality or efficiency bound.

Figures

Figures reproduced from arXiv: 2505.06452 by Jing Lei, Kathryn Roeder, Lorenzo Testa, Qi Xu.

Figure 1
Figure 1. Figure 1: Multivariate outcome mean simulation results. The simulation is run 1000 times under the decaying logistic scenario with n = 100 labeled samples and N = 1000 unlabeled samples; the outcome model µˆ is estimated using linear regression, and propensity score model πˆ is estimated via constant estimator. The left panel displays boxplots summarizing the RMSE distribution over different seeds of the considered … view at source ↗
Figure 2
Figure 2. Figure 2: Linear regression coefficients simulation results. The simulation is run 1000 times under the decaying logistic scenario with n = 100 labeled samples and N = 1000 unlabeled samples; the outcome model µˆ is estimated using linear regression, and propensity score model πˆ is estimated via constant estimator. The left panel displays boxplots summarizing the RMSE distribution over different seeds of the consid… view at source ↗
Figure 3
Figure 3. Figure 3: BLE-RSSI application results. Left panel: estimated propensity scores across observations, along with the proportion of labeled data n/(n + N), shown as a vertical dashed line. The heterogeneity in the estimated propensity scores may suggest a missing-at-random (MAR) labeling mechanism. Right panel: joint bivariate distribution over the floor bivariate grid of the observed position of the device Y (red) co… view at source ↗
Figure 4
Figure 4. Figure 4: NHEFS application results. Left panel: estimated propensity scores across observations, for various missingness indicators (very active, moderately active, marginal). The heterogeneity in the estimated propensity scores may suggest a missing-at-random (MAR) labeling mechanism. Right panel: estimated means from AIPW (blue) and the naive estimator based solely on labeled data (red). Confidence intervals are … view at source ↗
read the original abstract

In modern scientific applications, large volumes of covariate data are readily available, while outcome labels are costly, sparse, and often subject to distribution shift. This asymmetry has spurred interest in semi-supervised (SS) learning, but most existing approaches rely on strong assumptions -- such as missing completely at random (MCAR) labeling or strict positivity -- that put substantial limitations on their practical usefulness. In this work, we introduce a general semiparametric framework for estimation, inference, and efficiency benchmarking in SS settings where labels are missing at random (MAR) and the overlap may vanish as sample size increases. Our framework, that we label D2S3, accommodates a wide range of smooth statistical targets -- including means, linear regression coefficients, quantiles, and causal effects -- and remains valid under high-dimensional nuisance estimation and distributional shift between labeled and unlabeled samples. We extend the theoretical guarantees of augmented inverse probability weighting estimators to preserve double robustness, asymptotic normality, and semiparametric efficiency under this challenging D2S3 regime. A key insight is that classical root-n convergence fails under vanishing overlap; we instead provide corrected asymptotic rates that capture the impact of the decay in overlap. We validate our theory through simulations and demonstrate practical utility in real-world applications on the internet of things and public health where labeled data are scarce.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the D2S3 framework for semiparametric semi-supervised estimation under MAR labeling, vanishing overlap, and distribution shift between labeled and unlabeled samples. It extends augmented inverse probability weighting (AIPW) estimators to general smooth targets (means, linear regression coefficients, quantiles, causal effects) while preserving double robustness, deriving corrected asymptotic rates to restore asymptotic normality and semiparametric efficiency when classical root-n convergence fails due to decaying overlap, and accommodating high-dimensional nuisance estimation.

Significance. If the corrected rates and associated conditions are rigorously derived and the overlap decay is either known or estimable at a sufficient rate, the work would meaningfully extend classical AIPW theory to more realistic semi-supervised regimes with vanishing positivity and distribution shift. The generality across targets and the explicit handling of non-root-n rates are potential strengths for enabling valid inference where standard semiparametric results do not apply.

major comments (2)
  1. [Abstract] Abstract: The central claim that AIPW estimators preserve asymptotic normality and semiparametric efficiency under the D2S3 regime via 'corrected asymptotic rates' that depend on the overlap decay sequence is load-bearing for all downstream guarantees, yet the abstract provides no explicit form for these rates, no conditions on the decay (e.g., deterministic vs. random, monotonicity, or relative to sample size n), and no nuisance rate requirements that incorporate the decay. If the true decay is faster or irregular, the bias-variance balance in the expansion would fail, invalidating normality and efficiency.
  2. [Theoretical results] Theoretical development: The extension of double robustness requires that the overlap decay rate be known a priori or estimated without disturbing the asymptotic expansion; the manuscript must state the precise conditions under which the post-hoc adjustments to the AIPW expansion remain valid, including how estimation error in the propensity or overlap interacts with the vanishing overlap term.
minor comments (2)
  1. [Abstract] The acronym D2S3 is used without spelling out 'Distribution shift, Decaying overlap, Semi-Supervised' on first use in the abstract.
  2. [Simulations] Simulation section: Provide more detail on how the overlap decay sequence is generated and whether it is treated as known or estimated in the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of clarity in presenting the D2S3 framework's rates and conditions. We address each major comment below and will revise the manuscript to strengthen the exposition while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that AIPW estimators preserve asymptotic normality and semiparametric efficiency under the D2S3 regime via 'corrected asymptotic rates' that depend on the overlap decay sequence is load-bearing for all downstream guarantees, yet the abstract provides no explicit form for these rates, no conditions on the decay (e.g., deterministic vs. random, monotonicity, or relative to sample size n), and no nuisance rate requirements that incorporate the decay. If the true decay is faster or irregular, the bias-variance balance in the expansion would fail, invalidating normality and efficiency.

    Authors: We agree that the abstract would benefit from greater explicitness on this central claim. The corrected asymptotic rate is derived in Theorem 1 as n^{-1/2} pi_n^{-1/2} (where pi_n denotes the overlap decay sequence), under the assumption that pi_n is deterministic, non-increasing, and known up to estimation at a rate that does not dominate the leading term. Nuisance estimation must satisfy ||hat{eta} - eta|| = o_p((n pi_n)^{-1/4}) to preserve the expansion and double robustness. We will revise the abstract to briefly state the rate form, the deterministic monotonicity assumption on pi_n, and the adjusted nuisance rate requirement. This change improves transparency without affecting the manuscript's theoretical results. revision: yes

  2. Referee: [Theoretical results] Theoretical development: The extension of double robustness requires that the overlap decay rate be known a priori or estimated without disturbing the asymptotic expansion; the manuscript must state the precise conditions under which the post-hoc adjustments to the AIPW expansion remain valid, including how estimation error in the propensity or overlap interacts with the vanishing overlap term.

    Authors: The manuscript states these conditions in Assumption 3 and the proof of Theorem 2: the sequence {pi_n} may be treated as known for the rate correction, while estimation of pi_n (discussed in Section 4.2) is required to converge at o_p(pi_n) so that it does not alter the leading asymptotic term. Propensity estimation error interacts with the vanishing overlap through the product bound o_p((n pi_n)^{-1/2}), which is ensured by the high-dimensional nuisance rates already imposed. We will add an explicit remark in the revision clarifying this interaction and the precise conditions under which the post-hoc adjustments remain valid, thereby making the preservation of double robustness fully transparent. revision: yes

Circularity Check

0 steps flagged

Derivation of corrected rates under D2S3 regime is self-contained and independent of inputs

full rationale

The paper extends classical AIPW double robustness and efficiency results to the MAR + vanishing overlap + shift setting by deriving explicit corrected asymptotic expansions that depend on the overlap decay sequence. These rates follow directly from the stated modeling assumptions (MAR, known or estimable decay, high-dimensional nuisance rates) rather than from any parameter fitted to the target functional on the same data. No equation reduces to a tautology, no prediction is a renamed fit, and no load-bearing step collapses to a self-citation whose content is itself unverified. The framework therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Framework rests on standard MAR assumption and controlled decay of overlap; no free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption Labels are missing at random (MAR)
    Explicitly stated as the labeling mechanism the framework accommodates.
  • domain assumption Overlap between labeled and unlabeled samples may vanish as sample size grows
    Central challenging regime for which corrected asymptotic rates are derived.

pith-pipeline@v0.9.0 · 5769 in / 1118 out tokens · 60525 ms · 2026-05-22T16:58:44.530470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Transporting treatment effects by calibrating large-scale observational outcomes

    stat.ME 2026-05 unverdicted novelty 6.0

    Proposes a calibration-based estimator for transported average treatment effects that is consistent under correct specification and achieves semiparametric efficiency with large observational data.

  2. Transporting treatment effects by calibrating large-scale observational outcomes

    stat.ME 2026-05 unverdicted novelty 6.0

    A calibration procedure yields a weighted transported average treatment effect with asymptotically valid and efficient inference when experimental data grows slower than observational data, even without positivity or ...

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    PPI++: Efficient Prediction-Powered Inference

    Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference.Science, 382(6671):669–674, 2023a. Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. Ppi++: Efficient prediction-powered inference.arXiv preprint arXiv:2311.01453, 2023b. David Azriel, Lawrence D Brown, Michael Sklar, Richar...

  2. [2]

    A unifying framework for robust and efficient inference with unstructured data.arXiv preprint arXiv:2505.00282,

    Jacob Carlson and Melissa Dell. A unifying framework for robust and efficient inference with unstructured data.arXiv preprint arXiv:2505.00282,

  3. [3]

    Semi-supervised quantile estimation: Robust and efficient inference in high dimensional settings.arXiv preprint arXiv:2201.10208,

    Abhishek Chakrabortty, Guorong Dai, and Raymond J Carroll. Semi-supervised quantile estimation: Robust and efficient inference in high dimensional settings.arXiv preprint arXiv:2201.10208,

  4. [4]

    Semi-supervised learning (chapelle, o

    Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews].IEEE Transactions on Neural Networks, 20(3):542–542,

  5. [5]

    Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731,

    Wenlong Ji, Lihua Lei, and Tijana Zrnic. Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731,

  6. [6]

    Irregular identification, support conditions, and inverse weight estimation

    Shakeeb Khan and Elie Tamer. Irregular identification, support conditions, and inverse weight estimation. Econometrica, 78(6):2021–2042,

  7. [7]

    Prediction-powered inference with imputed covariates and nonuniform sampling.arXiv preprint arXiv:2501.18577,

    Dan M Kluger, Kerri Lu, Tijana Zrnic, Sherrie Wang, and Stephen Bates. Prediction-powered inference with imputed covariates and nonuniform sampling.arXiv preprint arXiv:2501.18577,

  8. [8]

    Assumption-lean and data- adaptive post-prediction inference.arXiv preprint arXiv:2311.14220,

    Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, and Qiongshi Lu. Assumption-lean and data- adaptive post-prediction inference.arXiv preprint arXiv:2311.14220,

  9. [9]

    Augmented doubly robust post-imputation inference for proteomic data.bioRxiv, pages 2024–03,

    18 Haeun Moon, Jin-Hong Du, Jing Lei, and Kathryn Roeder. Augmented doubly robust post-imputation inference for proteomic data.bioRxiv, pages 2024–03,

  10. [10]

    Sada: Safe and adaptive inference with multiple black-box predictions.arXiv preprint arXiv:2509.21707,

    Jiawei Shan, Yiming Dong, and Jiwei Zhao. Sada: Safe and adaptive inference with multiple black-box predictions.arXiv preprint arXiv:2509.21707,

  11. [11]

    A unified framework for semiparametrically efficient semi-supervised learning.arXiv preprint arXiv:2502.17741,

    Zichun Xu, Daniela Witten, and Ali Shojaie. A unified framework for semiparametrically efficient semi-supervised learning.arXiv preprint arXiv:2502.17741,

  12. [12]

    The decaying missing-at-random framework: Doubly robust causal inference with partially labeled data.arXiv preprint arXiv:2305.12789, 2023a

    Yuqian Zhang, Abhishek Chakrabortty , and Jelena Bradic. The decaying missing-at-random framework: Doubly robust causal inference with partially labeled data.arXiv preprint arXiv:2305.12789, 2023a. Yuqian Zhang, Abhishek Chakrabortty , and Jelena Bradic. Double robust semi-supervised inference for the mean: selection bias under mar labeling with decaying ...

  13. [13]

    Semiparametric semi-supervised learn- ing for general targets under distribution shift and decaying over- lap

    20 Supplementary Material of “Semiparametric semi-supervised learn- ing for general targets under distribution shift and decaying over- lap” A Proofs of main statements Wherever convenient, integrals and expectations are written in linear functional notation. Thus, in- stead of E[f(X)] = R f (x) P (d x), we sometimes write P[f] . In particular, we denote ...

  14. [14]

    (40) In summary , we have (n+N) 1/2 ¯V −1/2 θ ⋆ ˆθˆµ −θ ⋆ = (n+N) −1/2 ¯V −1/2 θ ⋆ n+NX i=1 ϕ € Di;θ ⋆; ¯µ;π ⋆ n,N Š +o P (1)

    Therefore the following Lindeberg condition also holds (n+N) −1 n+NX i=1 E • ¯V −1/2 θ ⋆ ϕ € Di;θ ⋆; ¯µ;π ⋆ n,N Š 2 2 1 n ¯V −1/2 θ ⋆ ϕ € Di;θ ⋆; ¯µ;π ⋆ n,N Š 2 > ϵ p n+N o˜ →0 asn,N→ ∞, (39) which implies, by Lindeberg-Feller Central Limit Theorem (Van der Vaart, 2000), that (n+N) −1/2 ¯V −1/2 θ ⋆ n+NX i=1 ϕ € Di;θ ⋆; ¯µ;π ⋆ n,N Š ⇝N 0,I q . (40) In summ...

  15. [15]

    (50) In summary , we have (n+N) 1/2 ¯V −1/2 θ ⋆ ˆθˆµ; ˆπ −θ ⋆ = (n+N) −1/2 ¯V −1/2 θ ⋆ n+NX i=1 ϕ Di;θ ⋆; ¯µ; ¯πn,N +o P (1)

    Therefore the following Lindeberg condition also holds (n+N) −1 n+NX i=1 E • ¯V −1/2 θ ⋆ ϕ Di;θ ⋆; ¯µ; ¯πn,N 2 2 1 n ¯V −1/2 θ ⋆ ϕ(D i;θ ⋆; ¯µ; ¯π) 2 > ϵ p n+N o˜ →0 asn,N→ ∞, (49) which implies, by Lindeberg-Feller Central Limit Theorem (Van der Vaart, 2000), that (n+N) −1/2 ¯V −1/2 θ ⋆ n+NX i=1 ϕ(D i;θ ⋆; ¯µ; ¯π)⇝N 0,I q . (50) In summary , we have (n+N...

  16. [16]

    The propensity score ˆπ is estimated via logistic regression, again leveraging both sets of covariates

    For the latter, the outcome regression model ˆµ is estimated using random forests (Breiman, 2001), with both biomarker and clinical covariates included as predictors. The propensity score ˆπ is estimated via logistic regression, again leveraging both sets of covariates. All nuisance functions are estimated using cross-fitting with J =5 folds to ensure val...