pith. sign in

arxiv: 2502.10605 · v3 · submitted 2025-02-14 · 📊 stat.ML · cs.CY· cs.LG· econ.EM· stat.ME

Batch-Adaptive Causal Annotations

Pith reviewed 2026-05-23 02:42 UTC · model grok-4.3

classification 📊 stat.ML cs.CYcs.LGecon.EMstat.ME
keywords causal inferencemissing outcomesbatch samplingdoubly robust estimatoraverage treatment effectoptimal allocationannotation efficiency
0
0 comments X

The pith

A closed-form solution for optimal batch sampling probabilities minimizes the asymptotic variance of a doubly robust estimator for causal effects with missing outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an analytical expression for the probabilities of selecting data points to receive costly outcome labels, chosen specifically to reduce the variance of treatment effect estimates. This addresses settings where only a subset of outcomes can be observed due to budget limits on annotation or follow-up. The approach relies on the asymptotic properties of doubly robust estimators and is tested on both simulated data and a real dataset from homelessness outreach services. It reports that the optimized selection recovers the accuracy of random sampling while using substantially fewer labels.

Core claim

We derive a closed-form solution for the optimal batch sampling probability by minimizing the asymptotic variance of a doubly robust estimator for causal inference with missing outcomes. Motivated by partners in street outreach, the framework extends to costly annotations of unstructured data such as text or images. Across simulated and real-world datasets, including one of outreach interventions in homelessness services, the approach achieves substantially lower mean-squared error and recovers the AIPW estimate with fewer labels than existing baselines, matching confidence intervals obtained with 361 random samples using only 90 optimized samples.

What carries the argument

Closed-form optimal batch sampling probability obtained by minimizing the asymptotic variance of the doubly robust estimator.

If this is right

  • The optimal probabilities yield lower mean-squared error for average treatment effect estimates at any fixed labeling budget.
  • The same derivation applies when annotations involve unstructured inputs such as text or images in addition to structured records.
  • In the homelessness services dataset the method recovers the target estimate while using only about one quarter the labels required by random selection.
  • The savings arise directly from reduced variance rather than from changes to the underlying estimator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance-minimization idea could be applied to other missing-data estimators beyond doubly robust forms.
  • If the closed-form expression depends on correctly specified models for the outcome regression and propensity score, misspecification would affect the realized efficiency gain.
  • Sequential updating of the sampling probabilities across multiple annotation rounds might produce further reductions in required labels.
  • The approach could extend to settings with multiple treatment arms or heterogeneous effects if the variance expression can be generalized accordingly.

Load-bearing premise

The asymptotic variance of the doubly robust estimator can be expressed in closed form and minimized analytically over the choice of batch sampling probabilities.

What would settle it

Empirical results in which the derived sampling probabilities produce higher mean-squared error than uniform random sampling for the same number of labels would contradict the optimality claim.

Figures

Figures reproduced from arXiv: 2502.10605 by Angela Zhou, Ezinne Nwankwo, Lauri Goldkind.

Figure 1
Figure 1. Figure 1: Illustration of cross-fitting (K folds within batches) Algorithm 1 Batch Adaptive Causal Estimation With Complex Embedded Outcomes Input: Data D = {(Xi , Zi)} n i=1, sampling budget Bz for z ∈ {0, 1} Step 1: Partition D into 2 batches and K folds D (k) 1 , D (k) 2 for k = 1, . . . , K Step 2: On Batch 1, sample R1 ∼ Bern(B). Estimate nuisances within each k-fold µˆ (k) z , σˆ 2(k) z , eˆ (k) z . Step 3: On… view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic and Semi-Synthetic (Retail Hero) Data. Mean squared error (top) and 95% confidence interval width on the log scale (bottom) averaged over 20 and 100 (for simulated data) trials across budget percentages of the data. Left and Center: Random forest prediction on tabular data. Right: including LLM predictions on text and serialized features. purchase from social media posts Y˜ and then using them as… view at source ↗
Figure 3
Figure 3. Figure 3: ), such as the number of previous outreach engagements, and (right, [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean squared error between estimated ATE and true ATE averaged over 100 trials across [PITH_FULL_IMAGE:figures/full_fig_p034_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average confidence interval width averaged over 100 trials across varying budgets. [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Boxplots of ATE estimates compared to skyline [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of street outreach engagements for client population. [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Street outreach data with post-treatment summaries only. Mean squared error and 95% confidence interval width averaged over 20 trials across budget percentages of the data. This plot makes use of tabular data and the best-performing random forest outcome model (left) and text-encoded outcomes using LLMs (right). 37 [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Street outreach data with pre- and post-treatment summaries. Mean squared error and 95% confidence interval width averaged over 20 trials across budget percentages of the data. This plot makes use of tabular data and the best-performing random forest outcome model (left) and text-encoded outcomes using LLMs (right). In [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: RetailHero Data. Budget saved due to batch adaptive annotation. The reduction in annotation sample size needed to achieve the same confidence interval width with batch adaptive annotation on tabular data (left) and on tabular data + complex embedded outcomes (right) compared to random sampling [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Street Outreach Data. Budget saved due to batch adaptive annotation. The reduction in annotation sample size needed to achieve the same confidence interval width with batch adaptive annotation on tabular data (left) and on tabular data + complex embedded outcomes (right) compared to random sampling. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗
read the original abstract

Estimating the causal effects of interventions is crucial to policy and decision-making, yet outcome data are often missing or subject to non-standard measurement error. While ground-truth outcomes can sometimes be obtained through costly data annotation or follow-up, budget constraints typically allow only a fraction of the dataset to be labeled. We address this challenge by optimizing which data points should be sampled for outcome information in order to improve efficiency in average treatment effect estimation with missing outcomes. We derive a closed-form solution for the optimal batch sampling probability by minimizing the asymptotic variance of a doubly robust estimator for causal inference with missing outcomes. Motivated by our street outreach partners, we extend the framework to costly annotations of unstructured data, such as text or images in healthcare and social services. Across simulated and real-world datasets, including one of outreach interventions in homelessness services, our approach achieves substantially lower mean-squared error and recovers the AIPW estimate with fewer labels than existing baselines. In practice, we show that our method can match confidence intervals obtained with 361 random samples using only 90 optimized samples - saving 75% of the labeling budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to derive a closed-form expression for optimal batch sampling probabilities that minimizes the asymptotic variance of a doubly robust (AIPW-style) estimator for the average treatment effect when outcomes are missing or costly to annotate. It extends the framework to unstructured data (text/images) and reports empirical gains in MSE and label efficiency, including a 75% reduction in labeling budget to match full-sample confidence intervals on simulated data and a real homelessness-services outreach dataset.

Significance. If the closed-form derivation is valid under explicitly stated and verifiable assumptions on the nuisance functions, the approach would supply a practical, low-overhead method for adaptive annotation in causal studies with missing outcomes, directly addressing budget constraints in policy-relevant applications such as healthcare and social services. The reported label savings are quantitatively large and would be a notable contribution if supported by transparent implementation details and error bars.

major comments (2)
  1. [Derivation of optimal batch sampling probability (methods section)] The central claim of a closed-form solution for optimal sampling probabilities rests on minimizing the asymptotic variance of the doubly robust estimator. The manuscript must explicitly state the functional forms or fixed-vs-estimated status assumed for the outcome regression and propensity score (and any other nuisances) that permit analytic differentiation and closed-form solution rather than numerical optimization; without these, the derivation reduces to a standard semiparametric variance expression that generally does not admit a closed form.
  2. [Experiments and results (simulation and real-data sections)] Empirical results report a 75% label saving (361 random vs. 90 optimized samples) but provide no standard errors, confidence intervals on the MSE or coverage metrics, or details on how the baselines (including AIPW) were implemented with respect to nuisance estimation. This undermines assessment of whether the gains are statistically reliable or sensitive to implementation choices.
minor comments (2)
  1. [Methods] Notation for the sampling probabilities p_i and their appearance in the influence function or variance expression should be introduced with an explicit equation before the minimization step.
  2. [Abstract and §1] The abstract and introduction should clarify whether the nuisance functions are estimated once on the unlabeled data or re-estimated after each batch, as this affects both the validity of the closed-form claim and practical deployment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Derivation of optimal batch sampling probability (methods section)] The central claim of a closed-form solution for optimal sampling probabilities rests on minimizing the asymptotic variance of the doubly robust estimator. The manuscript must explicitly state the functional forms or fixed-vs-estimated status assumed for the outcome regression and propensity score (and any other nuisances) that permit analytic differentiation and closed-form solution rather than numerical optimization; without these, the derivation reduces to a standard semiparametric variance expression that generally does not admit a closed form.

    Authors: We will revise the methods section to explicitly state that the outcome regression and propensity score are treated as fixed (or estimated on an independent sample), which permits analytic differentiation of the asymptotic variance expression with respect to the batch sampling probabilities. This yields the closed-form optimum under the stated semiparametric model; we will also include the explicit variance formula being minimized to make the derivation fully verifiable. revision: yes

  2. Referee: [Experiments and results (simulation and real-data sections)] Empirical results report a 75% label saving (361 random vs. 90 optimized samples) but provide no standard errors, confidence intervals on the MSE or coverage metrics, or details on how the baselines (including AIPW) were implemented with respect to nuisance estimation. This undermines assessment of whether the gains are statistically reliable or sensitive to implementation choices.

    Authors: We will add standard errors and confidence intervals for all reported MSE and coverage metrics, obtained via repeated simulations or bootstrap resampling. We will also expand the implementation details in the main text and appendix to specify the exact nuisance estimators and cross-fitting procedures used for our method and all baselines, including AIPW, ensuring full reproducibility. revision: yes

Circularity Check

0 steps flagged

Derivation of closed-form optimal sampling probabilities is a standard analytic minimization with no reduction to inputs

full rationale

The central derivation minimizes the asymptotic variance of the doubly robust estimator with respect to batch sampling probabilities to obtain a closed-form optimum. This follows directly from standard semiparametric efficiency calculations in causal inference and does not rely on self-definition, fitted parameters relabeled as predictions, or load-bearing self-citations. The variance expression is taken from established theory rather than constructed from the paper's own outputs or evaluation data, making the result self-contained and independent.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard causal inference assumptions plus the existence of an analytically minimizable variance expression; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Outcomes are missing at random conditional on observed covariates and treatment assignment.
    Required for consistency of the doubly robust estimator with missing outcomes.
  • domain assumption The asymptotic variance of the doubly robust estimator admits a closed-form expression that can be minimized with respect to sampling probabilities.
    This is the mathematical premise that enables the claimed closed-form solution.

pith-pipeline@v0.9.0 · 5731 in / 1331 out tokens · 39154 ms · 2026-05-23T02:42:44.554735+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Auditing LLMs for Algorithmic Fairness in Casenote-Augmented Tabular Prediction

    cs.CY 2026-04 unverdicted novelty 5.0

    Fine-tuned LLMs augmented with casenote summaries improve accuracy and reduce multi-class error disparities in housing placement prediction compared to tabular baselines.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Double machine learning for sample selection models

    Michela Bia, Martin Huber, and Luk¨ as Laff¨ ers. Double machine learning for sample selection models. arXiv prepint arXiv:2012.00745,

  2. [2]

    Proximal causal inference with text data.arXiv preprint arXiv:2401.06687,

    Jacob M Chen, Rohit Bhattacharya, and Katherine A Keith. Proximal causal inference with text data.arXiv preprint arXiv:2401.06687,

  3. [3]

    Double debiased machine learning nonparametric inference with continuous treatments.arXiv preprint arXiv:2004.03036,

    Kyle Colangelo and Ying-Ying Lee. Double debiased machine learning nonparametric inference with continuous treatments.arXiv preprint arXiv:2004.03036,

  4. [4]

    End-to-end causal effect estimation from unstructured natural language data

    Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul Krishnan, and Chris J Maddison. End-to-end causal effect estimation from unstructured natural language data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Maria Dimakopoulou, Zhimei Ren, and Zhengyuan Zhou. Online multi-armed bandits with adaptive inference.Advances in N...

  5. [5]

    Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou

    URL https://proceedings.neurips.cc/paper_files/paper/ 2023/file/d862f7f5445255090de13b825b880d59-Paper-Conference.pdf. Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32,

  6. [6]

    Causal direction of data collection matters: Implications of causal and anticausal learning for nlp.arXiv preprint arXiv:2110.03618,

    Zhijing Jin, Julius von K¨ ugelgen, Jingwei Ni, Tejas Vaidhya, Ayush Kaushal, Mrinmaya Sachan, and Bernhard Schoelkopf. Causal direction of data collection matters: Implications of causal and anticausal learning for nlp.arXiv preprint arXiv:2110.03618,

  7. [7]

    On Causal and Anticausal Learning

    Bernhard Sch¨ olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning.arXiv preprint arXiv:1206.6471,

  8. [8]

    Adaptive neyman allocation.arXiv preprint arXiv:2309.08808,

    Jinglong Zhao. Adaptive neyman allocation.arXiv preprint arXiv:2309.08808,

  9. [9]

    Active Statistical Inference

    Tijana Zrnic and Emmanuel J Cand` es. Active statistical inference.arXiv preprint arXiv:2403.03208,

  10. [10]

    anti-causal learning

    studies the long-term intervention effects of job training on lifetime earnings, by using only short-term outcomes (surrogates) such as yearly earnings. In this regime, the ground truth cannot be obtained at the time of analysis. In this paper, we focus a different regime where obtaining the ground truth from expert data annotators is feasible but budget-...

  11. [11]

    So the full variance term is Var[ψz −ψ z′] =E 1 ez(X)·π(z, X) ·[Y−µ z(X)] 2 +E 1 ez′(X)·π(z ′, X) ·[Y−µ z′(X)] 2 +E (µz(X)−µ z′(X)) 2 −E[µ z(X)−µ z′(X)] 2 =E 1 ez(X)·π(z, X) ·[Y−µ z(X)] 2 +E 1 ez′(X)·π(z ′, X) ·[Y−µ z′(X)] 2 + Var µz(X)−µ z′(X) Rewriting the bound from Hahn (1998), we get V≥E 1 ez(X)·π(z, X) ·[Y−µ z(X)] 2 +E 1 ez′(X)·π(z ′, X) ·[Y−µ z′(X)...

  12. [12]

    3]; it follows readily from following their proof of Thm

    The objective function arises from the asymptotic variance expression in [Colangelo and Lee, 2020, Thm. 3]; it follows readily from following their proof of Thm. 3 with our analysis of the asymptotic variance as in Proposition

  13. [13]

    F Additional Lemmas F.1 Results appearing in other works, stated for completeness

    Putting these results from Step 1 and Step 2 together, along with the fact that nt,k n → 1 K , gives the theorem. F Additional Lemmas F.1 Results appearing in other works, stated for completeness. Lemma 1(Conditional convergence implies unconditional convergence, from [Chernozhukov et al., 2018]).Lemma 6.1. (Conditional Convergence implies unconditional) ...

  14. [14]

    (b) Let {Am} be a sequence of positive constants

    In particular, this occurs if E[∥X m∥q /ϵq m |Y m]→ P r 0for some q≥ 1, by Markov’s inequality. (b) Let {Am} be a sequence of positive constants. If ∥Xm∥ = OP (Am) condi- tional on Ym, namely, that for any ℓm → ∞, Pr (∥Xm∥> ℓ mAm |Y m)→ P r 0, then ∥Xm∥ = OP (Am) unconditionally, namely, that for anyℓ m → ∞,Pr (∥X m∥> ℓ mAm)→0. Lemma 2(Chebyshev’s inequal...

  15. [15]

    (4) 31 In the below, we drop thezargument. By the triangle inequality, boundedness of 1/ˆe(X)≤ν e, and ofσ 2(X)≤B σ2: p ˆσ2(X)/ˆe2(X)− p σ2(X)/e 2(X) 2 = p ˆσ2(X)/ˆe2(X)± p σ2(X)/ˆe2(X)− p σ2(X)/e 2(X) 2 ≤ν e p ˆσ2(X)− p σ2(X) 2 +B σ2 1 e(X) − 1 ˆe(X) 2 For the second term: Bσ2 1 e(X) − 1 ˆe(X) 2 ≤B σ2 1 e(X) − 1 ˆe(X) 2 ≤B σ2νe ∥e(X)−ˆe(X)∥ 2 since 1/e(X...

  16. [16]

    ground-truth

    We sample each covariate X∈R 5 from a standard normal distribution, X∼ N (0, I5). Treatment Z is drawn with logistic probability γz(X) = (1 +e X2+X3+0.5). We defineσ 2 z(X) as follows: σ2 1(X) := max[1.3 + 0.4sin(X1),0] σ2 0(X) := max[3.5 + 0.3cos(X3),0]. Finally, the outcome models are defined as: Y(0) = 5 +X 1 −2X 2 +ϵ 0 Y(1) =Y(0) +θ 0 +ϵ 1, where ϵ0 ∼...