Improving Survey Inference in Two-phase Designs Using Bayesian Machine Learning

Abigail Greenleaf; Anyu Zhu; Lauren Kennedy; Qixuan Chen; Xinru Wang

arxiv: 2306.04119 · v2 · submitted 2023-06-07 · 📊 stat.ME

Improving Survey Inference in Two-phase Designs Using Bayesian Machine Learning

Xinru Wang , Anyu Zhu , Lauren Kennedy , Abigail Greenleaf , Qixuan Chen This is my paper

Pith reviewed 2026-05-24 08:44 UTC · model grok-4.3

classification 📊 stat.ME

keywords two-phase samplingmultiple imputationBayesian treessurvey inferencecomplex survey designpopulation meansweighted estimatorsRubin's rules

0 comments

The pith

Bayesian tree-based imputation yields more accurate population mean estimates from two-phase samples than traditional weighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In two-phase sampling, a large phase-one survey with complex design feeds into a smaller phase-two subsample whose analysis often relies on highly variable weights. This paper develops a Bayesian tree-based multiple imputation procedure that uses phase-one data to fill in phase-two values while embedding the strata and cluster structure of the parent survey. Simulations across various settings show the resulting estimates have less bias, lower root mean squared error, and narrower intervals whose coverage stays close to the nominal 95 percent. Rubin's rules for combining imputations produce valid variance estimates, and the method is demonstrated on real cellphone survey data about COVID-19 vaccination drawn from a Ugandan population-based HIV assessment.

Core claim

The authors establish that a Bayesian tree-based multiple imputation approach, which incorporates the strata and clusters from the parent complex survey design, produces superior estimates of population means from the phase II subsample. Through extensive simulations, this method achieves smaller bias, lower root mean squared error, and narrower 95% confidence intervals with coverage rates nearer the nominal level than conventional weighted estimators. The paper further shows that Rubin's variance estimation remains valid under this imputation scheme and applies the technique to data from a subcohort of the 2020 Uganda Population-based HIV Impact Assessment Survey.

What carries the argument

Bayesian tree-based multiple imputation models that integrate the survey design features of strata and clusters.

If this is right

Population mean estimates exhibit reduced bias relative to weighted analysis.
Root mean squared error decreases compared to traditional methods.
Confidence intervals become narrower while preserving appropriate coverage.
Rubin's rules yield valid inference for the imputed estimates.
The procedure applies directly to real-world two-phase public health surveys such as the Uganda COVID-19 vaccination study.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could allow smaller phase-two samples to achieve the same precision, reducing data collection costs.
Extensions to other outcomes like regression coefficients or proportions may follow similar gains.
In designs with high missingness or complex interactions, tree flexibility might offer advantages over parametric imputation.
Comparative studies with other machine learning imputers could identify when trees are optimal.

Load-bearing premise

The Bayesian tree models accurately represent the conditional distributions of the variables given the survey design information, so that the imputations do not add systematic error beyond what weighting would produce.

What would settle it

New simulations under the same two-phase design but with response variables whose relationships to covariates are not captured by trees, showing that the imputation method produces larger bias or coverage rates farther from 95 percent than the weighted estimator.

read the original abstract

The two-phase sampling design is a cost-effective strategy widely used in public health research. Analyzing the Phase II sample often involves creating subsample-specific weights. However, these weights can be highly variable, leading to unstable weighted analyses. Alternatively, the rich data collected during the first phase can be leveraged to improve survey inference for the Phase II sample. In this paper, we propose a Bayesian tree-based multiple imputation (MI) approach for estimating population means using the Phase II sample, where the parent survey was conducted using a complex survey design. The design features of the parent survey, such as strata and clusters, are incorporated into the tree-based imputation models. Through simulations, we demonstrate that the tree-based MI method outperforms traditional weighted estimators, yielding smaller bias, lower root mean squared error, and narrower 95% confidence intervals, with coverage rates closer to the nominal level. Furthermore, we show that Rubin's variance estimation method provides valid statistical inference for population mean estimation in our setting. We illustrate the application of the proposed tree-based MI method using data from a cellphone survey on COVID-19 vaccination in Uganda, which represents a subcohort sample drawn from the 2020 Uganda Population-based HIV Impact Assessment Survey.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bayesian tree MI that folds in strata and clusters beats weighting on the reported simulation metrics for two-phase means, but the gains rest on the trees capturing the relationships without new bias.

read the letter

The main thing to know is that the paper proposes Bayesian tree-based multiple imputation for phase II samples in two-phase surveys, with the parent survey's strata and clusters built into the imputation models, and the simulations show smaller bias, lower RMSE, narrower intervals, and closer-to-nominal coverage than standard weighted estimators, plus Rubin's rules still work for variance. The Uganda cellphone survey example illustrates the application on real subcohort data from a larger HIV impact assessment. What is actually new is the targeted use of trees for this exact setting rather than parametric models or ignoring the design features. The simulations give direct head-to-head numbers and the real-data case adds some practical grounding. The central assumption is that the tree models recover the relevant relationships while respecting the sampling structure; if that holds, the approach is a reasonable way to stabilize estimates without relying on highly variable weights. The soft spot is that performance will drop if the phase I variables are only weakly predictive or if the trees miss important structure not tested in the simulations, and the abstract leaves the exact simulation design thin enough that the full paper needs to show the setups are realistic for typical public health two-phase studies. This is aimed at survey statisticians working on complex designs in epidemiology. It has a concrete method plus simulation evidence, so it deserves peer review rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper proposes a Bayesian tree-based multiple imputation (MI) method for estimating population means from Phase II samples in two-phase designs, where the parent survey uses complex sampling (strata and clusters). The approach incorporates these design features into the imputation models. Simulations are used to claim that the method yields smaller bias, lower RMSE, narrower 95% CIs, and coverage closer to nominal levels compared to traditional weighted estimators; Rubin's variance estimator is asserted to be valid. The method is illustrated on a cellphone survey subsample from the 2020 Uganda Population-based HIV Impact Assessment Survey.

Significance. If the simulation results hold under realistic conditions that match the paper's design assumptions, the work could offer a practical alternative to unstable weighting in public-health two-phase surveys by leveraging Phase I data. The explicit incorporation of strata/clusters into tree-based imputations and the check on Rubin's rules are strengths for applied survey statistics.

minor comments (3)

The abstract states that simulations demonstrate outperformance, but the manuscript should include a dedicated section (e.g., §4 or §5) with explicit details on simulation design, sample sizes, number of replicates, how strata/clusters were generated, and any sensitivity checks to allow readers to assess whether the reported gains are robust.
Clarify in the methods section how the Bayesian tree models are specified to respect the complex survey design (e.g., whether cluster-level random effects or design weights are used as predictors) so that the claim of 'properly incorporating' the design is fully reproducible.
In the application section, report the effective sample sizes or weight variability for the weighted estimator to provide context for why the MI approach yields narrower intervals.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thoughtful summary of our work and for recommending minor revision. We are encouraged by the recognition of the method's potential as a practical alternative in public-health surveys and the noted strengths in incorporating strata/clusters and checking Rubin's rules. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes a Bayesian tree-based MI method for two-phase surveys, incorporates design features into imputation models, and validates performance via independent simulations against weighted estimators. No equations, derivations, fitted-input predictions, or self-citation chains are present that reduce the reported bias/RMSE/coverage improvements to the method definition by construction. The evidence is simulation-driven and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters, axioms, or invented entities are described in sufficient detail to enumerate.

pith-pipeline@v0.9.0 · 5748 in / 1118 out tokens · 25143 ms · 2026-05-24T08:44:01.243947+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find the tree-based MI method outperforms weighting methods with smaller bias, reduced root mean squared error, and narrower 95% confidence intervals...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.