arxiv: 2604.07636 · v1 · submitted 2026-04-08 · 📊 stat.ME

Sample-split REGression SREG: A robust estimator for high-dimensional survey data

Yonghyun Kwon , Shu Yang , Jae Kwang Kim This is my paper

Pith reviewed 2026-05-10 16:58 UTC · model grok-4.3

classification 📊 stat.ME

keywords survey samplingregression estimationhigh-dimensional datasample splittingcross-fittingGREG estimatorasymptotic normalitybias reduction

0 comments

The pith

Sample splitting removes double-use bias from regression-assisted survey estimators in high dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When auxiliary variables outnumber observations, the usual GREG estimator re-uses sampled outcomes for both model fitting and residual correction, producing non-negligible bias even under a correct working model. The SREG estimator splits the sample into K folds, fits the regression on all but one fold, and applies the out-of-fold prediction to form each unit's residual. This construction makes the estimator first-order equivalent to an oracle difference estimator that knows the true regression function, without any requirement that the fitted coefficients be root-n consistent. Asymptotic normality follows, and a variance estimator built from the cross-fitted residuals is shown to be consistent. The required conditional fluctuation condition holds for simple random, stratified, and rejective sampling.

Core claim

The sample-split regression estimator constructed with K-fold cross-fitting is first-order equivalent to the oracle difference estimator under a weak prediction-norm consistency requirement on the cross-fitted predictions, without requiring root-n consistent estimation of regression coefficients; asymptotic normality is established and a variance estimator based on cross-fitted residuals is proved consistent, with the key conditional fluctuation assumption verified for simple random, stratified, and rejective sampling.

What carries the argument

The SREG estimator that pairs each unit's residual with an out-of-fold prediction obtained by K-fold cross-fitting of the working regression model.

If this is right

The estimator remains first-order equivalent to the oracle difference estimator under only weak prediction-norm consistency.
Asymptotic normality holds without root-n consistency of the regression coefficients.
A variance estimator computed from the cross-fitted residuals is consistent.
The required assumptions are satisfied for simple random, stratified, and rejective sampling designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Machine-learning predictors could be substituted for linear regression inside the same cross-fitting structure to handle even richer auxiliary information.
The same splitting device may reduce bias in other model-assisted survey estimators that currently suffer from double use of outcomes.
Survey practitioners could safely include larger auxiliary data sets without incurring finite-sample bias from model overfitting.

Load-bearing premise

The sampling design must satisfy a conditional fluctuation condition and the cross-fitted predictions must obey weak prediction-norm consistency.

What would settle it

A simulation in which the cross-fitted predictions violate prediction-norm consistency yet the SREG estimator still equals the oracle difference estimator up to o_p(n^{-1/2}) would falsify the first-order equivalence claim.

Figures

Figures reproduced from arXiv: 2604.07636 by Jae Kwang Kim, Shu Yang, Yonghyun Kwon.

**Figure 1.** Figure 1: Illustration of the SREG estimator for K = 3. 3.2 Fold-Wise regression estimators and aggregation For each fold k, define the fold-wise sample-split regression estimator Tˆ (−k) k,reg := X i∈Uk mˆ (−k) i + X i∈Ak 1 πi n yi − mˆ (−k) i o . (6) The proposed K-fold sample-split regression estimator of the overall population total T := P i∈UN yi is Tˆ SREG := X K k=1 Tˆ (−k) k,reg . (7) [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 2.** Figure 2: RMSE(solid line) and Bias(dashed line) for different values of [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

read the original abstract

Model-assisted regression estimation is fundamental in survey sampling for incorporating auxiliary information. However, when the auxiliary dimension grows with the sample size, the standard Generalized regression (GREG) estimator can exhibit non-negligible bias under informative sampling, even when the working model is correctly specified. This failure stems from the double use of sampled outcomes simultaneously for fitting the regression and for forming the residual correction. We propose a sample-split REGression (SREG) estimator based on K-fold cross-fitting that eliminates this bias by pairing each unit's residual with an out-of-fold prediction. The resulting estimator is first-order equivalent to the oracle difference estimator under a weak prediction-norm consistency requirement, without requiring root-n consistent estimation of regression coefficients. We establish asymptotic normality and prove consistency of a variance estimator based on cross-fitted residuals. The key conditional fluctuation assumption is verified for simple random, stratified, and rejective sampling. Simulations demonstrate that SREG effectively removes high-dimensional bias while maintaining competitive efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SREG fixes double-use bias in high-dim GREG via cross-fitting under weak assumptions.

read the letter

The main thing to know is that SREG uses K-fold cross-fitting on the GREG estimator to remove the bias caused by double-using sampled outcomes for both regression fitting and residual correction in high-dimensional settings. This construction is new because it achieves first-order equivalence to the oracle difference estimator under a weak prediction-norm consistency requirement, without needing root-n consistent coefficient estimates. The authors establish asymptotic normality and prove that a variance estimator based on the cross-fitted residuals is consistent. They also verify the key conditional fluctuation assumption for simple random, stratified, and rejective sampling. The paper handles the problem well by targeting a real failure mode of standard GREG when auxiliary dimensions grow with sample size, and by keeping the approach within the survey sampling framework without extra parametric assumptions. The simulations show effective bias removal while maintaining efficiency. The soft spots are that everything rests on the prediction consistency and the fluctuation assumption, which the paper checks for basic designs but might require more work for other sampling methods or when predictions are not accurate enough. The simulation details aren't fully clear from the abstract, so it's hard to see how they chose the high-dimensional setups or handled data exclusion. If the working model is badly misspecified, the gains could be limited. There is also little on computational cost for large surveys. This paper is for survey statisticians and researchers focused on model-assisted estimation with high-dimensional data. Readers who work on practical survey estimation problems would find the estimator and its properties useful. It shows honest engagement with the literature and a solid methodological advance, so it deserves serious referee time. I would recommend sending it to peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes the sample-split REGression (SREG) estimator for high-dimensional survey data. It uses K-fold cross-fitting to pair each unit's residual with an out-of-fold prediction, thereby avoiding the double-use bias that arises in standard GREG when auxiliary dimension grows with sample size. The central claim is that SREG is first-order equivalent to the oracle difference estimator under a weak prediction-norm consistency requirement on the cross-fitted predictions, without requiring root-n consistent estimation of the regression coefficients. Asymptotic normality is established and a variance estimator based on cross-fitted residuals is shown to be consistent. The key conditional fluctuation assumption is verified for simple random, stratified, and rejective sampling. Simulations illustrate effective removal of high-dimensional bias while retaining competitive efficiency.

Significance. If the stated asymptotic results hold, the work is significant for survey sampling methodology because it supplies a practical, bias-robust method for incorporating high-dimensional auxiliary information under informative sampling. The relaxation of the usual root-n consistency requirement on the regression fit is a genuine strength, as is the explicit verification of the conditional fluctuation assumption for three standard sampling designs. The cross-fitting construction directly addresses the double-use problem identified in the abstract and supplies a variance estimator whose consistency is proved under the same weak conditions.

major comments (2)

[§3.1, Theorem 1] §3.1, Theorem 1: The first-order equivalence to the oracle difference estimator is conditioned on the weak prediction-norm consistency requirement together with the conditional fluctuation assumption; the manuscript should state explicitly whether this rate condition is verified empirically in the simulations of §5 or remains purely theoretical.
[§4.2] §4.2: The consistency proof for the variance estimator based on cross-fitted residuals must account for the dependence induced by the K-fold partitioning; it is unclear from the stated conditions whether the additional covariance terms vanish at the required rate.

minor comments (3)

[Abstract and §1] The abstract and introduction should specify the precise growth rate allowed for the auxiliary dimension p_n relative to n under which the bias of standard GREG becomes non-negligible.
[§5] Table 1 and Figure 2: Add standard-error bars or confidence intervals to the reported bias and MSE values so that the efficiency comparisons across sampling designs are statistically interpretable.
[§2] Notation for the out-of-fold predictor should be introduced once in §2 and used consistently thereafter; the current alternation between hat{m}^{(-k)} and m_{-k} is distracting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and positive recommendation. The comments help clarify the presentation of the theoretical results. We address each major comment below and will incorporate the necessary clarifications in the revised manuscript.

read point-by-point responses

Referee: [§3.1, Theorem 1] §3.1, Theorem 1: The first-order equivalence to the oracle difference estimator is conditioned on the weak prediction-norm consistency requirement together with the conditional fluctuation assumption; the manuscript should state explicitly whether this rate condition is verified empirically in the simulations of §5 or remains purely theoretical.

Authors: We appreciate this request for clarification. The weak prediction-norm consistency condition in Theorem 1 is a theoretical requirement that ensures the first-order equivalence to the oracle difference estimator; it is not tied to a specific rate beyond o_p(1) under the stated assumptions. The simulations in §5 demonstrate the finite-sample bias reduction and efficiency of SREG but do not include an explicit empirical verification of the prediction-norm rate. In the revision we will add an explicit statement in §3.1 and §5 noting that the rate condition remains theoretical while the simulation results are consistent with the asymptotic claims under the designs considered. revision: yes
Referee: [§4.2] §4.2: The consistency proof for the variance estimator based on cross-fitted residuals must account for the dependence induced by the K-fold partitioning; it is unclear from the stated conditions whether the additional covariance terms vanish at the required rate.

Authors: Thank you for pointing out this aspect of the variance proof. Section 4.2 establishes consistency of the cross-fitted residual variance estimator under the same weak conditions used for the point estimator, including the conditional fluctuation assumption. The dependence induced by the fixed K-fold partitioning is handled by bounding the relevant cross terms via the out-of-fold construction and the fact that the prediction errors are controlled in prediction norm; these terms are shown to be o_p(1) and do not affect the leading variance term. To improve readability we will expand the proof sketch in the revision to explicitly display the bounding argument for the additional covariance terms arising from the partitioning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under stated assumptions

full rationale

The SREG estimator is explicitly built with K-fold cross-fitting to separate regression fitting from residual correction, directly addressing the double-use bias of standard GREG. The first-order equivalence to the oracle difference estimator is conditioned on an external weak prediction-norm consistency requirement (not derived internally) plus the conditional fluctuation assumption, which the paper verifies for simple random, stratified, and rejective sampling. Asymptotic normality and variance consistency are proved separately. No equation reduces the claimed result to a tautology, fitted input renamed as prediction, or load-bearing self-citation chain; the central claims remain independent of the paper's own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the prediction-norm consistency of the cross-fitted regressor and the conditional fluctuation assumption for the sampling design; these are domain assumptions rather than derived results.

axioms (2)

domain assumption Key conditional fluctuation assumption
Invoked to establish asymptotic normality; stated to hold for simple random, stratified, and rejective sampling.
domain assumption Weak prediction-norm consistency of the cross-fitted predictor
Required for first-order equivalence to the oracle difference estimator; not requiring root-n consistency of coefficients.

pith-pipeline@v0.9.0 · 5468 in / 1515 out tokens · 92452 ms · 2026-05-10T16:58:31.389402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

1 extracted references · 1 canonical work pages

[1]

Bardenet and O.-A

R. Bardenet and O.-A. Maillard. Concentration inequalities for sampling without replace- ment.Bernoulli, 21(3):1361–1385, 2015. S. Bates, T. Hastie, and R. Tibshirani. Cross-validation: what does it estimate and how well does it do it?Journal of the American Statistical Association, 119(546):1434–1445, 2024. 23 P. Bertail and S. Cl´ emen¸ con. Bernstein-t...

work page 2015