On the use of cross-fitting in causal machine learning with correlated units

Hasan Laith; Nima S. Hejazi; Salvador V. Balkus

arxiv: 2601.10899 · v2 · submitted 2026-01-15 · 📊 stat.ME

On the use of cross-fitting in causal machine learning with correlated units

Salvador V. Balkus , Hasan Laith , Nima S. Hejazi This is my paper

Pith reviewed 2026-05-16 13:21 UTC · model grok-4.3

classification 📊 stat.ME

keywords cross-fittingcausal machine learningcorrelated unitsnuisance estimationbias eliminationdoubly robustdependence structuresstatistical inference

0 comments

The pith

Standard cross-fitting eliminates bias in causal machine learning even when data units are correlated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In causal machine learning, nuisance models are typically fitted and evaluated on separate data folds through cross-fitting to remove bias from complex predictors. The paper establishes that this procedure works without modification when units are correlated, as in spatial or time-series data. Treating units as independent during fold assignment still cancels the primary bias terms. Simulations across different correlation patterns demonstrate equivalent or better performance in bias and precision compared to methods that explicitly separate correlated units.

Core claim

Performing cross-fitting as if study units were independent still eliminates key bias terms even when units may be correlated. This holds because the cross-fitting structure ensures that the dependence between units does not prevent the remainder terms from vanishing under standard convergence conditions for the nuisance estimators.

What carries the argument

Cross-fitting of nuisance estimators in causal machine learning, which partitions data into folds to ensure independence between fitting and evaluation, thereby orthogonalizing the estimator and removing first-order bias terms despite unit dependence.

If this is right

Causal estimators retain their asymptotic properties without needing custom cross-fitting for dependence.
Implementation of causal machine learning becomes simpler for clustered, spatial, and time-series data.
Simulation results indicate no loss in bias reduction or precision when ignoring correlations in fold assignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could default to standard cross-fitting routines in software for dependent data settings to save development time.
Further work might examine the impact on variance estimation under strong dependence structures.
Analogous results could apply to other orthogonalized estimators in statistics beyond causal inference.

Load-bearing premise

The nuisance estimators converge at rates that remain valid even in the presence of dependence between study units.

What would settle it

An empirical example or simulation where the bias of standard cross-fitting exceeds that of correlation-adjusted cross-fitting under a specific dependence structure, such as strong spatial autocorrelation.

read the original abstract

In causal machine learning, the fitting and evaluation of nuisance models are often performed on separate partitions, or folds, of the observed data. This technique, called cross-fitting, eliminates bias introduced by the use of black-box predictive algorithms. When study units may be correlated, such as in spatial, clustered, or time-series data, investigators often design bespoke forms of cross-fitting to minimize correlation between folds. We prove that, perhaps contrary to popular belief, this is typically unnecessary: performing cross fitting as if study units were independent still eliminates key bias terms even when units may be correlated. In simulation experiments with various correlation structures, we show that causal machine learning estimators achieve the same or improved bias and precision under cross-fitting that ignores correlation compared to techniques striving to eliminate correlation between folds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard cross-fitting still removes the main bias terms even with correlated units, and the simulations show it performs at least as well as correlation-aware folds.

read the letter

The paper's core result is that ordinary cross-fitting, done as if units were independent, eliminates the key bias terms in causal machine learning estimators even when observations are dependent. This runs against the common practice of building special folds to break correlation between training and test sets in spatial, clustered, or time-series settings. The authors back the claim with a proof and a set of simulations across several correlation structures, where the standard approach matches or beats the bespoke versions on bias and precision.

Referee Report

2 major / 2 minor

Summary. The paper claims that in causal machine learning with potentially correlated units (spatial, clustered, or time-series data), standard cross-fitting—performed as if units were independent—still eliminates key bias terms arising from nuisance function estimation, contrary to the common practice of designing bespoke folds to minimize inter-fold correlation. This is supported by a theoretical argument and simulation experiments across various correlation structures demonstrating that estimators achieve the same or improved bias and precision compared to correlation-aware cross-fitting variants.

Significance. If the central theoretical result holds under appropriate conditions, the paper would have substantial practical significance by simplifying cross-fitting procedures in dependent-data settings and reducing the need for custom fold designs. The simulations provide useful empirical validation, but the result's impact hinges on whether the bias-elimination argument extends beyond i.i.d. assumptions without additional restrictions.

major comments (2)

[Proof section (post-abstract)] The central theoretical claim (detailed in the proof section following the abstract) that cross-fitting eliminates key bias terms relies on nuisance estimators satisfying o_p(n^{-1/4}) convergence rates that continue to hold under dependence. However, the manuscript does not specify the precise weak-dependence or mixing conditions under which these rates are preserved; without them, the remainder-term argument does not go through for strong positive dependence (e.g., AR(1) lag-1 correlation >0.8), as the effective sample size per fold shrinks.
[Simulation experiments] Simulation experiments (Section on numerical studies): the reported bias and precision improvements under standard cross-fitting are presented without explicit details on how nuisance estimators were tuned or whether post-hoc fold choices were made; this makes it difficult to assess whether the results are robust or sensitive to the specific correlation structures tested.

minor comments (2)

[Methods] Notation for the cross-fitting procedure could be clarified with an explicit diagram or pseudocode showing how folds are constructed under the 'as-if-independent' approach versus correlation-minimizing alternatives.
[Introduction] The abstract states that 'standard cross-fitting still eliminates key bias terms,' but the precise bias terms (e.g., which remainder in the influence function) should be named explicitly in the introduction for reader accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our theoretical results and improve the reproducibility of our simulations. We address each major comment below and have made revisions to the manuscript to incorporate the suggestions.

read point-by-point responses

Referee: [Proof section (post-abstract)] The central theoretical claim (detailed in the proof section following the abstract) that cross-fitting eliminates key bias terms relies on nuisance estimators satisfying o_p(n^{-1/4}) convergence rates that continue to hold under dependence. However, the manuscript does not specify the precise weak-dependence or mixing conditions under which these rates are preserved; without them, the remainder-term argument does not go through for strong positive dependence (e.g., AR(1) lag-1 correlation >0.8), as the effective sample size per fold shrinks.

Authors: We agree that the proof relies on the nuisance estimators achieving the stated o_p(n^{-1/4}) rates under dependence, and that these rates require appropriate weak-dependence conditions to hold. In the revised manuscript, we will add an explicit statement of the assumed conditions (alpha-mixing with summable coefficients ensuring the rates are preserved) immediately following the main theorem. We note that for very strong dependence (e.g., AR(1) correlations exceeding 0.8), the effective sample size reduction is a practical concern that our simulations already explore; the bias-elimination property continues to hold asymptotically under the stated mixing conditions, though finite-sample performance may vary. revision: yes
Referee: [Simulation experiments] Simulation experiments (Section on numerical studies): the reported bias and precision improvements under standard cross-fitting are presented without explicit details on how nuisance estimators were tuned or whether post-hoc fold choices were made; this makes it difficult to assess whether the results are robust or sensitive to the specific correlation structures tested.

Authors: We thank the referee for highlighting the need for greater transparency. In the revised version, we will expand the numerical studies section to include full details on nuisance estimator tuning (hyperparameters selected via cross-validation on the training folds only, with fixed random seeds for reproducibility) and explicitly state that no post-hoc fold adjustments or optimizations were performed; all fold partitions were fixed in advance according to the data-generating process without reference to the final estimator performance. These additions will allow readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

Theoretical proof of bias elimination is self-contained; no reduction to fitted inputs or self-citation chains

full rationale

The paper's central result is a mathematical proof that standard cross-fitting eliminates key bias terms even under unit dependence, relying on nuisance estimator convergence rates that are assumed to hold. This derivation does not reduce by construction to any fitted parameter, renamed empirical pattern, or load-bearing self-citation; the provided abstract and claim structure treat the proof as independent of the simulation validation, which serves only as corroboration rather than the source of the result. No quoted step equates a prediction to its own input or imports uniqueness via author overlap.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on standard causal assumptions (consistency, positivity, no unmeasured confounding) and regularity conditions for nuisance estimators that are not detailed in the abstract.

axioms (1)

domain assumption Standard causal identification assumptions including consistency and positivity
Implicit in any causal machine learning claim; required for the bias terms to be well-defined.

pith-pipeline@v0.9.0 · 5433 in / 1104 out tokens · 43654 ms · 2026-05-16T13:21:12.842957+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 and Lemmas 1-2: unbiasedness and variance bound o(1/r_n²) when number of correlated pairs ≤ n²/r_n²

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.