On the use of cross-fitting in causal machine learning with correlated units
Pith reviewed 2026-05-16 13:21 UTC · model grok-4.3
The pith
Standard cross-fitting eliminates bias in causal machine learning even when data units are correlated.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Performing cross-fitting as if study units were independent still eliminates key bias terms even when units may be correlated. This holds because the cross-fitting structure ensures that the dependence between units does not prevent the remainder terms from vanishing under standard convergence conditions for the nuisance estimators.
What carries the argument
Cross-fitting of nuisance estimators in causal machine learning, which partitions data into folds to ensure independence between fitting and evaluation, thereby orthogonalizing the estimator and removing first-order bias terms despite unit dependence.
If this is right
- Causal estimators retain their asymptotic properties without needing custom cross-fitting for dependence.
- Implementation of causal machine learning becomes simpler for clustered, spatial, and time-series data.
- Simulation results indicate no loss in bias reduction or precision when ignoring correlations in fold assignment.
Where Pith is reading between the lines
- Practitioners could default to standard cross-fitting routines in software for dependent data settings to save development time.
- Further work might examine the impact on variance estimation under strong dependence structures.
- Analogous results could apply to other orthogonalized estimators in statistics beyond causal inference.
Load-bearing premise
The nuisance estimators converge at rates that remain valid even in the presence of dependence between study units.
What would settle it
An empirical example or simulation where the bias of standard cross-fitting exceeds that of correlation-adjusted cross-fitting under a specific dependence structure, such as strong spatial autocorrelation.
read the original abstract
In causal machine learning, the fitting and evaluation of nuisance models are often performed on separate partitions, or folds, of the observed data. This technique, called cross-fitting, eliminates bias introduced by the use of black-box predictive algorithms. When study units may be correlated, such as in spatial, clustered, or time-series data, investigators often design bespoke forms of cross-fitting to minimize correlation between folds. We prove that, perhaps contrary to popular belief, this is typically unnecessary: performing cross fitting as if study units were independent still eliminates key bias terms even when units may be correlated. In simulation experiments with various correlation structures, we show that causal machine learning estimators achieve the same or improved bias and precision under cross-fitting that ignores correlation compared to techniques striving to eliminate correlation between folds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in causal machine learning with potentially correlated units (spatial, clustered, or time-series data), standard cross-fitting—performed as if units were independent—still eliminates key bias terms arising from nuisance function estimation, contrary to the common practice of designing bespoke folds to minimize inter-fold correlation. This is supported by a theoretical argument and simulation experiments across various correlation structures demonstrating that estimators achieve the same or improved bias and precision compared to correlation-aware cross-fitting variants.
Significance. If the central theoretical result holds under appropriate conditions, the paper would have substantial practical significance by simplifying cross-fitting procedures in dependent-data settings and reducing the need for custom fold designs. The simulations provide useful empirical validation, but the result's impact hinges on whether the bias-elimination argument extends beyond i.i.d. assumptions without additional restrictions.
major comments (2)
- [Proof section (post-abstract)] The central theoretical claim (detailed in the proof section following the abstract) that cross-fitting eliminates key bias terms relies on nuisance estimators satisfying o_p(n^{-1/4}) convergence rates that continue to hold under dependence. However, the manuscript does not specify the precise weak-dependence or mixing conditions under which these rates are preserved; without them, the remainder-term argument does not go through for strong positive dependence (e.g., AR(1) lag-1 correlation >0.8), as the effective sample size per fold shrinks.
- [Simulation experiments] Simulation experiments (Section on numerical studies): the reported bias and precision improvements under standard cross-fitting are presented without explicit details on how nuisance estimators were tuned or whether post-hoc fold choices were made; this makes it difficult to assess whether the results are robust or sensitive to the specific correlation structures tested.
minor comments (2)
- [Methods] Notation for the cross-fitting procedure could be clarified with an explicit diagram or pseudocode showing how folds are constructed under the 'as-if-independent' approach versus correlation-minimizing alternatives.
- [Introduction] The abstract states that 'standard cross-fitting still eliminates key bias terms,' but the precise bias terms (e.g., which remainder in the influence function) should be named explicitly in the introduction for reader accessibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope of our theoretical results and improve the reproducibility of our simulations. We address each major comment below and have made revisions to the manuscript to incorporate the suggestions.
read point-by-point responses
-
Referee: [Proof section (post-abstract)] The central theoretical claim (detailed in the proof section following the abstract) that cross-fitting eliminates key bias terms relies on nuisance estimators satisfying o_p(n^{-1/4}) convergence rates that continue to hold under dependence. However, the manuscript does not specify the precise weak-dependence or mixing conditions under which these rates are preserved; without them, the remainder-term argument does not go through for strong positive dependence (e.g., AR(1) lag-1 correlation >0.8), as the effective sample size per fold shrinks.
Authors: We agree that the proof relies on the nuisance estimators achieving the stated o_p(n^{-1/4}) rates under dependence, and that these rates require appropriate weak-dependence conditions to hold. In the revised manuscript, we will add an explicit statement of the assumed conditions (alpha-mixing with summable coefficients ensuring the rates are preserved) immediately following the main theorem. We note that for very strong dependence (e.g., AR(1) correlations exceeding 0.8), the effective sample size reduction is a practical concern that our simulations already explore; the bias-elimination property continues to hold asymptotically under the stated mixing conditions, though finite-sample performance may vary. revision: yes
-
Referee: [Simulation experiments] Simulation experiments (Section on numerical studies): the reported bias and precision improvements under standard cross-fitting are presented without explicit details on how nuisance estimators were tuned or whether post-hoc fold choices were made; this makes it difficult to assess whether the results are robust or sensitive to the specific correlation structures tested.
Authors: We thank the referee for highlighting the need for greater transparency. In the revised version, we will expand the numerical studies section to include full details on nuisance estimator tuning (hyperparameters selected via cross-validation on the training folds only, with fixed random seeds for reproducibility) and explicitly state that no post-hoc fold adjustments or optimizations were performed; all fold partitions were fixed in advance according to the data-generating process without reference to the final estimator performance. These additions will allow readers to assess robustness directly. revision: yes
Circularity Check
Theoretical proof of bias elimination is self-contained; no reduction to fitted inputs or self-citation chains
full rationale
The paper's central result is a mathematical proof that standard cross-fitting eliminates key bias terms even under unit dependence, relying on nuisance estimator convergence rates that are assumed to hold. This derivation does not reduce by construction to any fitted parameter, renamed empirical pattern, or load-bearing self-citation; the provided abstract and claim structure treat the proof as independent of the simulation validation, which serves only as corroboration rather than the source of the result. No quoted step equates a prediction to its own input or imports uniqueness via author overlap.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard causal identification assumptions including consistency and positivity
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 and Lemmas 1-2: unbiasedness and variance bound o(1/r_n²) when number of correlated pairs ≤ n²/r_n²
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.