pith. sign in

arxiv: 2604.07580 · v1 · submitted 2026-04-08 · 🧮 math.ST · stat.TH

Data Reuse and the Long Shadow of Error: Splitting, Subsampling, and Prospectively Managing Inferential Errors

Pith reviewed 2026-05-10 17:01 UTC · model grok-4.3

classification 🧮 math.ST stat.TH
keywords data reusesubsamplingmultiple testingtype I errorasymptotically linear statisticsexpected variance ratiodata splittinginferential errors
0
0 comments X

The pith

Subsampling by individual researchers controls dependence from data reuse to bound Type I error variance near the independent baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When multiple uncoordinated investigators test hypotheses on overlapping parts of one dataset, the shared data creates correlated test statistics that inflate the variance of the total Type I error count. The paper establishes that subsampling a fraction r of the data at the individual level is enough to keep this inflation small, specifically bounding the Expected Variance Ratio at order 1 over r squared rather than the slower 1 over r decay obtained by data splitting. The argument rests on an asymptotic decomposition of the covariance between test statistics into a pure overlap factor times an association factor, which is valid for the broad class of asymptotically linear statistics. A sympathetic reader cares because the method requires no central coordination or post-analysis corrections yet still lets many researchers extract reliable conclusions from the same valuable data.

Core claim

For asymptotically linear test statistics, the covariance matrix decomposes as the product of a data-overlap term and a test-statistic association term. This decomposition shows that dependence is controlled solely by limiting overlap, which is formalized by the Expected Variance Ratio (EVR) that compares the variance of the Type I error count to the independent case. Under the global null the variance of the joint rejection region admits a closed form in terms of pairwise correlations. Mean-variance portfolio theory is used to define EVR, and concentration inequalities then prove that individual subsampling yields an EVR close to 1 while preserving power. Data splitting is asymptotically a

What carries the argument

The Expected Variance Ratio (EVR), defined as the ratio of the expected variance of the Type I error count under data-induced dependence to the variance that would hold under independence; it is controlled by restricting data overlap through per-investigator subsampling.

If this is right

  • Data splitting achieves exact independence but only at the slower O(1/r) variance cost.
  • Subsampling performed by each investigator separately keeps EVR near 1 while supporting multiple tests at adequate power.
  • The EVR bound improves quadratically with smaller r compared with splitting.
  • The procedure requires only minimal coordination among investigators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large public datasets could safely host far more independent analyses if subsampling protocols became routine practice.
  • The scaling advantage suggests that finite-sample or non-linear extensions would be high-value follow-up work.
  • Data repositories might usefully distribute pre-subsampled versions or simple overlap-control guidelines to users.

Load-bearing premise

The test statistics belong to the class of asymptotically linear statistics so that their joint distribution is asymptotically normal and the covariance factors cleanly into an overlap term and an association term.

What would settle it

Simulate many overlapping subsamples of size fraction r from a fixed large dataset, run the same family of null hypotheses on each subsample, and check whether the empirical variance of the false-positive count scales as O(1/r squared) or remains larger.

Figures

Figures reproduced from arXiv: 2604.07580 by Jordan Rodu, Maria E. Currie, Mike Baiocchi, Reid Dale.

Figure 1
Figure 1. Figure 1: Subquadratic Growth of R(ρ, cα/2) in ρ 37 [PITH_FULL_IMAGE:figures/full_fig_p037_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Standard Deviation vs. FWER for E assuming Σij = ρij = ρ 38 [PITH_FULL_IMAGE:figures/full_fig_p038_2.png] view at source ↗
read the original abstract

When multiple investigators analyze a common dataset, the data reuse induces dependence across testing procedures, affecting the distribution of errors. Existing techniques of managing dependent tests require either cross-study coordination or post-hoc correction. These methods do not apply to the current practice of uncoordinated groups of researchers independently evaluating hypotheses on a shared dataset. We investigate the use of subsampling techniques implemented at the level of individual investigators to remedy dependence with minimal coordination. To this end, we establish the asymptotic joint normality of test statistics for the class of asymptotically linear test statistics, decomposing the covariance matrix as the product of a data overlap term and a test statistic association term. This decomposition shows that controlling data overlap is sufficient to control dependence, which we formalize through the notion of Expected Variance Ratio. This enables the closed form derivation of the variance of the joint rejection region under the global null as a function of pairwise correlations of test statistics. We adopt mean-variance portfolio theory to measure risk, defining the Expected Variance Ratio (EVR) as the ratio of the expected variance of the Type I error count to the independent baseline. We show that data splitting is asymptotically optimal among rules that ensure exact independence. We then use concentration inequalities to establish that subsampling techniques implementable by individual investigators can ensure an EVR close to $1$. Finally, we show that such subsampling techniques are able to simultaneously perform a number of tests while ensuring sufficient power and that the bounded EVR is $O\left(\frac{1}{r^2}\right)$ compared to data splitting's $O\left(\frac{1}{r}\right)$, where $r$ is the per-statistic fraction of data required.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that for asymptotically linear test statistics, individual investigators can use subsampling on a shared dataset to control dependence from data reuse without coordination. It establishes asymptotic joint normality, decomposes the covariance matrix into a data-overlap term times a test-statistic association term, defines the Expected Variance Ratio (EVR) via mean-variance portfolio theory applied to the variance of the Type I error count under the global null, shows data splitting is asymptotically optimal among exact-independence rules, and uses concentration inequalities to prove that subsampling yields EVR close to 1 with rate O(1/r²) versus O(1/r) for splitting, where r is the per-statistic data fraction.

Significance. If the derivations hold, the work supplies a concrete, low-coordination method for managing error dependence in the common setting of multiple independent analyses of the same data, with quantifiable efficiency gains over splitting. The EVR metric and the explicit rate comparison are useful for practice; the application of portfolio theory to error-count variance and the focus on uncoordinated investigators address a genuine gap.

major comments (2)
  1. [Abstract] Abstract (covariance decomposition and EVR derivation): the separation of covariance into data-overlap and association terms, the closed-form variance of the joint rejection region, and all subsequent EVR bounds are derived only for asymptotically linear statistics under joint asymptotic normality. This assumption is load-bearing for the central claim that 'controlling data overlap is sufficient to control dependence' and for the uncoordinated-investigator guarantee; the manuscript should state the scope explicitly and indicate whether the EVR control extends (even approximately) outside this class.
  2. [Abstract] Abstract (rate comparison): the O(1/r²) bound for subsampling versus O(1/r) for splitting is obtained via concentration inequalities on the error-count variance. The precise application of these inequalities to the portfolio-optimized variance expression, including any constants or higher-order terms that might affect the claimed improvement, needs to be shown in detail so that the rate advantage is verifiable.
minor comments (2)
  1. The definition of r (per-statistic fraction of data) and its role in both the splitting and subsampling regimes should be stated once in a single location with consistent notation.
  2. A brief remark on how the EVR behaves under local alternatives or when the global null is false would help readers assess power implications alongside the Type I error control.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the scope and strengthen the presentation of our results. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (covariance decomposition and EVR derivation): the separation of covariance into data-overlap and association terms, the closed-form variance of the joint rejection region, and all subsequent EVR bounds are derived only for asymptotically linear statistics under joint asymptotic normality. This assumption is load-bearing for the central claim that 'controlling data overlap is sufficient to control dependence' and for the uncoordinated-investigator guarantee; the manuscript should state the scope explicitly and indicate whether the EVR control extends (even approximately) outside this class.

    Authors: We agree that the central results, including the covariance decomposition, closed-form variance of the rejection region, and EVR bounds, are derived under the assumption of asymptotically linear test statistics with joint asymptotic normality. This class is explicitly referenced in the abstract and Section 2, but we will revise the abstract and introduction to state the scope more prominently at the outset. We will also add a remark noting that the exact decomposition and EVR guarantees do not necessarily extend outside this class (e.g., to non-linear statistics), though controlling data overlap may still reduce dependence heuristically in practice; we do not claim formal guarantees beyond the stated assumptions. revision: yes

  2. Referee: [Abstract] Abstract (rate comparison): the O(1/r²) bound for subsampling versus O(1/r) for splitting is obtained via concentration inequalities on the error-count variance. The precise application of these inequalities to the portfolio-optimized variance expression, including any constants or higher-order terms that might affect the claimed improvement, needs to be shown in detail so that the rate advantage is verifiable.

    Authors: We will expand the supplementary material (or add an appendix section) with a detailed step-by-step derivation of how the concentration inequalities are applied to the portfolio-optimized variance of the Type I error count. This will include the explicit constants, the form of the variance expression under the global null, and any higher-order terms, allowing readers to verify the O(1/r²) rate for subsampling versus O(1/r) for splitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained under standard asymptotic theory

full rationale

The paper derives asymptotic joint normality and the covariance decomposition (overlap term times association term) for the class of asymptotically linear statistics directly from standard theory, then defines EVR via portfolio optimization on the resulting variance of the error count and applies concentration inequalities to bound subsampling performance. No equation or claim reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; all steps are conditional on the stated class and produce independent content from the inputs. The O(1/r^2) vs O(1/r) comparison follows from the concentration analysis rather than being presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on standard asymptotic statistical theory and concentration inequalities; no new free parameters or invented entities are introduced beyond the defined EVR measure.

axioms (1)
  • domain assumption asymptotic joint normality of test statistics for the class of asymptotically linear test statistics
    Invoked to decompose the covariance matrix as product of data overlap and test statistic association terms.

pith-pipeline@v0.9.0 · 5619 in / 1258 out tokens · 79418 ms · 2026-05-10T17:01:42.730128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Claude Opus 4.6.https://www.anthropic.com/, 2025

    Anthropic. Claude Opus 4.6.https://www.anthropic.com/, 2025. Large language model

  2. [2]

    Controlling the false discovery rate: A practical and powerful approach to multiple testing

    Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. 57(1):289–300

  3. [3]

    Moments, cumulants and some applications to stationary random processes.DTIC 1) TIC m TECHNICAL REPORT No

    DAVID R BRILLINGER. Moments, cumulants and some applications to stationary random processes.DTIC 1) TIC m TECHNICAL REPORT No. 459, page 108, 1992

  4. [4]

    Asymptotic properties of kaplan-meier estimator for censored dependent data.Statistics & probability letters, 37(4):381–389, 1998

    Zongwu Cai. Asymptotic properties of kaplan-meier estimator for censored dependent data.Statistics & probability letters, 37(4):381–389, 1998

  5. [5]

    A short note on linear representation of the cox’s pro- file likelihood estimator.https://faculty.washington.edu/yenchic/ short_note/note_IIDCox.pdf, 2020

    Yen-Chi Chen. A short note on linear representation of the cox’s pro- file likelihood estimator.https://faculty.washington.edu/yenchic/ short_note/note_IIDCox.pdf, 2020. Accessed: 2026-04-01

  6. [6]

    A guide to conducting a meta-analysis with non- independent effect sizes.Neuropsychology review, 29(4):387–396, 2019

    Mike W-L Cheung. A guide to conducting a meta-analysis with non- independent effect sizes.Neuropsychology review, 29(4):387–396, 2019

  7. [7]

    rout- ledge, 2013

    Jacob Cohen.Statistical power analysis for the behavioral sciences. rout- ledge, 2013

  8. [8]

    D. R. Cox. A note on data-splitting for the evaluation of significance levels. 62(2):441–444

  9. [9]

    Dahl, Margreth Grotle, J¯ urat¯ eˇSaltyt¯ e Benth, and B˚ ard Natvig

    Fredrik A. Dahl, Margreth Grotle, J¯ urat¯ eˇSaltyt¯ e Benth, and B˚ ard Natvig. Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain.European Journal of Epidemiology, 23(4):237–242, April 2008

  10. [10]

    R package version 1.0.2

    Helgem de Souza Ribeiro Martins and Anderson Ribeiro Duarte.gencor: Generate Customized Correlation Matrices, 2022. R package version 1.0.2

  11. [11]

    Interim analysis: the alpha spending function approach.Statistics in medicine, 13(13-14):1341–1352, 1994

    David L Demets and KK Gordon Lan. Interim analysis: the alpha spending function approach.Statistics in medicine, 13(13-14):1341–1352, 1994

  12. [12]

    A multiple comparison procedure for comparing sev- eral treatments with a control.Journal of the American Statistical Associ- ation, 50(272):1096–1121, 1955

    Charles W Dunnett. A multiple comparison procedure for comparing sev- eral treatments with a control.Journal of the American Statistical Associ- ation, 50(272):1096–1121, 1955

  13. [13]

    Preserving statistical validity in adaptive data analysis

    Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. InProceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 117–126. ACM

  14. [14]

    Cambridge University Press, 1 edition

    Bradley Efron.Large-Scale Inference: Empirical Bayes Methods for Esti- mation, Testing, and Prediction. Cambridge University Press, 1 edition. 23

  15. [15]

    Correlated z-values and the accuracy of large-scale statistical estimates.Journal of the American Statistical Association, 105(491):1042–1055, 2010

    Bradley Efron. Correlated z-values and the accuracy of large-scale statistical estimates.Journal of the American Statistical Association, 105(491):1042–1055, 2010

  16. [16]

    Fay and Erica H

    Michael P. Fay and Erica H. Brittain.Statistical Hypothesis Testing in Context: Reproducibility, Inference, and Science, volume 52. Cambridge University Press, Cambridge, 2022

  17. [17]

    Foster and Robert A

    Dean P. Foster and Robert A. Stine.α-investing: a procedure for sequential control of expected false discoveries.Journal of the Royal Statistical Society Series B, 70:429–444, 2008

  18. [18]

    Lecture Notes in Statistics

    Alan Genz and Frank Bretz.Computation of Multivariate Normal and t Probabilities. Lecture Notes in Statistics. Springer-Verlag, Heidelberg, 2009

  19. [19]

    R package version 1.3-2

    Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn.mvtnorm: Multivariate Normal and t Distributions, 2024. R package version 1.3-2

  20. [20]

    Ringgenberg, Mehrdad Samadi, and In- grid M

    Davidson Heath, Matthew C. Ringgenberg, Mehrdad Samadi, and In- grid M. Werner. Reusing natural experiments.The Journal of Finance, 78(4):2329–2364, August 2023

  21. [21]

    Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

    Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

  22. [22]

    A simple sequentially rejective multiple test procedure.Scan- dinavian journal of statistics, pages 65–70, 1979

    Sture Holm. A simple sequentially rejective multiple test procedure.Scan- dinavian journal of statistics, pages 65–70, 1979

  23. [23]

    Ramesh Johari, Leo Pekelis, and David J. Walsh. Always valid inference: Bringing sequential analysis to a/b testing, July 2019

  24. [24]

    M. S. Joshi and Jane M. Paterson.Introduction to mathematical portfolio theory. Cambridge University Press, Cambridge ; New York, 2013

  25. [25]

    Springer, Berlin, Heidelberg, 2008

    Rob Kaas, Marc Goovaerts, Jan Dhaene, and Michel Denuit.Modern Ac- tuarial Risk Theory. Springer, Berlin, Heidelberg, 2008

  26. [26]

    Springer, 1999

    Erich Leo Lehmann.Elements of large-sample theory. Springer, 1999

  27. [27]

    Portfolio selection

    Harry Markowitz. Portfolio selection. 7(1):77

  28. [28]

    Five proofs of chernoff’s bound with applications, 2018

    Wolfgang Mulzer. Five proofs of chernoff’s bound with applications, 2018

  29. [29]

    R Foundation for Statistical Computing, Vienna, Austria, 2025

    R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2025

  30. [30]

    Evidence factors in observational studies.Biometrika, pages 333–345, 2010

    Paul R Rosenbaum. Evidence factors in observational studies.Biometrika, pages 333–345, 2010. 24

  31. [31]

    Bissett, and Rus- sell A

    William Hedley Thompson, Jessey Wright, Patrick G. Bissett, and Rus- sell A. Poldrack. Dataset decay and the problem of sequential analyses on open datasets.eLife, 9:e53498, May 2020

  32. [32]

    Cambridge uni- versity press, 2000

    Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge uni- versity press, 2000

  33. [33]

    Alexander M. Walker. Orthogonal predictions: follow-up questions for suggestive data. 19(5):529–532

  34. [34]

    Springer- Verlag, New York, 2016

    Hadley Wickham.ggplot2: Elegant Graphics for Data Analysis. Springer- Verlag, New York, 2016

  35. [35]

    An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias

    Arnold Zellner. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. 57(298):348–368. 25 A Proofs of Technical Results A.1 Proof of Theorem 1 Proof.Proof of Theorem 1 The idea of the proof is simple: show that the linear approximants ofS (N) i of the formc P j ψi(Xj) have the desired convergence toN(0,Σ) and in...

  36. [36]

    [a, b] pos

    yields ES[V(E)]≤(C+ 2γ)α(1−α) +C(C−1)R(ρ 0, cα/2) (99) 30 A.5 Proof of Proposition 7 Proof.This is the proof of proposition 7. LetSbe egalitarian. For any finite sampleN, the pairwise correlation betweenT i andT j is bounded above by |ρij| ≤ ωij(N)p r(N) 2 (100) = ωij(N) r(N) (101) = N ωij N r(N) (102) = |Di ∩D j| N r(N) .(103) so that X i̸=j≤C P(|ρ ij| ≥...