Data Reuse and the Long Shadow of Error: Splitting, Subsampling, and Prospectively Managing Inferential Errors
Pith reviewed 2026-05-10 17:01 UTC · model grok-4.3
The pith
Subsampling by individual researchers controls dependence from data reuse to bound Type I error variance near the independent baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For asymptotically linear test statistics, the covariance matrix decomposes as the product of a data-overlap term and a test-statistic association term. This decomposition shows that dependence is controlled solely by limiting overlap, which is formalized by the Expected Variance Ratio (EVR) that compares the variance of the Type I error count to the independent case. Under the global null the variance of the joint rejection region admits a closed form in terms of pairwise correlations. Mean-variance portfolio theory is used to define EVR, and concentration inequalities then prove that individual subsampling yields an EVR close to 1 while preserving power. Data splitting is asymptotically a
What carries the argument
The Expected Variance Ratio (EVR), defined as the ratio of the expected variance of the Type I error count under data-induced dependence to the variance that would hold under independence; it is controlled by restricting data overlap through per-investigator subsampling.
If this is right
- Data splitting achieves exact independence but only at the slower O(1/r) variance cost.
- Subsampling performed by each investigator separately keeps EVR near 1 while supporting multiple tests at adequate power.
- The EVR bound improves quadratically with smaller r compared with splitting.
- The procedure requires only minimal coordination among investigators.
Where Pith is reading between the lines
- Large public datasets could safely host far more independent analyses if subsampling protocols became routine practice.
- The scaling advantage suggests that finite-sample or non-linear extensions would be high-value follow-up work.
- Data repositories might usefully distribute pre-subsampled versions or simple overlap-control guidelines to users.
Load-bearing premise
The test statistics belong to the class of asymptotically linear statistics so that their joint distribution is asymptotically normal and the covariance factors cleanly into an overlap term and an association term.
What would settle it
Simulate many overlapping subsamples of size fraction r from a fixed large dataset, run the same family of null hypotheses on each subsample, and check whether the empirical variance of the false-positive count scales as O(1/r squared) or remains larger.
Figures
read the original abstract
When multiple investigators analyze a common dataset, the data reuse induces dependence across testing procedures, affecting the distribution of errors. Existing techniques of managing dependent tests require either cross-study coordination or post-hoc correction. These methods do not apply to the current practice of uncoordinated groups of researchers independently evaluating hypotheses on a shared dataset. We investigate the use of subsampling techniques implemented at the level of individual investigators to remedy dependence with minimal coordination. To this end, we establish the asymptotic joint normality of test statistics for the class of asymptotically linear test statistics, decomposing the covariance matrix as the product of a data overlap term and a test statistic association term. This decomposition shows that controlling data overlap is sufficient to control dependence, which we formalize through the notion of Expected Variance Ratio. This enables the closed form derivation of the variance of the joint rejection region under the global null as a function of pairwise correlations of test statistics. We adopt mean-variance portfolio theory to measure risk, defining the Expected Variance Ratio (EVR) as the ratio of the expected variance of the Type I error count to the independent baseline. We show that data splitting is asymptotically optimal among rules that ensure exact independence. We then use concentration inequalities to establish that subsampling techniques implementable by individual investigators can ensure an EVR close to $1$. Finally, we show that such subsampling techniques are able to simultaneously perform a number of tests while ensuring sufficient power and that the bounded EVR is $O\left(\frac{1}{r^2}\right)$ compared to data splitting's $O\left(\frac{1}{r}\right)$, where $r$ is the per-statistic fraction of data required.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for asymptotically linear test statistics, individual investigators can use subsampling on a shared dataset to control dependence from data reuse without coordination. It establishes asymptotic joint normality, decomposes the covariance matrix into a data-overlap term times a test-statistic association term, defines the Expected Variance Ratio (EVR) via mean-variance portfolio theory applied to the variance of the Type I error count under the global null, shows data splitting is asymptotically optimal among exact-independence rules, and uses concentration inequalities to prove that subsampling yields EVR close to 1 with rate O(1/r²) versus O(1/r) for splitting, where r is the per-statistic data fraction.
Significance. If the derivations hold, the work supplies a concrete, low-coordination method for managing error dependence in the common setting of multiple independent analyses of the same data, with quantifiable efficiency gains over splitting. The EVR metric and the explicit rate comparison are useful for practice; the application of portfolio theory to error-count variance and the focus on uncoordinated investigators address a genuine gap.
major comments (2)
- [Abstract] Abstract (covariance decomposition and EVR derivation): the separation of covariance into data-overlap and association terms, the closed-form variance of the joint rejection region, and all subsequent EVR bounds are derived only for asymptotically linear statistics under joint asymptotic normality. This assumption is load-bearing for the central claim that 'controlling data overlap is sufficient to control dependence' and for the uncoordinated-investigator guarantee; the manuscript should state the scope explicitly and indicate whether the EVR control extends (even approximately) outside this class.
- [Abstract] Abstract (rate comparison): the O(1/r²) bound for subsampling versus O(1/r) for splitting is obtained via concentration inequalities on the error-count variance. The precise application of these inequalities to the portfolio-optimized variance expression, including any constants or higher-order terms that might affect the claimed improvement, needs to be shown in detail so that the rate advantage is verifiable.
minor comments (2)
- The definition of r (per-statistic fraction of data) and its role in both the splitting and subsampling regimes should be stated once in a single location with consistent notation.
- A brief remark on how the EVR behaves under local alternatives or when the global null is false would help readers assess power implications alongside the Type I error control.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the scope and strengthen the presentation of our results. We address each major comment below and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (covariance decomposition and EVR derivation): the separation of covariance into data-overlap and association terms, the closed-form variance of the joint rejection region, and all subsequent EVR bounds are derived only for asymptotically linear statistics under joint asymptotic normality. This assumption is load-bearing for the central claim that 'controlling data overlap is sufficient to control dependence' and for the uncoordinated-investigator guarantee; the manuscript should state the scope explicitly and indicate whether the EVR control extends (even approximately) outside this class.
Authors: We agree that the central results, including the covariance decomposition, closed-form variance of the rejection region, and EVR bounds, are derived under the assumption of asymptotically linear test statistics with joint asymptotic normality. This class is explicitly referenced in the abstract and Section 2, but we will revise the abstract and introduction to state the scope more prominently at the outset. We will also add a remark noting that the exact decomposition and EVR guarantees do not necessarily extend outside this class (e.g., to non-linear statistics), though controlling data overlap may still reduce dependence heuristically in practice; we do not claim formal guarantees beyond the stated assumptions. revision: yes
-
Referee: [Abstract] Abstract (rate comparison): the O(1/r²) bound for subsampling versus O(1/r) for splitting is obtained via concentration inequalities on the error-count variance. The precise application of these inequalities to the portfolio-optimized variance expression, including any constants or higher-order terms that might affect the claimed improvement, needs to be shown in detail so that the rate advantage is verifiable.
Authors: We will expand the supplementary material (or add an appendix section) with a detailed step-by-step derivation of how the concentration inequalities are applied to the portfolio-optimized variance of the Type I error count. This will include the explicit constants, the form of the variance expression under the global null, and any higher-order terms, allowing readers to verify the O(1/r²) rate for subsampling versus O(1/r) for splitting. revision: yes
Circularity Check
No significant circularity; derivations are self-contained under standard asymptotic theory
full rationale
The paper derives asymptotic joint normality and the covariance decomposition (overlap term times association term) for the class of asymptotically linear statistics directly from standard theory, then defines EVR via portfolio optimization on the resulting variance of the error count and applies concentration inequalities to bound subsampling performance. No equation or claim reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; all steps are conditional on the stated class and produce independent content from the inputs. The O(1/r^2) vs O(1/r) comparison follows from the concentration analysis rather than being presupposed.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption asymptotic joint normality of test statistics for the class of asymptotically linear test statistics
Reference graph
Works this paper leans on
-
[1]
Claude Opus 4.6.https://www.anthropic.com/, 2025
Anthropic. Claude Opus 4.6.https://www.anthropic.com/, 2025. Large language model
work page 2025
-
[2]
Controlling the false discovery rate: A practical and powerful approach to multiple testing
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. 57(1):289–300
-
[3]
DAVID R BRILLINGER. Moments, cumulants and some applications to stationary random processes.DTIC 1) TIC m TECHNICAL REPORT No. 459, page 108, 1992
work page 1992
-
[4]
Zongwu Cai. Asymptotic properties of kaplan-meier estimator for censored dependent data.Statistics & probability letters, 37(4):381–389, 1998
work page 1998
-
[5]
Yen-Chi Chen. A short note on linear representation of the cox’s pro- file likelihood estimator.https://faculty.washington.edu/yenchic/ short_note/note_IIDCox.pdf, 2020. Accessed: 2026-04-01
work page 2020
-
[6]
Mike W-L Cheung. A guide to conducting a meta-analysis with non- independent effect sizes.Neuropsychology review, 29(4):387–396, 2019
work page 2019
-
[7]
Jacob Cohen.Statistical power analysis for the behavioral sciences. rout- ledge, 2013
work page 2013
-
[8]
D. R. Cox. A note on data-splitting for the evaluation of significance levels. 62(2):441–444
-
[9]
Dahl, Margreth Grotle, J¯ urat¯ eˇSaltyt¯ e Benth, and B˚ ard Natvig
Fredrik A. Dahl, Margreth Grotle, J¯ urat¯ eˇSaltyt¯ e Benth, and B˚ ard Natvig. Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain.European Journal of Epidemiology, 23(4):237–242, April 2008
work page 2008
-
[10]
Helgem de Souza Ribeiro Martins and Anderson Ribeiro Duarte.gencor: Generate Customized Correlation Matrices, 2022. R package version 1.0.2
work page 2022
-
[11]
David L Demets and KK Gordon Lan. Interim analysis: the alpha spending function approach.Statistics in medicine, 13(13-14):1341–1352, 1994
work page 1994
-
[12]
Charles W Dunnett. A multiple comparison procedure for comparing sev- eral treatments with a control.Journal of the American Statistical Associ- ation, 50(272):1096–1121, 1955
work page 1955
-
[13]
Preserving statistical validity in adaptive data analysis
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. InProceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 117–126. ACM
-
[14]
Cambridge University Press, 1 edition
Bradley Efron.Large-Scale Inference: Empirical Bayes Methods for Esti- mation, Testing, and Prediction. Cambridge University Press, 1 edition. 23
-
[15]
Bradley Efron. Correlated z-values and the accuracy of large-scale statistical estimates.Journal of the American Statistical Association, 105(491):1042–1055, 2010
work page 2010
-
[16]
Michael P. Fay and Erica H. Brittain.Statistical Hypothesis Testing in Context: Reproducibility, Inference, and Science, volume 52. Cambridge University Press, Cambridge, 2022
work page 2022
-
[17]
Dean P. Foster and Robert A. Stine.α-investing: a procedure for sequential control of expected false discoveries.Journal of the Royal Statistical Society Series B, 70:429–444, 2008
work page 2008
-
[18]
Alan Genz and Frank Bretz.Computation of Multivariate Normal and t Probabilities. Lecture Notes in Statistics. Springer-Verlag, Heidelberg, 2009
work page 2009
-
[19]
Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn.mvtnorm: Multivariate Normal and t Distributions, 2024. R package version 1.3-2
work page 2024
-
[20]
Ringgenberg, Mehrdad Samadi, and In- grid M
Davidson Heath, Matthew C. Ringgenberg, Mehrdad Samadi, and In- grid M. Werner. Reusing natural experiments.The Journal of Finance, 78(4):2329–2364, August 2023
work page 2023
-
[21]
Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963
work page 1963
-
[22]
Sture Holm. A simple sequentially rejective multiple test procedure.Scan- dinavian journal of statistics, pages 65–70, 1979
work page 1979
-
[23]
Ramesh Johari, Leo Pekelis, and David J. Walsh. Always valid inference: Bringing sequential analysis to a/b testing, July 2019
work page 2019
-
[24]
M. S. Joshi and Jane M. Paterson.Introduction to mathematical portfolio theory. Cambridge University Press, Cambridge ; New York, 2013
work page 2013
-
[25]
Springer, Berlin, Heidelberg, 2008
Rob Kaas, Marc Goovaerts, Jan Dhaene, and Michel Denuit.Modern Ac- tuarial Risk Theory. Springer, Berlin, Heidelberg, 2008
work page 2008
- [26]
- [27]
-
[28]
Five proofs of chernoff’s bound with applications, 2018
Wolfgang Mulzer. Five proofs of chernoff’s bound with applications, 2018
work page 2018
-
[29]
R Foundation for Statistical Computing, Vienna, Austria, 2025
R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2025
work page 2025
-
[30]
Evidence factors in observational studies.Biometrika, pages 333–345, 2010
Paul R Rosenbaum. Evidence factors in observational studies.Biometrika, pages 333–345, 2010. 24
work page 2010
-
[31]
William Hedley Thompson, Jessey Wright, Patrick G. Bissett, and Rus- sell A. Poldrack. Dataset decay and the problem of sequential analyses on open datasets.eLife, 9:e53498, May 2020
work page 2020
-
[32]
Cambridge uni- versity press, 2000
Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge uni- versity press, 2000
work page 2000
-
[33]
Alexander M. Walker. Orthogonal predictions: follow-up questions for suggestive data. 19(5):529–532
-
[34]
Springer- Verlag, New York, 2016
Hadley Wickham.ggplot2: Elegant Graphics for Data Analysis. Springer- Verlag, New York, 2016
work page 2016
-
[35]
An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias
Arnold Zellner. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. 57(298):348–368. 25 A Proofs of Technical Results A.1 Proof of Theorem 1 Proof.Proof of Theorem 1 The idea of the proof is simple: show that the linear approximants ofS (N) i of the formc P j ψi(Xj) have the desired convergence toN(0,Σ) and in...
-
[36]
yields ES[V(E)]≤(C+ 2γ)α(1−α) +C(C−1)R(ρ 0, cα/2) (99) 30 A.5 Proof of Proposition 7 Proof.This is the proof of proposition 7. LetSbe egalitarian. For any finite sampleN, the pairwise correlation betweenT i andT j is bounded above by |ρij| ≤ ωij(N)p r(N) 2 (100) = ωij(N) r(N) (101) = N ωij N r(N) (102) = |Di ∩D j| N r(N) .(103) so that X i̸=j≤C P(|ρ ij| ≥...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.