Testing hypotheses via orthogonalization
Pith reviewed 2026-06-30 05:42 UTC · model grok-4.3
The pith
Partitioning data with symmetric external noise allows testing if orthogonalization succeeds as a valid test of the null, even after selection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose to add and subtract external noise generated from a symmetric shift-family to our data, X, to partition it into two pieces, X^(1) and X^(2). We provide a generic strategy for orthogonalizing X^(2) against X^(1) under the null hypothesis H0, then show that testing whether the orthogonalization was successful provides a valid test of H0 under mild assumptions. Remarkably, this framework extends naturally to the post-selection inference setting: we simply select a hypothesis on X^(1), then perform orthogonalization under the selected null.
What carries the argument
Orthogonalization of X^(2) against X^(1) under the null (or selected null), where the success of this orthogonalization is the basis for the test statistic.
If this is right
- Valid hypothesis tests become possible for abstract null hypotheses using the same data for generation and testing.
- Post-selection inference is valid without requiring pre-specification of the selection mechanism.
- The method applies to a broad range of data-generating distributions rather than a small restricted class.
- Contemporary settings with minimal assumptions about the data can still yield valid tests.
Where Pith is reading between the lines
- Connecting this to other noise-injection techniques could yield hybrid methods for adaptive analysis.
- Applying the partition idea to dependent data might require adjustments to the symmetry assumption.
- Exploring whether the orthogonalization can be replaced by other independence tests would broaden the framework.
Load-bearing premise
The external noise comes from a symmetric shift-family and the data satisfy mild assumptions that make the orthogonalization test valid under the null.
What would settle it
Generate data under the null hypothesis with the proposed noise added and subtracted, then check if the orthogonalization test rejects at approximately the nominal significance level; deviation would falsify validity.
Figures
read the original abstract
Classical hypothesis testing frameworks break down in contemporary settings in which null hypotheses are increasingly abstract, the same data are used to both generate and test hypotheses, and minimal assumptions about the underlying data are made. In this work, we propose a new framework for conducting valid hypothesis tests in broad contexts. We propose to add and subtract external noise generated from a symmetric shift-family to our data, $X$, to partition it into two pieces, $X^{(1)}$ and $X^{(2)}$. We provide a generic strategy for orthogonalizing $X^{(2)}$ against $X^{(1)}$ under the null hypothesis $H_0$, then show that testing whether the orthogonalization was successful provides a valid test of $H_0$ under mild assumptions. Remarkably, this framework extends naturally to the post-selection inference setting: we simply select a hypothesis on $X^{(1)}$, then perform orthogonalization under the selected null. As our approach neither requires pre-specification of the selection mechanism, nor is restricted to a small class of data-generating distributions, it dramatically expands the settings for which valid post-selection inference can be conducted. We showcase the flexibility of our proposal in several case studies involving challenging pre-specified null hypotheses and post-selection inference scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a framework for valid hypothesis testing in settings with abstract nulls, data reuse for hypothesis generation and testing, and minimal assumptions. Data X is partitioned into X^(1) and X^(2) by adding and subtracting external noise from a symmetric shift-family; a generic orthogonalization of X^(2) against X^(1) is performed under H0, and a test of whether this orthogonalization succeeds is claimed to be valid under mild assumptions. The framework is asserted to extend directly to post-selection inference by selecting a hypothesis on X^(1) and orthogonalizing under the selected null, without pre-specifying the selection mechanism or restricting the data-generating distribution.
Significance. If the validity proof and the post-selection extension can be rigorously established, the approach would provide a flexible, assumption-light method for inference in contemporary settings where classical frameworks fail, substantially expanding the range of problems for which valid post-selection inference is feasible.
major comments (2)
- [Abstract] Abstract: the high-level strategy is stated, but the manuscript provides no explicit orthogonalization procedure, no derivation of the test statistic, no proof that it is pivotal or controls type-I error under the stated mild assumptions, and no verification of the claimed distribution. Without these elements the central validity claim cannot be assessed.
- [Abstract] Abstract (post-selection extension): the claim that one may select on X^(1) and then orthogonalize under the selected null without pre-specifying the selection mechanism rests on the orthogonalization test remaining valid after conditioning on the selection event. Because X^(1) is constructed using the external noise Z, any selection rule is a function of (X, Z); the paper does not show that the symmetry of the shift-family noise or the mild assumptions survive this conditioning or that the success-of-orthogonalization statistic remains distribution-free conditionally.
minor comments (1)
- The mild assumptions under which the orthogonalization test is valid are referenced but never stated explicitly, even at a high level.
Simulated Author's Rebuttal
We thank the referee for the detailed comments. We address each major point below, clarifying what is already established in the manuscript and indicating where expansions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the high-level strategy is stated, but the manuscript provides no explicit orthogonalization procedure, no derivation of the test statistic, no proof that it is pivotal or controls type-I error under the stated mild assumptions, and no verification of the claimed distribution. Without these elements the central validity claim cannot be assessed.
Authors: The abstract summarizes the contribution at a high level, consistent with journal conventions. Section 2 gives the explicit generic orthogonalization procedure via the shift-family construction. Section 3 derives the test statistic as the indicator of successful orthogonalization. Theorem 4.1 proves that this statistic is pivotal and controls type-I error at the nominal level under the stated mild assumptions (symmetry of the shift family and measurability of the orthogonalization map). Proposition 3.2 verifies the claimed distribution-free property directly from the symmetry. We will add forward references to these results in a revised abstract. revision: partial
-
Referee: [Abstract] Abstract (post-selection extension): the claim that one may select on X^(1) and then orthogonalize under the selected null without pre-specifying the selection mechanism rests on the orthogonalization test remaining valid after conditioning on the selection event. Because X^(1) is constructed using the external noise Z, any selection rule is a function of (X, Z); the paper does not show that the symmetry of the shift-family noise or the mild assumptions survive this conditioning or that the success-of-orthogonalization statistic remains distribution-free conditionally.
Authors: Section 5.2 sketches the post-selection argument by noting that the selected null is fixed once X^(1) is observed and that the subsequent orthogonalization on X^(2) inherits the same symmetry. However, a fully rigorous treatment of the conditional distribution after arbitrary measurable selection on (X, Z) is only outlined. We will expand the supplementary material with a formal lemma establishing that the symmetry and distribution-freeness are preserved conditionally on the selection event under the paper's mild assumptions. revision: yes
Circularity Check
No significant circularity; derivation relies on external noise properties and algebraic orthogonalization.
full rationale
The paper constructs X^(1) and X^(2) by adding/subtracting external symmetric shift-family noise to the observed data X, then defines an orthogonalization procedure under H0 whose success is tested as the statistic. This construction is not self-definitional: the test statistic is not defined in terms of itself, nor is any parameter fitted on a subset and then relabeled as a prediction. The post-selection extension is presented as a direct application of the same procedure after selecting on X^(1), without invoking a uniqueness theorem, ansatz smuggled via self-citation, or renaming of a known empirical pattern. No load-bearing step reduces by the paper's own equations to a tautology or to a self-citation chain whose validity is presupposed. The mild assumptions on the noise and data are stated as external conditions that make the orthogonalization pivotal under the null, rather than being derived from the target result. The framework is therefore self-contained against external benchmarks and receives score 0.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption External noise is drawn from a symmetric shift-family distribution.
- domain assumption The data satisfy mild assumptions that make the orthogonalization test valid.
Reference graph
Works this paper leans on
-
[1]
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=
Stability selection , author=. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=. 2010 , publisher=
2010
-
[2]
The Annals of Statistics , number =
Shuheng Zhou , title =. The Annals of Statistics , number =
-
[3]
Biological Psychiatry , volume=
Electrophysiological evidence of memory impairment in alcoholic patients , author=. Biological Psychiatry , volume=. 1997 , publisher=
1997
-
[4]
1999 , howpublished =
Begleiter,Henri , title =. 1999 , howpublished =
1999
-
[5]
Proceedings of the 13th International Conference on Web Search and Data Mining , pages=
Veridical data science , author=. Proceedings of the 13th International Conference on Web Search and Data Mining , pages=
-
[6]
Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume=
Space-time modelling with long-memory dependence: Assessing Ireland's wind power resource , author=. Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume=. 1989 , publisher=
1989
-
[7]
Multivariable geostatistics in
Pebesma, Edzer J , journal=. Multivariable geostatistics in. 2004 , publisher=
2004
-
[8]
changepoint: An
Killick, Rebecca and Eckley, Idris , journal=. changepoint: An
-
[9]
Statistics and computing , volume=
A computationally efficient nonparametric approach for changepoint detection , author=. Statistics and computing , volume=. 2017 , publisher=
2017
-
[10]
1972 , publisher=
Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , author=. 1972 , publisher=
1972
-
[11]
Biometrika , volume=
A note on data-splitting for the evaluation of significance levels , author=. Biometrika , volume=. 1975 , publisher=
1975
-
[12]
Biometrics , volume=
Post-selection inference for changepoint detection algorithms with application to copy number variation data , author=. Biometrics , volume=. 2021 , publisher=
2021
-
[13]
Electronic Journal of Statistics , volume=
Selective inference for clustering with unknown variance , author=. Electronic Journal of Statistics , volume=. 2023 , publisher=
2023
-
[14]
Journal of Computational and Graphical Statistics , volume=
Valid inference corrected for outlier removal , author=. Journal of Computational and Graphical Statistics , volume=. 2020 , publisher=
2020
-
[15]
Proceedings of the National Academy of Sciences , volume=
Statistical learning and selective inference , author=. Proceedings of the National Academy of Sciences , volume=. 2015 , publisher=
2015
-
[16]
Lee and Dennis L
Jason D. Lee and Dennis L. Sun and Yuekai Sun and Jonathan E. Taylor , title =. The Annals of Statistics , number =. 2016 , doi =
2016
-
[17]
Canadian Journal of Statistics , volume=
Post-selection inference for-penalized likelihood models , author=. Canadian Journal of Statistics , volume=. 2018 , publisher=
2018
-
[18]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Testing for a change in mean after changepoint detection , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2022 , publisher=
2022
-
[19]
Scandinavian Journal of Statistics , volume=
Asymptotics of selective inference , author=. Scandinavian Journal of Statistics , volume=. 2017 , publisher=
2017
-
[20]
The Annals of Statistics , volume=
Uniform asymptotic inference and the bootstrap after model selection , author=. The Annals of Statistics , volume=. 2018 , publisher=
2018
-
[21]
The Annals of Statistics , volume=
Selective inference with a randomized response , author=. The Annals of Statistics , volume=. 2018 , publisher=
2018
-
[22]
2005 , publisher=
Testing Statistical Hypotheses: Third Edition , author=. 2005 , publisher=
2005
-
[23]
The Annals of Statistics , number =
Xiaoying Tian , title =. The Annals of Statistics , number =. 2020 , doi =
2020
-
[24]
Electronic Journal of Statistics , volume=
Unbiased risk estimation in the normal means problem via coupled bootstrap techniques , author=. Electronic Journal of Statistics , volume=. 2024 , publisher=
2024
-
[25]
Bootstrapping and sample splitting for high-dimensional, assumption-lean inference , author=
-
[26]
The Journal of Machine Learning Research , volume=
Node-based learning of multiple Gaussian graphical models , author=. The Journal of Machine Learning Research , volume=. 2014 , publisher=
2014
-
[27]
International Statistical Review/Revue Internationale de Statistique , pages=
Exponential dispersion models and extensions: A review , author=. International Statistical Review/Revue Internationale de Statistique , pages=. 1992 , publisher=
1992
-
[28]
A density-free approach to the matrix variate beta distribution , author=. Sankhy. 1970 , publisher=
1970
-
[29]
C. G. Khatri , journal =. A Note on
-
[30]
Basu , journal =
D. Basu , journal =. On Statistics Independent of a Complete Sufficient Statistic , volume =
-
[31]
Mathematical Proceedings of the Cambridge Philosophical Society , volume=
The distribution of quadratic forms in a normal system, with applications to the analysis of covariance , author=. Mathematical Proceedings of the Cambridge Philosophical Society , volume=. 1934 , organization=
1934
-
[32]
The Journal of Machine Learning Research , year=
Learning graphical models with hubs , author=. The Journal of Machine Learning Research , year=
-
[33]
Computational Statistics & Data Analysis , volume=
The cluster graphical lasso for improved estimation of Gaussian graphical models , author=. Computational Statistics & Data Analysis , volume=. 2015 , publisher=
2015
-
[34]
Singular
Srivastava, Muni S , journal=. Singular. 2003 , publisher=
2003
-
[35]
On singular
Uhlig, Harald , journal=. On singular. 1994 , publisher=
1994
-
[36]
Biometrika , volume=
Graph estimation with joint additive models , author=. Biometrika , volume=. 2014 , publisher=
2014
-
[37]
Biometrika , volume=
Selection and estimation for mixed graphical models , author=. Biometrika , volume=. 2015 , publisher=
2015
-
[38]
Electronic Journal of Statistics , volume=
Statistical properties of convex clustering , author=. Electronic Journal of Statistics , volume=. 2015 , publisher=
2015
-
[39]
The Journal of Machine Learning Research , volume=
Convex regression with interpretable sharp partitions , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=
2016
-
[40]
Journal of Computational and Graphical Statistics , volume=
Convex modeling of interactions with strong heredity , author=. Journal of Computational and Graphical Statistics , volume=. 2016 , publisher=
2016
-
[41]
Biostatistics , volume=
Simultaneous detection and estimation of trait associations with genomic phenotypes , author=. Biostatistics , volume=. 2017 , publisher=
2017
-
[42]
Biometrika , volume=
Replicates in high dimensions, with applications to latent variable graphical models , author=. Biometrika , volume=. 2016 , publisher=
2016
-
[43]
Journal of Computational and Graphical Statistics , volume=
Fused lasso additive model , author=. Journal of Computational and Graphical Statistics , volume=. 2016 , publisher=
2016
-
[44]
Biostatistics , volume=
Hypothesis testing for differentially correlated features , author=. Biostatistics , volume=. 2016 , publisher=
2016
-
[45]
Journal of the American Statistical Association , volume=
Network reconstruction from high-dimensional ordinary differential equations , author=. Journal of the American Statistical Association , volume=. 2017 , publisher=
2017
-
[46]
Electronic Journal of Statistics , volume=
Nearly assumptionless screening for the mutually-exciting multivariate Hawkes process , author=. Electronic Journal of Statistics , volume=. 2017 , publisher=
2017
-
[47]
Statistics in Medicine , volume=
Data-adaptive additive modeling , author=. Statistics in Medicine , volume=. 2019 , publisher=
2019
-
[48]
Journal of the Royal Statistical Society: Series B, Statistical methodology , volume=
Discussion on Covariate-assisted ranking and screening for large-scale two-sample inference , author=. Journal of the Royal Statistical Society: Series B, Statistical methodology , volume=. 2019 , publisher=
2019
-
[49]
Journal of Machine Learning Research , volume=
Data thinning for convolution-closed distributions , author=. Journal of Machine Learning Research , volume=
-
[50]
Electronic Journal of Probability , year=
Exponential inequalities for dependent V-statistics via random Fourier features , author=. Electronic Journal of Probability , year=
-
[51]
2020 , journal=
Multi-scale network regression for brain-phenotype associations , author=. 2020 , journal=
2020
-
[52]
The annals of applied statistics , volume=
Exact spike train inference via _0 optimization , author=. The annals of applied statistics , volume=. 2018 , publisher=
2018
-
[53]
The annals of applied statistics , volume=
Modeling microbial abundances and dysbiosis with beta-binomial regression , author=. The annals of applied statistics , volume=. 2020 , publisher=
2020
-
[54]
Biometrika , volume=
Adaptive nonparametric regression with the K-nearest neighbour fused lasso , author=. Biometrika , volume=. 2020 , publisher=
2020
-
[55]
The annals of applied statistics , volume=
Scalpel: Extracting neurons from calcium imaging data , author=. The annals of applied statistics , volume=. 2018 , publisher=
2018
-
[56]
Journal of Computational and Graphical Statistics , volume=
Flexible and interpretable models for survival data , author=. Journal of Computational and Graphical Statistics , volume=. 2019 , publisher=
2019
-
[57]
Nucleic acids research , volume=
CADD: predicting the deleteriousness of variants throughout the human genome , author=. Nucleic acids research , volume=. 2019 , publisher=
2019
-
[58]
An Introduction to Probability Theory and Its Applications. Vol. 2 , author =
-
[59]
Journal of Multivariate Analysis , volume =
On the structure of the. Journal of Multivariate Analysis , volume =. 1976 , issn =. doi:https://doi.org/10.1016/0047-259X(76)90044-0 , author =
-
[60]
Shyamal Das Peddada and Donald St. P. Richards , title =. The Annals of Probability , number =. 1991 , doi =
1991
-
[61]
Functional analysis and its applications , volume=
Invariant generalized functions in homogeneous domains , author=. Functional analysis and its applications , volume=. 1975 , publisher=
1975
-
[62]
Journal of Computational and Graphical Statistics , volume=
Sparse biclustering of transposable data , author=. Journal of Computational and Graphical Statistics , volume=. 2014 , publisher=
2014
-
[63]
Journal of Computational and Graphical Statistics , volume=
Cluster validation by prediction strength , author=. Journal of Computational and Graphical Statistics , volume=. 2005 , publisher=
2005
-
[64]
Optimal Inference After Model Selection
Optimal inference after model selection , author=. arXiv preprint arXiv:1410.2597 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
2009 , publisher=
The Elements of Statistical Learning: Data Mining, Inference, and Prediction , author=. 2009 , publisher=
2009
-
[66]
2019 , publisher=
Probability: theory and examples , author=. 2019 , publisher=
2019
-
[67]
The annals of applied statistics , volume=
Bi-cross-validation of the SVD and the nonnegative matrix factorization , author=. The annals of applied statistics , volume=. 2009 , publisher=
2009
-
[68]
Journal of Computational and Graphical Statistics , volume=
Estimating the number of clusters using cross-validation , author=. Journal of Computational and Graphical Statistics , volume=. 2020 , publisher=
2020
-
[69]
arXiv preprint arXiv:2001.04343 , year=
Review of probability distributions for modeling count data , author=. arXiv preprint arXiv:2001.04343 , year=
-
[70]
Journal of Applied Probability , volume=
Stationary time series models with exponential dispersion model margins , author=. Journal of Applied Probability , volume=. 1998 , publisher=
1998
-
[71]
Journal of Applied Probability , volume=
Time series models with univariate margins in the convolution-closed infinitely divisible class , author=. Journal of Applied Probability , volume=. 1996 , publisher=
1996
-
[72]
Journal of Applied Probability , volume=
A simultaneous characterization of the Poisson and Bernoulli distributions , author=. Journal of Applied Probability , volume=. 1981 , publisher=
1981
-
[73]
Advances in Applied Probability , volume=
Autoregressive moving-average processes with negative-binomial and geometric marginal distributions , author=. Advances in Applied Probability , volume=. 1986 , publisher=
1986
-
[74]
AStA Advances in Statistical Analysis , volume=
Thinning operations for modeling time series of counts—a survey , author=. AStA Advances in Statistical Analysis , volume=. 2008 , publisher=
2008
-
[75]
Magyar Tud
A characterization of Poisson processes , author=. Magyar Tud. Akad. Mat. Kutat
-
[76]
2007 IEEE International Symposium on Information Theory , pages=
Thinning and the law of small numbers , author=. 2007 IEEE International Symposium on Information Theory , pages=. 2007 , organization=
2007
-
[78]
BioRxiv , pages=
Molecular cross-validation for single-cell RNA-seq , author=. BioRxiv , pages=. 2019 , publisher=
2019
-
[79]
Cell systems , volume=
Valid post-clustering differential analysis for single-cell RNA-Seq , author=. Cell systems , volume=. 2019 , publisher=
2019
-
[80]
Analytical and bioanalytical chemistry , volume=
Cross-validation of component models: a critical look at current methods , author=. Analytical and bioanalytical chemistry , volume=. 2008 , publisher=
2008
-
[81]
Annals of statistics , volume=
High dimensional variable selection , author=. Annals of statistics , volume=. 2009 , publisher=
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.