Recognition: unknown
Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference
Pith reviewed 2026-05-09 17:57 UTC · model grok-4.3
The pith
Minimum Specification Perturbation measures the fewest analyst decision changes needed to make a causal claim's confidence interval contain zero.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MSP is the minimum number of analyst decisions that must be altered to reach a specification whose confidence interval contains zero. It is small under the null, grows with effect strength, and supplies distance-to-falsification information that dispersion-based summaries cannot. Fragility Index and MSP assess orthogonal vulnerabilities because sensitivity to data points need not match sensitivity to specification choices. On the LaLonde benchmark MSP equals one, so one decision change suffices to include zero. Exact permutation calibration is available under randomization, computation is tractable for additive structures, and the problem is NP-hard in general.
What carries the argument
Minimum Specification Perturbation (MSP), the smallest number of changes across a countable set of analyst decisions to produce a specification whose confidence interval contains zero.
Load-bearing premise
Analyst decisions can be represented as discrete independent countable changes whose effects on the confidence interval admit an exhaustive or efficient search that does not itself bias the minimum count.
What would settle it
A Monte Carlo experiment in which MSP stays constant or decreases as the true effect size increases would falsify the claimed growth property.
Figures
read the original abstract
Empirical causal claims depend on many analyst decisions, from selecting covariates to choosing estimators. Existing robustness tools summarize how results vary across these choices, but, to the best of our knowledge, do not answer: \textbf{How many analyst decisions must change to reach a specification, which is a set of choices, whose confidence interval (CI) contains zero?} We introduce \emph{Minimum Specification Perturbation (MSP)}, the smallest number of changes. MSP is small under the null, grows with effect strength and captures distance-to-falsification information that dispersion-based summaries cannot report; when making decisions under weak effects, an MSP-based rule yields lower false-positive rates than dispersion-based rules. We show that Fragility Index and MSP measure orthogonal vulnerabilities: fragility to influential observations need not imply fragility to specification choices. On the LaLonde benchmark, MSP = 1 implies that one decision change makes the CI contain zero. We further provide exact permutation calibration under randomization and characterize computation, showing tractable cases under additive structure and NP-hardness in general.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Minimum Specification Perturbation (MSP) as the smallest number of changes to analyst decisions (covariate selection, estimator choice, etc.) needed to produce a specification whose confidence interval contains zero. It claims MSP is small under the null, increases with effect strength, supplies distance-to-falsification information orthogonal to dispersion-based summaries and to the Fragility Index, and yields lower false-positive rates than dispersion rules when effects are weak. The paper supplies an exact permutation calibration under randomization, characterizes computation (tractable under additive structure, NP-hard in general), and reports MSP = 1 on the LaLonde benchmark.
Significance. If the central claims survive scrutiny, MSP would supply a directly interpretable, decision-count metric for robustness that existing tools do not provide. The orthogonality result and the reported FPR advantage under weak effects would be useful additions to the causal-inference robustness literature.
major comments (3)
- [Abstract and computation/calibration section] The data-dependent search for the argmin perturbation count risks downward bias in MSP because the optimization selects specifications on the basis of data-driven CI checks. The abstract states that exact permutation calibration is provided under randomization, but it is unclear whether this calibration fully corrects for the selection effect induced by the search itself (rather than merely for the randomization distribution of a fixed specification). This issue is load-bearing for the claims that MSP is small under the null and that an MSP-based rule achieves lower FPR than dispersion-based rules.
- [Abstract and results on orthogonality] The orthogonality claim between MSP and the Fragility Index is asserted but not demonstrated with a formal argument or explicit counter-example showing that fragility to influential observations need not imply fragility to specification choices (or vice versa). Without such a demonstration, the statement that the two measures capture orthogonal vulnerabilities remains unverified.
- [LaLonde benchmark paragraph] The LaLonde benchmark result that MSP = 1 is reported without accompanying detail on the size of the decision space searched, the enumeration procedure, or whether the search was exhaustive versus heuristic. Because the paper notes NP-hardness in general, the concrete value MSP = 1 cannot be assessed for robustness to search procedure without these specifics.
minor comments (2)
- [Abstract] The abstract uses the phrase 'exact permutation calibration' without defining the permutation scheme or the null distribution being calibrated; a brief parenthetical or footnote would improve readability.
- [Introduction] Notation for the set of analyst decisions and the perturbation operator is introduced without an explicit mathematical definition in the opening paragraphs; adding a short displayed equation would aid clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, indicating revisions that will be incorporated to improve clarity and rigor.
read point-by-point responses
-
Referee: The data-dependent search for the argmin perturbation count risks downward bias in MSP because the optimization selects specifications on the basis of data-driven CI checks. The abstract states that exact permutation calibration is provided under randomization, but it is unclear whether this calibration fully corrects for the selection effect induced by the search itself (rather than merely for the randomization distribution of a fixed specification). This issue is load-bearing for the claims that MSP is small under the null and that an MSP-based rule achieves lower FPR than dispersion-based rules.
Authors: We appreciate the referee's identification of this important distinction. The permutation calibration described in the manuscript applies the complete MSP procedure—including the data-dependent search over the specification space—to each randomly permuted treatment assignment. Consequently, the resulting null distribution of MSP already incorporates the selection effect from the optimization step. We will revise the calibration section to state this explicitly, include pseudocode illustrating the full-procedure permutation, and add a brief simulation confirming exact control. These changes directly support the claims about MSP under the null and the reported FPR advantage. revision: yes
-
Referee: The orthogonality claim between MSP and the Fragility Index is asserted but not demonstrated with a formal argument or explicit counter-example showing that fragility to influential observations need not imply fragility to specification choices (or vice versa). Without such a demonstration, the statement that the two measures capture orthogonal vulnerabilities remains unverified.
Authors: We agree that an explicit demonstration would strengthen the orthogonality statement. While the manuscript notes that MSP and the Fragility Index target distinct vulnerabilities (specification choices versus observation influence), we will add a short subsection containing a formal argument based on the respective definitions under randomization and a concrete numerical counter-example. One example will show a dataset with low Fragility Index yet MSP = 1; the reverse case will also be presented. This material will be placed in the section discussing relationships to existing robustness measures. revision: yes
-
Referee: The LaLonde benchmark result that MSP = 1 is reported without accompanying detail on the size of the decision space searched, the enumeration procedure, or whether the search was exhaustive versus heuristic. Because the paper notes NP-hardness in general, the concrete value MSP = 1 cannot be assessed for robustness to search procedure without these specifics.
Authors: We accept this observation and will expand the LaLonde paragraph accordingly. The revision will report the exact size of the decision space (number of covariate subsets and estimator variants), confirm that an exhaustive enumeration was feasible and performed for this benchmark, and note that the instance falls within the additive-structure cases shown to be tractable in the computation section. These details will allow readers to evaluate the reported MSP = 1 in context. revision: yes
Circularity Check
No significant circularity; MSP defined directly as min decision changes to falsify CI
full rationale
The paper defines MSP explicitly as the smallest number of analyst decision changes required to reach a specification whose CI contains zero. This is a direct combinatorial distance measure, not a fitted parameter, not renamed from a known result, and not justified by self-citation chains. Claims that MSP is small under the null and grows with effect strength follow from the definition plus external benchmark evaluation (LaLonde) and permutation calibration; no equation reduces the output to the input by construction. Computation is separately characterized as NP-hard in general with additive-structure tractability, without circularity in the derivation. The central claim remains independent of the search procedure's data dependence, which is noted but not load-bearing for the definition itself.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A measure of robustness to misspecification.American Economic Review, 105(5):476–480, 2015
Susan Athey and Guido Imbens. A measure of robustness to misspecification.American Economic Review, 105(5):476–480, 2015
2015
-
[2]
Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20
-
[3]
arXiv preprint arXiv:2011.14999 , year=
Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic finite-sample robustness metric: when candropping alittle data make abig difference?arXiv preprint arXiv:2011.14999, 2020
-
[4]
Towards evaluating the robustness of neural net- works
Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural net- works. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017
2017
-
[5]
Making sense of sensitivity: Extending omitted vari- able bias.Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (1):39–67, 2020
Carlos Cinelli and Chad Hazlett. Making sense of sensitivity: Extending omitted vari- able bias.Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (1):39–67, 2020. 13
2020
-
[6]
Causal effects in nonexperimental studies: Reeval- uating the evaluation of training programs.Journal of the American statistical Associ- ation, 94(448):1053–1062, 1999
Rajeev H Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reeval- uating the evaluation of training programs.Journal of the American statistical Associ- ation, 94(448):1053–1062, 1999
1999
-
[7]
Hotflip: White-box adver- sarial examples for text classification
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adver- sarial examples for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36, 2018
2018
-
[8]
wh freeman New York, 2002
Michael R Garey and David S Johnson.Computers and intractability, volume 29. wh freeman New York, 2002
2002
-
[9]
fishing expedition
Andrew Gelman and Eric Loken. The garden of forking paths: Why multiple compar- isons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.Department of Statistics, Columbia University, 348(1-17):3, 2013
2013
-
[10]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014
work page internal anchor Pith review arXiv 2014
-
[11]
Counterfactual explanations and how to find them: literature review and benchmarking: R
Riccardo Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking: R. guidotti.Data Mining and Knowledge Discovery, 38(5):2770– 2824, 2024
2024
-
[12]
Causal inference with observational data: the need for triangulation of evidence.Psychological medicine, 51(4):563–578, 2021
Gemma Hammerton and Marcus R Munafò. Causal inference with observational data: the need for triangulation of evidence.Psychological medicine, 51(4):563–578, 2021
2021
-
[13]
A general qualitative definition of robustness.The annals of mathe- matical statistics, 42(6):1887–1896, 1971
Frank R Hampel. A general qualitative definition of robustness.The annals of mathe- matical statistics, 42(6):1887–1896, 1971
1971
-
[14]
Lalonde(1986)afternearlyfourdecades: Lessonslearned
GuidoImbensandYiqingXu. Lalonde(1986)afternearlyfourdecades: Lessonslearned. arXiv preprint arXiv:2406.00827, 14, 2024
-
[15]
Sensitivity to exogeneity assumptions in program evaluation.Amer- ican Economic Review, 93(2):126–132, 2003
Guido W Imbens. Sensitivity to exogeneity assumptions in program evaluation.Amer- ican Economic Review, 93(2):126–132, 2003
2003
-
[16]
Algorithmic recourse: from counterfactual explanations to interventions
Amir-Hossein Karimi, Bernhard Schölkopf, and Isabel Valera. Algorithmic recourse: from counterfactual explanations to interventions. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 353–362, 2021
2021
-
[17]
Tight certificates of ad- versarial robustness for randomly smoothed classifiers.Advances in Neural Information Processing Systems, 32, 2019
Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi Jaakkola. Tight certificates of ad- versarial robustness for randomly smoothed classifiers.Advances in Neural Information Processing Systems, 32, 2019
2019
-
[18]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017. 14
work page internal anchor Pith review arXiv 2017
-
[19]
Inference on breakdown frontiers.Quanti- tative Economics, 11(1):41–111, 2020
Matthew A Masten and Alexandre Poirier. Inference on breakdown frontiers.Quanti- tative Economics, 11(1):41–111, 2020
2020
-
[20]
Unobservable selection and coefficient stability: Theory and evidence
Emily Oster. Unobservable selection and coefficient stability: Theory and evidence. Journal of Business & Economic Statistics, 37(2):187–204, 2019
2019
-
[21]
Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations.Journal of clinical epidemiology, 68(9):1046–1058, 2015
Chirag J Patel, Belinda Burford, and John PA Ioannidis. Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations.Journal of clinical epidemiology, 68(9):1046–1058, 2015
2015
-
[22]
Exploring counterfactual explanations through the lens of adversarial ex- amples: A theoretical and empirical analysis
Martin Pawelczyk, Chirag Agarwal, Shalmali Joshi, Sohini Upadhyay, and Himabindu Lakkaraju. Exploring counterfactual explanations through the lens of adversarial ex- amples: A theoretical and empirical analysis. InInternational Conference on Artificial Intelligence and Statistics, pages 4574–4594. PMLR, 2022
2022
-
[23]
Belinda Phipson and Gordon K Smyth. Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.arXiv preprint arXiv:1603.05766, 2016
-
[24]
A more credible approach to parallel trends
Ashesh Rambachan and Jonathan Roth. A more credible approach to parallel trends. Review of Economic Studies, 90(5):2555–2591, 2023
2023
-
[25]
Observational studies
Paul R Rosenbaum. Observational studies. InObservational studies, pages 1–17. Springer, 2002
2002
-
[26]
Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018
Raphael Silberzahn, Eric L Uhlmann, Daniel P Martin, Pasquale Anselmi, Frederik Aust, Eli Awtrey, Štěpán Bahník, Feng Bai, Colin Bannard, Evelina Bonnier, et al. Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018
2018
-
[27]
False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.Psychological science, 22(11):1359–1366, 2011
Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.Psychological science, 22(11):1359–1366, 2011
2011
-
[28]
Specification curve analysis
Uri Simonsohn, Joseph P Simmons, and Leif D Nelson. Specification curve analysis. Nature human behaviour, 4(11):1208–1214, 2020
2020
-
[29]
Increasing transparency through a multiverse analysis.Perspectives on Psychological Science, 11 (5):702–712, 2016
Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. Increasing transparency through a multiverse analysis.Perspectives on Psychological Science, 11 (5):702–712, 2016
2016
-
[30]
Actionable recourse in linear classifi- cation
Berk Ustun, Alexander Spangher, and Yang Liu. Actionable recourse in linear classifi- cation. InProceedings of the conference on fairness, accountability, and transparency, pages 10–19, 2019
2019
-
[31]
Sensitivity analysis in observational research: introducing the e-value.Annals of internal medicine, 167(4):268–274, 2017
Tyler J VanderWeele and Peng Ding. Sensitivity analysis in observational research: introducing the e-value.Annals of internal medicine, 167(4):268–274, 2017. 15
2017
-
[32]
Counterfactual explanations without opening the black box: Automated decisions and the gdpr.Harv
Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the gdpr.Harv. JL & Tech., 31:841, 2017
2017
-
[33]
most fragile
Michael Walsh, Sadeesh K Srinathan, Daniel F McAuley, Marko Mrkobrada, Oren Levine, Christine Ribic, Amber O Molnar, Neil D Dattani, Andrew Burke, Gordon Guyatt, et al. The statistical significance of randomized controlled trial results is fre- quently fragile: a case for a fragility index.Journal of clinical epidemiology, 67(6): 622–628, 2014. Appendix A...
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.