arxiv: 2605.01579 · v1 · submitted 2026-05-02 · 📊 stat.ME · cs.LG

Recognition: unknown

Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

Hoang Dang , Luan Pham , Minh Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:57 UTC · model grok-4.3

classification 📊 stat.ME cs.LG

keywords minimum specification perturbationcausal inferencerobustnessfragility indexspecification sensitivitydistance to falsificationanalyst decisionsconfidence interval

0 comments

The pith

Minimum Specification Perturbation measures the fewest analyst decision changes needed to make a causal claim's confidence interval contain zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Minimum Specification Perturbation (MSP) as the smallest count of changes to modeling and estimation choices that would produce a confidence interval including zero. This metric is intended to reveal how close a result is to being falsified by reasonable alternative specifications. Unlike measures that summarize variation across specifications, MSP directly quantifies the distance to a null-finding specification. It behaves as expected by being small when no effect exists and larger when effects are stronger. When effects are weak, decisions guided by MSP achieve lower false positive rates than those using dispersion summaries of results.

Core claim

MSP is the minimum number of analyst decisions that must be altered to reach a specification whose confidence interval contains zero. It is small under the null, grows with effect strength, and supplies distance-to-falsification information that dispersion-based summaries cannot. Fragility Index and MSP assess orthogonal vulnerabilities because sensitivity to data points need not match sensitivity to specification choices. On the LaLonde benchmark MSP equals one, so one decision change suffices to include zero. Exact permutation calibration is available under randomization, computation is tractable for additive structures, and the problem is NP-hard in general.

What carries the argument

Minimum Specification Perturbation (MSP), the smallest number of changes across a countable set of analyst decisions to produce a specification whose confidence interval contains zero.

Load-bearing premise

Analyst decisions can be represented as discrete independent countable changes whose effects on the confidence interval admit an exhaustive or efficient search that does not itself bias the minimum count.

What would settle it

A Monte Carlo experiment in which MSP stays constant or decreases as the true effect size increases would falsify the claimed growth property.

Figures

Figures reproduced from arXiv: 2605.01579 by Hoang Dang, Luan Pham, Minh Nguyen.

**Figure 1.** Figure 1: LaLonde illustration. Fixing the outcome scale to raw earnings, the remaining view at source ↗

**Figure 2.** Figure 2: ROC curves for MSP and threshold-based baseline summaries in the binary deci view at source ↗

read the original abstract

Empirical causal claims depend on many analyst decisions, from selecting covariates to choosing estimators. Existing robustness tools summarize how results vary across these choices, but, to the best of our knowledge, do not answer: \textbf{How many analyst decisions must change to reach a specification, which is a set of choices, whose confidence interval (CI) contains zero?} We introduce \emph{Minimum Specification Perturbation (MSP)}, the smallest number of changes. MSP is small under the null, grows with effect strength and captures distance-to-falsification information that dispersion-based summaries cannot report; when making decisions under weak effects, an MSP-based rule yields lower false-positive rates than dispersion-based rules. We show that Fragility Index and MSP measure orthogonal vulnerabilities: fragility to influential observations need not imply fragility to specification choices. On the LaLonde benchmark, MSP = 1 implies that one decision change makes the CI contain zero. We further provide exact permutation calibration under randomization and characterize computation, showing tractable cases under additive structure and NP-hardness in general.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSP counts the fewest specification flips needed to make a causal CI contain zero, which is a clean framing but the data-driven search for that minimum likely understates the distance.

read the letter

The core contribution is a new robustness quantity: MSP is the smallest number of analyst decision changes required to reach a specification whose confidence interval includes zero. The paper shows this quantity is small under the null, grows with effect size, and supplies information that dispersion summaries miss. On the LaLonde benchmark it equals one, which is straightforward to interpret. It also demonstrates that MSP and the fragility index capture orthogonal vulnerabilities, so fragility to single observations does not automatically imply fragility to specification choices. The exact permutation calibration under randomization is a concrete step toward controlling the measure when the design is randomized. These pieces give applied users a simple, countable alternative to existing robustness tools. The orthogonality result and the benchmark example are the parts that land cleanly. The main soft spot is computational. Finding the minimum requires searching over discrete changes and testing each resulting CI on the same data, so the argmin is data-dependent. This selection can produce an artificially small MSP before any calibration is applied. The paper notes the problem is NP-hard in general and tractable only under additive structure, which means practical implementations will rely on restricted or heuristic searches. The permutation test may not automatically remove the downward bias that arises from that search. Without seeing the exact procedure and any correction for it, the claimed false-positive-rate advantage over dispersion rules is not yet guaranteed. This paper is aimed at statisticians and econometricians who already examine multiple specifications and want a direct distance-to-falsification number rather than another summary of spread. Readers working with weak effects or many analyst choices will get the most from the LaLonde illustration and the orthogonality claim. It deserves a serious referee because the idea is distinct from prior fragility and specification-curve work and the benchmark is standard, even though the search bias and finite-sample properties will need tightening in review.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Minimum Specification Perturbation (MSP) as the smallest number of changes to analyst decisions (covariate selection, estimator choice, etc.) needed to produce a specification whose confidence interval contains zero. It claims MSP is small under the null, increases with effect strength, supplies distance-to-falsification information orthogonal to dispersion-based summaries and to the Fragility Index, and yields lower false-positive rates than dispersion rules when effects are weak. The paper supplies an exact permutation calibration under randomization, characterizes computation (tractable under additive structure, NP-hard in general), and reports MSP = 1 on the LaLonde benchmark.

Significance. If the central claims survive scrutiny, MSP would supply a directly interpretable, decision-count metric for robustness that existing tools do not provide. The orthogonality result and the reported FPR advantage under weak effects would be useful additions to the causal-inference robustness literature.

major comments (3)

[Abstract and computation/calibration section] The data-dependent search for the argmin perturbation count risks downward bias in MSP because the optimization selects specifications on the basis of data-driven CI checks. The abstract states that exact permutation calibration is provided under randomization, but it is unclear whether this calibration fully corrects for the selection effect induced by the search itself (rather than merely for the randomization distribution of a fixed specification). This issue is load-bearing for the claims that MSP is small under the null and that an MSP-based rule achieves lower FPR than dispersion-based rules.
[Abstract and results on orthogonality] The orthogonality claim between MSP and the Fragility Index is asserted but not demonstrated with a formal argument or explicit counter-example showing that fragility to influential observations need not imply fragility to specification choices (or vice versa). Without such a demonstration, the statement that the two measures capture orthogonal vulnerabilities remains unverified.
[LaLonde benchmark paragraph] The LaLonde benchmark result that MSP = 1 is reported without accompanying detail on the size of the decision space searched, the enumeration procedure, or whether the search was exhaustive versus heuristic. Because the paper notes NP-hardness in general, the concrete value MSP = 1 cannot be assessed for robustness to search procedure without these specifics.

minor comments (2)

[Abstract] The abstract uses the phrase 'exact permutation calibration' without defining the permutation scheme or the null distribution being calibrated; a brief parenthetical or footnote would improve readability.
[Introduction] Notation for the set of analyst decisions and the perturbation operator is introduced without an explicit mathematical definition in the opening paragraphs; adding a short displayed equation would aid clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating revisions that will be incorporated to improve clarity and rigor.

read point-by-point responses

Referee: The data-dependent search for the argmin perturbation count risks downward bias in MSP because the optimization selects specifications on the basis of data-driven CI checks. The abstract states that exact permutation calibration is provided under randomization, but it is unclear whether this calibration fully corrects for the selection effect induced by the search itself (rather than merely for the randomization distribution of a fixed specification). This issue is load-bearing for the claims that MSP is small under the null and that an MSP-based rule achieves lower FPR than dispersion-based rules.

Authors: We appreciate the referee's identification of this important distinction. The permutation calibration described in the manuscript applies the complete MSP procedure—including the data-dependent search over the specification space—to each randomly permuted treatment assignment. Consequently, the resulting null distribution of MSP already incorporates the selection effect from the optimization step. We will revise the calibration section to state this explicitly, include pseudocode illustrating the full-procedure permutation, and add a brief simulation confirming exact control. These changes directly support the claims about MSP under the null and the reported FPR advantage. revision: yes
Referee: The orthogonality claim between MSP and the Fragility Index is asserted but not demonstrated with a formal argument or explicit counter-example showing that fragility to influential observations need not imply fragility to specification choices (or vice versa). Without such a demonstration, the statement that the two measures capture orthogonal vulnerabilities remains unverified.

Authors: We agree that an explicit demonstration would strengthen the orthogonality statement. While the manuscript notes that MSP and the Fragility Index target distinct vulnerabilities (specification choices versus observation influence), we will add a short subsection containing a formal argument based on the respective definitions under randomization and a concrete numerical counter-example. One example will show a dataset with low Fragility Index yet MSP = 1; the reverse case will also be presented. This material will be placed in the section discussing relationships to existing robustness measures. revision: yes
Referee: The LaLonde benchmark result that MSP = 1 is reported without accompanying detail on the size of the decision space searched, the enumeration procedure, or whether the search was exhaustive versus heuristic. Because the paper notes NP-hardness in general, the concrete value MSP = 1 cannot be assessed for robustness to search procedure without these specifics.

Authors: We accept this observation and will expand the LaLonde paragraph accordingly. The revision will report the exact size of the decision space (number of covariate subsets and estimator variants), confirm that an exhaustive enumeration was feasible and performed for this benchmark, and note that the instance falls within the additive-structure cases shown to be tractable in the computation section. These details will allow readers to evaluate the reported MSP = 1 in context. revision: yes

Circularity Check

0 steps flagged

No significant circularity; MSP defined directly as min decision changes to falsify CI

full rationale

The paper defines MSP explicitly as the smallest number of analyst decision changes required to reach a specification whose CI contains zero. This is a direct combinatorial distance measure, not a fitted parameter, not renamed from a known result, and not justified by self-citation chains. Claims that MSP is small under the null and grows with effect strength follow from the definition plus external benchmark evaluation (LaLonde) and permutation calibration; no equation reduces the output to the input by construction. Computation is separately characterized as NP-hard in general with additive-structure tractability, without circularity in the derivation. The central claim remains independent of the search procedure's data dependence, which is noted but not load-bearing for the definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the definition of MSP itself.

pith-pipeline@v0.9.0 · 5486 in / 1016 out tokens · 27315 ms · 2026-05-09T17:57:50.646939+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 6 canonical work pages · 2 internal anchors

[1]

A measure of robustness to misspecification.American Economic Review, 105(5):476–480, 2015

Susan Athey and Guido Imbens. A measure of robustness to misspecification.American Economic Review, 105(5):476–480, 2015

2015
[2]

Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20

work page doi:10.24432/c5xw20 1996
[3]

arXiv preprint arXiv:2011.14999 , year=

Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic finite-sample robustness metric: when candropping alittle data make abig difference?arXiv preprint arXiv:2011.14999, 2020

work page arXiv 2011
[4]

Towards evaluating the robustness of neural net- works

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural net- works. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017

2017
[5]

Making sense of sensitivity: Extending omitted vari- able bias.Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (1):39–67, 2020

Carlos Cinelli and Chad Hazlett. Making sense of sensitivity: Extending omitted vari- able bias.Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (1):39–67, 2020. 13

2020
[6]

Causal effects in nonexperimental studies: Reeval- uating the evaluation of training programs.Journal of the American statistical Associ- ation, 94(448):1053–1062, 1999

Rajeev H Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reeval- uating the evaluation of training programs.Journal of the American statistical Associ- ation, 94(448):1053–1062, 1999

1999
[7]

Hotflip: White-box adver- sarial examples for text classification

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adver- sarial examples for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36, 2018

2018
[8]

wh freeman New York, 2002

Michael R Garey and David S Johnson.Computers and intractability, volume 29. wh freeman New York, 2002

2002
[9]

fishing expedition

Andrew Gelman and Eric Loken. The garden of forking paths: Why multiple compar- isons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.Department of Statistics, Columbia University, 348(1-17):3, 2013

2013
[10]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review arXiv 2014
[11]

Counterfactual explanations and how to find them: literature review and benchmarking: R

Riccardo Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking: R. guidotti.Data Mining and Knowledge Discovery, 38(5):2770– 2824, 2024

2024
[12]

Causal inference with observational data: the need for triangulation of evidence.Psychological medicine, 51(4):563–578, 2021

Gemma Hammerton and Marcus R Munafò. Causal inference with observational data: the need for triangulation of evidence.Psychological medicine, 51(4):563–578, 2021

2021
[13]

A general qualitative definition of robustness.The annals of mathe- matical statistics, 42(6):1887–1896, 1971

Frank R Hampel. A general qualitative definition of robustness.The annals of mathe- matical statistics, 42(6):1887–1896, 1971

1971
[14]

Lalonde(1986)afternearlyfourdecades: Lessonslearned

GuidoImbensandYiqingXu. Lalonde(1986)afternearlyfourdecades: Lessonslearned. arXiv preprint arXiv:2406.00827, 14, 2024

work page arXiv 1986
[15]

Sensitivity to exogeneity assumptions in program evaluation.Amer- ican Economic Review, 93(2):126–132, 2003

Guido W Imbens. Sensitivity to exogeneity assumptions in program evaluation.Amer- ican Economic Review, 93(2):126–132, 2003

2003
[16]

Algorithmic recourse: from counterfactual explanations to interventions

Amir-Hossein Karimi, Bernhard Schölkopf, and Isabel Valera. Algorithmic recourse: from counterfactual explanations to interventions. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 353–362, 2021

2021
[17]

Tight certificates of ad- versarial robustness for randomly smoothed classifiers.Advances in Neural Information Processing Systems, 32, 2019

Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi Jaakkola. Tight certificates of ad- versarial robustness for randomly smoothed classifiers.Advances in Neural Information Processing Systems, 32, 2019

2019
[18]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017. 14

work page internal anchor Pith review arXiv 2017
[19]

Inference on breakdown frontiers.Quanti- tative Economics, 11(1):41–111, 2020

Matthew A Masten and Alexandre Poirier. Inference on breakdown frontiers.Quanti- tative Economics, 11(1):41–111, 2020

2020
[20]

Unobservable selection and coefficient stability: Theory and evidence

Emily Oster. Unobservable selection and coefficient stability: Theory and evidence. Journal of Business & Economic Statistics, 37(2):187–204, 2019

2019
[21]

Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations.Journal of clinical epidemiology, 68(9):1046–1058, 2015

Chirag J Patel, Belinda Burford, and John PA Ioannidis. Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations.Journal of clinical epidemiology, 68(9):1046–1058, 2015

2015
[22]

Exploring counterfactual explanations through the lens of adversarial ex- amples: A theoretical and empirical analysis

Martin Pawelczyk, Chirag Agarwal, Shalmali Joshi, Sohini Upadhyay, and Himabindu Lakkaraju. Exploring counterfactual explanations through the lens of adversarial ex- amples: A theoretical and empirical analysis. InInternational Conference on Artificial Intelligence and Statistics, pages 4574–4594. PMLR, 2022

2022
[23]

Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.arXiv preprint arXiv:1603.05766, 2016

Belinda Phipson and Gordon K Smyth. Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.arXiv preprint arXiv:1603.05766, 2016

work page arXiv 2016
[24]

A more credible approach to parallel trends

Ashesh Rambachan and Jonathan Roth. A more credible approach to parallel trends. Review of Economic Studies, 90(5):2555–2591, 2023

2023
[25]

Observational studies

Paul R Rosenbaum. Observational studies. InObservational studies, pages 1–17. Springer, 2002

2002
[26]

Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

Raphael Silberzahn, Eric L Uhlmann, Daniel P Martin, Pasquale Anselmi, Frederik Aust, Eli Awtrey, Štěpán Bahník, Feng Bai, Colin Bannard, Evelina Bonnier, et al. Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

2018
[27]

False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.Psychological science, 22(11):1359–1366, 2011

Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.Psychological science, 22(11):1359–1366, 2011

2011
[28]

Specification curve analysis

Uri Simonsohn, Joseph P Simmons, and Leif D Nelson. Specification curve analysis. Nature human behaviour, 4(11):1208–1214, 2020

2020
[29]

Increasing transparency through a multiverse analysis.Perspectives on Psychological Science, 11 (5):702–712, 2016

Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. Increasing transparency through a multiverse analysis.Perspectives on Psychological Science, 11 (5):702–712, 2016

2016
[30]

Actionable recourse in linear classifi- cation

Berk Ustun, Alexander Spangher, and Yang Liu. Actionable recourse in linear classifi- cation. InProceedings of the conference on fairness, accountability, and transparency, pages 10–19, 2019

2019
[31]

Sensitivity analysis in observational research: introducing the e-value.Annals of internal medicine, 167(4):268–274, 2017

Tyler J VanderWeele and Peng Ding. Sensitivity analysis in observational research: introducing the e-value.Annals of internal medicine, 167(4):268–274, 2017. 15

2017
[32]

Counterfactual explanations without opening the black box: Automated decisions and the gdpr.Harv

Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the gdpr.Harv. JL & Tech., 31:841, 2017

2017
[33]

most fragile

Michael Walsh, Sadeesh K Srinathan, Daniel F McAuley, Marko Mrkobrada, Oren Levine, Christine Ribic, Amber O Molnar, Neil D Dattani, Andrew Burke, Gordon Guyatt, et al. The statistical significance of randomized controlled trial results is fre- quently fragile: a case for a fragility index.Journal of clinical epidemiology, 67(6): 622–628, 2014. Appendix A...

2014