pith. machine review for the scientific record. sign in

arxiv: 2605.01579 · v1 · submitted 2026-05-02 · 📊 stat.ME · cs.LG

Recognition: unknown

Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:57 UTC · model grok-4.3

classification 📊 stat.ME cs.LG
keywords minimum specification perturbationcausal inferencerobustnessfragility indexspecification sensitivitydistance to falsificationanalyst decisionsconfidence interval
0
0 comments X

The pith

Minimum Specification Perturbation measures the fewest analyst decision changes needed to make a causal claim's confidence interval contain zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Minimum Specification Perturbation (MSP) as the smallest count of changes to modeling and estimation choices that would produce a confidence interval including zero. This metric is intended to reveal how close a result is to being falsified by reasonable alternative specifications. Unlike measures that summarize variation across specifications, MSP directly quantifies the distance to a null-finding specification. It behaves as expected by being small when no effect exists and larger when effects are stronger. When effects are weak, decisions guided by MSP achieve lower false positive rates than those using dispersion summaries of results.

Core claim

MSP is the minimum number of analyst decisions that must be altered to reach a specification whose confidence interval contains zero. It is small under the null, grows with effect strength, and supplies distance-to-falsification information that dispersion-based summaries cannot. Fragility Index and MSP assess orthogonal vulnerabilities because sensitivity to data points need not match sensitivity to specification choices. On the LaLonde benchmark MSP equals one, so one decision change suffices to include zero. Exact permutation calibration is available under randomization, computation is tractable for additive structures, and the problem is NP-hard in general.

What carries the argument

Minimum Specification Perturbation (MSP), the smallest number of changes across a countable set of analyst decisions to produce a specification whose confidence interval contains zero.

Load-bearing premise

Analyst decisions can be represented as discrete independent countable changes whose effects on the confidence interval admit an exhaustive or efficient search that does not itself bias the minimum count.

What would settle it

A Monte Carlo experiment in which MSP stays constant or decreases as the true effect size increases would falsify the claimed growth property.

Figures

Figures reproduced from arXiv: 2605.01579 by Hoang Dang, Luan Pham, Minh Nguyen.

Figure 1
Figure 1. Figure 1: LaLonde illustration. Fixing the outcome scale to raw earnings, the remaining view at source ↗
Figure 2
Figure 2. Figure 2: ROC curves for MSP and threshold-based baseline summaries in the binary deci view at source ↗
read the original abstract

Empirical causal claims depend on many analyst decisions, from selecting covariates to choosing estimators. Existing robustness tools summarize how results vary across these choices, but, to the best of our knowledge, do not answer: \textbf{How many analyst decisions must change to reach a specification, which is a set of choices, whose confidence interval (CI) contains zero?} We introduce \emph{Minimum Specification Perturbation (MSP)}, the smallest number of changes. MSP is small under the null, grows with effect strength and captures distance-to-falsification information that dispersion-based summaries cannot report; when making decisions under weak effects, an MSP-based rule yields lower false-positive rates than dispersion-based rules. We show that Fragility Index and MSP measure orthogonal vulnerabilities: fragility to influential observations need not imply fragility to specification choices. On the LaLonde benchmark, MSP = 1 implies that one decision change makes the CI contain zero. We further provide exact permutation calibration under randomization and characterize computation, showing tractable cases under additive structure and NP-hardness in general.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Minimum Specification Perturbation (MSP) as the smallest number of changes to analyst decisions (covariate selection, estimator choice, etc.) needed to produce a specification whose confidence interval contains zero. It claims MSP is small under the null, increases with effect strength, supplies distance-to-falsification information orthogonal to dispersion-based summaries and to the Fragility Index, and yields lower false-positive rates than dispersion rules when effects are weak. The paper supplies an exact permutation calibration under randomization, characterizes computation (tractable under additive structure, NP-hard in general), and reports MSP = 1 on the LaLonde benchmark.

Significance. If the central claims survive scrutiny, MSP would supply a directly interpretable, decision-count metric for robustness that existing tools do not provide. The orthogonality result and the reported FPR advantage under weak effects would be useful additions to the causal-inference robustness literature.

major comments (3)
  1. [Abstract and computation/calibration section] The data-dependent search for the argmin perturbation count risks downward bias in MSP because the optimization selects specifications on the basis of data-driven CI checks. The abstract states that exact permutation calibration is provided under randomization, but it is unclear whether this calibration fully corrects for the selection effect induced by the search itself (rather than merely for the randomization distribution of a fixed specification). This issue is load-bearing for the claims that MSP is small under the null and that an MSP-based rule achieves lower FPR than dispersion-based rules.
  2. [Abstract and results on orthogonality] The orthogonality claim between MSP and the Fragility Index is asserted but not demonstrated with a formal argument or explicit counter-example showing that fragility to influential observations need not imply fragility to specification choices (or vice versa). Without such a demonstration, the statement that the two measures capture orthogonal vulnerabilities remains unverified.
  3. [LaLonde benchmark paragraph] The LaLonde benchmark result that MSP = 1 is reported without accompanying detail on the size of the decision space searched, the enumeration procedure, or whether the search was exhaustive versus heuristic. Because the paper notes NP-hardness in general, the concrete value MSP = 1 cannot be assessed for robustness to search procedure without these specifics.
minor comments (2)
  1. [Abstract] The abstract uses the phrase 'exact permutation calibration' without defining the permutation scheme or the null distribution being calibrated; a brief parenthetical or footnote would improve readability.
  2. [Introduction] Notation for the set of analyst decisions and the perturbation operator is introduced without an explicit mathematical definition in the opening paragraphs; adding a short displayed equation would aid clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating revisions that will be incorporated to improve clarity and rigor.

read point-by-point responses
  1. Referee: The data-dependent search for the argmin perturbation count risks downward bias in MSP because the optimization selects specifications on the basis of data-driven CI checks. The abstract states that exact permutation calibration is provided under randomization, but it is unclear whether this calibration fully corrects for the selection effect induced by the search itself (rather than merely for the randomization distribution of a fixed specification). This issue is load-bearing for the claims that MSP is small under the null and that an MSP-based rule achieves lower FPR than dispersion-based rules.

    Authors: We appreciate the referee's identification of this important distinction. The permutation calibration described in the manuscript applies the complete MSP procedure—including the data-dependent search over the specification space—to each randomly permuted treatment assignment. Consequently, the resulting null distribution of MSP already incorporates the selection effect from the optimization step. We will revise the calibration section to state this explicitly, include pseudocode illustrating the full-procedure permutation, and add a brief simulation confirming exact control. These changes directly support the claims about MSP under the null and the reported FPR advantage. revision: yes

  2. Referee: The orthogonality claim between MSP and the Fragility Index is asserted but not demonstrated with a formal argument or explicit counter-example showing that fragility to influential observations need not imply fragility to specification choices (or vice versa). Without such a demonstration, the statement that the two measures capture orthogonal vulnerabilities remains unverified.

    Authors: We agree that an explicit demonstration would strengthen the orthogonality statement. While the manuscript notes that MSP and the Fragility Index target distinct vulnerabilities (specification choices versus observation influence), we will add a short subsection containing a formal argument based on the respective definitions under randomization and a concrete numerical counter-example. One example will show a dataset with low Fragility Index yet MSP = 1; the reverse case will also be presented. This material will be placed in the section discussing relationships to existing robustness measures. revision: yes

  3. Referee: The LaLonde benchmark result that MSP = 1 is reported without accompanying detail on the size of the decision space searched, the enumeration procedure, or whether the search was exhaustive versus heuristic. Because the paper notes NP-hardness in general, the concrete value MSP = 1 cannot be assessed for robustness to search procedure without these specifics.

    Authors: We accept this observation and will expand the LaLonde paragraph accordingly. The revision will report the exact size of the decision space (number of covariate subsets and estimator variants), confirm that an exhaustive enumeration was feasible and performed for this benchmark, and note that the instance falls within the additive-structure cases shown to be tractable in the computation section. These details will allow readers to evaluate the reported MSP = 1 in context. revision: yes

Circularity Check

0 steps flagged

No significant circularity; MSP defined directly as min decision changes to falsify CI

full rationale

The paper defines MSP explicitly as the smallest number of analyst decision changes required to reach a specification whose CI contains zero. This is a direct combinatorial distance measure, not a fitted parameter, not renamed from a known result, and not justified by self-citation chains. Claims that MSP is small under the null and grows with effect strength follow from the definition plus external benchmark evaluation (LaLonde) and permutation calibration; no equation reduces the output to the input by construction. Computation is separately characterized as NP-hard in general with additive-structure tractability, without circularity in the derivation. The central claim remains independent of the search procedure's data dependence, which is noted but not load-bearing for the definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the definition of MSP itself.

pith-pipeline@v0.9.0 · 5486 in / 1016 out tokens · 27315 ms · 2026-05-09T17:57:50.646939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    A measure of robustness to misspecification.American Economic Review, 105(5):476–480, 2015

    Susan Athey and Guido Imbens. A measure of robustness to misspecification.American Economic Review, 105(5):476–480, 2015

  2. [2]

    Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20

  3. [3]

    arXiv preprint arXiv:2011.14999 , year=

    Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic finite-sample robustness metric: when candropping alittle data make abig difference?arXiv preprint arXiv:2011.14999, 2020

  4. [4]

    Towards evaluating the robustness of neural net- works

    Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural net- works. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017

  5. [5]

    Making sense of sensitivity: Extending omitted vari- able bias.Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (1):39–67, 2020

    Carlos Cinelli and Chad Hazlett. Making sense of sensitivity: Extending omitted vari- able bias.Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (1):39–67, 2020. 13

  6. [6]

    Causal effects in nonexperimental studies: Reeval- uating the evaluation of training programs.Journal of the American statistical Associ- ation, 94(448):1053–1062, 1999

    Rajeev H Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reeval- uating the evaluation of training programs.Journal of the American statistical Associ- ation, 94(448):1053–1062, 1999

  7. [7]

    Hotflip: White-box adver- sarial examples for text classification

    Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adver- sarial examples for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36, 2018

  8. [8]

    wh freeman New York, 2002

    Michael R Garey and David S Johnson.Computers and intractability, volume 29. wh freeman New York, 2002

  9. [9]

    fishing expedition

    Andrew Gelman and Eric Loken. The garden of forking paths: Why multiple compar- isons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.Department of Statistics, Columbia University, 348(1-17):3, 2013

  10. [10]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014

  11. [11]

    Counterfactual explanations and how to find them: literature review and benchmarking: R

    Riccardo Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking: R. guidotti.Data Mining and Knowledge Discovery, 38(5):2770– 2824, 2024

  12. [12]

    Causal inference with observational data: the need for triangulation of evidence.Psychological medicine, 51(4):563–578, 2021

    Gemma Hammerton and Marcus R Munafò. Causal inference with observational data: the need for triangulation of evidence.Psychological medicine, 51(4):563–578, 2021

  13. [13]

    A general qualitative definition of robustness.The annals of mathe- matical statistics, 42(6):1887–1896, 1971

    Frank R Hampel. A general qualitative definition of robustness.The annals of mathe- matical statistics, 42(6):1887–1896, 1971

  14. [14]

    Lalonde(1986)afternearlyfourdecades: Lessonslearned

    GuidoImbensandYiqingXu. Lalonde(1986)afternearlyfourdecades: Lessonslearned. arXiv preprint arXiv:2406.00827, 14, 2024

  15. [15]

    Sensitivity to exogeneity assumptions in program evaluation.Amer- ican Economic Review, 93(2):126–132, 2003

    Guido W Imbens. Sensitivity to exogeneity assumptions in program evaluation.Amer- ican Economic Review, 93(2):126–132, 2003

  16. [16]

    Algorithmic recourse: from counterfactual explanations to interventions

    Amir-Hossein Karimi, Bernhard Schölkopf, and Isabel Valera. Algorithmic recourse: from counterfactual explanations to interventions. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 353–362, 2021

  17. [17]

    Tight certificates of ad- versarial robustness for randomly smoothed classifiers.Advances in Neural Information Processing Systems, 32, 2019

    Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi Jaakkola. Tight certificates of ad- versarial robustness for randomly smoothed classifiers.Advances in Neural Information Processing Systems, 32, 2019

  18. [18]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017. 14

  19. [19]

    Inference on breakdown frontiers.Quanti- tative Economics, 11(1):41–111, 2020

    Matthew A Masten and Alexandre Poirier. Inference on breakdown frontiers.Quanti- tative Economics, 11(1):41–111, 2020

  20. [20]

    Unobservable selection and coefficient stability: Theory and evidence

    Emily Oster. Unobservable selection and coefficient stability: Theory and evidence. Journal of Business & Economic Statistics, 37(2):187–204, 2019

  21. [21]

    Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations.Journal of clinical epidemiology, 68(9):1046–1058, 2015

    Chirag J Patel, Belinda Burford, and John PA Ioannidis. Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations.Journal of clinical epidemiology, 68(9):1046–1058, 2015

  22. [22]

    Exploring counterfactual explanations through the lens of adversarial ex- amples: A theoretical and empirical analysis

    Martin Pawelczyk, Chirag Agarwal, Shalmali Joshi, Sohini Upadhyay, and Himabindu Lakkaraju. Exploring counterfactual explanations through the lens of adversarial ex- amples: A theoretical and empirical analysis. InInternational Conference on Artificial Intelligence and Statistics, pages 4574–4594. PMLR, 2022

  23. [23]

    Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.arXiv preprint arXiv:1603.05766, 2016

    Belinda Phipson and Gordon K Smyth. Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.arXiv preprint arXiv:1603.05766, 2016

  24. [24]

    A more credible approach to parallel trends

    Ashesh Rambachan and Jonathan Roth. A more credible approach to parallel trends. Review of Economic Studies, 90(5):2555–2591, 2023

  25. [25]

    Observational studies

    Paul R Rosenbaum. Observational studies. InObservational studies, pages 1–17. Springer, 2002

  26. [26]

    Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

    Raphael Silberzahn, Eric L Uhlmann, Daniel P Martin, Pasquale Anselmi, Frederik Aust, Eli Awtrey, Štěpán Bahník, Feng Bai, Colin Bannard, Evelina Bonnier, et al. Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

  27. [27]

    False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.Psychological science, 22(11):1359–1366, 2011

    Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.Psychological science, 22(11):1359–1366, 2011

  28. [28]

    Specification curve analysis

    Uri Simonsohn, Joseph P Simmons, and Leif D Nelson. Specification curve analysis. Nature human behaviour, 4(11):1208–1214, 2020

  29. [29]

    Increasing transparency through a multiverse analysis.Perspectives on Psychological Science, 11 (5):702–712, 2016

    Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. Increasing transparency through a multiverse analysis.Perspectives on Psychological Science, 11 (5):702–712, 2016

  30. [30]

    Actionable recourse in linear classifi- cation

    Berk Ustun, Alexander Spangher, and Yang Liu. Actionable recourse in linear classifi- cation. InProceedings of the conference on fairness, accountability, and transparency, pages 10–19, 2019

  31. [31]

    Sensitivity analysis in observational research: introducing the e-value.Annals of internal medicine, 167(4):268–274, 2017

    Tyler J VanderWeele and Peng Ding. Sensitivity analysis in observational research: introducing the e-value.Annals of internal medicine, 167(4):268–274, 2017. 15

  32. [32]

    Counterfactual explanations without opening the black box: Automated decisions and the gdpr.Harv

    Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the gdpr.Harv. JL & Tech., 31:841, 2017

  33. [33]

    most fragile

    Michael Walsh, Sadeesh K Srinathan, Daniel F McAuley, Marko Mrkobrada, Oren Levine, Christine Ribic, Amber O Molnar, Neil D Dattani, Andrew Burke, Gordon Guyatt, et al. The statistical significance of randomized controlled trial results is fre- quently fragile: a case for a fragility index.Journal of clinical epidemiology, 67(6): 622–628, 2014. Appendix A...