MutDafny: A Mutation-Based Approach to Assess Dafny Specifications
Pith reviewed 2026-05-17 21:02 UTC · model grok-4.3
The pith
MutDafny applies mutations to Dafny programs to expose weaknesses in their formal specifications.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that mutation testing, when the oracle is supplied by formal specifications rather than test cases, can systematically surface weaknesses in Dafny specifications: if a program variant still verifies after a mutation, the specification does not constrain the behavior tightly enough to reject that change.
What carries the argument
MutDafny, a tool that applies a set of 40 mutation operators to Dafny source and re-runs the verifier to detect cases where the specification accepts the mutated program.
If this is right
- Developers can use undetected mutants as concrete prompts to strengthen specifications in their Dafny code.
- The rate of one weak specification per 241 lines provides a baseline for how often real-world Dafny contracts need review.
- The same mutation set can be applied repeatedly during development to catch specification drift as code evolves.
- Five concrete weak specifications were identified that would have allowed incorrect behavior to go undetected.
Where Pith is reading between the lines
- The technique could be ported to other verification-aware languages that already have mature verifiers, such as those supporting contracts or dependent types.
- Automated reporting of undetected mutants might be integrated into continuous verification pipelines to flag specification gaps early.
- Patterns among the undetected mutants could reveal common specification oversights, such as missing invariants on data structures, that recur across projects.
Load-bearing premise
Verification succeeding on a mutant reliably signals a weakness in the specification rather than the mutation being semantically invalid or the verifier being incomplete.
What would settle it
A manual review of all undetected mutants from the 794-program corpus that finds every one either invalidates the program semantics or is explainable by known verifier incompleteness, with no genuine specification weaknesses remaining.
Figures
read the original abstract
In verification-aware languages, such as Dafny, despite their critical role, specifications are as prone to error as implementations. Flaws in specifications can result in formally verified programs that deviate from the intended behavior. In this paper, we explore the use of mutation testing to reveal weaknesses in formal specifications written in Dafny. We present MutDafny, a tool that increases the reliability of Dafny specifications by automatically signaling potential weaknesses. Using a mutation testing approach, we introduce faults (mutations) into the code and rely on formal specifications for detecting them. If a program with a mutant verifies, this may indicate a weakness in the specification. We extensively analyze mutation operators from popular tools, identifying the ones applicable to Dafny. In addition, we synthesize new operators tailored for the language from bugfix commits in publicly available Dafny projects on GitHub. Drawing from both, we equipped our tool with a total of 40 mutation operators. We evaluate MutDafny's effectiveness and efficiency on a dataset of 794 real-world Dafny programs, and manually analyze a subset of the resulting undetected mutants, identifying five weak real-world specifications (on average, one at every 241 lines of code) that would benefit from strengthening.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MutDafny, a mutation-testing tool to assess the strength of Dafny specifications. It selects and synthesizes 40 mutation operators (drawing from existing tools and GitHub bug-fix commits), applies the approach to 794 real-world Dafny programs, and reports that manual review of undetected mutants surfaced five weak specifications (one per 241 LOC on average).
Significance. If the central interpretation holds, the work supplies a concrete, scalable method for exposing specification weaknesses in verified Dafny code—an important practical contribution because specifications are themselves error-prone. The evaluation on a sizable external corpus of real programs and the explicit identification of five weak specifications constitute tangible evidence of utility.
major comments (2)
- [Evaluation / manual analysis of undetected mutants] Evaluation (undetected-mutant analysis): the headline result of five weak specifications rests on the interpretation that a mutant that still verifies signals a specification weakness. The manuscript must explicitly describe, for each of the five cases, how semantic equivalence of the mutant was ruled out and how potential incompleteness of the Dafny verifier was addressed; without this, the count cannot be taken as secure evidence of specification weakness rather than an artifact of mutation validity or verifier limits.
- [Abstract and Evaluation section] Abstract and §4 (evaluation setup): the effectiveness claim is only moderately supported because the abstract and main evaluation text omit the total number of mutants generated, the precise mutant-selection criteria, and any statistical measures (e.g., mutation score distribution or detection-rate confidence intervals). These quantities are load-bearing for the claim that the tool reliably surfaces weak specifications.
minor comments (2)
- [Mutation operator selection] The description of how the 40 operators were chosen and synthesized from GitHub commits could be expanded with a short table or enumeration to improve reproducibility.
- [Throughout] Minor typographical inconsistencies appear in the operator list and in the reported average (241 LOC); a consistency pass would help.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Evaluation / manual analysis of undetected mutants] Evaluation (undetected-mutant analysis): the headline result of five weak specifications rests on the interpretation that a mutant that still verifies signals a specification weakness. The manuscript must explicitly describe, for each of the five cases, how semantic equivalence of the mutant was ruled out and how potential incompleteness of the Dafny verifier was addressed; without this, the count cannot be taken as secure evidence of specification weakness rather than an artifact of mutation validity or verifier limits.
Authors: We agree that the current description of the manual analysis is insufficient to fully support the claim. In the revised manuscript we will add a new subsection under the evaluation that details, for each of the five cases, the concrete steps used to establish that the mutant was not semantically equivalent (e.g., by exhibiting a concrete input that produces different observable behavior or by showing that the mutant violates an auxiliary assertion that the original satisfies) and the measures taken to address possible verifier incompleteness (e.g., running with all verification options enabled, increasing resource limits, and cross-checking a subset of cases with an alternative solver configuration). revision: yes
-
Referee: [Abstract and Evaluation section] Abstract and §4 (evaluation setup): the effectiveness claim is only moderately supported because the abstract and main evaluation text omit the total number of mutants generated, the precise mutant-selection criteria, and any statistical measures (e.g., mutation score distribution or detection-rate confidence intervals). These quantities are load-bearing for the claim that the tool reliably surfaces weak specifications.
Authors: We accept that these quantitative details should be stated explicitly. We will revise the abstract to report the total number of mutants generated and the criteria used to select and synthesize the 40 operators. In Section 4 we will add the distribution of per-program mutation scores together with any available statistical summaries (including detection-rate confidence intervals where the sample size permits). These additions will be placed in the evaluation setup and results subsections. revision: yes
Circularity Check
No circularity: results grounded in external data and manual analysis
full rationale
The paper's core claims derive from applying 40 mutation operators (sourced from external popular tools plus synthesis from independent GitHub Dafny bugfix commits) to a dataset of 794 real-world programs, followed by manual inspection of undetected mutants. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation. The count of five weak specifications is an empirical observation from external programs rather than a quantity that reduces to internal definitions or prior author results by construction. This matches the default expectation for an empirical tool-evaluation paper with independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption If a mutated program still verifies under the original specification, then the specification is weak or incomplete for the intended behavior.
Reference graph
Works this paper leans on
-
[1]
https://github.com/avito-tech/go-mutesting Last accessed Nov
2016.Go-mutesting (fork). https://github.com/avito-tech/go-mutesting Last accessed Nov. 2025
work page 2016
-
[2]
https://github.com/zimmski/go- mutesting Last accessed Nov
2016.Go-mutesting (original). https://github.com/zimmski/go- mutesting Last accessed Nov. 2025
work page 2016
-
[3]
https://mutation-testing.org Last accessed Nov
2011.Major. https://mutation-testing.org Last accessed Nov. 2025
work page 2011
-
[4]
https://github.com/jeffoffutt/muJava Last accessed Nov
2005.MuJava. https://github.com/jeffoffutt/muJava Last accessed Nov. 2025
work page 2005
-
[5]
https://github.com/mull-project/mull Last accessed Nov
2018.Mull. https://github.com/mull-project/mull Last accessed Nov. 2025
work page 2018
-
[6]
https://github.com/boxed/mutmut Last accessed Nov
2018.Mutmut. https://github.com/boxed/mutmut Last accessed Nov. 2025
work page 2018
-
[7]
http://pitest.org Last accessed Nov
2016.PIT. http://pitest.org Last accessed Nov. 2025
work page 2016
-
[8]
https://stryker-mutator.io/docs/stryker-js/introduct ion/ Last accessed Nov
2017.StrykerJS. https://stryker-mutator.io/docs/stryker-js/introduct ion/ Last accessed Nov. 2025
work page 2017
-
[9]
https://stryker-mutator.io/docs/stryker-net/introd uction/ Last accessed Nov
2021.Stryker.NET. https://stryker-mutator.io/docs/stryker-net/introd uction/ Last accessed Nov. 2025
work page 2021
-
[10]
Tailored Mutants Fit Bugs Better
Miltiadis Allamanis, Earl T. Barr, René Just, and Charles Sutton. 2016. Tailored Mutants Fit Bugs Better. arXiv:1611.02516 [cs.SE] https: //arxiv.org/abs/1611.02516
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
Lennart Beringer, Adam Petcher, Q Ye Katherine, and Andrew W Appel
-
[12]
Verified Correctness and Security of OpenSSL HMAC.. InUSENIX Security Symposium. 207–221
-
[13]
David Bingham Brown, Michael Vaughn, Ben Liblit, and Thomas Reps
-
[14]
The care and feeding of wild-caught mutants. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany)(ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 511–522. doi:10.1145/3106237.3106280
-
[15]
Timothy A. Budd and Dana Angluin. 1982. Two notions of correctness and their relation to testing.Acta Inf.18, 1 (March 1982), 31–45. doi:10 .1007/BF00625279
work page 1982
-
[16]
Franck Cassez. 2021. Verification of the incremental Merkle tree al- gorithm with Dafny. InInternational Symposium on Formal Methods. Springer, 445–462
work page 2021
-
[17]
Aleks Chakarov, Jaco Geldenhuys, Matthew Heck, Michael Hicks, Sam Huang, Georges-Axel Jaloyan, Anjali Joshi, K Rustan M Leino, Mikael Mayer, Sean McLaughlin, et al. 2025. Formally Verified Cloud-Scale Au- thorization. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 703–703
work page 2025
-
[18]
Yoonsik Cheon and Gary T. Leavens. 2002. A Simple and Practical Approach to Unit Testing: The JML and JUnit Way. InECOOP 2002 — Object-Oriented Programming, Boris Magnusson (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 231–255
work page 2002
-
[19]
Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: a practical mutation testing tool for Java (demo). InProceedings of the 25th International Symposium on Software Testing and Analysis(Saarbrücken, Germany)(ISSTA 2016). Association for Computing Machinery, New York, NY, USA, 449–452. doi:10.1145/2931037.2948707
-
[20]
Alex Denisov and Stanislav Pankevich. 2018. Mull It Over: Mutation Testing Based on LLVM. In2018 IEEE International Conference on Soft- ware Testing, Verification and Validation Workshops (ICSTW). 25–31. doi:10.1109/ICSTW.2018.00024
-
[21]
Anna Derezińska and Konrad Hałas. 2014. Analysis of Mutation Op- erators for the Python Language. InProceedings of the Ninth Inter- national Conference on Dependability and Complex Systems DepCoS- RELCOMEX. June 30 – July 4, 2014, Brunów, Poland, Wojciech Zamojski, Jacek Mazurkiewicz, Jarosław Sugier, Tomasz Walkowiak, and Janusz Kacprzyk (Eds.). Springer...
work page 2014
-
[22]
Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K. Lahiri. 2024. Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?Proc. ACM Softw. Eng.1, FSE, Article 84 (July 2024), 24 pages. doi:10.1145/3660791
-
[23]
Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy
-
[24]
InProceedings of the Twelfth European Conference on Computer Systems(Belgrade, Serbia)(EuroSys ’17)
An Empirical Study on the Correctness of Formally Verified Distributed Systems. InProceedings of the Twelfth European Conference on Computer Systems(Belgrade, Serbia)(EuroSys ’17). Association for Computing Machinery, New York, NY, USA, 328–343. doi:10.1145/30 64176.3064183
-
[25]
Eli Goldweber, Weixin Yu, Seyed Armin Vakil Ghahani, and Manos Kapritsos. 2024. IronSpec: Increasing the Reliability of Formal Specifica- tions. In18th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI 24). USENIX Association, Santa Clara, CA, 875–891. https://www.usenix.org/conference/osdi24/presentation/goldweber
work page 2024
-
[26]
Zhang, Marinos Kintis, and Mike Papadakis
Rahul Gopinath, Jie M. Zhang, Marinos Kintis, and Mike Papadakis
-
[27]
doi:10.1002/stvr.1831 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.1831
Mutation analysis and its industrial applications.Software Testing, Verification and Reliability32, 8 (2022), e1831. doi:10.1002/stvr.1831 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.1831
-
[28]
Mahdi Houshmand and Samad Paydar. 2017. TCE+: An Extension of the TCE Method for Detecting Equivalent Mutants in Java Programs. InFundamentals of Software Engineering, Mehdi Dastani and Marjan Sirjani (Eds.). Springer International Publishing, Cham, 164–179
work page 2017
-
[30]
Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Devel- opment of Mutation Testing.IEEE Transactions on Software Engineering 37, 5 (2011), 649–678. doi:10.1109/TSE.2010.62
-
[31]
René Just. 2014. The major mutation framework: efficient and scalable mutation analysis for Java. InProceedings of the 2014 International Symposium on Software Testing and Analysis(San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 433–436. doi:10.1145/2610384.2628053
-
[32]
René Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. MA- JOR: An efficient and extensible tool for mutation analysis in a Java com- piler. In2011 26th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE 2011). 612–615. doi:10.1109/ASE.2011.6100138
-
[33]
Marinos Kintis, Mike Papadakis, Yue Jia, Nicos Malevris, Yves Le Traon, and Mark Harman. 2018. Detecting Trivial Mutant Equivalences via Compiler Optimisations.IEEE Transactions on Software Engineering44, 4 (2018), 308–333. doi:10.1109/TSE.2017.2684805
-
[34]
Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolan- ski, Michael Norrish, et al . 2009. seL4: Formal verification of an OS kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 207–220
work page 2009
-
[35]
Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just
Benjamin Kushigian, Samuel J. Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just. 2024. Equivalent Mutants in the Wild: Identifying and Efficiently Suppressing Equivalent Mutants for Java Programs. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for C...
-
[36]
Benjamin Kushigian, Amit Rawat, and René Just. 2019. Medusa: Mu- tant Equivalence Detection Using Satisfiability Analysis. In2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 77–82. doi:10.1109/ICSTW.2019.00035
-
[37]
Shuvendu K. Lahiri. 2024. Evaluating LLM-driven User-Intent Formal- ization for Verification-Aware Languages. doi:10.34727/2024/ISBN.978- 3-85448-065-5_19
-
[38]
Y. Le Traon, B. Baudry, and J.-M. Jezequel. 2006. Design by Contract to Improve Software Vigilance.IEEE Transactions on Software Engineering 32, 8 (2006), 571–586. doi:10.1109/TSE.2006.79
- [39]
-
[40]
Xavier Leroy. 2009. A formally verified compiler back-end.Journal of Automated Reasoning43, 4 (2009), 363
work page 2009
-
[41]
Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2024. DafnyBench: A Benchmark for Formal Software Verification. (2024). arXiv:2406.08467 [cs.SE] https://arxiv.org/abs/2406.08467
-
[42]
Yu-seung Ma and Jeff Offutt. 2014. Description of Class Mutation Mutation Operators for Java. (2014). https://cs.gmu.edu/~offutt/mujav a/mutopsClass.pdf
work page 2014
-
[43]
Yu-seung Ma and Jeff Offutt. Update 2016. Description of muJava’s Method-level Mutation Operators. (Update 2016). https://cs.gmu.edu /~offutt/mujava/mutopsMethod.pdf
work page 2016
-
[44]
Yu-Seung Ma, Jeff Offutt, and Yong Kwon. 2005. MuJava: An automated class mutation system.Softw. Test., Verif. Reliab.15 (06 2005), 97–133. doi:10.1002/stvr.308 MutDafny: A Mutation-Based Approach to Assess Dafny Specifications
-
[45]
Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. 2006. MuJava: a mutation system for java. InProceedings of the 28th International Conference on Software Engineering(Shanghai, China)(ICSE ’06). Association for Computing Machinery, New York, NY, USA, 827–830. doi:10.1145/11 34285.1134425
work page doi:10.1145/11 2006
-
[46]
Dor Ma’ayan, Shahar Maoz, and Roey Rozi. 2022. Validating the correct- ness of reactive systems specifications through systematic exploration. InProceedings of the 25th International Conference on Model Driven Engineering Languages and Systems(Montreal, Quebec, Canada)(MOD- ELS ’22). Association for Computing Machinery, New York, NY, USA, 132–142. doi:10....
-
[47]
Lech Madeyski, Wojciech Orzeszyna, Richard Torkar, and Mariusz Jozala. 2014. Overcoming the Equivalent Mutant Problem: A Systematic Literature Review and a Comparative Experiment of Second Order Mutation.IEEE Trans. Softw. Eng.40, 1 (Jan. 2014), 23–42. doi:10.1109/ TSE.2013.44
work page 2014
-
[48]
Phil McMinn, Chris J. Wright, Colton J. McCurdy, and Gregory M. Kapfhammer. 2019. Automatic Detection and Removal of Ineffective Mutants for the Mutation Analysis of Relational Database Schemas. IEEE Transactions on Software Engineering45, 5 (2019), 427–463. doi:10 .1109/TSE.2017.2786286
-
[49]
Lopes, Iris Ma, and James No- ble
Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, and James No- ble. 2024. Towards AI-Assisted Synthesis of Verified Dafny Methods. arXiv:2402.00247 [cs.AI] https://arxiv.org/abs/2402.00247
-
[50]
Myers, Corey Sandler, and Tom Badgett
Glenford J. Myers, Corey Sandler, and Tom Badgett. 2011.The Art of Software Testing(3rd ed.). Wiley Publishing
work page 2011
-
[51]
Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 936–946. doi:10.1109/ICSE.2015.103
-
[52]
Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura
Ana B. Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura. 2022. Mutation testing in the wild: findings from GitHub. Empirical Softw. Engg.27, 6 (Nov. 2022), 35 pages. doi:10.1007/s10664- 022-10177-8
-
[53]
Ana B. Sánchez, José A. Parejo, Sergio Segura, Amador Durán, and Mike Papadakis. 2024. Mutation Testing in Practice: Insights From Open- Source Software Developers.IEEE Transactions on Software Engineering 50, 5 (2024), 1130–1143. doi:10.1109/TSE.2024.3377378
-
[54]
Roy Patrick Tan and Stephen H. Edwards. 2004. Experiences evaluating the effectiveness of JML-JUnit testing.SIGSOFT Softw. Eng. Notes29, 5 (Sept. 2004), 1–4. doi:10.1145/1022494.1022545
-
[55]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. Learning How to Mutate Source Code from Bug-Fixes . In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 301–312. doi:10.1109/ICSME.2019.00046
-
[56]
Kaiyuan Wang, Allison Sullivan, and Sarfraz Khurshid. 2018. MuAlloy: a mutation testing framework for alloy. InProceedings of the 40th Inter- national Conference on Software Engineering: Companion Proceedings (Gothenburg, Sweden)(ICSE ’18). Association for Computing Machin- ery, New York, NY, USA, 29–32. doi:10.1145/3183440.3183488
-
[57]
2012.Experimentation in software engi- neering
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engi- neering. Vol. 236. Springer
work page 2012
- [58]
-
[59]
Xiangjuan Yao, Mark Harman, and Yue Jia. 2014. A study of equivalent and stubborn mutation operators using human analysis of equivalence. InProceedings of the 36th International Conference on Software Engi- neering(Hyderabad, India)(ICSE 2014). Association for Computing Machinery, New York, NY, USA, 919–930. doi:10.1145/2568225.2568265
-
[60]
R.K. Yin. 2009.Case Study Research: Design and Methods. SAGE Publi- cations
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.