MutDafny: A Mutation-Based Approach to Assess Dafny Specifications

Alexandra Mendes; Isabel Amaral; Jos\'e Campos

arxiv: 2511.15403 · v2 · submitted 2025-11-19 · 💻 cs.SE · cs.PL

MutDafny: A Mutation-Based Approach to Assess Dafny Specifications

Isabel Amaral , Alexandra Mendes , Jos\'e Campos This is my paper

Pith reviewed 2026-05-17 21:02 UTC · model grok-4.3

classification 💻 cs.SE cs.PL

keywords mutation testingformal specificationsDafnyprogram verificationsoftware qualityspecification weaknessesfault injection

0 comments

The pith

MutDafny applies mutations to Dafny programs to expose weaknesses in their formal specifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a mutation testing method for Dafny that inserts faults into verified code and checks whether the existing specifications still hold. When a mutant verifies successfully, the specification has failed to rule out that particular deviation from intended behavior. The authors assembled 40 operators by adapting ones from other tools and creating new ones drawn from bug-fix patterns in public Dafny repositories. They ran the approach on 794 real-world programs and manually inspected undetected mutants, locating five weak specifications that averaged one per 241 lines of code. A sympathetic reader would see this as a practical way to make formally verified software more trustworthy by catching under-specified contracts before deployment.

Core claim

The central claim is that mutation testing, when the oracle is supplied by formal specifications rather than test cases, can systematically surface weaknesses in Dafny specifications: if a program variant still verifies after a mutation, the specification does not constrain the behavior tightly enough to reject that change.

What carries the argument

MutDafny, a tool that applies a set of 40 mutation operators to Dafny source and re-runs the verifier to detect cases where the specification accepts the mutated program.

If this is right

Developers can use undetected mutants as concrete prompts to strengthen specifications in their Dafny code.
The rate of one weak specification per 241 lines provides a baseline for how often real-world Dafny contracts need review.
The same mutation set can be applied repeatedly during development to catch specification drift as code evolves.
Five concrete weak specifications were identified that would have allowed incorrect behavior to go undetected.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be ported to other verification-aware languages that already have mature verifiers, such as those supporting contracts or dependent types.
Automated reporting of undetected mutants might be integrated into continuous verification pipelines to flag specification gaps early.
Patterns among the undetected mutants could reveal common specification oversights, such as missing invariants on data structures, that recur across projects.

Load-bearing premise

Verification succeeding on a mutant reliably signals a weakness in the specification rather than the mutation being semantically invalid or the verifier being incomplete.

What would settle it

A manual review of all undetected mutants from the 794-program corpus that finds every one either invalidates the program semantics or is explainable by known verifier incompleteness, with no genuine specification weaknesses remaining.

Figures

Figures reproduced from arXiv: 2511.15403 by Alexandra Mendes, Isabel Amaral, Jos\'e Campos.

**Figure 1.** Figure 1: Mutation testing pipeline with MutDafny. MutDafny’s components are depicted by the filled boxes, the rest corresponding to Dafny. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 2.** Figure 2: Mutation score distribution. we record the total runtime of each scan/mutation process and the time spent in each of Dafny’s workflow stages and executing the plugin. This data allows us to compute the mutant generation, mutation analysis, and total runtime. Additionally, we compute the mutation score [26], i.e., the ratio of killed mutants to the total number of mutants, excluding invalid mutants, e.g., … view at source ↗

**Figure 3.** Figure 3: Plugin, mutant generation, mutation analysis, and total [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

In verification-aware languages, such as Dafny, despite their critical role, specifications are as prone to error as implementations. Flaws in specifications can result in formally verified programs that deviate from the intended behavior. In this paper, we explore the use of mutation testing to reveal weaknesses in formal specifications written in Dafny. We present MutDafny, a tool that increases the reliability of Dafny specifications by automatically signaling potential weaknesses. Using a mutation testing approach, we introduce faults (mutations) into the code and rely on formal specifications for detecting them. If a program with a mutant verifies, this may indicate a weakness in the specification. We extensively analyze mutation operators from popular tools, identifying the ones applicable to Dafny. In addition, we synthesize new operators tailored for the language from bugfix commits in publicly available Dafny projects on GitHub. Drawing from both, we equipped our tool with a total of 40 mutation operators. We evaluate MutDafny's effectiveness and efficiency on a dataset of 794 real-world Dafny programs, and manually analyze a subset of the resulting undetected mutants, identifying five weak real-world specifications (on average, one at every 241 lines of code) that would benefit from strengthening.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents MutDafny, a mutation-testing tool to assess the strength of Dafny specifications. It selects and synthesizes 40 mutation operators (drawing from existing tools and GitHub bug-fix commits), applies the approach to 794 real-world Dafny programs, and reports that manual review of undetected mutants surfaced five weak specifications (one per 241 LOC on average).

Significance. If the central interpretation holds, the work supplies a concrete, scalable method for exposing specification weaknesses in verified Dafny code—an important practical contribution because specifications are themselves error-prone. The evaluation on a sizable external corpus of real programs and the explicit identification of five weak specifications constitute tangible evidence of utility.

major comments (2)

[Evaluation / manual analysis of undetected mutants] Evaluation (undetected-mutant analysis): the headline result of five weak specifications rests on the interpretation that a mutant that still verifies signals a specification weakness. The manuscript must explicitly describe, for each of the five cases, how semantic equivalence of the mutant was ruled out and how potential incompleteness of the Dafny verifier was addressed; without this, the count cannot be taken as secure evidence of specification weakness rather than an artifact of mutation validity or verifier limits.
[Abstract and Evaluation section] Abstract and §4 (evaluation setup): the effectiveness claim is only moderately supported because the abstract and main evaluation text omit the total number of mutants generated, the precise mutant-selection criteria, and any statistical measures (e.g., mutation score distribution or detection-rate confidence intervals). These quantities are load-bearing for the claim that the tool reliably surfaces weak specifications.

minor comments (2)

[Mutation operator selection] The description of how the 40 operators were chosen and synthesized from GitHub commits could be expanded with a short table or enumeration to improve reproducibility.
[Throughout] Minor typographical inconsistencies appear in the operator list and in the reported average (241 LOC); a consistency pass would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Evaluation / manual analysis of undetected mutants] Evaluation (undetected-mutant analysis): the headline result of five weak specifications rests on the interpretation that a mutant that still verifies signals a specification weakness. The manuscript must explicitly describe, for each of the five cases, how semantic equivalence of the mutant was ruled out and how potential incompleteness of the Dafny verifier was addressed; without this, the count cannot be taken as secure evidence of specification weakness rather than an artifact of mutation validity or verifier limits.

Authors: We agree that the current description of the manual analysis is insufficient to fully support the claim. In the revised manuscript we will add a new subsection under the evaluation that details, for each of the five cases, the concrete steps used to establish that the mutant was not semantically equivalent (e.g., by exhibiting a concrete input that produces different observable behavior or by showing that the mutant violates an auxiliary assertion that the original satisfies) and the measures taken to address possible verifier incompleteness (e.g., running with all verification options enabled, increasing resource limits, and cross-checking a subset of cases with an alternative solver configuration). revision: yes
Referee: [Abstract and Evaluation section] Abstract and §4 (evaluation setup): the effectiveness claim is only moderately supported because the abstract and main evaluation text omit the total number of mutants generated, the precise mutant-selection criteria, and any statistical measures (e.g., mutation score distribution or detection-rate confidence intervals). These quantities are load-bearing for the claim that the tool reliably surfaces weak specifications.

Authors: We accept that these quantitative details should be stated explicitly. We will revise the abstract to report the total number of mutants generated and the criteria used to select and synthesize the 40 operators. In Section 4 we will add the distribution of per-program mutation scores together with any available statistical summaries (including detection-rate confidence intervals where the sample size permits). These additions will be placed in the evaluation setup and results subsections. revision: yes

Circularity Check

0 steps flagged

No circularity: results grounded in external data and manual analysis

full rationale

The paper's core claims derive from applying 40 mutation operators (sourced from external popular tools plus synthesis from independent GitHub Dafny bugfix commits) to a dataset of 794 real-world programs, followed by manual inspection of undetected mutants. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation. The count of five weak specifications is an empirical observation from external programs rather than a quantity that reduces to internal definitions or prior author results by construction. This matches the default expectation for an empirical tool-evaluation paper with independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard premise of mutation testing without introducing fitted numerical parameters or new postulated entities; the only notable assumption is the interpretive link between undetected mutants and specification weakness.

axioms (1)

domain assumption If a mutated program still verifies under the original specification, then the specification is weak or incomplete for the intended behavior.
This interpretive step is invoked when the paper concludes that undetected mutants indicate specifications that would benefit from strengthening.

pith-pipeline@v0.9.0 · 5518 in / 1409 out tokens · 74081 ms · 2026-05-17T21:02:45.717670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 1 internal anchor

[1]

https://github.com/avito-tech/go-mutesting Last accessed Nov

2016.Go-mutesting (fork). https://github.com/avito-tech/go-mutesting Last accessed Nov. 2025

work page 2016
[2]

https://github.com/zimmski/go- mutesting Last accessed Nov

2016.Go-mutesting (original). https://github.com/zimmski/go- mutesting Last accessed Nov. 2025

work page 2016
[3]

https://mutation-testing.org Last accessed Nov

2011.Major. https://mutation-testing.org Last accessed Nov. 2025

work page 2011
[4]

https://github.com/jeffoffutt/muJava Last accessed Nov

2005.MuJava. https://github.com/jeffoffutt/muJava Last accessed Nov. 2025

work page 2005
[5]

https://github.com/mull-project/mull Last accessed Nov

2018.Mull. https://github.com/mull-project/mull Last accessed Nov. 2025

work page 2018
[6]

https://github.com/boxed/mutmut Last accessed Nov

2018.Mutmut. https://github.com/boxed/mutmut Last accessed Nov. 2025

work page 2018
[7]

http://pitest.org Last accessed Nov

2016.PIT. http://pitest.org Last accessed Nov. 2025

work page 2016
[8]

https://stryker-mutator.io/docs/stryker-js/introduct ion/ Last accessed Nov

2017.StrykerJS. https://stryker-mutator.io/docs/stryker-js/introduct ion/ Last accessed Nov. 2025

work page 2017
[9]

https://stryker-mutator.io/docs/stryker-net/introd uction/ Last accessed Nov

2021.Stryker.NET. https://stryker-mutator.io/docs/stryker-net/introd uction/ Last accessed Nov. 2025

work page 2021
[10]

Tailored Mutants Fit Bugs Better

Miltiadis Allamanis, Earl T. Barr, René Just, and Charles Sutton. 2016. Tailored Mutants Fit Bugs Better. arXiv:1611.02516 [cs.SE] https: //arxiv.org/abs/1611.02516

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Lennart Beringer, Adam Petcher, Q Ye Katherine, and Andrew W Appel

work page
[12]

InUSENIX Security Symposium

Verified Correctness and Security of OpenSSL HMAC.. InUSENIX Security Symposium. 207–221

work page
[13]

David Bingham Brown, Michael Vaughn, Ben Liblit, and Thomas Reps

work page
[14]

InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany)(ESEC/FSE 2017)

The care and feeding of wild-caught mutants. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany)(ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 511–522. doi:10.1145/3106237.3106280

work page doi:10.1145/3106237.3106280 2017
[15]

Budd and Dana Angluin

Timothy A. Budd and Dana Angluin. 1982. Two notions of correctness and their relation to testing.Acta Inf.18, 1 (March 1982), 31–45. doi:10 .1007/BF00625279

work page 1982
[16]

Franck Cassez. 2021. Verification of the incremental Merkle tree al- gorithm with Dafny. InInternational Symposium on Formal Methods. Springer, 445–462

work page 2021
[17]

Aleks Chakarov, Jaco Geldenhuys, Matthew Heck, Michael Hicks, Sam Huang, Georges-Axel Jaloyan, Anjali Joshi, K Rustan M Leino, Mikael Mayer, Sean McLaughlin, et al. 2025. Formally Verified Cloud-Scale Au- thorization. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 703–703

work page 2025
[18]

Yoonsik Cheon and Gary T. Leavens. 2002. A Simple and Practical Approach to Unit Testing: The JML and JUnit Way. InECOOP 2002 — Object-Oriented Programming, Boris Magnusson (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 231–255

work page 2002
[19]

Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: a practical mutation testing tool for Java (demo). InProceedings of the 25th International Symposium on Software Testing and Analysis(Saarbrücken, Germany)(ISSTA 2016). Association for Computing Machinery, New York, NY, USA, 449–452. doi:10.1145/2931037.2948707

work page doi:10.1145/2931037.2948707 2016
[20]

Alex Denisov and Stanislav Pankevich. 2018. Mull It Over: Mutation Testing Based on LLVM. In2018 IEEE International Conference on Soft- ware Testing, Verification and Validation Workshops (ICSTW). 25–31. doi:10.1109/ICSTW.2018.00024

work page doi:10.1109/icstw.2018.00024 2018
[21]

Anna Derezińska and Konrad Hałas. 2014. Analysis of Mutation Op- erators for the Python Language. InProceedings of the Ninth Inter- national Conference on Dependability and Complex Systems DepCoS- RELCOMEX. June 30 – July 4, 2014, Brunów, Poland, Wojciech Zamojski, Jacek Mazurkiewicz, Jarosław Sugier, Tomasz Walkowiak, and Janusz Kacprzyk (Eds.). Springer...

work page 2014
[22]

Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K. Lahiri. 2024. Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?Proc. ACM Softw. Eng.1, FSE, Article 84 (July 2024), 24 pages. doi:10.1145/3660791

work page doi:10.1145/3660791 2024
[23]

Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy

work page
[24]

InProceedings of the Twelfth European Conference on Computer Systems(Belgrade, Serbia)(EuroSys ’17)

An Empirical Study on the Correctness of Formally Verified Distributed Systems. InProceedings of the Twelfth European Conference on Computer Systems(Belgrade, Serbia)(EuroSys ’17). Association for Computing Machinery, New York, NY, USA, 328–343. doi:10.1145/30 64176.3064183

work page doi:10.1145/30
[25]

Eli Goldweber, Weixin Yu, Seyed Armin Vakil Ghahani, and Manos Kapritsos. 2024. IronSpec: Increasing the Reliability of Formal Specifica- tions. In18th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI 24). USENIX Association, Santa Clara, CA, 875–891. https://www.usenix.org/conference/osdi24/presentation/goldweber

work page 2024
[26]

Zhang, Marinos Kintis, and Mike Papadakis

Rahul Gopinath, Jie M. Zhang, Marinos Kintis, and Mike Papadakis

work page
[27]

doi:10.1002/stvr.1831 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.1831

Mutation analysis and its industrial applications.Software Testing, Verification and Reliability32, 8 (2022), e1831. doi:10.1002/stvr.1831 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.1831

work page doi:10.1002/stvr.1831 2022
[28]

Mahdi Houshmand and Samad Paydar. 2017. TCE+: An Extension of the TCE Method for Detecting Equivalent Mutants in Java Programs. InFundamentals of Software Engineering, Mehdi Dastani and Marjan Sirjani (Eds.). Springer International Publishing, Cham, 164–179

work page 2017
[30]

Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Devel- opment of Mutation Testing.IEEE Transactions on Software Engineering 37, 5 (2011), 649–678. doi:10.1109/TSE.2010.62

work page doi:10.1109/tse.2010.62 2011
[31]

René Just. 2014. The major mutation framework: efficient and scalable mutation analysis for Java. InProceedings of the 2014 International Symposium on Software Testing and Analysis(San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 433–436. doi:10.1145/2610384.2628053

work page doi:10.1145/2610384.2628053 2014
[32]

Kapfhammer

René Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. MA- JOR: An efficient and extensible tool for mutation analysis in a Java com- piler. In2011 26th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE 2011). 612–615. doi:10.1109/ASE.2011.6100138

work page doi:10.1109/ase.2011.6100138 2011
[33]

Marinos Kintis, Mike Papadakis, Yue Jia, Nicos Malevris, Yves Le Traon, and Mark Harman. 2018. Detecting Trivial Mutant Equivalences via Compiler Optimisations.IEEE Transactions on Software Engineering44, 4 (2018), 308–333. doi:10.1109/TSE.2017.2684805

work page doi:10.1109/tse.2017.2684805 2018
[34]

Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolan- ski, Michael Norrish, et al . 2009. seL4: Formal verification of an OS kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 207–220

work page 2009
[35]

Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just

Benjamin Kushigian, Samuel J. Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just. 2024. Equivalent Mutants in the Wild: Identifying and Efficiently Suppressing Equivalent Mutants for Java Programs. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for C...

work page doi:10.1145/3650212.3680310 2024
[36]

Benjamin Kushigian, Amit Rawat, and René Just. 2019. Medusa: Mu- tant Equivalence Detection Using Satisfiability Analysis. In2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 77–82. doi:10.1109/ICSTW.2019.00035

work page doi:10.1109/icstw.2019.00035 2019
[37]

Shuvendu K. Lahiri. 2024. Evaluating LLM-driven User-Intent Formal- ization for Verification-Aware Languages. doi:10.34727/2024/ISBN.978- 3-85448-065-5_19

work page doi:10.34727/2024/isbn.978- 2024
[38]

Le Traon, B

Y. Le Traon, B. Baudry, and J.-M. Jezequel. 2006. Design by Contract to Improve Software Vigilance.IEEE Transactions on Software Engineering 32, 8 (2006), 571–586. doi:10.1109/TSE.2006.79

work page doi:10.1109/tse.2006.79 2006
[39]

Rustan M

K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InLogic for Programming, Artificial Intelligence, and Reasoning, Edmund M. Clarke and Andrei Voronkov (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 348–370

work page 2010
[40]

Xavier Leroy. 2009. A formally verified compiler back-end.Journal of Automated Reasoning43, 4 (2009), 363

work page 2009
[41]

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2024. DafnyBench: A Benchmark for Formal Software Verification. (2024). arXiv:2406.08467 [cs.SE] https://arxiv.org/abs/2406.08467

work page arXiv 2024
[42]

Yu-seung Ma and Jeff Offutt. 2014. Description of Class Mutation Mutation Operators for Java. (2014). https://cs.gmu.edu/~offutt/mujav a/mutopsClass.pdf

work page 2014
[43]

Update 2016

Yu-seung Ma and Jeff Offutt. Update 2016. Description of muJava’s Method-level Mutation Operators. (Update 2016). https://cs.gmu.edu /~offutt/mujava/mutopsMethod.pdf

work page 2016
[44]

Yu-Seung Ma, Jeff Offutt, and Yong Kwon. 2005. MuJava: An automated class mutation system.Softw. Test., Verif. Reliab.15 (06 2005), 97–133. doi:10.1002/stvr.308 MutDafny: A Mutation-Based Approach to Assess Dafny Specifications

work page doi:10.1002/stvr.308 2005
[45]

Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. 2006. MuJava: a mutation system for java. InProceedings of the 28th International Conference on Software Engineering(Shanghai, China)(ICSE ’06). Association for Computing Machinery, New York, NY, USA, 827–830. doi:10.1145/11 34285.1134425

work page doi:10.1145/11 2006
[46]

Dor Ma’ayan, Shahar Maoz, and Roey Rozi. 2022. Validating the correct- ness of reactive systems specifications through systematic exploration. InProceedings of the 25th International Conference on Model Driven Engineering Languages and Systems(Montreal, Quebec, Canada)(MOD- ELS ’22). Association for Computing Machinery, New York, NY, USA, 132–142. doi:10....

work page doi:10.1145/3550355.3552425 2022
[47]

Lech Madeyski, Wojciech Orzeszyna, Richard Torkar, and Mariusz Jozala. 2014. Overcoming the Equivalent Mutant Problem: A Systematic Literature Review and a Comparative Experiment of Second Order Mutation.IEEE Trans. Softw. Eng.40, 1 (Jan. 2014), 23–42. doi:10.1109/ TSE.2013.44

work page 2014
[48]

Wright, Colton J

Phil McMinn, Chris J. Wright, Colton J. McCurdy, and Gregory M. Kapfhammer. 2019. Automatic Detection and Removal of Ineffective Mutants for the Mutation Analysis of Relational Database Schemas. IEEE Transactions on Software Engineering45, 5 (2019), 427–463. doi:10 .1109/TSE.2017.2786286

work page arXiv 2019
[49]

Lopes, Iris Ma, and James No- ble

Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, and James No- ble. 2024. Towards AI-Assisted Synthesis of Verified Dafny Methods. arXiv:2402.00247 [cs.AI] https://arxiv.org/abs/2402.00247

work page arXiv 2024
[50]

Myers, Corey Sandler, and Tom Badgett

Glenford J. Myers, Corey Sandler, and Tom Badgett. 2011.The Art of Software Testing(3rd ed.). Wiley Publishing

work page 2011
[51]

Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 936–946. doi:10.1109/ICSE.2015.103

work page doi:10.1109/icse.2015.103 2015
[52]

Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura

Ana B. Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura. 2022. Mutation testing in the wild: findings from GitHub. Empirical Softw. Engg.27, 6 (Nov. 2022), 35 pages. doi:10.1007/s10664- 022-10177-8

work page doi:10.1007/s10664- 2022
[53]

Sánchez, José A

Ana B. Sánchez, José A. Parejo, Sergio Segura, Amador Durán, and Mike Papadakis. 2024. Mutation Testing in Practice: Insights From Open- Source Software Developers.IEEE Transactions on Software Engineering 50, 5 (2024), 1130–1143. doi:10.1109/TSE.2024.3377378

work page doi:10.1109/tse.2024.3377378 2024
[54]

Roy Patrick Tan and Stephen H. Edwards. 2004. Experiences evaluating the effectiveness of JML-JUnit testing.SIGSOFT Softw. Eng. Notes29, 5 (Sept. 2004), 1–4. doi:10.1145/1022494.1022545

work page doi:10.1145/1022494.1022545 2004
[55]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. Learning How to Mutate Source Code from Bug-Fixes . In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 301–312. doi:10.1109/ICSME.2019.00046

work page doi:10.1109/icsme.2019.00046 2019
[56]

Kaiyuan Wang, Allison Sullivan, and Sarfraz Khurshid. 2018. MuAlloy: a mutation testing framework for alloy. InProceedings of the 40th Inter- national Conference on Software Engineering: Companion Proceedings (Gothenburg, Sweden)(ICSE ’18). Association for Computing Machin- ery, New York, NY, USA, 29–32. doi:10.1145/3183440.3183488

work page doi:10.1145/3183440.3183488 2018
[57]

2012.Experimentation in software engi- neering

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engi- neering. Vol. 236. Springer

work page 2012
[58]

Woodward

M.R. Woodward. 1993. Mutation testing—its origin and evolution. Information and Software Technology35, 3 (1993), 163–169. doi:10.101 6/0950-5849(93)90053-6

work page 1993
[59]

Xiangjuan Yao, Mark Harman, and Yue Jia. 2014. A study of equivalent and stubborn mutation operators using human analysis of equivalence. InProceedings of the 36th International Conference on Software Engi- neering(Hyderabad, India)(ICSE 2014). Association for Computing Machinery, New York, NY, USA, 919–930. doi:10.1145/2568225.2568265

work page doi:10.1145/2568225.2568265 2014
[60]

R.K. Yin. 2009.Case Study Research: Design and Methods. SAGE Publi- cations

work page 2009

[1] [1]

https://github.com/avito-tech/go-mutesting Last accessed Nov

2016.Go-mutesting (fork). https://github.com/avito-tech/go-mutesting Last accessed Nov. 2025

work page 2016

[2] [2]

https://github.com/zimmski/go- mutesting Last accessed Nov

2016.Go-mutesting (original). https://github.com/zimmski/go- mutesting Last accessed Nov. 2025

work page 2016

[3] [3]

https://mutation-testing.org Last accessed Nov

2011.Major. https://mutation-testing.org Last accessed Nov. 2025

work page 2011

[4] [4]

https://github.com/jeffoffutt/muJava Last accessed Nov

2005.MuJava. https://github.com/jeffoffutt/muJava Last accessed Nov. 2025

work page 2005

[5] [5]

https://github.com/mull-project/mull Last accessed Nov

2018.Mull. https://github.com/mull-project/mull Last accessed Nov. 2025

work page 2018

[6] [6]

https://github.com/boxed/mutmut Last accessed Nov

2018.Mutmut. https://github.com/boxed/mutmut Last accessed Nov. 2025

work page 2018

[7] [7]

http://pitest.org Last accessed Nov

2016.PIT. http://pitest.org Last accessed Nov. 2025

work page 2016

[8] [8]

https://stryker-mutator.io/docs/stryker-js/introduct ion/ Last accessed Nov

2017.StrykerJS. https://stryker-mutator.io/docs/stryker-js/introduct ion/ Last accessed Nov. 2025

work page 2017

[9] [9]

https://stryker-mutator.io/docs/stryker-net/introd uction/ Last accessed Nov

2021.Stryker.NET. https://stryker-mutator.io/docs/stryker-net/introd uction/ Last accessed Nov. 2025

work page 2021

[10] [10]

Tailored Mutants Fit Bugs Better

Miltiadis Allamanis, Earl T. Barr, René Just, and Charles Sutton. 2016. Tailored Mutants Fit Bugs Better. arXiv:1611.02516 [cs.SE] https: //arxiv.org/abs/1611.02516

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Lennart Beringer, Adam Petcher, Q Ye Katherine, and Andrew W Appel

work page

[12] [12]

InUSENIX Security Symposium

Verified Correctness and Security of OpenSSL HMAC.. InUSENIX Security Symposium. 207–221

work page

[13] [13]

David Bingham Brown, Michael Vaughn, Ben Liblit, and Thomas Reps

work page

[14] [14]

InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany)(ESEC/FSE 2017)

The care and feeding of wild-caught mutants. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany)(ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 511–522. doi:10.1145/3106237.3106280

work page doi:10.1145/3106237.3106280 2017

[15] [15]

Budd and Dana Angluin

Timothy A. Budd and Dana Angluin. 1982. Two notions of correctness and their relation to testing.Acta Inf.18, 1 (March 1982), 31–45. doi:10 .1007/BF00625279

work page 1982

[16] [16]

Franck Cassez. 2021. Verification of the incremental Merkle tree al- gorithm with Dafny. InInternational Symposium on Formal Methods. Springer, 445–462

work page 2021

[17] [17]

Aleks Chakarov, Jaco Geldenhuys, Matthew Heck, Michael Hicks, Sam Huang, Georges-Axel Jaloyan, Anjali Joshi, K Rustan M Leino, Mikael Mayer, Sean McLaughlin, et al. 2025. Formally Verified Cloud-Scale Au- thorization. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 703–703

work page 2025

[18] [18]

Yoonsik Cheon and Gary T. Leavens. 2002. A Simple and Practical Approach to Unit Testing: The JML and JUnit Way. InECOOP 2002 — Object-Oriented Programming, Boris Magnusson (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 231–255

work page 2002

[19] [19]

Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: a practical mutation testing tool for Java (demo). InProceedings of the 25th International Symposium on Software Testing and Analysis(Saarbrücken, Germany)(ISSTA 2016). Association for Computing Machinery, New York, NY, USA, 449–452. doi:10.1145/2931037.2948707

work page doi:10.1145/2931037.2948707 2016

[20] [20]

Alex Denisov and Stanislav Pankevich. 2018. Mull It Over: Mutation Testing Based on LLVM. In2018 IEEE International Conference on Soft- ware Testing, Verification and Validation Workshops (ICSTW). 25–31. doi:10.1109/ICSTW.2018.00024

work page doi:10.1109/icstw.2018.00024 2018

[21] [21]

Anna Derezińska and Konrad Hałas. 2014. Analysis of Mutation Op- erators for the Python Language. InProceedings of the Ninth Inter- national Conference on Dependability and Complex Systems DepCoS- RELCOMEX. June 30 – July 4, 2014, Brunów, Poland, Wojciech Zamojski, Jacek Mazurkiewicz, Jarosław Sugier, Tomasz Walkowiak, and Janusz Kacprzyk (Eds.). Springer...

work page 2014

[22] [22]

Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K. Lahiri. 2024. Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?Proc. ACM Softw. Eng.1, FSE, Article 84 (July 2024), 24 pages. doi:10.1145/3660791

work page doi:10.1145/3660791 2024

[23] [23]

Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy

work page

[24] [24]

InProceedings of the Twelfth European Conference on Computer Systems(Belgrade, Serbia)(EuroSys ’17)

An Empirical Study on the Correctness of Formally Verified Distributed Systems. InProceedings of the Twelfth European Conference on Computer Systems(Belgrade, Serbia)(EuroSys ’17). Association for Computing Machinery, New York, NY, USA, 328–343. doi:10.1145/30 64176.3064183

work page doi:10.1145/30

[25] [25]

Eli Goldweber, Weixin Yu, Seyed Armin Vakil Ghahani, and Manos Kapritsos. 2024. IronSpec: Increasing the Reliability of Formal Specifica- tions. In18th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI 24). USENIX Association, Santa Clara, CA, 875–891. https://www.usenix.org/conference/osdi24/presentation/goldweber

work page 2024

[26] [26]

Zhang, Marinos Kintis, and Mike Papadakis

Rahul Gopinath, Jie M. Zhang, Marinos Kintis, and Mike Papadakis

work page

[27] [27]

doi:10.1002/stvr.1831 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.1831

Mutation analysis and its industrial applications.Software Testing, Verification and Reliability32, 8 (2022), e1831. doi:10.1002/stvr.1831 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.1831

work page doi:10.1002/stvr.1831 2022

[28] [28]

Mahdi Houshmand and Samad Paydar. 2017. TCE+: An Extension of the TCE Method for Detecting Equivalent Mutants in Java Programs. InFundamentals of Software Engineering, Mehdi Dastani and Marjan Sirjani (Eds.). Springer International Publishing, Cham, 164–179

work page 2017

[29] [30]

Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Devel- opment of Mutation Testing.IEEE Transactions on Software Engineering 37, 5 (2011), 649–678. doi:10.1109/TSE.2010.62

work page doi:10.1109/tse.2010.62 2011

[30] [31]

René Just. 2014. The major mutation framework: efficient and scalable mutation analysis for Java. InProceedings of the 2014 International Symposium on Software Testing and Analysis(San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 433–436. doi:10.1145/2610384.2628053

work page doi:10.1145/2610384.2628053 2014

[31] [32]

Kapfhammer

René Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. MA- JOR: An efficient and extensible tool for mutation analysis in a Java com- piler. In2011 26th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE 2011). 612–615. doi:10.1109/ASE.2011.6100138

work page doi:10.1109/ase.2011.6100138 2011

[32] [33]

Marinos Kintis, Mike Papadakis, Yue Jia, Nicos Malevris, Yves Le Traon, and Mark Harman. 2018. Detecting Trivial Mutant Equivalences via Compiler Optimisations.IEEE Transactions on Software Engineering44, 4 (2018), 308–333. doi:10.1109/TSE.2017.2684805

work page doi:10.1109/tse.2017.2684805 2018

[33] [34]

Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolan- ski, Michael Norrish, et al . 2009. seL4: Formal verification of an OS kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 207–220

work page 2009

[34] [35]

Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just

Benjamin Kushigian, Samuel J. Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just. 2024. Equivalent Mutants in the Wild: Identifying and Efficiently Suppressing Equivalent Mutants for Java Programs. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for C...

work page doi:10.1145/3650212.3680310 2024

[35] [36]

Benjamin Kushigian, Amit Rawat, and René Just. 2019. Medusa: Mu- tant Equivalence Detection Using Satisfiability Analysis. In2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 77–82. doi:10.1109/ICSTW.2019.00035

work page doi:10.1109/icstw.2019.00035 2019

[36] [37]

Shuvendu K. Lahiri. 2024. Evaluating LLM-driven User-Intent Formal- ization for Verification-Aware Languages. doi:10.34727/2024/ISBN.978- 3-85448-065-5_19

work page doi:10.34727/2024/isbn.978- 2024

[37] [38]

Le Traon, B

Y. Le Traon, B. Baudry, and J.-M. Jezequel. 2006. Design by Contract to Improve Software Vigilance.IEEE Transactions on Software Engineering 32, 8 (2006), 571–586. doi:10.1109/TSE.2006.79

work page doi:10.1109/tse.2006.79 2006

[38] [39]

Rustan M

K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InLogic for Programming, Artificial Intelligence, and Reasoning, Edmund M. Clarke and Andrei Voronkov (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 348–370

work page 2010

[39] [40]

Xavier Leroy. 2009. A formally verified compiler back-end.Journal of Automated Reasoning43, 4 (2009), 363

work page 2009

[40] [41]

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2024. DafnyBench: A Benchmark for Formal Software Verification. (2024). arXiv:2406.08467 [cs.SE] https://arxiv.org/abs/2406.08467

work page arXiv 2024

[41] [42]

Yu-seung Ma and Jeff Offutt. 2014. Description of Class Mutation Mutation Operators for Java. (2014). https://cs.gmu.edu/~offutt/mujav a/mutopsClass.pdf

work page 2014

[42] [43]

Update 2016

Yu-seung Ma and Jeff Offutt. Update 2016. Description of muJava’s Method-level Mutation Operators. (Update 2016). https://cs.gmu.edu /~offutt/mujava/mutopsMethod.pdf

work page 2016

[43] [44]

Yu-Seung Ma, Jeff Offutt, and Yong Kwon. 2005. MuJava: An automated class mutation system.Softw. Test., Verif. Reliab.15 (06 2005), 97–133. doi:10.1002/stvr.308 MutDafny: A Mutation-Based Approach to Assess Dafny Specifications

work page doi:10.1002/stvr.308 2005

[44] [45]

Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. 2006. MuJava: a mutation system for java. InProceedings of the 28th International Conference on Software Engineering(Shanghai, China)(ICSE ’06). Association for Computing Machinery, New York, NY, USA, 827–830. doi:10.1145/11 34285.1134425

work page doi:10.1145/11 2006

[45] [46]

Dor Ma’ayan, Shahar Maoz, and Roey Rozi. 2022. Validating the correct- ness of reactive systems specifications through systematic exploration. InProceedings of the 25th International Conference on Model Driven Engineering Languages and Systems(Montreal, Quebec, Canada)(MOD- ELS ’22). Association for Computing Machinery, New York, NY, USA, 132–142. doi:10....

work page doi:10.1145/3550355.3552425 2022

[46] [47]

Lech Madeyski, Wojciech Orzeszyna, Richard Torkar, and Mariusz Jozala. 2014. Overcoming the Equivalent Mutant Problem: A Systematic Literature Review and a Comparative Experiment of Second Order Mutation.IEEE Trans. Softw. Eng.40, 1 (Jan. 2014), 23–42. doi:10.1109/ TSE.2013.44

work page 2014

[47] [48]

Wright, Colton J

Phil McMinn, Chris J. Wright, Colton J. McCurdy, and Gregory M. Kapfhammer. 2019. Automatic Detection and Removal of Ineffective Mutants for the Mutation Analysis of Relational Database Schemas. IEEE Transactions on Software Engineering45, 5 (2019), 427–463. doi:10 .1109/TSE.2017.2786286

work page arXiv 2019

[48] [49]

Lopes, Iris Ma, and James No- ble

Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, and James No- ble. 2024. Towards AI-Assisted Synthesis of Verified Dafny Methods. arXiv:2402.00247 [cs.AI] https://arxiv.org/abs/2402.00247

work page arXiv 2024

[49] [50]

Myers, Corey Sandler, and Tom Badgett

Glenford J. Myers, Corey Sandler, and Tom Badgett. 2011.The Art of Software Testing(3rd ed.). Wiley Publishing

work page 2011

[50] [51]

Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 936–946. doi:10.1109/ICSE.2015.103

work page doi:10.1109/icse.2015.103 2015

[51] [52]

Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura

Ana B. Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura. 2022. Mutation testing in the wild: findings from GitHub. Empirical Softw. Engg.27, 6 (Nov. 2022), 35 pages. doi:10.1007/s10664- 022-10177-8

work page doi:10.1007/s10664- 2022

[52] [53]

Sánchez, José A

Ana B. Sánchez, José A. Parejo, Sergio Segura, Amador Durán, and Mike Papadakis. 2024. Mutation Testing in Practice: Insights From Open- Source Software Developers.IEEE Transactions on Software Engineering 50, 5 (2024), 1130–1143. doi:10.1109/TSE.2024.3377378

work page doi:10.1109/tse.2024.3377378 2024

[53] [54]

Roy Patrick Tan and Stephen H. Edwards. 2004. Experiences evaluating the effectiveness of JML-JUnit testing.SIGSOFT Softw. Eng. Notes29, 5 (Sept. 2004), 1–4. doi:10.1145/1022494.1022545

work page doi:10.1145/1022494.1022545 2004

[54] [55]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. Learning How to Mutate Source Code from Bug-Fixes . In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 301–312. doi:10.1109/ICSME.2019.00046

work page doi:10.1109/icsme.2019.00046 2019

[55] [56]

Kaiyuan Wang, Allison Sullivan, and Sarfraz Khurshid. 2018. MuAlloy: a mutation testing framework for alloy. InProceedings of the 40th Inter- national Conference on Software Engineering: Companion Proceedings (Gothenburg, Sweden)(ICSE ’18). Association for Computing Machin- ery, New York, NY, USA, 29–32. doi:10.1145/3183440.3183488

work page doi:10.1145/3183440.3183488 2018

[56] [57]

2012.Experimentation in software engi- neering

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engi- neering. Vol. 236. Springer

work page 2012

[57] [58]

Woodward

M.R. Woodward. 1993. Mutation testing—its origin and evolution. Information and Software Technology35, 3 (1993), 163–169. doi:10.101 6/0950-5849(93)90053-6

work page 1993

[58] [59]

Xiangjuan Yao, Mark Harman, and Yue Jia. 2014. A study of equivalent and stubborn mutation operators using human analysis of equivalence. InProceedings of the 36th International Conference on Software Engi- neering(Hyderabad, India)(ICSE 2014). Association for Computing Machinery, New York, NY, USA, 919–930. doi:10.1145/2568225.2568265

work page doi:10.1145/2568225.2568265 2014

[59] [60]

R.K. Yin. 2009.Case Study Research: Design and Methods. SAGE Publi- cations

work page 2009