pith. sign in

arxiv: 2511.15403 · v2 · submitted 2025-11-19 · 💻 cs.SE · cs.PL

MutDafny: A Mutation-Based Approach to Assess Dafny Specifications

Pith reviewed 2026-05-17 21:02 UTC · model grok-4.3

classification 💻 cs.SE cs.PL
keywords mutation testingformal specificationsDafnyprogram verificationsoftware qualityspecification weaknessesfault injection
0
0 comments X

The pith

MutDafny applies mutations to Dafny programs to expose weaknesses in their formal specifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a mutation testing method for Dafny that inserts faults into verified code and checks whether the existing specifications still hold. When a mutant verifies successfully, the specification has failed to rule out that particular deviation from intended behavior. The authors assembled 40 operators by adapting ones from other tools and creating new ones drawn from bug-fix patterns in public Dafny repositories. They ran the approach on 794 real-world programs and manually inspected undetected mutants, locating five weak specifications that averaged one per 241 lines of code. A sympathetic reader would see this as a practical way to make formally verified software more trustworthy by catching under-specified contracts before deployment.

Core claim

The central claim is that mutation testing, when the oracle is supplied by formal specifications rather than test cases, can systematically surface weaknesses in Dafny specifications: if a program variant still verifies after a mutation, the specification does not constrain the behavior tightly enough to reject that change.

What carries the argument

MutDafny, a tool that applies a set of 40 mutation operators to Dafny source and re-runs the verifier to detect cases where the specification accepts the mutated program.

If this is right

  • Developers can use undetected mutants as concrete prompts to strengthen specifications in their Dafny code.
  • The rate of one weak specification per 241 lines provides a baseline for how often real-world Dafny contracts need review.
  • The same mutation set can be applied repeatedly during development to catch specification drift as code evolves.
  • Five concrete weak specifications were identified that would have allowed incorrect behavior to go undetected.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could be ported to other verification-aware languages that already have mature verifiers, such as those supporting contracts or dependent types.
  • Automated reporting of undetected mutants might be integrated into continuous verification pipelines to flag specification gaps early.
  • Patterns among the undetected mutants could reveal common specification oversights, such as missing invariants on data structures, that recur across projects.

Load-bearing premise

Verification succeeding on a mutant reliably signals a weakness in the specification rather than the mutation being semantically invalid or the verifier being incomplete.

What would settle it

A manual review of all undetected mutants from the 794-program corpus that finds every one either invalidates the program semantics or is explainable by known verifier incompleteness, with no genuine specification weaknesses remaining.

Figures

Figures reproduced from arXiv: 2511.15403 by Alexandra Mendes, Isabel Amaral, Jos\'e Campos.

Figure 1
Figure 1. Figure 1: Mutation testing pipeline with MutDafny. MutDafny’s components are depicted by the filled boxes, the rest corresponding to Dafny. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mutation score distribution. we record the total runtime of each scan/mutation process and the time spent in each of Dafny’s workflow stages and executing the plugin. This data allows us to compute the mutant generation, mutation analysis, and total runtime. Additionally, we compute the mutation score [26], i.e., the ratio of killed mutants to the total number of mutants, ex￾cluding invalid mutants, e.g., … view at source ↗
Figure 3
Figure 3. Figure 3: Plugin, mutant generation, mutation analysis, and total [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

In verification-aware languages, such as Dafny, despite their critical role, specifications are as prone to error as implementations. Flaws in specifications can result in formally verified programs that deviate from the intended behavior. In this paper, we explore the use of mutation testing to reveal weaknesses in formal specifications written in Dafny. We present MutDafny, a tool that increases the reliability of Dafny specifications by automatically signaling potential weaknesses. Using a mutation testing approach, we introduce faults (mutations) into the code and rely on formal specifications for detecting them. If a program with a mutant verifies, this may indicate a weakness in the specification. We extensively analyze mutation operators from popular tools, identifying the ones applicable to Dafny. In addition, we synthesize new operators tailored for the language from bugfix commits in publicly available Dafny projects on GitHub. Drawing from both, we equipped our tool with a total of 40 mutation operators. We evaluate MutDafny's effectiveness and efficiency on a dataset of 794 real-world Dafny programs, and manually analyze a subset of the resulting undetected mutants, identifying five weak real-world specifications (on average, one at every 241 lines of code) that would benefit from strengthening.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents MutDafny, a mutation-testing tool to assess the strength of Dafny specifications. It selects and synthesizes 40 mutation operators (drawing from existing tools and GitHub bug-fix commits), applies the approach to 794 real-world Dafny programs, and reports that manual review of undetected mutants surfaced five weak specifications (one per 241 LOC on average).

Significance. If the central interpretation holds, the work supplies a concrete, scalable method for exposing specification weaknesses in verified Dafny code—an important practical contribution because specifications are themselves error-prone. The evaluation on a sizable external corpus of real programs and the explicit identification of five weak specifications constitute tangible evidence of utility.

major comments (2)
  1. [Evaluation / manual analysis of undetected mutants] Evaluation (undetected-mutant analysis): the headline result of five weak specifications rests on the interpretation that a mutant that still verifies signals a specification weakness. The manuscript must explicitly describe, for each of the five cases, how semantic equivalence of the mutant was ruled out and how potential incompleteness of the Dafny verifier was addressed; without this, the count cannot be taken as secure evidence of specification weakness rather than an artifact of mutation validity or verifier limits.
  2. [Abstract and Evaluation section] Abstract and §4 (evaluation setup): the effectiveness claim is only moderately supported because the abstract and main evaluation text omit the total number of mutants generated, the precise mutant-selection criteria, and any statistical measures (e.g., mutation score distribution or detection-rate confidence intervals). These quantities are load-bearing for the claim that the tool reliably surfaces weak specifications.
minor comments (2)
  1. [Mutation operator selection] The description of how the 40 operators were chosen and synthesized from GitHub commits could be expanded with a short table or enumeration to improve reproducibility.
  2. [Throughout] Minor typographical inconsistencies appear in the operator list and in the reported average (241 LOC); a consistency pass would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Evaluation / manual analysis of undetected mutants] Evaluation (undetected-mutant analysis): the headline result of five weak specifications rests on the interpretation that a mutant that still verifies signals a specification weakness. The manuscript must explicitly describe, for each of the five cases, how semantic equivalence of the mutant was ruled out and how potential incompleteness of the Dafny verifier was addressed; without this, the count cannot be taken as secure evidence of specification weakness rather than an artifact of mutation validity or verifier limits.

    Authors: We agree that the current description of the manual analysis is insufficient to fully support the claim. In the revised manuscript we will add a new subsection under the evaluation that details, for each of the five cases, the concrete steps used to establish that the mutant was not semantically equivalent (e.g., by exhibiting a concrete input that produces different observable behavior or by showing that the mutant violates an auxiliary assertion that the original satisfies) and the measures taken to address possible verifier incompleteness (e.g., running with all verification options enabled, increasing resource limits, and cross-checking a subset of cases with an alternative solver configuration). revision: yes

  2. Referee: [Abstract and Evaluation section] Abstract and §4 (evaluation setup): the effectiveness claim is only moderately supported because the abstract and main evaluation text omit the total number of mutants generated, the precise mutant-selection criteria, and any statistical measures (e.g., mutation score distribution or detection-rate confidence intervals). These quantities are load-bearing for the claim that the tool reliably surfaces weak specifications.

    Authors: We accept that these quantitative details should be stated explicitly. We will revise the abstract to report the total number of mutants generated and the criteria used to select and synthesize the 40 operators. In Section 4 we will add the distribution of per-program mutation scores together with any available statistical summaries (including detection-rate confidence intervals where the sample size permits). These additions will be placed in the evaluation setup and results subsections. revision: yes

Circularity Check

0 steps flagged

No circularity: results grounded in external data and manual analysis

full rationale

The paper's core claims derive from applying 40 mutation operators (sourced from external popular tools plus synthesis from independent GitHub Dafny bugfix commits) to a dataset of 794 real-world programs, followed by manual inspection of undetected mutants. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation. The count of five weak specifications is an empirical observation from external programs rather than a quantity that reduces to internal definitions or prior author results by construction. This matches the default expectation for an empirical tool-evaluation paper with independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard premise of mutation testing without introducing fitted numerical parameters or new postulated entities; the only notable assumption is the interpretive link between undetected mutants and specification weakness.

axioms (1)
  • domain assumption If a mutated program still verifies under the original specification, then the specification is weak or incomplete for the intended behavior.
    This interpretive step is invoked when the paper concludes that undetected mutants indicate specifications that would benefit from strengthening.

pith-pipeline@v0.9.0 · 5518 in / 1409 out tokens · 74081 ms · 2026-05-17T21:02:45.717670+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 1 internal anchor

  1. [1]

    https://github.com/avito-tech/go-mutesting Last accessed Nov

    2016.Go-mutesting (fork). https://github.com/avito-tech/go-mutesting Last accessed Nov. 2025

  2. [2]

    https://github.com/zimmski/go- mutesting Last accessed Nov

    2016.Go-mutesting (original). https://github.com/zimmski/go- mutesting Last accessed Nov. 2025

  3. [3]

    https://mutation-testing.org Last accessed Nov

    2011.Major. https://mutation-testing.org Last accessed Nov. 2025

  4. [4]

    https://github.com/jeffoffutt/muJava Last accessed Nov

    2005.MuJava. https://github.com/jeffoffutt/muJava Last accessed Nov. 2025

  5. [5]

    https://github.com/mull-project/mull Last accessed Nov

    2018.Mull. https://github.com/mull-project/mull Last accessed Nov. 2025

  6. [6]

    https://github.com/boxed/mutmut Last accessed Nov

    2018.Mutmut. https://github.com/boxed/mutmut Last accessed Nov. 2025

  7. [7]

    http://pitest.org Last accessed Nov

    2016.PIT. http://pitest.org Last accessed Nov. 2025

  8. [8]

    https://stryker-mutator.io/docs/stryker-js/introduct ion/ Last accessed Nov

    2017.StrykerJS. https://stryker-mutator.io/docs/stryker-js/introduct ion/ Last accessed Nov. 2025

  9. [9]

    https://stryker-mutator.io/docs/stryker-net/introd uction/ Last accessed Nov

    2021.Stryker.NET. https://stryker-mutator.io/docs/stryker-net/introd uction/ Last accessed Nov. 2025

  10. [10]

    Tailored Mutants Fit Bugs Better

    Miltiadis Allamanis, Earl T. Barr, René Just, and Charles Sutton. 2016. Tailored Mutants Fit Bugs Better. arXiv:1611.02516 [cs.SE] https: //arxiv.org/abs/1611.02516

  11. [11]

    Lennart Beringer, Adam Petcher, Q Ye Katherine, and Andrew W Appel

  12. [12]

    InUSENIX Security Symposium

    Verified Correctness and Security of OpenSSL HMAC.. InUSENIX Security Symposium. 207–221

  13. [13]

    David Bingham Brown, Michael Vaughn, Ben Liblit, and Thomas Reps

  14. [14]

    InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany)(ESEC/FSE 2017)

    The care and feeding of wild-caught mutants. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany)(ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 511–522. doi:10.1145/3106237.3106280

  15. [15]

    Budd and Dana Angluin

    Timothy A. Budd and Dana Angluin. 1982. Two notions of correctness and their relation to testing.Acta Inf.18, 1 (March 1982), 31–45. doi:10 .1007/BF00625279

  16. [16]

    Franck Cassez. 2021. Verification of the incremental Merkle tree al- gorithm with Dafny. InInternational Symposium on Formal Methods. Springer, 445–462

  17. [17]

    Aleks Chakarov, Jaco Geldenhuys, Matthew Heck, Michael Hicks, Sam Huang, Georges-Axel Jaloyan, Anjali Joshi, K Rustan M Leino, Mikael Mayer, Sean McLaughlin, et al. 2025. Formally Verified Cloud-Scale Au- thorization. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 703–703

  18. [18]

    Yoonsik Cheon and Gary T. Leavens. 2002. A Simple and Practical Approach to Unit Testing: The JML and JUnit Way. InECOOP 2002 — Object-Oriented Programming, Boris Magnusson (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 231–255

  19. [19]

    Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: a practical mutation testing tool for Java (demo). InProceedings of the 25th International Symposium on Software Testing and Analysis(Saarbrücken, Germany)(ISSTA 2016). Association for Computing Machinery, New York, NY, USA, 449–452. doi:10.1145/2931037.2948707

  20. [20]

    Alex Denisov and Stanislav Pankevich. 2018. Mull It Over: Mutation Testing Based on LLVM. In2018 IEEE International Conference on Soft- ware Testing, Verification and Validation Workshops (ICSTW). 25–31. doi:10.1109/ICSTW.2018.00024

  21. [21]

    Anna Derezińska and Konrad Hałas. 2014. Analysis of Mutation Op- erators for the Python Language. InProceedings of the Ninth Inter- national Conference on Dependability and Complex Systems DepCoS- RELCOMEX. June 30 – July 4, 2014, Brunów, Poland, Wojciech Zamojski, Jacek Mazurkiewicz, Jarosław Sugier, Tomasz Walkowiak, and Janusz Kacprzyk (Eds.). Springer...

  22. [22]

    Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K. Lahiri. 2024. Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?Proc. ACM Softw. Eng.1, FSE, Article 84 (July 2024), 24 pages. doi:10.1145/3660791

  23. [23]

    Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy

  24. [24]

    InProceedings of the Twelfth European Conference on Computer Systems(Belgrade, Serbia)(EuroSys ’17)

    An Empirical Study on the Correctness of Formally Verified Distributed Systems. InProceedings of the Twelfth European Conference on Computer Systems(Belgrade, Serbia)(EuroSys ’17). Association for Computing Machinery, New York, NY, USA, 328–343. doi:10.1145/30 64176.3064183

  25. [25]

    Eli Goldweber, Weixin Yu, Seyed Armin Vakil Ghahani, and Manos Kapritsos. 2024. IronSpec: Increasing the Reliability of Formal Specifica- tions. In18th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI 24). USENIX Association, Santa Clara, CA, 875–891. https://www.usenix.org/conference/osdi24/presentation/goldweber

  26. [26]

    Zhang, Marinos Kintis, and Mike Papadakis

    Rahul Gopinath, Jie M. Zhang, Marinos Kintis, and Mike Papadakis

  27. [27]

    doi:10.1002/stvr.1831 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.1831

    Mutation analysis and its industrial applications.Software Testing, Verification and Reliability32, 8 (2022), e1831. doi:10.1002/stvr.1831 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.1831

  28. [28]

    Mahdi Houshmand and Samad Paydar. 2017. TCE+: An Extension of the TCE Method for Detecting Equivalent Mutants in Java Programs. InFundamentals of Software Engineering, Mehdi Dastani and Marjan Sirjani (Eds.). Springer International Publishing, Cham, 164–179

  29. [30]

    Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Devel- opment of Mutation Testing.IEEE Transactions on Software Engineering 37, 5 (2011), 649–678. doi:10.1109/TSE.2010.62

  30. [31]

    René Just. 2014. The major mutation framework: efficient and scalable mutation analysis for Java. InProceedings of the 2014 International Symposium on Software Testing and Analysis(San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 433–436. doi:10.1145/2610384.2628053

  31. [32]

    Kapfhammer

    René Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. MA- JOR: An efficient and extensible tool for mutation analysis in a Java com- piler. In2011 26th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE 2011). 612–615. doi:10.1109/ASE.2011.6100138

  32. [33]

    Marinos Kintis, Mike Papadakis, Yue Jia, Nicos Malevris, Yves Le Traon, and Mark Harman. 2018. Detecting Trivial Mutant Equivalences via Compiler Optimisations.IEEE Transactions on Software Engineering44, 4 (2018), 308–333. doi:10.1109/TSE.2017.2684805

  33. [34]

    Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolan- ski, Michael Norrish, et al . 2009. seL4: Formal verification of an OS kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 207–220

  34. [35]

    Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just

    Benjamin Kushigian, Samuel J. Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just. 2024. Equivalent Mutants in the Wild: Identifying and Efficiently Suppressing Equivalent Mutants for Java Programs. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for C...

  35. [36]

    Benjamin Kushigian, Amit Rawat, and René Just. 2019. Medusa: Mu- tant Equivalence Detection Using Satisfiability Analysis. In2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 77–82. doi:10.1109/ICSTW.2019.00035

  36. [37]

    Shuvendu K. Lahiri. 2024. Evaluating LLM-driven User-Intent Formal- ization for Verification-Aware Languages. doi:10.34727/2024/ISBN.978- 3-85448-065-5_19

  37. [38]

    Le Traon, B

    Y. Le Traon, B. Baudry, and J.-M. Jezequel. 2006. Design by Contract to Improve Software Vigilance.IEEE Transactions on Software Engineering 32, 8 (2006), 571–586. doi:10.1109/TSE.2006.79

  38. [39]

    Rustan M

    K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InLogic for Programming, Artificial Intelligence, and Reasoning, Edmund M. Clarke and Andrei Voronkov (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 348–370

  39. [40]

    Xavier Leroy. 2009. A formally verified compiler back-end.Journal of Automated Reasoning43, 4 (2009), 363

  40. [41]

    Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2024. DafnyBench: A Benchmark for Formal Software Verification. (2024). arXiv:2406.08467 [cs.SE] https://arxiv.org/abs/2406.08467

  41. [42]

    Yu-seung Ma and Jeff Offutt. 2014. Description of Class Mutation Mutation Operators for Java. (2014). https://cs.gmu.edu/~offutt/mujav a/mutopsClass.pdf

  42. [43]

    Update 2016

    Yu-seung Ma and Jeff Offutt. Update 2016. Description of muJava’s Method-level Mutation Operators. (Update 2016). https://cs.gmu.edu /~offutt/mujava/mutopsMethod.pdf

  43. [44]

    Yu-Seung Ma, Jeff Offutt, and Yong Kwon. 2005. MuJava: An automated class mutation system.Softw. Test., Verif. Reliab.15 (06 2005), 97–133. doi:10.1002/stvr.308 MutDafny: A Mutation-Based Approach to Assess Dafny Specifications

  44. [45]

    Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. 2006. MuJava: a mutation system for java. InProceedings of the 28th International Conference on Software Engineering(Shanghai, China)(ICSE ’06). Association for Computing Machinery, New York, NY, USA, 827–830. doi:10.1145/11 34285.1134425

  45. [46]

    Dor Ma’ayan, Shahar Maoz, and Roey Rozi. 2022. Validating the correct- ness of reactive systems specifications through systematic exploration. InProceedings of the 25th International Conference on Model Driven Engineering Languages and Systems(Montreal, Quebec, Canada)(MOD- ELS ’22). Association for Computing Machinery, New York, NY, USA, 132–142. doi:10....

  46. [47]

    Lech Madeyski, Wojciech Orzeszyna, Richard Torkar, and Mariusz Jozala. 2014. Overcoming the Equivalent Mutant Problem: A Systematic Literature Review and a Comparative Experiment of Second Order Mutation.IEEE Trans. Softw. Eng.40, 1 (Jan. 2014), 23–42. doi:10.1109/ TSE.2013.44

  47. [48]

    Wright, Colton J

    Phil McMinn, Chris J. Wright, Colton J. McCurdy, and Gregory M. Kapfhammer. 2019. Automatic Detection and Removal of Ineffective Mutants for the Mutation Analysis of Relational Database Schemas. IEEE Transactions on Software Engineering45, 5 (2019), 427–463. doi:10 .1109/TSE.2017.2786286

  48. [49]

    Lopes, Iris Ma, and James No- ble

    Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, and James No- ble. 2024. Towards AI-Assisted Synthesis of Verified Dafny Methods. arXiv:2402.00247 [cs.AI] https://arxiv.org/abs/2402.00247

  49. [50]

    Myers, Corey Sandler, and Tom Badgett

    Glenford J. Myers, Corey Sandler, and Tom Badgett. 2011.The Art of Software Testing(3rd ed.). Wiley Publishing

  50. [51]

    Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 936–946. doi:10.1109/ICSE.2015.103

  51. [52]

    Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura

    Ana B. Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura. 2022. Mutation testing in the wild: findings from GitHub. Empirical Softw. Engg.27, 6 (Nov. 2022), 35 pages. doi:10.1007/s10664- 022-10177-8

  52. [53]

    Sánchez, José A

    Ana B. Sánchez, José A. Parejo, Sergio Segura, Amador Durán, and Mike Papadakis. 2024. Mutation Testing in Practice: Insights From Open- Source Software Developers.IEEE Transactions on Software Engineering 50, 5 (2024), 1130–1143. doi:10.1109/TSE.2024.3377378

  53. [54]

    Roy Patrick Tan and Stephen H. Edwards. 2004. Experiences evaluating the effectiveness of JML-JUnit testing.SIGSOFT Softw. Eng. Notes29, 5 (Sept. 2004), 1–4. doi:10.1145/1022494.1022545

  54. [55]

    Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. Learning How to Mutate Source Code from Bug-Fixes . In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 301–312. doi:10.1109/ICSME.2019.00046

  55. [56]

    Kaiyuan Wang, Allison Sullivan, and Sarfraz Khurshid. 2018. MuAlloy: a mutation testing framework for alloy. InProceedings of the 40th Inter- national Conference on Software Engineering: Companion Proceedings (Gothenburg, Sweden)(ICSE ’18). Association for Computing Machin- ery, New York, NY, USA, 29–32. doi:10.1145/3183440.3183488

  56. [57]

    2012.Experimentation in software engi- neering

    Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engi- neering. Vol. 236. Springer

  57. [58]

    Woodward

    M.R. Woodward. 1993. Mutation testing—its origin and evolution. Information and Software Technology35, 3 (1993), 163–169. doi:10.101 6/0950-5849(93)90053-6

  58. [59]

    Xiangjuan Yao, Mark Harman, and Yue Jia. 2014. A study of equivalent and stubborn mutation operators using human analysis of equivalence. InProceedings of the 36th International Conference on Software Engi- neering(Hyderabad, India)(ICSE 2014). Association for Computing Machinery, New York, NY, USA, 919–930. doi:10.1145/2568225.2568265

  59. [60]

    R.K. Yin. 2009.Case Study Research: Design and Methods. SAGE Publi- cations