Recognition: unknown
Crash-free Deductive Verifiers
Pith reviewed 2026-05-10 02:27 UTC · model grok-4.3
The pith
Fuzz testing can uncover issues and improve the reliability of deductive verifiers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Deductive verifiers need complementary means beyond their own verification to guarantee they run reliably out of the box, produce meaningful error messages, and return correct results. Fuzz testing supplies such a means by generating random inputs to surface hidden problems, as implemented and evaluated in the AValAnCHE prototype for VerCors and shown to work on additional deductive verifiers.
What carries the argument
The AValAnCHE fuzzing prototype that generates random inputs for deductive verifiers like VerCors to detect crashes and incorrect behaviors.
If this is right
- Deductive verifiers become more dependable for non-expert users by surviving random inputs without crashing.
- Previously unknown issues in tools such as VerCors can be found automatically through fuzzing.
- The same fuzzing technique applies successfully to other deductive verifiers.
- Robustness can be increased without attempting full formal verification of the verifier software.
Where Pith is reading between the lines
- Fuzzing could become a routine part of the development workflow for formal verification tools.
- Early and repeated fuzzing might prevent many user-visible bugs from ever reaching release.
- The method could be combined with existing test suites to cover a wider set of failure modes.
Load-bearing premise
That the defects exposed by random fuzz inputs are representative of problems real users would encounter and that fixing them improves usability without introducing new problems.
What would settle it
A follow-up study in which real users of a fuzzed-and-fixed verifier still report crashes or wrong results that the fuzzer never triggered.
Figures
read the original abstract
As deductive verifiers mature, their potential user base is growing from the initial core developers to other users. To convince external users of the suitability of verifiers, these tools must run reliably out of the box, give meaningful error messages and display correct results. Yet deductive verifiers are large and complex software systems and their own full verification is often out of reach. We therefore need complementary means to provide such guarantees. This paper advocates the use of fuzzing as a practical way to improve the quality and robustness of deductive verifiers. We outline how fuzz testing can be applied to deductive verifiers, and demonstrate the idea with the prototype tool AValAnCHE, which is integrated with the VerCors verifier. We report on our experiments in which AValAnCHE uncovered several issues in VerCors and demonstrate that the approach also works for other deductive verifiers
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that deductive verifiers, as they mature and attract broader users, require robustness guarantees beyond full self-verification, which is often infeasible; it advocates fuzzing as a practical complementary technique, outlines its application, and demonstrates it via the AValAnCHE prototype integrated with VerCors, which uncovered several issues, with the approach also working for other verifiers.
Significance. If the fuzz-generated inputs and discovered issues prove representative of real-user verification tasks, the work offers a pragmatic, low-overhead method to improve verifier reliability and usability, supporting wider adoption. The empirical prototype and cross-verifier demonstration provide concrete evidence of feasibility.
major comments (2)
- [Experiments] Experiments section: the abstract and description report that AValAnCHE uncovered issues but provide no details on fuzzing strategy (e.g., grammar or mutation rules for annotated programs), coverage metrics, input volume, or independent confirmation that the crashes are not spurious. This directly limits support for the claim that fuzzing yields practically relevant robustness gains.
- [§3] §3 (AValAnCHE integration): the generated inputs are described as random but it is not shown that they exercise the same parser, type-checker, and proof-obligation paths as typical user programs containing JML/ACSL annotations; without this, the fixed issues may not translate to improved experience for the intended user base.
minor comments (2)
- [Abstract] The abstract could quantify the issues found (number, severity, categories) rather than stating 'several issues'.
- Clarify the scope of 'other deductive verifiers' tested and whether the same fuzzing harness was reused without modification.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the positive assessment of the significance of our work. We will revise the manuscript to address the concerns raised about the level of detail in the experiments and the representativeness of the generated inputs.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the abstract and description report that AValAnCHE uncovered issues but provide no details on fuzzing strategy (e.g., grammar or mutation rules for annotated programs), coverage metrics, input volume, or independent confirmation that the crashes are not spurious. This directly limits support for the claim that fuzzing yields practically relevant robustness gains.
Authors: The current manuscript indeed provides only high-level information on the experiments, focusing on the fact that issues were uncovered rather than detailed methodology. This is a valid point, and we will revise the Experiments section to include specifics on the fuzzing strategy, including the grammar and mutation rules used for generating annotated programs, the volume of inputs tested, and coverage metrics where available. We will also describe how we confirmed the crashes were not spurious by reproducing them and communicating with the tool developers. These additions will better support the claims of practical relevance. revision: yes
-
Referee: [§3] §3 (AValAnCHE integration): the generated inputs are described as random but it is not shown that they exercise the same parser, type-checker, and proof-obligation paths as typical user programs containing JML/ACSL annotations; without this, the fixed issues may not translate to improved experience for the intended user base.
Authors: We recognize the importance of demonstrating that the fuzzed inputs are representative of real user scenarios. While the generation is random within the constraints of valid JML/ACSL syntax, the manuscript does not explicitly show path coverage equivalence. In the revised version, we will enhance §3 to explain the design of the input generator, provide examples of generated programs that align with common user annotation patterns, and discuss how the approach exercises relevant parser and type-checker components. For proof obligations, we will clarify that the robustness issues found relate to handling of valid inputs that reach those stages. revision: yes
Circularity Check
No circularity: empirical advocacy and prototype demonstration
full rationale
The paper is an empirical software-engineering study that advocates fuzzing as a complementary technique for improving deductive verifiers, outlines its application, presents the external prototype AValAnCHE integrated with VerCors, and reports experimental results on discovered issues. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The central claims rest on the existence and behavior of the introduced prototype and the reported runs rather than on any chain that reduces by construction to the paper's own inputs or prior self-citations. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Deductive verifiers are large and complex software systems and their own full verification is often out of reach.
invented entities (1)
-
AValAnCHE
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Gurfinkel, A., Ganesh, V
Armborst, L., Bos, P., van den Haak, L.B., Huisman, M., Rubbens, R., Şakar, Ö., Tasche, P.: The VerCors Verifier: A Progress Report. In: Gurfinkel, A., Ganesh, V. (eds.) Computer Aided Verification. pp. 3–18. Springer Nature Switzerland (2024).https://doi.org/10.1007/ 978-3-031-65630-9_1
2024
-
[2]
Bliudze, S., van den Bos, P., Huisman, M., Rubbens, R., Safina, L.: Jav- aBIP meets VerCors: Towards the safety of concurrent software systems in Java. In: Lambers, L., Uchitel, S. (eds.) Proceedings of the 26th Inter- national Conference on Fundamental Approaches to Software Engineering FASE,LectureNotesinComputerScience,vol.13991,pp.143–150.Springer (20...
-
[3]
CodeIntelligence: Jazzer: Coverage-guided, in-process fuzzing for the JVM, https://github.com/CodeIntelligenceTesting/jazzer
-
[4]
The github recent bugs dataset for evaluating llm-based debugging applications,
Donaldson, A.F., Sheth, D., Tristan, J.B., Usher, A.: Randomised Testing of the Compiler for a Verification-Aware Programming Language. In: 2024 IEEE Conference on Software Testing, Verification and Validation (ICST). pp. 407–418. IEEE (2024).https://doi.org/10.1109/ICST60714.2024. 00044
-
[5]
Logical Methods in Computer ScienceVolume 11, Issue 1, 998 (2015).https://doi.org/10.2168/ LMCS-11(1:2)2015
Haack, C., Huisman, M., Hurlin, C., Amighi, A.: Permission-Based Separa- tion Logic for Multithreaded Java Programs. Logical Methods in Computer ScienceVolume 11, Issue 1, 998 (2015).https://doi.org/10.2168/ LMCS-11(1:2)2015
2015
-
[6]
Hatch, W., Darragh, P., Porncharoenwase, S., Watson, G., Eide, E.: Gen- erating Conforming Programs with Xsmith. In: Proceedings of the 22nd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. pp. 86–99. ACM (2023).https://doi.org/10. 1145/3624007.3624056
-
[7]
Hodován, R., Kiss, A., Gyimóthy, T.: Grammarinator: A grammar-based open source fuzzer. In: Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation. pp. 45–48. ACM (2018).https://doi.org/10.1145/3278186.3278193
-
[8]
Irfan, A., Porncharoenwase, S., Rakamarić, Z., Rungta, N., Torlak, E.: Test- ing Dafny (experience paper). In: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. pp. 556–567. ACM (2022).https://doi.org/10.1145/3533767.3534382
-
[9]
In: Bobaru, M.G., Havelund, K., Holzmann, G.J., Joshi, R
Jacobs, B., Smans, J., Philippaerts, P., Vogels, F., Penninckx, W., Piessens, F.: Verifast: A powerful, sound, predictable, fast verifier for C and java. In: Bobaru, M.G., Havelund, K., Holzmann, G.J., Joshi, R. (eds.) Pro- ceedings of the Third International Symposium of NASA Formal Meth- ods NFM. pp. 41–55. LNCS, Springer (2011).https://doi.org/10.1007/...
2011
-
[10]
In: Lie, D., Mannan, M., Backes, M., Wang, X
Klees, G., Ruef, A., Cooper, B., Wei, S., Hicks, M.: Evaluating fuzz test- ing. In: Lie, D., Mannan, M., Backes, M., Wang, X. (eds.) Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS. pp. 2123–2138. ACM (2018).https://doi.org/10.1145/ 3243734.3243804
-
[11]
Leino, K.R.M.: Dafny: An automatic program verifier for functional correct- ness. In: Clarke, E.M., Voronkov, A. (eds.) 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning LPAR, Revised Selected Papers. pp. 348–370. Lecture Notes in Computer Science, Springer (2010).https://doi.org/10.1007/978-3-642-17511-4_20
-
[12]
MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Gen- eration
Manès, V.J., Han, H., Han, C., Cha, S.K., Egele, M., Schwartz, E.J., Woo, M.: The Art, Science, and Engineering of Fuzzing: A Survey. IEEE Trans. Software Eng.47(11), 2312–2331 (2021).https://doi.org/10.1109/TSE. 2019.2946563
work page doi:10.1109/tse 2021
-
[13]
Miller, Lars Fredriksen, and Bryan So
Miller, B.P., Fredriksen, L., So, B.: An empirical study of the reliability of UNIX utilities. Communications of the ACM33(12), 32–44 (1990).https: //doi.org/10.1145/96267.96279
-
[14]
In: Ramakrish- nan, C.R., Rehof, J
de Moura, L.M., Bjørner, N.S.: Z3: an efficient SMT solver. In: Ramakrish- nan, C.R., Rehof, J. (eds.) Proceedings of the 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS. pp. 337–340. LICS, Springer (2008).https://doi.org/10.1007/ 978-3-540-78800-3_24
2008
-
[15]
In: Jobstmann, B., Leino, K.R.M
Müller, P., Schwerhoff, M., Summers, A.J.: Viper: A verification infras- tructure for permission-based reasoning. In: Jobstmann, B., Leino, K.R.M. (eds.) Proceedings of the 17th International Conference on Verification, Model Checking, and Abstract Interpretation VMCAI. pp. 41–62. Lecture Notes in Computer Science, Springer (2016).https://doi.org/10.1007/...
2016
-
[16]
Thesis, University of Twente, Enschede (June 2025), https://purl.utwente.nl/essays/106440
Nauta, W.: AValAnCHE: Improving robustness of the VerCors verification toolset using fuzzing. Thesis, University of Twente, Enschede (June 2025), https://purl.utwente.nl/essays/106440
2025
-
[17]
O’Hearn, P.W.: Resources, concurrency, and local reasoning. Theoretical Computer Science375(1), 271–307 (2007).https://doi.org/10.1016/j. tcs.2006.12.035
work page doi:10.1016/j 2007
-
[18]
Software: Practice and Experience25(7), 789–810 (1995).https://doi
Parr, T.J., Quong, R.W.: ANTLR: A predicated-LL(k) parser generator. Software: Practice and Experience25(7), 789–810 (1995).https://doi. org/10.1002/spe.4380250705
-
[19]
BIT12(3), 366–375 (Sep 1972).https://doi.org/10.1007/BF01932308
Purdom, P.: A sentence generator for testing parsers. BIT12(3), 366–375 (Sep 1972).https://doi.org/10.1007/BF01932308
-
[20]
In: Kosma- tov, N., Kovács, L
Rubbens, R., van den Bos, P., Huisman, M.: Veymont: Choreography-based generation of correct concurrent programs with shared memory. In: Kosma- tov, N., Kovács, L. (eds.) Proceedings of the 19th International Confer- ence on Integrated Formal MethodsIFM, Lecture Notes in Computer Sci- ence, vol. 15234, pp. 217–236. Springer (2024).https://doi.org/10.1007/...
2024
-
[21]
Master’s thesis, Imperial College London (2023),https://www
Usher, A.: Fuzz-d: Random Program Generation for Testing Dafny. Master’s thesis, Imperial College London (2023),https://www. imperial.ac.uk/media/imperial-college/faculty-of-engineering/ computing/public/distinguished-projects/2223-ug-projects/ fuzz-d-Random-Program-Generation-for-Testing-Dafny.pdf
2023
-
[22]
Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in C compilers. In: Hall, M.W., Padua, D.A. (eds.) Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Imple- mentation PLDI. pp. 283–294. ACM (2011).https://doi.org/10.1145/ 1993498.1993532 10 W. Nauta et al. A Overview of errors found with AValAnCHE Th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.