arxiv: 2604.19448 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

Crash-free Deductive Verifiers

Wander Nauta , Marcus Gerhold , Marieke Huisman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:27 UTC · model grok-4.3

classification 💻 cs.SE

keywords fuzzingdeductive verifierssoftware testingVerCorsAValAnCHErobustnessformal methods

0 comments

The pith

Fuzz testing can uncover issues and improve the reliability of deductive verifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deductive verifiers are complex tools whose full self-verification is often impractical, yet they must run reliably for growing numbers of external users. The paper proposes fuzzing as a complementary practical technique that generates random inputs to expose crashes, incorrect outputs, and other failures. This is demonstrated by the AValAnCHE prototype integrated with VerCors, which identified multiple issues, and the approach is shown to transfer to other deductive verifiers. A sympathetic reader would care because it offers an accessible way to strengthen these tools for real-world use without requiring complete formal proofs of the verifier itself.

Core claim

Deductive verifiers need complementary means beyond their own verification to guarantee they run reliably out of the box, produce meaningful error messages, and return correct results. Fuzz testing supplies such a means by generating random inputs to surface hidden problems, as implemented and evaluated in the AValAnCHE prototype for VerCors and shown to work on additional deductive verifiers.

What carries the argument

The AValAnCHE fuzzing prototype that generates random inputs for deductive verifiers like VerCors to detect crashes and incorrect behaviors.

If this is right

Deductive verifiers become more dependable for non-expert users by surviving random inputs without crashing.
Previously unknown issues in tools such as VerCors can be found automatically through fuzzing.
The same fuzzing technique applies successfully to other deductive verifiers.
Robustness can be increased without attempting full formal verification of the verifier software.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fuzzing could become a routine part of the development workflow for formal verification tools.
Early and repeated fuzzing might prevent many user-visible bugs from ever reaching release.
The method could be combined with existing test suites to cover a wider set of failure modes.

Load-bearing premise

That the defects exposed by random fuzz inputs are representative of problems real users would encounter and that fixing them improves usability without introducing new problems.

What would settle it

A follow-up study in which real users of a fuzzed-and-fixed verifier still report crashes or wrong results that the fuzzer never triggered.

Figures

Figures reproduced from arXiv: 2604.19448 by Marcus Gerhold, Marieke Huisman, Wander Nauta.

**Figure 1.** Figure 1: Performance of the different approaches compared. The direct approach (coverage-guided fuzzing) did not find any bugs. It achieves low and barely growing coverage, indicating that this approach is not able to find interesting inputs. The number of instrumented coverage counters also does not grow, thus a large part of the VerCors codebase is not exercised by the generated inputs. Manual inspection of the g… view at source ↗

read the original abstract

As deductive verifiers mature, their potential user base is growing from the initial core developers to other users. To convince external users of the suitability of verifiers, these tools must run reliably out of the box, give meaningful error messages and display correct results. Yet deductive verifiers are large and complex software systems and their own full verification is often out of reach. We therefore need complementary means to provide such guarantees. This paper advocates the use of fuzzing as a practical way to improve the quality and robustness of deductive verifiers. We outline how fuzz testing can be applied to deductive verifiers, and demonstrate the idea with the prototype tool AValAnCHE, which is integrated with the VerCors verifier. We report on our experiments in which AValAnCHE uncovered several issues in VerCors and demonstrate that the approach also works for other deductive verifiers

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that deductive verifiers, as they mature and attract broader users, require robustness guarantees beyond full self-verification, which is often infeasible; it advocates fuzzing as a practical complementary technique, outlines its application, and demonstrates it via the AValAnCHE prototype integrated with VerCors, which uncovered several issues, with the approach also working for other verifiers.

Significance. If the fuzz-generated inputs and discovered issues prove representative of real-user verification tasks, the work offers a pragmatic, low-overhead method to improve verifier reliability and usability, supporting wider adoption. The empirical prototype and cross-verifier demonstration provide concrete evidence of feasibility.

major comments (2)

[Experiments] Experiments section: the abstract and description report that AValAnCHE uncovered issues but provide no details on fuzzing strategy (e.g., grammar or mutation rules for annotated programs), coverage metrics, input volume, or independent confirmation that the crashes are not spurious. This directly limits support for the claim that fuzzing yields practically relevant robustness gains.
[§3] §3 (AValAnCHE integration): the generated inputs are described as random but it is not shown that they exercise the same parser, type-checker, and proof-obligation paths as typical user programs containing JML/ACSL annotations; without this, the fixed issues may not translate to improved experience for the intended user base.

minor comments (2)

[Abstract] The abstract could quantify the issues found (number, severity, categories) rather than stating 'several issues'.
Clarify the scope of 'other deductive verifiers' tested and whether the same fuzzing harness was reused without modification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the positive assessment of the significance of our work. We will revise the manuscript to address the concerns raised about the level of detail in the experiments and the representativeness of the generated inputs.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract and description report that AValAnCHE uncovered issues but provide no details on fuzzing strategy (e.g., grammar or mutation rules for annotated programs), coverage metrics, input volume, or independent confirmation that the crashes are not spurious. This directly limits support for the claim that fuzzing yields practically relevant robustness gains.

Authors: The current manuscript indeed provides only high-level information on the experiments, focusing on the fact that issues were uncovered rather than detailed methodology. This is a valid point, and we will revise the Experiments section to include specifics on the fuzzing strategy, including the grammar and mutation rules used for generating annotated programs, the volume of inputs tested, and coverage metrics where available. We will also describe how we confirmed the crashes were not spurious by reproducing them and communicating with the tool developers. These additions will better support the claims of practical relevance. revision: yes
Referee: [§3] §3 (AValAnCHE integration): the generated inputs are described as random but it is not shown that they exercise the same parser, type-checker, and proof-obligation paths as typical user programs containing JML/ACSL annotations; without this, the fixed issues may not translate to improved experience for the intended user base.

Authors: We recognize the importance of demonstrating that the fuzzed inputs are representative of real user scenarios. While the generation is random within the constraints of valid JML/ACSL syntax, the manuscript does not explicitly show path coverage equivalence. In the revised version, we will enhance §3 to explain the design of the input generator, provide examples of generated programs that align with common user annotation patterns, and discuss how the approach exercises relevant parser and type-checker components. For proof obligations, we will clarify that the robustness issues found relate to handling of valid inputs that reach those stages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical advocacy and prototype demonstration

full rationale

The paper is an empirical software-engineering study that advocates fuzzing as a complementary technique for improving deductive verifiers, outlines its application, presents the external prototype AValAnCHE integrated with VerCors, and reports experimental results on discovered issues. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The central claims rest on the existence and behavior of the introduced prototype and the reported runs rather than on any chain that reduces by construction to the paper's own inputs or prior self-citations. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that verifiers are too complex for full self-verification and introduces one new tool without independent evidence of its general effectiveness.

axioms (1)

domain assumption Deductive verifiers are large and complex software systems and their own full verification is often out of reach.
This premise directly motivates the need for complementary techniques such as fuzzing.

invented entities (1)

AValAnCHE no independent evidence
purpose: Prototype fuzz-testing tool integrated with VerCors to demonstrate the advocated approach.
Newly created for this work; no independent evidence of correctness or completeness is supplied.

pith-pipeline@v0.9.0 · 5436 in / 1177 out tokens · 38511 ms · 2026-05-10T02:27:26.797653+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 13 canonical work pages

[1]

In: Gurfinkel, A., Ganesh, V

Armborst, L., Bos, P., van den Haak, L.B., Huisman, M., Rubbens, R., Şakar, Ö., Tasche, P.: The VerCors Verifier: A Progress Report. In: Gurfinkel, A., Ganesh, V. (eds.) Computer Aided Verification. pp. 3–18. Springer Nature Switzerland (2024).https://doi.org/10.1007/ 978-3-031-65630-9_1

2024
[2]

In: Lambers, L., Uchitel, S

Bliudze, S., van den Bos, P., Huisman, M., Rubbens, R., Safina, L.: Jav- aBIP meets VerCors: Towards the safety of concurrent software systems in Java. In: Lambers, L., Uchitel, S. (eds.) Proceedings of the 26th Inter- national Conference on Fundamental Approaches to Software Engineering FASE,LectureNotesinComputerScience,vol.13991,pp.143–150.Springer (20...

work page doi:10.1007/978-3-031-30826-0_8 2023
[3]

CodeIntelligence: Jazzer: Coverage-guided, in-process fuzzing for the JVM, https://github.com/CodeIntelligenceTesting/jazzer
[4]

The github recent bugs dataset for evaluating llm-based debugging applications,

Donaldson, A.F., Sheth, D., Tristan, J.B., Usher, A.: Randomised Testing of the Compiler for a Verification-Aware Programming Language. In: 2024 IEEE Conference on Software Testing, Verification and Validation (ICST). pp. 407–418. IEEE (2024).https://doi.org/10.1109/ICST60714.2024. 00044

work page doi:10.1109/icst60714.2024 2024
[5]

Logical Methods in Computer ScienceVolume 11, Issue 1, 998 (2015).https://doi.org/10.2168/ LMCS-11(1:2)2015

Haack, C., Huisman, M., Hurlin, C., Amighi, A.: Permission-Based Separa- tion Logic for Multithreaded Java Programs. Logical Methods in Computer ScienceVolume 11, Issue 1, 998 (2015).https://doi.org/10.2168/ LMCS-11(1:2)2015

2015
[6]

In: Proceedings of the 22nd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences

Hatch, W., Darragh, P., Porncharoenwase, S., Watson, G., Eide, E.: Gen- erating Conforming Programs with Xsmith. In: Proceedings of the 22nd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. pp. 86–99. ACM (2023).https://doi.org/10. 1145/3624007.3624056

work page arXiv 2023
[7]

In: Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation

Hodován, R., Kiss, A., Gyimóthy, T.: Grammarinator: A grammar-based open source fuzzer. In: Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation. pp. 45–48. ACM (2018).https://doi.org/10.1145/3278186.3278193

work page doi:10.1145/3278186.3278193 2018
[8]

Donaldson

Irfan, A., Porncharoenwase, S., Rakamarić, Z., Rungta, N., Torlak, E.: Test- ing Dafny (experience paper). In: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. pp. 556–567. ACM (2022).https://doi.org/10.1145/3533767.3534382

work page doi:10.1145/3533767.3534382 2022
[9]

In: Bobaru, M.G., Havelund, K., Holzmann, G.J., Joshi, R

Jacobs, B., Smans, J., Philippaerts, P., Vogels, F., Penninckx, W., Piessens, F.: Verifast: A powerful, sound, predictable, fast verifier for C and java. In: Bobaru, M.G., Havelund, K., Holzmann, G.J., Joshi, R. (eds.) Pro- ceedings of the Third International Symposium of NASA Formal Meth- ods NFM. pp. 41–55. LNCS, Springer (2011).https://doi.org/10.1007/...

2011
[10]

In: Lie, D., Mannan, M., Backes, M., Wang, X

Klees, G., Ruef, A., Cooper, B., Wei, S., Hicks, M.: Evaluating fuzz test- ing. In: Lie, D., Mannan, M., Backes, M., Wang, X. (eds.) Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS. pp. 2123–2138. ACM (2018).https://doi.org/10.1145/ 3243734.3243804

work page arXiv 2018
[11]

Rustan M

Leino, K.R.M.: Dafny: An automatic program verifier for functional correct- ness. In: Clarke, E.M., Voronkov, A. (eds.) 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning LPAR, Revised Selected Papers. pp. 348–370. Lecture Notes in Computer Science, Springer (2010).https://doi.org/10.1007/978-3-642-17511-4_20

work page doi:10.1007/978-3-642-17511-4_20 2010
[12]

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Gen- eration

Manès, V.J., Han, H., Han, C., Cha, S.K., Egele, M., Schwartz, E.J., Woo, M.: The Art, Science, and Engineering of Fuzzing: A Survey. IEEE Trans. Software Eng.47(11), 2312–2331 (2021).https://doi.org/10.1109/TSE. 2019.2946563

work page doi:10.1109/tse 2021
[13]

Miller, Lars Fredriksen, and Bryan So

Miller, B.P., Fredriksen, L., So, B.: An empirical study of the reliability of UNIX utilities. Communications of the ACM33(12), 32–44 (1990).https: //doi.org/10.1145/96267.96279

work page doi:10.1145/96267.96279 1990
[14]

In: Ramakrish- nan, C.R., Rehof, J

de Moura, L.M., Bjørner, N.S.: Z3: an efficient SMT solver. In: Ramakrish- nan, C.R., Rehof, J. (eds.) Proceedings of the 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS. pp. 337–340. LICS, Springer (2008).https://doi.org/10.1007/ 978-3-540-78800-3_24

2008
[15]

In: Jobstmann, B., Leino, K.R.M

Müller, P., Schwerhoff, M., Summers, A.J.: Viper: A verification infras- tructure for permission-based reasoning. In: Jobstmann, B., Leino, K.R.M. (eds.) Proceedings of the 17th International Conference on Verification, Model Checking, and Abstract Interpretation VMCAI. pp. 41–62. Lecture Notes in Computer Science, Springer (2016).https://doi.org/10.1007/...

2016
[16]

Thesis, University of Twente, Enschede (June 2025), https://purl.utwente.nl/essays/106440

Nauta, W.: AValAnCHE: Improving robustness of the VerCors verification toolset using fuzzing. Thesis, University of Twente, Enschede (June 2025), https://purl.utwente.nl/essays/106440

2025
[17]

Matthew Weinberg

O’Hearn, P.W.: Resources, concurrency, and local reasoning. Theoretical Computer Science375(1), 271–307 (2007).https://doi.org/10.1016/j. tcs.2006.12.035

work page doi:10.1016/j 2007
[18]

Software: Practice and Experience25(7), 789–810 (1995).https://doi

Parr, T.J., Quong, R.W.: ANTLR: A predicated-LL(k) parser generator. Software: Practice and Experience25(7), 789–810 (1995).https://doi. org/10.1002/spe.4380250705

work page doi:10.1002/spe.4380250705 1995
[19]

BIT12(3), 366–375 (Sep 1972).https://doi.org/10.1007/BF01932308

Purdom, P.: A sentence generator for testing parsers. BIT12(3), 366–375 (Sep 1972).https://doi.org/10.1007/BF01932308

work page doi:10.1007/bf01932308 1972
[20]

In: Kosma- tov, N., Kovács, L

Rubbens, R., van den Bos, P., Huisman, M.: Veymont: Choreography-based generation of correct concurrent programs with shared memory. In: Kosma- tov, N., Kovács, L. (eds.) Proceedings of the 19th International Confer- ence on Integrated Formal MethodsIFM, Lecture Notes in Computer Sci- ence, vol. 15234, pp. 217–236. Springer (2024).https://doi.org/10.1007/...

2024
[21]

Master’s thesis, Imperial College London (2023),https://www

Usher, A.: Fuzz-d: Random Program Generation for Testing Dafny. Master’s thesis, Imperial College London (2023),https://www. imperial.ac.uk/media/imperial-college/faculty-of-engineering/ computing/public/distinguished-projects/2223-ug-projects/ fuzz-d-Random-Program-Generation-for-Testing-Dafny.pdf

2023
[22]

mozzarella

Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in C compilers. In: Hall, M.W., Padua, D.A. (eds.) Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Imple- mentation PLDI. pp. 283–294. ACM (2011).https://doi.org/10.1145/ 1993498.1993532 10 W. Nauta et al. A Overview of errors found with AValAnCHE Th...

work page arXiv 2011