Formally Verified Linear-Time Invertible Lexing
Pith reviewed 2026-05-22 13:19 UTC · model grok-4.3
The pith
ZipLex verifies that lexing and printing are exact inverses while running in linear time under longest-match rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZipLex establishes that lexing a string to a token sequence and then printing that sequence always yields the original string, that the lexing step itself completes in time linear in the input length, and that the process follows longest-match semantics. These properties rest on an abstraction for token sequences that enables both efficient updates and proofs of invertibility, combined with verified zippers for navigation and a standalone imperative hash table for memoization.
What carries the argument
The new abstraction of token sequences that captures separability of tokens while supporting efficient manipulation and invertibility proofs.
If this is right
- Token sequences can be manipulated and then printed to recover source text exactly in verified compilers or pretty-printers.
- JSON processors and language lexers built on this framework avoid quadratic slowdowns on inputs that defeat flex-style methods.
- Linear-time behavior holds even when memoization is required to handle overlapping token choices.
- The same verified structures can be reused for other string-to-structure transformations that require round-tripping.
Where Pith is reading between the lines
- The separation of token sequences from concrete string positions could simplify proofs in other bidirectional language processing tools.
- Combining zippers with verified hash tables might transfer to linear-time verified parsing or pretty-printing beyond lexing.
- Extending the abstraction to handle context-sensitive tokens would test whether invertibility survives more complex grammar rules.
Load-bearing premise
The token sequence abstraction correctly models separability and invertibility without introducing unverified edge cases during memoization or zipper operations.
What would settle it
A concrete input string for which lexing followed by printing yields a string different from the original, or for which measured lexing time grows quadratically rather than linearly with input size.
Figures
read the original abstract
We present ZipLex, a verified framework for invertible linear-time lexical analysis following the longest match (maximal munch) semantics. Unlike past verified lexers that focus only on satisfying the semantics of regular expressions and the longest match property, ZipLex also guarantees that lexing and printing are mutual inverses. Thanks to verified memoization, it also ensures that the lexical analysis of a string is linear in the size of the string. Our design and implementation rely on two sets of ideas: (1) a new abstraction of token sequences that captures the separability of tokens in a sequence while supporting their efficient manipulation, and (2) a combination of verified data structures and optimizations, including Huet's zippers and memoization with a standalone verified imperative hash table. Our hash table offers competitive performance as shown by our evaluation. We implemented and verified ZipLex using the Stainless deductive verifier for Scala. Our evaluation demonstrates that ZipLex supports realistic applications such as JSON processing and lexers of programming languages, and behaves linearly even in cases that make flex-style approaches quadratic. ZipLex is two orders of magnitude faster than Verbatim++, showing that verified invertibility and linear-time algorithms can be developed without prohibitive cost. Compared to Coqlex, ZipLex also offers linear (instead of quadratic) time lexing, and is the first lexer that comes with invertibility proofs for printing token sequences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ZipLex, a formally verified framework for linear-time invertible lexical analysis under longest-match (maximal munch) semantics. It guarantees that lexing and printing are mutual inverses, achieves linear time via verified memoization using a standalone imperative hash table, and relies on Huet's zippers combined with a new token-sequence abstraction for separability and efficient manipulation. The implementation is verified in Stainless for Scala, with evaluation demonstrating practicality on JSON processing and programming-language lexers, linear behavior where flex-style approaches become quadratic, and performance two orders of magnitude faster than Verbatim++ while offering linear time unlike Coqlex.
Significance. If the central verification results hold, the work is significant for demonstrating that deductive verification can deliver practical, high-performance lexers with strong guarantees (invertibility plus linearity) that prior tools lack. The machine-checked proofs, reproducible code, and parameter-free derivations from the stated semantics are explicit strengths; the evaluation on realistic inputs further supports that verified invertibility need not incur prohibitive cost.
minor comments (3)
- [Section 3] The description of the token-sequence abstraction would benefit from an explicit small example showing how separability is preserved under zipper operations (e.g., around the definition of the new data structure).
- [Section 5] Table 1 (performance comparison) reports speedups but does not include raw timing data or input sizes for the quadratic cases; adding these would strengthen the linear-time claim.
- [Section 4] A few Scala-specific notations in the code listings (e.g., implicit parameters for the hash table) could be clarified for readers less familiar with Stainless.
Simulated Author's Rebuttal
We thank the referee for their positive review, detailed summary of our contributions, and recommendation to accept the paper. We appreciate the recognition that ZipLex provides the first verified lexer combining linear-time performance with provable invertibility between lexing and printing.
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper establishes its claims of invertibility and linear-time lexing through direct formal verification in Stainless against explicitly stated semantics, using Huet's zippers, a token-sequence abstraction for separability, and a standalone verified imperative hash table for memoization. No equations, parameters, or derivations reduce the central properties to fitted inputs or self-citations by construction; the proofs are independent of the target results and the evaluation on JSON and language lexers supplies external confirmation. The derivation is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Longest-match (maximal munch) semantics for tokenization
- standard math Soundness of the Stainless deductive verifier for the Scala subset used
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a new predicate family, R-path predicates, to describe predicates where a relation applies to pairs of successive elements... sep(t1, t2) := ¬(∪ ri).prefixMatch(t1 ::: t2(0))
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ZipLex guarantees that lexing and printing are mutual inverses and that lexical analysis runs in linear time... using Huet’s zippers and memoization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jonathan Immanuel Brachthäuser, Tillmann Rendel, and Klaus Ostermann. 2016. Parsing with first-class derivatives. InProceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications(Amsterdam, Netherlands)(OOPSLA 2016). Association for Computing Machinery, New York, NY, USA, 588–606. doi:10.1...
-
[2]
Janusz A Brzozowski. 1964. Derivatives of regular expressions.Journal of the ACM (JACM)11, 4 (1964), 481–494
work page 1964
-
[3]
Mario Bucev, Samuel Chassot, Simon Felix, Filip Schramka, and Viktor Kunčak. 2025. Formally Verifiable Generated ASN.1/ACN Encoders and Decoders: A Case Study. InVerification, Model Checking, and Abstract Interpretation, Krishna Shankaranarayanan, Sriram Sankaranarayanan, and Ashutosh Trivedi (Eds.). Springer Nature Switzerland, Cham, 185–207
work page 2025
-
[4]
1978.Formalization and automatic derivation of code generators.Technical Report
Roderic Geoffrey Galton Cattell. 1978.Formalization and automatic derivation of code generators.Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE
work page 1978
-
[5]
Samuel Chassot and Viktor Kunčak. 2024. Verifying a Realistic Mutable Hash Table. InAutomated Reasoning, Christoph Benzmüller, Marijn J.H. Heule, and Renate A. Schmidt (Eds.). Springer Nature Switzerland, Cham, 304–314. doi:10.1007/978-3-031-63498-7_18
-
[6]
Agnishom Chattopadhyay, Angela W. Li, and Konstantinos Mamouras. 2025. Verified and Efficient Matching of Regular Expressions with Lookaround. InProceedings of the 14th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP ’25). Association for Computing Machinery, New York, NY, USA, 198–213. doi:10.1145/3703595.3705884
-
[7]
Pierce Darragh and Michael D. Adams. 2020. Parsing with zippers (functional pearl).Proc. ACM Program. Lang.4, ICFP (2020), 108:1–108:28. doi:10.1145/3408990
-
[8]
2021.Efficient Parsing with Derivatives and Zippers
Romain Edelmann. 2021.Efficient Parsing with Derivatives and Zippers. Ph. D. Dissertation. EPFL. doi:10.5075/epfl-thesis-7357
-
[9]
Romain Edelmann, Jad Hamza, and Viktor Kunčak. 2020. Zippy LL(1) parsing with derivatives. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK)(PLDI 2020). Association for Computing Machinery, New York, NY, USA, 1036–1051. doi:10.1145/3385412.3385992
-
[10]
Derek Egolf, Sam Lasser, and Kathleen Fisher. 2021. Verbatim: A Verified Lexer Generator. In2021 IEEE Security and Privacy Workshops (SPW). IEEE, San Francisco, CA, USA, 92–100. doi:10.1109/SPW53761.2021.00022
-
[11]
Derek Egolf, Sam Lasser, and Kathleen Fisher. 2022. Verbatim++: verified, optimized, and semantically rich lexing with derivatives. InProceedings of the 11th ACM SIGPLAN International Conference on Certified Programs and Proofs. ACM, Philadelphia PA USA, 27–39. doi:10.1145/3497775.3503694
-
[12]
Simon Guilloud and Clément Pit-Claudel. 2025. Verified and Optimized Implementation of Orthologic Proof Search. InComputer Aided Verification - 37th International Conference, CA V 2025, July 23-25, 2025, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 15933), Ruzica Piskac and Zvonimir Rakamaric (Eds.). Springer, Zagreb, Croatia, 130–152. d...
-
[13]
Gérard Huet. 1997. The zipper.Journal of functional programming7, 5 (1997), 549–554
work page 1997
-
[14]
K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InLogic for Programming, Artificial Intelligence, and Reasoning, Edmund M. Clarke and Andrei Voronkov (Eds.). Springer, Berlin, Heidelberg, 348–370. doi:10.1007/978-3-642-17511-4_20
-
[15]
2024.The CompCert C verified compiler: Documentation and user’s manual
Xavier Leroy. 2024.The CompCert C verified compiler: Documentation and user’s manual. report. Inria. https://inria.hal.science/hal-01091802 Pages: 1
work page 2024
-
[16]
Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bernhard Schommer, Markus Pister, and Christian Ferdinand. 2016. CompCert - A Formally Verified Optimizing Compiler. InERTS 2016: Embedded Real Time Software and Systems, 8th European Congress. SEE, HAL, Toulouse, France, 0, 0. https://inria.hal.science/hal-01238879
work page 2016
-
[17]
Ravichandhran Madhavan, Sumith Kulal, and Viktor Kuncak. 2017. Contract-based resource verification for higher-order functions with memoization. SIGPLAN Not.52, 1 (Jan. 2017), 330–343. doi:10.1145/3093333.3009874
-
[18]
Matthew Might, David Darais, and Daniel Spiewak. 2011. Parsing with derivatives: a functional pearl. InProceeding of the 16th ACM SIGPLAN international conference on Functional Programming, ICFP 2011, September 19-21, 2011, Manuel M. T. Chakravarty, Zhenjiang Hu, and Olivier Danvy (Eds.). ACM, Tokyo, Japan, 189–195. doi:10.1145/2034773.2034801
-
[19]
Tobias Nipkow. 1998. Verified lexical analysis. InTheorem Proving in Higher Order Logics, Jim Grundy and Malcolm Newey (Eds.). Springer, Berlin, Heidelberg, 1–15. doi:10.1007/BFb0055126
-
[20]
Wendlasida Ouedraogo, Gabriel Scherer, and Lutz Strassburger. 2023. Coqlex Artifact Gitlab Repository. https://gitlab.inria.fr/wouedrao/coqlex/- /tree/master/Comparison/JSON?ref_type=heads
work page 2023
-
[21]
Wendlasida Ouedraogo, Gabriel Scherer, and Lutz Strassburger. 2023. Coqlex: Generating Formally Verified Lexers.The Art, Science, and Engineering of Programming8, 1 (2023), 3–1
work page 2023
-
[22]
Scott Owens, John Reppy, and Aaron Turon. 2009. Regular-expression derivatives re-examined.Journal of Functional Programming19, 2 (2009), 173–190
work page 2009
-
[23]
2011.Grammatical Framework: Programming with Multilingual Grammars
Aarne Ranta. 2011.Grammatical Framework: Programming with Multilingual Grammars. CSLI Publications, Stanford. ISBN-10: 1-57586-626-9 (Paper), 1-57586-627-7 (Cloth)
work page 2011
-
[24]
Tillmann Rendel and Klaus Ostermann. 2010. Invertible syntax descriptions: unifying parsing and pretty printing.SIGPLAN Not.45, 11 (Sept. 2010), 1–12. doi:10.1145/2088456.1863525
-
[25]
Thomas W. Reps. 1998. "Maximal-munch" Tokenization in Linear Time.ACM Trans. Program. Lang. Syst.20, 2 (1998), 259–273. doi:10.1145/276393. 276394 Manuscript submitted to ACM ZipLex: Verified Invertible Lexing with Memoized Derivatives and Zippers 25
-
[26]
Noé De Santo, Aurèle Barrière, and Clément Pit-Claudel. 2024. A Coq Mechanization of JavaScript Regular Expression Semantics.Proc. ACM Program. Lang.8, ICFP (2024), 1003–1031. doi:10.1145/3674666
-
[27]
Ian Erik Varatalu, Margus Veanes, and Juhan Ernits. 2025. RE#: High Performance Derivative-Based Regex Matching with Intersection, Complement, and Restricted Lookarounds.Proceedings of the ACM on Programming Languages9, POPL (2025), 1–32. Manuscript submitted to ACM
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.