pith. sign in

arxiv: 2510.18479 · v3 · pith:VZTJKT26new · submitted 2025-10-21 · 💻 cs.PL · cs.FL

Formally Verified Linear-Time Invertible Lexing

Pith reviewed 2026-05-22 13:19 UTC · model grok-4.3

classification 💻 cs.PL cs.FL
keywords verified lexerinvertible lexinglinear timelongest matchformal verificationtoken sequenceszippersmemoization
0
0 comments X

The pith

ZipLex verifies that lexing and printing are exact inverses while running in linear time under longest-match rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZipLex as a verified framework for lexical analysis that ensures any tokenized string can be printed back to recover the exact original input. It achieves linear-time performance through memoized operations on a new representation of token sequences that preserves their separability. This combination matters for building reliable language tools where round-tripping between source text and tokens must be lossless and efficient. Earlier approaches either skipped invertibility guarantees or accepted quadratic costs in some inputs. If the claims hold, verified lexers can support practical applications like JSON handling and full language front-ends without hidden performance traps.

Core claim

ZipLex establishes that lexing a string to a token sequence and then printing that sequence always yields the original string, that the lexing step itself completes in time linear in the input length, and that the process follows longest-match semantics. These properties rest on an abstraction for token sequences that enables both efficient updates and proofs of invertibility, combined with verified zippers for navigation and a standalone imperative hash table for memoization.

What carries the argument

The new abstraction of token sequences that captures separability of tokens while supporting efficient manipulation and invertibility proofs.

If this is right

  • Token sequences can be manipulated and then printed to recover source text exactly in verified compilers or pretty-printers.
  • JSON processors and language lexers built on this framework avoid quadratic slowdowns on inputs that defeat flex-style methods.
  • Linear-time behavior holds even when memoization is required to handle overlapping token choices.
  • The same verified structures can be reused for other string-to-structure transformations that require round-tripping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of token sequences from concrete string positions could simplify proofs in other bidirectional language processing tools.
  • Combining zippers with verified hash tables might transfer to linear-time verified parsing or pretty-printing beyond lexing.
  • Extending the abstraction to handle context-sensitive tokens would test whether invertibility survives more complex grammar rules.

Load-bearing premise

The token sequence abstraction correctly models separability and invertibility without introducing unverified edge cases during memoization or zipper operations.

What would settle it

A concrete input string for which lexing followed by printing yields a string different from the original, or for which measured lexing time grows quadratically rather than linearly with input size.

Figures

Figures reproduced from arXiv: 2510.18479 by Samuel Chassot, Viktor Kun\v{c}ak.

Figure 1
Figure 1. Figure 1: JSON sorting application algorithm. Grey boxes represent instances of PrintableTokens [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Token semantic value for JSON integer literal along with the injective transformation function. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ADT definitions for tokens and rules of the lexer framework. Signature of the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correctness theorem (in Stainless) for the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Definition of the new Injection construct. In the token transformation case, f is transform and witness is charsOf. For readability, in this section we use the syntactic sugar [𝑎-𝑧] and [0-9] to denote union of all characters in the given interval, and 𝑟 + for 𝑟 · 𝑟 ∗ . While not primitive in our DSL, they can be easily expressed as Scala functions that build the underlying expanded regular expressions. 3.… view at source ↗
Figure 7
Figure 7. Figure 7: PrintableTokens token sequence abstraction that maintains sep property on successive elements of the sequence. Also shown are operations append and slice that maintain the invariant. Formally, an R-path condition 𝑝𝑅 for a given relation 𝑅 is defined as follows for a sequence 𝑠: 𝑝𝑅 (𝑠) = ∀𝑖, 0 ≤ 𝑖 < |𝑠| − 1 =⇒ 𝑅(𝑠𝑖 , 𝑠𝑖+1) We call R-Path a sequence that satisfies an R-Path condition. One commonly known R-Pa… view at source ↗
Figure 8
Figure 8. Figure 8: Formal definition of Brzozowski’s derivatives where [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ADT definitions for zipper-based representation of regular expressions and signatures of [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Brzozowski derivatives on zippers, and matching function for zippers. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Matching of single-line comments of size [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: JSON lexer performance with or without memoization (left). Comparison of R-Path [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison between lexing and computing R-Path [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: JSON lexers performance comparison. x-axis shows the length of input files in number of characters; y-axis shows the time in [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
read the original abstract

We present ZipLex, a verified framework for invertible linear-time lexical analysis following the longest match (maximal munch) semantics. Unlike past verified lexers that focus only on satisfying the semantics of regular expressions and the longest match property, ZipLex also guarantees that lexing and printing are mutual inverses. Thanks to verified memoization, it also ensures that the lexical analysis of a string is linear in the size of the string. Our design and implementation rely on two sets of ideas: (1) a new abstraction of token sequences that captures the separability of tokens in a sequence while supporting their efficient manipulation, and (2) a combination of verified data structures and optimizations, including Huet's zippers and memoization with a standalone verified imperative hash table. Our hash table offers competitive performance as shown by our evaluation. We implemented and verified ZipLex using the Stainless deductive verifier for Scala. Our evaluation demonstrates that ZipLex supports realistic applications such as JSON processing and lexers of programming languages, and behaves linearly even in cases that make flex-style approaches quadratic. ZipLex is two orders of magnitude faster than Verbatim++, showing that verified invertibility and linear-time algorithms can be developed without prohibitive cost. Compared to Coqlex, ZipLex also offers linear (instead of quadratic) time lexing, and is the first lexer that comes with invertibility proofs for printing token sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper presents ZipLex, a formally verified framework for linear-time invertible lexical analysis under longest-match (maximal munch) semantics. It guarantees that lexing and printing are mutual inverses, achieves linear time via verified memoization using a standalone imperative hash table, and relies on Huet's zippers combined with a new token-sequence abstraction for separability and efficient manipulation. The implementation is verified in Stainless for Scala, with evaluation demonstrating practicality on JSON processing and programming-language lexers, linear behavior where flex-style approaches become quadratic, and performance two orders of magnitude faster than Verbatim++ while offering linear time unlike Coqlex.

Significance. If the central verification results hold, the work is significant for demonstrating that deductive verification can deliver practical, high-performance lexers with strong guarantees (invertibility plus linearity) that prior tools lack. The machine-checked proofs, reproducible code, and parameter-free derivations from the stated semantics are explicit strengths; the evaluation on realistic inputs further supports that verified invertibility need not incur prohibitive cost.

minor comments (3)
  1. [Section 3] The description of the token-sequence abstraction would benefit from an explicit small example showing how separability is preserved under zipper operations (e.g., around the definition of the new data structure).
  2. [Section 5] Table 1 (performance comparison) reports speedups but does not include raw timing data or input sizes for the quadratic cases; adding these would strengthen the linear-time claim.
  3. [Section 4] A few Scala-specific notations in the code listings (e.g., implicit parameters for the hash table) could be clarified for readers less familiar with Stainless.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review, detailed summary of our contributions, and recommendation to accept the paper. We appreciate the recognition that ZipLex provides the first verified lexer combining linear-time performance with provable invertibility between lexing and printing.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper establishes its claims of invertibility and linear-time lexing through direct formal verification in Stainless against explicitly stated semantics, using Huet's zippers, a token-sequence abstraction for separability, and a standalone verified imperative hash table for memoization. No equations, parameters, or derivations reduce the central properties to fitted inputs or self-citations by construction; the proofs are independent of the target results and the evaluation on JSON and language lexers supplies external confirmation. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard domain assumptions about longest-match lexing and the soundness of Stainless, plus the correctness of the newly introduced token-sequence abstraction; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Longest-match (maximal munch) semantics for tokenization
    Invoked as the target semantics that both lexing and the invertibility property must satisfy.
  • standard math Soundness of the Stainless deductive verifier for the Scala subset used
    The entire verification effort depends on Stainless correctly discharging the proof obligations.

pith-pipeline@v0.9.0 · 5773 in / 1310 out tokens · 61733 ms · 2026-05-22T13:19:46.909175+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Jonathan Immanuel Brachthäuser, Tillmann Rendel, and Klaus Ostermann. 2016. Parsing with first-class derivatives. InProceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications(Amsterdam, Netherlands)(OOPSLA 2016). Association for Computing Machinery, New York, NY, USA, 588–606. doi:10.1...

  2. [2]

    Janusz A Brzozowski. 1964. Derivatives of regular expressions.Journal of the ACM (JACM)11, 4 (1964), 481–494

  3. [3]

    Mario Bucev, Samuel Chassot, Simon Felix, Filip Schramka, and Viktor Kunčak. 2025. Formally Verifiable Generated ASN.1/ACN Encoders and Decoders: A Case Study. InVerification, Model Checking, and Abstract Interpretation, Krishna Shankaranarayanan, Sriram Sankaranarayanan, and Ashutosh Trivedi (Eds.). Springer Nature Switzerland, Cham, 185–207

  4. [4]

    1978.Formalization and automatic derivation of code generators.Technical Report

    Roderic Geoffrey Galton Cattell. 1978.Formalization and automatic derivation of code generators.Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE

  5. [5]

    Samuel Chassot and Viktor Kunčak. 2024. Verifying a Realistic Mutable Hash Table. InAutomated Reasoning, Christoph Benzmüller, Marijn J.H. Heule, and Renate A. Schmidt (Eds.). Springer Nature Switzerland, Cham, 304–314. doi:10.1007/978-3-031-63498-7_18

  6. [6]

    Li, and Konstantinos Mamouras

    Agnishom Chattopadhyay, Angela W. Li, and Konstantinos Mamouras. 2025. Verified and Efficient Matching of Regular Expressions with Lookaround. InProceedings of the 14th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP ’25). Association for Computing Machinery, New York, NY, USA, 198–213. doi:10.1145/3703595.3705884

  7. [7]

    Pierce Darragh and Michael D. Adams. 2020. Parsing with zippers (functional pearl).Proc. ACM Program. Lang.4, ICFP (2020), 108:1–108:28. doi:10.1145/3408990

  8. [8]

    2021.Efficient Parsing with Derivatives and Zippers

    Romain Edelmann. 2021.Efficient Parsing with Derivatives and Zippers. Ph. D. Dissertation. EPFL. doi:10.5075/epfl-thesis-7357

  9. [9]

    Romain Edelmann, Jad Hamza, and Viktor Kunčak. 2020. Zippy LL(1) parsing with derivatives. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK)(PLDI 2020). Association for Computing Machinery, New York, NY, USA, 1036–1051. doi:10.1145/3385412.3385992

  10. [10]

    Derek Egolf, Sam Lasser, and Kathleen Fisher. 2021. Verbatim: A Verified Lexer Generator. In2021 IEEE Security and Privacy Workshops (SPW). IEEE, San Francisco, CA, USA, 92–100. doi:10.1109/SPW53761.2021.00022

  11. [11]

    Derek Egolf, Sam Lasser, and Kathleen Fisher. 2022. Verbatim++: verified, optimized, and semantically rich lexing with derivatives. InProceedings of the 11th ACM SIGPLAN International Conference on Certified Programs and Proofs. ACM, Philadelphia PA USA, 27–39. doi:10.1145/3497775.3503694

  12. [12]

    Simon Guilloud and Clément Pit-Claudel. 2025. Verified and Optimized Implementation of Orthologic Proof Search. InComputer Aided Verification - 37th International Conference, CA V 2025, July 23-25, 2025, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 15933), Ruzica Piskac and Zvonimir Rakamaric (Eds.). Springer, Zagreb, Croatia, 130–152. d...

  13. [13]

    Gérard Huet. 1997. The zipper.Journal of functional programming7, 5 (1997), 549–554

  14. [14]

    In: Clarke, E.M., Voronkov, A

    K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InLogic for Programming, Artificial Intelligence, and Reasoning, Edmund M. Clarke and Andrei Voronkov (Eds.). Springer, Berlin, Heidelberg, 348–370. doi:10.1007/978-3-642-17511-4_20

  15. [15]

    2024.The CompCert C verified compiler: Documentation and user’s manual

    Xavier Leroy. 2024.The CompCert C verified compiler: Documentation and user’s manual. report. Inria. https://inria.hal.science/hal-01091802 Pages: 1

  16. [16]

    Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bernhard Schommer, Markus Pister, and Christian Ferdinand. 2016. CompCert - A Formally Verified Optimizing Compiler. InERTS 2016: Embedded Real Time Software and Systems, 8th European Congress. SEE, HAL, Toulouse, France, 0, 0. https://inria.hal.science/hal-01238879

  17. [17]

    Ravichandhran Madhavan, Sumith Kulal, and Viktor Kuncak. 2017. Contract-based resource verification for higher-order functions with memoization. SIGPLAN Not.52, 1 (Jan. 2017), 330–343. doi:10.1145/3093333.3009874

  18. [18]

    Matthew Might, David Darais, and Daniel Spiewak. 2011. Parsing with derivatives: a functional pearl. InProceeding of the 16th ACM SIGPLAN international conference on Functional Programming, ICFP 2011, September 19-21, 2011, Manuel M. T. Chakravarty, Zhenjiang Hu, and Olivier Danvy (Eds.). ACM, Tokyo, Japan, 189–195. doi:10.1145/2034773.2034801

  19. [19]

    Tobias Nipkow. 1998. Verified lexical analysis. InTheorem Proving in Higher Order Logics, Jim Grundy and Malcolm Newey (Eds.). Springer, Berlin, Heidelberg, 1–15. doi:10.1007/BFb0055126

  20. [20]

    Wendlasida Ouedraogo, Gabriel Scherer, and Lutz Strassburger. 2023. Coqlex Artifact Gitlab Repository. https://gitlab.inria.fr/wouedrao/coqlex/- /tree/master/Comparison/JSON?ref_type=heads

  21. [21]

    Wendlasida Ouedraogo, Gabriel Scherer, and Lutz Strassburger. 2023. Coqlex: Generating Formally Verified Lexers.The Art, Science, and Engineering of Programming8, 1 (2023), 3–1

  22. [22]

    Scott Owens, John Reppy, and Aaron Turon. 2009. Regular-expression derivatives re-examined.Journal of Functional Programming19, 2 (2009), 173–190

  23. [23]

    2011.Grammatical Framework: Programming with Multilingual Grammars

    Aarne Ranta. 2011.Grammatical Framework: Programming with Multilingual Grammars. CSLI Publications, Stanford. ISBN-10: 1-57586-626-9 (Paper), 1-57586-627-7 (Cloth)

  24. [24]

    Tillmann Rendel and Klaus Ostermann. 2010. Invertible syntax descriptions: unifying parsing and pretty printing.SIGPLAN Not.45, 11 (Sept. 2010), 1–12. doi:10.1145/2088456.1863525

  25. [25]

    Maximal-munch

    Thomas W. Reps. 1998. "Maximal-munch" Tokenization in Linear Time.ACM Trans. Program. Lang. Syst.20, 2 (1998), 259–273. doi:10.1145/276393. 276394 Manuscript submitted to ACM ZipLex: Verified Invertible Lexing with Memoized Derivatives and Zippers 25

  26. [26]

    Noé De Santo, Aurèle Barrière, and Clément Pit-Claudel. 2024. A Coq Mechanization of JavaScript Regular Expression Semantics.Proc. ACM Program. Lang.8, ICFP (2024), 1003–1031. doi:10.1145/3674666

  27. [27]

    Ian Erik Varatalu, Margus Veanes, and Juhan Ernits. 2025. RE#: High Performance Derivative-Based Regex Matching with Intersection, Complement, and Restricted Lookarounds.Proceedings of the ACM on Programming Languages9, POPL (2025), 1–32. Manuscript submitted to ACM