pith. sign in

arxiv: 2606.26678 · v1 · pith:HFZGEUH5new · submitted 2026-06-25 · 💻 cs.FL

Selective Memoization for Efficient Backtracking Regular Expression Matching

Pith reviewed 2026-06-26 02:11 UTC · model grok-4.3

classification 💻 cs.FL
keywords selective memoizationfeedback vertex setbacktracking regexThompson automatonGlushkov automatonlinear-time matchingregular expression matching
0
0 comments X

The pith

Computing a minimum feedback vertex set on a regex automaton lets selective memoization guarantee linear backtracking time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a selective memoization strategy for backtracking regular expression matchers that suffer exponential worst-case time. It identifies the smallest set of automaton states whose memoization breaks every cycle, thereby removing redundant recomputation paths without storing results for every state. This Minimum Feedback Node scheme is shown to relate to earlier full and partial memoization methods and is analyzed for both Thompson and Glushkov constructions. A reader would care because full memoization often costs too much memory while many practical engines still risk catastrophic slowdowns on certain inputs.

Core claim

The Minimum Feedback Node memoization scheme computes a minimum feedback vertex set of the automaton and memos only those nodes; this eliminates all exponential paths in backtracking matching, yields linear time, and maintains explicit relationships to prior memoization approaches under both Thompson and Glushkov automata.

What carries the argument

The Minimum Feedback Node (MFN) memoization scheme, which selects memoization points by computing a minimum feedback vertex set of the automaton to cut all cycles.

If this is right

  • Backtracking matchers using MFN memoization run in linear time on all inputs.
  • Memory usage stays below that of full memoization while still preventing catastrophic cases.
  • The same MFN construction works for both Thompson and Glushkov automata with the claimed relationships to earlier schemes preserved.
  • Selective memoization can be computed once per expression rather than per match.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to other nondeterministic automata used in string matching or parsing if their cycle structure is similar.
  • Practical implementations would need an efficient approximation algorithm for the minimum feedback vertex set, since the exact problem is NP-hard.
  • If the MFN set proves small in real-world regexes, the method could be integrated into existing engines without large memory overhead.

Load-bearing premise

That a minimum feedback vertex set on the automaton always produces a memoization set sufficient to remove every exponential recomputation path while preserving the linear-time bound.

What would settle it

A concrete regular expression whose Thompson or Glushkov automaton has an MFN set that still permits an exponential number of matching paths on some input string.

Figures

Figures reproduced from arXiv: 2606.26678 by Brink van der Merwe (Stellenbosch University), Iain le Roux (Stellenbosch University), Martin Berglund (Ume{\aa} University).

Figure 1
Figure 1. Figure 1: An illustration of a few aspects of the Thompson construction: on the left the construction of Th(A · A ′ ), given the subautomata Th(A) and Th(A ′ ). On the right the construction of Th(A ∗ ) from the subautomaton Th(A). All operations except a ∈ Σ and ε involve gluing subautomata together with epsilon transitions, and possibly new states (with Th(a) the unique minimal DFA for {a}). The exact set of state… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of MFN to IN and CN in terms of memoization entries for Thompson and Glushkov constructions. Entries along the diagonal indicate that both schemes give the same result, while the further an entry is into the upper triangle the larger the improvement offered by MFN. Although the memory improvements of MFN are the main focus of these experiments, matching time is also relevant. MFN removes IDA, bu… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of MFN to IN and CN in terms of matching steps for Thompson and Glushkov constructions. Entries along the diagonal correspond to no improvement, as can be seen a scattering of entries fall into the lower triangle, indicating improvements in the number of matching steps performed using MFN. Although MFN memoization is not a uniform improvement over all the mentioned memoization [PITH_FULL_IMAGE:… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of None to MFN in terms of matching steps for Thompson and Glushkov construc￾tions. MFN and CN Comparison Examples The regex that resulted in one of the worst relative matching time performances of MFN memoization compared to CN memoization for both the Thompson and Glushkov constructions was the following. Observe that this regular expression uses the common practical syntax; refer to the docum… view at source ↗
read the original abstract

Backtracking regular expression matchers are widely used due to their expressive power but may exhibit exponential worst-case matching time. Memoization provides a principled method for eliminating redundant computation and ensuring linear matching time, but full memoization is memory-intensive and impractical. We introduce the Minimum Feedback Node (MFN) memoization scheme, a selective memoization strategy based on computing a minimum feedback vertex set of an automaton. We establish relationships with existing memoization schemes and analyze their behaviour under both Thompson and Glushkov automaton constructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the Minimum Feedback Node (MFN) memoization scheme, a selective memoization strategy for backtracking regular expression matchers. MFN is obtained by computing a minimum feedback vertex set of the underlying automaton; the paper claims this eliminates all exponential backtracking paths while guaranteeing linear matching time. It further establishes relationships between MFN and prior memoization schemes and analyzes behavior under both Thompson and Glushkov NFA constructions.

Significance. If the MFN construction can be performed in polynomial time and the linear-time guarantee is formally established, the selective-memoization approach would supply a memory-efficient alternative to full memoization, with potential practical value for regex engines that must handle pathological inputs.

major comments (1)
  1. [Abstract] Abstract: the central claim that MFN memoization 'eliminates exponential backtracking paths while delivering linear matching time' depends on an efficient computation of the minimum feedback vertex set. Minimum feedback vertex set is NP-hard on general graphs, and neither Thompson nor Glushkov NFAs are known to admit polynomial-time exact solutions in the worst case. The manuscript supplies neither a structure-exploiting polynomial algorithm nor a proven-correct approximation that preserves the end-to-end linearity; this is load-bearing for the claimed complexity result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying this critical point regarding the computational complexity underlying our claims. We address the comment below and outline the revisions we will undertake.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that MFN memoization 'eliminates exponential backtracking paths while delivering linear matching time' depends on an efficient computation of the minimum feedback vertex set. Minimum feedback vertex set is NP-hard on general graphs, and neither Thompson nor Glushkov NFAs are known to admit polynomial-time exact solutions in the worst case. The manuscript supplies neither a structure-exploiting polynomial algorithm nor a proven-correct approximation that preserves the end-to-end linearity; this is load-bearing for the claimed complexity result.

    Authors: We agree that the end-to-end complexity claim is load-bearing and that the manuscript does not supply a polynomial-time algorithm (or approximation preserving linearity) for computing the minimum feedback vertex set on Thompson or Glushkov NFAs. The paper focuses on the correctness and linear-time properties of the matching phase once an MFN has been obtained, establishing that selective memoization on the MFN eliminates all exponential backtracking paths. We do not claim or prove that MFN computation itself is efficient in the worst case for these automata constructions. We will revise the abstract to qualify the linear-time guarantee as applying to the matching process given a precomputed MFN, and we will add a dedicated discussion section addressing the complexity of MFN computation, its NP-hardness in general, and whether the restricted structure of regex-derived NFAs admits efficient exact or approximate solutions. If no such efficient method can be established, the claims will be adjusted to separate the precomputation cost from the per-input matching time. revision: yes

Circularity Check

0 steps flagged

No significant circularity; MFN scheme defined via external min-FVS concept

full rationale

The paper defines the MFN memoization scheme directly in terms of an external graph-theoretic primitive (minimum feedback vertex set of an automaton) and states that it establishes relationships to existing memoization schemes. No equations, fitted parameters, self-citations, or ansatzes are present in the abstract that would reduce any claimed derivation or prediction to the paper's own inputs by construction. The central construction therefore remains independent of the target result and draws on standard external mathematics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the scheme relies on the standard concept of feedback vertex set whose computational properties are assumed but not detailed here.

pith-pipeline@v0.9.1-grok · 5621 in / 1016 out tokens · 22223 ms · 2026-06-26T02:11:16.475100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages

  1. [1]

    Theoretical Computer Science 679, 69–82 (2017)

    Berglund, M., van der Merwe, B.: On the semantics of regular expression parsing in the wild. Theoretical Computer Science 679, 69–82 (2017). https://doi.org/10.1016/j.tcs.2016.09.006

  2. [2]

    Berry, G., Sethi, R.: From regular expressions to deterministic automata. Theor. Comput. Sci.48(3), 117–126 (1986). https://doi.org/10.1016/0304-3975(86)90088-5

  3. [3]

    In: Dumas, M., Pfahl, D., Apel, S., Russo, A

    Davis, J.C., IV , L.G.M., Coghlan, C.A., Servant, F., Lee, D.: Why aren’t regular expressions a lingua franca? An empirical study on the re-use and portability of regular expressions. In: Dumas, M., Pfahl, D., Apel, S., Russo, A. (eds.) Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Sof...

  4. [4]

    Choquette - Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

    Davis, J.C., Servant, F., Lee, D.: Using Selective Memoization to Defeat Regular Expression Denial of Service (ReDoS). In: 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021. pp. 1–17. IEEE (2021). https://doi.org/10.1109/SP40001.2021.00032

  5. [5]

    Technical Report 98-17, Università Ca’ Foscari di Venezia (1998)

    Giammarresi, D., Ponty, J.L., Wood, D.: Glushkov and Thompson Constructions: A Synthesis. Technical Report 98-17, Università Ca’ Foscari di Venezia (1998)

  6. [6]

    Giammarresi, D., Ponty, J., Wood, D., Ziadi, D.: A characterization of Thompson digraphs. Discret. Appl. Math. 134(1-3), 317–337 (2004). https://doi.org/10.1016/S0166-218X(03)00299-3

  7. [7]

    Russian Mathematical Surveys 16(5), 1 (oct 1961)

    Glushkov, V .M.: The abstract theory of automata. Russian Mathematical Surveys 16(5), 1 (oct 1961). https://doi.org/10.1070/RM1961v016n05ABEH004112

  8. [8]

    Reducibility among Combinatorial Problems

    Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Proceedings of a symposium on the Complexity of Computer Computations, held March 20-22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA. pp. 85–103. The IBM Research Symposia Series, Plenum Press, New York (1972). https://doi....

  9. [9]

    In: Maneth, S

    van der Merwe, B., Mouton, J., van Litsenborgh, S., Berglund, M.: Memoized regular expressions. In: Maneth, S. (ed.) Implementation and Application of Automata - 25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021, Proceedings. Lecture Notes in Computer Science, vol. 12803, pp. 39–52. Springer (2021). https://doi.org/10.1007/978-3-03...

  10. [10]

    In: Fazekas, S.Z

    Roodt, A., Watling, B.K.M., Bester, W., van der Merwe, B., Sung, S., Han, Y .S.: Benchmarking regular expression matching. In: Fazekas, S.Z. (ed.) Implementation and Application of Automata - 28th Interna- tional Conference, CIAA 2024, Akita, Japan, September 3-6, 2024, Proceedings. Lecture Notes in Computer Science, vol. 15015, pp. 316–331. Springer (202...

  11. [11]

    Shamir, A.: A linear time algorithm for finding minimum cutsets in reducible graphs. SIAM J. Comput. 8(4), 645–655 (1979). https://doi.org/10.1137/0208051

  12. [12]

    Thompson, K.: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968). https://doi.org/10.1145/363347.363387

  13. [13]

    Theoretical Computer Science 88(2), 325–349 (1991)

    Weber, A., Seidl, H.: On the degree of ambiguity of finite automata. Theoretical Computer Science 88(2), 325–349 (1991). https://doi.org/10.1016/0304-3975(91)90381-B

  14. [14]

    In: Han, Y ., Salomaa, K

    Weideman, N., van der Merwe, B., Berglund, M., Watson, B.W.: Analyzing matching time behavior of backtracking regular expression matchers by using ambiguity of NFA. In: Han, Y ., Salomaa, K. (eds.) Im- plementation and Application of Automata - 21st International Conference, CIAA 2016, Seoul, South Korea, July 19-22, 2016, Proceedings. Lecture Notes in Co...