pith. sign in

arxiv: 2205.12114 · v5 · submitted 2022-05-24 · 💻 cs.LO · cs.FL

Towards Efficient Matching of Regexes with Backreferences using Register Set Automata (Technical Report)

Pith reviewed 2026-05-24 11:38 UTC · model grok-4.3

classification 💻 cs.LO cs.FL
keywords regex matchingbackreferencesregister set automatadeterministic automatadata wordsderivative constructionemptiness problem
0
0 comments X

The pith

Register set automata transform backreference regex matching into deterministic linear or quadratic time procedures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces register set automata to handle regexes with backreferences efficiently without backtracking slowdowns. It shows how a large class of register automata can be converted to deterministic RSAs that support set-valued registers and specific operations. A derivative-based method converts suitable regexes with single-letter captures and backreferences into register automata. This yields matching algorithms whose time is linear in input length for finite alphabets and quadratic for infinite alphabets. The work also establishes that RSA emptiness is decidable and F_omega-complete while the model is incomparable to other data-word automata.

Core claim

Register set automata (RSAs) extend register automata so that registers hold sets of symbols and support adding the current input symbol, merging or clearing registers, and testing whether a register contains a value. A large class of register automata can be transformed into deterministic RSAs that serve as the basis for fast matching of regexes with single-letter capture groups and backreferences. A derivative-based algorithm converts a large class of such regexes into register automata, and the resulting matchers run in time linear in the input length for finite alphabets and quadratic for infinite alphabets.

What carries the argument

Register set automata (RSAs), an extension of register automata allowing registers to hold sets of symbols with operations for adding input values, merging or clearing registers, and testing membership.

If this is right

  • Deterministic matching without backtracking is possible for the supported family of backreference regexes.
  • Matching runs in time linear in input length when the alphabet is finite.
  • Matching runs in time quadratic in input length when the alphabet is infinite.
  • The emptiness problem for RSAs is decidable and complete for the F_omega class.
  • RSAs are incomparable in expressive power to other popular automata models over data words.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-letter restriction might be relaxed by encoding multi-letter captures through multiple single-letter registers if additional bookkeeping preserves the set semantics.
  • Integration into production regex engines could reduce exposure to ReDoS attacks on patterns that fit the supported class.
  • Determinism of the resulting matchers opens the possibility of efficient online or streaming implementations for data streams.
  • The decidability result for emptiness suggests RSAs could serve as a target for verification tasks involving properties of data words.

Load-bearing premise

The transformations and complexity results apply only to regexes restricted to single-letter capture groups and backreferences, assuming this restricted class remains practically relevant.

What would settle it

An input string together with a single-letter backreference regex whose RSA matcher either exceeds linear time on a finite alphabet or accepts a language different from the original regex.

read the original abstract

Matching regexes (regular expressions) is a common problem in many areas of computer science, with requirements on high speed and robust performance. Regexes with backreferences allow one to express certain patterns (even beyond regular) concisely, however, since the matching is usually done by backtracking, the matching speed can degrade to a degree that constitutes a service failure or a security threat. To facilitate high-speed matching of such regexes, we propose register set automata (RSAs), an extension of register automata where registers can contain sets of symbols (from a potentially infinite alphabet) and the following operations are supported: adding input values to registers, merging or clearing registers, and testing whether a register contains a value. We show that a large class of register automata can be transformed into deterministic RSAs, which can serve as a basis for fast matching of a family of regexes with single-letter capture groups and backreferences. We also give a derivative-based algorithm for transforming a large class of regexes with backreferences to register automata and show that the time complexity of matching is linear and quadratic to the length of the input for finite and infinite alphabets respectively. Our prototype implementation of a regex matcher shows that our approach can significantly improve the robustness of state-of-the-art regex matchers on regexes with backreferences. We also study the theoretical properties of the model and show that the emptiness problem for RSAs is decidable and complete for the $\mathbf{F}_\omega$ class and that RSAs are incomparable in expressive power to other popular automata models over data words.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces register set automata (RSAs) as an extension of register automata in which registers hold sets of symbols and support add, merge, clear, and membership-test operations. It shows that a large class of register automata can be converted to deterministic RSAs, gives a derivative-based construction that translates a restricted family of regexes with single-letter capture groups and backreferences into register automata, proves linear-time matching for finite alphabets and quadratic-time matching for infinite alphabets, reports a prototype that improves robustness over existing matchers, and establishes that RSA emptiness is F_ω-complete while RSAs are incomparable to several other data-word automata models.

Significance. If the claimed semantic equivalences hold, the work supplies a deterministic, automaton-based route to efficient matching for an explicitly delimited but practically motivated subclass of backreference regexes, directly addressing performance and denial-of-service concerns. The prototype evaluation and the F_ω-completeness result constitute concrete strengths; the explicit restriction to single-letter groups and the separate treatment of finite versus infinite alphabets allow the complexity claims to be assessed within a well-defined scope.

minor comments (3)
  1. [§2] §2 (RSA definition): the precise cost model for set-union and membership tests on potentially unbounded sets is not stated; a sentence clarifying whether these are assumed O(1) or depend on set cardinality would aid reproducibility of the quadratic bound.
  2. [§4] §4 (derivative construction): the algorithm is described at a high level; adding a small pseudocode fragment or an explicit inductive invariant would make the translation from regex to register automaton easier to verify.
  3. [Table 1] Table 1 (prototype benchmarks): the column headings and units for input length versus matching time are not fully aligned with the text description of the test suite; a caption clarifying the alphabet size for each row would remove ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work on register set automata and the recommendation for minor revision. No specific major comments appear in the provided report, so there are no individual points requiring point-by-point rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces register set automata (RSAs) as a novel extension of register automata with explicit set operations on registers, then defines transformations to deterministic RSAs and a derivative-based conversion from a restricted class of backreference regexes directly from these new definitions. Complexity bounds (linear/quadratic matching time) and the F_ω-completeness of emptiness follow from the model semantics and standard automata constructions without any equations that reduce the claimed results to fitted parameters, prior self-citations, or renamed known results. The restricted scope (single-letter capture groups) is stated explicitly, and all load-bearing steps remain self-contained within the introduced formalism.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the newly defined RSA operations and the existence of transformations for a 'large class' of inputs; no free parameters are introduced, and the model itself is the main invented construct.

axioms (1)
  • standard math Standard automata constructions and derivative rules for regular expressions extend to register automata with set registers.
    Invoked for the derivative-based algorithm and the register automata to RSA conversion.
invented entities (1)
  • Register Set Automata (RSAs) no independent evidence
    purpose: Provide deterministic automata for efficient backreference regex matching using set-valued registers.
    New model defined in the paper with specific operations on sets.

pith-pipeline@v0.9.0 · 5855 in / 1304 out tokens · 41689 ms · 2026-05-24T11:38:46.685473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

  1. [1]

    19 Google

    Springer International Publishing.doi:10.1007/978-3-030-63461-2_2. 19 Google. RE2, 2022. URL:https://github.com/google/re2. 20 Google. RE2 issue tracker: Feature request #101: Add backreference support, 2022. URL: https://github.com/google/re2/issues/101. 21 Radu Grigore, Dino Distefano, Rasmus Lerchedahl Petersen, and Nikos Tzevelekos. Runtime Verificatio...

  2. [2]

    doi:10.1007/978-3-642-36742-7_19

    Springer. doi:10.1007/978-3-642-36742-7_19. 22 Lukáš Holík, Ondřej Lengál, Olli Saarikivi, Lenka Turoňová, Margus Veanes, and Tomáš Vojnar. Succinct Determinisation of Counting Automata via Sphere Construction. In Anthony Widjaja Lin, editor,Programming Languages and Systems, Lecture Notes in Computer Science, pages 468–489, Cham, 2019. Springer Internati...

  3. [3]

    41 Sylvain Schmitz and Philippe Schnoebelen

    URL: https://cel.archives-ouvertes.fr/cel-00727025. 41 Sylvain Schmitz and Philippe Schnoebelen. The power of well-structured systems. In Pedro R. D’Argenio and Hernán C. Melgratti, editors,CONCUR 2013, Buenos Aires, Argentina, August 27-30, 2013. Proceedings, volume 8052 ofLecture Notes in Computer Science, pages 5–24. Springer, 2013. doi:10.1007/978-3-6...

  4. [4]

    t is enabled inM and

  5. [5]

    positive

    M ′ is the marking such that for everyp ∈P we have M ′(p) = Out(p) + ∑ {Maux(p′) | Transfer(p′) =p} where Maux =M − In. (1) That is, the successor markingM ′ is obtained by (i) removingIn tokens from inputs oft, (ii) transferring tokens according toTransfer, and (iii) addingOut tokens tot’s outputs. We say that a markingM is reachableif there is a (possib...

  6. [6]

    The Lossy Removalprotogadget, which simulates a (lossy) removal of one token from a placep is the RsA defined as LossyRm(p) = ({q1,q 2,q 3}, R, ∆′, {q1}, {q3}) where ∆′ contains the following three transitions (cf. Figure 9a): ∆′ =    q1 a | {rp}, ∅, {rin ↦→ {in}, rtmp ↦→ ∅} q2, q2 a | {rp}, {rin}, {rtmp ↦→ {rtmp, in}} q2, q2 a | ∅, ∅, {rp ↦→ {rtmp}} q3...

  7. [7]

    Intuitively, the protogadet empties out the register which represents the placep1

    The Move protogadget, which simulatesmoving all tokens from a placep1 to a placep2, is the followingRsA (also depicted in Figure 9b): Move(p1,p 2) = ({q1,q 2}, R, {q1 a | ∅, ∅, {rp1 ↦→ ∅, rp2 ↦→ {rp1 , rp2 }} q2}, {q1}, {q2}). Intuitively, the protogadet empties out the register which represents the placep1. Its previous value is assigned to register repr...

  8. [8]

    Intuitively, the protogadget adds the unique data value from the input tape into the register representing the placep

    The New Tokenprotogadget, which simulates adding a token to a placep, is defined as follows (depiction is in Figure 9c): NewToken(p) = ({q1,q 2}, R, {q1 a | ∅, R, {rp ↦→ {rp, in}} q2}, {q1}, {q2}). Intuitively, the protogadget adds the unique data value from the input tape into the register representing the placep. The uniqueness of the data value is ensur...

  9. [9]

    · AIn(pn) where every AIn(pi) is defined as AIn(pi) = LossyRm(pi)[In(pi)], i.e., it is a concatenation ofIn(pi) copies of LossyRm(pi)

    First, we transformIn into the RsA AIn = AIn(p1) ·... · AIn(pn) where every AIn(pi) is defined as AIn(pi) = LossyRm(pi)[In(pi)], i.e., it is a concatenation ofIn(pi) copies of LossyRm(pi)

  10. [10]

    · AOut(pn) with AOut(pi) defined as AOut(pi) = NewToken(pi)[Out(pi)], i.e., it is a concatenation ofOut(pi) copies of NewToken(pi)

    Second, from Out we create the RsA AOut = AOut(p1) ·... · AOut(pn) with AOut(pi) defined as AOut(pi) = NewToken(pi)[Out(pi)], i.e., it is a concatenation ofOut(pi) copies of NewToken(pi)

  11. [11]

    · ATransfer(pn) · Aunprime(p1) ·

    Third, from Transfer we obtain theRsA ATransfer = ATransfer(p1) ·... · ATransfer(pn) · Aunprime(p1) · ... · A unprime(pn) such that ATransfer(pi) = Move(rpi,r p′ j ) with pj = Transfer(pi) and Aunprime(pi) = Move(p′ i,p i). Intuitively, ATransfer first moves the contents of all registers according toTransfer to primed instances of the target registers (in ...

  12. [12]

    most precise

    Finally, we combine theRsAs created above into the single gadgetAt = AIn · ATransfer · AOut. The initial marking will be encoded by a gadget that puts one new data value in the register representing the placep1. For this, we construct theRsA AM0 = NewToken(p1) and rename its initial state toqinit. The last ingredient we need is to create a gadget that wil...

  13. [13]

    RsA is closed under union and intersection

  14. [14]

    RsA is not closed under complement. Proof. The proofs for closure under union and intersection are standard: for twoRsAs A1 = (Q1, R1, ∆1,I 1,F 1) and A2 = (Q2, R2, ∆2,I 2,F 2) with disjoint sets of states and registers, the RsA A∪ accepting the union of their languages is obtained asA∪ = (Q1 ∪ Q2, R1 ∪R2, ∆1 ∪∆2,I 1 ∪I2,F 1 ∪F2). Similarly, A\ accepting ...

  15. [15]

    each increment is matched with a decrement

    ∈ ∆′ iff s1 a | g∈ 1 , g /∈ 1 , up1 s′ 1 ∈ ∆1 and s2 a | g∈ 2 , g /∈ 2 , up2 s′ 2 ∈ ∆2. Correctness of the constructions is clear. For showing the non-closure under complement, consider the languageL¬∀repeat from Example 5, which can be accepted byRsA. Let us show that for the complement of the language, namely, the language L∀repeat = {w | ∀i∃j : i ̸=j ∧ ...

  16. [16]

    The first instruction is of the form(q1, ·, ·) for q1 being the initial state ofM

  17. [17]

    Each instruction of the form(·, ·,q i) is followed by an instruction of the form(qi, ·, ·)

  18. [18]

    All increments have different data values and all decrements have different data values

  19. [19]

    there is an increment not followed by a decrement with the same data value

    Between every two(·, ifzeroi, ·) instructions (or between the start and the first such an ifzeroi instruction), a. every (·, deci, ·) needs to be preceded by an(·, inci, ·) instruction with the same data value and b. every (·, inci, ·) needs to be followed by a(·, deci, ·) instruction with the same data value. Properties 1 and 2 can be easily expressed usi...

  20. [20]

    RsAn is closed under union

  21. [21]

    RsAn is not closed under intersection and complement. Proof. The proof of closure under union is the same as in the proof of Theorem 9 with the exception that the results uses only registersR1 (we assume |R1| = n): all references to registers r ∈ R2 are changed to references tof(r) where f : R2 → R1 is an injection. CVIT 2016 23:30 Register Set Automata (...

  22. [22]

    ∈ ∆′ iff s1 a | g∈ 1 , g /∈ 1 , up1 s′ 1 ∈ ∆1 and s2 a | g∈ 2 , g /∈ 2 , up2 s′ 2 ∈ ∆2, and F ′ = {(q1,q 2) |q1 ∈F1 ∨q2 ∈F2}. The construction of theDRsA A\ = (Q1 ×Q2, R1 ∪ R2, ∆′,I 1 ×I2,F ′ \ ) accepting the intersection of L(A1) and L(A2) is similar to the construction ofA∪, with the exception of F ′ \, which is obtained asF ′ \ =F1 ×F2. The complement ...

  23. [23]

    DRsAn is closed under complement

  24. [24]

    DRsAn is not closed under union and intersection. Proof. The closure under complement is trivial (complete theDRsA and swap final and non-final states; no new register is introduced). To show thatDRsAn is not closed under intersection, we use languagesLA n and LB n from the proof of Theorem 10. In particular, both these languages are inDRAn (the DRAn needs ...

  25. [26]

    We will show that ρ′ is accepting, and sow ∈ L(A′)

    ⊢⟨a2,d2⟩ t′ 2 · · · ⊢⟨an,dn⟩ t′n ((S′ n,c ′ n),f ′ n) be the run ofA′ on w (A′ is deterministic and complete, soρ′ is unique). We will show that ρ′ is accepting, and sow ∈ L(A′). Let us show that for all0 ≤i ≤n, the following conditions hold:

  26. [27]

    ∀r ∈ R: fi(r) ̸= ⊥ =⇒ fi(r) ∈f ′ i(r), and

  27. [28]

    We proceed by induction oni: i = 0: Since the (only) initial state ofA′ is the macrostate(I,c 0 = {ri ↦→ 0 |ri ∈ R}) (cf

    ∀r ∈ R: c′ i(r) =    0 iff f ′ i(r) = ∅, 1 iff |f ′ i(r)| = 1, ω iff |f ′ i(r)| ≥ 2. We proceed by induction oni: i = 0: Since the (only) initial state ofA′ is the macrostate(I,c 0 = {ri ↦→ 0 |ri ∈ R}) (cf. Line 25), andq0 ∈I, then condition 1 holds. Moreover,f0(r) = ⊥ for everyr ∈ R, so condition 2 holds trivially, and so does condition 3 (thec′ i of...

  28. [29]

    ∀r ∈ R: fj+1(r) ̸= ⊥ =⇒ fj+1(r) ∈f ′ j+1(r), and

  29. [30]

    ∀r ∈ R: c′ j+1(r) =    0 iff f ′ j+1(r) = ∅, 1 iff |f ′ j+1(r)| = 1, ω iff |f ′ j+1(r)| ≥ 2. Proof. 1. Trivial

  30. [31]

    Follows from the induction hypothesis and the definition of the update functionup′ on Lines 9–20

  31. [32]

    ◁ This concludes the first direction of the proof

    Follows from the induction hypothesis and the definition ofc′ on Line 15. ◁ This concludes the first direction of the proof. (⊇) Let w = ⟨a1,d 1⟩... ⟨an,d n⟩ ∈ L (A′). Then there is an accepting runρ′ of A′ on w, such that ρ′ : ((S′ 0,c ′ 0),f ′

  32. [33]

    ⊢⟨a1,d1⟩ t′ 1 ((S′ 1,c ′ 1),f ′

  33. [34]

    ⊢⟨a2,d2⟩ t′ 2 · · · ⊢⟨an,dn⟩ t′n ((S′ n,c ′ n),f ′ n) with (S′ n,c ′ n) ∈ F ′. We will construct in a backward manner a sequence of sets of A’s configurations (i.e., pairs containing a state and assignment to registers) U0,U 1,...,U n−1,U n, such that ∀1 ≤ i ≤ n: Ui ∈ Q × (R → D), that represents all accepting runs ofA over w, and show thatU0 contains a co...

  34. [35]

    for every registerr ∈ R, it holds thatfi(r) ∈f ′ i(r) ∪ {⊥},

  35. [36]

    Let us now show that for all0 ≤ i ≤ n, the setUi is nonempty

    there is a transitionti+1 : qi ai+1 | g=, g̸=, up qi+1 ∈ ∆ such that ∗ (qi+1,f i+1) ∈Ui+1 and ∗ (qi,f i) ⊢⟨ai+1,di+1⟩ ti+1 (qi+1,f i+1). Let us now show that for all0 ≤ i ≤ n, the setUi is nonempty. We proceed by backward induction. ∗ i =n (base case):ρ′ is accepting, so there is at least one state inS′ n \ F ′. ∗ i =j + 1 (induction hypothesis): we assum...

  36. [37]

    Non-complemented literals: if the literal is not complemented, we use Theorem 15(a) to obtain a DRsA accepting it

  37. [38]

    Complemented literals: first, we use Fact 1 to obtain theURA= 1 accepting the comple- mented language and then use Theorem 15(b) to convert it into aDRsA

  38. [39]

    ◀ C Other Examples ▶ Example 34

    Unions and intersections: we simply use Theorem 11 to obtain the resultingDRsA. ◀ C Other Examples ▶ Example 34. Consider the languageLdisjoint [31, Example 2.5] of wordsw =uv where u and v are non-empty and the data values inu and v are disjoint, i.e., Ldisjoint = { w | ∃u,v ∈ Σ × D: w =uv ∧u ̸=ϵ ∧v ̸=ϵ ∧ ∀i,j : D[ui] ̸= D[vj] } For instance, the languag...

  39. [40]

    Intuitively, theEmptyEq simulates a register emptiness test

    The EmptyEq protogadget (depicted in Figure 10a), which simulates the inhibitor arc leading from placep is an RsA=r defined as: EmptyEq(p) = ({q1,q 2,q 3}, R, {q1 a | ∅, ∅, ∅, {re ↦→ ∅} q2,q 2 a | ∅, ∅, {rp ↦→ {re}}, ∅ q3}, {q1}, {q3}). Intuitively, theEmptyEq simulates a register emptiness test. At first, the protogadget explicitly assigns the value of emp...

  40. [41]

    Intuitively, for each arc originating in the transition and ending in a particular place, a token is added to the register representing the destination

    The New Token(depicted in Figure 10b) which simulates adding a token to a placep, is an RsArm =∅ defined in the following way: NewToken(p) = ({q1,q 2}, R, {q1 a | ∅, R, ∅, {rp ↦→ {rp, in}}} q2}, {q1}, {q2}). Intuitively, for each arc originating in the transition and ending in a particular place, a token is added to the register representing the destinatio...

  41. [42]

    Intuitively, for each arc originating in a placep and terminating in transition, respective number of values has to be removed from the register representing placep

    The Non-lossy Remove Token(depicted in Figure 10c) which simulates removal of a token from a placep is an RsA=r defined in the following way: NonLossyRm(p) = ({q1,q 2,q 3}, R, {q1 a | {rp}, ∅, ∅, {rin ↦→ {in}, rs ↦→ {in}} q2, q2 a | {rp}, {rin}, ∅, {rtmp ↦→ {rtmp, inp}, rs ↦→ {rtmp, rin, in}} q2, q2 a | ∅, ∅, {rp ↦→ {rs}}, {rp ↦→ {rtmp}} q3}, {q1}, {q2}). ...

  42. [43]

    The Empty protogadget (depicted in Figure 11a), which simulates the inhibitor arc leading from the placep is an RsArm =∅ defined as: Empty(p) = ({q1,q 2}, R, {q1 a | ∅, ∅, {rp}, ∅, {rp ↦→ ∅}} q2}, {q1}, {q2}). S. Gulčíková and O. Lengál 23:37 q1 q2 a rp = ∅ (a) The Empty(p) protogadget. q1 q2 a {in /∈ r}r∈R rp ← rp ∪ {in} (b) The NewToken(p) protogadget. q...

  43. [44]

    Intuitively, for each arc originating in the transition and ending in a particular place, a token is added to the register representing the destination

    The New Tokenprotogadget (depicted in Figure 11b) which simulates adding a token to a placep, is anRsArm =∅ defined in the following way: NewToken(p) = ({q1,q 2}, R, {q1 a | ∅, R, ∅, ∅, {rp ↦→ {rp, in}}} q2}, {q1}, {q2}). Intuitively, for each arc originating in the transition and ending in a particular place, a token is added to the register representing ...

  44. [45]

    Intuitively, for each token removed from the source place, one value is removed from the register which represents that place

    The Remove Tokenprotogadget (depicted in Figure 11c) which simulates the removal of a token from a placep, is anRsArm =∅ defined in the following way: TokenRem(p) = ({q1,q 2}, R, {q1 a | ∅, {rp}, ∅, {rp}, {rp ↦→ {rp}}} q2}, {q1}, {q2}). Intuitively, for each token removed from the source place, one value is removed from the register which represents that p...