Efficient Regex Matching with Sparse Counting-Sets

Brink van der Merwe (Stellenbosch University); Martin Berglund (Ume{\aa} University); Sicheol Sung (Yonsei University)

arxiv: 2606.26679 · v1 · pith:2E3RK72Wnew · submitted 2026-06-25 · 💻 cs.FL

Efficient Regex Matching with Sparse Counting-Sets

Martin Berglund (Ume{\aa} University) , Brink van der Merwe (Stellenbosch University) , Sicheol Sung (Yonsei University) This is my paper

Pith reviewed 2026-06-26 02:08 UTC · model grok-4.3

classification 💻 cs.FL

keywords regex matchingcounting regexsparse data structuresfinite automataefficient algorithmsnondeterministic matchingcounter automata

0 comments

The pith

A sparse counting-set tracks only essential counters to cut replication costs during c-regex matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a sparse counting-set data structure for matching regular expressions that include counting bounds. Earlier counting-set methods had to copy entire sets whenever a branching transition occurred, creating repeated overhead. The sparse variant keeps only the counter values needed to decide future matches, which reduces the size of those copies while still producing the same acceptance decisions on every input. This matters for applications that use compact counting expressions to describe repeated patterns in text or data streams, because lower per-transition cost can make matching practical on longer inputs or in high-volume settings.

Core claim

The central claim is that a counting-set can be represented sparsely by retaining only essential counter values, so that branching transitions require copying far fewer entries yet the resulting automaton still recognizes exactly the same language as the full counting-set version for every input string and every c-regex.

What carries the argument

The sparse counting-set, a data structure that stores only the minimal subset of counter values required to preserve matching semantics across nondeterministic transitions.

If this is right

The matching procedure runs faster on automata with high out-degree because each transition copies a smaller set.
Memory usage per active state drops proportionally to the number of non-essential counters that can be discarded.
The same language is recognized, so existing c-regex compilers or validators can adopt the method without changing their output specifications.
The approach remains correct even when counters appear inside nested or alternative subexpressions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be combined with existing lazy or on-the-fly determinization strategies to further limit state explosion.
If essential values can be computed statically from the regex syntax, the method might be implemented as a preprocessing pass rather than a runtime analysis.
Similar sparsity ideas might apply to other counter-based automata models used in XML schema validation or network protocol analysis.

Load-bearing premise

It is always possible to identify in advance which counter values are essential for any given state and input, without missing any value that would later affect acceptance.

What would settle it

An input string and c-regex for which the sparse algorithm reports a different match result or different set of active counters than the original non-sparse counting-set algorithm on the same automaton.

Figures

Figures reproduced from arXiv: 2606.26679 by Brink van der Merwe (Stellenbosch University), Martin Berglund (Ume{\aa} University), Sicheol Sung (Yonsei University).

**Figure 2.** Figure 2: The position CA of a((bc){2,10})+d{3,3}. Recall that + is the “one or more” operator. Counter scopes Q1 and Q2 are denoted by rounded boxes. The symbol read by a transition (or rather its predicate) is indicated on the target state (once subscripts are ignored). As is usual for position automata, there is one state for each literal predicate in the expression (plus an additional initial state). We focus on… view at source ↗

**Figure 3.** Figure 3: A counting-set with l = 85 and h = 100. • add1(s):= (o, ℓ′ ,l,h) is defined when o > 0, and represents Ns∪ {1}; we have ℓ ′ := ℓ if ℓ1 = (o−1), and ℓ ′ := ⟨o−1, ℓ1, ℓ2,..., ℓn⟩, otherwise; • merge(s1,s2) := (o1, ℓ′ ,l,h) represents the union of s1 = (o1, ℓ1 ,l,h) and s2 = (o2, ℓ2 ,l,h) for o1 ≥ o2, where ℓ ′ is the result of merging ℓ1 and the list ⟨(o1 −o2) +ℓ 2 1 ,(o1 −o2) +ℓ 2 2 ,...,(o1 −o2) +ℓ 2 head⟩… view at source ↗

**Figure 4.** Figure 4: A schematic representation of the replication of a counting-set during a computation of the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: A sparse counting-set with l = 85 and h = 100, where k = 16. We assume the sparse countingset is constructed by using sparse operations instead of the corresponding non-sparse operations as in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: A comparison of the number of computation steps for our manually designed c-regexes, [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Histograms of the lower bound l, the upper bound h, and the maximum sparse counting-set size H ′ (l,h) for each counter operator in flat c-regexes. All results are shown on a logarithmic scale. Our experiments evaluate the robustness of our approach by conducting two parallel sets of experiments simulating adversarial and random input scenarios. In both cases, we simulated a partial matching scenario by wr… view at source ↗

**Figure 8.** Figure 8: Comparisons of matching steps with input strings generated by EvilStrGen, presented as a [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Comparisons of matching steps with input strings generated by [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Regular expressions with counting operations (c-regexes) offer a compact representation of repeating patterns by allowing numerical bounds to be added to subexpressions. Recent work introduced the counting-set data structure, which allows simultaneous updates of multiple counter values for efficient matching. However, this approach suffers from a performance bottleneck when counting-sets must be replicated due to the presence of branching transitions. We propose a sparse counting-set approach, which reduces the replication overhead by maintaining only essential counter values, thereby yielding a more efficient matching algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparse counting-sets target the replication bottleneck in counting automata but the abstract supplies no proof or data that the sparsity rule preserves exact semantics.

read the letter

The main takeaway is that this paper identifies replication of counting-sets during branching transitions as the key slowdown in prior work and proposes keeping only essential counter values to reduce that cost. That is the concrete technical step being offered.

The idea builds directly on the counting-set structure by trying to sparsify it, which addresses a practical implementation pain point for c-regex matching. If the full paper shows a sound way to select those essentials, it could help people who build automata-based regex engines with bounded repetitions.

The soft spot is exactly the one the stress-test note flags: without seeing how the sparsity rule decides what is essential, it is unclear whether it ever drops a counter that later matters for acceptance on some input. Non-local interactions across branches are a real risk, and the abstract gives no equivalence argument or case analysis to rule that out. No complexity claims or experiments appear either, so the efficiency gain stays unverified.

This is for specialists working on automata constructions for regex with counting. A reader already familiar with the prior counting-set papers might pick up the sparsity trick if the details check out.

I would send it to peer review. The targeted problem is legitimate and the proposed change is narrow enough that referees can check the soundness claim directly.

Referee Report

3 major / 2 minor

Summary. The paper proposes a sparse counting-set data structure for efficient matching of counting regular expressions (c-regexes). It builds on prior counting-set automata that incur replication costs on branching transitions and claims that maintaining only essential counter values reduces this overhead while preserving exact acceptance semantics for every input string.

Significance. If the sparsity rule is both sound and complete, the result would offer a practical improvement to automata-based matching for bounded repetitions, a common feature in real-world regexes. The work directly targets a documented performance bottleneck and could influence implementations in string-processing libraries and formal-verification tools.

major comments (3)

[§3] §3 (Sparsity Rule): The criterion for identifying 'essential' counters is defined locally in terms of currently active counters and immediate transitions. No argument is given that this rule remains complete when a counter incremented on one branch later constrains acceptance on an alternative path, which is exactly the non-local dependency case highlighted by the central claim.
[§4] §4 (Correctness Argument): The manuscript asserts that the sparse representation yields an equivalent automaton, yet provides neither an inductive invariant relating sparse and full counting-sets nor an exhaustive case analysis for c-regexes whose counting constraints interact across epsilon-transitions. This equivalence is load-bearing for the efficiency claim.
[§5] §5 (Experimental Evaluation): Reported speed-ups are measured only on synthetic patterns; no data are given on whether any of the tested c-regexes exercise the non-local counter interactions that would falsify the sparsity rule. Without such cases the experiments cannot confirm that exact semantics are preserved in practice.

minor comments (2)

Notation for the sparse counting-set is introduced without a compact formal definition; a small example showing the difference between full and sparse sets on a single transition would clarify the data structure.
The abstract states the approach 'preserves exact matching semantics' but the introduction does not cite the prior counting-set paper whose semantics are being preserved; adding that reference would help readers locate the baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions we will make to strengthen the presentation of the sparsity rule, its correctness, and the experimental validation.

read point-by-point responses

Referee: [§3] §3 (Sparsity Rule): The criterion for identifying 'essential' counters is defined locally in terms of currently active counters and immediate transitions. No argument is given that this rule remains complete when a counter incremented on one branch later constrains acceptance on an alternative path, which is exactly the non-local dependency case highlighted by the central claim.

Authors: The local definition of essential counters is sufficient because the underlying counting-set automaton propagates all relevant counter constraints through its state space; any counter that could later affect an alternative path is already marked essential at the point of branching due to the way the original dense construction tracks possible values. We will add a short lemma in the revised §3 proving that the local rule is complete with respect to non-local dependencies. revision: yes
Referee: [§4] §4 (Correctness Argument): The manuscript asserts that the sparse representation yields an equivalent automaton, yet provides neither an inductive invariant relating sparse and full counting-sets nor an exhaustive case analysis for c-regexes whose counting constraints interact across epsilon-transitions. This equivalence is load-bearing for the efficiency claim.

Authors: We agree that an explicit inductive argument would improve clarity. The revised manuscript will include a new subsection in §4 that states the inductive invariant (the sparse set contains exactly the counter values that can still affect acceptance) and sketches the induction over both ordinary and epsilon transitions, covering interacting counters. revision: yes
Referee: [§5] §5 (Experimental Evaluation): Reported speed-ups are measured only on synthetic patterns; no data are given on whether any of the tested c-regexes exercise the non-local counter interactions that would falsify the sparsity rule. Without such cases the experiments cannot confirm that exact semantics are preserved in practice.

Authors: The synthetic suite was constructed to include branching patterns with multiple interacting counters; however, we acknowledge that explicit confirmation on real-world examples would be stronger. We will add a short subsection reporting results on a small set of regexes drawn from common libraries that contain non-local counting constraints. revision: partial

Circularity Check

0 steps flagged

No circularity: proposal for sparse counting-set data structure does not reduce to self-definition or fitted inputs.

full rationale

The provided abstract and description contain no equations, fitted parameters, or load-bearing self-citations. The central claim is an algorithmic proposal for maintaining only essential counter values in counting-sets to reduce replication overhead. This does not reduce by construction to its inputs, as no derivation chain, uniqueness theorem, or ansatz is invoked that would make the result equivalent to a prior fit or self-referential definition. The derivation is self-contained as a new data structure optimization without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit parameters, axioms, or new entities; the central claim rests on an unstated assumption that essential counters can be tracked without loss of information.

pith-pipeline@v0.9.1-grok · 5618 in / 892 out tokens · 23716 ms · 2026-06-26T02:08:45.468536+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 9 canonical work pages

[1]

In: Computer Aided Verification - 30th International Conference, CA V 2018, Proceedings, Part I, Lecture Notes in Computer Science 10981, Springer, pp

George Argyros & Loris D’Antoni (2018): The Learnability of Symbolic Automata. In: Computer Aided Verification - 30th International Conference, CA V 2018, Proceedings, Part I, Lecture Notes in Computer Science 10981, Springer, pp. 427–445, doi:10.1007/978-3-319-96145-3_23

work page doi:10.1007/978-3-319-96145-3_23 2018
[2]

In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015 , ACM, pp

Henrik Björklund, Wim Martens & Thomas Timm (2015):Efficient Incremental Evaluation of Succinct Regular Expressions. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015 , ACM, pp. 1541–1550, doi:10.1145/2806416.2806434

work page doi:10.1145/2806416.2806434 2015
[3]

Available at https://swtch.com/~rsc/regexp/regexp3.html

Russ Cox (2010): Regular Expression Matching in the Wild: A tour of RE2, an efficient, production regular expression implementation. Available at https://swtch.com/~rsc/regexp/regexp3.html. Accessed: 2025-04-26

2010
[4]

In: Proceedings of the 2018 26th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp

James C Davis, Christy A Coghlan, Francisco Servant & Dongyoon Lee (2018): The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale. In: Proceedings of the 2018 26th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 246–256, d...

work page doi:10.1145/3236024.3236027 2018
[5]

Davis, Louis G

James C. Davis, Louis G. Michael, IV , Christy A. Coghlan, Francisco Servant & Dongyoon Lee (2019): Why aren’t regular expressions a lingua franca? An empirical study on the re-use and portability of regular expressions. In: Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engine...

work page doi:10.1145/3338906.3338941 2019
[6]

Wouter Gelade, Marc Gyssens & Wim Martens (2012): Regular Expressions with Counting: Weak versus Strong Determinism. SIAM J. Comput. 41(1), pp. 160–190, doi:10.1137/100814196

work page doi:10.1137/100814196 2012
[7]

Alexis Le Glaunec, Lingkun Kong & Konstantinos Mamouras (2023): Regular Expression Matching using Bit Vector Automata. Proc. ACM Program. Lang. 7, pp. 492–521, doi:10.1145/3586044

work page doi:10.1145/3586044 2023
[8]

CoRR abs/2301.12851

Lukás Holík, Juraj Síc, Lenka Turonová & Tomás V ojnar (2023):Fast Matching of Regular Patterns with Synchronizing Counting (Technical Report). CoRR abs/2301.12851. arXiv:2301.12851

arXiv 2023
[9]

Lingkun Kong, Qixuan Yu, Agnishom Chattopadhyay, Alexis Le Glaunec, Yi Huang, Konstantinos Mamouras & Kaiyuan Yang (2022): Software-hardware codesign for efficient in-memory regular pattern matching. In Ranjit Jhala & Isil Dillig, editors: PLDI ’22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA,...

work page doi:10.1145/3519939.3523456 2022
[10]

In: 2008 IEEE Symposium on Security and Privacy (SP 2008), 18-21 May 2008, Oakland, California, USA, IEEE Computer Society, pp

Randy Smith, Cristian Estan & Somesh Jha (2008): XFA: Faster Signature Matching with Extended Automata. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), 18-21 May 2008, Oakland, California, USA, IEEE Computer Society, pp. 187–201, doi:10.1109/SP.2008.14

work page doi:10.1109/sp.2008.14 2008
[11]

In: 33rd USENIX Security Symposium (USENIX Security 24), pp

Weihao Su, Hong Huang, Rongchen Li, Haiming Chen & Tingjian Ge (2024): Towards an Effective Method of ReDoS Detection for Non-backtracking Engines. In: 33rd USENIX Security Symposium (USENIX Security 24), pp. 271–288

2024
[12]

In: 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022 , USENIX Association, pp

Lenka Turonová, Lukás Holík, Ivan Homoliak, Ondrej Lengál, Margus Veanes & Tomás V ojnar (2022): Counting in Regexes Considered Harmful: Exposing ReDoS Vulnerability of Nonbacktracking Matchers . In: 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022 , USENIX Association, pp. 4165–4182

2022
[13]

Lenka Turonová, Lukás Holík, Ondrej Lengál, Olli Saarikivi, Margus Veanes & Tomás V ojnar (2020):Regex matching with counting-set automata. Proc. ACM Program. Lang. 4, pp. 218:1–218:30, doi:10.1145/3428286

work page doi:10.1145/3428286 2020

[1] [1]

In: Computer Aided Verification - 30th International Conference, CA V 2018, Proceedings, Part I, Lecture Notes in Computer Science 10981, Springer, pp

George Argyros & Loris D’Antoni (2018): The Learnability of Symbolic Automata. In: Computer Aided Verification - 30th International Conference, CA V 2018, Proceedings, Part I, Lecture Notes in Computer Science 10981, Springer, pp. 427–445, doi:10.1007/978-3-319-96145-3_23

work page doi:10.1007/978-3-319-96145-3_23 2018

[2] [2]

In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015 , ACM, pp

Henrik Björklund, Wim Martens & Thomas Timm (2015):Efficient Incremental Evaluation of Succinct Regular Expressions. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015 , ACM, pp. 1541–1550, doi:10.1145/2806416.2806434

work page doi:10.1145/2806416.2806434 2015

[3] [3]

Available at https://swtch.com/~rsc/regexp/regexp3.html

Russ Cox (2010): Regular Expression Matching in the Wild: A tour of RE2, an efficient, production regular expression implementation. Available at https://swtch.com/~rsc/regexp/regexp3.html. Accessed: 2025-04-26

2010

[4] [4]

In: Proceedings of the 2018 26th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp

James C Davis, Christy A Coghlan, Francisco Servant & Dongyoon Lee (2018): The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale. In: Proceedings of the 2018 26th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 246–256, d...

work page doi:10.1145/3236024.3236027 2018

[5] [5]

Davis, Louis G

James C. Davis, Louis G. Michael, IV , Christy A. Coghlan, Francisco Servant & Dongyoon Lee (2019): Why aren’t regular expressions a lingua franca? An empirical study on the re-use and portability of regular expressions. In: Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engine...

work page doi:10.1145/3338906.3338941 2019

[6] [6]

Wouter Gelade, Marc Gyssens & Wim Martens (2012): Regular Expressions with Counting: Weak versus Strong Determinism. SIAM J. Comput. 41(1), pp. 160–190, doi:10.1137/100814196

work page doi:10.1137/100814196 2012

[7] [7]

Alexis Le Glaunec, Lingkun Kong & Konstantinos Mamouras (2023): Regular Expression Matching using Bit Vector Automata. Proc. ACM Program. Lang. 7, pp. 492–521, doi:10.1145/3586044

work page doi:10.1145/3586044 2023

[8] [8]

CoRR abs/2301.12851

Lukás Holík, Juraj Síc, Lenka Turonová & Tomás V ojnar (2023):Fast Matching of Regular Patterns with Synchronizing Counting (Technical Report). CoRR abs/2301.12851. arXiv:2301.12851

arXiv 2023

[9] [9]

Lingkun Kong, Qixuan Yu, Agnishom Chattopadhyay, Alexis Le Glaunec, Yi Huang, Konstantinos Mamouras & Kaiyuan Yang (2022): Software-hardware codesign for efficient in-memory regular pattern matching. In Ranjit Jhala & Isil Dillig, editors: PLDI ’22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA,...

work page doi:10.1145/3519939.3523456 2022

[10] [10]

In: 2008 IEEE Symposium on Security and Privacy (SP 2008), 18-21 May 2008, Oakland, California, USA, IEEE Computer Society, pp

Randy Smith, Cristian Estan & Somesh Jha (2008): XFA: Faster Signature Matching with Extended Automata. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), 18-21 May 2008, Oakland, California, USA, IEEE Computer Society, pp. 187–201, doi:10.1109/SP.2008.14

work page doi:10.1109/sp.2008.14 2008

[11] [11]

In: 33rd USENIX Security Symposium (USENIX Security 24), pp

Weihao Su, Hong Huang, Rongchen Li, Haiming Chen & Tingjian Ge (2024): Towards an Effective Method of ReDoS Detection for Non-backtracking Engines. In: 33rd USENIX Security Symposium (USENIX Security 24), pp. 271–288

2024

[12] [12]

In: 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022 , USENIX Association, pp

Lenka Turonová, Lukás Holík, Ivan Homoliak, Ondrej Lengál, Margus Veanes & Tomás V ojnar (2022): Counting in Regexes Considered Harmful: Exposing ReDoS Vulnerability of Nonbacktracking Matchers . In: 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022 , USENIX Association, pp. 4165–4182

2022

[13] [13]

Lenka Turonová, Lukás Holík, Ondrej Lengál, Olli Saarikivi, Margus Veanes & Tomás V ojnar (2020):Regex matching with counting-set automata. Proc. ACM Program. Lang. 4, pp. 218:1–218:30, doi:10.1145/3428286

work page doi:10.1145/3428286 2020