pith. sign in

arxiv: 2606.31030 · v1 · pith:DKS2GWGDnew · submitted 2026-06-30 · 💻 cs.DS

Optimal-Time Contextual Pattern Matching in Compressed Space

Pith reviewed 2026-07-01 03:43 UTC · model grok-4.3

classification 💻 cs.DS
keywords contextual pattern matchingcompressed spacesymmetric CDAWGweighted ancestor queriesoptimal timedata structuresstring algorithmsrepetitive texts
0
0 comments X

The pith

Contextual pattern matching now runs in optimal O(m + occ) time using only the compressed space of a symmetric CDAWG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to locate all distinct λ-length contexts around occurrences of a pattern P in a text T without expanding the text to full size. It achieves the classic optimal time bound O(m + occ) by reducing the problem to ancestor queries on a symmetric CDAWG and handling those queries with a new linear-space distance-sensitive weighted ancestor structure. The same structure also supports listing the occ answers with O(log log λ) delay after a short O(m) preprocessing step on the pattern. A reader cares because many large repetitive collections, such as genomes or versioned documents, fit comfortably in compressed space yet previously forced either slower queries or uncompressed indexes.

Core claim

The paper establishes that a symmetric CDAWG of T, together with an improved linear-space distance-sensitive weighted ancestor data structure, solves contextual pattern matching in optimal O(m + occ) time and permits enumeration of the occ solutions with O(log log λ) delay after O(m)-time preprocessing of P.

What carries the argument

The symmetric CDAWG (SCDAWG) of T, which encodes the text compactly and turns contextual matching into weighted ancestor queries answered by the new distance-sensitive structure.

If this is right

  • Any text collection that already stores a symmetric CDAWG can answer contextual queries without extra asymptotic space.
  • The occ distinct contexts around a pattern can be listed after only linear work in the pattern length.
  • The improved ancestor structure is presented as a general tool that the matching reduction relies on.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same SCDAWG-plus-ancestor combination may extend to other context-sensitive queries that reduce to path or ancestor navigation.
  • If the ancestor structure generalizes beyond CDAWGs, it could improve compressed solutions for related string problems such as approximate matching with contexts.

Load-bearing premise

The new linear-space weighted ancestor data structure delivers the query times required when run on the SCDAWG.

What would settle it

An input where the ancestor queries on the SCDAWG take more than O(1) or O(log log λ) time per output, causing the overall solution to exceed O(m + occ) or the stated enumeration delay.

Figures

Figures reproduced from arXiv: 2606.31030 by Francisco Olivares, Gonzalo Navarro.

Figure 1
Figure 1. Figure 1: Suffix tree for the string "alabaralalabarda". Edges leading to leaves show only the first symbol of their label, which continues spelling the text until the $ terminator. of nodes and edges is proportional to the number of right-maximal substrings in T, then by storing the input string T, STree(T) consumes O(n) space [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CDAWG for the string "alabaralalabarda". Dashed edges represent leaves. The root is node 1 and the non-numbered node is the sink. Edges leading to the sink (the leaves) show only the first symbol of their label, which continues spelling the text until the $ terminator. For ease to draw, edges are not shown in sorted order of labels. The edges chosen to be “default” to build tree T in [PITH_FULL_IMAGE:figu… view at source ↗
Figure 3
Figure 3. Figure 3: SCDAWG for the string "alabaralalabarda". Black edges represent right-edges, and red edges represent left-edges. Dashed edges represent leaves. The root is node 1 and the non-numbered node is the sink. Right-edges leading to the sink show only the first symbol of their label, which continues spelling the text until the $ terminator, and left-edges leading to the sink show only the last symbol of their labe… view at source ↗
Figure 4
Figure 4. Figure 4: A possible enumerator tree T for the right-edges of the SCDAWG of the string "alabaralalabarda". The default edges chosen are those colored in brown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Contextual pattern matching is the task of, given a pattern $P[1,m]$, a context length $\lambda$, and a text $T[1,n]$, find all the $occ$ distinct contexts in which $P$ occurs in $T$, the context being the $\lambda$ symbols preceding and the $\lambda$ symbols following the occurrence; a text position where each context occurs must be output. While the problem can be solved in optimal time $O(m+occ)$ using $O(n)$-space precomputed data structures on $T$, this type of search is particularly relevant on large repetitive text collections, where $O(n)$ space can be prohibitive. We present the first optimal-time solution that runs in compressed space, namely that of a symmetric CDAWG (SCDAWG) of $T$. Further, we show how the set of $occ$ solutions can be enumerated with $O(\log\log\lambda)$ delay after $O(m)$-time preprocessing of $P$. To achieve this, we develop an improved linear-space distance-sensitive weighted ancestor data structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript claims to give the first optimal-time O(m + occ) algorithm for contextual pattern matching (reporting distinct λ-contexts around occurrences of P in T, together with a witness position per context) that uses only the compressed space of a symmetric CDAWG of T. The solution reduces the problem to a sequence of weighted-ancestor queries on the SCDAWG; an improved linear-space distance-sensitive weighted-ancestor structure is introduced to realize the stated time bounds. In addition, the occ solutions can be enumerated with O(log log λ) delay after O(m) preprocessing of P.

Significance. If the claimed time and space bounds hold, the result would be a notable advance in compressed stringology: it matches the uncompressed optimal time for a problem that is natural on large repetitive collections while staying within the space of the SCDAWG. The new distance-sensitive weighted-ancestor structure is presented as a reusable primitive and could have independent value.

minor comments (1)
  1. The abstract states that the SCDAWG occupies 'compressed space' but does not explicitly recall its asymptotic size in terms of the number of right- and left-contexts; a one-sentence reminder would help readers who are not already familiar with CDAWGs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. We are pleased that the contribution is viewed as a notable advance in compressed stringology and that the new weighted-ancestor structure is recognized as potentially reusable.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an algorithmic construction reducing contextual pattern matching to weighted ancestor queries on an SCDAWG plus a new linear-space distance-sensitive data structure. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the abstract or described claims. The derivation is a standard algorithmic reduction to existing string data structures and one new primitive; it does not reduce any central claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on standard properties of CDAWGs and weighted ancestor queries; no free parameters, ad-hoc axioms, or new invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The symmetric CDAWG of T supports the required ancestor and context queries in the time bounds needed for optimality.
    Central to reducing contextual matching to SCDAWG navigation.
  • domain assumption An improved linear-space distance-sensitive weighted ancestor structure exists and can be built on the SCDAWG.
    Explicitly named as the technical enabler for the delay bounds.

pith-pipeline@v0.9.1-grok · 5708 in / 1359 out tokens · 44713 ms · 2026-07-01T03:43:31.469213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Computing matching statistics on wheeler dfas

    Abedin, P., Chubet, O., Gibney, D., Thankachan, S.V.: Contextual pattern match- ing in less space. In: 2023 Data Compression Conference (DCC). pp. 160–167 (2023). https://doi.org/10.1109/DCC55655.2023.00024

  2. [2]

    In: Proc

    Adamson, D., Gawrychowski, P., Manea, F.: Enumerating m-length walks in di- rected graphs with constant delay. In: Proc. 16th Latin American Symposium on Theoretical Informatics, Part I. pp. 35–50 (2024)

  3. [3]

    Predecessor search with distance-sensitive query time

    Belazzougui, D., Boldi, P., Vigna, S.: Predecessor search with distance-sensitive query time. CoRR 1209.5441 (2012)

  4. [4]

    In: Fici, G., Sciortino, M., Venturini, R

    Belazzougui, D., Cunial, F.: Fast label extraction in the cdawg. In: Fici, G., Sciortino, M., Venturini, R. (eds.) String Processing and Information Retrieval. pp. 161–175. Springer International Publishing, Cham (2017)

  5. [5]

    In: Proc

    Belazzougui, D., Kosolobov, D., Puglisi, S.J., Raman, R.: Weighted ancestors in suffix trees revisited. In: Proc. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM). pp. 8:1–8:15 (2021). https://doi.org/10.4230/LIPIcs.CPM.2021. 8

  6. [6]

    Journal of the ACM 34(3), 578–595 (1987)

    Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. Journal of the ACM 34(3), 578–595 (1987)

  7. [7]

    In: Proc

    Farach, M., Muthukrishnan, S.: Perfect hashing for strings: Formalization and al- gorithms. In: Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM). pp. 130–140 (1996)

  8. [8]

    Journal of Computer and Systems Science 47(3), 424–436 (1993)

    Fredman, M., Willard, D.: Surpassing the information theoretic bound with fusion trees. Journal of Computer and Systems Science 47(3), 424–436 (1993)

  9. [9]

    Journal of the ACM 31(3), 538–544 (1984)

    Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with O(1) worst case access time. Journal of the ACM 31(3), 538–544 (1984)

  10. [10]

    In: Proc

    Gawrychowski, P., Lewenstein, M., Nicholson, P.: Weighted ancestors in suffix trees. In: Proc. 22nd Annual European Symposium on Algorithms (ESA). pp. 455–466 (2014). https://doi.org/10.1007/978-3-662-44777-2_38 Optimal-Time Contextual Pattern Matching in Compressed Space 13

  11. [11]

    In: Proc

    Inenaga, S., Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S.: On-line con- struction of symmetric compact directed acyclic word graphs. In: Proc. 8th Sympo- sium on String Processing and Information Retrieval (SPIRE). pp. 96–110 (2001). https://doi.org/10.1109/SPIRE.2001.989743

  12. [12]

    Kopelowitz, T., Lewenstein, M.: Dynamic weighted ancestors. pp. 565–574 (2007)

  13. [13]

    Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993). https://doi.org/10.1137/0222058

  14. [14]

    Journal of the ACM 23(2), 262–272 (1976)

    McCreight, E.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)

  15. [15]

    SIAM Journal on Computing 31(3), 762–776 (2001)

    Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM Journal on Computing 31(3), 762–776 (2001)

  16. [16]

    ACM Computing Surveys 54(2), article 26 (2021)

    Navarro, G.: Indexing highly repetitive string collections, part II: Compressed in- dexes. ACM Computing Surveys 54(2), article 26 (2021)

  17. [17]

    In: Proc

    Navarro, G.: Contextual pattern matching. In: Proc. String Processing and Infor- mation Retrieval. pp. 3–10 (2020)

  18. [18]

    CoRR 2010.07076 (2020), https:// arxiv.org/abs/2010.07076

    Navarro, G.: Contextual pattern matching. CoRR 2010.07076 (2020), https:// arxiv.org/abs/2010.07076

  19. [19]

    ACM Computing Surveys 54(2), article 29 (2021)

    Navarro, G.: Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys 54(2), article 29 (2021)

  20. [20]

    Algorithmica 14(3), 249–260 (1995)

    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

  21. [21]

    In: Proc

    Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Sym- posium on Switching and Automata Theory. pp. 1–11 (1973)