pith. sign in

arxiv: 2506.05638 · v5 · submitted 2025-06-05 · 💻 cs.FL · cs.DS· math.CO

Smallest Suffixient Sets: Effectiveness, Resilience, and Calculation

Pith reviewed 2026-05-19 10:59 UTC · model grok-4.3

classification 💻 cs.FL cs.DSmath.CO
keywords suffixient setsrepetitiveness measuresBurrows-Wheeler transformpattern matchingstring operationsepisturmian wordslexicographic parseFibonacci words
0
0 comments X

The pith

The size of the smallest suffixient set is at most linear in the number of Burrows-Wheeler runs and strictly smaller than the smallest lexicographic parse on some string families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A suffixient set captures the minimal information from a repetitive string that, together with random access to the text, suffices for several pattern-matching tasks. This paper measures the size χ of the smallest such set and shows that χ grows by at most two when a character is appended or prepended, which immediately yields simple linear-time online algorithms for computing it. The authors further locate χ among other repetitiveness measures by proving χ equals O(r) for the Burrows-Wheeler run count r, by exhibiting families where χ is asymptotically smaller than the smallest lexicographic parse v, and by proving tight upper bounds such as χ ≤ σ + 2 on episturmian words.

Core claim

The paper establishes that the smallest suffixient set size χ satisfies χ = O(r) where r is the number of runs in the Burrows-Wheeler transform, that there exist infinite string families with χ = o(v) for v the size of the smallest lexicographic parse, and that χ ≤ σ + 2 on episturmian words over an alphabet of size σ, with the exact value χ ≤ 4 and a complete characterization of the two possible smallest sets on Fibonacci words.

What carries the argument

The size χ of the smallest suffixient set, the minimal collection of positions whose suffixes allow reconstruction of all necessary matching information for the full string.

If this is right

  • Smallest suffixient sets can be maintained with simple linear-time online algorithms because appending or prepending a character increases χ by at most two.
  • Reversing a string may increase χ by up to O(n) yet the ratio between χ of a string and its reverse never exceeds two.
  • Arbitrary edit operations or rotations can increase χ by an additive Ω(√n) in the worst case.
  • χ is incomparable to nearly all repetitiveness measures based on explicit copy-paste mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Because χ stays small on episturmian and Fibonacci words, suffixient sets may yield compact indexes for morphic or Sturmian sequences that arise in combinatorics on words.
  • The O(r) bound together with the linear-time construction suggests suffixient sets could be combined with BWT-based indexes to reduce space on highly repetitive data without sacrificing query support.
  • The Ω(√n) sensitivity to edits implies that dynamic maintenance of exact smallest suffixient sets will require sublinear update structures only when the number of edits remains small.
  • Empirical measurement of χ on DNA or versioned-text collections could reveal how tightly the theoretical O(r) relation holds in practice.

Load-bearing premise

The practical value of a suffixient set depends on the availability of random access to the underlying string so that the set can be used for pattern matching.

What would settle it

A concrete string family in which the size of the smallest suffixient set grows faster than linearly with the number of Burrows-Wheeler runs would refute the claimed O(r) bound.

Figures

Figures reproduced from arXiv: 2506.05638 by Cristian Urbina, Giuseppe Romana, Gonzalo Navarro, Hiroto Fujimaru.

Figure 1
Figure 1. Figure 1: Relations between relevant repetitiveness measures and how our results place χ among them. An arrow µ1 → µ2 means that µ1 = O(µ2) for all strings and, save for c → z, there is a string family where µ1 = o(µ2). The dotted arrows mark only this last condition, so they are not transitive. Measures in light gray nodes are known to be reachable; those in dark gray are accessible and searchable; and r is hatched… view at source ↗
read the original abstract

A suffixient set is a novel combinatorial object that captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching. In this paper, we study the size $\chi$ of the smallest suffixient set as a repetitiveness measure. First, we study its sensitivity to various string operations. We show that $\chi$ cannot increase by more than 2 after appending or prepending a character to the string. As a consequence, we are able to give simple linear-time online algorithms to compute smallest suffixient sets. We also show that, although reversing the string can increase $\chi$ by an arbitrary $O(n)$ value, it always holds $\chi(T)/\chi(T^R)\le 2$. We also prove lower and upper bounds for the additive or multiplicative increase of $\chi$ after applying arbitrary edit operations, or rotating the text. In particular, we show that the additive increase can be as large as $\Omega(\sqrt{n})$ for all those operations. Secondly, we place $\chi$ in between known repetitiveness measures. In particular, we show $\chi = O(r)$ (where $r$ is the number of runs in the Burrows-Wheeler Transform of the string), that there are string families where $\chi=o(v)$ (where $v$ is the size of the smallext lexicographic parse of the string), and that $\chi$ is uncomparable to almost all reachable measures based on copy-paste mechanisms. In passing, we give precise bounds for $\chi$ for some relevant string families, for example $\chi \le \sigma+2$ on episturmian words over alphabets of size $\sigma$ (e.g., $\chi \le 4$ on Fibonacci strings, for which we precisely characterize the only two smallest suffixient sets).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper defines a suffixient set as a combinatorial object capturing essential information in repetitive strings to support pattern matching given random access. It studies the minimal size χ of such sets as a repetitiveness measure. The authors prove sensitivity bounds under string operations, including that appending or prepending a character increases χ by at most 2 (enabling linear-time online algorithms), that χ(T)/χ(T^R) ≤ 2 despite possible O(n) increases on reversal, and that the additive increase under edits or rotations can be Ω(√n). They also relate χ to other measures by showing χ = O(r) (r = BWT runs), χ = o(v) for some families (v = smallest lexicographic parse size), incomparability to most copy-paste-based measures, and χ ≤ σ + 2 for episturmian words (with exact characterization of the two smallest suffixient sets on Fibonacci words).

Significance. If the bounds hold, the work introduces a new repetitiveness measure with favorable sensitivity and online computability properties that complement existing measures. The explicit linear-time algorithms, the O(r) relation to BWT runs, the o(v) separation from lexicographic parses, and the concrete bounds for episturmian and Fibonacci families are strengths that could inform compressed indexing and combinatorics on words.

major comments (2)
  1. [§3] §3 (sensitivity results): the central claim that χ increases by at most 2 on append/prepend rests on an explicit construction of a new suffixient set from the old one; the manuscript should include a self-contained verification that the constructed set remains suffixient under the random-access assumption.
  2. [§4.1] §4.1 (comparison to r): the proof that χ = O(r) is load-bearing for the positioning of the measure; the argument should state the multiplicative constant explicitly rather than leaving it implicit in the run-counting construction.
minor comments (2)
  1. [Abstract] Abstract: 'smallext' is a typographical error and should read 'smallest'.
  2. [Abstract] Abstract and §4.2: the statement that χ is 'uncomparable to almost all reachable measures based on copy-paste mechanisms' would benefit from a brief pointer to the specific families or theorem establishing the incomparability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment, and constructive suggestions. We address the major comments point by point below and have incorporated revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (sensitivity results): the central claim that χ increases by at most 2 on append/prepend rests on an explicit construction of a new suffixient set from the old one; the manuscript should include a self-contained verification that the constructed set remains suffixient under the random-access assumption.

    Authors: We agree that a self-contained verification strengthens the presentation. In the revised manuscript we have added a dedicated paragraph immediately following the construction that verifies suffixiency directly from the definition: for any pattern P, if an occurrence exists in the extended string, the random-access oracle on the original positions (combined with the at-most-two new suffixient positions) suffices to report a match without external assumptions. The argument enumerates the possible cases for where the new occurrence starts or ends relative to the appended/prepended character. revision: yes

  2. Referee: [§4.1] §4.1 (comparison to r): the proof that χ = O(r) is load-bearing for the positioning of the measure; the argument should state the multiplicative constant explicitly rather than leaving it implicit in the run-counting construction.

    Authors: We thank the referee for highlighting this. The run-counting argument in the original proof yields χ ≤ 3r for all strings (with the factor 3 arising from at most two suffixient positions per BWT run plus a constant overhead for the first and last runs). We have revised §4.1 to state the bound explicitly as χ ≤ 3r and to include a short sentence deriving the constant from the per-run contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in combinatorial bounds on χ

full rationale

The paper defines χ as the size of the smallest suffixient set and establishes worst-case bounds such as χ = O(r) (r = BWT runs), existence of families with χ = o(v) (v = lex-parse size), and χ ≤ σ+2 on episturmian words. These follow from explicit constructions and direct comparisons to independently defined measures r and v. Sensitivity results (e.g., χ increases by at most 2 on append/prepend, χ(T)/χ(T^R) ≤ 2) are proven via case analysis on string operations without reducing the target quantity to a fitted parameter or self-citation chain. All claims are self-contained combinatorial statements verifiable against external string families and run-counting arguments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the definition of suffixient sets as objects that capture essential repetitive information when random access is supplied. No numeric free parameters appear; the only invented entity is the suffixient set itself, introduced by definition with no external falsifiable prediction supplied in the abstract.

axioms (1)
  • domain assumption A suffixient set captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching.
    This is the foundational definition of the new combinatorial object on which all subsequent size and sensitivity results depend.
invented entities (1)
  • suffixient set no independent evidence
    purpose: Combinatorial object that captures essential repetitive information for pattern matching under random access
    New object defined in the paper; no independent evidence or falsifiable prediction outside the definition is given.

pith-pipeline@v0.9.0 · 5886 in / 1538 out tokens · 76457 ms · 2026-05-19T10:59:09.352041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Smallest suffixient set maintenance in near-real-time

    cs.DS 2026-04 unverdicted novelty 6.0

    The smallest suffixient set can be maintained online in polyloglog time per letter in either left-to-right or right-to-left direction via Weiner's suffix tree primitives.

  2. Compressing Suffix Trees by Path Decompositions

    cs.DS 2025-06 unverdicted novelty 6.0

    Introduces a suffix tree path decomposition technique that yields a suffix array sample of size at most r, improving the prior 2r bound for compressed indexes in external memory.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers

  1. [1]

    Information and Computation291, 104999 (2023)

    Akagi, T., Funakoshi, M., Inenaga, S.: Sensitivity of string compressors and repet- itiveness measures. Information and Computation291, 104999 (2023)

  2. [2]

    Proceedings of the Section of Sciences of the Koninklijke Nederlandse Akademie van Wetenschappen te Amsterdam49(7), 758–764 (1946)

    Bruijn, de, N.: A combinatorial problem. Proceedings of the Section of Sciences of the Koninklijke Nederlandse Akademie van Wetenschappen te Amsterdam49(7), 758–764 (1946)

  3. [3]

    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)

  4. [4]

    Suffixient arrays: a new efficient suffix array compression technique.arXiv preprint 2407.18753v2, 2025

    Cenzato, D., Depuydt, L., Gagie, T., Kim, S.H., Manzini, G., Olivares, F., Prezza, N.: Suffixient arrays: a new efficient suffix array compression technique. CoRR 2407.18753 (2025)

  5. [5]

    In: Proc

    Cenzato, D., Olivares, F., Prezza, N.: On computing the smallest suffixient set. In: Proc. 31st International Symposium on String Processing and Information Re- trieval (SPIRE 2024). Lecture Notes in Computer Science, vol. 14899, pp. 73–87. Springer (2024)

  6. [6]

    Suf- fixient sets.arXiv preprint 2312.01359v3, 2024

    Depuydt, L., Gagie, T., Langmead, B., Manzini, G., Prezza, N.: Suffixient sets. CoRR 2312.01359 (2023)

  7. [7]

    Information Processing Letters 55(4), 217–221 (1995)

    Droubay, X.: Palindromes in the Fibonacci word. Information Processing Letters 55(4), 217–221 (1995)

  8. [8]

    Theoretical Computer Science255(1), 539–553 (2001)

    Droubay, X., Justin, J., Pirillo, G.: Episturmian words and some constructions of de Luca and Rauzy. Theoretical Computer Science255(1), 539–553 (2001)

  9. [9]

    In: Proc

    Fici, G., Romana, G., Sciortino, M., Urbina, C.: On the impact of morphisms on BWT-runs. In: Proc. 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). LIPIcs, vol. 259, pp. 10:1–10:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)

  10. [10]

    CoRR2504.17443 (2025)

    Fici, G., Romana, G., Sciortino, M., Urbina, C.: Morphisms and BWT-run sensi- tivity. CoRR2504.17443 (2025)

  11. [11]

    SIAM Review 24(2), 195–221 (1982)

    Fredricksen, H.: A survey of full length nonlinear shift register cycle algorithms. SIAM Review 24(2), 195–221 (1982)

  12. [12]

    Discrete Mathematics341(11), 2977– 2987 (2018)

    Gabric, D., Sawada, J., Williams, A., Wong, D.: A framework for constructing de Bruijn sequences via simple successor rules. Discrete Mathematics341(11), 2977– 2987 (2018)

  13. [13]

    In: Proc

    Gawrychowski, P., Kosche, M., Manea, F.: On the number of factors in the LZ- end factorization. In: Proc. 30th International Symposium on String Processing and Information Retrieval (SPIRE 2023). Lecture Notes in Computer Science, vol. 14240, pp. 253–259. Springer (2023)

  14. [14]

    In: Proc

    Giuliani, S., Inenaga, S., Lipták, Z., Prezza, N., Sciortino, M., Toffanello, A.: Novel results on the number of runs of the Burrows-Wheeler-Transform. In: Proc. 47th International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2021). Lecture Notes in Computer Science, vol. 12607, pp. 249–

  15. [15]

    Theory of Computing Systems 69(2), 19 (2025)

    Giuliani, S., Inenaga, S., Lipták, Z., Romana, G., Sciortino, M., Urbina, C.: Bit catastrophes for the Burrows-Wheeler transform. Theory of Computing Systems 69(2), 19 (2025)

  16. [16]

    RAIRO - Theoretical Informatics and Applications 43(3), 403–442 (2009)

    Glen, A., Justin, J.: Episturmian words: a survey. RAIRO - Theoretical Informatics and Applications 43(3), 403–442 (2009)

  17. [17]

    Communications of the ACM65(6), 91–98 (2022) 14 G

    Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler transform conjec- ture. Communications of the ACM65(6), 91–98 (2022) 14 G. Navarro, G. Romana, and C. Urbina

  18. [18]

    In: Proc

    Kempa, D., Prezza, N.: At the roots of dictionary compression: String attractors. In: Proc. 50th Annual ACM Symposium on the Theory of Computing (STOC 2018). pp. 827–840. ACM (2018)

  19. [19]

    Theo- retical Computer Science483, 115–133 (2013)

    Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theo- retical Computer Science483, 115–133 (2013)

  20. [20]

    IEEE Transactions on Information Theory 22(1), 75–81 (1976)

    Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22(1), 75–81 (1976)

  21. [21]

    Encyclopedia of Mathematics and its Applications, Cambridge University Press, New York, NY, USA (2002)

    Lothaire, M.: Algebraic Combinatorics on Words. Encyclopedia of Mathematics and its Applications, Cambridge University Press, New York, NY, USA (2002)

  22. [22]

    Information Pro- cessing Letters 12(4), 193–195 (1981)

    de Luca, A.: A combinatorial property of the Fibonacci words. Information Pro- cessing Letters 12(4), 193–195 (1981)

  23. [23]

    Theoretical Computer Science850, 236–248 (2021)

    Mantaci, S., Restivo, A., Romana, G., Rosone, G., Sciortino, M.: A combinatorial view on string attractors. Theoretical Computer Science850, 236–248 (2021)

  24. [24]

    ACM Computing Surveys54(2), article 29 (2021)

    Navarro, G.: Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys54(2), article 29 (2021)

  25. [25]

    ACM Computing Surveys54(2), article 26 (2021)

    Navarro, G.: Indexing highly repetitive string collections, part II: Compressed in- dexes. ACM Computing Surveys54(2), article 26 (2021)

  26. [26]

    CoRR 2004.02781 (2022)

    Navarro, G.: Indexing highly repetitive string collections. CoRR 2004.02781 (2022)

  27. [27]

    IEEE Transactions on Information Theory67(2), 1008–1026 (2021)

    Navarro,G.,Ochoa,C.,Prezza,N.:Ontheapproximationratiooforderedparsings. IEEE Transactions on Information Theory67(2), 1008–1026 (2021)

  28. [28]

    Acta Informatica 62(1), 14 (2025)

    Navarro, G., Olivares, F., Urbina, C.: Generalized straight-line programs. Acta Informatica 62(1), 14 (2025)

  29. [29]

    The- oretical Computer Science1043, 115259 (2025)

    Navarro, G., Urbina, C.: Repetitiveness measures based on string morphisms. The- oretical Computer Science1043, 115259 (2025)

  30. [30]

    Journal of the ACM29(4), 928–951 (1982)

    Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM29(4), 928–951 (1982)