Smallest Suffixient Sets: Effectiveness, Resilience, and Calculation
Pith reviewed 2026-05-19 10:59 UTC · model grok-4.3
The pith
The size of the smallest suffixient set is at most linear in the number of Burrows-Wheeler runs and strictly smaller than the smallest lexicographic parse on some string families.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that the smallest suffixient set size χ satisfies χ = O(r) where r is the number of runs in the Burrows-Wheeler transform, that there exist infinite string families with χ = o(v) for v the size of the smallest lexicographic parse, and that χ ≤ σ + 2 on episturmian words over an alphabet of size σ, with the exact value χ ≤ 4 and a complete characterization of the two possible smallest sets on Fibonacci words.
What carries the argument
The size χ of the smallest suffixient set, the minimal collection of positions whose suffixes allow reconstruction of all necessary matching information for the full string.
If this is right
- Smallest suffixient sets can be maintained with simple linear-time online algorithms because appending or prepending a character increases χ by at most two.
- Reversing a string may increase χ by up to O(n) yet the ratio between χ of a string and its reverse never exceeds two.
- Arbitrary edit operations or rotations can increase χ by an additive Ω(√n) in the worst case.
- χ is incomparable to nearly all repetitiveness measures based on explicit copy-paste mechanisms.
Where Pith is reading between the lines
- Because χ stays small on episturmian and Fibonacci words, suffixient sets may yield compact indexes for morphic or Sturmian sequences that arise in combinatorics on words.
- The O(r) bound together with the linear-time construction suggests suffixient sets could be combined with BWT-based indexes to reduce space on highly repetitive data without sacrificing query support.
- The Ω(√n) sensitivity to edits implies that dynamic maintenance of exact smallest suffixient sets will require sublinear update structures only when the number of edits remains small.
- Empirical measurement of χ on DNA or versioned-text collections could reveal how tightly the theoretical O(r) relation holds in practice.
Load-bearing premise
The practical value of a suffixient set depends on the availability of random access to the underlying string so that the set can be used for pattern matching.
What would settle it
A concrete string family in which the size of the smallest suffixient set grows faster than linearly with the number of Burrows-Wheeler runs would refute the claimed O(r) bound.
Figures
read the original abstract
A suffixient set is a novel combinatorial object that captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching. In this paper, we study the size $\chi$ of the smallest suffixient set as a repetitiveness measure. First, we study its sensitivity to various string operations. We show that $\chi$ cannot increase by more than 2 after appending or prepending a character to the string. As a consequence, we are able to give simple linear-time online algorithms to compute smallest suffixient sets. We also show that, although reversing the string can increase $\chi$ by an arbitrary $O(n)$ value, it always holds $\chi(T)/\chi(T^R)\le 2$. We also prove lower and upper bounds for the additive or multiplicative increase of $\chi$ after applying arbitrary edit operations, or rotating the text. In particular, we show that the additive increase can be as large as $\Omega(\sqrt{n})$ for all those operations. Secondly, we place $\chi$ in between known repetitiveness measures. In particular, we show $\chi = O(r)$ (where $r$ is the number of runs in the Burrows-Wheeler Transform of the string), that there are string families where $\chi=o(v)$ (where $v$ is the size of the smallext lexicographic parse of the string), and that $\chi$ is uncomparable to almost all reachable measures based on copy-paste mechanisms. In passing, we give precise bounds for $\chi$ for some relevant string families, for example $\chi \le \sigma+2$ on episturmian words over alphabets of size $\sigma$ (e.g., $\chi \le 4$ on Fibonacci strings, for which we precisely characterize the only two smallest suffixient sets).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines a suffixient set as a combinatorial object capturing essential information in repetitive strings to support pattern matching given random access. It studies the minimal size χ of such sets as a repetitiveness measure. The authors prove sensitivity bounds under string operations, including that appending or prepending a character increases χ by at most 2 (enabling linear-time online algorithms), that χ(T)/χ(T^R) ≤ 2 despite possible O(n) increases on reversal, and that the additive increase under edits or rotations can be Ω(√n). They also relate χ to other measures by showing χ = O(r) (r = BWT runs), χ = o(v) for some families (v = smallest lexicographic parse size), incomparability to most copy-paste-based measures, and χ ≤ σ + 2 for episturmian words (with exact characterization of the two smallest suffixient sets on Fibonacci words).
Significance. If the bounds hold, the work introduces a new repetitiveness measure with favorable sensitivity and online computability properties that complement existing measures. The explicit linear-time algorithms, the O(r) relation to BWT runs, the o(v) separation from lexicographic parses, and the concrete bounds for episturmian and Fibonacci families are strengths that could inform compressed indexing and combinatorics on words.
major comments (2)
- [§3] §3 (sensitivity results): the central claim that χ increases by at most 2 on append/prepend rests on an explicit construction of a new suffixient set from the old one; the manuscript should include a self-contained verification that the constructed set remains suffixient under the random-access assumption.
- [§4.1] §4.1 (comparison to r): the proof that χ = O(r) is load-bearing for the positioning of the measure; the argument should state the multiplicative constant explicitly rather than leaving it implicit in the run-counting construction.
minor comments (2)
- [Abstract] Abstract: 'smallext' is a typographical error and should read 'smallest'.
- [Abstract] Abstract and §4.2: the statement that χ is 'uncomparable to almost all reachable measures based on copy-paste mechanisms' would benefit from a brief pointer to the specific families or theorem establishing the incomparability.
Simulated Author's Rebuttal
We thank the referee for their careful reading, positive assessment, and constructive suggestions. We address the major comments point by point below and have incorporated revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3] §3 (sensitivity results): the central claim that χ increases by at most 2 on append/prepend rests on an explicit construction of a new suffixient set from the old one; the manuscript should include a self-contained verification that the constructed set remains suffixient under the random-access assumption.
Authors: We agree that a self-contained verification strengthens the presentation. In the revised manuscript we have added a dedicated paragraph immediately following the construction that verifies suffixiency directly from the definition: for any pattern P, if an occurrence exists in the extended string, the random-access oracle on the original positions (combined with the at-most-two new suffixient positions) suffices to report a match without external assumptions. The argument enumerates the possible cases for where the new occurrence starts or ends relative to the appended/prepended character. revision: yes
-
Referee: [§4.1] §4.1 (comparison to r): the proof that χ = O(r) is load-bearing for the positioning of the measure; the argument should state the multiplicative constant explicitly rather than leaving it implicit in the run-counting construction.
Authors: We thank the referee for highlighting this. The run-counting argument in the original proof yields χ ≤ 3r for all strings (with the factor 3 arising from at most two suffixient positions per BWT run plus a constant overhead for the first and last runs). We have revised §4.1 to state the bound explicitly as χ ≤ 3r and to include a short sentence deriving the constant from the per-run contribution. revision: yes
Circularity Check
No significant circularity in combinatorial bounds on χ
full rationale
The paper defines χ as the size of the smallest suffixient set and establishes worst-case bounds such as χ = O(r) (r = BWT runs), existence of families with χ = o(v) (v = lex-parse size), and χ ≤ σ+2 on episturmian words. These follow from explicit constructions and direct comparisons to independently defined measures r and v. Sensitivity results (e.g., χ increases by at most 2 on append/prepend, χ(T)/χ(T^R) ≤ 2) are proven via case analysis on string operations without reducing the target quantity to a fitted parameter or self-citation chain. All claims are self-contained combinatorial statements verifiable against external string families and run-counting arguments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A suffixient set captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching.
invented entities (1)
-
suffixient set
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that χ = O(r) ... χ ≤ σ+2 on episturmian words
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
χ is uncomparable to almost all reachable measures based on copy-paste mechanisms
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Smallest suffixient set maintenance in near-real-time
The smallest suffixient set can be maintained online in polyloglog time per letter in either left-to-right or right-to-left direction via Weiner's suffix tree primitives.
-
Compressing Suffix Trees by Path Decompositions
Introduces a suffix tree path decomposition technique that yields a suffix array sample of size at most r, improving the prior 2r bound for compressed indexes in external memory.
Reference graph
Works this paper leans on
-
[1]
Information and Computation291, 104999 (2023)
Akagi, T., Funakoshi, M., Inenaga, S.: Sensitivity of string compressors and repet- itiveness measures. Information and Computation291, 104999 (2023)
work page 2023
-
[2]
Bruijn, de, N.: A combinatorial problem. Proceedings of the Section of Sciences of the Koninklijke Nederlandse Akademie van Wetenschappen te Amsterdam49(7), 758–764 (1946)
work page 1946
-
[3]
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)
work page 1994
-
[4]
Cenzato, D., Depuydt, L., Gagie, T., Kim, S.H., Manzini, G., Olivares, F., Prezza, N.: Suffixient arrays: a new efficient suffix array compression technique. CoRR 2407.18753 (2025)
- [5]
-
[6]
Suf- fixient sets.arXiv preprint 2312.01359v3, 2024
Depuydt, L., Gagie, T., Langmead, B., Manzini, G., Prezza, N.: Suffixient sets. CoRR 2312.01359 (2023)
-
[7]
Information Processing Letters 55(4), 217–221 (1995)
Droubay, X.: Palindromes in the Fibonacci word. Information Processing Letters 55(4), 217–221 (1995)
work page 1995
-
[8]
Theoretical Computer Science255(1), 539–553 (2001)
Droubay, X., Justin, J., Pirillo, G.: Episturmian words and some constructions of de Luca and Rauzy. Theoretical Computer Science255(1), 539–553 (2001)
work page 2001
- [9]
-
[10]
Fici, G., Romana, G., Sciortino, M., Urbina, C.: Morphisms and BWT-run sensi- tivity. CoRR2504.17443 (2025)
-
[11]
SIAM Review 24(2), 195–221 (1982)
Fredricksen, H.: A survey of full length nonlinear shift register cycle algorithms. SIAM Review 24(2), 195–221 (1982)
work page 1982
-
[12]
Discrete Mathematics341(11), 2977– 2987 (2018)
Gabric, D., Sawada, J., Williams, A., Wong, D.: A framework for constructing de Bruijn sequences via simple successor rules. Discrete Mathematics341(11), 2977– 2987 (2018)
work page 2018
- [13]
-
[14]
Giuliani, S., Inenaga, S., Lipták, Z., Prezza, N., Sciortino, M., Toffanello, A.: Novel results on the number of runs of the Burrows-Wheeler-Transform. In: Proc. 47th International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2021). Lecture Notes in Computer Science, vol. 12607, pp. 249–
work page 2021
-
[15]
Theory of Computing Systems 69(2), 19 (2025)
Giuliani, S., Inenaga, S., Lipták, Z., Romana, G., Sciortino, M., Urbina, C.: Bit catastrophes for the Burrows-Wheeler transform. Theory of Computing Systems 69(2), 19 (2025)
work page 2025
-
[16]
RAIRO - Theoretical Informatics and Applications 43(3), 403–442 (2009)
Glen, A., Justin, J.: Episturmian words: a survey. RAIRO - Theoretical Informatics and Applications 43(3), 403–442 (2009)
work page 2009
-
[17]
Communications of the ACM65(6), 91–98 (2022) 14 G
Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler transform conjec- ture. Communications of the ACM65(6), 91–98 (2022) 14 G. Navarro, G. Romana, and C. Urbina
work page 2022
- [18]
-
[19]
Theo- retical Computer Science483, 115–133 (2013)
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theo- retical Computer Science483, 115–133 (2013)
work page 2013
-
[20]
IEEE Transactions on Information Theory 22(1), 75–81 (1976)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22(1), 75–81 (1976)
work page 1976
-
[21]
Lothaire, M.: Algebraic Combinatorics on Words. Encyclopedia of Mathematics and its Applications, Cambridge University Press, New York, NY, USA (2002)
work page 2002
-
[22]
Information Pro- cessing Letters 12(4), 193–195 (1981)
de Luca, A.: A combinatorial property of the Fibonacci words. Information Pro- cessing Letters 12(4), 193–195 (1981)
work page 1981
-
[23]
Theoretical Computer Science850, 236–248 (2021)
Mantaci, S., Restivo, A., Romana, G., Rosone, G., Sciortino, M.: A combinatorial view on string attractors. Theoretical Computer Science850, 236–248 (2021)
work page 2021
-
[24]
ACM Computing Surveys54(2), article 29 (2021)
Navarro, G.: Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys54(2), article 29 (2021)
work page 2021
-
[25]
ACM Computing Surveys54(2), article 26 (2021)
Navarro, G.: Indexing highly repetitive string collections, part II: Compressed in- dexes. ACM Computing Surveys54(2), article 26 (2021)
work page 2021
-
[26]
Navarro, G.: Indexing highly repetitive string collections. CoRR 2004.02781 (2022)
-
[27]
IEEE Transactions on Information Theory67(2), 1008–1026 (2021)
Navarro,G.,Ochoa,C.,Prezza,N.:Ontheapproximationratiooforderedparsings. IEEE Transactions on Information Theory67(2), 1008–1026 (2021)
work page 2021
-
[28]
Acta Informatica 62(1), 14 (2025)
Navarro, G., Olivares, F., Urbina, C.: Generalized straight-line programs. Acta Informatica 62(1), 14 (2025)
work page 2025
-
[29]
The- oretical Computer Science1043, 115259 (2025)
Navarro, G., Urbina, C.: Repetitiveness measures based on string morphisms. The- oretical Computer Science1043, 115259 (2025)
work page 2025
-
[30]
Journal of the ACM29(4), 928–951 (1982)
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM29(4), 928–951 (1982)
work page 1982
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.