String Representation in Suffixient Set Size Space

Hideo Bannai; Hiroki Shibata

arxiv: 2604.04377 · v3 · submitted 2026-04-06 · 💻 cs.DS

String Representation in Suffixient Set Size Space

Hiroki Shibata , Hideo Bannai This is my paper

Pith reviewed 2026-05-10 20:04 UTC · model grok-4.3

classification 💻 cs.DS

keywords suffixient setrepetitiveness measuresubstring equation systemstring representationcompressed indexingdata structures

0 comments

The pith

Every string admits a string representation of size O(χ(w)) using substring equation systems

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the repetitiveness measure χ as the size of the smallest suffixient set for a string. It focuses on the open reachability question: whether every string has some form of representation whose space is bounded by O(χ(w)). The authors introduce the substring equation system model and prove that every string admits such a system of that size. This yields the first affirmative construction. A reader cares because it means χ can now parameterize compact string representations and related data structures without coverage gaps.

Core claim

We answer this question affirmatively by presenting the first such representation scheme. Our construction is based on a new model, the substring equation system (SES), and we show that every string admits an SES of size O(χ(w)).

What carries the argument

The substring equation system (SES), a model that represents a string via equations relating its substrings, carrying the argument by achieving total size O(χ(w)) for any input string.

If this is right

String representations and indexing structures can be parameterized directly by χ(w).
The reachability problem for the χ measure is resolved in the affirmative.
Compressed data structures based on repetitiveness can now use χ as a space bound without exceptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The SES construction suggests a template that could be tested on related repetitiveness measures such as the number of runs.
Practical string algorithms in domains like text search may achieve better space bounds if SES can be implemented efficiently.

Load-bearing premise

The newly introduced substring equation system can serve as a valid string representation whose total size is O(χ(w)) for every string without hidden additive terms or additional assumptions not stated in the abstract.

What would settle it

A concrete string w where every substring equation system has size not O(χ(w)), or a counterexample exposing an unaccounted additive term in the construction.

Figures

Figures reproduced from arXiv: 2604.04377 by Hideo Bannai, Hiroki Shibata.

**Figure 2.** Figure 2: The trie and compacted trie of the set {x R | xc ∈ SRE(w), c ∈ Σ} for w = aabbaababa$. In the compacted trie, each edge is labeled by the length of the corresponding path string. Each node stores a set of indices, where each index is the position in the leftmost occurrence of a super-maximal right extension associated with that node of the character immediately preceding the last character. 5 Conclusion In… view at source ↗

**Figure 3.** Figure 3: An example of a substring equation system (SES) constructed from the reverse compacted trie [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Repetitiveness measures quantify how much repetitive structure a string contains and serve as parameters for compressed representations and indexing data structures. We study the measure $\chi$, defined as the size of the smallest suffixient set. Although $\chi$ has been studied extensively, its reachability, whether every string $w$ admits a string representation of size $O(\chi(w))$ words, has remained an important open problem. We answer this question affirmatively by presenting the first such representation scheme. Our construction is based on a new model, the substring equation system (SES), and we show that every string admits an SES of size $O(\chi(w))$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper closes the reachability question for χ by giving an explicit O(χ(w)) substring equation system built directly from a minimum suffixient set.

read the letter

The main thing to know is that Shibata and Bannai affirmatively settle whether every string admits a representation of size O(χ(w)). They introduce substring equation systems (SES) and prove that one of linear size always exists. The construction takes a minimum suffixient set S of size χ(w), assigns a variable to each position in S, and writes a collection of substring equations that capture the necessary overlaps and extensions. The proof shows the total number of variables and equations stays linear in |S| with no hidden n or alphabet factors, and the system reconstructs the original string uniquely in linear time. This directly answers the open problem left by prior work on repetitiveness measures. The argument is straightforward and avoids self-referential definitions or extra assumptions about periodicity. The paper does well on the core theoretical claim. The main soft spot is that it stops at existence and size; there is no analysis of how to compute the SES in practice or how it would plug into compressed indexes or data structures. That keeps the result clean but limits immediate applicability. Readers working on string algorithms and repetitiveness measures will find this useful for understanding the power of χ. It deserves a serious referee because the construction is new, the size bound is explicit, and the open problem it resolves is well-known in the area. I would send it to peer review.

Referee Report

0 major / 2 minor

Summary. The paper addresses the open problem of reachability for the repetitiveness measure χ(w), the size of the smallest suffixient set. It introduces the substring equation system (SES) model and constructs, for every string w, an SES of size O(χ(w)) derived directly from a minimum suffixient set S. Variables are assigned to positions in S, and equations encode the necessary overlaps and extensions; the resulting system has size linear in |S| and is solvable in linear time to recover w uniquely.

Significance. If the construction holds, the result is significant for stringology and compressed data structures: it affirmatively resolves whether χ admits an O(χ(w))-size representation, the first such scheme. Credit is due for the explicit, parameter-free derivation (linear in |S| with no hidden n- or alphabet-dependent factors) and the linear-time reconstruction guarantee, both of which strengthen the claim beyond the abstract.

minor comments (2)

[Abstract] The abstract states the O(χ(w)) bound but does not mention the linear-time solvability; adding one sentence would better convey the efficiency of the representation scheme.
[Section 3 (SES definition)] A short worked example (e.g., a periodic string of length 10–15) illustrating variable assignment and the resulting equations would help readers verify that the SES size is indeed linear in |S| and that reconstruction is unique.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of our contributions, and the recommendation for minor revision. We are pleased that the explicit linear-size construction, parameter-free nature, and linear-time reconstruction were highlighted as strengthening the result.

Circularity Check

0 steps flagged

No significant circularity; explicit construction from prior measure

full rationale

The paper defines a new substring equation system (SES) model explicitly constructed from any suffixient set S of size χ(w). It proves by direct counting that the resulting SES has O(χ(w)) equations/variables and that the system uniquely determines w (solvable in linear time). No step reduces the claimed O(χ(w)) bound to a fitted parameter, self-definition, or unverified self-citation chain; the size analysis follows from the explicit encoding of overlaps/extensions already present in S. The derivation is therefore self-contained and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the introduction of the SES model as a new framework for string representation.

invented entities (1)

substring equation system (SES) no independent evidence
purpose: New model enabling string representation of size O(χ(w))
Introduced in the paper as the basis for the claimed representation scheme.

pith-pipeline@v0.9.0 · 5397 in / 1098 out tokens · 99359 ms · 2026-05-10T20:04:06.545936+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We answer this question affirmatively by presenting the first such representation scheme. Our construction is based on a new model, the substring equation system (SES), and we show that every string admits an SES of size O(χ(w)).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Smallest suffixient set maintenance in near-real-time
cs.DS 2026-04 unverdicted novelty 6.0

The smallest suffixient set can be maintained online in polyloglog time per letter in either left-to-right or right-to-left direction via Weiner's suffix tree primitives.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper

[1]

A separation ofγand b via thue-morse words

Hideo Bannai, Mitsuru Funakoshi, Tomohiro I, Dominik K¨ oppl, Takuya Mieno, and Takaaki Nishimoto. A separation ofγand b via thue-morse words. In Thierry Lecroq and H´ el` ene Touzet, editors,String 8 ba aa a 1 11107 9865432 bb a a b $ Figure 3: An example of a substring equation system (SES) constructed from the reverse compacted trie induced by the supe...

work page doi:10.1007/978-3-030-86692-1 2021
[2]

Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. InTech- nical Report 124. Digital SRC Research Report, 1994

work page 1994
[3]

Suffixient arrays: a new efficient suffix array compression technique.arXiv preprint 2407.18753v2, 2025

Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, Nicola Prezza, and Lore Depuydt. Suffixient arrays: a new efficient suffix array compression technique. CoRR, abs/2407.18753, 2023. URL:https://doi.org/10.48550/arXiv.2407.18753,arXiv:2407. 18753,doi:10.48550/ARXIV.2407.18753

work page doi:10.48550/arxiv.2407.18753 2023
[4]

On computing the smallest suffixient set

Davide Cenzato, Francisco Olivares, and Nicola Prezza. On computing the smallest suffixient set. In Zsuzsanna Lipt´ ak, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors,String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of...

work page doi:10.1007/978-3-031-72200-4 2024
[5]

Suf- fixient sets.arXiv preprint 2312.01359v3, 2024

Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, and Nicola Prezza. Suffixient sets. CoRR, abs/2312.01359, 2023. URL:https://doi.org/10.48550/arXiv.2312.01359,arXiv:2312. 01359,doi:10.48550/ARXIV.2312.01359

work page doi:10.48550/arxiv.2312.01359 2023
[6]

Optimal suffix tree construction with large alphabets

Martin Farach. Optimal suffix tree construction with large alphabets. In38th Annual Symposium on Foundations of Computer Science, FOCS 1997, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997.doi:10.1109/SFCS.1997.646102

work page doi:10.1109/sfcs.1997.646102 1997
[7]

Universal reconstruction of a string.Theor

Pawel Gawrychowski, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Universal reconstruction of a string.Theor. Comput. Sci., 812:174–186, 2020. URL:https://doi.org/ 10.1016/j.tcs.2019.10.027,doi:10.1016/J.TCS.2019.10.027

work page doi:10.1016/j.tcs.2019.10.027 2020
[8]

At the roots of dictionary compression: String attractors

Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors,Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018.doi:10.1145/3188745.3188814. 9

work page doi:10.1145/3188745.3188814 2018
[9]

A linear-time algorithm for seeds computation.ACM Trans

Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. A linear-time algorithm for seeds computation.ACM Trans. Algorithms, 16(2):27:1–27:23, 2020.doi: 10.1145/3386369

work page doi:10.1145/3386369 2020
[10]

Towards a definitive measure of repetitiveness

Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In Yoshiharu Kohayakawa and Fl´ avio Keidi Miyazawa, editors,LATIN 2020: Theoretical Informatics - 14th Latin American Symposium, S˜ ao Paulo, Brazil, January 5-8, 2021, Proceedings, volume 12118 of Lecture Notes in Computer Science, pages 207–219. Spring...

work page doi:10.1007/978-3-030-61792-9 2020
[11]

On repetitiveness measures of thue-morse words

Kanaru Kutsukake, Takuya Matsumoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. On repetitiveness measures of thue-morse words. In Christina Boucher and Sharma V. Thankachan, editors,String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, Lecture N...

work page doi:10.1007/978-3-030-59212-7 2020
[12]

Indexing highly repetitive string collections, part I: repetitiveness measures.ACM Comput

Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures.ACM Comput. Surv., 54(2):29:1–29:31, 2022.doi:10.1145/3434399

work page doi:10.1145/3434399 2022
[13]

On the approximation ratio of ordered parsings

Gonzalo Navarro, Carlos Ochoa, and Nicola Prezza. On the approximation ratio of ordered parsings. IEEE Trans. Inf. Theory, 67(2):1008–1026, 2021.doi:10.1109/TIT.2020.3042746

work page doi:10.1109/tit.2020.3042746 2021
[14]

Smallest suffixient sets as a repetitiveness measure

Gonzalo Navarro, Giuseppe Romana, and Cristian Urbina. Smallest suffixient sets as a repetitiveness measure. In Golnaz Badkobeh, Jakub Radoszewski, Nicola Tonellotto, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 32nd International Symposium, SPIRE 2025, London, UK, September 8-11, 2025, Proceedings, volume 16073 ofLectur...

work page doi:10.1007/978-3-032-05228-5 2025
[15]

Storer and Thomas G

James A. Storer and Thomas G. Szymanski. The macro model for data compression (extended abstract). In Richard J. Lipton, Walter A. Burkhard, Walter J. Savitch, Emily P. Friedman, and Alfred V. Aho, editors,Proceedings of the 10th Annual ACM Symposium on Theory of Computing, May 1-3, 1978, San Diego, California, USA, pages 30–39. ACM, 1978.doi:10.1145/8001...

work page doi:10.1145/800133.804329 1978
[16]

A universal algorithm for sequential data compression.IEEE Trans

Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression.IEEE Trans. Inf. Theory, 23(3):337–343, 1977.doi:10.1109/TIT.1977.1055714. 10

work page doi:10.1109/tit.1977.1055714 1977

[1] [1]

A separation ofγand b via thue-morse words

Hideo Bannai, Mitsuru Funakoshi, Tomohiro I, Dominik K¨ oppl, Takuya Mieno, and Takaaki Nishimoto. A separation ofγand b via thue-morse words. In Thierry Lecroq and H´ el` ene Touzet, editors,String 8 ba aa a 1 11107 9865432 bb a a b $ Figure 3: An example of a substring equation system (SES) constructed from the reverse compacted trie induced by the supe...

work page doi:10.1007/978-3-030-86692-1 2021

[2] [2]

Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. InTech- nical Report 124. Digital SRC Research Report, 1994

work page 1994

[3] [3]

Suffixient arrays: a new efficient suffix array compression technique.arXiv preprint 2407.18753v2, 2025

Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, Nicola Prezza, and Lore Depuydt. Suffixient arrays: a new efficient suffix array compression technique. CoRR, abs/2407.18753, 2023. URL:https://doi.org/10.48550/arXiv.2407.18753,arXiv:2407. 18753,doi:10.48550/ARXIV.2407.18753

work page doi:10.48550/arxiv.2407.18753 2023

[4] [4]

On computing the smallest suffixient set

Davide Cenzato, Francisco Olivares, and Nicola Prezza. On computing the smallest suffixient set. In Zsuzsanna Lipt´ ak, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors,String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of...

work page doi:10.1007/978-3-031-72200-4 2024

[5] [5]

Suf- fixient sets.arXiv preprint 2312.01359v3, 2024

Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, and Nicola Prezza. Suffixient sets. CoRR, abs/2312.01359, 2023. URL:https://doi.org/10.48550/arXiv.2312.01359,arXiv:2312. 01359,doi:10.48550/ARXIV.2312.01359

work page doi:10.48550/arxiv.2312.01359 2023

[6] [6]

Optimal suffix tree construction with large alphabets

Martin Farach. Optimal suffix tree construction with large alphabets. In38th Annual Symposium on Foundations of Computer Science, FOCS 1997, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997.doi:10.1109/SFCS.1997.646102

work page doi:10.1109/sfcs.1997.646102 1997

[7] [7]

Universal reconstruction of a string.Theor

Pawel Gawrychowski, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Universal reconstruction of a string.Theor. Comput. Sci., 812:174–186, 2020. URL:https://doi.org/ 10.1016/j.tcs.2019.10.027,doi:10.1016/J.TCS.2019.10.027

work page doi:10.1016/j.tcs.2019.10.027 2020

[8] [8]

At the roots of dictionary compression: String attractors

Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors,Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018.doi:10.1145/3188745.3188814. 9

work page doi:10.1145/3188745.3188814 2018

[9] [9]

A linear-time algorithm for seeds computation.ACM Trans

Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. A linear-time algorithm for seeds computation.ACM Trans. Algorithms, 16(2):27:1–27:23, 2020.doi: 10.1145/3386369

work page doi:10.1145/3386369 2020

[10] [10]

Towards a definitive measure of repetitiveness

Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In Yoshiharu Kohayakawa and Fl´ avio Keidi Miyazawa, editors,LATIN 2020: Theoretical Informatics - 14th Latin American Symposium, S˜ ao Paulo, Brazil, January 5-8, 2021, Proceedings, volume 12118 of Lecture Notes in Computer Science, pages 207–219. Spring...

work page doi:10.1007/978-3-030-61792-9 2020

[11] [11]

On repetitiveness measures of thue-morse words

Kanaru Kutsukake, Takuya Matsumoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. On repetitiveness measures of thue-morse words. In Christina Boucher and Sharma V. Thankachan, editors,String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, Lecture N...

work page doi:10.1007/978-3-030-59212-7 2020

[12] [12]

Indexing highly repetitive string collections, part I: repetitiveness measures.ACM Comput

Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures.ACM Comput. Surv., 54(2):29:1–29:31, 2022.doi:10.1145/3434399

work page doi:10.1145/3434399 2022

[13] [13]

On the approximation ratio of ordered parsings

Gonzalo Navarro, Carlos Ochoa, and Nicola Prezza. On the approximation ratio of ordered parsings. IEEE Trans. Inf. Theory, 67(2):1008–1026, 2021.doi:10.1109/TIT.2020.3042746

work page doi:10.1109/tit.2020.3042746 2021

[14] [14]

Smallest suffixient sets as a repetitiveness measure

Gonzalo Navarro, Giuseppe Romana, and Cristian Urbina. Smallest suffixient sets as a repetitiveness measure. In Golnaz Badkobeh, Jakub Radoszewski, Nicola Tonellotto, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 32nd International Symposium, SPIRE 2025, London, UK, September 8-11, 2025, Proceedings, volume 16073 ofLectur...

work page doi:10.1007/978-3-032-05228-5 2025

[15] [15]

Storer and Thomas G

James A. Storer and Thomas G. Szymanski. The macro model for data compression (extended abstract). In Richard J. Lipton, Walter A. Burkhard, Walter J. Savitch, Emily P. Friedman, and Alfred V. Aho, editors,Proceedings of the 10th Annual ACM Symposium on Theory of Computing, May 1-3, 1978, San Diego, California, USA, pages 30–39. ACM, 1978.doi:10.1145/8001...

work page doi:10.1145/800133.804329 1978

[16] [16]

A universal algorithm for sequential data compression.IEEE Trans

Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression.IEEE Trans. Inf. Theory, 23(3):337–343, 1977.doi:10.1109/TIT.1977.1055714. 10

work page doi:10.1109/tit.1977.1055714 1977