pith. sign in

arxiv: 2607.00204 · v1 · pith:MAVYBYL3new · submitted 2026-06-30 · 💻 cs.DS

Computing Smallest Suffixient Arrays in Sublinear Time

Pith reviewed 2026-07-02 16:46 UTC · model grok-4.3

classification 💻 cs.DS
keywords suffixient arrayBurrows-Wheeler transformsublinear timetext indexingpattern matchingrun-length compression
0
0 comments X

The pith

A smallest suffixient array for text T of length n can be computed in O(n log σ / √log n + min(r, r-bar) log^ε n) time for any ε > 0.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new algorithm for building the smallest suffixient array on a string T. When paired with an index that gives random access to T, this array supports multiple pattern-matching queries. The running time improves to sublinear whenever the alphabet size σ stays modest and the number of equal-letter runs in the Burrows-Wheeler transforms of T and its reverse is o(n / log^ε n). The result gives an asymptotic improvement over earlier linear-time constructions. The authors also derive several related algorithmic results that follow from the same techniques.

Core claim

We show how to compute a smallest suffixient array for T[1…n] in O(n log σ / √log n + min(r, r-bar) log^ε n) time for any ε > 0, where σ is the alphabet size of T and r and r-bar are the numbers of equal-letter runs of the BWTs of T and its reverse.

What carries the argument

The smallest suffixient array, a compact structure whose size is governed by the run-length properties of the forward and reverse Burrows-Wheeler transforms and that replaces a full suffix array for the supported queries.

If this is right

  • Pattern-matching queries become feasible on texts whose size makes prior linear-time construction impractical.
  • Preprocessing cost drops below linear whenever the input is repetitive enough that min(r, r-bar) is sublinear in n.
  • The same run-length information used for construction can be reused for other compressed indexes.
  • The method yields a family of connected algorithms whose complexity is also expressed in terms of BWT runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique may be adaptable to other run-length compressed structures such as the run-length FM-index.
  • On genomic or versioned data, where BWT runs are naturally small, the construction could become the dominant practical improvement.
  • It is open whether the √log n factor in the first term can be removed while retaining the same dependence on r.

Load-bearing premise

An auxiliary index already exists that supplies direct random access to every position of the input text T.

What would settle it

Construct the array on a concrete string whose BWT run counts are known, measure wall-clock time against the stated bound, and check whether the observed complexity deviates from O(n log σ / √log n + min(r, r-bar) log^ε n).

Figures

Figures reproduced from arXiv: 2607.00204 by Cristian Urbina, Francisco Olivares, Giuseppe Romana, Gonzalo Navarro, Hiroto Fujimaru, Jakub Radoszewski.

Figure 1
Figure 1. Figure 1: Processing the BWT run-break at position i of the set S. Substrings T[SA[i] . . SA[i] + |x|] and T[SA[i − 1] . . SA[i − 1] + |x|] are candidates to be super￾maximal extensions of T. We add the pairs (n − (SA[i] + |x|) + 1, |x| + 1) and (n − (SA[i − 1] + |x|) + 1, |x| + 1) to the multiset S, which represents the starting positions and length of the occurrences of those reversed candidates in T. assume that … view at source ↗
read the original abstract

A suffixient array is a novel data structure that, when combined with an index providing direct access on a text $T$, allows us to answer a variety of pattern matching queries. In this work, we show how to compute a smallest suffixient array for $T[1\dots n]$ in $O(\frac{n\log \sigma}{\sqrt{\log n}}+\min(r,\bar{r})\log^\epsilon n)$ time for any $\epsilon > 0$, where $\sigma$ is the alphabet size of $T$ and $r$ and $\bar{r}$ are the numbers of equal-letter runs of the Burrows-Wheeler transforms of $T$ and its reverse $\overline{T}$, respectively. This time complexity becomes sublinear when $\sigma$ is small enough and $\min(r,\bar{r})=o(\frac{n}{\log^\epsilon n})$, yielding an asymptotic improvement over state-of-the-art algorithms. We also present a series of connected algorithmic results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims an algorithm to compute a smallest suffixient array for a text T[1..n] in O(n log σ / √log n + min(r, r-bar) log^ε n) time (any ε>0), where σ is the alphabet size and r, r-bar are the equal-letter run counts in the BWTs of T and its reverse. The bound is sublinear when σ is small and min(r, r-bar)=o(n/log^ε n), improving prior work; the structure supports pattern-matching queries when paired with a direct-access index on T. Additional connected algorithmic results are presented.

Significance. If the claimed time bound holds under a clearly stated model, the result would give the first sublinear construction for this data structure, with practical relevance for compressible texts (via BWT runs) and small alphabets. The connected results add value, but the headline improvement rests on the sublinear claim.

major comments (2)
  1. [Abstract] Abstract: the O(n log σ / √log n + min(r, r-bar) log^ε n) bound is presented as sublinear, yet the algorithm explicitly requires an auxiliary index that supplies direct (random) access to T[i] as a unit-cost primitive. No construction time or space bound for this index is supplied, and standard RAM-model constructions require Ω(n) time; this makes the sublinear claim dependent on an unaccounted-for assumption that is load-bearing for the central result.
  2. [Abstract, §1] Abstract and opening of §1: the model statement qualifies the result as holding “when combined with an index providing direct access,” but the manuscript supplies neither a construction for the index nor a reduction showing that its cost can be absorbed into the stated bound without losing sublinearity. This leaves the complexity claim incomplete in the standard word-RAM model.
minor comments (1)
  1. [Abstract] Notation for r-bar is introduced without an explicit definition in the abstract; a parenthetical reminder would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for pointing out the need for greater precision in the model assumptions underlying our sublinear time bound. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the O(n log σ / √log n + min(r, r-bar) log^ε n) bound is presented as sublinear, yet the algorithm explicitly requires an auxiliary index that supplies direct (random) access to T[i] as a unit-cost primitive. No construction time or space bound for this index is supplied, and standard RAM-model constructions require Ω(n) time; this makes the sublinear claim dependent on an unaccounted-for assumption that is load-bearing for the central result.

    Authors: We agree that the stated time bound applies to the construction of the smallest suffixient array under the assumption that a direct-access index on T is already available as a unit-cost primitive. The abstract and §1 explicitly qualify the result with the phrase “when combined with an index providing direct access.” The sublinearity therefore holds relative to this model; the algorithm itself does not construct the index. In application settings where T is already stored in a structure supporting random access (for example, as part of an existing FM-index or external-memory layout), the bound is as claimed. We acknowledge, however, that the manuscript does not supply a construction for the index or absorb its cost, so the claim is incomplete if interpreted as a fully self-contained word-RAM algorithm starting from the raw text. In revision we will add an explicit paragraph clarifying the model and noting that index construction would contribute an additional Ω(n) term outside the stated bound. revision: yes

  2. Referee: [Abstract, §1] Abstract and opening of §1: the model statement qualifies the result as holding “when combined with an index providing direct access,” but the manuscript supplies neither a construction for the index nor a reduction showing that its cost can be absorbed into the stated bound without losing sublinearity. This leaves the complexity claim incomplete in the standard word-RAM model.

    Authors: The manuscript intentionally presents the result in the model where the direct-access index is supplied; no construction or absorption argument is given because the contribution concerns the suffixient-array algorithm that exploits the index. We did not claim a reduction that preserves sublinearity from the plain text. The referee is correct that this leaves the overall complexity statement incomplete for the standard word-RAM model without the index. We will revise §1 to include a short discussion of representative index constructions (e.g., a plain array or a succinct structure) and their costs, making the conditional nature of the sublinear bound fully explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; algorithmic construction is self-contained under stated model.

full rationale

The paper claims an O(n log σ / √log n + min(r, r-bar) log^ε n) algorithm for constructing a smallest suffixient array, explicitly conditioned on the existence of a separate direct-access index for T. No equations, parameters, or predictions appear that reduce by construction to their own inputs; the time bound is derived from standard string algorithms on run-length compressed BWTs rather than any self-definition or fitted renaming. No load-bearing self-citations or uniqueness theorems imported from prior author work are invoked to justify the central result. The derivation therefore stands as an independent algorithmic achievement within the RAM model that treats random access as a unit-cost primitive.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard algorithmic assumptions without introducing new free parameters, invented entities, or ad-hoc axioms beyond the usual word-RAM model implicit in string algorithm analysis.

axioms (1)
  • standard math Standard word-RAM computational model with word size Θ(log n) and constant-time access to array elements.
    Implicit in all time-complexity statements for string algorithms on texts of length n.

pith-pipeline@v0.9.1-grok · 5708 in / 1211 out tokens · 23751 ms · 2026-07-02T16:46:00.450831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 25 canonical work pages · 4 internal anchors

  1. [1]

    In: Indyk, P

    Babenko, M.A., Gawrychowski, P., Kociumaka, T., Starikovskaya, T.: Wavelet trees meet suffix trees. In: Indyk, P. (ed.) Proceedings of the Twenty-Sixth An- nual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015. pp. 572–591. SIAM (2015). https://doi.org/10.1137/ 1.9781611973730.39

  2. [2]

    In: Gawrychowski, P., Starikovskaya, T

    Belazzougui, D., Kosolobov, D., Puglisi, S.J., Raman, R.: Weighted Ancestors in Suffix Trees Revisited. In: Gawrychowski, P., Starikovskaya, T. (eds.) 32nd An- nual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz In- ternational Proceedings in Informatics (LIPIcs), vol. 191, pp. 8:1–8:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, ...

  3. [3]

    In: Gonnet, G.H., Viola, A

    Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000: Theoretical Informatics. pp. 88–94. Springer Berlin Heidelberg, Berlin, Heidelberg (2000). https://doi.org/10.1007/10719839_9

  4. [4]

    Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM34(3), 578–595 (1987). https://doi.org/10.1145/28869.28873

  5. [5]

    In: Bille, P., Prezza, N

    Bonizzoni, P., Gao, Y., Riccardi, B.: Constructing suffixient arrays revisited. In: Bille, P., Prezza, N. (eds.) 37th Annual Symposium on Combinatorial Pattern Matching, CPM 2026. LIPIcs, vol. 369, pp. 30:1–30:18. Schloss Dagstuhl - Leibniz- Zentrum für Informatik (2026). https://doi.org/10.4230/LIPICS.CPM.2026.30

  6. [6]

    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)

  7. [7]

    Cenzato, D., Depuydt, L., Gagie, T., Kim, S.H., Manzini, G., Olivares, F., Prezza, N.: Suffixient arrays: a new efficient suffix array compression technique (2025), https://arxiv.org/abs/2407.18753

  8. [8]

    In: Lipták, Z., de Moura, E.S., Figueroa, K., Baeza-Yates, R

    Cenzato, D., Olivares, F., Prezza, N.: On computing the smallest suffixient set. In: Lipták, Z., de Moura, E.S., Figueroa, K., Baeza-Yates, R. (eds.) Proceedings of 31st International Symposium on String Processing and Information Retrieval, SPIRE 2024. Lecture Notes in Computer Science, vol. 14899, pp. 73–87 (2024). https://doi.org/10.1007/978-3-031-72200-4_6

  9. [9]

    Testing Suffixient Sets

    Cenzato, D., Olivares, F., Prezza, N.: Testing suffixient sets (2025), https://arxiv. org/abs/2506.08225

  10. [10]

    MIT Press (2009), http://mitpress.mit.edu/books/ introduction-algorithms

    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd Edition. MIT Press (2009), http://mitpress.mit.edu/books/ introduction-algorithms

  11. [11]

    CoRRabs/2312.01359(2023)

    Depuydt, L., Gagie, T., Langmead, B., Manzini, G., Prezza, N.: Suffixient sets. CoRRabs/2312.01359(2023). https://doi.org/10.48550/ARXIV.2312.01359

  12. [12]

    Fujimaru, H., Navarro, G., Romana, G., Urbina, C.: Smallest suffixient sets: Effec- tiveness, resilience, and calculation (2025), https://arxiv.org/abs/2506.05638

  13. [13]

    In: Soto, J.A., Wiese, A

    Gagie, T., Goga, A., Jeż, A., Navarro, G.: Space-efficient conversions from SLPs. In: Soto, J.A., Wiese, A. (eds.) LATIN 2024: Theoretical Informatics - 16th Latin 14 H. Fujimaru et al. American Symposium, Part I. Lecture Notes in Computer Science, vol. 14578, pp. 146–161. Springer (2024). https://doi.org/10.1007/978-3-031-55598-5_10

  14. [14]

    Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM67(1) (jan 2020). https://doi.org/ 10.1145/3375890

  15. [15]

    In: SOFSEM

    Giuliani, S., Inenaga, S., Lipták, Z., Prezza, N., Sciortino, M., Toffanello, A.: Novel results on the number of runs of the Burrows-Wheeler-transform. In: SOFSEM. LectureNotesinComputerScience,vol.12607,pp.249–262.Springer(2021).https: //doi.org/10.1007/978-3-030-67731-2_18

  16. [16]

    In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algo- rithms, January 12-14, 2003, Baltimore, Maryland, USA

    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algo- rithms, January 12-14, 2003, Baltimore, Maryland, USA. pp. 841–850. ACM/SIAM (2003), http://dl.acm.org/citation.cfm?id=644108.644250

  17. [17]

    Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM53(6), 918–936 (Nov 2006). https://doi.org/10.1145/1217856.1217858

  18. [18]

    Kempa, D., Kociumaka, T.: String synchronizing sets: sublinear-time BWT con- structionandoptimalLCEdatastructure.In:Proceedingsofthe51stAnnualACM SIGACT Symposium on Theory of Computing. p. 756–767. STOC ’19, ACM (Jun 2019). https://doi.org/10.1145/3313276.3316368

  19. [19]

    Approximation Algorithms for Stochastic Minimum-Norm Combinatorial Optimization , booktitle =

    Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler Transform conjec- ture. In: Proc. 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS). pp. 1014–1025 (2020). https://doi.org/10.1109/FOCS46700.2020.00097

  20. [20]

    In: Bansal, N., Nagarajan, V

    Kempa, D., Kociumaka, T.: Breaking the O(n)-barrier in the construction of com- pressed suffix arrays and suffix trees. In: Bansal, N., Nagarajan, V. (eds.) Proceed- ings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023. pp. 5122–5202. SIAM (2023). https://doi.org/10.1137/1.9781611977554.ch187

  21. [21]

    Köppl, D., Kucherov, G.: Smallest suffixient set maintenance in near-real-time (2026), https://arxiv.org/abs/2604.27548, accepted to MFCS 2026

  22. [22]

    Nature526(7571), 68–74 (2015)

    The 1000 Genomes Project Consortium: A global reference for human genetic vari- ation. Nature526(7571), 68–74 (2015). https://doi.org/10.1038/nature15393

  23. [23]

    Current Opinion in Genetics & Development15(6), 589–594 (2005)

    Medini, D., Donati, C., Tettelin, H., Masignani, V., Rappuoli, R.: The microbial pan-genome. Current Opinion in Genetics & Development15(6), 589–594 (2005). https://doi.org/10.1016/j.gde.2005.09.006

  24. [24]

    Navarro, G.: Wavelet trees for all. J. Discrete Algorithms25, 2–20 (2014). https: //doi.org/10.1016/J.JDA.2013.07.004

  25. [25]

    ACM Comput

    Navarro, G.: Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures. ACM Comput. Surv.54(2), 29:1–29:31 (2022). https://doi.org/10.1145/ 3434399

  26. [26]

    ACM Comput

    Navarro, G.: Indexing Highly Repetitive String Collections, Part II: Compressed Indexes. ACM Comput. Surv.54(2), 26:1–26:31 (2022). https://doi.org/10.1145/ 3432999

  27. [27]

    In: Badkobeh, G., Radoszewski, J., Tonellotto, N., Baeza-Yates, R

    Navarro, G., Romana, G., Urbina, C.: Smallest suffixient sets as a repetitiveness measure. In: Badkobeh, G., Radoszewski, J., Tonellotto, N., Baeza-Yates, R. (eds.) Proc. 32nd International Symposium on String Processing and Information Re- trieval (SPIRE 2025). Lecture Notes in Computer Science, vol. 16073, pp. 217–232. Springer (2025). https://doi.org/1...

  28. [28]

    Prezza, N.: Compressed Computation for Text Indexing. Ph.D. thesis, University of Udine (2016)

  29. [29]

    Algorithmica65(3), 685–709 (2013)

    Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica65(3), 685–709 (2013). https: //doi.org/10.1007/S00453-012-9618-6 Computing Smallest Suffixient Arrays in Sublinear Time 15

  30. [30]

    Shibata, H., Bannai, H.: String representation in suffixient set size space (2026), https://arxiv.org/abs/2604.04377

  31. [31]

    Nature604(7906), 437–446 (2022)

    Wang, T., Antonacci-Fulton, L., Howe, K., Lawson, H.A., Lucas, J.K., Phillippy, A.M., Popejoy, A.B., Asri, M., Carson, C., Chaisson, M.J., et al.: The human pangenome project: a global resource to map genomic diversity. Nature604(7906), 437–446 (2022). https://doi.org/10.1038/s41586-022-04601-8 A Sublinear time adaptation of Cenzato et al.’s algorithm Cen...