Computing Smallest Suffixient Arrays in Sublinear Time
Pith reviewed 2026-07-02 16:46 UTC · model grok-4.3
The pith
A smallest suffixient array for text T of length n can be computed in O(n log σ / √log n + min(r, r-bar) log^ε n) time for any ε > 0.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show how to compute a smallest suffixient array for T[1…n] in O(n log σ / √log n + min(r, r-bar) log^ε n) time for any ε > 0, where σ is the alphabet size of T and r and r-bar are the numbers of equal-letter runs of the BWTs of T and its reverse.
What carries the argument
The smallest suffixient array, a compact structure whose size is governed by the run-length properties of the forward and reverse Burrows-Wheeler transforms and that replaces a full suffix array for the supported queries.
If this is right
- Pattern-matching queries become feasible on texts whose size makes prior linear-time construction impractical.
- Preprocessing cost drops below linear whenever the input is repetitive enough that min(r, r-bar) is sublinear in n.
- The same run-length information used for construction can be reused for other compressed indexes.
- The method yields a family of connected algorithms whose complexity is also expressed in terms of BWT runs.
Where Pith is reading between the lines
- The technique may be adaptable to other run-length compressed structures such as the run-length FM-index.
- On genomic or versioned data, where BWT runs are naturally small, the construction could become the dominant practical improvement.
- It is open whether the √log n factor in the first term can be removed while retaining the same dependence on r.
Load-bearing premise
An auxiliary index already exists that supplies direct random access to every position of the input text T.
What would settle it
Construct the array on a concrete string whose BWT run counts are known, measure wall-clock time against the stated bound, and check whether the observed complexity deviates from O(n log σ / √log n + min(r, r-bar) log^ε n).
Figures
read the original abstract
A suffixient array is a novel data structure that, when combined with an index providing direct access on a text $T$, allows us to answer a variety of pattern matching queries. In this work, we show how to compute a smallest suffixient array for $T[1\dots n]$ in $O(\frac{n\log \sigma}{\sqrt{\log n}}+\min(r,\bar{r})\log^\epsilon n)$ time for any $\epsilon > 0$, where $\sigma$ is the alphabet size of $T$ and $r$ and $\bar{r}$ are the numbers of equal-letter runs of the Burrows-Wheeler transforms of $T$ and its reverse $\overline{T}$, respectively. This time complexity becomes sublinear when $\sigma$ is small enough and $\min(r,\bar{r})=o(\frac{n}{\log^\epsilon n})$, yielding an asymptotic improvement over state-of-the-art algorithms. We also present a series of connected algorithmic results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims an algorithm to compute a smallest suffixient array for a text T[1..n] in O(n log σ / √log n + min(r, r-bar) log^ε n) time (any ε>0), where σ is the alphabet size and r, r-bar are the equal-letter run counts in the BWTs of T and its reverse. The bound is sublinear when σ is small and min(r, r-bar)=o(n/log^ε n), improving prior work; the structure supports pattern-matching queries when paired with a direct-access index on T. Additional connected algorithmic results are presented.
Significance. If the claimed time bound holds under a clearly stated model, the result would give the first sublinear construction for this data structure, with practical relevance for compressible texts (via BWT runs) and small alphabets. The connected results add value, but the headline improvement rests on the sublinear claim.
major comments (2)
- [Abstract] Abstract: the O(n log σ / √log n + min(r, r-bar) log^ε n) bound is presented as sublinear, yet the algorithm explicitly requires an auxiliary index that supplies direct (random) access to T[i] as a unit-cost primitive. No construction time or space bound for this index is supplied, and standard RAM-model constructions require Ω(n) time; this makes the sublinear claim dependent on an unaccounted-for assumption that is load-bearing for the central result.
- [Abstract, §1] Abstract and opening of §1: the model statement qualifies the result as holding “when combined with an index providing direct access,” but the manuscript supplies neither a construction for the index nor a reduction showing that its cost can be absorbed into the stated bound without losing sublinearity. This leaves the complexity claim incomplete in the standard word-RAM model.
minor comments (1)
- [Abstract] Notation for r-bar is introduced without an explicit definition in the abstract; a parenthetical reminder would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for pointing out the need for greater precision in the model assumptions underlying our sublinear time bound. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the O(n log σ / √log n + min(r, r-bar) log^ε n) bound is presented as sublinear, yet the algorithm explicitly requires an auxiliary index that supplies direct (random) access to T[i] as a unit-cost primitive. No construction time or space bound for this index is supplied, and standard RAM-model constructions require Ω(n) time; this makes the sublinear claim dependent on an unaccounted-for assumption that is load-bearing for the central result.
Authors: We agree that the stated time bound applies to the construction of the smallest suffixient array under the assumption that a direct-access index on T is already available as a unit-cost primitive. The abstract and §1 explicitly qualify the result with the phrase “when combined with an index providing direct access.” The sublinearity therefore holds relative to this model; the algorithm itself does not construct the index. In application settings where T is already stored in a structure supporting random access (for example, as part of an existing FM-index or external-memory layout), the bound is as claimed. We acknowledge, however, that the manuscript does not supply a construction for the index or absorb its cost, so the claim is incomplete if interpreted as a fully self-contained word-RAM algorithm starting from the raw text. In revision we will add an explicit paragraph clarifying the model and noting that index construction would contribute an additional Ω(n) term outside the stated bound. revision: yes
-
Referee: [Abstract, §1] Abstract and opening of §1: the model statement qualifies the result as holding “when combined with an index providing direct access,” but the manuscript supplies neither a construction for the index nor a reduction showing that its cost can be absorbed into the stated bound without losing sublinearity. This leaves the complexity claim incomplete in the standard word-RAM model.
Authors: The manuscript intentionally presents the result in the model where the direct-access index is supplied; no construction or absorption argument is given because the contribution concerns the suffixient-array algorithm that exploits the index. We did not claim a reduction that preserves sublinearity from the plain text. The referee is correct that this leaves the overall complexity statement incomplete for the standard word-RAM model without the index. We will revise §1 to include a short discussion of representative index constructions (e.g., a plain array or a succinct structure) and their costs, making the conditional nature of the sublinear bound fully explicit. revision: yes
Circularity Check
No circularity; algorithmic construction is self-contained under stated model.
full rationale
The paper claims an O(n log σ / √log n + min(r, r-bar) log^ε n) algorithm for constructing a smallest suffixient array, explicitly conditioned on the existence of a separate direct-access index for T. No equations, parameters, or predictions appear that reduce by construction to their own inputs; the time bound is derived from standard string algorithms on run-length compressed BWTs rather than any self-definition or fitted renaming. No load-bearing self-citations or uniqueness theorems imported from prior author work are invoked to justify the central result. The derivation therefore stands as an independent algorithmic achievement within the RAM model that treats random access as a unit-cost primitive.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard word-RAM computational model with word size Θ(log n) and constant-time access to array elements.
Reference graph
Works this paper leans on
-
[1]
In: Indyk, P
Babenko, M.A., Gawrychowski, P., Kociumaka, T., Starikovskaya, T.: Wavelet trees meet suffix trees. In: Indyk, P. (ed.) Proceedings of the Twenty-Sixth An- nual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015. pp. 572–591. SIAM (2015). https://doi.org/10.1137/ 1.9781611973730.39
2015
-
[2]
In: Gawrychowski, P., Starikovskaya, T
Belazzougui, D., Kosolobov, D., Puglisi, S.J., Raman, R.: Weighted Ancestors in Suffix Trees Revisited. In: Gawrychowski, P., Starikovskaya, T. (eds.) 32nd An- nual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz In- ternational Proceedings in Informatics (LIPIcs), vol. 191, pp. 8:1–8:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, ...
-
[3]
Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000: Theoretical Informatics. pp. 88–94. Springer Berlin Heidelberg, Berlin, Heidelberg (2000). https://doi.org/10.1007/10719839_9
-
[4]
Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM34(3), 578–595 (1987). https://doi.org/10.1145/28869.28873
-
[5]
Bonizzoni, P., Gao, Y., Riccardi, B.: Constructing suffixient arrays revisited. In: Bille, P., Prezza, N. (eds.) 37th Annual Symposium on Combinatorial Pattern Matching, CPM 2026. LIPIcs, vol. 369, pp. 30:1–30:18. Schloss Dagstuhl - Leibniz- Zentrum für Informatik (2026). https://doi.org/10.4230/LIPICS.CPM.2026.30
-
[6]
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)
1994
- [7]
-
[8]
In: Lipták, Z., de Moura, E.S., Figueroa, K., Baeza-Yates, R
Cenzato, D., Olivares, F., Prezza, N.: On computing the smallest suffixient set. In: Lipták, Z., de Moura, E.S., Figueroa, K., Baeza-Yates, R. (eds.) Proceedings of 31st International Symposium on String Processing and Information Retrieval, SPIRE 2024. Lecture Notes in Computer Science, vol. 14899, pp. 73–87 (2024). https://doi.org/10.1007/978-3-031-72200-4_6
-
[9]
Cenzato, D., Olivares, F., Prezza, N.: Testing suffixient sets (2025), https://arxiv. org/abs/2506.08225
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
MIT Press (2009), http://mitpress.mit.edu/books/ introduction-algorithms
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd Edition. MIT Press (2009), http://mitpress.mit.edu/books/ introduction-algorithms
2009
-
[11]
Depuydt, L., Gagie, T., Langmead, B., Manzini, G., Prezza, N.: Suffixient sets. CoRRabs/2312.01359(2023). https://doi.org/10.48550/ARXIV.2312.01359
-
[12]
Fujimaru, H., Navarro, G., Romana, G., Urbina, C.: Smallest suffixient sets: Effec- tiveness, resilience, and calculation (2025), https://arxiv.org/abs/2506.05638
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Gagie, T., Goga, A., Jeż, A., Navarro, G.: Space-efficient conversions from SLPs. In: Soto, J.A., Wiese, A. (eds.) LATIN 2024: Theoretical Informatics - 16th Latin 14 H. Fujimaru et al. American Symposium, Part I. Lecture Notes in Computer Science, vol. 14578, pp. 146–161. Springer (2024). https://doi.org/10.1007/978-3-031-55598-5_10
-
[14]
Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM67(1) (jan 2020). https://doi.org/ 10.1145/3375890
-
[15]
Giuliani, S., Inenaga, S., Lipták, Z., Prezza, N., Sciortino, M., Toffanello, A.: Novel results on the number of runs of the Burrows-Wheeler-transform. In: SOFSEM. LectureNotesinComputerScience,vol.12607,pp.249–262.Springer(2021).https: //doi.org/10.1007/978-3-030-67731-2_18
-
[16]
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algo- rithms, January 12-14, 2003, Baltimore, Maryland, USA. pp. 841–850. ACM/SIAM (2003), http://dl.acm.org/citation.cfm?id=644108.644250
-
[17]
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM53(6), 918–936 (Nov 2006). https://doi.org/10.1145/1217856.1217858
-
[18]
Kempa, D., Kociumaka, T.: String synchronizing sets: sublinear-time BWT con- structionandoptimalLCEdatastructure.In:Proceedingsofthe51stAnnualACM SIGACT Symposium on Theory of Computing. p. 756–767. STOC ’19, ACM (Jun 2019). https://doi.org/10.1145/3313276.3316368
-
[19]
Approximation Algorithms for Stochastic Minimum-Norm Combinatorial Optimization , booktitle =
Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler Transform conjec- ture. In: Proc. 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS). pp. 1014–1025 (2020). https://doi.org/10.1109/FOCS46700.2020.00097
-
[20]
Kempa, D., Kociumaka, T.: Breaking the O(n)-barrier in the construction of com- pressed suffix arrays and suffix trees. In: Bansal, N., Nagarajan, V. (eds.) Proceed- ings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023. pp. 5122–5202. SIAM (2023). https://doi.org/10.1137/1.9781611977554.ch187
-
[21]
Köppl, D., Kucherov, G.: Smallest suffixient set maintenance in near-real-time (2026), https://arxiv.org/abs/2604.27548, accepted to MFCS 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
The 1000 Genomes Project Consortium: A global reference for human genetic vari- ation. Nature526(7571), 68–74 (2015). https://doi.org/10.1038/nature15393
-
[23]
Current Opinion in Genetics & Development15(6), 589–594 (2005)
Medini, D., Donati, C., Tettelin, H., Masignani, V., Rappuoli, R.: The microbial pan-genome. Current Opinion in Genetics & Development15(6), 589–594 (2005). https://doi.org/10.1016/j.gde.2005.09.006
-
[24]
Navarro, G.: Wavelet trees for all. J. Discrete Algorithms25, 2–20 (2014). https: //doi.org/10.1016/J.JDA.2013.07.004
-
[25]
ACM Comput
Navarro, G.: Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures. ACM Comput. Surv.54(2), 29:1–29:31 (2022). https://doi.org/10.1145/ 3434399
2022
-
[26]
ACM Comput
Navarro, G.: Indexing Highly Repetitive String Collections, Part II: Compressed Indexes. ACM Comput. Surv.54(2), 26:1–26:31 (2022). https://doi.org/10.1145/ 3432999
2022
-
[27]
In: Badkobeh, G., Radoszewski, J., Tonellotto, N., Baeza-Yates, R
Navarro, G., Romana, G., Urbina, C.: Smallest suffixient sets as a repetitiveness measure. In: Badkobeh, G., Radoszewski, J., Tonellotto, N., Baeza-Yates, R. (eds.) Proc. 32nd International Symposium on String Processing and Information Re- trieval (SPIRE 2025). Lecture Notes in Computer Science, vol. 16073, pp. 217–232. Springer (2025). https://doi.org/1...
-
[28]
Prezza, N.: Compressed Computation for Text Indexing. Ph.D. thesis, University of Udine (2016)
2016
-
[29]
Algorithmica65(3), 685–709 (2013)
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica65(3), 685–709 (2013). https: //doi.org/10.1007/S00453-012-9618-6 Computing Smallest Suffixient Arrays in Sublinear Time 15
-
[30]
Shibata, H., Bannai, H.: String representation in suffixient set size space (2026), https://arxiv.org/abs/2604.04377
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Nature604(7906), 437–446 (2022)
Wang, T., Antonacci-Fulton, L., Howe, K., Lawson, H.A., Lucas, J.K., Phillippy, A.M., Popejoy, A.B., Asri, M., Carson, C., Chaisson, M.J., et al.: The human pangenome project: a global resource to map genomic diversity. Nature604(7906), 437–446 (2022). https://doi.org/10.1038/s41586-022-04601-8 A Sublinear time adaptation of Cenzato et al.’s algorithm Cen...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.