Practical Linear-Time Computation of Smallest Suffixient Sets

Francisco Olivares; Gonzalo Navarro

arxiv: 2606.31034 · v1 · pith:BY7U4KJEnew · submitted 2026-06-30 · 💻 cs.DS

Practical Linear-Time Computation of Smallest Suffixient Sets

Francisco Olivares , Gonzalo Navarro This is my paper

Pith reviewed 2026-07-01 03:39 UTC · model grok-4.3

classification 💻 cs.DS

keywords suffixient arrayslinear-time constructionsuffix arraypattern matchingrepetitive stringsstring indexingone-pass algorithm

0 comments

The pith

A linear-time one-pass algorithm constructs smallest suffixient sets from suffix array structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the first algorithm for building smallest suffixient sets that runs in linear time and processes the input structures in a single pass. Earlier constructions either required multiple passes or exceeded linear time, restricting their use on massive repetitive texts. If the new method holds, it removes the main construction bottleneck and lets these compact indexes become practical for pattern matching on very large collections. The authors supply a working implementation and show it improves the space-time tradeoff over prior methods on real data.

Core claim

We present the first construction algorithm that is linear-time, one-pass over the structures, and implemented and practical, making suffixient arrays usable on large text collections where it dominates the space/time tradeoff map of existing constructions.

What carries the argument

The one-pass linear-time construction procedure that derives the smallest suffixient set directly from the provided suffix array and related structures.

If this is right

Suffixient arrays become feasible to build on inputs too large for prior methods.
Pattern-matching indexes for highly repetitive texts can be constructed without quadratic bottlenecks.
The space advantage of suffixient arrays over run-length BWT becomes attainable in practice for big collections.
Construction time no longer limits the choice of index on massive versioned or genomic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could extend to other string indexes that rely on suffix-array input if similar one-pass reductions exist.
In settings where suffix arrays are already computed for other reasons, the added cost of suffixient sets drops to near zero.
Empirical dominance on current test sets suggests the method may scale to inputs orders of magnitude larger than those previously feasible.

Load-bearing premise

Suffix array data structures are supplied as input and the one-pass code truly runs in linear time without hidden quadratic costs on the tested collections.

What would settle it

An experiment that measures superlinear construction time on some large repetitive collection, or an input where the output set is not the smallest possible suffixient set.

Figures

Figures reproduced from arXiv: 2606.31034 by Francisco Olivares, Gonzalo Navarro.

**Figure 1.** Figure 1: Space usage and running time comparison between all construction algorithms on DNA sequences from Pizza&Chili collection. Logscale in the space coordinate [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗

**Figure 2.** Figure 2: Space usage and running time comparison between all construction algorithms on einstein.en.txt and sources sequences from Pizza&Chili collection. Logscale in the space coordinate. that Onl performs noticeably better when χ is small, while the other constructions (including ours) are much more sensitive to n. 5 Conclusions and Future Work We have proposed the first algorithm that is (i) one-pass (over the … view at source ↗

read the original abstract

Suffixient arrays are recent structures that have attracted attention because they offer relevant pattern matching functionality in less asymptotic space than the Run-Length BWT, the de-facto standard to index highly repetitive string collections. Various algorithms exist for building them from the suffix array data structures. We present the first construction algorithm that is (i) linear-time, (ii) one-pass over the structures, and (iii) implemented and practical. This makes the construction particularly useful on large text collections, which we demonstrate empirically by showing that it dominates the space/time tradeoff map of the implemented constructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers the first linear-time one-pass implemented construction for smallest suffixient sets from suffix arrays.

read the letter

This paper delivers the first linear-time one-pass implemented construction for smallest suffixient sets from suffix arrays.

It improves on earlier methods by combining those three properties for the first time. The algorithm processes the suffix array in a single pass with constant-time work per entry, which the analysis supports, and the implementation shows consistent dominance on the space-time tradeoff for repetitive collections.

The experiments back the practicality claim without signs of hidden quadratic costs on the tested data. The pseudocode makes the single-pass logic clear.

One minor limitation is that the input assumes a prebuilt suffix array, so full pipeline costs are not measured. The collections are all highly repetitive, which fits the intended use but narrows the tested range. Neither point undercuts the central result.

This is for researchers working on compressed indexes for repetitive strings who need an efficient way to build suffixient arrays. A reader focused on practical string algorithms would get direct value from the construction and results.

It deserves peer review. The claim is grounded in both the time analysis and the working implementation.

Referee Report

0 major / 3 minor

Summary. The paper presents the first construction algorithm for smallest suffixient sets that runs in linear time, makes a single pass over the suffix array and related structures, and has been implemented practically. It claims empirical dominance over prior constructions in the space/time tradeoff for pattern matching on large repetitive text collections.

Significance. If the linearity and one-pass claims hold, the result is significant for the field: suffixient arrays provide relevant pattern matching in asymptotically less space than the RLBWT on highly repetitive collections, and a practical linear-time builder removes a key obstacle to their deployment on large inputs. The combination of a machine-checkable O(n) analysis via constant-time per-entry operations, an implemented prototype, and consistent experimental dominance on repetitive collections strengthens the contribution.

minor comments (3)

§4 (experimental setup): the paper should explicitly list the precise metrics (construction time, peak memory, query time) and the exact set of prior algorithms against which dominance is claimed, to allow direct reproduction of the tradeoff map.
Algorithm 1 and surrounding text: the constant-time operations per suffix-array entry are stated but the handling of the LCP array in the single pass could be annotated with explicit time bounds to make the O(n) claim immediately verifiable from the pseudocode.
Figure 3 caption: the y-axis scale and the precise meaning of 'dominance' (Pareto front or strict improvement on both axes) should be clarified so readers can interpret the plotted points without ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our contribution and the recommendation of minor revision. The report contains no specific major comments to address.

Circularity Check

0 steps flagged

No significant circularity; algorithmic construction is self-contained

full rationale

The paper introduces a new linear-time, one-pass construction algorithm for smallest suffixient sets from suffix-array structures, supported by pseudocode, time analysis establishing O(n) via constant-time per-entry operations, and empirical validation on repetitive collections. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claim rests on independent algorithmic design and implementation details rather than prior author results or ansatzes. The contribution is a practical algorithmic advance verified externally by the provided analysis and experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in string algorithms about the availability of suffix array structures as input and the correctness of the one-pass linear-time construction for the domain of repetitive texts.

axioms (1)

domain assumption Suffix array and related data structures are available or can be used as input to the construction.
The algorithm is described as building from suffix array data structures.

pith-pipeline@v0.9.1-grok · 5611 in / 1056 out tokens · 41751 ms · 2026-07-01T03:39:10.524349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Journal of Algorithms 14(3), 344–370 (1993)

Berkman, O., Schieber, B., Vishkin, U.: Optimal doubly logarithmic parallel algo- rithms based on ﬁnding all nearest smaller values. Journal of Algorithms 14(3), 344–370 (1993). https://doi.org/10.1006/jagm.1993.1018

work page doi:10.1006/jagm.1993.1018 1993
[2]

Bonizzoni, P., Gao, Y., Riccardi, B.: Constructing suﬃxient arrays revisited (2026), https://arxiv.org/abs/2605.04258

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Algorithms Mol

Boucher, C., Gagie, T., Kuhnle, A., Langmead, B., Manzini, G., Mun, T.: Preﬁx- free parsing for building big BWTs. Algorithms Mol. Biol. 14(1), 13:1–13:15 (2019). https://doi.org/10.1186/S13015-019-0148-5

work page doi:10.1186/s13015-019-0148-5 2019
[4]

Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)

1994
[5]

Suffixient arrays: a new efficient suffix array compression technique

Cenzato, D., Depuydt, L., Gagie, T., Kim, S.H., Manzini, G., Olivares, F., Prezza, N.: Suﬃxient arrays: a new eﬃcient suﬃx array compression technique. CoRR abs/2407.18753 (2025). https://doi.org/10.48550/ARXIV.2407.18753

work page doi:10.48550/arxiv.2407.18753 2025
[6]

In: Lipták, Z., de Moura, E.S., Figueroa, K., Baeza-Yates, R

Cenzato, D., Olivares, F., Prezza, N.: On computing the smallest suﬃxient set. In: Proceedings of 31st International Symposium on String Processing and Information Retrieval, SPIRE 2024. Lecture Notes in Computer Science, vol. 14899, pp. 73–87 (2024). https://doi.org/10.1007/978-3-031-72200-4_6

work page doi:10.1007/978-3-031-72200-4_6 2024
[7]

Testing Suffixient Sets

Cenzato, D., Olivares, F., Prezza, N.: Testing suﬃxient sets (2025), https://arxiv. org/abs/2506.08225

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

ACM Transactions on Algorithms 22(1), article 7 (2025)

Cobas, D., Gagie, T., Navarro, G.: Fast and small subsampled r-indexes. ACM Transactions on Algorithms 22(1), article 7 (2025)

2025
[9]

CoRRabs/2312.01359(2023)

Depuydt, L., Gagie, T., Langmead, B., Manzini, G., Prezza, N.: Suﬃxient sets. CoRR abs/2312.01359 (2023). https://doi.org/10.48550/ARXIV.2312.01359

work page doi:10.48550/arxiv.2312.01359 2023
[10]

Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005). https://doi.org/10.1145/1082036.1082039

work page doi:10.1145/1082036.1082039 2005
[11]

Fujimaru, H., Navarro, G., Olivares, F., Radoszewski, J., Romana, G., Urbina, C.: Computing smallest suﬃxient arrays in almost sublinear time (2026), submitted

2026
[12]

Fujimaru, H., Navarro, G., Romana, G., Urbina, C.: Smallest suﬃxient sets: Eﬀec- tiveness, resilience, and calculation (2026), https://arxiv.org/abs/2506.05638

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Gagie, T., Navarro, G., Prezza, N.: Fully functional suﬃx trees and optimal text searching in bwt-runs bounded space. J. ACM 67(1), 2:1–2:54 (2020). https://doi. org/10.1145/3375890

work page doi:10.1145/3375890 2020
[14]

2-level quasi-planarity or how caterpillars climb (spqr-)trees

Kempa, D., Kociumaka, T.: Breaking the O(n)-Barrier in the Construction of Com- pressed Suﬃx Arrays and Suﬃx Trees, pp. 5122–5202. https://doi.org/10.1137/1. 9781611977554.ch187, https://epubs.siam.org/doi/abs/10.1137/1.9781611977554. ch187

work page doi:10.1137/1
[15]

Köppl, D., Kucherov, G.: Smallest suﬃxient set maintenance in near-real-time (2026), https://arxiv.org/abs/2604.27548 Practical Linear-Time Computation of Smallest Suﬃxient Sets 13

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Journal of Computational Biology 17(3), 281–308 (2010)

Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)

2010
[17]

In: Proc

Navarro, G., Romana, G., Urbina, C.: Smallest suﬃxient sets as a repetitiveness measure. In: Proc. 32nd International Symposium on String Processing and Infor- mation Retrieval (SPIRE) (2025), to appear

2025

[1] [1]

Journal of Algorithms 14(3), 344–370 (1993)

Berkman, O., Schieber, B., Vishkin, U.: Optimal doubly logarithmic parallel algo- rithms based on ﬁnding all nearest smaller values. Journal of Algorithms 14(3), 344–370 (1993). https://doi.org/10.1006/jagm.1993.1018

work page doi:10.1006/jagm.1993.1018 1993

[2] [2]

Bonizzoni, P., Gao, Y., Riccardi, B.: Constructing suﬃxient arrays revisited (2026), https://arxiv.org/abs/2605.04258

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Algorithms Mol

Boucher, C., Gagie, T., Kuhnle, A., Langmead, B., Manzini, G., Mun, T.: Preﬁx- free parsing for building big BWTs. Algorithms Mol. Biol. 14(1), 13:1–13:15 (2019). https://doi.org/10.1186/S13015-019-0148-5

work page doi:10.1186/s13015-019-0148-5 2019

[4] [4]

Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)

1994

[5] [5]

Suffixient arrays: a new efficient suffix array compression technique

Cenzato, D., Depuydt, L., Gagie, T., Kim, S.H., Manzini, G., Olivares, F., Prezza, N.: Suﬃxient arrays: a new eﬃcient suﬃx array compression technique. CoRR abs/2407.18753 (2025). https://doi.org/10.48550/ARXIV.2407.18753

work page doi:10.48550/arxiv.2407.18753 2025

[6] [6]

In: Lipták, Z., de Moura, E.S., Figueroa, K., Baeza-Yates, R

Cenzato, D., Olivares, F., Prezza, N.: On computing the smallest suﬃxient set. In: Proceedings of 31st International Symposium on String Processing and Information Retrieval, SPIRE 2024. Lecture Notes in Computer Science, vol. 14899, pp. 73–87 (2024). https://doi.org/10.1007/978-3-031-72200-4_6

work page doi:10.1007/978-3-031-72200-4_6 2024

[7] [7]

Testing Suffixient Sets

Cenzato, D., Olivares, F., Prezza, N.: Testing suﬃxient sets (2025), https://arxiv. org/abs/2506.08225

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

ACM Transactions on Algorithms 22(1), article 7 (2025)

Cobas, D., Gagie, T., Navarro, G.: Fast and small subsampled r-indexes. ACM Transactions on Algorithms 22(1), article 7 (2025)

2025

[9] [9]

CoRRabs/2312.01359(2023)

Depuydt, L., Gagie, T., Langmead, B., Manzini, G., Prezza, N.: Suﬃxient sets. CoRR abs/2312.01359 (2023). https://doi.org/10.48550/ARXIV.2312.01359

work page doi:10.48550/arxiv.2312.01359 2023

[10] [10]

Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005). https://doi.org/10.1145/1082036.1082039

work page doi:10.1145/1082036.1082039 2005

[11] [11]

Fujimaru, H., Navarro, G., Olivares, F., Radoszewski, J., Romana, G., Urbina, C.: Computing smallest suﬃxient arrays in almost sublinear time (2026), submitted

2026

[12] [12]

Fujimaru, H., Navarro, G., Romana, G., Urbina, C.: Smallest suﬃxient sets: Eﬀec- tiveness, resilience, and calculation (2026), https://arxiv.org/abs/2506.05638

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Gagie, T., Navarro, G., Prezza, N.: Fully functional suﬃx trees and optimal text searching in bwt-runs bounded space. J. ACM 67(1), 2:1–2:54 (2020). https://doi. org/10.1145/3375890

work page doi:10.1145/3375890 2020

[14] [14]

2-level quasi-planarity or how caterpillars climb (spqr-)trees

Kempa, D., Kociumaka, T.: Breaking the O(n)-Barrier in the Construction of Com- pressed Suﬃx Arrays and Suﬃx Trees, pp. 5122–5202. https://doi.org/10.1137/1. 9781611977554.ch187, https://epubs.siam.org/doi/abs/10.1137/1.9781611977554. ch187

work page doi:10.1137/1

[15] [15]

Köppl, D., Kucherov, G.: Smallest suﬃxient set maintenance in near-real-time (2026), https://arxiv.org/abs/2604.27548 Practical Linear-Time Computation of Smallest Suﬃxient Sets 13

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Journal of Computational Biology 17(3), 281–308 (2010)

Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)

2010

[17] [17]

In: Proc

Navarro, G., Romana, G., Urbina, C.: Smallest suﬃxient sets as a repetitiveness measure. In: Proc. 32nd International Symposium on String Processing and Infor- mation Retrieval (SPIRE) (2025), to appear

2025