pith. sign in

arxiv: 2606.31034 · v1 · pith:BY7U4KJEnew · submitted 2026-06-30 · 💻 cs.DS

Practical Linear-Time Computation of Smallest Suffixient Sets

Pith reviewed 2026-07-01 03:39 UTC · model grok-4.3

classification 💻 cs.DS
keywords suffixient arrayslinear-time constructionsuffix arraypattern matchingrepetitive stringsstring indexingone-pass algorithm
0
0 comments X

The pith

A linear-time one-pass algorithm constructs smallest suffixient sets from suffix array structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the first algorithm for building smallest suffixient sets that runs in linear time and processes the input structures in a single pass. Earlier constructions either required multiple passes or exceeded linear time, restricting their use on massive repetitive texts. If the new method holds, it removes the main construction bottleneck and lets these compact indexes become practical for pattern matching on very large collections. The authors supply a working implementation and show it improves the space-time tradeoff over prior methods on real data.

Core claim

We present the first construction algorithm that is linear-time, one-pass over the structures, and implemented and practical, making suffixient arrays usable on large text collections where it dominates the space/time tradeoff map of existing constructions.

What carries the argument

The one-pass linear-time construction procedure that derives the smallest suffixient set directly from the provided suffix array and related structures.

If this is right

  • Suffixient arrays become feasible to build on inputs too large for prior methods.
  • Pattern-matching indexes for highly repetitive texts can be constructed without quadratic bottlenecks.
  • The space advantage of suffixient arrays over run-length BWT becomes attainable in practice for big collections.
  • Construction time no longer limits the choice of index on massive versioned or genomic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could extend to other string indexes that rely on suffix-array input if similar one-pass reductions exist.
  • In settings where suffix arrays are already computed for other reasons, the added cost of suffixient sets drops to near zero.
  • Empirical dominance on current test sets suggests the method may scale to inputs orders of magnitude larger than those previously feasible.

Load-bearing premise

Suffix array data structures are supplied as input and the one-pass code truly runs in linear time without hidden quadratic costs on the tested collections.

What would settle it

An experiment that measures superlinear construction time on some large repetitive collection, or an input where the output set is not the smallest possible suffixient set.

Figures

Figures reproduced from arXiv: 2606.31034 by Francisco Olivares, Gonzalo Navarro.

Figure 1
Figure 1. Figure 1: Space usage and running time comparison between all construction algorithms on DNA sequences from Pizza&Chili collection. Logscale in the space coordinate [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Space usage and running time comparison between all construction algorithms on einstein.en.txt and sources sequences from Pizza&Chili collection. Logscale in the space coordinate. that Onl performs noticeably better when χ is small, while the other construc￾tions (including ours) are much more sensitive to n. 5 Conclusions and Future Work We have proposed the first algorithm that is (i) one-pass (over the … view at source ↗
read the original abstract

Suffixient arrays are recent structures that have attracted attention because they offer relevant pattern matching functionality in less asymptotic space than the Run-Length BWT, the de-facto standard to index highly repetitive string collections. Various algorithms exist for building them from the suffix array data structures. We present the first construction algorithm that is (i) linear-time, (ii) one-pass over the structures, and (iii) implemented and practical. This makes the construction particularly useful on large text collections, which we demonstrate empirically by showing that it dominates the space/time tradeoff map of the implemented constructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper presents the first construction algorithm for smallest suffixient sets that runs in linear time, makes a single pass over the suffix array and related structures, and has been implemented practically. It claims empirical dominance over prior constructions in the space/time tradeoff for pattern matching on large repetitive text collections.

Significance. If the linearity and one-pass claims hold, the result is significant for the field: suffixient arrays provide relevant pattern matching in asymptotically less space than the RLBWT on highly repetitive collections, and a practical linear-time builder removes a key obstacle to their deployment on large inputs. The combination of a machine-checkable O(n) analysis via constant-time per-entry operations, an implemented prototype, and consistent experimental dominance on repetitive collections strengthens the contribution.

minor comments (3)
  1. §4 (experimental setup): the paper should explicitly list the precise metrics (construction time, peak memory, query time) and the exact set of prior algorithms against which dominance is claimed, to allow direct reproduction of the tradeoff map.
  2. Algorithm 1 and surrounding text: the constant-time operations per suffix-array entry are stated but the handling of the LCP array in the single pass could be annotated with explicit time bounds to make the O(n) claim immediately verifiable from the pseudocode.
  3. Figure 3 caption: the y-axis scale and the precise meaning of 'dominance' (Pareto front or strict improvement on both axes) should be clarified so readers can interpret the plotted points without ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our contribution and the recommendation of minor revision. The report contains no specific major comments to address.

Circularity Check

0 steps flagged

No significant circularity; algorithmic construction is self-contained

full rationale

The paper introduces a new linear-time, one-pass construction algorithm for smallest suffixient sets from suffix-array structures, supported by pseudocode, time analysis establishing O(n) via constant-time per-entry operations, and empirical validation on repetitive collections. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claim rests on independent algorithmic design and implementation details rather than prior author results or ansatzes. The contribution is a practical algorithmic advance verified externally by the provided analysis and experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in string algorithms about the availability of suffix array structures as input and the correctness of the one-pass linear-time construction for the domain of repetitive texts.

axioms (1)
  • domain assumption Suffix array and related data structures are available or can be used as input to the construction.
    The algorithm is described as building from suffix array data structures.

pith-pipeline@v0.9.1-grok · 5611 in / 1056 out tokens · 41751 ms · 2026-07-01T03:39:10.524349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Journal of Algorithms 14(3), 344–370 (1993)

    Berkman, O., Schieber, B., Vishkin, U.: Optimal doubly logarithmic parallel algo- rithms based on finding all nearest smaller values. Journal of Algorithms 14(3), 344–370 (1993). https://doi.org/10.1006/jagm.1993.1018

  2. [2]

    Bonizzoni, P., Gao, Y., Riccardi, B.: Constructing suffixient arrays revisited (2026), https://arxiv.org/abs/2605.04258

  3. [3]

    Algorithms Mol

    Boucher, C., Gagie, T., Kuhnle, A., Langmead, B., Manzini, G., Mun, T.: Prefix- free parsing for building big BWTs. Algorithms Mol. Biol. 14(1), 13:1–13:15 (2019). https://doi.org/10.1186/S13015-019-0148-5

  4. [4]

    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)

  5. [5]

    Suffixient arrays: a new efficient suffix array compression technique

    Cenzato, D., Depuydt, L., Gagie, T., Kim, S.H., Manzini, G., Olivares, F., Prezza, N.: Suffixient arrays: a new efficient suffix array compression technique. CoRR abs/2407.18753 (2025). https://doi.org/10.48550/ARXIV.2407.18753

  6. [6]

    In: Lipták, Z., de Moura, E.S., Figueroa, K., Baeza-Yates, R

    Cenzato, D., Olivares, F., Prezza, N.: On computing the smallest suffixient set. In: Proceedings of 31st International Symposium on String Processing and Information Retrieval, SPIRE 2024. Lecture Notes in Computer Science, vol. 14899, pp. 73–87 (2024). https://doi.org/10.1007/978-3-031-72200-4_6

  7. [7]

    Testing Suffixient Sets

    Cenzato, D., Olivares, F., Prezza, N.: Testing suffixient sets (2025), https://arxiv. org/abs/2506.08225

  8. [8]

    ACM Transactions on Algorithms 22(1), article 7 (2025)

    Cobas, D., Gagie, T., Navarro, G.: Fast and small subsampled r-indexes. ACM Transactions on Algorithms 22(1), article 7 (2025)

  9. [9]

    CoRRabs/2312.01359(2023)

    Depuydt, L., Gagie, T., Langmead, B., Manzini, G., Prezza, N.: Suffixient sets. CoRR abs/2312.01359 (2023). https://doi.org/10.48550/ARXIV.2312.01359

  10. [10]

    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005). https://doi.org/10.1145/1082036.1082039

  11. [11]

    Fujimaru, H., Navarro, G., Olivares, F., Radoszewski, J., Romana, G., Urbina, C.: Computing smallest suffixient arrays in almost sublinear time (2026), submitted

  12. [12]

    Fujimaru, H., Navarro, G., Romana, G., Urbina, C.: Smallest suffixient sets: Effec- tiveness, resilience, and calculation (2026), https://arxiv.org/abs/2506.05638

  13. [13]

    Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM 67(1), 2:1–2:54 (2020). https://doi. org/10.1145/3375890

  14. [14]

    2-level quasi-planarity or how caterpillars climb (spqr-)trees

    Kempa, D., Kociumaka, T.: Breaking the O(n)-Barrier in the Construction of Com- pressed Suffix Arrays and Suffix Trees, pp. 5122–5202. https://doi.org/10.1137/1. 9781611977554.ch187, https://epubs.siam.org/doi/abs/10.1137/1.9781611977554. ch187

  15. [15]

    Köppl, D., Kucherov, G.: Smallest suffixient set maintenance in near-real-time (2026), https://arxiv.org/abs/2604.27548 Practical Linear-Time Computation of Smallest Suffixient Sets 13

  16. [16]

    Journal of Computational Biology 17(3), 281–308 (2010)

    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)

  17. [17]

    In: Proc

    Navarro, G., Romana, G., Urbina, C.: Smallest suffixient sets as a repetitiveness measure. In: Proc. 32nd International Symposium on String Processing and Infor- mation Retrieval (SPIRE) (2025), to appear