pith. sign in

arxiv: 2506.14734 · v6 · submitted 2025-06-17 · 💻 cs.DS

Compressing Suffix Trees by Path Decompositions

Pith reviewed 2026-05-19 08:51 UTC · model grok-4.3

classification 💻 cs.DS
keywords suffix treepath decompositionsuffix array samplingBurrows-Wheeler transformI/O modelcompressed indexespattern matchingrepetitiveness measures
0
0 comments X

The pith

Sorting suffix tree leaves by priority functions decomposes the tree into at most r paths and supports a Suffix Array sample of size r.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that specific priority functions applied to the leaves of a suffix tree induce a path decomposition in which the number of distinct paths equals at most r, the number of runs in the Burrows-Wheeler transform of the string. Each such path corresponds to a suffix, and the decomposition directly determines which positions must be sampled in the Suffix Array. With a sample of size at most r the structure supports indexed pattern matching whose I/O cost is proportional to the number of paths traversed rather than to the full text length. This bound is strictly tighter than the 2r sample size required by earlier sampling methods that also target the I/O model.

Core claim

We introduce a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths is bounded by r. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most r, strictly improving upon the 2r bound of Suffixient Arrays.

What carries the argument

The path decomposition of the suffix tree obtained by ordering leaves according to chosen priority functions; the decomposition groups edges into paths each representing a string suffix and limits their count to r.

If this is right

  • Indexed pattern matching queries run with I/O cost governed by a sample whose size is bounded by r rather than 2r.
  • The space overhead of the sampled Suffix Array becomes proportional to the repetitiveness measure r instead of twice that measure.
  • Cache misses during matching decrease when the text is stored on disk or in external memory because fewer sampled positions need to be fetched.
  • The same decomposition supplies a linear-time construction for the sampled structure once the priority ordering is fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same priority ordering may yield compact representations for other tree-based string indexes that currently sample at twice the run count.
  • Dynamic maintenance of the decomposition could allow online updates to the sampled structure while preserving the r bound.
  • The approach suggests that viewing suffix trees through path covers rather than node or edge covers captures repetitiveness more tightly.

Load-bearing premise

Priority functions exist that order the suffix tree leaves so the induced paths number at most r.

What would settle it

A string whose suffix tree requires more than r distinct paths under every possible choice of priority functions on the leaves.

Figures

Figures reproduced from arXiv: 2506.14734 by Davide Cenzato, Giovanni Manzini, Nicola Prezza, Ragnar Groot Koerkamp, Ruben Becker, Sung-Hwan Kim, Travis Gagie.

Figure 1
Figure 1. Figure 1: Overview of our technique. We sort T ’s suffixes T [i, n] (equivalently, suffix tree leaves) by increasing π(i). In this example, π corresponds to the standard lexicographic order of the text’s suffixes (but π can be more general). This induces a suffix tree path decomposition (an edge-disjoint set of node-to-leaf paths covering all edges) obtained by always following the leftmost path. At this point, we a… view at source ↗
Figure 2
Figure 2. Figure 2: Example where 8 = χ = r + w − 1 = 5 + 4 − 1. On top: the suffix tree for the text T = BBAAAABABB$. Under the tree, BWT(T )[i] is shown, being aligned to the i-th lexicographically smallest suffix. On the bottom: the text with the suffixient set chosen by the described procedure and the Weiner link tree, in which each node corresponds to an internal node of the suffix tree. Strings indicate the root-to-node… view at source ↗
Figure 3
Figure 3. Figure 3: Consider this particular string T and take π = id to be the identity function. Then, LPFπ = LPF is the classic Longest Previous Factor array. In the third line, we put in boxes the irreducible LPF positions i such that LPF[i] > 0; these are the beginnings of our phrases (7 phrases in total). For example, consider such a position i = 9. Since LPF[9] = 6, the corresponding phrase is T [9, 9 + 6 − 1] = T [9, … view at source ↗
Figure 4
Figure 4. Figure 4: Example of the STPD obtained by taking π = IPA to be the colexicographic rank of the text’s prefixes. For a better visualization, we do not sort suffix tree leaves according to π as this would cause some edges to cross. Paths T [j, n] are colored according to the color of j ∈ st-colex−. To see how the paths are built, consider the Prefix Array PA = [11, 1, 2, 10, 9, 3, 5, 7, 4, 6, 8] and the generalized Lo… view at source ↗
Figure 5
Figure 5. Figure 5: Resources (time and working space) used by st-colex−, r-index, move-r, and the Suffix Array to find primary occurrences of a set of 105 patterns of length 30, 100, 1000, and 10000. Note that our index deviates from the usual Pareto curve (smaller space, larger query time) and always dominates by a wide margin all other solutions in both dimensions. 40 [PITH_FULL_IMAGE:figures/full_fig_p041_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Resources (time and working space) used by st-colex−, r-index, move-r, and the Suffix Array to locate all occurrences of a set of 105 patterns of length 30, 100, 1000, and 10000. Being based on computing the ϕ¯ function, our locate mechanism cannot escape the Pareto curve when occ is large (top left plot — about 300 occurrences per pattern on average). Nevertheless, our work shows for the first time that a… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of a smallest suffixient set for string AACGCGCGAA$. The corresponding suffixient array is sA = [11, 2, 9, 3, 7, 4] (assuming alphabet order $ < A < C < G). We show how pattern matching works with an example. Imagine the task of matching pattern P = CGCGA on T , and assume that P does occur in T . Since the alphabet has cardinality at least 2, then the empty string ϵ is right-maximal, hence t… view at source ↗
read the original abstract

The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the $r$-index (Gagie et al., SODA 2018), fit in space proportional to $r$, the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by $r$. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most $r$, strictly improving upon the (tight) $2r$ bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a suffix array sampling technique based on path decompositions of the suffix tree. It claims that sorting the leaves of the suffix tree according to specific priority functions produces a decomposition into at most r distinct paths (each a suffix), where r is the number of runs in the Burrows-Wheeler transform. This yields an I/O-efficient indexed pattern matching structure whose suffix array sample has size at most r, strictly improving on the tight 2r bound achieved by Suffixient Arrays.

Significance. If the claimed r-bound holds, the result would be a notable improvement in the design of compressed indexes for the I/O model, directly addressing the practical cache-locality shortcomings of existing r-based structures while preserving the strong repetitiveness measure r. The path-decomposition viewpoint on suffix trees offers a fresh angle that could influence future work on I/O-efficient string algorithms.

major comments (2)
  1. [Section 4] Section 4 (Priority Functions and Induced Decomposition): The manuscript asserts that particular priority functions on suffix-tree leaves induce a path decomposition with at most r distinct paths, yet neither the explicit definition of these functions nor the combinatorial steps that map the sorted leaf order to BWT runs (ensuring the path count stays at most r rather than 2r) are supplied. This argument is load-bearing for the central claim of a strict improvement over Suffixient Arrays.
  2. [Theorem 5.1] Theorem 5.1 (or the main bounding theorem): The proof that the number of distinct paths is bounded by r is presented without an induction, counting argument, or explicit reduction to the run-length structure of the BWT. Without these steps it is impossible to confirm that the construction avoids the extra factor that appears in the 2r case.
minor comments (2)
  1. [Abstract] The abstract refers to 'specific priority functions' without even a one-sentence characterization; adding a high-level description would help readers assess the novelty immediately.
  2. [Figure 1] Figure 1 (path decomposition example): The diagram would be clearer if the priority values used for leaf ordering were annotated on the leaves or edges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential significance of our path-decomposition approach for I/O-efficient compressed indexes. We address each major comment below and will incorporate the requested clarifications into a revised manuscript.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Priority Functions and Induced Decomposition): The manuscript asserts that particular priority functions on suffix-tree leaves induce a path decomposition with at most r distinct paths, yet neither the explicit definition of these functions nor the combinatorial steps that map the sorted leaf order to BWT runs (ensuring the path count stays at most r rather than 2r) are supplied. This argument is load-bearing for the central claim of a strict improvement over Suffixient Arrays.

    Authors: We agree that the current draft of Section 4 would benefit from greater explicitness. In the revision we will supply the precise definitions of the two priority functions used to order the suffix-tree leaves. We will also add a self-contained combinatorial argument that shows how the resulting leaf order partitions the tree into paths whose number is bounded by the number of BWT runs r, rather than 2r, by directly relating consecutive leaves in the order to run boundaries in the BWT. revision: yes

  2. Referee: [Theorem 5.1] Theorem 5.1 (or the main bounding theorem): The proof that the number of distinct paths is bounded by r is presented without an induction, counting argument, or explicit reduction to the run-length structure of the BWT. Without these steps it is impossible to confirm that the construction avoids the extra factor that appears in the 2r case.

    Authors: We accept that the proof of Theorem 5.1 in the submitted version is too terse. We will expand it to include an explicit counting argument (or a short induction on the number of BWT runs) that reduces the number of distinct paths to r. The revised proof will highlight the structural property of our priority ordering that prevents the extra factor incurred by earlier 2r constructions. revision: yes

Circularity Check

0 steps flagged

No circularity: r-bound on path decompositions is a self-contained combinatorial proof

full rationale

The paper's central derivation is a proof that sorting suffix tree leaves by specific (explicitly constructed) priority functions induces a path decomposition whose distinct paths number at most r, the number of BWT runs. This is presented as an independent combinatorial result that directly yields an SA sample of size r for I/O-efficient matching, improving on the cited 2r bound of Suffixient Arrays. No step reduces by definition, by fitting, or by self-citation chain to the target quantity itself; the existence of the priority functions and the r-bound are established by explicit construction and argument rather than by renaming or tautological redefinition of inputs. The derivation is therefore self-contained and externally verifiable by combinatorial inspection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of suffix trees and the Burrows-Wheeler transform together with the existence of suitable priority functions; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • standard math Standard combinatorial properties of suffix trees and the Burrows-Wheeler transform hold.
    Invoked implicitly when relating r to the number of distinct paths in the decomposition.

pith-pipeline@v0.9.0 · 5858 in / 1288 out tokens · 45933 ms · 2026-05-19T08:51:17.022287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. More efficient PBWT prefix-array access via batching

    cs.DS 2026-05 unverdicted novelty 5.0

    Batching queries to accumulate r lg(h)/lg(r) substrings with average lg(r)/lg(h) haplotypes reported per substring enables O(r log h) bits and constant time per haplotype in PBWT prefix-array access.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Pan-genomic matching statistics for targeted nanopore sequencing

    Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6), 2021

  2. [2]

    Spumoni 2: improved classification using a pangenome index of minimizer digests

    Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. Spumoni 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023

  3. [3]

    Monotone minimal perfect hashing: searching a sorted table with o(1) accesses

    Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: searching a sorted table with o(1) accesses. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 785–794, 2009. URL: https://dl.acm.org/doi/10.5555/1496770.1496856

  4. [4]

    Puglisi, and Yasuo Tabei

    Djamal Belazzougui, Manuel C´ aceres, Travis Gagie, Pawel Gawrychowski, Juha K¨ arkk¨ ainen, Gonzalo Navarro, Alberto Ord´ o˜ nez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees.J. Comput. Syst. Sci. , 117:1–22, 2021

  5. [5]

    Move-r: Optimizing the r-index

    Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms, SEA 2024, July 23-26, 2024, Vienna, Austria , volume 301 of LIPIcs, pages 1:1–1:19. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik, 2024

  6. [6]

    Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann

    Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing , 44:513–539, 2015

  7. [7]

    Kings, Name Days, Lazy Servants and Magic

    Paolo Boldi and Sebastiano Vigna. Kings, Name Days, Lazy Servants and Magic. In Hiro Ito, Stefano Leonardi, Linda Pagli, and Giuseppe Prencipe, editors, 9th International Conference on Fun with Algorithms (FUN 2018) , volume 100 of Leibniz International Proceedings in Informatics (LIPIcs) , pages 10:1–10:13, Dagstuhl, Germany, 2018. Schloss Dagstuhl – Lei...

  8. [8]

    Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994

  9. [9]

    Suffixient arrays: a new efficient suffix array compression technique.arXiv preprint 2407.18753v2, 2025

    Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, and Nicola Prezza. Suffixient arrays: a new efficient suffix array compression technique, 2025. URL: https://arxiv.org/abs/2407.18753

  10. [10]

    Filtering search: a new approach to query-answering

    Bernard Chazelle. Filtering search: a new approach to query-answering. SIAM Journal on Computing, 15(3):703– 724, 1986

  11. [11]

    Optimal-time dictionary-compressed indexes

    Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1–8:39, 2021

  12. [12]

    Improved grammar-based compressed indexes

    Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In International Sympo- sium on String Processing and Information Retrieval , pages 180–192. Springer, 2012

  13. [13]

    Cleary and W

    John G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. Comput. J., 40(2/3):67–75, 1997

  14. [14]

    µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data

    Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik K¨ oppl, Christina Boucher, and Paola Bonizzoni. µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data. Bioinform., 39(9), 2023

  15. [15]

    Opportunistic data structures with applications

    Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science , pages 390–398, 2000

  16. [16]

    Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

    Johannes Fischer and Volker Heun. Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays. SIAM Journal on Computing , 40:465–492, 2011

  17. [17]

    KATKA: A kraken-like tool with k given at query time

    Travis Gagie, Sana Kashgouli, and Ben Langmead. KATKA: A kraken-like tool with k given at query time. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepci´ on, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science , pages 191–1...

  18. [18]

    Optimal-time text indexing in bwt-runs bounded space, 2018

    Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space, 2018

  19. [19]

    Fully functional suffix trees and optimal text searching in bwt-runs bounded space

    Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1), January 2020

  20. [20]

    New indices for text: Pat trees and pat arrays

    Gaston H Gonnet, Ricardo A Baeza-Yates, and Tim Snider. New indices for text: Pat trees and pat arrays. Information Retrieval: Data Structures & Algorithms , 66:82, 1992

  21. [21]

    Compressed suffix arrays and suffix trees with applications to text index- ing and string matching

    Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text index- ing and string matching. In Proceedings of the thirty-second annual ACM Symposium on Theory of Computing (STOC), pages 397 – 406, 2000

  22. [22]

    Juha K¨ arkk¨ ainen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. InProceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM) , pages 181–192, 2009

  23. [23]

    Resolution of the burrows-wheeler transform conjecture

    Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) , pages 1002–1013. IEEE, 2020. 42

  24. [24]

    Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space

    Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023 , pages 1877–1886. IEEE, 2023

  25. [25]

    At the roots of dictionary compression: string attractors

    Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages 827–840, 2018

  26. [26]

    Toward a definitive compressibility measure for repet- itive sequences

    Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repet- itive sequences. IEEE Transactions on Information Theory , 69(4):2074–2092, 2022

  27. [27]

    On compressing and indexing repetitive sequences

    Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013

  28. [28]

    On compressing and indexing repetitive sequences

    Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci. , 483:115–133, 2013

  29. [29]

    Relative Lempel-Ziv compression of genomes for large- scale storage and retrieval

    Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large- scale storage and retrieval. In International Symposium on String Processing and Information Retrieval , pages 201–206. Springer, 2010

  30. [30]

    On the complexity of finite sequences

    Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory , 22(1):75–81, 1976

  31. [31]

    Succinct suffix arrays based on run-length encoding

    Veli M¨ akinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40–66, 2005

  32. [32]

    Storage and retrieval of highly repetitive sequence collections

    Veli M¨ akinen, Gonzalo Navarro, Jouni Sir´ en, and Niko V¨ alim¨ aki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology , 17(3):281–308, 2010

  33. [33]

    In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algo- rithms, 22-24 January 1990, San Francisco, California, USA

    Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In David S. Johnson, editor, Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1990, San Fran- cisco, California, USA , pages 319–327. SIAM, 1990. URL: http://dl.acm.org/citation.cfm?id=320176.320218

  34. [34]

    Suffix arrays: a new method for on-line string searches

    Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993

  35. [35]

    McCreight

    Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM , 23(2):262– 272, 1976

  36. [36]

    Wavelet trees for all

    Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms , 25:2–20, 2014

  37. [37]

    Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

    Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures. ACM Com- puting Surveys, 54(2):29:1–29:31, 2022

  38. [38]

    Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

    Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part II: Compressed Indexes. ACM Computing Surveys, 54(2):26:1–26:31, 2022

  39. [39]

    Smallest suffixient sets as a repetitiveness measure,

    Gonzalo Navarro, Giuseppe Romana, and Cristian Urbina. Smallest suffixient sets as a repetitiveness measure,

  40. [40]

    URL: https://arxiv.org/abs/2506.05638

  41. [41]

    Optimal-time queries on BWT-runs compressed indexes

    Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference) , volume 198 of LIPIcs, pages 101:1–101:15. Schloss Dagstuhl - Le...

  42. [42]

    Optimal substring equality queries with applications to sparse text indexing

    Nicola Prezza. Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms, 17(1):7:1–7:23, 2021

  43. [43]

    Relative lempel-ziv compression of suffix arrays

    Simon J Puglisi and Bella Zhukova. Relative lempel-ziv compression of suffix arrays. In International Symposium on String Processing and Information Retrieval , pages 89–96. Springer, 2020

  44. [44]

    Finding maximal exact matches using the r-index

    Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol. , 29(2):188–194, 2022

  45. [45]

    MONI: A pangenomic index for finding maximal exact matches

    Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol. , 29(2):169–187, 2022

  46. [46]

    Sigmoni: classification of nanopore signal with a compressed pangenome index

    Vikram S Shivakumar, Omar Y Ahmed, Sam Kovaka, Mohsen Zakeri, and Ben Langmead. Sigmoni: classification of nanopore signal with a compressed pangenome index. Bioinformatics, 40(Supplement 1):i287–i296, 2024

  47. [47]

    On-line construction of suffix trees

    Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995

  48. [48]

    Linear pattern matching algorithms

    Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11. IEEE, 1973

  49. [49]

    Improved metagenomic analysis with kraken 2

    Derrick E Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with kraken 2. Genome biology, 20:1–13, 2019

  50. [50]

    Kraken: ultrafast metagenomic sequence classification using exact alignments

    Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15:1–12, 2014

  51. [51]

    Movi: a fast and cache-efficient full-text pangenome index

    Mohsen Zakeri, Nathaniel K Brown, Omar Y Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. iScience, 27(12), 2024. 43 A Suffixient Arrays Like suffixient arrays [9], ours is a technique for sampling the Prefix Array while still maintaining search functionalities. The idea behind suffixient arrays is rather ...

  52. [52]

    BWT( S)[i] = S[SA[i] − 1], where S[0] := S[n]

  53. [53]

    If ISA[ i] < ISA[j] and S[i − 1] = S[j − 1], then ISA[i − 1] < ISA[j − 1], where ISA[0] := ISA[n]

  54. [54]

    BWT[ISA[ i]] = S[i − 1]

  55. [55]

    We proceed with the definition of Range Minimum and Maximum queries

    coBWT[ i] = S[PA[i] + 1] where S[n + 1] := S[1]. We proceed with the definition of Range Minimum and Maximum queries. Definition 25. Given a list L of n integers, the Range Minimum (Maximum) query on L with arguments ℓ, r ∈ [n], returns the index argmink∈[ℓ,r]L[k] (argmaxk∈[ℓ,r]L[k]) of the minimum (max- imum) element in L[ℓ, r]. There exists a data struc...