Compressing Suffix Trees by Path Decompositions

Davide Cenzato; Giovanni Manzini; Nicola Prezza; Ragnar Groot Koerkamp; Ruben Becker; Sung-Hwan Kim; Travis Gagie

arxiv: 2506.14734 · v6 · submitted 2025-06-17 · 💻 cs.DS

Compressing Suffix Trees by Path Decompositions

Ruben Becker , Davide Cenzato , Travis Gagie , Sung-Hwan Kim , Ragnar Groot Koerkamp , Giovanni Manzini , Nicola Prezza This is my paper

Pith reviewed 2026-05-19 08:51 UTC · model grok-4.3

classification 💻 cs.DS

keywords suffix treepath decompositionsuffix array samplingBurrows-Wheeler transformI/O modelcompressed indexespattern matchingrepetitiveness measures

0 comments

The pith

Sorting suffix tree leaves by priority functions decomposes the tree into at most r paths and supports a Suffix Array sample of size r.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that specific priority functions applied to the leaves of a suffix tree induce a path decomposition in which the number of distinct paths equals at most r, the number of runs in the Burrows-Wheeler transform of the string. Each such path corresponds to a suffix, and the decomposition directly determines which positions must be sampled in the Suffix Array. With a sample of size at most r the structure supports indexed pattern matching whose I/O cost is proportional to the number of paths traversed rather than to the full text length. This bound is strictly tighter than the 2r sample size required by earlier sampling methods that also target the I/O model.

Core claim

We introduce a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths is bounded by r. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most r, strictly improving upon the 2r bound of Suffixient Arrays.

What carries the argument

The path decomposition of the suffix tree obtained by ordering leaves according to chosen priority functions; the decomposition groups edges into paths each representing a string suffix and limits their count to r.

If this is right

Indexed pattern matching queries run with I/O cost governed by a sample whose size is bounded by r rather than 2r.
The space overhead of the sampled Suffix Array becomes proportional to the repetitiveness measure r instead of twice that measure.
Cache misses during matching decrease when the text is stored on disk or in external memory because fewer sampled positions need to be fetched.
The same decomposition supplies a linear-time construction for the sampled structure once the priority ordering is fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same priority ordering may yield compact representations for other tree-based string indexes that currently sample at twice the run count.
Dynamic maintenance of the decomposition could allow online updates to the sampled structure while preserving the r bound.
The approach suggests that viewing suffix trees through path covers rather than node or edge covers captures repetitiveness more tightly.

Load-bearing premise

Priority functions exist that order the suffix tree leaves so the induced paths number at most r.

What would settle it

A string whose suffix tree requires more than r distinct paths under every possible choice of priority functions on the leaves.

Figures

Figures reproduced from arXiv: 2506.14734 by Davide Cenzato, Giovanni Manzini, Nicola Prezza, Ragnar Groot Koerkamp, Ruben Becker, Sung-Hwan Kim, Travis Gagie.

**Figure 1.** Figure 1: Overview of our technique. We sort T ’s suffixes T [i, n] (equivalently, suffix tree leaves) by increasing π(i). In this example, π corresponds to the standard lexicographic order of the text’s suffixes (but π can be more general). This induces a suffix tree path decomposition (an edge-disjoint set of node-to-leaf paths covering all edges) obtained by always following the leftmost path. At this point, we a… view at source ↗

**Figure 2.** Figure 2: Example where 8 = χ = r + w − 1 = 5 + 4 − 1. On top: the suffix tree for the text T = BBAAAABABB$. Under the tree, BWT(T )[i] is shown, being aligned to the i-th lexicographically smallest suffix. On the bottom: the text with the suffixient set chosen by the described procedure and the Weiner link tree, in which each node corresponds to an internal node of the suffix tree. Strings indicate the root-to-node… view at source ↗

**Figure 3.** Figure 3: Consider this particular string T and take π = id to be the identity function. Then, LPFπ = LPF is the classic Longest Previous Factor array. In the third line, we put in boxes the irreducible LPF positions i such that LPF[i] > 0; these are the beginnings of our phrases (7 phrases in total). For example, consider such a position i = 9. Since LPF[9] = 6, the corresponding phrase is T [9, 9 + 6 − 1] = T [9, … view at source ↗

**Figure 4.** Figure 4: Example of the STPD obtained by taking π = IPA to be the colexicographic rank of the text’s prefixes. For a better visualization, we do not sort suffix tree leaves according to π as this would cause some edges to cross. Paths T [j, n] are colored according to the color of j ∈ st-colex−. To see how the paths are built, consider the Prefix Array PA = [11, 1, 2, 10, 9, 3, 5, 7, 4, 6, 8] and the generalized Lo… view at source ↗

**Figure 5.** Figure 5: Resources (time and working space) used by st-colex−, r-index, move-r, and the Suffix Array to find primary occurrences of a set of 105 patterns of length 30, 100, 1000, and 10000. Note that our index deviates from the usual Pareto curve (smaller space, larger query time) and always dominates by a wide margin all other solutions in both dimensions. 40 [PITH_FULL_IMAGE:figures/full_fig_p041_5.png] view at source ↗

**Figure 6.** Figure 6: Resources (time and working space) used by st-colex−, r-index, move-r, and the Suffix Array to locate all occurrences of a set of 105 patterns of length 30, 100, 1000, and 10000. Being based on computing the ϕ¯ function, our locate mechanism cannot escape the Pareto curve when occ is large (top left plot — about 300 occurrences per pattern on average). Nevertheless, our work shows for the first time that a… view at source ↗

**Figure 7.** Figure 7: Visualization of a smallest suffixient set for string AACGCGCGAA$. The corresponding suffixient array is sA = [11, 2, 9, 3, 7, 4] (assuming alphabet order $ < A < C < G). We show how pattern matching works with an example. Imagine the task of matching pattern P = CGCGA on T , and assume that P does occur in T . Since the alphabet has cardinality at least 2, then the empty string ϵ is right-maximal, hence t… view at source ↗

read the original abstract

The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the $r$-index (Gagie et al., SODA 2018), fit in space proportional to $r$, the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by $r$. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most $r$, strictly improving upon the (tight) $2r$ bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper improves the sampling bound to r for I/O-efficient suffix arrays using path decompositions, but the proof details on priority functions are thin in the abstract.

read the letter

The paper's main contribution is a path-decomposition method that lets them sample the suffix array with only r entries for I/O-efficient matching, improving the previous 2r bound from Suffixient Arrays. If the proof works, this is a nice tightening that could help with practical performance on large strings. They do well at identifying the practical issue with current compressed indexes. The RAM model bounds are nice but don't help when cache and disk I/O become the bottleneck on large strings. Using the suffix tree to guide the sampling while staying within r is a useful direction that builds on the r-index and related structures. The soft spot is the lack of detail on the priority functions and the proof. The abstract asserts that sorting the leaves this way bounds the distinct paths by r, but without the actual functions or the combinatorial reasoning, it's tough to see how they get the improvement over 2r without some extra cost. This is the step the stress test flags, and it needs to be solid in the full text. If the paper provides the definitions and shows why the number of paths stays at r, that would address the concern directly. This work is for researchers in string data structures who are interested in making compressed indexes more practical in external memory. Readers following the r-index line of work would find it relevant and might want to see if the priority functions are efficient to use in code. I recommend sending it for peer review. The claim is specific enough that referees can evaluate the new argument properly, and the focus on I/O makes it worth the time even if some revisions are needed for clarity.

Referee Report

2 major / 2 minor

Summary. The paper introduces a suffix array sampling technique based on path decompositions of the suffix tree. It claims that sorting the leaves of the suffix tree according to specific priority functions produces a decomposition into at most r distinct paths (each a suffix), where r is the number of runs in the Burrows-Wheeler transform. This yields an I/O-efficient indexed pattern matching structure whose suffix array sample has size at most r, strictly improving on the tight 2r bound achieved by Suffixient Arrays.

Significance. If the claimed r-bound holds, the result would be a notable improvement in the design of compressed indexes for the I/O model, directly addressing the practical cache-locality shortcomings of existing r-based structures while preserving the strong repetitiveness measure r. The path-decomposition viewpoint on suffix trees offers a fresh angle that could influence future work on I/O-efficient string algorithms.

major comments (2)

[Section 4] Section 4 (Priority Functions and Induced Decomposition): The manuscript asserts that particular priority functions on suffix-tree leaves induce a path decomposition with at most r distinct paths, yet neither the explicit definition of these functions nor the combinatorial steps that map the sorted leaf order to BWT runs (ensuring the path count stays at most r rather than 2r) are supplied. This argument is load-bearing for the central claim of a strict improvement over Suffixient Arrays.
[Theorem 5.1] Theorem 5.1 (or the main bounding theorem): The proof that the number of distinct paths is bounded by r is presented without an induction, counting argument, or explicit reduction to the run-length structure of the BWT. Without these steps it is impossible to confirm that the construction avoids the extra factor that appears in the 2r case.

minor comments (2)

[Abstract] The abstract refers to 'specific priority functions' without even a one-sentence characterization; adding a high-level description would help readers assess the novelty immediately.
[Figure 1] Figure 1 (path decomposition example): The diagram would be clearer if the priority values used for leaf ordering were annotated on the leaves or edges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential significance of our path-decomposition approach for I/O-efficient compressed indexes. We address each major comment below and will incorporate the requested clarifications into a revised manuscript.

read point-by-point responses

Referee: [Section 4] Section 4 (Priority Functions and Induced Decomposition): The manuscript asserts that particular priority functions on suffix-tree leaves induce a path decomposition with at most r distinct paths, yet neither the explicit definition of these functions nor the combinatorial steps that map the sorted leaf order to BWT runs (ensuring the path count stays at most r rather than 2r) are supplied. This argument is load-bearing for the central claim of a strict improvement over Suffixient Arrays.

Authors: We agree that the current draft of Section 4 would benefit from greater explicitness. In the revision we will supply the precise definitions of the two priority functions used to order the suffix-tree leaves. We will also add a self-contained combinatorial argument that shows how the resulting leaf order partitions the tree into paths whose number is bounded by the number of BWT runs r, rather than 2r, by directly relating consecutive leaves in the order to run boundaries in the BWT. revision: yes
Referee: [Theorem 5.1] Theorem 5.1 (or the main bounding theorem): The proof that the number of distinct paths is bounded by r is presented without an induction, counting argument, or explicit reduction to the run-length structure of the BWT. Without these steps it is impossible to confirm that the construction avoids the extra factor that appears in the 2r case.

Authors: We accept that the proof of Theorem 5.1 in the submitted version is too terse. We will expand it to include an explicit counting argument (or a short induction on the number of BWT runs) that reduces the number of distinct paths to r. The revised proof will highlight the structural property of our priority ordering that prevents the extra factor incurred by earlier 2r constructions. revision: yes

Circularity Check

0 steps flagged

No circularity: r-bound on path decompositions is a self-contained combinatorial proof

full rationale

The paper's central derivation is a proof that sorting suffix tree leaves by specific (explicitly constructed) priority functions induces a path decomposition whose distinct paths number at most r, the number of BWT runs. This is presented as an independent combinatorial result that directly yields an SA sample of size r for I/O-efficient matching, improving on the cited 2r bound of Suffixient Arrays. No step reduces by definition, by fitting, or by self-citation chain to the target quantity itself; the existence of the priority functions and the r-bound are established by explicit construction and argument rather than by renaming or tautological redefinition of inputs. The derivation is therefore self-contained and externally verifiable by combinatorial inspection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of suffix trees and the Burrows-Wheeler transform together with the existence of suitable priority functions; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

standard math Standard combinatorial properties of suffix trees and the Burrows-Wheeler transform hold.
Invoked implicitly when relating r to the number of distinct paths in the decomposition.

pith-pipeline@v0.9.0 · 5858 in / 1288 out tokens · 45933 ms · 2026-05-19T08:51:17.022287+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths is bounded by r... strictly improving upon the 2r bound of Suffixient Arrays
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the number of irreducible values in LPF^π is known to be at most r when π = ISA

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

More efficient PBWT prefix-array access via batching
cs.DS 2026-05 unverdicted novelty 5.0

Batching queries to accumulate r lg(h)/lg(r) substrings with average lg(r)/lg(h) haplotypes reported per substring enables O(r log h) bits and constant time per haplotype in PBWT prefix-array access.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Pan-genomic matching statistics for targeted nanopore sequencing

Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6), 2021

work page 2021
[2]

Spumoni 2: improved classification using a pangenome index of minimizer digests

Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. Spumoni 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023

work page 2023
[3]

Monotone minimal perfect hashing: searching a sorted table with o(1) accesses

Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: searching a sorted table with o(1) accesses. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 785–794, 2009. URL: https://dl.acm.org/doi/10.5555/1496770.1496856

work page doi:10.5555/1496770.1496856 2009
[4]

Puglisi, and Yasuo Tabei

Djamal Belazzougui, Manuel C´ aceres, Travis Gagie, Pawel Gawrychowski, Juha K¨ arkk¨ ainen, Gonzalo Navarro, Alberto Ord´ o˜ nez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees.J. Comput. Syst. Sci. , 117:1–22, 2021

work page 2021
[5]

Move-r: Optimizing the r-index

Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms, SEA 2024, July 23-26, 2024, Vienna, Austria , volume 301 of LIPIcs, pages 1:1–1:19. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik, 2024

work page 2024
[6]

Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann

Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing , 44:513–539, 2015

work page 2015
[7]

Kings, Name Days, Lazy Servants and Magic

Paolo Boldi and Sebastiano Vigna. Kings, Name Days, Lazy Servants and Magic. In Hiro Ito, Stefano Leonardi, Linda Pagli, and Giuseppe Prencipe, editors, 9th International Conference on Fun with Algorithms (FUN 2018) , volume 100 of Leibniz International Proceedings in Informatics (LIPIcs) , pages 10:1–10:13, Dagstuhl, Germany, 2018. Schloss Dagstuhl – Lei...

work page doi:10.4230/lipics.fun.2018.10 2018
[8]

Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994

work page 1994
[9]

Suffixient arrays: a new efficient suffix array compression technique.arXiv preprint 2407.18753v2, 2025

Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, and Nicola Prezza. Suffixient arrays: a new efficient suffix array compression technique, 2025. URL: https://arxiv.org/abs/2407.18753

work page arXiv 2025
[10]

Filtering search: a new approach to query-answering

Bernard Chazelle. Filtering search: a new approach to query-answering. SIAM Journal on Computing, 15(3):703– 724, 1986

work page 1986
[11]

Optimal-time dictionary-compressed indexes

Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1–8:39, 2021

work page 2021
[12]

Improved grammar-based compressed indexes

Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In International Sympo- sium on String Processing and Information Retrieval , pages 180–192. Springer, 2012

work page 2012
[13]

Cleary and W

John G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. Comput. J., 40(2/3):67–75, 1997

work page 1997
[14]

µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data

Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik K¨ oppl, Christina Boucher, and Paola Bonizzoni. µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data. Bioinform., 39(9), 2023

work page 2023
[15]

Opportunistic data structures with applications

Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science , pages 390–398, 2000

work page 2000
[16]

Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

Johannes Fischer and Volker Heun. Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays. SIAM Journal on Computing , 40:465–492, 2011

work page 2011
[17]

KATKA: A kraken-like tool with k given at query time

Travis Gagie, Sana Kashgouli, and Ben Langmead. KATKA: A kraken-like tool with k given at query time. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepci´ on, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science , pages 191–1...

work page 2022
[18]

Optimal-time text indexing in bwt-runs bounded space, 2018

Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space, 2018

work page 2018
[19]

Fully functional suffix trees and optimal text searching in bwt-runs bounded space

Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1), January 2020

work page 2020
[20]

New indices for text: Pat trees and pat arrays

Gaston H Gonnet, Ricardo A Baeza-Yates, and Tim Snider. New indices for text: Pat trees and pat arrays. Information Retrieval: Data Structures & Algorithms , 66:82, 1992

work page 1992
[21]

Compressed suffix arrays and suffix trees with applications to text index- ing and string matching

Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text index- ing and string matching. In Proceedings of the thirty-second annual ACM Symposium on Theory of Computing (STOC), pages 397 – 406, 2000

work page 2000
[22]

Juha K¨ arkk¨ ainen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. InProceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM) , pages 181–192, 2009

work page 2009
[23]

Resolution of the burrows-wheeler transform conjecture

Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) , pages 1002–1013. IEEE, 2020. 42

work page 2020
[24]

Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space

Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023 , pages 1877–1886. IEEE, 2023

work page 2023
[25]

At the roots of dictionary compression: string attractors

Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages 827–840, 2018

work page 2018
[26]

Toward a definitive compressibility measure for repet- itive sequences

Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repet- itive sequences. IEEE Transactions on Information Theory , 69(4):2074–2092, 2022

work page 2074
[27]

On compressing and indexing repetitive sequences

Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013

work page 2013
[28]

On compressing and indexing repetitive sequences

Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci. , 483:115–133, 2013

work page 2013
[29]

Relative Lempel-Ziv compression of genomes for large- scale storage and retrieval

Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large- scale storage and retrieval. In International Symposium on String Processing and Information Retrieval , pages 201–206. Springer, 2010

work page 2010
[30]

On the complexity of finite sequences

Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory , 22(1):75–81, 1976

work page 1976
[31]

Succinct suffix arrays based on run-length encoding

Veli M¨ akinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40–66, 2005

work page 2005
[32]

Storage and retrieval of highly repetitive sequence collections

Veli M¨ akinen, Gonzalo Navarro, Jouni Sir´ en, and Niko V¨ alim¨ aki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology , 17(3):281–308, 2010

work page 2010
[33]

In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algo- rithms, 22-24 January 1990, San Francisco, California, USA

Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In David S. Johnson, editor, Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1990, San Fran- cisco, California, USA , pages 319–327. SIAM, 1990. URL: http://dl.acm.org/citation.cfm?id=320176.320218

work page arXiv 1990
[34]

Suffix arrays: a new method for on-line string searches

Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993

work page 1993
[35]

McCreight

Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM , 23(2):262– 272, 1976

work page 1976
[36]

Wavelet trees for all

Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms , 25:2–20, 2014

work page 2014
[37]

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures. ACM Com- puting Surveys, 54(2):29:1–29:31, 2022

work page 2022
[38]

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part II: Compressed Indexes. ACM Computing Surveys, 54(2):26:1–26:31, 2022

work page 2022
[39]

Smallest suffixient sets as a repetitiveness measure,

Gonzalo Navarro, Giuseppe Romana, and Cristian Urbina. Smallest suffixient sets as a repetitiveness measure,

work page
[40]

URL: https://arxiv.org/abs/2506.05638

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Optimal-time queries on BWT-runs compressed indexes

Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference) , volume 198 of LIPIcs, pages 101:1–101:15. Schloss Dagstuhl - Le...

work page 2021
[42]

Optimal substring equality queries with applications to sparse text indexing

Nicola Prezza. Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms, 17(1):7:1–7:23, 2021

work page 2021
[43]

Relative lempel-ziv compression of suffix arrays

Simon J Puglisi and Bella Zhukova. Relative lempel-ziv compression of suffix arrays. In International Symposium on String Processing and Information Retrieval , pages 89–96. Springer, 2020

work page 2020
[44]

Finding maximal exact matches using the r-index

Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol. , 29(2):188–194, 2022

work page 2022
[45]

MONI: A pangenomic index for finding maximal exact matches

Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol. , 29(2):169–187, 2022

work page 2022
[46]

Sigmoni: classification of nanopore signal with a compressed pangenome index

Vikram S Shivakumar, Omar Y Ahmed, Sam Kovaka, Mohsen Zakeri, and Ben Langmead. Sigmoni: classification of nanopore signal with a compressed pangenome index. Bioinformatics, 40(Supplement 1):i287–i296, 2024

work page 2024
[47]

On-line construction of suffix trees

Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995

work page 1995
[48]

Linear pattern matching algorithms

Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11. IEEE, 1973

work page 1973
[49]

Improved metagenomic analysis with kraken 2

Derrick E Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with kraken 2. Genome biology, 20:1–13, 2019

work page 2019
[50]

Kraken: ultrafast metagenomic sequence classification using exact alignments

Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15:1–12, 2014

work page 2014
[51]

Movi: a fast and cache-efficient full-text pangenome index

Mohsen Zakeri, Nathaniel K Brown, Omar Y Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. iScience, 27(12), 2024. 43 A Suffixient Arrays Like suffixient arrays [9], ours is a technique for sampling the Prefix Array while still maintaining search functionalities. The idea behind suffixient arrays is rather ...

work page 2024
[52]

BWT( S)[i] = S[SA[i] − 1], where S[0] := S[n]

work page
[53]

If ISA[ i] < ISA[j] and S[i − 1] = S[j − 1], then ISA[i − 1] < ISA[j − 1], where ISA[0] := ISA[n]

work page
[54]

BWT[ISA[ i]] = S[i − 1]

work page
[55]

We proceed with the definition of Range Minimum and Maximum queries

coBWT[ i] = S[PA[i] + 1] where S[n + 1] := S[1]. We proceed with the definition of Range Minimum and Maximum queries. Definition 25. Given a list L of n integers, the Range Minimum (Maximum) query on L with arguments ℓ, r ∈ [n], returns the index argmink∈[ℓ,r]L[k] (argmaxk∈[ℓ,r]L[k]) of the minimum (max- imum) element in L[ℓ, r]. There exists a data struc...

work page

[1] [1]

Pan-genomic matching statistics for targeted nanopore sequencing

Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6), 2021

work page 2021

[2] [2]

Spumoni 2: improved classification using a pangenome index of minimizer digests

Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. Spumoni 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023

work page 2023

[3] [3]

Monotone minimal perfect hashing: searching a sorted table with o(1) accesses

Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: searching a sorted table with o(1) accesses. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 785–794, 2009. URL: https://dl.acm.org/doi/10.5555/1496770.1496856

work page doi:10.5555/1496770.1496856 2009

[4] [4]

Puglisi, and Yasuo Tabei

Djamal Belazzougui, Manuel C´ aceres, Travis Gagie, Pawel Gawrychowski, Juha K¨ arkk¨ ainen, Gonzalo Navarro, Alberto Ord´ o˜ nez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees.J. Comput. Syst. Sci. , 117:1–22, 2021

work page 2021

[5] [5]

Move-r: Optimizing the r-index

Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms, SEA 2024, July 23-26, 2024, Vienna, Austria , volume 301 of LIPIcs, pages 1:1–1:19. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik, 2024

work page 2024

[6] [6]

Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann

Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing , 44:513–539, 2015

work page 2015

[7] [7]

Kings, Name Days, Lazy Servants and Magic

Paolo Boldi and Sebastiano Vigna. Kings, Name Days, Lazy Servants and Magic. In Hiro Ito, Stefano Leonardi, Linda Pagli, and Giuseppe Prencipe, editors, 9th International Conference on Fun with Algorithms (FUN 2018) , volume 100 of Leibniz International Proceedings in Informatics (LIPIcs) , pages 10:1–10:13, Dagstuhl, Germany, 2018. Schloss Dagstuhl – Lei...

work page doi:10.4230/lipics.fun.2018.10 2018

[8] [8]

Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994

work page 1994

[9] [9]

Suffixient arrays: a new efficient suffix array compression technique.arXiv preprint 2407.18753v2, 2025

Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, and Nicola Prezza. Suffixient arrays: a new efficient suffix array compression technique, 2025. URL: https://arxiv.org/abs/2407.18753

work page arXiv 2025

[10] [10]

Filtering search: a new approach to query-answering

Bernard Chazelle. Filtering search: a new approach to query-answering. SIAM Journal on Computing, 15(3):703– 724, 1986

work page 1986

[11] [11]

Optimal-time dictionary-compressed indexes

Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1–8:39, 2021

work page 2021

[12] [12]

Improved grammar-based compressed indexes

Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In International Sympo- sium on String Processing and Information Retrieval , pages 180–192. Springer, 2012

work page 2012

[13] [13]

Cleary and W

John G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. Comput. J., 40(2/3):67–75, 1997

work page 1997

[14] [14]

µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data

Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik K¨ oppl, Christina Boucher, and Paola Bonizzoni. µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data. Bioinform., 39(9), 2023

work page 2023

[15] [15]

Opportunistic data structures with applications

Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science , pages 390–398, 2000

work page 2000

[16] [16]

Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

Johannes Fischer and Volker Heun. Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays. SIAM Journal on Computing , 40:465–492, 2011

work page 2011

[17] [17]

KATKA: A kraken-like tool with k given at query time

Travis Gagie, Sana Kashgouli, and Ben Langmead. KATKA: A kraken-like tool with k given at query time. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepci´ on, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science , pages 191–1...

work page 2022

[18] [18]

Optimal-time text indexing in bwt-runs bounded space, 2018

Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space, 2018

work page 2018

[19] [19]

Fully functional suffix trees and optimal text searching in bwt-runs bounded space

Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1), January 2020

work page 2020

[20] [20]

New indices for text: Pat trees and pat arrays

Gaston H Gonnet, Ricardo A Baeza-Yates, and Tim Snider. New indices for text: Pat trees and pat arrays. Information Retrieval: Data Structures & Algorithms , 66:82, 1992

work page 1992

[21] [21]

Compressed suffix arrays and suffix trees with applications to text index- ing and string matching

Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text index- ing and string matching. In Proceedings of the thirty-second annual ACM Symposium on Theory of Computing (STOC), pages 397 – 406, 2000

work page 2000

[22] [22]

Juha K¨ arkk¨ ainen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. InProceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM) , pages 181–192, 2009

work page 2009

[23] [23]

Resolution of the burrows-wheeler transform conjecture

Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) , pages 1002–1013. IEEE, 2020. 42

work page 2020

[24] [24]

Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space

Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023 , pages 1877–1886. IEEE, 2023

work page 2023

[25] [25]

At the roots of dictionary compression: string attractors

Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages 827–840, 2018

work page 2018

[26] [26]

Toward a definitive compressibility measure for repet- itive sequences

Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repet- itive sequences. IEEE Transactions on Information Theory , 69(4):2074–2092, 2022

work page 2074

[27] [27]

On compressing and indexing repetitive sequences

Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013

work page 2013

[28] [28]

On compressing and indexing repetitive sequences

Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci. , 483:115–133, 2013

work page 2013

[29] [29]

Relative Lempel-Ziv compression of genomes for large- scale storage and retrieval

Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large- scale storage and retrieval. In International Symposium on String Processing and Information Retrieval , pages 201–206. Springer, 2010

work page 2010

[30] [30]

On the complexity of finite sequences

Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory , 22(1):75–81, 1976

work page 1976

[31] [31]

Succinct suffix arrays based on run-length encoding

Veli M¨ akinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40–66, 2005

work page 2005

[32] [32]

Storage and retrieval of highly repetitive sequence collections

Veli M¨ akinen, Gonzalo Navarro, Jouni Sir´ en, and Niko V¨ alim¨ aki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology , 17(3):281–308, 2010

work page 2010

[33] [33]

In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algo- rithms, 22-24 January 1990, San Francisco, California, USA

Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In David S. Johnson, editor, Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1990, San Fran- cisco, California, USA , pages 319–327. SIAM, 1990. URL: http://dl.acm.org/citation.cfm?id=320176.320218

work page arXiv 1990

[34] [34]

Suffix arrays: a new method for on-line string searches

Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993

work page 1993

[35] [35]

McCreight

Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM , 23(2):262– 272, 1976

work page 1976

[36] [36]

Wavelet trees for all

Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms , 25:2–20, 2014

work page 2014

[37] [37]

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures. ACM Com- puting Surveys, 54(2):29:1–29:31, 2022

work page 2022

[38] [38]

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part II: Compressed Indexes. ACM Computing Surveys, 54(2):26:1–26:31, 2022

work page 2022

[39] [39]

Smallest suffixient sets as a repetitiveness measure,

Gonzalo Navarro, Giuseppe Romana, and Cristian Urbina. Smallest suffixient sets as a repetitiveness measure,

work page

[40] [40]

URL: https://arxiv.org/abs/2506.05638

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Optimal-time queries on BWT-runs compressed indexes

Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference) , volume 198 of LIPIcs, pages 101:1–101:15. Schloss Dagstuhl - Le...

work page 2021

[42] [42]

Optimal substring equality queries with applications to sparse text indexing

Nicola Prezza. Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms, 17(1):7:1–7:23, 2021

work page 2021

[43] [43]

Relative lempel-ziv compression of suffix arrays

Simon J Puglisi and Bella Zhukova. Relative lempel-ziv compression of suffix arrays. In International Symposium on String Processing and Information Retrieval , pages 89–96. Springer, 2020

work page 2020

[44] [44]

Finding maximal exact matches using the r-index

Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol. , 29(2):188–194, 2022

work page 2022

[45] [45]

MONI: A pangenomic index for finding maximal exact matches

Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol. , 29(2):169–187, 2022

work page 2022

[46] [46]

Sigmoni: classification of nanopore signal with a compressed pangenome index

Vikram S Shivakumar, Omar Y Ahmed, Sam Kovaka, Mohsen Zakeri, and Ben Langmead. Sigmoni: classification of nanopore signal with a compressed pangenome index. Bioinformatics, 40(Supplement 1):i287–i296, 2024

work page 2024

[47] [47]

On-line construction of suffix trees

Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995

work page 1995

[48] [48]

Linear pattern matching algorithms

Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11. IEEE, 1973

work page 1973

[49] [49]

Improved metagenomic analysis with kraken 2

Derrick E Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with kraken 2. Genome biology, 20:1–13, 2019

work page 2019

[50] [50]

Kraken: ultrafast metagenomic sequence classification using exact alignments

Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15:1–12, 2014

work page 2014

[51] [51]

Movi: a fast and cache-efficient full-text pangenome index

Mohsen Zakeri, Nathaniel K Brown, Omar Y Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. iScience, 27(12), 2024. 43 A Suffixient Arrays Like suffixient arrays [9], ours is a technique for sampling the Prefix Array while still maintaining search functionalities. The idea behind suffixient arrays is rather ...

work page 2024

[52] [52]

BWT( S)[i] = S[SA[i] − 1], where S[0] := S[n]

work page

[53] [53]

If ISA[ i] < ISA[j] and S[i − 1] = S[j − 1], then ISA[i − 1] < ISA[j − 1], where ISA[0] := ISA[n]

work page

[54] [54]

BWT[ISA[ i]] = S[i − 1]

work page

[55] [55]

We proceed with the definition of Range Minimum and Maximum queries

coBWT[ i] = S[PA[i] + 1] where S[n + 1] := S[1]. We proceed with the definition of Range Minimum and Maximum queries. Definition 25. Given a list L of n integers, the Range Minimum (Maximum) query on L with arguments ℓ, r ∈ [n], returns the index argmink∈[ℓ,r]L[k] (argmaxk∈[ℓ,r]L[k]) of the minimum (max- imum) element in L[ℓ, r]. There exists a data struc...

work page