Compressing Suffix Trees by Path Decompositions
Pith reviewed 2026-05-19 08:51 UTC · model grok-4.3
The pith
Sorting suffix tree leaves by priority functions decomposes the tree into at most r paths and supports a Suffix Array sample of size r.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths is bounded by r. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most r, strictly improving upon the 2r bound of Suffixient Arrays.
What carries the argument
The path decomposition of the suffix tree obtained by ordering leaves according to chosen priority functions; the decomposition groups edges into paths each representing a string suffix and limits their count to r.
If this is right
- Indexed pattern matching queries run with I/O cost governed by a sample whose size is bounded by r rather than 2r.
- The space overhead of the sampled Suffix Array becomes proportional to the repetitiveness measure r instead of twice that measure.
- Cache misses during matching decrease when the text is stored on disk or in external memory because fewer sampled positions need to be fetched.
- The same decomposition supplies a linear-time construction for the sampled structure once the priority ordering is fixed.
Where Pith is reading between the lines
- The same priority ordering may yield compact representations for other tree-based string indexes that currently sample at twice the run count.
- Dynamic maintenance of the decomposition could allow online updates to the sampled structure while preserving the r bound.
- The approach suggests that viewing suffix trees through path covers rather than node or edge covers captures repetitiveness more tightly.
Load-bearing premise
Priority functions exist that order the suffix tree leaves so the induced paths number at most r.
What would settle it
A string whose suffix tree requires more than r distinct paths under every possible choice of priority functions on the leaves.
Figures
read the original abstract
The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the $r$-index (Gagie et al., SODA 2018), fit in space proportional to $r$, the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by $r$. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most $r$, strictly improving upon the (tight) $2r$ bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a suffix array sampling technique based on path decompositions of the suffix tree. It claims that sorting the leaves of the suffix tree according to specific priority functions produces a decomposition into at most r distinct paths (each a suffix), where r is the number of runs in the Burrows-Wheeler transform. This yields an I/O-efficient indexed pattern matching structure whose suffix array sample has size at most r, strictly improving on the tight 2r bound achieved by Suffixient Arrays.
Significance. If the claimed r-bound holds, the result would be a notable improvement in the design of compressed indexes for the I/O model, directly addressing the practical cache-locality shortcomings of existing r-based structures while preserving the strong repetitiveness measure r. The path-decomposition viewpoint on suffix trees offers a fresh angle that could influence future work on I/O-efficient string algorithms.
major comments (2)
- [Section 4] Section 4 (Priority Functions and Induced Decomposition): The manuscript asserts that particular priority functions on suffix-tree leaves induce a path decomposition with at most r distinct paths, yet neither the explicit definition of these functions nor the combinatorial steps that map the sorted leaf order to BWT runs (ensuring the path count stays at most r rather than 2r) are supplied. This argument is load-bearing for the central claim of a strict improvement over Suffixient Arrays.
- [Theorem 5.1] Theorem 5.1 (or the main bounding theorem): The proof that the number of distinct paths is bounded by r is presented without an induction, counting argument, or explicit reduction to the run-length structure of the BWT. Without these steps it is impossible to confirm that the construction avoids the extra factor that appears in the 2r case.
minor comments (2)
- [Abstract] The abstract refers to 'specific priority functions' without even a one-sentence characterization; adding a high-level description would help readers assess the novelty immediately.
- [Figure 1] Figure 1 (path decomposition example): The diagram would be clearer if the priority values used for leaf ordering were annotated on the leaves or edges.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for recognizing the potential significance of our path-decomposition approach for I/O-efficient compressed indexes. We address each major comment below and will incorporate the requested clarifications into a revised manuscript.
read point-by-point responses
-
Referee: [Section 4] Section 4 (Priority Functions and Induced Decomposition): The manuscript asserts that particular priority functions on suffix-tree leaves induce a path decomposition with at most r distinct paths, yet neither the explicit definition of these functions nor the combinatorial steps that map the sorted leaf order to BWT runs (ensuring the path count stays at most r rather than 2r) are supplied. This argument is load-bearing for the central claim of a strict improvement over Suffixient Arrays.
Authors: We agree that the current draft of Section 4 would benefit from greater explicitness. In the revision we will supply the precise definitions of the two priority functions used to order the suffix-tree leaves. We will also add a self-contained combinatorial argument that shows how the resulting leaf order partitions the tree into paths whose number is bounded by the number of BWT runs r, rather than 2r, by directly relating consecutive leaves in the order to run boundaries in the BWT. revision: yes
-
Referee: [Theorem 5.1] Theorem 5.1 (or the main bounding theorem): The proof that the number of distinct paths is bounded by r is presented without an induction, counting argument, or explicit reduction to the run-length structure of the BWT. Without these steps it is impossible to confirm that the construction avoids the extra factor that appears in the 2r case.
Authors: We accept that the proof of Theorem 5.1 in the submitted version is too terse. We will expand it to include an explicit counting argument (or a short induction on the number of BWT runs) that reduces the number of distinct paths to r. The revised proof will highlight the structural property of our priority ordering that prevents the extra factor incurred by earlier 2r constructions. revision: yes
Circularity Check
No circularity: r-bound on path decompositions is a self-contained combinatorial proof
full rationale
The paper's central derivation is a proof that sorting suffix tree leaves by specific (explicitly constructed) priority functions induces a path decomposition whose distinct paths number at most r, the number of BWT runs. This is presented as an independent combinatorial result that directly yields an SA sample of size r for I/O-efficient matching, improving on the cited 2r bound of Suffixient Arrays. No step reduces by definition, by fitting, or by self-citation chain to the target quantity itself; the existence of the priority functions and the r-bound are established by explicit construction and argument rather than by renaming or tautological redefinition of inputs. The derivation is therefore self-contained and externally verifiable by combinatorial inspection.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard combinatorial properties of suffix trees and the Burrows-Wheeler transform hold.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths is bounded by r... strictly improving upon the 2r bound of Suffixient Arrays
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the number of irreducible values in LPF^π is known to be at most r when π = ISA
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
More efficient PBWT prefix-array access via batching
Batching queries to accumulate r lg(h)/lg(r) substrings with average lg(r)/lg(h) haplotypes reported per substring enables O(r log h) bits and constant time per haplotype in PBWT prefix-array access.
Reference graph
Works this paper leans on
-
[1]
Pan-genomic matching statistics for targeted nanopore sequencing
Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6), 2021
work page 2021
-
[2]
Spumoni 2: improved classification using a pangenome index of minimizer digests
Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. Spumoni 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023
work page 2023
-
[3]
Monotone minimal perfect hashing: searching a sorted table with o(1) accesses
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: searching a sorted table with o(1) accesses. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 785–794, 2009. URL: https://dl.acm.org/doi/10.5555/1496770.1496856
-
[4]
Djamal Belazzougui, Manuel C´ aceres, Travis Gagie, Pawel Gawrychowski, Juha K¨ arkk¨ ainen, Gonzalo Navarro, Alberto Ord´ o˜ nez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees.J. Comput. Syst. Sci. , 117:1–22, 2021
work page 2021
-
[5]
Move-r: Optimizing the r-index
Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms, SEA 2024, July 23-26, 2024, Vienna, Austria , volume 301 of LIPIcs, pages 1:1–1:19. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik, 2024
work page 2024
-
[6]
Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann
Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing , 44:513–539, 2015
work page 2015
-
[7]
Kings, Name Days, Lazy Servants and Magic
Paolo Boldi and Sebastiano Vigna. Kings, Name Days, Lazy Servants and Magic. In Hiro Ito, Stefano Leonardi, Linda Pagli, and Giuseppe Prencipe, editors, 9th International Conference on Fun with Algorithms (FUN 2018) , volume 100 of Leibniz International Proceedings in Informatics (LIPIcs) , pages 10:1–10:13, Dagstuhl, Germany, 2018. Schloss Dagstuhl – Lei...
-
[8]
Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994
work page 1994
-
[9]
Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, and Nicola Prezza. Suffixient arrays: a new efficient suffix array compression technique, 2025. URL: https://arxiv.org/abs/2407.18753
-
[10]
Filtering search: a new approach to query-answering
Bernard Chazelle. Filtering search: a new approach to query-answering. SIAM Journal on Computing, 15(3):703– 724, 1986
work page 1986
-
[11]
Optimal-time dictionary-compressed indexes
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1–8:39, 2021
work page 2021
-
[12]
Improved grammar-based compressed indexes
Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In International Sympo- sium on String Processing and Information Retrieval , pages 180–192. Springer, 2012
work page 2012
-
[13]
John G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. Comput. J., 40(2/3):67–75, 1997
work page 1997
-
[14]
µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data
Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik K¨ oppl, Christina Boucher, and Paola Bonizzoni. µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data. Bioinform., 39(9), 2023
work page 2023
-
[15]
Opportunistic data structures with applications
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science , pages 390–398, 2000
work page 2000
-
[16]
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
Johannes Fischer and Volker Heun. Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays. SIAM Journal on Computing , 40:465–492, 2011
work page 2011
-
[17]
KATKA: A kraken-like tool with k given at query time
Travis Gagie, Sana Kashgouli, and Ben Langmead. KATKA: A kraken-like tool with k given at query time. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepci´ on, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science , pages 191–1...
work page 2022
-
[18]
Optimal-time text indexing in bwt-runs bounded space, 2018
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space, 2018
work page 2018
-
[19]
Fully functional suffix trees and optimal text searching in bwt-runs bounded space
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1), January 2020
work page 2020
-
[20]
New indices for text: Pat trees and pat arrays
Gaston H Gonnet, Ricardo A Baeza-Yates, and Tim Snider. New indices for text: Pat trees and pat arrays. Information Retrieval: Data Structures & Algorithms , 66:82, 1992
work page 1992
-
[21]
Compressed suffix arrays and suffix trees with applications to text index- ing and string matching
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text index- ing and string matching. In Proceedings of the thirty-second annual ACM Symposium on Theory of Computing (STOC), pages 397 – 406, 2000
work page 2000
-
[22]
Juha K¨ arkk¨ ainen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. InProceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM) , pages 181–192, 2009
work page 2009
-
[23]
Resolution of the burrows-wheeler transform conjecture
Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) , pages 1002–1013. IEEE, 2020. 42
work page 2020
-
[24]
Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space
Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023 , pages 1877–1886. IEEE, 2023
work page 2023
-
[25]
At the roots of dictionary compression: string attractors
Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages 827–840, 2018
work page 2018
-
[26]
Toward a definitive compressibility measure for repet- itive sequences
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repet- itive sequences. IEEE Transactions on Information Theory , 69(4):2074–2092, 2022
work page 2074
-
[27]
On compressing and indexing repetitive sequences
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013
work page 2013
-
[28]
On compressing and indexing repetitive sequences
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci. , 483:115–133, 2013
work page 2013
-
[29]
Relative Lempel-Ziv compression of genomes for large- scale storage and retrieval
Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large- scale storage and retrieval. In International Symposium on String Processing and Information Retrieval , pages 201–206. Springer, 2010
work page 2010
-
[30]
On the complexity of finite sequences
Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory , 22(1):75–81, 1976
work page 1976
-
[31]
Succinct suffix arrays based on run-length encoding
Veli M¨ akinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40–66, 2005
work page 2005
-
[32]
Storage and retrieval of highly repetitive sequence collections
Veli M¨ akinen, Gonzalo Navarro, Jouni Sir´ en, and Niko V¨ alim¨ aki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology , 17(3):281–308, 2010
work page 2010
-
[33]
Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In David S. Johnson, editor, Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1990, San Fran- cisco, California, USA , pages 319–327. SIAM, 1990. URL: http://dl.acm.org/citation.cfm?id=320176.320218
-
[34]
Suffix arrays: a new method for on-line string searches
Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993
work page 1993
- [35]
-
[36]
Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms , 25:2–20, 2014
work page 2014
-
[37]
Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures
Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures. ACM Com- puting Surveys, 54(2):29:1–29:31, 2022
work page 2022
-
[38]
Indexing Highly Repetitive String Collections, Part II: Compressed Indexes
Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part II: Compressed Indexes. ACM Computing Surveys, 54(2):26:1–26:31, 2022
work page 2022
-
[39]
Smallest suffixient sets as a repetitiveness measure,
Gonzalo Navarro, Giuseppe Romana, and Cristian Urbina. Smallest suffixient sets as a repetitiveness measure,
-
[40]
URL: https://arxiv.org/abs/2506.05638
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Optimal-time queries on BWT-runs compressed indexes
Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference) , volume 198 of LIPIcs, pages 101:1–101:15. Schloss Dagstuhl - Le...
work page 2021
-
[42]
Optimal substring equality queries with applications to sparse text indexing
Nicola Prezza. Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms, 17(1):7:1–7:23, 2021
work page 2021
-
[43]
Relative lempel-ziv compression of suffix arrays
Simon J Puglisi and Bella Zhukova. Relative lempel-ziv compression of suffix arrays. In International Symposium on String Processing and Information Retrieval , pages 89–96. Springer, 2020
work page 2020
-
[44]
Finding maximal exact matches using the r-index
Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol. , 29(2):188–194, 2022
work page 2022
-
[45]
MONI: A pangenomic index for finding maximal exact matches
Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol. , 29(2):169–187, 2022
work page 2022
-
[46]
Sigmoni: classification of nanopore signal with a compressed pangenome index
Vikram S Shivakumar, Omar Y Ahmed, Sam Kovaka, Mohsen Zakeri, and Ben Langmead. Sigmoni: classification of nanopore signal with a compressed pangenome index. Bioinformatics, 40(Supplement 1):i287–i296, 2024
work page 2024
-
[47]
On-line construction of suffix trees
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995
work page 1995
-
[48]
Linear pattern matching algorithms
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11. IEEE, 1973
work page 1973
-
[49]
Improved metagenomic analysis with kraken 2
Derrick E Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with kraken 2. Genome biology, 20:1–13, 2019
work page 2019
-
[50]
Kraken: ultrafast metagenomic sequence classification using exact alignments
Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15:1–12, 2014
work page 2014
-
[51]
Movi: a fast and cache-efficient full-text pangenome index
Mohsen Zakeri, Nathaniel K Brown, Omar Y Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. iScience, 27(12), 2024. 43 A Suffixient Arrays Like suffixient arrays [9], ours is a technique for sampling the Prefix Array while still maintaining search functionalities. The idea behind suffixient arrays is rather ...
work page 2024
-
[52]
BWT( S)[i] = S[SA[i] − 1], where S[0] := S[n]
-
[53]
If ISA[ i] < ISA[j] and S[i − 1] = S[j − 1], then ISA[i − 1] < ISA[j − 1], where ISA[0] := ISA[n]
-
[54]
BWT[ISA[ i]] = S[i − 1]
-
[55]
We proceed with the definition of Range Minimum and Maximum queries
coBWT[ i] = S[PA[i] + 1] where S[n + 1] := S[1]. We proceed with the definition of Range Minimum and Maximum queries. Definition 25. Given a list L of n integers, the Range Minimum (Maximum) query on L with arguments ℓ, r ∈ [n], returns the index argmink∈[ℓ,r]L[k] (argmaxk∈[ℓ,r]L[k]) of the minimum (max- imum) element in L[ℓ, r]. There exists a data struc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.