Parse indexing for choosing pseudo-MEMs
Pith reviewed 2026-05-20 12:46 UTC · model grok-4.3
The pith
Parse indexing lets us safely filter pseudo-MEMs while still guaranteeing all longest exact matches are retained.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By building a parse index of the text, we can determine, for any set of candidate pseudo-MEMs produced by a Bloom filter on k-mers, exactly which ones must be kept so that the retained pseudo-MEMs are guaranteed to include every MEM whose length is maximal after any subsequent length threshold or top-t selection.
What carries the argument
The parse index, which records the run-length encoding of the text's LZ parse and thereby identifies which k-mer boundaries are critical for preserving longest matches under filtering.
If this is right
- Filtering short or low-count pseudo-MEMs can now be performed without the possibility of discarding a longest MEM.
- The same selection rule works for any choice of k, removing the need to tune that parameter.
- Searches restricted to the retained pseudo-MEMs are guaranteed to report all longest matches while still benefiting from the speed-up of the Bloom-filter pre-filter.
- The approach extends directly to any filtering rule based on length or rank that is applied after pseudo-MEM identification.
Where Pith is reading between the lines
- The same parse-index information could be reused to guide other string-matching or compression pipelines that need to retain longest exact matches under resource constraints.
- Because the method is parameter-free with respect to k, it may simplify integration into larger bioinformatics pipelines that already index repetitive genomes.
Load-bearing premise
The parse index contains enough information to identify precisely which pseudo-MEMs must be kept to guarantee that every longest MEM survives length or count filtering.
What would settle it
Construct a repetitive text, a pattern, and a filter (length threshold or top-t) such that the longest MEM in the pattern is longer than every match found inside the pseudo-MEMs selected by the parse-index method; the claim is false if this occurs.
Figures
read the original abstract
Brown et al.\ (2025) recently proposed a pre-processing step, called $k$-mer based breaking (KeBaB), to speed up searches for long maximal exact matches (MEMs) between patterns and an indexed repetitive text. They fix a parameter $k$ and build a Bloom filter for the distinct $k$-mers in the text. When given a pattern, they quickly separate the $k$-mers in it into those that probably occur in the text and those that certainly do not. They call the maximal substrings of the pattern consisting only of the former $k$-mers {\em pseudo-MEMs}. These pseudo-MEMs are guaranteed to contain all the MEMs of length at least $k$ of the pattern with respect to the text, and it is usually much faster to find the pseudo-MEMs and then find the MEMs in them than to find the MEMs in the pattern directly. KeBaB is particularly effective when we choose a threshold $L > k$ and discard the pseudo-MEMs of length less than $L$, or discard all but the $t$ longest pseudo-MEMs. These filtering steps are risky, however, since the pseudo-MEMs we keep may not contain the longest MEMs. In this paper we show how to use parse indexing to choose pseudo-MEMs safely, even when we filter them, with the added benefit that we need not choose $k$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper builds on Brown et al. (2025)'s KeBaB preprocessing, which uses a Bloom filter on k-mers to identify pseudo-MEMs that are guaranteed to contain all MEMs of length at least k. It shows that parse indexing supplies the additional information needed to select a subset of these pseudo-MEMs that remains guaranteed to contain every longest MEM even after length-based (L > k) or cardinality-based (top-t) filtering, while removing the need to choose any fixed k.
Significance. If the central guarantee holds, the method removes a practical risk in KeBaB filtering and eliminates a free parameter, yielding a more robust and easier-to-deploy accelerator for MEM search on repetitive texts. This would be useful in large-scale string processing tasks such as genome alignment where both speed and completeness of long matches matter.
major comments (2)
- [§4] §4, main guarantee: the argument that the parse index identifies exactly which pseudo-MEMs must be retained to cover all longest MEMs after arbitrary filtering needs an explicit statement of the retained set (e.g., via a formal definition or pseudocode) and a proof that no longest MEM can be lost when the filter discards short or low-ranked pseudo-MEMs.
- [§3.1] §3.1, parse-index construction: it is unclear whether the parse must be built on the full text or can be built on the same k-mer Bloom-filtered view; if the former, the claimed advantage of avoiding k must be reconciled with the cost of the parse.
minor comments (2)
- [Abstract] The abstract and introduction should state the precise filtering operations (length threshold vs. top-t) for which the guarantee is proved.
- [§2] Notation for pseudo-MEM boundaries and how they relate to parse phrases should be introduced earlier and used consistently.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. The comments help clarify the presentation of the main guarantee and the construction details. We respond to each major comment below.
read point-by-point responses
-
Referee: [§4] §4, main guarantee: the argument that the parse index identifies exactly which pseudo-MEMs must be retained to cover all longest MEMs after arbitrary filtering needs an explicit statement of the retained set (e.g., via a formal definition or pseudocode) and a proof that no longest MEM can be lost when the filter discards short or low-ranked pseudo-MEMs.
Authors: We agree that the guarantee would benefit from greater formality. In the revised manuscript we will add, in Section 4, an explicit definition of the retained set (the pseudo-MEMs whose parse intervals intersect the longest possible match positions) together with a short proof that length-based (L > k) or cardinality-based (top-t) filtering cannot eliminate any longest MEM when the parse index is used for selection. This addition makes the argument self-contained without changing the underlying result. revision: yes
-
Referee: [§3.1] §3.1, parse-index construction: it is unclear whether the parse must be built on the full text or can be built on the same k-mer Bloom-filtered view; if the former, the claimed advantage of avoiding k must be reconciled with the cost of the parse.
Authors: The parse index is constructed on the full text, because the hierarchical parse captures the repetition structure required for k-independent selection. The Bloom filter is applied only at query time to identify candidate k-mers and is independent of the parse. The advantage of avoiding a fixed k therefore holds at filtering time: once the (offline) parse is available, safe retention decisions no longer require choosing or knowing k. We will insert a clarifying paragraph in §3.1 that distinguishes the offline construction from the online, parameter-free filtering step and notes that the parse cost is incurred only once, comparable to other static indexes. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central claim is that parse indexing provides sufficient structural information to select a filtered subset of pseudo-MEMs guaranteed to contain all longest MEMs, while removing the need to choose k. This rests on the external KeBaB construction from Brown et al. (2025) combined with the independent properties of parse indexes as a data structure for repetitive texts. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the guarantee follows from the data-structure invariants rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Parse indexing provides sufficient information to guarantee retention of pseudo-MEMs containing all longest MEMs after filtering.
Reference graph
Works this paper leans on
-
[1]
KeBaB: k-mer based breaking for finding long MEMs
Nathaniel K Brown et al. KeBaB: k-mer based breaking for finding long MEMs. InProc. 32nd Symposium on String Processing and Information Retrieval (SPIRE), 2025
work page 2025
-
[2]
Faster run-length compressed suffix arrays
Nathaniel K Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster run-length compressed suffix arrays. InFrom Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday. Schloss Dagstuhl–Leibniz- Zentrum f¨ ur Informatik, 2025
work page 2025
-
[3]
Faster algorithms for longest common substring.ACM Transactions on Algorithms, 22, 2026
Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, and Solon P Pis- sis. Faster algorithms for longest common substring.ACM Transactions on Algorithms, 22, 2026
work page 2026
-
[4]
Fast and small subsampled r-indexes
Dustin Cobas, Travis Gagie, and Gonzalo Navarro. Fast and small subsampled r-indexes. ACM Transactions on Algorithms, 22, 2025
work page 2025
-
[5]
FM-indexing gram- mars induced by suffix sorting for long patterns
Jin-Jie Deng, Wing-Kai Hon, Dominik K¨ oppl, and Kunihiko Sadakane. FM-indexing gram- mars induced by suffix sorting for long patterns. InProc. Data Compression Conference (DCC), 2022
work page 2022
-
[6]
Li Fan, Pei Cao, Jussara Almeida, and Andrei Z Broder. Summary cache: a scalable wide-area web cache sharing protocol.IEEE/ACM Transactions on Networking, 8, 2000
work page 2000
-
[7]
Indexing compressed text.Journal of the ACM, 52, 2005
Paolo Ferragina and Giovanni Manzini. Indexing compressed text.Journal of the ACM, 52, 2005
work page 2005
-
[8]
Building a pangenome alignment index via recursive prefix-free parsing.iScience, 27, 2024
Eddie Ferro, Marco Oliva, Travis Gagie, and Christina Boucher. Building a pangenome alignment index via recursive prefix-free parsing.iScience, 27, 2024
work page 2024
-
[9]
How to find long maximal exact matches and ignore short ones
Travis Gagie. How to find long maximal exact matches and ignore short ones. InProc. 28th Conference on Developments in Language Theory (DLT), 2024. 6
work page 2024
-
[10]
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching.SIAM Journal on Computing, 35, 2005
work page 2005
-
[11]
Pfp-fm: an accelerated FM-index.Algorithms for Molecular Biology, 19, 2024
Aaron Hong, Marco Oliva, Dominik K¨ oppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Pfp-fm: an accelerated FM-index.Algorithms for Molecular Biology, 19, 2024
work page 2024
-
[12]
Heng Li. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.Bioinformatics, 28, 2012
work page 2012
-
[13]
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint 1303.3997, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[14]
BWT construction and search at the terabase scale.Bioinformatics, 40, 2024
Heng Li. BWT construction and search at the terabase scale.Bioinformatics, 40, 2024
work page 2024
-
[15]
Veli M¨ akinen, Gonzalo Navarro, Jouni Sir´ en, and Niko V¨ alim¨ aki. Storage and retrieval of highly repetitive sequence collections.Journal of Computational Biology, 17, 2010
work page 2010
-
[16]
Beyond Bloom: A tutorial on future feature-rich filters
Prashant Pandey, Mart´ ın Farach-Colton, Niv Dayan, and Huanchen Zhang. Beyond Bloom: A tutorial on future feature-rich filters. InCompanion to the Conference on Management of Data (SIGMOD), 2024
work page 2024
-
[17]
Andrew Tridgell and Paul Mackerras. The rsync algorithm. Technical report, Australian National University, 1996. 7
work page 1996
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.