Parse indexing for choosing pseudo-MEMs

Travis Gagie

arxiv: 2605.17574 · v2 · pith:UAS47XOBnew · submitted 2026-05-17 · 💻 cs.DS

Parse indexing for choosing pseudo-MEMs

Travis Gagie This is my paper

Pith reviewed 2026-05-20 12:46 UTC · model grok-4.3

classification 💻 cs.DS

keywords parse indexingpseudo-MEMsmaximal exact matchesKeBaBrepetitive textfilteringBloom filter

0 comments

The pith

Parse indexing lets us safely filter pseudo-MEMs while still guaranteeing all longest exact matches are retained.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how parse indexing on a repetitive text supplies the extra information needed to pick a subset of pseudo-MEMs that is guaranteed to contain every longest MEM even after length or count filtering. This removes the risk that earlier k-mer-based breaking methods run when they discard short or low-ranked pseudo-MEMs, and it also removes the need to choose any fixed k in advance. A reader would care because the technique makes fast, reliable MEM search practical on large repetitive collections without introducing new tuning parameters or missing critical long matches.

Core claim

By building a parse index of the text, we can determine, for any set of candidate pseudo-MEMs produced by a Bloom filter on k-mers, exactly which ones must be kept so that the retained pseudo-MEMs are guaranteed to include every MEM whose length is maximal after any subsequent length threshold or top-t selection.

What carries the argument

The parse index, which records the run-length encoding of the text's LZ parse and thereby identifies which k-mer boundaries are critical for preserving longest matches under filtering.

If this is right

Filtering short or low-count pseudo-MEMs can now be performed without the possibility of discarding a longest MEM.
The same selection rule works for any choice of k, removing the need to tune that parameter.
Searches restricted to the retained pseudo-MEMs are guaranteed to report all longest matches while still benefiting from the speed-up of the Bloom-filter pre-filter.
The approach extends directly to any filtering rule based on length or rank that is applied after pseudo-MEM identification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parse-index information could be reused to guide other string-matching or compression pipelines that need to retain longest exact matches under resource constraints.
Because the method is parameter-free with respect to k, it may simplify integration into larger bioinformatics pipelines that already index repetitive genomes.

Load-bearing premise

The parse index contains enough information to identify precisely which pseudo-MEMs must be kept to guarantee that every longest MEM survives length or count filtering.

What would settle it

Construct a repetitive text, a pattern, and a filter (length threshold or top-t) such that the longest MEM in the pattern is longer than every match found inside the pseudo-MEMs selected by the parse-index method; the claim is false if this occurs.

Figures

Figures reproduced from arXiv: 2605.17574 by Travis Gagie.

read the original abstract

Brown et al.\ (2025) recently proposed a pre-processing step, called $k$-mer based breaking (KeBaB), to speed up searches for long maximal exact matches (MEMs) between patterns and an indexed repetitive text. They fix a parameter $k$ and build a Bloom filter for the distinct $k$-mers in the text. When given a pattern, they quickly separate the $k$-mers in it into those that probably occur in the text and those that certainly do not. They call the maximal substrings of the pattern consisting only of the former $k$-mers {\em pseudo-MEMs}. These pseudo-MEMs are guaranteed to contain all the MEMs of length at least $k$ of the pattern with respect to the text, and it is usually much faster to find the pseudo-MEMs and then find the MEMs in them than to find the MEMs in the pattern directly. KeBaB is particularly effective when we choose a threshold $L > k$ and discard the pseudo-MEMs of length less than $L$, or discard all but the $t$ longest pseudo-MEMs. These filtering steps are risky, however, since the pseudo-MEMs we keep may not contain the longest MEMs. In this paper we show how to use parse indexing to choose pseudo-MEMs safely, even when we filter them, with the added benefit that we need not choose $k$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gagie shows parse indexing can pick filtered pseudo-MEMs safely without fixing k.

read the letter

The key point here is that parse indexing gives a way to choose pseudo-MEMs that remain safe even after you filter them by length or number, and you skip choosing the k parameter altogether. This is a direct follow-up to the KeBaB method in Brown et al. 2025. KeBaB uses a Bloom filter on k-mers to find candidate regions quickly, but then filtering those candidates risks dropping the actual longest MEMs. Gagie points out that a parse index already encodes the long repeats in a repetitive text, so you can use it to decide which pseudo-MEMs must be kept to cover the longest matches. What works well is the guarantee: the selected subset still contains every longest MEM after filtering. The paper avoids introducing new parameters and rests on the existing data structure properties rather than fitting anything. The soft spots are minor. The write-up is short and leans on familiarity with parse indexes and the prior KeBaB paper. It would benefit from a concrete small example showing the selection step, or a note on implementation cost. There is no experimental section, which is fine for a note like this but means the practical speedup is assumed from the earlier work. This paper is for people already working on MEM finding in repetitive strings, probably in bioinformatics or compressed data structures. A reader who knows the KeBaB paper will see the value immediately as a fix for its main drawback. I would recommend sending it for peer review. The central claim looks solid and addresses a real limitation in the prior approach.

Referee Report

2 major / 2 minor

Summary. The paper builds on Brown et al. (2025)'s KeBaB preprocessing, which uses a Bloom filter on k-mers to identify pseudo-MEMs that are guaranteed to contain all MEMs of length at least k. It shows that parse indexing supplies the additional information needed to select a subset of these pseudo-MEMs that remains guaranteed to contain every longest MEM even after length-based (L > k) or cardinality-based (top-t) filtering, while removing the need to choose any fixed k.

Significance. If the central guarantee holds, the method removes a practical risk in KeBaB filtering and eliminates a free parameter, yielding a more robust and easier-to-deploy accelerator for MEM search on repetitive texts. This would be useful in large-scale string processing tasks such as genome alignment where both speed and completeness of long matches matter.

major comments (2)

[§4] §4, main guarantee: the argument that the parse index identifies exactly which pseudo-MEMs must be retained to cover all longest MEMs after arbitrary filtering needs an explicit statement of the retained set (e.g., via a formal definition or pseudocode) and a proof that no longest MEM can be lost when the filter discards short or low-ranked pseudo-MEMs.
[§3.1] §3.1, parse-index construction: it is unclear whether the parse must be built on the full text or can be built on the same k-mer Bloom-filtered view; if the former, the claimed advantage of avoiding k must be reconciled with the cost of the parse.

minor comments (2)

[Abstract] The abstract and introduction should state the precise filtering operations (length threshold vs. top-t) for which the guarantee is proved.
[§2] Notation for pseudo-MEM boundaries and how they relate to parse phrases should be introduced earlier and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments help clarify the presentation of the main guarantee and the construction details. We respond to each major comment below.

read point-by-point responses

Referee: [§4] §4, main guarantee: the argument that the parse index identifies exactly which pseudo-MEMs must be retained to cover all longest MEMs after arbitrary filtering needs an explicit statement of the retained set (e.g., via a formal definition or pseudocode) and a proof that no longest MEM can be lost when the filter discards short or low-ranked pseudo-MEMs.

Authors: We agree that the guarantee would benefit from greater formality. In the revised manuscript we will add, in Section 4, an explicit definition of the retained set (the pseudo-MEMs whose parse intervals intersect the longest possible match positions) together with a short proof that length-based (L > k) or cardinality-based (top-t) filtering cannot eliminate any longest MEM when the parse index is used for selection. This addition makes the argument self-contained without changing the underlying result. revision: yes
Referee: [§3.1] §3.1, parse-index construction: it is unclear whether the parse must be built on the full text or can be built on the same k-mer Bloom-filtered view; if the former, the claimed advantage of avoiding k must be reconciled with the cost of the parse.

Authors: The parse index is constructed on the full text, because the hierarchical parse captures the repetition structure required for k-independent selection. The Bloom filter is applied only at query time to identify candidate k-mers and is independent of the parse. The advantage of avoiding a fixed k therefore holds at filtering time: once the (offline) parse is available, safe retention decisions no longer require choosing or knowing k. We will insert a clarifying paragraph in §3.1 that distinguishes the offline construction from the online, parameter-free filtering step and notes that the parse cost is incurred only once, comparable to other static indexes. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claim is that parse indexing provides sufficient structural information to select a filtered subset of pseudo-MEMs guaranteed to contain all longest MEMs, while removing the need to choose k. This rests on the external KeBaB construction from Brown et al. (2025) combined with the independent properties of parse indexes as a data structure for repetitive texts. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the guarantee follows from the data-structure invariants rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the unstated property that parse indexing can safely guide filtering decisions for pseudo-MEMs; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Parse indexing provides sufficient information to guarantee retention of pseudo-MEMs containing all longest MEMs after filtering.
This premise is required for the safety claim but is not justified in the provided abstract.

pith-pipeline@v0.9.0 · 5782 in / 1110 out tokens · 31303 ms · 2026-05-20T12:46:31.058382+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

KeBaB: k-mer based breaking for finding long MEMs

Nathaniel K Brown et al. KeBaB: k-mer based breaking for finding long MEMs. InProc. 32nd Symposium on String Processing and Information Retrieval (SPIRE), 2025

work page 2025
[2]

Faster run-length compressed suffix arrays

Nathaniel K Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster run-length compressed suffix arrays. InFrom Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday. Schloss Dagstuhl–Leibniz- Zentrum f¨ ur Informatik, 2025

work page 2025
[3]

Faster algorithms for longest common substring.ACM Transactions on Algorithms, 22, 2026

Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, and Solon P Pis- sis. Faster algorithms for longest common substring.ACM Transactions on Algorithms, 22, 2026

work page 2026
[4]

Fast and small subsampled r-indexes

Dustin Cobas, Travis Gagie, and Gonzalo Navarro. Fast and small subsampled r-indexes. ACM Transactions on Algorithms, 22, 2025

work page 2025
[5]

FM-indexing gram- mars induced by suffix sorting for long patterns

Jin-Jie Deng, Wing-Kai Hon, Dominik K¨ oppl, and Kunihiko Sadakane. FM-indexing gram- mars induced by suffix sorting for long patterns. InProc. Data Compression Conference (DCC), 2022

work page 2022
[6]

Summary cache: a scalable wide-area web cache sharing protocol.IEEE/ACM Transactions on Networking, 8, 2000

Li Fan, Pei Cao, Jussara Almeida, and Andrei Z Broder. Summary cache: a scalable wide-area web cache sharing protocol.IEEE/ACM Transactions on Networking, 8, 2000

work page 2000
[7]

Indexing compressed text.Journal of the ACM, 52, 2005

Paolo Ferragina and Giovanni Manzini. Indexing compressed text.Journal of the ACM, 52, 2005

work page 2005
[8]

Building a pangenome alignment index via recursive prefix-free parsing.iScience, 27, 2024

Eddie Ferro, Marco Oliva, Travis Gagie, and Christina Boucher. Building a pangenome alignment index via recursive prefix-free parsing.iScience, 27, 2024

work page 2024
[9]

How to find long maximal exact matches and ignore short ones

Travis Gagie. How to find long maximal exact matches and ignore short ones. InProc. 28th Conference on Developments in Language Theory (DLT), 2024. 6

work page 2024
[10]

Compressed suffix arrays and suffix trees with applications to text indexing and string matching.SIAM Journal on Computing, 35, 2005

Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching.SIAM Journal on Computing, 35, 2005

work page 2005
[11]

Pfp-fm: an accelerated FM-index.Algorithms for Molecular Biology, 19, 2024

Aaron Hong, Marco Oliva, Dominik K¨ oppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Pfp-fm: an accelerated FM-index.Algorithms for Molecular Biology, 19, 2024

work page 2024
[12]

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.Bioinformatics, 28, 2012

Heng Li. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.Bioinformatics, 28, 2012

work page 2012
[13]

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint 1303.3997, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[14]

BWT construction and search at the terabase scale.Bioinformatics, 40, 2024

Heng Li. BWT construction and search at the terabase scale.Bioinformatics, 40, 2024

work page 2024
[15]

Storage and retrieval of highly repetitive sequence collections.Journal of Computational Biology, 17, 2010

Veli M¨ akinen, Gonzalo Navarro, Jouni Sir´ en, and Niko V¨ alim¨ aki. Storage and retrieval of highly repetitive sequence collections.Journal of Computational Biology, 17, 2010

work page 2010
[16]

Beyond Bloom: A tutorial on future feature-rich filters

Prashant Pandey, Mart´ ın Farach-Colton, Niv Dayan, and Huanchen Zhang. Beyond Bloom: A tutorial on future feature-rich filters. InCompanion to the Conference on Management of Data (SIGMOD), 2024

work page 2024
[17]

The rsync algorithm

Andrew Tridgell and Paul Mackerras. The rsync algorithm. Technical report, Australian National University, 1996. 7

work page 1996

[1] [1]

KeBaB: k-mer based breaking for finding long MEMs

Nathaniel K Brown et al. KeBaB: k-mer based breaking for finding long MEMs. InProc. 32nd Symposium on String Processing and Information Retrieval (SPIRE), 2025

work page 2025

[2] [2]

Faster run-length compressed suffix arrays

Nathaniel K Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster run-length compressed suffix arrays. InFrom Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday. Schloss Dagstuhl–Leibniz- Zentrum f¨ ur Informatik, 2025

work page 2025

[3] [3]

Faster algorithms for longest common substring.ACM Transactions on Algorithms, 22, 2026

Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, and Solon P Pis- sis. Faster algorithms for longest common substring.ACM Transactions on Algorithms, 22, 2026

work page 2026

[4] [4]

Fast and small subsampled r-indexes

Dustin Cobas, Travis Gagie, and Gonzalo Navarro. Fast and small subsampled r-indexes. ACM Transactions on Algorithms, 22, 2025

work page 2025

[5] [5]

FM-indexing gram- mars induced by suffix sorting for long patterns

Jin-Jie Deng, Wing-Kai Hon, Dominik K¨ oppl, and Kunihiko Sadakane. FM-indexing gram- mars induced by suffix sorting for long patterns. InProc. Data Compression Conference (DCC), 2022

work page 2022

[6] [6]

Summary cache: a scalable wide-area web cache sharing protocol.IEEE/ACM Transactions on Networking, 8, 2000

Li Fan, Pei Cao, Jussara Almeida, and Andrei Z Broder. Summary cache: a scalable wide-area web cache sharing protocol.IEEE/ACM Transactions on Networking, 8, 2000

work page 2000

[7] [7]

Indexing compressed text.Journal of the ACM, 52, 2005

Paolo Ferragina and Giovanni Manzini. Indexing compressed text.Journal of the ACM, 52, 2005

work page 2005

[8] [8]

Building a pangenome alignment index via recursive prefix-free parsing.iScience, 27, 2024

Eddie Ferro, Marco Oliva, Travis Gagie, and Christina Boucher. Building a pangenome alignment index via recursive prefix-free parsing.iScience, 27, 2024

work page 2024

[9] [9]

How to find long maximal exact matches and ignore short ones

Travis Gagie. How to find long maximal exact matches and ignore short ones. InProc. 28th Conference on Developments in Language Theory (DLT), 2024. 6

work page 2024

[10] [10]

Compressed suffix arrays and suffix trees with applications to text indexing and string matching.SIAM Journal on Computing, 35, 2005

Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching.SIAM Journal on Computing, 35, 2005

work page 2005

[11] [11]

Pfp-fm: an accelerated FM-index.Algorithms for Molecular Biology, 19, 2024

Aaron Hong, Marco Oliva, Dominik K¨ oppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Pfp-fm: an accelerated FM-index.Algorithms for Molecular Biology, 19, 2024

work page 2024

[12] [12]

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.Bioinformatics, 28, 2012

Heng Li. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.Bioinformatics, 28, 2012

work page 2012

[13] [13]

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint 1303.3997, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[14] [14]

BWT construction and search at the terabase scale.Bioinformatics, 40, 2024

Heng Li. BWT construction and search at the terabase scale.Bioinformatics, 40, 2024

work page 2024

[15] [15]

Storage and retrieval of highly repetitive sequence collections.Journal of Computational Biology, 17, 2010

Veli M¨ akinen, Gonzalo Navarro, Jouni Sir´ en, and Niko V¨ alim¨ aki. Storage and retrieval of highly repetitive sequence collections.Journal of Computational Biology, 17, 2010

work page 2010

[16] [16]

Beyond Bloom: A tutorial on future feature-rich filters

Prashant Pandey, Mart´ ın Farach-Colton, Niv Dayan, and Huanchen Zhang. Beyond Bloom: A tutorial on future feature-rich filters. InCompanion to the Conference on Management of Data (SIGMOD), 2024

work page 2024

[17] [17]

The rsync algorithm

Andrew Tridgell and Paul Mackerras. The rsync algorithm. Technical report, Australian National University, 1996. 7

work page 1996