pith. sign in

arxiv: 2605.17574 · v2 · pith:UAS47XOBnew · submitted 2026-05-17 · 💻 cs.DS

Parse indexing for choosing pseudo-MEMs

Pith reviewed 2026-05-20 12:46 UTC · model grok-4.3

classification 💻 cs.DS
keywords parse indexingpseudo-MEMsmaximal exact matchesKeBaBrepetitive textfilteringBloom filter
0
0 comments X

The pith

Parse indexing lets us safely filter pseudo-MEMs while still guaranteeing all longest exact matches are retained.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how parse indexing on a repetitive text supplies the extra information needed to pick a subset of pseudo-MEMs that is guaranteed to contain every longest MEM even after length or count filtering. This removes the risk that earlier k-mer-based breaking methods run when they discard short or low-ranked pseudo-MEMs, and it also removes the need to choose any fixed k in advance. A reader would care because the technique makes fast, reliable MEM search practical on large repetitive collections without introducing new tuning parameters or missing critical long matches.

Core claim

By building a parse index of the text, we can determine, for any set of candidate pseudo-MEMs produced by a Bloom filter on k-mers, exactly which ones must be kept so that the retained pseudo-MEMs are guaranteed to include every MEM whose length is maximal after any subsequent length threshold or top-t selection.

What carries the argument

The parse index, which records the run-length encoding of the text's LZ parse and thereby identifies which k-mer boundaries are critical for preserving longest matches under filtering.

If this is right

  • Filtering short or low-count pseudo-MEMs can now be performed without the possibility of discarding a longest MEM.
  • The same selection rule works for any choice of k, removing the need to tune that parameter.
  • Searches restricted to the retained pseudo-MEMs are guaranteed to report all longest matches while still benefiting from the speed-up of the Bloom-filter pre-filter.
  • The approach extends directly to any filtering rule based on length or rank that is applied after pseudo-MEM identification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parse-index information could be reused to guide other string-matching or compression pipelines that need to retain longest exact matches under resource constraints.
  • Because the method is parameter-free with respect to k, it may simplify integration into larger bioinformatics pipelines that already index repetitive genomes.

Load-bearing premise

The parse index contains enough information to identify precisely which pseudo-MEMs must be kept to guarantee that every longest MEM survives length or count filtering.

What would settle it

Construct a repetitive text, a pattern, and a filter (length threshold or top-t) such that the longest MEM in the pattern is longer than every match found inside the pseudo-MEMs selected by the parse-index method; the claim is false if this occurs.

Figures

Figures reproduced from arXiv: 2605.17574 by Travis Gagie.

Figure 1
Figure 1. Figure 1: A summary of our approach combining KeBaB with parse indexing, guaranteed to [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Brown et al.\ (2025) recently proposed a pre-processing step, called $k$-mer based breaking (KeBaB), to speed up searches for long maximal exact matches (MEMs) between patterns and an indexed repetitive text. They fix a parameter $k$ and build a Bloom filter for the distinct $k$-mers in the text. When given a pattern, they quickly separate the $k$-mers in it into those that probably occur in the text and those that certainly do not. They call the maximal substrings of the pattern consisting only of the former $k$-mers {\em pseudo-MEMs}. These pseudo-MEMs are guaranteed to contain all the MEMs of length at least $k$ of the pattern with respect to the text, and it is usually much faster to find the pseudo-MEMs and then find the MEMs in them than to find the MEMs in the pattern directly. KeBaB is particularly effective when we choose a threshold $L > k$ and discard the pseudo-MEMs of length less than $L$, or discard all but the $t$ longest pseudo-MEMs. These filtering steps are risky, however, since the pseudo-MEMs we keep may not contain the longest MEMs. In this paper we show how to use parse indexing to choose pseudo-MEMs safely, even when we filter them, with the added benefit that we need not choose $k$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper builds on Brown et al. (2025)'s KeBaB preprocessing, which uses a Bloom filter on k-mers to identify pseudo-MEMs that are guaranteed to contain all MEMs of length at least k. It shows that parse indexing supplies the additional information needed to select a subset of these pseudo-MEMs that remains guaranteed to contain every longest MEM even after length-based (L > k) or cardinality-based (top-t) filtering, while removing the need to choose any fixed k.

Significance. If the central guarantee holds, the method removes a practical risk in KeBaB filtering and eliminates a free parameter, yielding a more robust and easier-to-deploy accelerator for MEM search on repetitive texts. This would be useful in large-scale string processing tasks such as genome alignment where both speed and completeness of long matches matter.

major comments (2)
  1. [§4] §4, main guarantee: the argument that the parse index identifies exactly which pseudo-MEMs must be retained to cover all longest MEMs after arbitrary filtering needs an explicit statement of the retained set (e.g., via a formal definition or pseudocode) and a proof that no longest MEM can be lost when the filter discards short or low-ranked pseudo-MEMs.
  2. [§3.1] §3.1, parse-index construction: it is unclear whether the parse must be built on the full text or can be built on the same k-mer Bloom-filtered view; if the former, the claimed advantage of avoiding k must be reconciled with the cost of the parse.
minor comments (2)
  1. [Abstract] The abstract and introduction should state the precise filtering operations (length threshold vs. top-t) for which the guarantee is proved.
  2. [§2] Notation for pseudo-MEM boundaries and how they relate to parse phrases should be introduced earlier and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments help clarify the presentation of the main guarantee and the construction details. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4] §4, main guarantee: the argument that the parse index identifies exactly which pseudo-MEMs must be retained to cover all longest MEMs after arbitrary filtering needs an explicit statement of the retained set (e.g., via a formal definition or pseudocode) and a proof that no longest MEM can be lost when the filter discards short or low-ranked pseudo-MEMs.

    Authors: We agree that the guarantee would benefit from greater formality. In the revised manuscript we will add, in Section 4, an explicit definition of the retained set (the pseudo-MEMs whose parse intervals intersect the longest possible match positions) together with a short proof that length-based (L > k) or cardinality-based (top-t) filtering cannot eliminate any longest MEM when the parse index is used for selection. This addition makes the argument self-contained without changing the underlying result. revision: yes

  2. Referee: [§3.1] §3.1, parse-index construction: it is unclear whether the parse must be built on the full text or can be built on the same k-mer Bloom-filtered view; if the former, the claimed advantage of avoiding k must be reconciled with the cost of the parse.

    Authors: The parse index is constructed on the full text, because the hierarchical parse captures the repetition structure required for k-independent selection. The Bloom filter is applied only at query time to identify candidate k-mers and is independent of the parse. The advantage of avoiding a fixed k therefore holds at filtering time: once the (offline) parse is available, safe retention decisions no longer require choosing or knowing k. We will insert a clarifying paragraph in §3.1 that distinguishes the offline construction from the online, parameter-free filtering step and notes that the parse cost is incurred only once, comparable to other static indexes. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claim is that parse indexing provides sufficient structural information to select a filtered subset of pseudo-MEMs guaranteed to contain all longest MEMs, while removing the need to choose k. This rests on the external KeBaB construction from Brown et al. (2025) combined with the independent properties of parse indexes as a data structure for repetitive texts. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the guarantee follows from the data-structure invariants rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the unstated property that parse indexing can safely guide filtering decisions for pseudo-MEMs; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Parse indexing provides sufficient information to guarantee retention of pseudo-MEMs containing all longest MEMs after filtering.
    This premise is required for the safety claim but is not justified in the provided abstract.

pith-pipeline@v0.9.0 · 5782 in / 1110 out tokens · 31303 ms · 2026-05-20T12:46:31.058382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    KeBaB: k-mer based breaking for finding long MEMs

    Nathaniel K Brown et al. KeBaB: k-mer based breaking for finding long MEMs. InProc. 32nd Symposium on String Processing and Information Retrieval (SPIRE), 2025

  2. [2]

    Faster run-length compressed suffix arrays

    Nathaniel K Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster run-length compressed suffix arrays. InFrom Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday. Schloss Dagstuhl–Leibniz- Zentrum f¨ ur Informatik, 2025

  3. [3]

    Faster algorithms for longest common substring.ACM Transactions on Algorithms, 22, 2026

    Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, and Solon P Pis- sis. Faster algorithms for longest common substring.ACM Transactions on Algorithms, 22, 2026

  4. [4]

    Fast and small subsampled r-indexes

    Dustin Cobas, Travis Gagie, and Gonzalo Navarro. Fast and small subsampled r-indexes. ACM Transactions on Algorithms, 22, 2025

  5. [5]

    FM-indexing gram- mars induced by suffix sorting for long patterns

    Jin-Jie Deng, Wing-Kai Hon, Dominik K¨ oppl, and Kunihiko Sadakane. FM-indexing gram- mars induced by suffix sorting for long patterns. InProc. Data Compression Conference (DCC), 2022

  6. [6]

    Summary cache: a scalable wide-area web cache sharing protocol.IEEE/ACM Transactions on Networking, 8, 2000

    Li Fan, Pei Cao, Jussara Almeida, and Andrei Z Broder. Summary cache: a scalable wide-area web cache sharing protocol.IEEE/ACM Transactions on Networking, 8, 2000

  7. [7]

    Indexing compressed text.Journal of the ACM, 52, 2005

    Paolo Ferragina and Giovanni Manzini. Indexing compressed text.Journal of the ACM, 52, 2005

  8. [8]

    Building a pangenome alignment index via recursive prefix-free parsing.iScience, 27, 2024

    Eddie Ferro, Marco Oliva, Travis Gagie, and Christina Boucher. Building a pangenome alignment index via recursive prefix-free parsing.iScience, 27, 2024

  9. [9]

    How to find long maximal exact matches and ignore short ones

    Travis Gagie. How to find long maximal exact matches and ignore short ones. InProc. 28th Conference on Developments in Language Theory (DLT), 2024. 6

  10. [10]

    Compressed suffix arrays and suffix trees with applications to text indexing and string matching.SIAM Journal on Computing, 35, 2005

    Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching.SIAM Journal on Computing, 35, 2005

  11. [11]

    Pfp-fm: an accelerated FM-index.Algorithms for Molecular Biology, 19, 2024

    Aaron Hong, Marco Oliva, Dominik K¨ oppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Pfp-fm: an accelerated FM-index.Algorithms for Molecular Biology, 19, 2024

  12. [12]

    Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.Bioinformatics, 28, 2012

    Heng Li. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.Bioinformatics, 28, 2012

  13. [13]

    Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

    Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint 1303.3997, 2013

  14. [14]

    BWT construction and search at the terabase scale.Bioinformatics, 40, 2024

    Heng Li. BWT construction and search at the terabase scale.Bioinformatics, 40, 2024

  15. [15]

    Storage and retrieval of highly repetitive sequence collections.Journal of Computational Biology, 17, 2010

    Veli M¨ akinen, Gonzalo Navarro, Jouni Sir´ en, and Niko V¨ alim¨ aki. Storage and retrieval of highly repetitive sequence collections.Journal of Computational Biology, 17, 2010

  16. [16]

    Beyond Bloom: A tutorial on future feature-rich filters

    Prashant Pandey, Mart´ ın Farach-Colton, Niv Dayan, and Huanchen Zhang. Beyond Bloom: A tutorial on future feature-rich filters. InCompanion to the Conference on Management of Data (SIGMOD), 2024

  17. [17]

    The rsync algorithm

    Andrew Tridgell and Paul Mackerras. The rsync algorithm. Technical report, Australian National University, 1996. 7