More efficient PBWT prefix-array access via batching

Travis Gagie

arxiv: 2605.15819 · v2 · pith:NLJ4OXORnew · submitted 2026-05-15 · 💻 cs.DS

More efficient PBWT prefix-array access via batching

Travis Gagie This is my paper

Pith reviewed 2026-05-19 19:21 UTC · model grok-4.3

classification 💻 cs.DS

keywords PBWThaplotype panelsprefix arraybatching queriestime-space tradeoffsconstant-time reportingmaximal exact matchesgenomic data structures

0 comments

The pith

Batching queries to the PBWT prefix array allows constant-time haplotype reporting using O(r log h) bits when batch sizes meet specified logarithmic thresholds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper focuses on the second step of finding maximal exact matches with PBWT, where one must report the haplotypes that match a given query substring at the same position. It observes that collecting queries in batches until r log h over log r substrings have been processed, and provided each substring reports at least log r over log h haplotypes on average, makes it possible to store the necessary information in O(r log h) bits while answering each report in constant time. A reader working with large haplotype panels would care because this reduces memory use for the reporting structure compared with prior tradeoffs that either used more space or required logarithmic time factors. The observation is workload-dependent and holds only when queries can be grouped to satisfy the average-report condition.

Core claim

If queries are batched until r lg(h)/lg(r) substrings have been identified and an average of at least lg(r)/lg(h) haplotypes are reported per substring, then the reporting step can be performed with O(r log h) bits of space and constant time per haplotype.

What carries the argument

Batching of multiple query substrings until a threshold of r log(h)/log(r) substrings is reached, allowing amortized constant-time reporting of matching haplotypes.

If this is right

The space for the haplotype-reporting structure drops to O(r log h) bits.
Each haplotype can be output in constant time instead of O(log log h) or similar.
The new tradeoff sits between the two previously given bounds that used either more space or slower reporting.
The method applies directly to the second phase of SMEM computation on PBWT-compressed panels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Streaming or multi-user query systems could naturally supply the required batches by buffering incoming requests.
The same batching idea might reduce space in other run-length encoded string indexes that report occurrence lists.
Empirical tests on large panels would show how often real query streams meet the average-report condition needed for constant time.

Load-bearing premise

That it is practical to accumulate queries into batches large enough to reach r lg(h)/lg(r) substrings while still obtaining at least lg(r)/lg(h) haplotypes reported per substring on average.

What would settle it

A workload trace from real haplotype queries in which the average number of reported haplotypes per substring falls below lg(r)/lg(h) even after batches reach the size r lg(h)/lg(r).

Figures

Figures reproduced from arXiv: 2605.15819 by Travis Gagie.

**Figure 1.** Figure 1: A toy haplotype panel (left) with h = ℓ = 8, its PBWT (center) and the matrix (right) whose columns are the prefix arrays. Inspired by the Burrows-Wheeler Transform [6] (BWT) and Ferragina and Manzini’s [10] FM-index based on it, Durbin [9] proposed the positional BWT (PBWT) on haplotype panels. To build the PBWT of a binary matrix M[0..h−1][0..ℓ−1] we stably sort the bits in each column M[0..h − 1][j] int… view at source ↗

**Figure 2.** Figure 2: The substrings in the panel (highlighted) for the SMEMs of Q1 = 10100111 and Q2 = 11100010, in the haplotype panel (left) and its PBWT (center), and the corresponding entries in the prefix arrays. corresponding bits in the PBWT and the entries in the prefix arrays are also highlighted. Notice the highlighting for each match continues one prefix array to the right after the match ends, because we can only b… view at source ↗

read the original abstract

The positional Burrows-Wheeler Transform (PBWT) is commonly used to store haplotype panels compactly in such a way that, given a query haplotype, we can quickly find the set maximal exact matches (SMEMs) between the query and the haplotypes in a panel. There are generally two steps in this process: first we find the maximal substrings of the query that occur in the same positions in haplotypes in the panel and then, for each such substring, report the haplotypes in the panel in which the substring occurs in the same position as in the query. Very recently, Bonizzoni, Gagie and Gao (2026) gave two time-space tradeoffs for the second step: they use either $O ((r + h) \log n)$ bits and $O (\log \log \min (h, \ell) + k)$ time to report $k$ haplotypes in the panel, or $O (r \log h + h \log n)$ bits and $O (k \log \log h)$ time, where $r$ is the number of runs in the panel's PBWT and $h$, $\ell$ and $n = h \ell$ are the panel's height, length and size, respectively. We observe here that if we can batch queries until we have found $r \lg (h) / \lg r$ such substrings and we report an average of at least $\lg (r) / \lg h$ haplotypes in the panel per substring, for example, then for the second step we can easily use $O (r \log h)$ bits and constant time to report each haplotype. Our approach is based on an algorithm for constructing the prefix arrays quickly from the PBWT, which may be of independent interest. For the sake of completeness, we give a similar result for the divergence arrays.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This note adds a batching condition that turns one of the recent PBWT reporting tradeoffs into constant time per haplotype, but only when the workload meets specific size and density thresholds.

read the letter

The main thing to know is that this short note shows how batching queries until you hit r lg(h)/lg(r) substrings, with an average of at least lg(r)/lg(h) haplotypes reported per substring, lets the reporting step use O(r log h) bits and constant time. It builds on the two time-space options from Bonizzoni, Gagie and Gao in 2026 by adding this amortization observation, which was not in the earlier paper. The logic is straightforward once the batch premise is granted, and it keeps the presentation simple without new machinery. That is the useful part: it clarifies a practical route to constant-time reporting if the conditions can be met in an application that tolerates batching. The space bound stays modest and the time claim follows directly from amortizing over the collected matches. The soft spot is the workload assumption itself. The note states the thresholds clearly but gives no evidence or example that standard SMEM-finding runs on haplotype panels actually produce substrings at that rate or with that reporting density. Without that check, it is hard to judge whether the constant-time guarantee applies to real pipelines or only to specially arranged offline batches. Online single-query settings would also need extra handling to collect the batch without hurting latency. This is for people who already work with PBWT structures for genomic haplotype panels and want to squeeze the reporting phase. A reader focused on constant-time access under batch constraints would get direct value from the observation. The thinking is clear and the citation pattern is appropriate, so the paper deserves a serious referee even as a short note or communication. I would send it to peer review.

Referee Report

1 major / 0 minor

Summary. The paper observes that, for the haplotype-reporting step in PBWT-based SMEM finding, batching queries until r lg(h)/lg(r) substrings have been identified and an average of at least lg(r)/lg(h) haplotypes are reported per substring allows the use of an O(r log h)-bit structure that reports each haplotype in constant time. This is positioned as an improvement over the two time-space trade-offs recently given by Bonizzoni, Gagie and Gao (2026).

Significance. If the stated batch-size and density conditions hold for realistic workloads, the observation yields a cleaner constant-time reporting bound with space independent of h log n. The result is a modest but clean refinement of the recent trade-offs; its practical value hinges on whether standard SMEM workloads naturally produce batches meeting the r lg(h)/lg(r) threshold with sufficient matches per substring.

major comments (1)

The central claim rests on the workload-dependent premise that queries can be batched until r lg(h)/lg(r) substrings are collected while maintaining an average of lg(r)/lg(h) haplotypes per substring. No argument or reference is supplied showing that this threshold is reached in typical haplotype-panel queries (e.g., long-read or population-genetics workloads), nor how online single-query processing would be adapted without increasing per-query latency.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their careful review and for identifying the workload-dependent aspects of our observation. The manuscript presents a conditional data-structural refinement for the haplotype-reporting step in PBWT-based SMEM finding, available when batching conditions can be met. We address the major comment below and have made a partial revision to clarify the scope.

read point-by-point responses

Referee: The central claim rests on the workload-dependent premise that queries can be batched until r lg(h)/lg(r) substrings are collected while maintaining an average of at least lg(r)/lg(h) haplotypes per substring. No argument or reference is supplied showing that this threshold is reached in typical haplotype-panel queries (e.g., long-read or population-genetics workloads), nor how online single-query processing would be adapted without increasing per-query latency.

Authors: We agree that the benefit is conditional on the ability to accumulate a sufficient batch of substrings with the stated average haplotype density. The manuscript frames the result explicitly with an 'if' clause and positions it as an improvement available in batch settings rather than a claim that the thresholds hold universally. We do not supply empirical arguments or references for typical workloads because the contribution is the observation of the resulting O(r log h)-bit constant-time structure, not a workload analysis; such validation would require experimental study outside the scope of this short theoretical note. In high-throughput contexts where multiple queries are processed together, the conditions may hold naturally, but we leave confirmation to practitioners. For strictly online single-query processing, achieving the batch size would require buffering queries, which could increase per-query latency; the paper does not propose or analyze any such adaptation and instead highlights the improvement for batch scenarios. We have revised the manuscript to add a sentence in the introduction explicitly noting the batch-processing context and the conditional nature of the result. revision: partial

standing simulated objections not resolved

Empirical demonstration or reference that the batching thresholds (r lg(h)/lg(r) substrings with average lg(r)/lg(h) haplotypes per substring) are reached in standard long-read or population-genetics SMEM workloads

Circularity Check

0 steps flagged

Batching observation is independent and self-contained

full rationale

The paper's central claim is a conditional observation: under explicit workload assumptions about batching queries to reach r lg(h)/lg(r) substrings with average lg(r)/lg(h) haplotypes each, constant-time reporting becomes possible with O(r log h) bits. This does not reduce any derived quantity to a fitted input or prior result by construction. The citation to Bonizzoni, Gagie and Gao (2026) supplies base time-space tradeoffs but is not load-bearing for the new batching insight, which is presented as an 'if... then we can easily use' statement. No equations equate a prediction to its inputs, and the derivation remains externally falsifiable via workload measurements. This is a normal non-circular theoretical note.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The efficiency result rests on the assumption that batching is feasible in the target workload and that the average number of reported haplotypes per substring meets the stated threshold; these are workload properties rather than quantities derived from the data structure itself.

free parameters (2)

batch threshold r lg(h)/lg(r)
Number of substrings that must be accumulated before reporting begins; chosen to enable the constant-time regime.
average haplotypes per substring lg(r)/lg(h)
Minimum average matches required per substring to amortize the reporting cost to constant time.

axioms (1)

standard math Standard logarithmic space and time bounds for run-length encoded structures in string algorithms
Invoked implicitly when stating the O(r log h) bit space and constant-time reporting.

pith-pipeline@v0.9.0 · 5827 in / 1384 out tokens · 53697 ms · 2026-05-19T19:21:01.355044+00:00 · methodology

More efficient PBWT prefix-array access via batching

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)