Efficient average-case population recovery in the presence of insertions and deletions

Frank Ban; Rocco A. Servedio; Sandip Sinha; Xi Chen

arxiv: 1907.05964 · v1 · pith:IYKEES4Mnew · submitted 2019-07-12 · 💻 cs.DS · cs.IT· cs.LG· math.IT

Efficient average-case population recovery in the presence of insertions and deletions

Frank Ban , Xi Chen , Rocco A. Servedio , Sandip Sinha This is my paper

Pith reviewed 2026-05-24 22:03 UTC · model grok-4.3

classification 💻 cs.DS cs.ITcs.LGmath.IT

keywords population recoverytrace reconstructioninsertions and deletionsaverage casebinary stringsdistribution learningsupport recovery

0 comments

The pith

For most random s-subsets of binary strings with s up to exp(Θ(n^{1/3})), any distribution over the subset can be recovered to TV distance ε from indel traces in poly(n,s,1/ε) time using poly(s,1/ε,exp(log^{1/3}n)) samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an efficient algorithm for recovering an unknown distribution D over s unknown binary strings from samples that are indel traces of draws from D. This holds whenever the support set is a typical random s-subset of {0,1}^n and s is at most exp(Θ(n^{1/3})). A reader would care because the sample complexity is only polynomial in s, in contrast to the doubly exponential dependence required in the worst case over all possible supports. The approach extends single-string average-case trace reconstruction methods to the multi-string population recovery setting.

Core claim

For any support size s ≤ exp(Θ(n^{1/3})), for a 1-o(1) fraction of all s-element support sets {x¹,…,xˢ} ⊂ {0,1}ⁿ, for every distribution D supported on {x¹,…,xˢ}, the algorithm efficiently recovers D up to total variation distance ε with high probability, given access to independent traces of independent draws from D. The algorithm runs in time poly(n,s,1/ε) and its sample complexity is poly(s,1/ε,exp(log^{1/3}n)).

What carries the argument

Separation properties that hold for a 1-o(1) fraction of random s-subsets and enable reduction to efficient single-string average-case trace reconstruction.

If this is right

Sample complexity depends only polynomially on s rather than doubly exponentially.
Recovery succeeds for every distribution D on the support, not merely the uniform distribution.
Running time remains polynomial in both n and s.
The guarantee applies with high probability over the random choice of the support set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same random-support separation properties may extend the approach to other deletion-insertion-substitution channels.
In settings where source strings are generated randomly, such as certain storage or communication protocols, the polynomial dependence on s becomes practically relevant.
Adversarial choice of the support set is expected to force a return to doubly exponential sample complexity, highlighting the gap between average and worst case.

Load-bearing premise

The support set must be drawn uniformly at random from all possible s-subsets of {0,1}^n so that it satisfies the required separation properties with high probability.

What would settle it

Exhibit one concrete s-subset of size exp(Θ(n^{1/3})) together with a distribution D on it such that Ω(poly(s,exp(log^{1/3}n))) samples are required to recover D to constant TV distance from its indel traces.

read the original abstract

Several recent works have considered the \emph{trace reconstruction problem}, in which an unknown source string $x\in\{0,1\}^n$ is transmitted through a probabilistic channel which may randomly delete coordinates or insert random bits, resulting in a \emph{trace} of $x$. The goal is to reconstruct the original string~$x$ from independent traces of $x$. While the best algorithms known for worst-case strings use $\exp(O(n^{1/3}))$ traces \cite{DOS17,NazarovPeres17}, highly efficient algorithms are known \cite{PZ17,HPP18} for the \emph{average-case} version, in which $x$ is uniformly random. We consider a generalization of this average-case trace reconstruction problem, which we call \emph{average-case population recovery in the presence of insertions and deletions}. In this problem, there is an unknown distribution $\cal{D}$ over $s$ unknown source strings $x^1,\dots,x^s \in \{0,1\}^n$, and each sample is independently generated by drawing some $x^i$ from $\cal{D}$ and returning an independent trace of $x^i$. Building on \cite{PZ17} and \cite{HPP18}, we give an efficient algorithm for this problem. For any support size $s \leq \smash{\exp(\Theta(n^{1/3}))}$, for a $1-o(1)$ fraction of all $s$-element support sets $\{x^1,\dots,x^s\} \subset \{0,1\}^n$, for every distribution $\cal{D}$ supported on $\{x^1,\dots,x^s\}$, our algorithm efficiently recovers ${\cal D}$ up to total variation distance $\epsilon$ with high probability, given access to independent traces of independent draws from $\cal{D}$. The algorithm runs in time poly$(n,s,1/\epsilon)$ and its sample complexity is poly$(s,1/\epsilon,\exp(\log^{1/3}n)).$ This polynomial dependence on the support size $s$ is in sharp contrast with the \emph{worst-case} version (when $x^1,\dots,x^s$ may be any strings in $\{0,1\}^n$), in which the sample complexity of the most efficient known algorithm \cite{BCFSS19} is doubly exponential in $s$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends average-case single-string trace reconstruction to population recovery over s strings, keeping polynomial dependence on s for most random supports.

read the letter

The paper gives an algorithm that recovers any distribution D over an s-element support from indel traces, using poly(s, 1/eps, exp((log n)^{1/3})) samples, when the support is a typical random s-subset of {0,1}^n and s is at most exp(Theta(n^{1/3})). This is the main new piece: it lifts the average-case single-string results of PZ17 and HPP18 to the multi-string setting without paying exponential or doubly exponential in s, in contrast to the worst-case bound in BCFSS19. The reduction appears to work by exploiting separation properties that hold for most random supports, then applying the single-string methods to the population case. That is a clean extension and the sample bound matches what one would hope for from the priors. The time bound is also polynomial in n, s, and 1/eps, which is practical within the model. The central limitation is the explicit average-case scope over the support: success holds only for a 1-o(1) fraction of random s-subsets, because those are the ones where the required separation properties kick in. The paper does not claim the result for adversarial supports, so that is not a flaw but a stated boundary. Soundness rests on the two cited papers; if they are correct, the extension looks fine, and there is no evident circularity or post-hoc fitting. This is useful reading for people working on trace reconstruction, population recovery, or noisy sequence problems in theoretical CS. A reader already following the average-case trace reconstruction line will see the natural next step and the preserved efficiency. It deserves peer review because the improvement on s-dependence is concrete and the approach is scoped precisely enough to be checkable.

Referee Report

0 major / 2 minor

Summary. The paper presents an efficient algorithm for the average-case population recovery problem with insertions and deletions. For any support size s ≤ exp(Θ(n^{1/3})), for a 1-o(1) fraction of all s-element support sets {x¹,…,xˢ} ⊂ {0,1}ⁿ, and for every distribution D supported on that set, the algorithm recovers D up to total variation distance ε with high probability from independent traces. It runs in time poly(n,s,1/ε) with sample complexity poly(s,1/ε,exp(log^{1/3}n)). The result builds directly on the average-case trace reconstruction algorithms of PZ17 and HPP18, extending them from single-string reconstruction to distribution recovery over multiple strings.

Significance. If the result holds, the work demonstrates that average-case assumptions over random supports allow polynomial dependence on s for population recovery, in sharp contrast to the doubly exponential dependence required in the worst-case setting (BCFSS19). This strengthens the case that average-case trace reconstruction techniques can scale to multi-string settings while preserving near-optimal sample complexity in n. The explicit average-case guarantee over uniform s-subsets is a clear strength, as is the polynomial runtime in n and s.

minor comments (2)

[Abstract] Abstract: the notation exp(log^{1/3}n) for the sample complexity term could be clarified with an explicit statement of the exponent base (natural log or log base 2) to avoid ambiguity with prior trace reconstruction bounds.
The manuscript should include a brief comparison table or paragraph contrasting the new sample complexity with the single-string bounds from PZ17 and HPP18 to highlight the precise overhead introduced by moving to population recovery.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, assessment of significance, and recommendation to accept the paper. We appreciate the recognition of the polynomial dependence on s under average-case assumptions and the explicit guarantee over uniform random supports.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central result is an explicit extension of the average-case trace reconstruction algorithms from the externally cited works PZ17 and HPP18 (distinct authorship) to the population-recovery setting, with the key guarantee stated as holding for a 1-o(1) fraction of uniformly random s-subsets of {0,1}^n. No derivation step reduces a claimed prediction or uniqueness property to a fitted parameter, self-citation, or ansatz imported from the authors' own prior work; the sample and time bounds are presented as direct consequences of the cited external algorithms plus standard concentration arguments over the random support. The average-case premise is openly declared in the theorem statement rather than smuggled in, so the derivation chain remains self-contained against the external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on the average-case assumption that the support set is typical; no explicit free parameters or invented entities are named in the abstract. The proof technique inherits any unstated assumptions from PZ17 and HPP18.

axioms (1)

domain assumption The support set is drawn uniformly at random among all s-subsets of {0,1}^n (so that concentration properties hold with high probability).
Stated in the main theorem claim in the abstract; if false the polynomial dependence on s may not hold.

pith-pipeline@v0.9.0 · 5991 in / 1661 out tokens · 21709 ms · 2026-05-24T22:03:32.836579+00:00 · methodology

Efficient average-case population recovery in the presence of insertions and deletions

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)