Efficient average-case population recovery in the presence of insertions and deletions
Pith reviewed 2026-05-24 22:03 UTC · model grok-4.3
The pith
For most random s-subsets of binary strings with s up to exp(Θ(n^{1/3})), any distribution over the subset can be recovered to TV distance ε from indel traces in poly(n,s,1/ε) time using poly(s,1/ε,exp(log^{1/3}n)) samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For any support size s ≤ exp(Θ(n^{1/3})), for a 1-o(1) fraction of all s-element support sets {x¹,…,xˢ} ⊂ {0,1}ⁿ, for every distribution D supported on {x¹,…,xˢ}, the algorithm efficiently recovers D up to total variation distance ε with high probability, given access to independent traces of independent draws from D. The algorithm runs in time poly(n,s,1/ε) and its sample complexity is poly(s,1/ε,exp(log^{1/3}n)).
What carries the argument
Separation properties that hold for a 1-o(1) fraction of random s-subsets and enable reduction to efficient single-string average-case trace reconstruction.
If this is right
- Sample complexity depends only polynomially on s rather than doubly exponentially.
- Recovery succeeds for every distribution D on the support, not merely the uniform distribution.
- Running time remains polynomial in both n and s.
- The guarantee applies with high probability over the random choice of the support set.
Where Pith is reading between the lines
- The same random-support separation properties may extend the approach to other deletion-insertion-substitution channels.
- In settings where source strings are generated randomly, such as certain storage or communication protocols, the polynomial dependence on s becomes practically relevant.
- Adversarial choice of the support set is expected to force a return to doubly exponential sample complexity, highlighting the gap between average and worst case.
Load-bearing premise
The support set must be drawn uniformly at random from all possible s-subsets of {0,1}^n so that it satisfies the required separation properties with high probability.
What would settle it
Exhibit one concrete s-subset of size exp(Θ(n^{1/3})) together with a distribution D on it such that Ω(poly(s,exp(log^{1/3}n))) samples are required to recover D to constant TV distance from its indel traces.
read the original abstract
Several recent works have considered the \emph{trace reconstruction problem}, in which an unknown source string $x\in\{0,1\}^n$ is transmitted through a probabilistic channel which may randomly delete coordinates or insert random bits, resulting in a \emph{trace} of $x$. The goal is to reconstruct the original string~$x$ from independent traces of $x$. While the best algorithms known for worst-case strings use $\exp(O(n^{1/3}))$ traces \cite{DOS17,NazarovPeres17}, highly efficient algorithms are known \cite{PZ17,HPP18} for the \emph{average-case} version, in which $x$ is uniformly random. We consider a generalization of this average-case trace reconstruction problem, which we call \emph{average-case population recovery in the presence of insertions and deletions}. In this problem, there is an unknown distribution $\cal{D}$ over $s$ unknown source strings $x^1,\dots,x^s \in \{0,1\}^n$, and each sample is independently generated by drawing some $x^i$ from $\cal{D}$ and returning an independent trace of $x^i$. Building on \cite{PZ17} and \cite{HPP18}, we give an efficient algorithm for this problem. For any support size $s \leq \smash{\exp(\Theta(n^{1/3}))}$, for a $1-o(1)$ fraction of all $s$-element support sets $\{x^1,\dots,x^s\} \subset \{0,1\}^n$, for every distribution $\cal{D}$ supported on $\{x^1,\dots,x^s\}$, our algorithm efficiently recovers ${\cal D}$ up to total variation distance $\epsilon$ with high probability, given access to independent traces of independent draws from $\cal{D}$. The algorithm runs in time poly$(n,s,1/\epsilon)$ and its sample complexity is poly$(s,1/\epsilon,\exp(\log^{1/3}n)).$ This polynomial dependence on the support size $s$ is in sharp contrast with the \emph{worst-case} version (when $x^1,\dots,x^s$ may be any strings in $\{0,1\}^n$), in which the sample complexity of the most efficient known algorithm \cite{BCFSS19} is doubly exponential in $s$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an efficient algorithm for the average-case population recovery problem with insertions and deletions. For any support size s ≤ exp(Θ(n^{1/3})), for a 1-o(1) fraction of all s-element support sets {x¹,…,xˢ} ⊂ {0,1}ⁿ, and for every distribution D supported on that set, the algorithm recovers D up to total variation distance ε with high probability from independent traces. It runs in time poly(n,s,1/ε) with sample complexity poly(s,1/ε,exp(log^{1/3}n)). The result builds directly on the average-case trace reconstruction algorithms of PZ17 and HPP18, extending them from single-string reconstruction to distribution recovery over multiple strings.
Significance. If the result holds, the work demonstrates that average-case assumptions over random supports allow polynomial dependence on s for population recovery, in sharp contrast to the doubly exponential dependence required in the worst-case setting (BCFSS19). This strengthens the case that average-case trace reconstruction techniques can scale to multi-string settings while preserving near-optimal sample complexity in n. The explicit average-case guarantee over uniform s-subsets is a clear strength, as is the polynomial runtime in n and s.
minor comments (2)
- [Abstract] Abstract: the notation exp(log^{1/3}n) for the sample complexity term could be clarified with an explicit statement of the exponent base (natural log or log base 2) to avoid ambiguity with prior trace reconstruction bounds.
- The manuscript should include a brief comparison table or paragraph contrasting the new sample complexity with the single-string bounds from PZ17 and HPP18 to highlight the precise overhead introduced by moving to population recovery.
Simulated Author's Rebuttal
We thank the referee for their positive summary, assessment of significance, and recommendation to accept the paper. We appreciate the recognition of the polynomial dependence on s under average-case assumptions and the explicit guarantee over uniform random supports.
Circularity Check
No significant circularity
full rationale
The paper's central result is an explicit extension of the average-case trace reconstruction algorithms from the externally cited works PZ17 and HPP18 (distinct authorship) to the population-recovery setting, with the key guarantee stated as holding for a 1-o(1) fraction of uniformly random s-subsets of {0,1}^n. No derivation step reduces a claimed prediction or uniqueness property to a fitted parameter, self-citation, or ansatz imported from the authors' own prior work; the sample and time bounds are presented as direct consequences of the cited external algorithms plus standard concentration arguments over the random support. The average-case premise is openly declared in the theorem statement rather than smuggled in, so the derivation chain remains self-contained against the external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The support set is drawn uniformly at random among all s-subsets of {0,1}^n (so that concentration properties hold with high probability).
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.