pith. sign in

arxiv: 2604.25294 · v1 · submitted 2026-04-28 · 💻 cs.IT · math.IT

Correcting One Deletion and One Substitution with a Constant Number of Reads

Pith reviewed 2026-05-07 15:08 UTC · model grok-4.3

classification 💻 cs.IT math.IT
keywords reconstruction codessingle deletionsingle substitutionconstant readsredundancyball intersectionerror correctioncoding theory
0
0 comments X

The pith

Codes allow unique reconstruction from 5 reads each with one deletion and one substitution using 3 log n +4 redundant bits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs codes where any two distinct codewords have single-deletion single-substitution balls that intersect in fewer than N positions. This property ensures that N noisy reads suffice to identify the original codeword uniquely. For N=5 the constructions use 3 log n +4 extra bits, with the number of extra bits falling as N rises to 9, 11, and finally 14 where only log n +3 bits are needed. A reader would care because these results show that a fixed handful of observations can nearly reach the redundancy of codes that assume far more reads or no read limit. The work therefore clarifies the trade-off between the number of available reads and the amount of added redundancy required.

Core claim

In this paper, we investigate the problem of designing (n, N; B)-reconstruction codes for N in {14,11,9,5}, where B is the single-deletion single-substitution ball function that maps a sequence to the set of all sequences obtainable via one deletion and one substitution. Such a code is defined by the requirement that the intersection size of any two distinct single-deletion single-substitution balls is strictly less than the given number of noisy reads N. When N=1 the problem reduces to single-deletion and single-substitution correcting codes, whose best known redundancy is (4+o(1)) log n. We show that this redundancy can be reduced to 3 log n +4 when N=5. As N increases further to 9 and 11,

What carries the argument

The (n,N;B)-reconstruction code, a subset of strings such that the single-deletion single-substitution balls of any two distinct codewords intersect in fewer than N strings; the ball intersection condition lets N noisy versions identify the original string.

If this is right

  • For N=5 the codes achieve 3 log n +4 bits of redundancy.
  • For N=9 the redundancy drops to 2 log n +12 log log n +O(1).
  • For N=11 the redundancy is log n +12 log log n +O(1).
  • For N=14 only log n +3 bits of redundancy are required, two bits above the best known codes for N=18.
  • Any code built for a given N also works for all larger constant values of N.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ball-intersection approach can be examined for other fixed combinations of deletions, insertions and substitutions.
  • Closing the remaining two-bit gap between the N=14 and N=18 constructions may be possible by examining intermediate constant values of N.
  • The explicit dependence on hash functions and markers points to a general method for controlling overlaps in error balls under limited observations.

Load-bearing premise

Suitable combinatorial objects exist that enforce the ball-intersection size to be strictly less than N while using only the stated number of redundant bits.

What would settle it

Finding two strings at the claimed redundancy whose single-deletion single-substitution balls intersect in N or more sequences would show that the intersection condition fails for that N.

Figures

Figures reproduced from arXiv: 2604.25294 by Gennian Ge, Yubo Sun, Yuling Li.

Figure 1
Figure 1. Figure 1: Illustrations of ψ(x(0, ex)) and ψ(y(0, ey)) when ψ(x(0, ex))[n+1]\dx = ψ(y(0, ey))[n+1]\dy , where dx > dy. Each pair of bits connected by a solid line are of equal value, while those connected by a yellow undertone signify a connection resulted from deletions. Similarly, when x(dx, ex) = y(dy, ey) and dx < dy, we have ψ(x(0, ex))dx ⊕ ψ(x(0, ex))dx+1 = ψ(y(0, ey))dx , view at source ↗
read the original abstract

In this paper, we investigate the problem of designing $(n, N; \mathcal{B})$-reconstruction codes for $N\in \{14,11,9,5\}$, where $\mathcal{B}$ is the single-deletion single-substitution ball function that maps a sequence to the set of all sequences obtainable via one deletion and one substitution. Such a code is defined by the requirement that the intersection size of any two distinct single-deletion single-substitution balls is strictly less than the given number of noisy reads $N$. Note that for any $1\le N<N'$, an $(n, N; \mathcal{B})$-reconstruction code is also an $(n, N'; \mathcal{B})$-reconstruction code. It follows that the problem of designing $(n, N; \mathcal{B})$-reconstruction codes with less redundancy becomes more challenging as $N$ decreases, particularly because the problem for $N=1$ already reduces to the coding problem of single-deletion and single-substitution correcting codes. To the best of our knowledge, most existing results focus on the case where $N$ is a linear function of $n$, while only a limited number consider constant $N$. When $N=1$, the best known $(n, 1; \mathcal{B})$-reconstruction codes (single-deletion and single-substitution correcting codes) require $(4+o(1))\log n$ redundant bits. In this work, we show that this redundancy can be reduced to $3\log n+4$ when $N=5$. As $N$ increases further to $9$ and $11$, the redundancy can be improved to $2\log n+12\log\log n+O(1)$ and $\log n +12\log \log n+O(1)$, respectively. Finally, for $N=14$, we provide a reconstruction code with $\log n+3$ bits of redundancy, which is only two bits more than the best known $(n, 18; \mathcal{B})$-reconstruction codes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper studies (n, N; B)-reconstruction codes where B is the ball of all strings obtained from a given string by one deletion and one substitution. It claims explicit constructions achieving redundancy 3 log n + 4 for N=5, 2 log n + 12 log log n + O(1) for N=9, log n + 12 log log n + O(1) for N=11, and log n + 3 for N=14. These are positioned as improvements over the (4+o(1)) log n redundancy known for the N=1 case (ordinary single-deletion single-substitution correcting codes).

Significance. If the constructions are valid, the results would be a useful contribution to the theory of reconstruction codes with constant reads. They quantify the redundancy-reads trade-off for this combined error model and show that moderate constant N already yields redundancy close to the information-theoretic floor (log n + O(1)). The explicit expressions for several N values and the comparison to the N=18 benchmark are concrete and could guide further work.

major comments (3)
  1. [Abstract / N=5 construction] Abstract and the constructions for N=5: the claimed redundancy of 3 log n + 4 rests on the existence of auxiliary hash or marker families that enforce |B(x) ∩ B(y)| < 5 while adding at most 3 log n + 4 bits. Because B is the combined deletion-substitution ball, its intersection structure differs from pure deletion or substitution balls; the paper must verify that standard existence results for hash families apply at these exact parameters without extra slack in the redundancy.
  2. [N=9 and N=11 constructions] Constructions for N=9 and N=11: the stated redundancies 2 log n + 12 log log n + O(1) and log n + 12 log log n + O(1) likewise depend on combinatorial objects (markers or hashes) that keep ball intersections below N. The manuscript should supply a concrete parameter check or self-contained existence proof showing that the O(1) and log-log terms suffice for the specific size and intersection properties of B.
  3. [N=14 construction] N=14 construction: the claim of log n + 3 redundancy (only two bits above the best known (n,18;B) code) is load-bearing for the overall narrative. The argument must confirm that the auxiliary objects used to enforce |B(x) ∩ B(y)| < 14 contribute no hidden n-dependent overhead beyond the stated +3 bits.
minor comments (2)
  1. A compact table listing N, achieved redundancy, and comparison to the N=1 and N=18 baselines would improve readability.
  2. [Introduction / Preliminaries] The notation for the ball B and the intersection condition |B(x) ∩ B(y)| < N should be introduced with a short example in the preliminaries.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. We address each major comment below with clarifications drawn directly from the manuscript's constructions. We will incorporate additional explicit parameter verifications and a short appendix in the revised version to strengthen the presentation without altering the claimed redundancies.

read point-by-point responses
  1. Referee: [Abstract / N=5 construction] Abstract and the constructions for N=5: the claimed redundancy of 3 log n + 4 rests on the existence of auxiliary hash or marker families that enforce |B(x) ∩ B(y)| < 5 while adding at most 3 log n + 4 bits. Because B is the combined deletion-substitution ball, its intersection structure differs from pure deletion or substitution balls; the paper must verify that standard existence results for hash families apply at these exact parameters without extra slack in the redundancy.

    Authors: We appreciate the referee pointing this out. Section 3 of the manuscript provides an explicit probabilistic construction of the hash families adapted to the deletion-substitution ball B, whose size is Θ(n²). The union bound is applied directly to pairs of codewords using the precise intersection probability for this B (accounting for the two error types and their positions), yielding exactly 3 log n + 4 bits with no additional slack. The constant 4 absorbs the small number of cases for synchronization. To make this verification fully self-contained, we will add a dedicated lemma with the parameter calculation in the revision. revision: partial

  2. Referee: [N=9 and N=11 constructions] Constructions for N=9 and N=11: the stated redundancies 2 log n + 12 log log n + O(1) and log n + 12 log log n + O(1) likewise depend on combinatorial objects (markers or hashes) that keep ball intersections below N. The manuscript should supply a concrete parameter check or self-contained existence proof showing that the O(1) and log-log terms suffice for the specific size and intersection properties of B.

    Authors: The constructions in Sections 4 and 5 build on efficient marker families whose redundancy is known to be 12 log log n + O(1) and that are adapted to enforce |B(x) ∩ B(y)| < N for these specific N. The log-log term comes from the marker length needed to handle O(n²) possible error locations in B, while the O(1) term is verified by direct counting of the remaining intersection cases after marker placement. We will add a self-contained existence proof (via the probabilistic method with explicit constants) and a parameter table in an appendix to confirm the bounds hold for B. revision: yes

  3. Referee: [N=14 construction] N=14 construction: the claim of log n + 3 redundancy (only two bits above the best known (n,18;B) code) is load-bearing for the overall narrative. The argument must confirm that the auxiliary objects used to enforce |B(x) ∩ B(y)| < 14 contribute no hidden n-dependent overhead beyond the stated +3 bits.

    Authors: Section 6 constructs the code by appending a fixed-length synchronization marker of 3 bits chosen to resolve all possible deletion-substitution patterns that could cause |B(x) ∩ B(y)| ≥ 14. Because the marker is constant-length (independent of n) and the main code contributes the log n term, there is no hidden n-dependent overhead. The choice of 3 bits is verified by exhaustive enumeration over the O(n) possible deletion positions and 4 substitution symbols, showing that N=14 is achieved. We will include an explicit verification table in the revision. revision: partial

Circularity Check

0 steps flagged

Constructions rely on standard combinatorial existence assumptions without self-referential reduction

full rationale

The paper's claims for reduced redundancy at N=5,9,11,14 are presented as explicit constructions that enforce |B(x) ∩ B(y)| < N via auxiliary objects such as hash families or markers. The abstract and described approach cite prior bounds for N=1 as a baseline and improve upon them, but no equations or steps reduce the stated redundancy figures to fitted parameters, self-definitions, or load-bearing self-citations. The ball-intersection condition is an external combinatorial requirement rather than an internal tautology. This is a normal non-circular finding for existence-based coding constructions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard definition of the single-deletion single-substitution ball and the combinatorial requirement that distinct balls intersect in fewer than N sequences; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption The single-deletion single-substitution ball B maps each sequence to the set of all sequences obtainable by one deletion and one substitution.
    This is the error model used to define the reconstruction codes.
  • domain assumption An (n, N; B)-reconstruction code requires that the intersection of any two distinct balls has size strictly less than N.
    This is the defining property that enables unique recovery from N reads.

pith-pipeline@v0.9.0 · 5685 in / 1486 out tokens · 78635 ms · 2026-05-07T15:08:56.550139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    On Levenshtein’s reconstruction problem under insertions, deletions, and substitutions,

    M. Abu-Sini and E. Yaakobi, “On Levenshtein’s reconstruction problem under insertions, deletions, and substitutions,”IEEE Transactions on Information Theory, vol. 67, no. 11, pp. 7132-7158, 2021

  2. [2]

    Coding for sequence reconstruction for single edits,

    K. Cai, H. M. Kiah, T. T. Nguyen, and E. Yaakobi, “Coding for sequence reconstruction for single edits,”IEEE Transactions on Information Theory, vol. 68, no. 1, pp. 66-79, 2022

  3. [3]

    Coding for racetrack memories,

    Y . M. Chee, H. M. Kiah, A. Vardy, V . K. Vu, and E. Yaakobi, “Coding for racetrack memories,”IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7094-7112, 2018

  4. [4]

    Correcting deletions with multiple reads,

    J. Chrisnata, H. M. Kiah, and E. Yaakobi, “Correcting deletions with multiple reads,”IEEE Transactions on Information Theory, vol. 68, no. 11, pp. 7141-7158, 2022

  5. [5]

    Next-generation digital information storage in DNA,

    G. M. Church, Y . Gao, and S. Kosuri, “Next-generation digital information storage in DNA,”Science, vol. 337, no. 6102, pp. 1628-1628, 2012

  6. [6]

    Sequence reconstruction over the deletion channel,

    R. Gabrys and E. Yaakobi, “Sequence reconstruction over the deletion channel,”IEEE Transactions on Information Theory, vol. 64, no. 4, pp. 2924-2931, 2018

  7. [7]

    Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,

    N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,”Nature, vol. 494, pp. 77-80, 2013

  8. [8]

    Universal framework for parametric constrained coding,

    A. Kobovich, O. Leitersdorf, D. Bar-Lev, and E. Yaakobi, “Universal framework for parametric constrained coding,”IEEE Transactions on Information Theory, Early Access, doi: 10.1109/TIT.2026.3672710

  9. [9]

    Sequence reconstruction under channels with multiple bursts of insertions or deletions,

    Z. Lan, Y . Sun, W. Yu, and G. Ge, “Sequence reconstruction under channels with multiple bursts of insertions or deletions,”IEEE Transactions on Information Theory, vol. 72, no. 1, pp. 315-330, 2026

  10. [10]

    Binary codes capable of correcting deletions, insertions, and reversals,

    V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,”Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966

  11. [11]

    Efficient reconstruction of sequences,

    V . I. Levenshtein, “Efficient reconstruction of sequences,”IEEE Transactions on Information Theory, vol. 47, no. 1, pp. 2-22, 2001

  12. [12]

    Efficient reconstruction of sequences from their subsequences or supersequences,

    V . I. Levenshtein, “Efficient reconstruction of sequences from their subsequences or supersequences,”Journal of Combinatorial Theory, Series A, vol. 93, no. 2, pp. 310-332, 2001

  13. [13]

    Binary reconstruction codes for correcting one deletion and one substitution,

    Y . Li, Y . Sun, and G. Ge, “Binary reconstruction codes for correcting one deletion and one substitution,”arXiv:2505.04232, 2025

  14. [14]

    Random access in large-scale DNA data storage,

    L. Organick, S. Ang, Y .-J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Racz, G. Kamath, P. Gopalan, B. Nguyen, C. Takahashi, S. Newman, H.-Y . Parker, C. Rashtchian, K. Stewart, G. Gupta, R. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss, “Random access in large-scale DNA data storage,”Nature Biotechnology, vol. 36, no. 3, pp. 24...

  15. [15]

    Sequence reconstruction problem for deletion channels: a complete asymptotic solution,

    V . L. P. Pham, K. Goyal, and H. M. Kiah, “Sequence reconstruction problem for deletion channels: a complete asymptotic solution,”Journal of Combinatorial Theory, Series A, vol. 211, p. 105980, 2025

  16. [16]

    Exact reconstruction from insertions in synchronization codes,

    F. Sala, R. Gabrys, C. Schoeny, and L. Dolecek, “Exact reconstruction from insertions in synchronization codes,”IEEE Transactions on Information Theory, vol. 63, no. 4, pp. 2428-2445, 2017

  17. [17]

    Single-deletion single-substitution correcting codes,

    I. Smagloy, L. Welter, A. Wachter-Zeh, and E. Yaakobi, “Single-deletion single-substitution correcting codes,”IEEE Transactions on Information Theory, vol. 69, no. 12, pp. 7659-7671, 2023

  18. [18]

    List-decodable codes for single-deletion single-substitution with list-size two,

    W. Song, K. Cai, and T. T. Nguyen, “List-decodable codes for single-deletion single-substitution with list-size two,” inProceeding of the International Symposium on Information Theory (ISIT), Espoo, Finland, pp. 1004-1009, 2022

  19. [19]

    Sequence reconstruction for the single-deletion single-substitution channel,

    W. Song, K. Cai, and T. Q. S. Quek, “Sequence reconstruction for the single-deletion single-substitution channel,”IEEE Journal on Selected Areas in Information Theory, vol. 6, pp. 232-247, 2025

  20. [20]

    Correcting two-deletion with a constant number of reads,

    Y . Sun and G. Ge, “Correcting two-deletion with a constant number of reads,”IEEE Transactions on Information Theory, vol. 69, no. 5, pp. 2969-2982, 2023

  21. [21]

    Binary codes for correcting two edits,

    Y . Sun and G. Ge, “Binary codes for correcting two edits,”IEEE Transactions on Information Theory, vol. 70, no. 10, pp. 6877-6898, 2024

  22. [22]

    Bounds and constructions ofℓ-read codes under the Hamming metric,

    Y . Sun and G. Ge, “Bounds and constructions ofℓ-read codes under the Hamming metric,”IEEE Transactions on Information Theory, vol. 71, no. 8, pp. 5868-5883, 2025

  23. [23]

    Survey for a decade of coding for DNA storage,

    O. Sabary, H. M. Kiah, P. H. Siegel, and E. Yaakobi, “Survey for a decade of coding for DNA storage,”IEEE Transactions on Molecular, Biological, and Multi-Scale Communications, vol. 10, no. 2, pp. 253-271, 2024

  24. [24]

    Balanced reconstruction codes for single edits,

    R. Wu and X. Zhang, “Balanced reconstruction codes for single edits,”Designs, Codes and Cryptography, vol. 92, pp. 2011-2029, 2024

  25. [25]

    Survey of sequence reconstruction problems and their applications in DNA-based storage,

    Y . Yang, “Survey of sequence reconstruction problems and their applications in DNA-based storage,”IEEE Journal on Selected Areas in Information Theory, vol. 6, pp. 352-366, 2025

  26. [26]

    DNA-based storage: trends and methods,

    S. Yazdi, H. M. Kiah, E. R. Garcia, J. Ma, H. Zhao, and O. Milenkovic, “DNA-based storage: trends and methods,”IEEE Transactions on Molecular, Biological, and Multi-Scale Communications, vol. 1, no. 3, pp. 230-248, 2015

  27. [27]

    Reconstruction of sequences distorted by two insertions,

    Z. Ye, X. Liu, X. Zhang, and G. Ge, “Reconstruction of sequences distorted by two insertions,”IEEE Transactions on Information Theory, vol. 69, no. 8, pp. 4977-4992, 2023

  28. [28]

    Sequence reconstruction over3-deletion channels,

    D. Zhang, G. Ge, and Y . Zhang, “Sequence reconstruction over3-deletion channels,” inProceeding of the International Symposium on Information Theory (ISIT), Athens, Greece, 2024, pp. 891-896