pith. machine review for the scientific record.
sign in

arxiv: 2509.24161 · v2 · submitted 2025-09-29 · 💻 cs.IT · math.IT

Capacity-Achieving Codes for Noisy Insertion Channels

Pith reviewed 2026-05-18 13:07 UTC · model grok-4.3

classification 💻 cs.IT math.IT
keywords noisy insertion channelcoding capacityerror-correcting codesDNA storageinsertion errorsasymptotic optimalitychannel model
0
0 comments X

The pith

The coding capacity of a noisy insertion channel is determined and achieved by asymptotically optimal codes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies a channel relevant to DNA data storage where insertions of symbols that match or complement the original can occur infinitely often, along with at most one random symbol insertion due to noise. The authors find the exact coding capacity of this channel and build error-correcting codes whose rates approach this capacity for long sequences. A reader interested in reliable long-term data storage would see this as providing both the theoretical limit and practical constructions for handling the dominant errors in DNA sequences.

Core claim

We determine the coding capacity of the noisy channel and construct asymptotically optimal error-correcting codes achieving the coding capacity. The channel allows infinitely many insertions of complement or identical symbols and up to one insertion of a random symbol.

What carries the argument

The noisy insertion channel model that includes unlimited identical or complementary insertions and at most one random insertion, along with the capacity calculation and code constructions that achieve it.

If this is right

  • The capacity value sets the maximum rate for reliable communication over this channel.
  • Explicit codes can correct the described insertion errors while operating close to the capacity.
  • As block lengths increase, the codes' performance approaches the theoretical limit.
  • DNA storage systems can use these codes to improve data recovery reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the model holds, similar capacity results might apply to related channels with different noise levels.
  • Testing these codes in actual DNA synthesis and sequencing experiments could validate the approach.
  • Extensions to handle multiple random insertions could build on this foundation.

Load-bearing premise

The specific error model with infinitely many complement or identical insertions plus at most one random insertion correctly represents the main errors in DNA storage.

What would settle it

A calculation or proof showing that the capacity is different from what is claimed, or an experiment where codes exceed the stated capacity without error, would disprove the result.

read the original abstract

DNA storage has emerged as a promising solution for large-scale and long-term data preservation. Among various error types, insertions are the most frequent errors occurring in DNA sequences, where the inserted symbol is often identical or complementary to the original, and in practical implementations, noise can further cause the inserted symbol to mutate into a random one, which creates significant challenges to reliable data recovery. In this paper, we investigate a new noisy insertion channel, where infinitely many insertions of symbols complement or identical to the original ones and up to one insertion of random symbol may occur. We determine the coding capacity of the noisy channel and construct asymptotically optimal error-correcting codes achieving the coding capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript investigates a noisy insertion channel for DNA storage, in which the channel permits arbitrarily many insertions of symbols identical or complementary to the original and at most one insertion of a random symbol. It claims to derive the coding capacity of this channel via a mutual-information characterization and to construct asymptotically optimal error-correcting codes that achieve the capacity through a random-coding argument.

Significance. If the derivations hold, the work supplies an explicit capacity formula and a matching achievability result for a channel model that captures dominant insertion statistics in DNA storage. The combination of a converse bound and random-coding construction is a standard and rigorous approach in information theory; the result would strengthen the theoretical toolkit for synchronization-error channels.

major comments (1)
  1. §4 (Capacity Derivation): the mutual-information expression for capacity must explicitly account for the normalization of output length under unbounded insertions; without a precise statement of the rate definition (e.g., bits per input symbol versus per output symbol), the claimed capacity value cannot be verified as load-bearing.
minor comments (3)
  1. Abstract: the channel model description is terse; adding one sentence that states the resulting capacity expression would improve accessibility.
  2. Notation: the symbols for identical, complementary, and random insertions should be introduced once in §2 and used consistently thereafter.
  3. References: several foundational papers on deletion/insertion channels (e.g., works by Dobrushin and on DNA-specific models) appear to be missing.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and for the constructive comment on the capacity derivation. We address the point below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses
  1. Referee: §4 (Capacity Derivation): the mutual-information expression for capacity must explicitly account for the normalization of output length under unbounded insertions; without a precise statement of the rate definition (e.g., bits per input symbol versus per output symbol), the claimed capacity value cannot be verified as load-bearing.

    Authors: We agree that an explicit statement of the rate definition improves verifiability. In our derivation the coding capacity is defined in the standard information-theoretic sense as the supremum of achievable rates R where R = (1/n) log |C| (bits per input symbol) for block length n, such that there exist codes with vanishing error probability as n → ∞. The mutual-information characterization in Section 4 is obtained by maximizing I(X;Y) under the channel law that incorporates the (at most one) random insertion together with the insertions of identical or complementary symbols; the variable output length is handled asymptotically because the channel model ensures that the number of insertions is o(n) with high probability under the optimizing input distribution. Nevertheless, to address the referee’s concern directly we will add a dedicated paragraph at the beginning of Section 4 that (i) states the rate definition explicitly as bits per input symbol and (ii) explains how the normalization is preserved under the unbounded but probabilistically controlled insertions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper explicitly defines the noisy insertion channel via transition probabilities (arbitrarily many identical/complement insertions plus at most one random-symbol insertion), then derives capacity as the maximum mutual information over input distributions and proves achievability via random coding that matches the converse. These steps follow standard information-theoretic definitions and proof techniques without reducing the claimed capacity or code optimality to fitted parameters, self-citations, or ansatzes imported from prior work by the same authors. The result is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The channel model itself is the main modeling assumption.

pith-pipeline@v0.9.0 · 5632 in / 934 out tokens · 21342 ms · 2026-05-18T13:07:18.936805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    and Schwartz, M.: On the reverse-complement sequence- duplication system

    Ben-Tolila, E. and Schwartz, M.: On the reverse-complement sequence- duplication system. IEEE Trans. Inform. Theory 68, 7184–7197 (2022)

  2. [2]

    and Kosuri, S.: Next-generation digital information storage in DNA

    Church, G.M., Gao, Y. and Kosuri, S.: Next-generation digital information storage in DNA. Science 337, 1628 (2012)

  3. [3]

    and Bruck, J.: The capacity of sequence-duplication systems

    Farnoud, F., Schwartz, M. and Bruck, J.: The capacity of sequence-duplication systems. IEEE Trans. Inform. Theory 62, 811–824 (2016)

  4. [4]

    and Bruck, J.: Estimation of duplication history under a stochastic model for tandem repeats

    Farnoud, F., Schwartz, M. and Bruck, J.: Estimation of duplication history under a stochastic model for tandem repeats. BMC Bioinformatics 20, 1–11 (2019) 15

  5. [5]

    and Birney, E.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

    Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E.M., Sipos, B. and Birney, E.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013)

  6. [6]

    and Vorobyev, I.: Codes Correcting Long Duplica- tion Errors

    Goshkoder, D., Polyanskii, N. and Vorobyev, I.: Codes Correcting Long Duplica- tion Errors. IEEE Trans. Molecular, Biological, Multi-Scale Commun. 10, 272–288 (2024)

  7. [7]

    and Sala, F.: Codes correcting two deletions

    Gabrys, R. and Sala, F.: Codes correcting two deletions. IEEE Trans. Inform. Theory 6, 965-974 (2018)

  8. [8]

    and Ferreira, H.C.: On multiple insertion/deletion correcting codes

    Helberg, A.S. and Ferreira, H.C.: On multiple insertion/deletion correcting codes. IEEE Trans. Inform. Theory 48, 305-308 (2002)

  9. [9]

    and Bruck, J.: Capacity and expressiveness of genomic tandem duplication

    Jain, S., Farnoud, F. and Bruck, J.: Capacity and expressiveness of genomic tandem duplication. IEEE Trans. Inform. Theory 63, 6129–6138 (2017)

  10. [10]

    and Bruck, J.: Duplication-correcting codes for data storage in the DNA of living organisms

    Jain, S., Farnoud, F., Schwartz, M. and Bruck, J.: Duplication-correcting codes for data storage in the DNA of living organisms. IEEE Trans. Inform. Theory 63, 4996–5010 (2017)

  11. [11]

    and De Figueiredo, P.: DNA watermarking of infectious agents: Progress and prospects

    Jupiter, D.C., Ficht, T.A., Samuel, J., Qin, Q.M. and De Figueiredo, P.: DNA watermarking of infectious agents: Progress and prospects. PLoS pathogens, 6(6), p.e1000950. (2010)

  12. [12]

    IEEE Trans

    Kovaˇ cevi´ c, M.: Zero-error capacity of duplication channels. IEEE Trans. Com- mun. 67, 6735–6742 (2019)

  13. [13]

    Soviet Physics Doklady 10, 707–710 (1966)

    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)

  14. [14]

    and Yaakobi, E.: Duplication-correcting codes

    Lenz, A., Wachter-Zeh, A. and Yaakobi, E.: Duplication-correcting codes. Des. Codes Cryptogr. 87, 277–298 (2019)

  15. [15]

    and Xing, C.: Explicit construction ofq-ary 2-deletion correcting codes with low redundancy

    Liu, S., Tjuawinata, I. and Xing, C.: Explicit construction ofq-ary 2-deletion correcting codes with low redundancy. IEEE Trans. Inform. Theory, 70, 4093-4101 (2024)

  16. [16]

    and Zhang, Y.:t-deletion-s-insertion-burst correcting codes

    Lu, Z. and Zhang, Y.:t-deletion-s-insertion-burst correcting codes. IEEE Trans. Inform. Theory 69, 6401–6413 (2023)

  17. [17]

    and Helbig, A.J.: Origin and evolution of tandem repeats in the mitochondrial DNA control region of shrikes

    Mundy, N.I. and Helbig, A.J.: Origin and evolution of tandem repeats in the mitochondrial DNA control region of shrikes. J. Molecular Evolution 59, 250–257 (2004)

  18. [18]

    In: Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT2022), Espoo, 16 Finland, pp

    Nguyen T.T., Cai K., Song W., Immink K.A.S.: Optimal single chromosome- inversion correcting codes for data storage in live DNA. In: Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT2022), Espoo, 16 Finland, pp. 1791–1796 (2022)

  19. [19]

    and Boratnik, B.: Replication slippage versus point mutation rates in short tandem repeats of the human genome, Mol

    Pumpernik, D., Oblak, B. and Boratnik, B.: Replication slippage versus point mutation rates in short tandem repeats of the human genome, Mol. Genet. Genomics 279, 53–61 (2008)

  20. [20]

    Reinsel, D., Rydning, J., and Gantz, J.: Worldwide global datasphere forecast, 2020–2024: The covid-19 data bump and the future of data growth. Int. Data Corp.(IDC) (2020)

  21. [21]

    and Nozaki, T.: An improvement of non-binary code correcting single b-burst of insertions or deletions

    Saeki, T. and Nozaki, T.: An improvement of non-binary code correcting single b-burst of insertions or deletions. In: Proc. Int. Symp. Inform. Theory Appl. (ISITA), 6–10 (2018)

  22. [22]

    and Yaakobi, E.: Codes correcting a burst of deletions or insertions

    Schoeny, C., Wachter-Zeh, A., Gabrys, R. and Yaakobi, E.: Codes correcting a burst of deletions or insertions. IEEE Trans. Inform. Theory 63, 1971–1985 (2017)

  23. [23]

    In 2017 51st Asilomar Conference on Signals, Systems, and Computers 511-515 (2017)

    Schoeny, C., Sala, F., Dolecek, L.: Novel combinatorial coding results for DNA sequencing and data storage. In 2017 51st Asilomar Conference on Signals, Systems, and Computers 511-515 (2017)

  24. [24]

    and Bruck, J.: Error Correction for DNA Storage

    Sima, J., Raviv, N., Schwartz, M. and Bruck, J.: Error Correction for DNA Storage. IEEE BITS Inform. Theory Mag. 3, 78–94 (2023)

  25. [25]

    and Bruck, J.: Two deletion correcting codes from indicator vectors

    Sima, J., Raviv, N. and Bruck, J.: Two deletion correcting codes from indicator vectors. IEEE trans. Inform. Theory 66, 2375-2391 (2019)

  26. [26]

    and Church, G.M.: CRISPR-Cas encoding of digital movie into the genomes of a population of living bacteria

    Shipman, S.L., Nivala, J., Macklis, J.D. and Church, G.M.: CRISPR-Cas encoding of digital movie into the genomes of a population of living bacteria. Nature 547, 345–349 (2017)

  27. [27]

    and Cai, K.: Non-binary two-deletion correcting codes and burst- deletion correcting codes

    Song, W. and Cai, K.: Non-binary two-deletion correcting codes and burst- deletion correcting codes. IEEE Trans. Inform. Theory 69, 6470-6484 (2023)

  28. [28]

    and Ge, G.: Asymptotically Optimal Codes for (t, s)- Burst Error

    Sun, Y., Lu, Z., Zhang, Y. and Ge, G.: Asymptotically Optimal Codes for (t, s)- Burst Error. IEEE Trans. Inform. Theory 71, 1570-1584 (2025)

  29. [29]

    and Farnoud, F.: Error-correcting codes for short tandem duplication and edit errors

    Tang, Y. and Farnoud, F.: Error-correcting codes for short tandem duplication and edit errors. IEEE Trans. Inform. Theory 68, 871–880 (2021)

  30. [30]

    and Farnoud, F.: Low-redundancy codes for correcting multiple short-duplication and edit errors

    Tang, Y., Wang, S., Lou, H., Gabrys, R. and Farnoud, F.: Low-redundancy codes for correcting multiple short-duplication and edit errors. IEEE Trans. Inform. Theory 69, 2940–2954 (2023)

  31. [31]

    and Farnoud, F.: Single-error detection and correction for duplication and substitution channels

    Tang, Y., Yehezkeally, Y., Schwartz, M. and Farnoud, F.: Single-error detection and correction for duplication and substitution channels. IEEE Trans. Inform. Theory 66, 6908–6919 (2020) 17

  32. [32]

    and Farnoud, F.: Error-correcting codes for noisy duplication channels

    Tang, Y. and Farnoud, F.: Error-correcting codes for noisy duplication channels. IEEE Trans. Inform. Theory 67, 3452-3463 (2021)

  33. [33]

    and Farnoud, F.: Error-correcting codes for short tandem duplications and at mostpsubstitutions

    Tang, Y., Lou, H. and Farnoud, F.: Error-correcting codes for short tandem duplications and at mostpsubstitutions. In IEEE International Symposium on Information Theory. (ISIT), 1835-1840 (2021)

  34. [34]

    IEEE Trans

    Tenengolts, G.,: Nonbinary codes correcting single deletion or insertion. IEEE Trans. Inform. Theory 30, 766–769 (1984)

  35. [35]

    and Varshamov, R.: A code that corrects single unsymmetric errors

    Tenengolts, G. and Varshamov, R.: A code that corrects single unsymmetric errors. Avtomatika Telemekhanika 26, 288–292 (1965)

  36. [36]

    and Farnoud, F.: Non-binary codes for correcting a burst of at mosttdeletions

    Wang, S., Tang, Y., Sima, J., Gabrys, R. and Farnoud, F.: Non-binary codes for correcting a burst of at mosttdeletions. IEEE Trans. Inform. Theory 70, 964-979 (2023)

  37. [37]

    and Schwartz, M.: On duplication-free codes for disjoint or equal-length errors

    Yu, W. and Schwartz, M.: On duplication-free codes for disjoint or equal-length errors. Des. Codes Cryptogr. 92, 2845–2861 (2024)

  38. [38]

    and Schwartz, M.: On the coding capacity of reverse-complement and palindromic duplication-correcting codes

    Yohananov, L. and Schwartz, M.: On the coding capacity of reverse-complement and palindromic duplication-correcting codes. Des. Codes Cryptogr. 93, 3283- 3302 (2025)

  39. [39]

    and Gulliver, T.A.: Construction of tandem duplication correcting codes

    Zeraatpisheh, M., Esmaeili, M. and Gulliver, T.A.: Construction of tandem duplication correcting codes. IET Commun. 13, 2217–2225 (2019)

  40. [40]

    and Gulliver, T.A.: Construction of duplication correcting codes

    Zeraatpisheh, M., Esmaeili, M. and Gulliver, T.A.: Construction of duplication correcting codes. IEEE Access 8, 96150–96161 (2020) 18