arxiv: 2509.24161 · v2 · submitted 2025-09-29 · 💻 cs.IT · math.IT

Capacity-Achieving Codes for Noisy Insertion Channels

Hengfeng Liu , Chunming Tang , Cuiling Fan This is my paper

Pith reviewed 2026-05-18 13:07 UTC · model grok-4.3

classification 💻 cs.IT math.IT

keywords noisy insertion channelcoding capacityerror-correcting codesDNA storageinsertion errorsasymptotic optimalitychannel model

0 comments

The pith

The coding capacity of a noisy insertion channel is determined and achieved by asymptotically optimal codes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies a channel relevant to DNA data storage where insertions of symbols that match or complement the original can occur infinitely often, along with at most one random symbol insertion due to noise. The authors find the exact coding capacity of this channel and build error-correcting codes whose rates approach this capacity for long sequences. A reader interested in reliable long-term data storage would see this as providing both the theoretical limit and practical constructions for handling the dominant errors in DNA sequences.

Core claim

We determine the coding capacity of the noisy channel and construct asymptotically optimal error-correcting codes achieving the coding capacity. The channel allows infinitely many insertions of complement or identical symbols and up to one insertion of a random symbol.

What carries the argument

The noisy insertion channel model that includes unlimited identical or complementary insertions and at most one random insertion, along with the capacity calculation and code constructions that achieve it.

If this is right

The capacity value sets the maximum rate for reliable communication over this channel.
Explicit codes can correct the described insertion errors while operating close to the capacity.
As block lengths increase, the codes' performance approaches the theoretical limit.
DNA storage systems can use these codes to improve data recovery reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the model holds, similar capacity results might apply to related channels with different noise levels.
Testing these codes in actual DNA synthesis and sequencing experiments could validate the approach.
Extensions to handle multiple random insertions could build on this foundation.

Load-bearing premise

The specific error model with infinitely many complement or identical insertions plus at most one random insertion correctly represents the main errors in DNA storage.

What would settle it

A calculation or proof showing that the capacity is different from what is claimed, or an experiment where codes exceed the stated capacity without error, would disprove the result.

read the original abstract

DNA storage has emerged as a promising solution for large-scale and long-term data preservation. Among various error types, insertions are the most frequent errors occurring in DNA sequences, where the inserted symbol is often identical or complementary to the original, and in practical implementations, noise can further cause the inserted symbol to mutate into a random one, which creates significant challenges to reliable data recovery. In this paper, we investigate a new noisy insertion channel, where infinitely many insertions of symbols complement or identical to the original ones and up to one insertion of random symbol may occur. We determine the coding capacity of the noisy channel and construct asymptotically optimal error-correcting codes achieving the coding capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives the capacity for a noisy insertion channel with unlimited identical/complement insertions plus one random one and gives asymptotically achieving codes.

read the letter

This paper pins down the capacity of a noisy insertion channel with unlimited identical or complementary insertions plus at most one random-symbol insertion, and constructs codes that achieve it asymptotically. The authors set up explicit transition probabilities for the channel, express capacity as a mutual-information quantity, and supply a random-coding achievability argument that matches a converse bound. The full text shows no obvious gaps in the synchronization analysis or output-length handling, so the central claim holds up on its own terms. What is new is the specific combination of the insertion types with the single random-noise insertion, which is motivated by DNA storage error patterns but treated as a clean information-theoretic object. The work does a straightforward job of applying standard tools to this model without hidden fitting or post-hoc adjustments. One soft spot is the modeling choice to cap random insertions at one; real DNA processes can produce more variable noise depending on the chemistry and sequencing method, so the result is exact for the stated channel but may need empirical checks before direct use in code design. The paper is also purely asymptotic, with no finite-length constructions or simulations, which is normal for capacity papers but leaves practical translation to others. This is for information theorists working on synchronization-error channels and for DNA-storage researchers who want theoretical rate limits to guide experiments. A reader who needs the explicit capacity expression or the achievability proof would get direct value. It deserves peer review because the math is grounded and the result is concrete for the defined model.

Referee Report

1 major / 3 minor

Summary. The manuscript investigates a noisy insertion channel for DNA storage, in which the channel permits arbitrarily many insertions of symbols identical or complementary to the original and at most one insertion of a random symbol. It claims to derive the coding capacity of this channel via a mutual-information characterization and to construct asymptotically optimal error-correcting codes that achieve the capacity through a random-coding argument.

Significance. If the derivations hold, the work supplies an explicit capacity formula and a matching achievability result for a channel model that captures dominant insertion statistics in DNA storage. The combination of a converse bound and random-coding construction is a standard and rigorous approach in information theory; the result would strengthen the theoretical toolkit for synchronization-error channels.

major comments (1)

§4 (Capacity Derivation): the mutual-information expression for capacity must explicitly account for the normalization of output length under unbounded insertions; without a precise statement of the rate definition (e.g., bits per input symbol versus per output symbol), the claimed capacity value cannot be verified as load-bearing.

minor comments (3)

Abstract: the channel model description is terse; adding one sentence that states the resulting capacity expression would improve accessibility.
Notation: the symbols for identical, complementary, and random insertions should be introduced once in §2 and used consistently thereafter.
References: several foundational papers on deletion/insertion channels (e.g., works by Dobrushin and on DNA-specific models) appear to be missing.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and for the constructive comment on the capacity derivation. We address the point below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses

Referee: §4 (Capacity Derivation): the mutual-information expression for capacity must explicitly account for the normalization of output length under unbounded insertions; without a precise statement of the rate definition (e.g., bits per input symbol versus per output symbol), the claimed capacity value cannot be verified as load-bearing.

Authors: We agree that an explicit statement of the rate definition improves verifiability. In our derivation the coding capacity is defined in the standard information-theoretic sense as the supremum of achievable rates R where R = (1/n) log |C| (bits per input symbol) for block length n, such that there exist codes with vanishing error probability as n → ∞. The mutual-information characterization in Section 4 is obtained by maximizing I(X;Y) under the channel law that incorporates the (at most one) random insertion together with the insertions of identical or complementary symbols; the variable output length is handled asymptotically because the channel model ensures that the number of insertions is o(n) with high probability under the optimizing input distribution. Nevertheless, to address the referee’s concern directly we will add a dedicated paragraph at the beginning of Section 4 that (i) states the rate definition explicitly as bits per input symbol and (ii) explains how the normalization is preserved under the unbounded but probabilistically controlled insertions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper explicitly defines the noisy insertion channel via transition probabilities (arbitrarily many identical/complement insertions plus at most one random-symbol insertion), then derives capacity as the maximum mutual information over input distributions and proves achievability via random coding that matches the converse. These steps follow standard information-theoretic definitions and proof techniques without reducing the claimed capacity or code optimality to fitted parameters, self-citations, or ansatzes imported from prior work by the same authors. The result is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The channel model itself is the main modeling assumption.

pith-pipeline@v0.9.0 · 5632 in / 934 out tokens · 21342 ms · 2026-05-18T13:07:18.936805+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We determine the coding capacity of the noisy channel and construct asymptotically optimal error-correcting codes achieving the coding capacity.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cap_noisy_q = log_q(q-2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

and Schwartz, M.: On the reverse-complement sequence- duplication system

Ben-Tolila, E. and Schwartz, M.: On the reverse-complement sequence- duplication system. IEEE Trans. Inform. Theory 68, 7184–7197 (2022)

work page 2022
[2]

and Kosuri, S.: Next-generation digital information storage in DNA

Church, G.M., Gao, Y. and Kosuri, S.: Next-generation digital information storage in DNA. Science 337, 1628 (2012)

work page 2012
[3]

and Bruck, J.: The capacity of sequence-duplication systems

Farnoud, F., Schwartz, M. and Bruck, J.: The capacity of sequence-duplication systems. IEEE Trans. Inform. Theory 62, 811–824 (2016)

work page 2016
[4]

and Bruck, J.: Estimation of duplication history under a stochastic model for tandem repeats

Farnoud, F., Schwartz, M. and Bruck, J.: Estimation of duplication history under a stochastic model for tandem repeats. BMC Bioinformatics 20, 1–11 (2019) 15

work page 2019
[5]

and Birney, E.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E.M., Sipos, B. and Birney, E.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013)

work page 2013
[6]

and Vorobyev, I.: Codes Correcting Long Duplica- tion Errors

Goshkoder, D., Polyanskii, N. and Vorobyev, I.: Codes Correcting Long Duplica- tion Errors. IEEE Trans. Molecular, Biological, Multi-Scale Commun. 10, 272–288 (2024)

work page 2024
[7]

and Sala, F.: Codes correcting two deletions

Gabrys, R. and Sala, F.: Codes correcting two deletions. IEEE Trans. Inform. Theory 6, 965-974 (2018)

work page 2018
[8]

and Ferreira, H.C.: On multiple insertion/deletion correcting codes

Helberg, A.S. and Ferreira, H.C.: On multiple insertion/deletion correcting codes. IEEE Trans. Inform. Theory 48, 305-308 (2002)

work page 2002
[9]

and Bruck, J.: Capacity and expressiveness of genomic tandem duplication

Jain, S., Farnoud, F. and Bruck, J.: Capacity and expressiveness of genomic tandem duplication. IEEE Trans. Inform. Theory 63, 6129–6138 (2017)

work page 2017
[10]

and Bruck, J.: Duplication-correcting codes for data storage in the DNA of living organisms

Jain, S., Farnoud, F., Schwartz, M. and Bruck, J.: Duplication-correcting codes for data storage in the DNA of living organisms. IEEE Trans. Inform. Theory 63, 4996–5010 (2017)

work page 2017
[11]

and De Figueiredo, P.: DNA watermarking of infectious agents: Progress and prospects

Jupiter, D.C., Ficht, T.A., Samuel, J., Qin, Q.M. and De Figueiredo, P.: DNA watermarking of infectious agents: Progress and prospects. PLoS pathogens, 6(6), p.e1000950. (2010)

work page 2010
[12]

IEEE Trans

Kovaˇ cevi´ c, M.: Zero-error capacity of duplication channels. IEEE Trans. Com- mun. 67, 6735–6742 (2019)

work page 2019
[13]

Soviet Physics Doklady 10, 707–710 (1966)

Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)

work page 1966
[14]

and Yaakobi, E.: Duplication-correcting codes

Lenz, A., Wachter-Zeh, A. and Yaakobi, E.: Duplication-correcting codes. Des. Codes Cryptogr. 87, 277–298 (2019)

work page 2019
[15]

and Xing, C.: Explicit construction ofq-ary 2-deletion correcting codes with low redundancy

Liu, S., Tjuawinata, I. and Xing, C.: Explicit construction ofq-ary 2-deletion correcting codes with low redundancy. IEEE Trans. Inform. Theory, 70, 4093-4101 (2024)

work page 2024
[16]

and Zhang, Y.:t-deletion-s-insertion-burst correcting codes

Lu, Z. and Zhang, Y.:t-deletion-s-insertion-burst correcting codes. IEEE Trans. Inform. Theory 69, 6401–6413 (2023)

work page 2023
[17]

and Helbig, A.J.: Origin and evolution of tandem repeats in the mitochondrial DNA control region of shrikes

Mundy, N.I. and Helbig, A.J.: Origin and evolution of tandem repeats in the mitochondrial DNA control region of shrikes. J. Molecular Evolution 59, 250–257 (2004)

work page 2004
[18]

In: Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT2022), Espoo, 16 Finland, pp

Nguyen T.T., Cai K., Song W., Immink K.A.S.: Optimal single chromosome- inversion correcting codes for data storage in live DNA. In: Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT2022), Espoo, 16 Finland, pp. 1791–1796 (2022)

work page 2022
[19]

and Boratnik, B.: Replication slippage versus point mutation rates in short tandem repeats of the human genome, Mol

Pumpernik, D., Oblak, B. and Boratnik, B.: Replication slippage versus point mutation rates in short tandem repeats of the human genome, Mol. Genet. Genomics 279, 53–61 (2008)

work page 2008
[20]

Reinsel, D., Rydning, J., and Gantz, J.: Worldwide global datasphere forecast, 2020–2024: The covid-19 data bump and the future of data growth. Int. Data Corp.(IDC) (2020)

work page 2020
[21]

and Nozaki, T.: An improvement of non-binary code correcting single b-burst of insertions or deletions

Saeki, T. and Nozaki, T.: An improvement of non-binary code correcting single b-burst of insertions or deletions. In: Proc. Int. Symp. Inform. Theory Appl. (ISITA), 6–10 (2018)

work page 2018
[22]

and Yaakobi, E.: Codes correcting a burst of deletions or insertions

Schoeny, C., Wachter-Zeh, A., Gabrys, R. and Yaakobi, E.: Codes correcting a burst of deletions or insertions. IEEE Trans. Inform. Theory 63, 1971–1985 (2017)

work page 1971
[23]

In 2017 51st Asilomar Conference on Signals, Systems, and Computers 511-515 (2017)

Schoeny, C., Sala, F., Dolecek, L.: Novel combinatorial coding results for DNA sequencing and data storage. In 2017 51st Asilomar Conference on Signals, Systems, and Computers 511-515 (2017)

work page 2017
[24]

and Bruck, J.: Error Correction for DNA Storage

Sima, J., Raviv, N., Schwartz, M. and Bruck, J.: Error Correction for DNA Storage. IEEE BITS Inform. Theory Mag. 3, 78–94 (2023)

work page 2023
[25]

and Bruck, J.: Two deletion correcting codes from indicator vectors

Sima, J., Raviv, N. and Bruck, J.: Two deletion correcting codes from indicator vectors. IEEE trans. Inform. Theory 66, 2375-2391 (2019)

work page 2019
[26]

and Church, G.M.: CRISPR-Cas encoding of digital movie into the genomes of a population of living bacteria

Shipman, S.L., Nivala, J., Macklis, J.D. and Church, G.M.: CRISPR-Cas encoding of digital movie into the genomes of a population of living bacteria. Nature 547, 345–349 (2017)

work page 2017
[27]

and Cai, K.: Non-binary two-deletion correcting codes and burst- deletion correcting codes

Song, W. and Cai, K.: Non-binary two-deletion correcting codes and burst- deletion correcting codes. IEEE Trans. Inform. Theory 69, 6470-6484 (2023)

work page 2023
[28]

and Ge, G.: Asymptotically Optimal Codes for (t, s)- Burst Error

Sun, Y., Lu, Z., Zhang, Y. and Ge, G.: Asymptotically Optimal Codes for (t, s)- Burst Error. IEEE Trans. Inform. Theory 71, 1570-1584 (2025)

work page 2025
[29]

and Farnoud, F.: Error-correcting codes for short tandem duplication and edit errors

Tang, Y. and Farnoud, F.: Error-correcting codes for short tandem duplication and edit errors. IEEE Trans. Inform. Theory 68, 871–880 (2021)

work page 2021
[30]

and Farnoud, F.: Low-redundancy codes for correcting multiple short-duplication and edit errors

Tang, Y., Wang, S., Lou, H., Gabrys, R. and Farnoud, F.: Low-redundancy codes for correcting multiple short-duplication and edit errors. IEEE Trans. Inform. Theory 69, 2940–2954 (2023)

work page 2023
[31]

and Farnoud, F.: Single-error detection and correction for duplication and substitution channels

Tang, Y., Yehezkeally, Y., Schwartz, M. and Farnoud, F.: Single-error detection and correction for duplication and substitution channels. IEEE Trans. Inform. Theory 66, 6908–6919 (2020) 17

work page 2020
[32]

and Farnoud, F.: Error-correcting codes for noisy duplication channels

Tang, Y. and Farnoud, F.: Error-correcting codes for noisy duplication channels. IEEE Trans. Inform. Theory 67, 3452-3463 (2021)

work page 2021
[33]

and Farnoud, F.: Error-correcting codes for short tandem duplications and at mostpsubstitutions

Tang, Y., Lou, H. and Farnoud, F.: Error-correcting codes for short tandem duplications and at mostpsubstitutions. In IEEE International Symposium on Information Theory. (ISIT), 1835-1840 (2021)

work page 2021
[34]

IEEE Trans

Tenengolts, G.,: Nonbinary codes correcting single deletion or insertion. IEEE Trans. Inform. Theory 30, 766–769 (1984)

work page 1984
[35]

and Varshamov, R.: A code that corrects single unsymmetric errors

Tenengolts, G. and Varshamov, R.: A code that corrects single unsymmetric errors. Avtomatika Telemekhanika 26, 288–292 (1965)

work page 1965
[36]

and Farnoud, F.: Non-binary codes for correcting a burst of at mosttdeletions

Wang, S., Tang, Y., Sima, J., Gabrys, R. and Farnoud, F.: Non-binary codes for correcting a burst of at mosttdeletions. IEEE Trans. Inform. Theory 70, 964-979 (2023)

work page 2023
[37]

and Schwartz, M.: On duplication-free codes for disjoint or equal-length errors

Yu, W. and Schwartz, M.: On duplication-free codes for disjoint or equal-length errors. Des. Codes Cryptogr. 92, 2845–2861 (2024)

work page 2024
[38]

and Schwartz, M.: On the coding capacity of reverse-complement and palindromic duplication-correcting codes

Yohananov, L. and Schwartz, M.: On the coding capacity of reverse-complement and palindromic duplication-correcting codes. Des. Codes Cryptogr. 93, 3283- 3302 (2025)

work page 2025
[39]

and Gulliver, T.A.: Construction of tandem duplication correcting codes

Zeraatpisheh, M., Esmaeili, M. and Gulliver, T.A.: Construction of tandem duplication correcting codes. IET Commun. 13, 2217–2225 (2019)

work page 2019
[40]

and Gulliver, T.A.: Construction of duplication correcting codes

Zeraatpisheh, M., Esmaeili, M. and Gulliver, T.A.: Construction of duplication correcting codes. IEEE Access 8, 96150–96161 (2020) 18

work page 2020