pith. sign in

arxiv: 2604.25445 · v1 · submitted 2026-04-28 · 💻 cs.IT · math.IT

Central Limit Theorem for Mutation Systems

Pith reviewed 2026-05-07 14:53 UTC · model grok-4.3

classification 💻 cs.IT math.IT
keywords mutation systemscentral limit theoremDNA storagek-tuple frequenciesmartingale CLTsubstitution matrixasymptotic analysisstochastic fluctuations
0
0 comments X p. Extension

The pith

A central limit theorem characterizes fluctuations in k-tuple frequencies for mutation systems modeling DNA evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a central limit theorem for the random deviations of empirical k-tuple frequencies from their limiting values in mutation systems. These systems describe how sequences over a finite alphabet change probabilistically through mutations such as substitutions. This matters for understanding error accumulation in in-vivo DNA data storage, where sequences evolve continuously inside organisms. The approach uses the spectral properties of the k-substitution matrix to reduce the count vectors to a martingale difference sequence and applies the standard martingale central limit theorem to obtain the result, along with an explicit form for the limiting covariance matrix.

Core claim

We study the asymptotic behavior of mutation systems and characterize the stochastic fluctuations around the limiting empirical k-tuple frequencies by establishing a Central Limit Theorem. Our approach leverages the spectral properties of the k-substitution matrix to project the centered count vectors onto a martingale difference sequence, verifying the classical martingale CLT conditions, and we explicitly derive the limiting covariance matrix.

What carries the argument

the k-substitution matrix and its spectral properties, which project centered count vectors onto a martingale difference sequence to which the martingale central limit theorem applies

If this is right

  • The scaled fluctuations of the k-tuple count vectors converge in distribution to a multivariate normal with the derived covariance matrix.
  • This gives a precise description of variability in sequence composition for large systems.
  • The result extends the known convergence of empirical frequencies to include their rate of convergence and fluctuation statistics.
  • Applications to in-vivo DNA storage can use this to predict typical error patterns due to mutations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Simulations of small mutation systems could test whether the empirical covariance matches the derived matrix for moderate sequence lengths.
  • The CLT could guide the choice of redundancy in error-correcting schemes for DNA storage by quantifying expected deviations.
  • Similar techniques might apply to other stochastic processes on sequences, such as those in evolutionary biology.

Load-bearing premise

The k-substitution matrix possesses spectral properties that allow the centered count vectors to be projected onto a martingale difference sequence for which the classical martingale central limit theorem conditions hold.

What would settle it

A simulation of a mutation system with a small alphabet and specific substitution probabilities where the normalized deviations of k-tuple frequencies fail to approach a Gaussian distribution with the predicted covariance.

read the original abstract

DNA-based storage has emerged as a promising alternative to traditional data storage methods, offering unmatched advantages in data density, longevity, and sustainability. Two main approaches have developed: in-vitro storage, where information is synthesized in controlled environments, and in-vivo storage, where data is embedded within an organism's DNA for enhanced confidentiality and protection. While in-vivo DNA storage provides unique advantages, it faces significant challenges from mutations, including duplications, deletions, and substitutions, which cause sequence evolution over time. Thus, in-vivo systems experience continuous sequence alterations that increase length and change composition, making error correction particularly challenging. We study the asymptotic behavior of mutation systems, which model the probabilistic evolution of sequences over a finite alphabet, and are central to the analysis of in-vivo DNA-based data storage. Building upon prior works that established the limit of empirical $k$-tuple frequencies, we characterize the stochastic fluctuations around these values by establishing a Central Limit Theorem (CLT). Our approach leverages the spectral properties of the $k$-substitution matrix to project the centered count vectors, allowing us to approximate the system via a martingale difference sequence, and then verifying the classical martingale CLT conditions. In addition, we explicitly derive the limiting covariance matrix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to establish a Central Limit Theorem characterizing the stochastic fluctuations of empirical k-tuple frequencies around their deterministic limits in mutation systems (probabilistic models of sequence evolution under substitutions, duplications, and deletions). The approach projects centered count vectors using spectral properties of the k-substitution matrix to obtain a martingale difference sequence, verifies the classical martingale CLT conditions, and derives an explicit limiting covariance matrix. This is motivated by applications to error correction in in-vivo DNA-based data storage.

Significance. If the central claim holds with the required error controls, the result supplies a precise Gaussian limit law and covariance for fluctuations in sequence composition, which would be useful for statistical analysis, capacity calculations, and coding design in DNA storage. The combination of spectral projection with martingale CLT is a natural extension of existing limit theorems for substitution systems, and the explicit covariance is a concrete contribution that could support downstream applications.

major comments (1)
  1. [Abstract and proof of the main CLT result] The abstract states that spectral properties 'allow us to approximate the system via a martingale difference sequence' before applying the martingale CLT. For the normalized centered count vector to converge to N(0, Σ) with the claimed covariance Σ, the difference between the actual process and the projected martingale sum must be o_p(√n). No explicit bound or argument controlling this remainder (e.g., via spectral gap decay rates or transient mode contributions) is indicated, which is load-bearing for both the CLT statement and the covariance derivation.
minor comments (1)
  1. [Abstract / Theorem statement] The abstract refers to 'the k-substitution matrix' without stating the precise assumptions (e.g., primitivity, spectral gap size, or eigenvalue separation) needed for the projection step; these should be listed explicitly in the statement of the main theorem.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need for an explicit bound on the martingale approximation remainder. We address the concern below and will revise the manuscript to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract and proof of the main CLT result] The abstract states that spectral properties 'allow us to approximate the system via a martingale difference sequence' before applying the martingale CLT. For the normalized centered count vector to converge to N(0, Σ) with the claimed covariance Σ, the difference between the actual process and the projected martingale sum must be o_p(√n). No explicit bound or argument controlling this remainder (e.g., via spectral gap decay rates or transient mode contributions) is indicated, which is load-bearing for both the CLT statement and the covariance derivation.

    Authors: We agree that an explicit control of the remainder is required for rigor. The k-substitution matrix is assumed to possess a spectral gap: its dominant eigenvalue is 1 (simple, with positive eigenvector), while the spectral radius ρ of the restriction to the complementary invariant subspace satisfies ρ < 1. Consequently, after projection, the contribution of the transient modes to the centered count vector is bounded in norm by Cρ^n (almost surely, for a random constant C depending on the initial sequence). This exponential decay is o_p(n^{-1/2}) for any fixed ρ < 1. We will insert a new lemma immediately after the projection step that (i) recalls the spectral decomposition, (ii) states the geometric bound on the complementary component, and (iii) verifies that the difference between the original centered process and the martingale sum is o_p(√n) in the Euclidean norm. This lemma will also confirm that the limiting covariance remains unaffected. The abstract will be updated to reference the spectral-gap control. revision: yes

Circularity Check

0 steps flagged

No circularity; standard martingale CLT after spectral projection

full rationale

The derivation builds on prior (unspecified) works only for the LLN limit of empirical k-tuple frequencies, then projects centered count vectors via spectral properties of the k-substitution matrix to obtain a martingale difference sequence and invokes the classical martingale CLT. The limiting covariance is stated to be derived explicitly. No quoted step reduces a claimed prediction or theorem to a fitted parameter, self-definition, or self-citation chain by construction. The approach is a direct application of existing probabilistic tools to the mutation model and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the central claim rests on unstated spectral assumptions of the substitution matrix and applicability of the classical martingale CLT.

axioms (1)
  • domain assumption The k-substitution matrix has spectral properties permitting projection of centered count vectors to a martingale difference sequence
    Invoked to reduce the fluctuation process to one where standard martingale CLT applies

pith-pipeline@v0.9.0 · 5514 in / 1083 out tokens · 79273 ms · 2026-05-07T14:53:51.048024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 4 canonical work pages

  1. [1]

    R. B. Ash and C. A. Doleans-Dade,Probability and Measure Theory. Elsevier Science, 2000

  2. [2]

    Capacity of DNA data embedding under substitution mutations,

    F. Balado, “Capacity of DNA data embedding under substitution mutations,”IEEE Transactions on Information Theory, vol. 59, no. 2, pp. 928–941, 2012

  3. [3]

    On the reverse-complement string-duplication system,

    E. Ben-Tolila and M. Schwartz, “On the reverse-complement string-duplication system,”IEEE Transactions on Information Theory, vol. 68, no. 11, pp. 7184–7197, 2022

  4. [4]

    Hiding messages in DNA microdots,

    C. T. Clelland, V . Risca, and C. Bancroft, “Hiding messages in DNA microdots,”Nature, vol. 399, no. 6736, pp. 533–534, 1999

  5. [5]

    Initial sequencing and analysis of the human genome,

    I. H. G. S. Consortium, “Initial sequencing and analysis of the human genome,”nature, vol. 409, no. 6822, pp. 860–921, 2001

  6. [6]

    On the long-term behavior ofk-tuples frequencies in mutation systems,

    O. Elishco, “On the long-term behavior ofk-tuples frequencies in mutation systems,”IEEE Transactions on Information Theory, vol. 70, no. 12, pp. 8524–8545, 2024

  7. [7]

    The entropy rate of some P ´olya string models,

    O. Elishco, F. F. Hassanzadeh, M. Schwartz, and J. Bruck, “The entropy rate of some P ´olya string models,”IEEE Transactions on Information Theory, vol. 65, no. 12, pp. 8180–8193, 2019

  8. [8]

    The capacity of string-duplication systems,

    F. Farnoud, M. Schwartz, and J. Bruck, “The capacity of string-duplication systems,”IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 811–824, 2015

  9. [9]

    Estimation of duplication history under a stochastic model for tandem repeats,

    ——, “Estimation of duplication history under a stochastic model for tandem repeats,”BMC Bioinformatics, vol. 20, no. 1, p. 64, 2019

  10. [10]

    Hall and C

    P. Hall and C. C. Heyde,Martingale limit theory and its application. Academic press, 2014

  11. [11]

    DNA-based watermarks using the DNA-crypt algorithm,

    D. Heider and A. Barnekow, “DNA-based watermarks using the DNA-crypt algorithm,”BMC bioinformatics, vol. 8, no. 1, pp. 1–10, 2007

  12. [12]

    R. A. Horn and C. B. Johnson,Matrix analysis. Cambridge University Press, 2012

  13. [13]

    Jacod and A

    J. Jacod and A. Shiryaev,Limit theorems for stochastic processes. Springer Science & Business Media, 2013, vol. 288

  14. [14]

    Capacity and expressiveness of genomic tandem duplication,

    S. Jain, F. Farnoud, and J. Bruck, “Capacity and expressiveness of genomic tandem duplication,”IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6129–6138, 2017

  15. [15]

    Duplication-correcting codes for data storage in the DNA of living organisms,

    S. Jain, F. Farnoud, M. Schwartz, and J. Bruck, “Duplication-correcting codes for data storage in the DNA of living organisms,”IEEE Transactions on Information Theory, vol. 63, no. 8, pp. 4996–5010, 2017

  16. [16]

    DNA watermarking of infectious agents: Progress and prospects,

    D. C. Jupiter, T. A. Ficht, J. Samuel, Q. M. Qin, and P. D. Figueiredo, “DNA watermarking of infectious agents: Progress and prospects,”PLoS pathogens, vol. 6, no. 6, p. e1000950, 2010

  17. [17]

    On the maximum number of non-confusable strings evolving under short tandem duplications,

    M. Kovacevic, “On the maximum number of non-confusable strings evolving under short tandem duplications,”Problems of Information Transmission, vol. 58, pp. 111–121, 2022

  18. [18]

    Asymptotically optimal codes correcting fixed-length duplication errors in DNA storage systems,

    M. Kova ˇcevi´c and V . Y . F. Tan, “Asymptotically optimal codes correcting fixed-length duplication errors in DNA storage systems,”IEEE Communications Letters, vol. 22, no. 11, pp. 2194–2197, 2018

  19. [19]

    Bounds and constructions for multi-symbol duplication error correcting codes,

    A. Lenz, N. J ¨unger, and A. Wachter-Zeh, “Bounds and constructions for multi-symbol duplication error correcting codes,”arXiv preprint arXiv:1807.02874, 2018

  20. [20]

    Bounds on codes correcting tandem and palindromic duplications,

    A. Lenz, A. Wachter-Zeh, and E. Yaakobi, “Bounds on codes correcting tandem and palindromic duplications,”arXiv preprint arXiv:1707.00052, 2017

  21. [21]

    Duplication-correcting codes,

    ——, “Duplication-correcting codes,”Designs, Codes and Cryptography, vol. 87, pp. 277–298, 2019

  22. [22]

    Embedding permanent watermarks in synthetic genes,

    M. Liss, D. Daubert, K. Brunner, K. Kliche, U. Hammes, A. Leiherer, and R. Wagner, “Embedding permanent watermarks in synthetic genes,” 2012

  23. [23]

    Explicit construction of codes correcting a single reverse-complement duplication of arbitrary length,

    H. Liu, C. Tang, C. Fan, and V . Sagar, “Explicit construction of codes correcting a single reverse-complement duplication of arbitrary length,”Advances in Mathematics of Communications, vol. 22, pp. 175–184, 2025

  24. [24]

    Evolution ofk-mer frequencies and entropy in duplication and substitution mutation systems,

    H. Lou, M. Schwartz, J. Bruck, and F. Farnoud, “Evolution ofk-mer frequencies and entropy in duplication and substitution mutation systems,”IEEE Transactions on Information Theory, vol. 66, no. 5, pp. 3171–3186, 2019

  25. [25]

    Asymptotically optimal sticky-insertion-correcting codes with efficient encoding and decoding,

    H. Mahdavifar and A. Vardy, “Asymptotically optimal sticky-insertion-correcting codes with efficient encoding and decoding,” in2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 2683–2687

  26. [26]

    Mahmoud,P ´olya urn models

    H. Mahmoud,P ´olya urn models. CRC press, 2008

  27. [27]

    Origin and evolution of tandem repeats in the mitochondrial dna control region of shrikes (lanius spp.),

    N. I. Mundy and A. J. Helbig, “Origin and evolution of tandem repeats in the mitochondrial dna control region of shrikes (lanius spp.),”Journal of Molecular Evolution, vol. 59, no. 2, pp. 250–257, 2004

  28. [28]

    Exact reconstruction from insertions in synchronization codes,

    F. Sala, R. Gabrys, C. Schoeny, and L. Dolecek, “Exact reconstruction from insertions in synchronization codes,”IEEE Transactions on Information Theory, vol. 63, no. 4, pp. 2428–2445, 2017

  29. [29]

    CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria,

    S. L. Shipman, J. Nivala, J. D. Macklis, and G. M. Church, “CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria,”Nature, vol. 547, no. 7663, pp. 345–349, 2017

  30. [30]

    Central limit theorems for urn models,

    R. T. Smythe, “Central limit theorems for urn models,”Stochastic Processes and their Applications, vol. 65, no. 1, pp. 115–137, 1996

  31. [31]

    On the palindromic/reverse-complement duplication correcting codes,

    Y . Sun and G. Ge, “On the palindromic/reverse-complement duplication correcting codes,”arXiv preprint arXiv:2602.01151, 2026

  32. [32]

    Error-correcting codes for short tandem duplications and at mostpsubstitutions,

    Y . Tang, H. Lou, and F. Farnoud, “Error-correcting codes for short tandem duplications and at mostpsubstitutions,” in2021 IEEE International Symposium on Information Theory (ISIT). IEEE, 2021, pp. 1835–1840

  33. [33]

    Single-error detection and correction for duplication and substitution channels,

    Y . Tang, Y . Yehezkeally, M. Schwartz, and F. Farnoud, “Single-error detection and correction for duplication and substitution channels,”IEEE Transactions on Information Theory, vol. 66, no. 11, pp. 6908–6919, 2020

  34. [34]

    Organic data memory using the DNA approach,

    P. C. Wong, K. k. Wong, and H. Foote, “Organic data memory using the DNA approach,”Communications of the ACM, vol. 46, no. 1, pp. 95–98, 2003. 47

  35. [35]

    Portable and error-free DNA-based data storage,

    S. M. H. T. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based data storage,”Scientific reports, vol. 7, no. 1, p. 5011, 2017

  36. [36]

    On the coding capacity of reverse-complement and palindromic duplication-correcting codes: L. yohananov, m. schwartz,

    L. Yohananov and M. Schwartz, “On the coding capacity of reverse-complement and palindromic duplication-correcting codes: L. yohananov, m. schwartz,”Designs, Codes and Cryptography, vol. 93, no. 8, pp. 3283–3302, 2025

  37. [37]

    On duplication-free codes for disjoint or equal-length errors,

    W. Yu and M. Schwartz, “On duplication-free codes for disjoint or equal-length errors,”arXiv preprint arXiv:2401.04675, 2024