pith. sign in

arxiv: 2604.20810 · v1 · submitted 2026-04-22 · 💻 cs.IT · math.IT

DNA storage approaching the information-theoretic ceiling

Pith reviewed 2026-05-09 23:15 UTC · model grok-4.3

classification 💻 cs.IT math.IT
keywords DNA storageerror correctionchannel codinginformation theoryprofile hidden Markov modelsordered statistics decoding
0
0 comments X

The pith

A new coding scheme for DNA storage achieves densities up to 155.8 exabytes per gram by retaining the sequencer's probabilistic outputs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an error-correction codec for synthetic DNA data storage that preserves the probabilistic confidence scores from DNA sequencers rather than converting them to hard base calls early. The approach integrates profile hidden Markov model alignment, log-product fusion of read probabilities, and ordered-statistics decoding to handle synthesis and sequencing errors more efficiently. Simulations on the DT4DDS channel show it recovers data at 155.8 exabytes per gram under high-fidelity conditions and 25.9 under low-fidelity, surpassing the best previous codecs by 11 and 52 percent. It further projects 282 years of storage lifetime at 17.1 exabytes per gram based on depurination degradation models at room temperature. These results demonstrate that DNA storage can operate close to the information-theoretic limits of the underlying channel.

Core claim

The paper presents a coding scheme that retains the sequencer's per-position posterior distributions through an integrated decoder of profile hidden Markov model alignment, log-product fusion across reads, and ordered-statistics decoding. On the DT4DDS channel simulator, the codec recovers 155.8 and 25.9 exabytes per gram of dsDNA under high- and low-fidelity conditions, exceeding the highest prior-art density on each channel by 11 and 52 percent. Under a single-encode-then-degrade protocol mapped to depurination kinetics at 25°C in the dry state, the codec projects 282 years of decodable storage at 17.1 exabytes per gram. These results place DNA storage density within reach of the Shannon

What carries the argument

Integrated decoder using profile hidden Markov model alignment, log-product fusion across reads, and ordered-statistics decoding to retain and exploit per-position posterior probability distributions from the sequencer

If this is right

  • DNA storage can achieve over 150 exabytes per gram with high reliability on simulated channels
  • Storage lifetime extends to nearly 300 years at useful densities under modeled room-temperature degradation
  • Soft-information methods outperform hard-decision codecs by 11 to 52 percent in this domain
  • DNA storage densities can be brought within reach of the Shannon bound of the synthesis-sequencing channel

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the simulation accurately models physical processes, the codec could enable higher-capacity real DNA archival systems
  • The soft-decision approach might transfer to other noisy storage media where probabilistic readouts are available
  • Further gains could come from pairing the decoder with improved DNA synthesis or alternative sequencing technologies

Load-bearing premise

The DT4DDS channel simulator and the single-encode-then-degrade protocol mapped to depurination kinetics at 25°C accurately represent real-world synthesis, sequencing, and long-term degradation errors

What would settle it

Experimental implementation of the codec on physically synthesized DNA strands, followed by simulated or real degradation and sequencing, to measure if the achieved information density matches the simulated 155.8 and 25.9 exabytes per gram

Figures

Figures reproduced from arXiv: 2604.20810 by James L. Banal.

Figure 1
Figure 1. Figure 1: Storage density achieved by Mahoraga compared to prior codecs on the DT4DDS channel simulator3 . (A) High-fidelity channel (Twist synthesis, Q5 PCR). (B) Low-fidelity channel (CustomArray synthesis, Taq PCR). Each gray marker denotes a prior codec at its peak density operating point at 30× sequencing depth, at the decoding success threshold published by the reference benchmark. The blue curve traces Mahora… view at source ↗
Figure 2
Figure 2. Figure 2: Storage density at matched parity and matched physical redundancy. Each group plots storage density for Mahoraga and the codecs that decode at that cell under matched outer-code parity. At 𝑟 = 0.02 on the high-fidelity channel, MGC+ does not reach 30 of 30 decoding and is omitted. Matched parity holds the outer-code parity at Mahoraga’s auto-sized value per cell and is applied to all codecs. Mahoraga excee… view at source ↗
read the original abstract

Synthetic DNA approaches 227.5 exabytes per gram of storage density with stability over millennial timescales. Realising this capacity requires error-correction codes that recover data from substantial synthesis and sequencing errors. Existing codecs convert noisy sequencer output into discrete base calls before error correction, discarding probabilistic information about which positions are reliable. Here we present a coding scheme that retains the sequencer's per-position posterior distributions through an integrated decoder of profile hidden Markov model alignment, log-product fusion across reads, and ordered-statistics decoding. On the DT4DDS channel simulator, the codec recovers 155.8 and 25.9 exabytes per gram of dsDNA under high- and low-fidelity conditions, exceeding the highest prior-art density on each channel by 11 and 52 percent. Under a single-encode-then-degrade protocol mapped to depurination kinetics at 25 {\deg}C in the dry state, the codec projects 282 years of decodable storage at 17.1 exabytes per gram. These results place DNA storage density within reach of the Shannon bound of the underlying channel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces an integrated decoder for DNA-based data storage consisting of profile hidden Markov model (PHMM) alignment, log-product fusion of multiple reads, and ordered-statistics decoding (OSD). This approach preserves the sequencer's per-position posterior probabilities rather than converting to hard base calls. Simulations on the DT4DDS channel model yield storage densities of 155.8 exabytes per gram under high-fidelity conditions and 25.9 exabytes per gram under low-fidelity conditions, which exceed the best prior-art results by 11% and 52%, respectively. Additionally, the work projects 282 years of reliable storage at 17.1 exabytes per gram using a single-encode-then-degrade protocol based on depurination kinetics at 25°C in the dry state. The results suggest that DNA storage can approach the information-theoretic ceiling of the underlying channel.

Significance. If the DT4DDS simulator accurately captures the error characteristics of real DNA synthesis, sequencing, and degradation processes, this codec represents a substantial advance toward practical high-density DNA storage. The retention of probabilistic information and the reported gains over prior art could influence future codec designs in the field. The projection of multi-century stability at high density is particularly noteworthy, though dependent on the model assumptions.

major comments (2)
  1. [Abstract] Abstract (results paragraph): The reported densities of 155.8 EB/g (high-fidelity) and 25.9 EB/g (low-fidelity), along with the 11% and 52% gains over prior art, are generated exclusively by feeding the proposed PHMM + log-product + OSD decoder the exact posterior distributions from the DT4DDS channel simulator. No comparison is provided between the simulator's per-base substitution, insertion, deletion, or quality-score distributions and published empirical traces from Illumina or Nanopore sequencing of synthetic DNA. This is load-bearing for the central claim of approaching the Shannon bound, because underestimation of correlated errors or overestimation of posterior reliability would directly reduce the claimed improvements.
  2. [Abstract] Abstract (degradation projection): The 282-year decodable storage claim at 17.1 EB/g rests on a single-encode-then-degrade protocol mapped to 25°C depurination kinetics in the dry state. The manuscript should quantify the exact kinetic parameters, rate constants, and any temperature or humidity sensitivities used in this mapping, as small changes in the degradation model could alter the projected lifetime by orders of magnitude.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'dsDNA' is used for the density figures; clarify whether the underlying channel model and prior-art comparisons assume double-stranded or single-stranded DNA, and state any conversion factors applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and have prepared revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (results paragraph): The reported densities of 155.8 EB/g (high-fidelity) and 25.9 EB/g (low-fidelity), along with the 11% and 52% gains over prior art, are generated exclusively by feeding the proposed PHMM + log-product + OSD decoder the exact posterior distributions from the DT4DDS channel simulator. No comparison is provided between the simulator's per-base substitution, insertion, deletion, or quality-score distributions and published empirical traces from Illumina or Nanopore sequencing of synthetic DNA. This is load-bearing for the central claim of approaching the Shannon bound, because underestimation of correlated errors or overestimation of posterior reliability would directly reduce the claimed improvements.

    Authors: The DT4DDS simulator is parameterized directly from published empirical error profiles for DNA synthesis and sequencing (including substitution, insertion, deletion, and quality-score statistics from Illumina and Nanopore studies of synthetic DNA). The reported densities and gains are therefore with respect to this calibrated channel model, and the Shannon-bound comparison is likewise internal to the model. To address the concern explicitly, we will add a short validation paragraph (with citations) in the Methods section that tabulates the DT4DDS per-base error rates and quality-score distributions against the corresponding empirical ranges reported in the literature. This addition will confirm that the posteriors are representative rather than optimistic and will clarify that the 11% and 52% improvements demonstrate the benefit of retaining probabilistic information even under realistic modeled conditions. revision: yes

  2. Referee: [Abstract] Abstract (degradation projection): The 282-year decodable storage claim at 17.1 EB/g rests on a single-encode-then-degrade protocol mapped to 25°C depurination kinetics in the dry state. The manuscript should quantify the exact kinetic parameters, rate constants, and any temperature or humidity sensitivities used in this mapping, as small changes in the degradation model could alter the projected lifetime by orders of magnitude.

    Authors: We agree that greater transparency on the degradation model is needed. In the revised manuscript we will expand the relevant paragraph to state the exact depurination rate constant at 25 °C in the dry state (taken from the cited kinetic literature), the Arrhenius activation energy used for any temperature mapping, and a brief sensitivity discussion noting the effects of humidity and temperature deviations. The 282-year figure will be presented as an illustrative projection under the stated idealized conditions rather than a precise forecast, thereby allowing readers to evaluate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: headline densities are direct simulation outputs on an external channel model

full rationale

The paper's central results (155.8 EB/g and 25.9 EB/g) are generated by running the proposed PHMM-alignment + log-product fusion + OSD decoder on the DT4DDS simulator. No equation or self-citation reduces these performance numbers to fitted constants, prior self-work, or the target densities themselves. The depurination-kinetics projection and prior-art comparisons are likewise external mappings and benchmarks, not algebraic identities. The derivation chain remains self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central performance claims rest on the fidelity of the DT4DDS simulator and the kinetic mapping for long-term degradation; these are standard domain assumptions rather than new postulates.

free parameters (1)
  • decoder tuning parameters
    Weights or thresholds inside the ordered-statistics decoder and fusion step are likely chosen to optimize simulated performance.
axioms (2)
  • domain assumption DT4DDS simulator faithfully reproduces synthesis and sequencing error statistics of real DNA channels.
    All reported densities are generated exclusively on this simulator.
  • domain assumption Depurination kinetics at 25 °C in the dry state provide an accurate model for multi-century DNA degradation.
    Used to project the 282-year lifetime at 17.1 exabytes per gram.

pith-pipeline@v0.9.0 · 5478 in / 1363 out tokens · 37085 ms · 2026-05-09T23:15:15.527056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    M., Sandhu, G

    Zhirnov, V., Zadegan, R. M., Sandhu, G. S., Church, G. M. & Hughes, W. L. Nucleic acid memory. Nature Materials15, 366–370 (2016)

  2. [2]

    N., Heckel, R., Puddu, M., Paunescu, D

    Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes.Angewandte Chemie International Edition54, 2552–2555 (2015)

  3. [3]

    L., Remschak, A., Stark, W

    Gimpel, A. L., Remschak, A., Stark, W. J., Heckel, R. & Grass, R. N. Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage.Nature Communications(2026)

  4. [4]

    Organick, L.et al.Random access in large-scale DNA data storage.Nature Biotechnology36, 242–248 (2018)

  5. [5]

    & Kas Hanna, S

    Khabbaz, R., Mateos, J., Antonini, M. & Kas Hanna, S. DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA.bioRxiv(2026)

  6. [6]

    & Zielinski, D

    Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture.Science355, 950–954 (2017)

  7. [7]

    H., Hawkins, J

    Press, W. H., Hawkins, J. A., Jones, J., Stephen K., Schaub, J. M. & Finkelstein, I. J. HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.Proceedings of the National Academy of Sciences117, 18489–18496 (2020)

  8. [8]

    Welzel, M.et al.DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage.Nature Communications14, 628 (2023)

  9. [9]

    InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8822–8826 (2020)

    Chandak, S.et al.Overcoming high nanopore basecaller error rates for DNA storage via basecaller- decoder integration and convolutional codes. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8822–8826 (2020). 15

  10. [10]

    IEEE Transactions on NanoBioscience23, 81–90 (2024)

    Jeong, J.et al.Iterative soft decoding algorithm for DNA storage using quality score and redecoding. IEEE Transactions on NanoBioscience23, 81–90 (2024)

  11. [11]

    L., Stark, W

    Gimpel, A. L., Stark, W. J., Heckel, R. & Grass, R. N. Challenges for error-correction coding in DNA data storage: Photolithographic synthesis and DNA decay.Digital Discovery3, 2497–2508 (2024)

  12. [12]

    & Heckel, R

    Shomorony, I. & Heckel, R. Information-theoretic foundations of DNA data storage.Foundations and Trends in Communications and Information Theory19, 1–106 (2022)

  13. [13]

    H., Wachter-Zeh, A

    Lenz, A., Siegel, P. H., Wachter-Zeh, A. & Yaakobi, E. The noisy drawing channel: Reliable data storage in DNA sequences.IEEE Transactions on Information Theory69, 2757–2778 (2023)

  14. [14]

    Polyanskiy, Y., Poor, H. V. & Verdú, S. Channel coding rate in the finite blocklength regime.IEEE Transactions on Information Theory56, 2307–2359 (2010)

  15. [15]

    & Nyberg, B

    Lindahl, T. & Nyberg, B. Rate of depurination of native deoxyribonucleic acid.Biochemistry11, 3610–3618 (1972)

  16. [16]

    E.et al.The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils

    Allentoft, M. E.et al.The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proceedings of the Royal Society B: Biological Sciences279, 4724–4733 (2012)

  17. [17]

    Van der Auwera, G. A. & O’Connor, B. D.Genomics in the Cloud: Using Docker, GATK, and WDL in Terra(O’Reilly Media, 2020), 1st edn

  18. [18]

    Poplin, R.et al.A universal SNP and small-indel variant caller using deep neural networks.Nature Biotechnology36, 983–987 (2018)

  19. [19]

    & Arnold, D

    Hu, X.-Y., Eleftheriou, E. & Arnold, D. M. Regular and irregular progressive edge-growth Tanner graphs.IEEE Transactions on Information Theory51, 386–398 (2005)

  20. [20]

    Fossorier, M. P. C. & Lin, S. Soft-decision decoding of linear block codes based on ordered statistics. IEEE Transactions on Information Theory41, 1379–1396 (1995)

  21. [21]

    Reed, I. S. & Solomon, G. Polynomial codes over certain finite fields.Journal of the Society for Industrial and Applied Mathematics8, 300–304 (1960)

  22. [22]

    Durbin,R.,Eddy,S.R.,Krogh,A.&Mitchison,G.Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids(Cambridge University Press, Cambridge, 1998)

  23. [23]

    Goldman,N.et al.Towardspractical,high-capacity,low-maintenanceinformationstorageinsynthesized DNA.Nature494, 77–80 (2013)

  24. [24]

    L.et al.Random access DNA memory using Boolean search in an archival file storage system

    Banal, J. L.et al.Random access DNA memory using Boolean search in an archival file storage system. Nature Materials20, 1272–1280 (2021)

  25. [25]

    ReversibleNucleicAcidStorageinDeconstructable Glassy Polymer Networks.Journal of the American Chemical Society146, 17066–17074 (2024)

    Prince,E.,Cheng,H.F.,Banal,J.L.&Johnson,J.A. ReversibleNucleicAcidStorageinDeconstructable Glassy Polymer Networks.Journal of the American Chemical Society146, 17066–17074 (2024). 16 Supplementary Section 1 Payload length scaling The 126 nt payload length used in the main text reflects the electrochemical synthesis and iSeq 150-nt paired-end read-length c...