DNA storage approaching the information-theoretic ceiling
Pith reviewed 2026-05-09 23:15 UTC · model grok-4.3
The pith
A new coding scheme for DNA storage achieves densities up to 155.8 exabytes per gram by retaining the sequencer's probabilistic outputs
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents a coding scheme that retains the sequencer's per-position posterior distributions through an integrated decoder of profile hidden Markov model alignment, log-product fusion across reads, and ordered-statistics decoding. On the DT4DDS channel simulator, the codec recovers 155.8 and 25.9 exabytes per gram of dsDNA under high- and low-fidelity conditions, exceeding the highest prior-art density on each channel by 11 and 52 percent. Under a single-encode-then-degrade protocol mapped to depurination kinetics at 25°C in the dry state, the codec projects 282 years of decodable storage at 17.1 exabytes per gram. These results place DNA storage density within reach of the Shannon
What carries the argument
Integrated decoder using profile hidden Markov model alignment, log-product fusion across reads, and ordered-statistics decoding to retain and exploit per-position posterior probability distributions from the sequencer
If this is right
- DNA storage can achieve over 150 exabytes per gram with high reliability on simulated channels
- Storage lifetime extends to nearly 300 years at useful densities under modeled room-temperature degradation
- Soft-information methods outperform hard-decision codecs by 11 to 52 percent in this domain
- DNA storage densities can be brought within reach of the Shannon bound of the synthesis-sequencing channel
Where Pith is reading between the lines
- If the simulation accurately models physical processes, the codec could enable higher-capacity real DNA archival systems
- The soft-decision approach might transfer to other noisy storage media where probabilistic readouts are available
- Further gains could come from pairing the decoder with improved DNA synthesis or alternative sequencing technologies
Load-bearing premise
The DT4DDS channel simulator and the single-encode-then-degrade protocol mapped to depurination kinetics at 25°C accurately represent real-world synthesis, sequencing, and long-term degradation errors
What would settle it
Experimental implementation of the codec on physically synthesized DNA strands, followed by simulated or real degradation and sequencing, to measure if the achieved information density matches the simulated 155.8 and 25.9 exabytes per gram
Figures
read the original abstract
Synthetic DNA approaches 227.5 exabytes per gram of storage density with stability over millennial timescales. Realising this capacity requires error-correction codes that recover data from substantial synthesis and sequencing errors. Existing codecs convert noisy sequencer output into discrete base calls before error correction, discarding probabilistic information about which positions are reliable. Here we present a coding scheme that retains the sequencer's per-position posterior distributions through an integrated decoder of profile hidden Markov model alignment, log-product fusion across reads, and ordered-statistics decoding. On the DT4DDS channel simulator, the codec recovers 155.8 and 25.9 exabytes per gram of dsDNA under high- and low-fidelity conditions, exceeding the highest prior-art density on each channel by 11 and 52 percent. Under a single-encode-then-degrade protocol mapped to depurination kinetics at 25 {\deg}C in the dry state, the codec projects 282 years of decodable storage at 17.1 exabytes per gram. These results place DNA storage density within reach of the Shannon bound of the underlying channel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an integrated decoder for DNA-based data storage consisting of profile hidden Markov model (PHMM) alignment, log-product fusion of multiple reads, and ordered-statistics decoding (OSD). This approach preserves the sequencer's per-position posterior probabilities rather than converting to hard base calls. Simulations on the DT4DDS channel model yield storage densities of 155.8 exabytes per gram under high-fidelity conditions and 25.9 exabytes per gram under low-fidelity conditions, which exceed the best prior-art results by 11% and 52%, respectively. Additionally, the work projects 282 years of reliable storage at 17.1 exabytes per gram using a single-encode-then-degrade protocol based on depurination kinetics at 25°C in the dry state. The results suggest that DNA storage can approach the information-theoretic ceiling of the underlying channel.
Significance. If the DT4DDS simulator accurately captures the error characteristics of real DNA synthesis, sequencing, and degradation processes, this codec represents a substantial advance toward practical high-density DNA storage. The retention of probabilistic information and the reported gains over prior art could influence future codec designs in the field. The projection of multi-century stability at high density is particularly noteworthy, though dependent on the model assumptions.
major comments (2)
- [Abstract] Abstract (results paragraph): The reported densities of 155.8 EB/g (high-fidelity) and 25.9 EB/g (low-fidelity), along with the 11% and 52% gains over prior art, are generated exclusively by feeding the proposed PHMM + log-product + OSD decoder the exact posterior distributions from the DT4DDS channel simulator. No comparison is provided between the simulator's per-base substitution, insertion, deletion, or quality-score distributions and published empirical traces from Illumina or Nanopore sequencing of synthetic DNA. This is load-bearing for the central claim of approaching the Shannon bound, because underestimation of correlated errors or overestimation of posterior reliability would directly reduce the claimed improvements.
- [Abstract] Abstract (degradation projection): The 282-year decodable storage claim at 17.1 EB/g rests on a single-encode-then-degrade protocol mapped to 25°C depurination kinetics in the dry state. The manuscript should quantify the exact kinetic parameters, rate constants, and any temperature or humidity sensitivities used in this mapping, as small changes in the degradation model could alter the projected lifetime by orders of magnitude.
minor comments (1)
- [Abstract] Abstract: The phrase 'dsDNA' is used for the density figures; clarify whether the underlying channel model and prior-art comparisons assume double-stranded or single-stranded DNA, and state any conversion factors applied.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and have prepared revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract (results paragraph): The reported densities of 155.8 EB/g (high-fidelity) and 25.9 EB/g (low-fidelity), along with the 11% and 52% gains over prior art, are generated exclusively by feeding the proposed PHMM + log-product + OSD decoder the exact posterior distributions from the DT4DDS channel simulator. No comparison is provided between the simulator's per-base substitution, insertion, deletion, or quality-score distributions and published empirical traces from Illumina or Nanopore sequencing of synthetic DNA. This is load-bearing for the central claim of approaching the Shannon bound, because underestimation of correlated errors or overestimation of posterior reliability would directly reduce the claimed improvements.
Authors: The DT4DDS simulator is parameterized directly from published empirical error profiles for DNA synthesis and sequencing (including substitution, insertion, deletion, and quality-score statistics from Illumina and Nanopore studies of synthetic DNA). The reported densities and gains are therefore with respect to this calibrated channel model, and the Shannon-bound comparison is likewise internal to the model. To address the concern explicitly, we will add a short validation paragraph (with citations) in the Methods section that tabulates the DT4DDS per-base error rates and quality-score distributions against the corresponding empirical ranges reported in the literature. This addition will confirm that the posteriors are representative rather than optimistic and will clarify that the 11% and 52% improvements demonstrate the benefit of retaining probabilistic information even under realistic modeled conditions. revision: yes
-
Referee: [Abstract] Abstract (degradation projection): The 282-year decodable storage claim at 17.1 EB/g rests on a single-encode-then-degrade protocol mapped to 25°C depurination kinetics in the dry state. The manuscript should quantify the exact kinetic parameters, rate constants, and any temperature or humidity sensitivities used in this mapping, as small changes in the degradation model could alter the projected lifetime by orders of magnitude.
Authors: We agree that greater transparency on the degradation model is needed. In the revised manuscript we will expand the relevant paragraph to state the exact depurination rate constant at 25 °C in the dry state (taken from the cited kinetic literature), the Arrhenius activation energy used for any temperature mapping, and a brief sensitivity discussion noting the effects of humidity and temperature deviations. The 282-year figure will be presented as an illustrative projection under the stated idealized conditions rather than a precise forecast, thereby allowing readers to evaluate robustness. revision: yes
Circularity Check
No circularity: headline densities are direct simulation outputs on an external channel model
full rationale
The paper's central results (155.8 EB/g and 25.9 EB/g) are generated by running the proposed PHMM-alignment + log-product fusion + OSD decoder on the DT4DDS simulator. No equation or self-citation reduces these performance numbers to fitted constants, prior self-work, or the target densities themselves. The depurination-kinetics projection and prior-art comparisons are likewise external mappings and benchmarks, not algebraic identities. The derivation chain remains self-contained empirical evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- decoder tuning parameters
axioms (2)
- domain assumption DT4DDS simulator faithfully reproduces synthesis and sequencing error statistics of real DNA channels.
- domain assumption Depurination kinetics at 25 °C in the dry state provide an accurate model for multi-century DNA degradation.
Reference graph
Works this paper leans on
-
[1]
Zhirnov, V., Zadegan, R. M., Sandhu, G. S., Church, G. M. & Hughes, W. L. Nucleic acid memory. Nature Materials15, 366–370 (2016)
work page 2016
-
[2]
N., Heckel, R., Puddu, M., Paunescu, D
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes.Angewandte Chemie International Edition54, 2552–2555 (2015)
work page 2015
-
[3]
Gimpel, A. L., Remschak, A., Stark, W. J., Heckel, R. & Grass, R. N. Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage.Nature Communications(2026)
work page 2026
-
[4]
Organick, L.et al.Random access in large-scale DNA data storage.Nature Biotechnology36, 242–248 (2018)
work page 2018
-
[5]
Khabbaz, R., Mateos, J., Antonini, M. & Kas Hanna, S. DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA.bioRxiv(2026)
work page 2026
-
[6]
Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture.Science355, 950–954 (2017)
work page 2017
-
[7]
Press, W. H., Hawkins, J. A., Jones, J., Stephen K., Schaub, J. M. & Finkelstein, I. J. HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.Proceedings of the National Academy of Sciences117, 18489–18496 (2020)
work page 2020
-
[8]
Welzel, M.et al.DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage.Nature Communications14, 628 (2023)
work page 2023
-
[9]
Chandak, S.et al.Overcoming high nanopore basecaller error rates for DNA storage via basecaller- decoder integration and convolutional codes. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8822–8826 (2020). 15
work page 2020
-
[10]
IEEE Transactions on NanoBioscience23, 81–90 (2024)
Jeong, J.et al.Iterative soft decoding algorithm for DNA storage using quality score and redecoding. IEEE Transactions on NanoBioscience23, 81–90 (2024)
work page 2024
-
[11]
Gimpel, A. L., Stark, W. J., Heckel, R. & Grass, R. N. Challenges for error-correction coding in DNA data storage: Photolithographic synthesis and DNA decay.Digital Discovery3, 2497–2508 (2024)
work page 2024
-
[12]
Shomorony, I. & Heckel, R. Information-theoretic foundations of DNA data storage.Foundations and Trends in Communications and Information Theory19, 1–106 (2022)
work page 2022
-
[13]
Lenz, A., Siegel, P. H., Wachter-Zeh, A. & Yaakobi, E. The noisy drawing channel: Reliable data storage in DNA sequences.IEEE Transactions on Information Theory69, 2757–2778 (2023)
work page 2023
-
[14]
Polyanskiy, Y., Poor, H. V. & Verdú, S. Channel coding rate in the finite blocklength regime.IEEE Transactions on Information Theory56, 2307–2359 (2010)
work page 2010
-
[15]
Lindahl, T. & Nyberg, B. Rate of depurination of native deoxyribonucleic acid.Biochemistry11, 3610–3618 (1972)
work page 1972
-
[16]
E.et al.The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils
Allentoft, M. E.et al.The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proceedings of the Royal Society B: Biological Sciences279, 4724–4733 (2012)
work page 2012
-
[17]
Van der Auwera, G. A. & O’Connor, B. D.Genomics in the Cloud: Using Docker, GATK, and WDL in Terra(O’Reilly Media, 2020), 1st edn
work page 2020
-
[18]
Poplin, R.et al.A universal SNP and small-indel variant caller using deep neural networks.Nature Biotechnology36, 983–987 (2018)
work page 2018
-
[19]
Hu, X.-Y., Eleftheriou, E. & Arnold, D. M. Regular and irregular progressive edge-growth Tanner graphs.IEEE Transactions on Information Theory51, 386–398 (2005)
work page 2005
-
[20]
Fossorier, M. P. C. & Lin, S. Soft-decision decoding of linear block codes based on ordered statistics. IEEE Transactions on Information Theory41, 1379–1396 (1995)
work page 1995
-
[21]
Reed, I. S. & Solomon, G. Polynomial codes over certain finite fields.Journal of the Society for Industrial and Applied Mathematics8, 300–304 (1960)
work page 1960
-
[22]
Durbin,R.,Eddy,S.R.,Krogh,A.&Mitchison,G.Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids(Cambridge University Press, Cambridge, 1998)
work page 1998
-
[23]
Goldman,N.et al.Towardspractical,high-capacity,low-maintenanceinformationstorageinsynthesized DNA.Nature494, 77–80 (2013)
work page 2013
-
[24]
L.et al.Random access DNA memory using Boolean search in an archival file storage system
Banal, J. L.et al.Random access DNA memory using Boolean search in an archival file storage system. Nature Materials20, 1272–1280 (2021)
work page 2021
-
[25]
Prince,E.,Cheng,H.F.,Banal,J.L.&Johnson,J.A. ReversibleNucleicAcidStorageinDeconstructable Glassy Polymer Networks.Journal of the American Chemical Society146, 17066–17074 (2024). 16 Supplementary Section 1 Payload length scaling The 126 nt payload length used in the main text reflects the electrochemical synthesis and iSeq 150-nt paired-end read-length c...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.