pith. sign in

arxiv: 2606.29975 · v1 · pith:Y7MFSICGnew · submitted 2026-06-29 · 💻 cs.LG · cond-mat.mtrl-sci

Atompack: A Storage and Distribution Layer for Read-Heavy Atomistic ML Training Datasets

Pith reviewed 2026-06-30 07:30 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-sci
keywords atomistic machine learningdatasetsstorage formatshuffled readstraining workloadsmemory-mapped accessimmutable indexmolecular records
0
0 comments X

The pith

Atompack is a storage format that delivers 96x faster shuffled reads for atomistic ML training datasets while producing 79% smaller artifacts than ASE LMDB.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Atompack as a storage and distribution layer for atomistic machine learning datasets that consist of large immutable collections of molecules. These datasets are typically read repeatedly in random order during training epochs, a pattern that differs from mutable curation or ad-hoc queries. Atompack appends records during construction, commits an immutable index, and serves complete molecular records through a memory-mapped path. Benchmarks on a 64-atom workload show major gains in shuffled read speed and reduced file size compared to HDF5, LMDB, and ASE approaches. A sympathetic reader would care because faster data loading and smaller artifacts could speed up training and simplify sharing of large scientific datasets.

Core claim

Atompack is an append-oriented storage format and distribution layer designed around the workload where training pipelines consume complete molecular records while the learning algorithm randomizes their order. It appends records efficiently during dataset construction, commits an immutable index, and serves records through a memory-mapped read path optimized for training. On a representative 64-atom workload, Atompack is 96x faster than ASE LMDB on shuffled training-style reads while producing artifacts about 79% smaller. The results indicate that serving complete molecule records, rather than field chunks or reconstructed objects, improves shuffled training throughput while keeping artifac

What carries the argument

An append-oriented storage format that commits an immutable index after construction and serves complete molecule records via a memory-mapped read path.

If this is right

  • Training pipelines achieve higher throughput on shuffled reads without altering the learning algorithm.
  • Dataset artifacts become compact enough for easier public distribution and staging across shared filesystems.
  • Construction remains efficient through append-oriented writes while the committed index supports repeated immutable reads.
  • The format outperforms array stores, key-value stores, and object-oriented databases specifically on complete-record shuffled access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could encourage dataset publishers to prioritize ML training access patterns over general scientific query tools.
  • Similar workload-specific storage designs might apply to other ML domains with repeated random reads of complete records.
  • Integration into data loaders could reduce the data movement bottleneck in distributed atomistic training.

Load-bearing premise

The dominant workload consists of consuming complete molecular records in randomized order during training, making it distinct from mutable curation or ad-hoc inspection workloads.

What would settle it

A set of benchmarks on workloads requiring frequent record mutations or access to individual atomic fields that shows Atompack loses its speed and size advantages would falsify the central claim.

read the original abstract

Atomistic machine learning datasets are increasingly used for training: large immutable snapshots are read repeatedly, shuffled across epochs, staged across clusters' storage systems, and republished as reusable scientific artifacts. This workload differs from interactive scientific curation, where mutable records and ad hoc inspection are often more important than random indexed throughput. We present Atompack, an append-oriented storage format and distribution layer designed around a simple workload: training pipelines usually consume complete molecular records, while the order of records is randomized by the learning algorithm. Atompack appends records efficiently during dataset construction, then commits an immutable index and serves records through a memory-mapped read path optimized for training. We compare Atompack with HDF5, LMDB, and ASE baselines representing array stores, key-value records, serialized records, and object-oriented databases. The benchmarks measure sequential reads, shuffled reads, shared-filesystem behavior, write throughput, and artifact size. On a representative 64-atom workload, Atompack is 96x faster than ASE LMDB on shuffled training-style reads while producing artifacts about 79\% smaller. The results indicate that serving complete molecule records, rather than field chunks or reconstructed objects, improves shuffled training throughput while keeping artifacts compact enough for public distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Atompack, an append-oriented storage format and distribution layer for read-heavy atomistic ML training datasets. It scopes the design to immutable, append-only datasets where training pipelines consume complete molecular records in randomized order, contrasting this with mutable curation workloads. Atompack is benchmarked against HDF5, LMDB, and ASE baselines on sequential reads, shuffled reads, shared-filesystem behavior, write throughput, and artifact size, with the central claim that on a 64-atom workload it achieves 96x faster shuffled training-style reads than ASE LMDB while producing artifacts 79% smaller.

Significance. If the empirical results hold under the stated workload, Atompack addresses a practical systems need in atomistic ML by optimizing for complete-record shuffled reads rather than field chunks or object reconstruction, potentially enabling higher training throughput and more compact public dataset artifacts. The explicit workload scoping and multi-baseline comparison are strengths; the work ships concrete benchmark measurements rather than derivations.

major comments (1)
  1. [Abstract and benchmark results] The performance claims in the abstract (96x speedup vs. ASE LMDB on shuffled reads; 79% smaller artifacts on the 64-atom case) are load-bearing for the central contribution yet are presented without dataset sizes, hardware specifications, error bars, number of trials, or detailed methodology. This prevents verification of whether the measurements support the headline result and must be addressed for the empirical claims to be assessable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in the abstract's performance claims. We address this point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and benchmark results] The performance claims in the abstract (96x speedup vs. ASE LMDB on shuffled reads; 79% smaller artifacts on the 64-atom case) are load-bearing for the central contribution yet are presented without dataset sizes, hardware specifications, error bars, number of trials, or detailed methodology. This prevents verification of whether the measurements support the headline result and must be addressed for the empirical claims to be assessable.

    Authors: We agree that the abstract, as currently written, does not include sufficient parameters to allow immediate verification of the headline numbers. In the revised version we will expand the abstract to specify: (i) the dataset size (number of molecular records and atoms per record), (ii) the hardware platform and storage configuration used for the timing measurements, (iii) the number of independent trials, and (iv) a concise statement that error bars and full methodology appear in Section 4. The body of the paper already contains these details together with the raw timing data; the revision will simply surface the key parameters in the abstract itself so that the central claims become assessable on first reading. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical storage format and benchmark results for read-heavy atomistic ML datasets. Its central claims (96x speedup on shuffled reads, 79% smaller artifacts) are direct measurements against external baselines (HDF5, LMDB, ASE) under a scoped workload of immutable complete-record reads. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text; the derivation chain is absent because the work is engineering and measurement rather than mathematical derivation. This matches the default expectation of no circularity for benchmark-driven papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5774 in / 993 out tokens · 47531 ms · 2026-06-30T07:30:24.575394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 15 canonical work pages

  1. [1]

    High performance i/o for large scale deep learn- ing, 2020

    Alex Aizman, Gavin Maltby, and Thomas Breuel. High performance i/o for large scale deep learn- ing, 2020. URL https://arxiv.org/abs/2001. 01858

  2. [2]

    Wood, Misko Dzamba, Meng Gao, Ammar Rizvi, C

    Luis Barroso-Luque, Muhammed Shuaibi, Xi- 8 ang Fu, Brandon M. Wood, Misko Dzamba, Meng Gao, Ammar Rizvi, C. Lawrence Zitnick, and Zachary W. Ulissi. Open materials 2024 (omat24) inorganic materials dataset and mod- els, 2024. URL https://arxiv.org/abs/2410. 12771

  3. [3]

    The aflow standard for high-throughput materials science calculations

    Camilo E Calderon, Jose J Plata, Cormac Toher, Corey Oses, Ohad Levy, Marco Fornari, Amir Natan, Michael J Mehl, Gus Hart, Marco Buon- giorno Nardelli, et al. The aflow standard for high-throughput materials science calculations. Computational Materials Science, 108:233–238,

  4. [4]

    doi: 10.1016/j.commatsci.2015.07.019

  5. [5]

    Lawrence Zitnick, and Zachary Ulissi

    Lowik Chanussot, Abhishek Das, Siddharth Goyal, Thibaut Lavril, Muhammed Shuaibi, Morgane Riviere, Kevin Tran, Javier Heras- Domingo, Caleb Ho, Weihua Hu, Aini Pal- izhati, Anuroop Sriram, Brandon Wood, Jun- woong Yoon, Devi Parikh, C. Lawrence Zitnick, and Zachary Ulissi. Open catalyst 2020 (oc20) dataset and community challenges.ACS Catal- ysis, 11(10):...

  6. [6]

    URL http: //dx.doi.org/10.1021/acscatal.0c04525

    doi: 10.1021/acscatal.0c04525. URL http: //dx.doi.org/10.1021/acscatal.0c04525

  7. [7]

    MDB: A memory-mapped database and backend for OpenLDAP

    Howard Chu. MDB: A memory-mapped database and backend for OpenLDAP. InPro- ceedings of LDAPCon, 2011. URL https://www. openldap.org/pub/hyc/mdb-paper.pdf

  8. [8]

    Bartel, and Gerbrand Ceder

    Bowen Deng, Peichen Zhong, KyuJung Jun, Janosh Riebesell, Kevin Han, Christopher J. Bartel, and Gerbrand Ceder. CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling.Na- ture Machine Intelligence, 5(9):1031–1041, 2023. doi: 10.1038/s42256-023-00716-3. URL https: //doi.org/10.1038/s42256-023-00716-3

  9. [9]

    The no- mad laboratory: from data sharing to artificial intelligence.Journal of Physics: Materials, 2(3): 036001, 2019

    Claudia Draxl and Matthias Scheffler. The no- mad laboratory: from data sharing to artificial intelligence.Journal of Physics: Materials, 2(3): 036001, 2019. doi: 10.1088/2515-7639/ab13bb

  10. [10]

    Mathis, Chai- tanya K

    Alexandre Duval, Simon V. Mathis, Chai- tanya K. Joshi, Victor Schmidt, Santiago Miret, Fragkiskos D. Malliaros, Taco Cohen, Pietro Li` o, Yoshua Bengio, and Michael Bronstein. A hitch- hiker’s guide to geometric gnns for 3d atomic systems, 2024. URL https://arxiv.org/abs/ 2312.07511

  11. [11]

    Fair chemistry documentation

    FAIR Chemistry. Fair chemistry documentation. https://fair-chem.github.io/omol25, 2026. Accessed: 2026-05-18

  12. [12]

    Jain , author S

    Anubhav Jain, Shyue Ping Ong, Geoffroy Hau- tier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, et al. Com- mentary: The materials project: A materials genome approach to accelerating materials inno- vation.APL Materials, 1(1):011002, 2013. doi: 10.1063/1.4812323

  13. [13]

    Kaplan, Runze Liu, Ji Qi, Tsz Wai Ko, Bowen Deng, Janosh Riebesell, Gerbrand Ceder, Kristin A

    Aaron D. Kaplan, Runze Liu, Ji Qi, Tsz Wai Ko, Bowen Deng, Janosh Riebesell, Gerbrand Ceder, Kristin A. Persson, and Shyue Ping Ong. A foundational potential energy surface dataset for materials. arXiv:2503.04070, 2025

  14. [14]

    The open quantum materials database (oqmd): assess- ing the accuracy of dft formation energies.npj Computational Materials, 1(1):15010, 2015

    Scott Kirklin, James E Saal, Bryce Meredig, Alex Thompson, Jeff W Doak, Muratahan Aykol, Stephan R¨ uhl, and Chris Wolverton. The open quantum materials database (oqmd): assess- ing the accuracy of dft formation energies.npj Computational Materials, 1(1):15010, 2015. doi: 10.1038/npjcompumats.2015.10

  15. [15]

    Kuner, Aaron D

    Matthew C. Kuner, Aaron D. Kaplan, Kristin A. Persson, Mark Asta, and Daryl C. Chrzan. MP-ALOE: an r 2SCAN dataset for universal machine learning interatomic potentials.npj Computational Materials, 11:352, 2025. doi: 10.1038/s41524-025-01834-9

  16. [16]

    The atomic simulation environment—a python library for working with atoms.Jour- nal of Physics: Condensed Matter, 29(27): 273002, 2017

    Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist, Ivano E Castelli, Rune Chris- tensen, Marcin Du lak, Jesper Friis, Michael N Groves, Bjørk Hammer, Cory Hargus, Eric D Hermes, Paul C Jennings, Peter Bjerre Jensen, James Kermode, John R Kitchin, Esben Leon- hard Kolsbjerg, Joseph Kubal, Kristen Kaasb- jerg, Steen Lysgaard, J´ on Bergmann Maronsso...

  17. [17]

    Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G

    Daniel S. Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G. Tay- lor, Muhammad R. Hasyim, Kyle Michel, Ilyes Batatia, G´ abor Cs´ anyi, Misko Dzamba, Peter Eastman, Nathan C. Frey, Xiang Fu, Vahe Gharakhanyan, Aditi S. Krishnapriyan, Joshua A. Rackers, Sanjeev Raja, Ammar Rizvi, 9 Andrew S. Rosen, Zachary Ulissi, Santiago Var- gas, C. L...

  18. [18]

    Zarr stor- age specification 2.0 community stan- dard

    Open Geospatial Consortium. Zarr stor- age specification 2.0 community stan- dard. https://www.ogc.org/standards/ zarr-storage-specification/, 2022. OGC Document 21-050r1

  19. [19]

    PyTorch: An imperative style, high- performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high- perf...

  20. [20]

    Lemat-traj: A scalable and unified dataset of materials tra- jectories for atomistic modeling, 2025

    Ali Ramlaoui, Martin Siron, Inel Djafar, Joseph Musielewicz, Amandine Rossello, Vic- tor Schmidt, and Alexandre Duval. Lemat-traj: A scalable and unified dataset of materials tra- jectories for atomistic modeling, 2025. URL https://arxiv.org/abs/2508.20875

  21. [21]

    Improving machine-learning models in materials science through large datasets.Materials Today Physics, 48:101560, 2024

    Jonathan Schmidt, Tiago FT Cerqueira, Aldo H Romero, Antoine Loew, Fabian J¨ ager, Hai-Chen Wang, Silvana Botti, and Miguel AL Marques. Improving machine-learning models in materials science through large datasets.Materials Today Physics, 48:101560, 2024. doi: 10.1016/j.mtphys. 2024.101560

  22. [22]

    Lemat-bulk: aggregating, and de-duplicating quantum chemistry materials databases, 2025

    Martin Siron, Inel Djafar, Ali Ramlaoui, Eti- enne du Fayette, Amandine Rossello, Edvin Fako, Matthew McDermott, Felix Therrien, Luis Barroso-Luque, Flaviu Cipcigan, Philippe Schwaller, Thomas Wolf, and Alexandre Duval. Lemat-bulk: aggregating, and de-duplicating quantum chemistry materials databases, 2025. URLhttps://arxiv.org/abs/2511.05178

  23. [23]

    How big is big data? Faraday Discussions, 256:483–502, 2025

    Daniel Speckhard, Tim Bechtel, Luca M Ghir- inghelli, Martin Kuban, Santiago Rigamonti, and Claudia Draxl. How big is big data? Faraday Discussions, 256:483–502, 2025. doi: 10.1039/D4FD00102H

  24. [24]

    Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, and Claudia Draxl

    Daniel T. Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, and Claudia Draxl. Training speedups via batching for geometric learning: an analysis of static and dynamic al- gorithms.Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https: //openreview.net/forum?id=v8rC6EEUep

  25. [25]

    Hierarchical data format, ver- sion 5

    The HDF Group. Hierarchical data format, ver- sion 5. https://github.com/HDFGroup/hdf5,

  26. [26]

    DOI: https://doi.org/10

    Version 2.1.1. DOI: https://doi.org/10. 5281/zenodo.17808558. 10