Atompack: A Storage and Distribution Layer for Read-Heavy Atomistic ML Training Datasets

Alexandre Duval; Ali Ramlaoui; Daniel T. Speckhard; Fragkiskos D. Malliaros; Sagar Pal; Victor Schmidt

arxiv: 2606.29975 · v1 · pith:Y7MFSICGnew · submitted 2026-06-29 · 💻 cs.LG · cond-mat.mtrl-sci

Atompack: A Storage and Distribution Layer for Read-Heavy Atomistic ML Training Datasets

Ali Ramlaoui , Daniel T. Speckhard , Sagar Pal , Fragkiskos D. Malliaros , Alexandre Duval , Victor Schmidt This is my paper

Pith reviewed 2026-06-30 07:30 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-sci

keywords atomistic machine learningdatasetsstorage formatshuffled readstraining workloadsmemory-mapped accessimmutable indexmolecular records

0 comments

The pith

Atompack is a storage format that delivers 96x faster shuffled reads for atomistic ML training datasets while producing 79% smaller artifacts than ASE LMDB.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Atompack as a storage and distribution layer for atomistic machine learning datasets that consist of large immutable collections of molecules. These datasets are typically read repeatedly in random order during training epochs, a pattern that differs from mutable curation or ad-hoc queries. Atompack appends records during construction, commits an immutable index, and serves complete molecular records through a memory-mapped path. Benchmarks on a 64-atom workload show major gains in shuffled read speed and reduced file size compared to HDF5, LMDB, and ASE approaches. A sympathetic reader would care because faster data loading and smaller artifacts could speed up training and simplify sharing of large scientific datasets.

Core claim

Atompack is an append-oriented storage format and distribution layer designed around the workload where training pipelines consume complete molecular records while the learning algorithm randomizes their order. It appends records efficiently during dataset construction, commits an immutable index, and serves records through a memory-mapped read path optimized for training. On a representative 64-atom workload, Atompack is 96x faster than ASE LMDB on shuffled training-style reads while producing artifacts about 79% smaller. The results indicate that serving complete molecule records, rather than field chunks or reconstructed objects, improves shuffled training throughput while keeping artifac

What carries the argument

An append-oriented storage format that commits an immutable index after construction and serves complete molecule records via a memory-mapped read path.

If this is right

Training pipelines achieve higher throughput on shuffled reads without altering the learning algorithm.
Dataset artifacts become compact enough for easier public distribution and staging across shared filesystems.
Construction remains efficient through append-oriented writes while the committed index supports repeated immutable reads.
The format outperforms array stores, key-value stores, and object-oriented databases specifically on complete-record shuffled access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could encourage dataset publishers to prioritize ML training access patterns over general scientific query tools.
Similar workload-specific storage designs might apply to other ML domains with repeated random reads of complete records.
Integration into data loaders could reduce the data movement bottleneck in distributed atomistic training.

Load-bearing premise

The dominant workload consists of consuming complete molecular records in randomized order during training, making it distinct from mutable curation or ad-hoc inspection workloads.

What would settle it

A set of benchmarks on workloads requiring frequent record mutations or access to individual atomic fields that shows Atompack loses its speed and size advantages would falsify the central claim.

read the original abstract

Atomistic machine learning datasets are increasingly used for training: large immutable snapshots are read repeatedly, shuffled across epochs, staged across clusters' storage systems, and republished as reusable scientific artifacts. This workload differs from interactive scientific curation, where mutable records and ad hoc inspection are often more important than random indexed throughput. We present Atompack, an append-oriented storage format and distribution layer designed around a simple workload: training pipelines usually consume complete molecular records, while the order of records is randomized by the learning algorithm. Atompack appends records efficiently during dataset construction, then commits an immutable index and serves records through a memory-mapped read path optimized for training. We compare Atompack with HDF5, LMDB, and ASE baselines representing array stores, key-value records, serialized records, and object-oriented databases. The benchmarks measure sequential reads, shuffled reads, shared-filesystem behavior, write throughput, and artifact size. On a representative 64-atom workload, Atompack is 96x faster than ASE LMDB on shuffled training-style reads while producing artifacts about 79\% smaller. The results indicate that serving complete molecule records, rather than field chunks or reconstructed objects, improves shuffled training throughput while keeping artifacts compact enough for public distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Atompack gives a targeted storage format for immutable atomistic datasets that claims 96x faster shuffled complete-record reads than ASE LMDB and 79% smaller files, scoped explicitly to training consumption rather than curation.

read the letter

The main takeaway is that this paper builds Atompack around the specific pattern of appending whole molecular records once and then serving them fast through memory maps for randomized training reads.

The design is new in its narrow focus on complete-record access instead of field-level chunks or reconstructed objects. They run head-to-head tests against HDF5, LMDB, and ASE variants on sequential reads, shuffled reads, shared-filesystem behavior, write speed, and final size. The 64-atom case shows clear gains under the stated conditions, and the authors are explicit that they are optimizing for immutable, append-only data consumed in training pipelines.

The work does a solid job separating the training workload from interactive curation needs and backing the claims with direct measurements on the relevant operations. The results line up with the design choices.

Soft spots are small and mostly about scope. The speedup rests on the assumption that full records in random order dominate, which they flag but which may not match every use case. Benchmark details such as exact hardware specs or dataset sizes are not fully visible from the abstract, though the stress test found no internal contradictions or hidden dependencies. No math or fitting issues arise since everything is empirical.

This paper is for people who build, store, and distribute large atomistic ML datasets and need practical throughput for training. A reader working on similar infrastructure would get concrete comparisons to try.

It deserves a serious referee because it solves a real, measurable bottleneck with a scoped but reproducible approach.

Referee Report

1 major / 0 minor

Summary. The paper presents Atompack, an append-oriented storage format and distribution layer for read-heavy atomistic ML training datasets. It scopes the design to immutable, append-only datasets where training pipelines consume complete molecular records in randomized order, contrasting this with mutable curation workloads. Atompack is benchmarked against HDF5, LMDB, and ASE baselines on sequential reads, shuffled reads, shared-filesystem behavior, write throughput, and artifact size, with the central claim that on a 64-atom workload it achieves 96x faster shuffled training-style reads than ASE LMDB while producing artifacts 79% smaller.

Significance. If the empirical results hold under the stated workload, Atompack addresses a practical systems need in atomistic ML by optimizing for complete-record shuffled reads rather than field chunks or object reconstruction, potentially enabling higher training throughput and more compact public dataset artifacts. The explicit workload scoping and multi-baseline comparison are strengths; the work ships concrete benchmark measurements rather than derivations.

major comments (1)

[Abstract and benchmark results] The performance claims in the abstract (96x speedup vs. ASE LMDB on shuffled reads; 79% smaller artifacts on the 64-atom case) are load-bearing for the central contribution yet are presented without dataset sizes, hardware specifications, error bars, number of trials, or detailed methodology. This prevents verification of whether the measurements support the headline result and must be addressed for the empirical claims to be assessable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in the abstract's performance claims. We address this point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and benchmark results] The performance claims in the abstract (96x speedup vs. ASE LMDB on shuffled reads; 79% smaller artifacts on the 64-atom case) are load-bearing for the central contribution yet are presented without dataset sizes, hardware specifications, error bars, number of trials, or detailed methodology. This prevents verification of whether the measurements support the headline result and must be addressed for the empirical claims to be assessable.

Authors: We agree that the abstract, as currently written, does not include sufficient parameters to allow immediate verification of the headline numbers. In the revised version we will expand the abstract to specify: (i) the dataset size (number of molecular records and atoms per record), (ii) the hardware platform and storage configuration used for the timing measurements, (iii) the number of independent trials, and (iv) a concise statement that error bars and full methodology appear in Section 4. The body of the paper already contains these details together with the raw timing data; the revision will simply surface the key parameters in the abstract itself so that the central claims become assessable on first reading. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical storage format and benchmark results for read-heavy atomistic ML datasets. Its central claims (96x speedup on shuffled reads, 79% smaller artifacts) are direct measurements against external baselines (HDF5, LMDB, ASE) under a scoped workload of immutable complete-record reads. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text; the derivation chain is absent because the work is engineering and measurement rather than mathematical derivation. This matches the default expectation of no circularity for benchmark-driven papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5774 in / 993 out tokens · 47531 ms · 2026-06-30T07:30:24.575394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 15 canonical work pages

[1]

High performance i/o for large scale deep learn- ing, 2020

Alex Aizman, Gavin Maltby, and Thomas Breuel. High performance i/o for large scale deep learn- ing, 2020. URL https://arxiv.org/abs/2001. 01858

2020
[2]

Wood, Misko Dzamba, Meng Gao, Ammar Rizvi, C

Luis Barroso-Luque, Muhammed Shuaibi, Xi- 8 ang Fu, Brandon M. Wood, Misko Dzamba, Meng Gao, Ammar Rizvi, C. Lawrence Zitnick, and Zachary W. Ulissi. Open materials 2024 (omat24) inorganic materials dataset and mod- els, 2024. URL https://arxiv.org/abs/2410. 12771

2024
[3]

The aflow standard for high-throughput materials science calculations

Camilo E Calderon, Jose J Plata, Cormac Toher, Corey Oses, Ohad Levy, Marco Fornari, Amir Natan, Michael J Mehl, Gus Hart, Marco Buon- giorno Nardelli, et al. The aflow standard for high-throughput materials science calculations. Computational Materials Science, 108:233–238,
[4]

doi: 10.1016/j.commatsci.2015.07.019

work page doi:10.1016/j.commatsci.2015.07.019 2015
[5]

Lawrence Zitnick, and Zachary Ulissi

Lowik Chanussot, Abhishek Das, Siddharth Goyal, Thibaut Lavril, Muhammed Shuaibi, Morgane Riviere, Kevin Tran, Javier Heras- Domingo, Caleb Ho, Weihua Hu, Aini Pal- izhati, Anuroop Sriram, Brandon Wood, Jun- woong Yoon, Devi Parikh, C. Lawrence Zitnick, and Zachary Ulissi. Open catalyst 2020 (oc20) dataset and community challenges.ACS Catal- ysis, 11(10):...

2020
[6]

URL http: //dx.doi.org/10.1021/acscatal.0c04525

doi: 10.1021/acscatal.0c04525. URL http: //dx.doi.org/10.1021/acscatal.0c04525

work page doi:10.1021/acscatal.0c04525
[7]

MDB: A memory-mapped database and backend for OpenLDAP

Howard Chu. MDB: A memory-mapped database and backend for OpenLDAP. InPro- ceedings of LDAPCon, 2011. URL https://www. openldap.org/pub/hyc/mdb-paper.pdf

2011
[8]

Bartel, and Gerbrand Ceder

Bowen Deng, Peichen Zhong, KyuJung Jun, Janosh Riebesell, Kevin Han, Christopher J. Bartel, and Gerbrand Ceder. CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling.Na- ture Machine Intelligence, 5(9):1031–1041, 2023. doi: 10.1038/s42256-023-00716-3. URL https: //doi.org/10.1038/s42256-023-00716-3

work page doi:10.1038/s42256-023-00716-3 2023
[9]

The no- mad laboratory: from data sharing to artificial intelligence.Journal of Physics: Materials, 2(3): 036001, 2019

Claudia Draxl and Matthias Scheffler. The no- mad laboratory: from data sharing to artificial intelligence.Journal of Physics: Materials, 2(3): 036001, 2019. doi: 10.1088/2515-7639/ab13bb

work page doi:10.1088/2515-7639/ab13bb 2019
[10]

Mathis, Chai- tanya K

Alexandre Duval, Simon V. Mathis, Chai- tanya K. Joshi, Victor Schmidt, Santiago Miret, Fragkiskos D. Malliaros, Taco Cohen, Pietro Li` o, Yoshua Bengio, and Michael Bronstein. A hitch- hiker’s guide to geometric gnns for 3d atomic systems, 2024. URL https://arxiv.org/abs/ 2312.07511

work page arXiv 2024
[11]

Fair chemistry documentation

FAIR Chemistry. Fair chemistry documentation. https://fair-chem.github.io/omol25, 2026. Accessed: 2026-05-18

2026
[12]

Jain , author S

Anubhav Jain, Shyue Ping Ong, Geoffroy Hau- tier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, et al. Com- mentary: The materials project: A materials genome approach to accelerating materials inno- vation.APL Materials, 1(1):011002, 2013. doi: 10.1063/1.4812323

work page doi:10.1063/1.4812323 2013
[13]

Kaplan, Runze Liu, Ji Qi, Tsz Wai Ko, Bowen Deng, Janosh Riebesell, Gerbrand Ceder, Kristin A

Aaron D. Kaplan, Runze Liu, Ji Qi, Tsz Wai Ko, Bowen Deng, Janosh Riebesell, Gerbrand Ceder, Kristin A. Persson, and Shyue Ping Ong. A foundational potential energy surface dataset for materials. arXiv:2503.04070, 2025

work page arXiv 2025
[14]

The open quantum materials database (oqmd): assess- ing the accuracy of dft formation energies.npj Computational Materials, 1(1):15010, 2015

Scott Kirklin, James E Saal, Bryce Meredig, Alex Thompson, Jeff W Doak, Muratahan Aykol, Stephan R¨ uhl, and Chris Wolverton. The open quantum materials database (oqmd): assess- ing the accuracy of dft formation energies.npj Computational Materials, 1(1):15010, 2015. doi: 10.1038/npjcompumats.2015.10

work page doi:10.1038/npjcompumats.2015.10 2015
[15]

Kuner, Aaron D

Matthew C. Kuner, Aaron D. Kaplan, Kristin A. Persson, Mark Asta, and Daryl C. Chrzan. MP-ALOE: an r 2SCAN dataset for universal machine learning interatomic potentials.npj Computational Materials, 11:352, 2025. doi: 10.1038/s41524-025-01834-9

work page doi:10.1038/s41524-025-01834-9 2025
[16]

The atomic simulation environment—a python library for working with atoms.Jour- nal of Physics: Condensed Matter, 29(27): 273002, 2017

Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist, Ivano E Castelli, Rune Chris- tensen, Marcin Du lak, Jesper Friis, Michael N Groves, Bjørk Hammer, Cory Hargus, Eric D Hermes, Paul C Jennings, Peter Bjerre Jensen, James Kermode, John R Kitchin, Esben Leon- hard Kolsbjerg, Joseph Kubal, Kristen Kaasb- jerg, Steen Lysgaard, J´ on Bergmann Maronsso...

work page doi:10.1088/1361-648x/aa680e 2017
[17]

Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G

Daniel S. Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G. Tay- lor, Muhammad R. Hasyim, Kyle Michel, Ilyes Batatia, G´ abor Cs´ anyi, Misko Dzamba, Peter Eastman, Nathan C. Frey, Xiang Fu, Vahe Gharakhanyan, Aditi S. Krishnapriyan, Joshua A. Rackers, Sanjeev Raja, Ammar Rizvi, 9 Andrew S. Rosen, Zachary Ulissi, Santiago Var- gas, C. L...

work page arXiv 2025
[18]

Zarr stor- age specification 2.0 community stan- dard

Open Geospatial Consortium. Zarr stor- age specification 2.0 community stan- dard. https://www.ogc.org/standards/ zarr-storage-specification/, 2022. OGC Document 21-050r1

2022
[19]

PyTorch: An imperative style, high- performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high- perf...

2019
[20]

Lemat-traj: A scalable and unified dataset of materials tra- jectories for atomistic modeling, 2025

Ali Ramlaoui, Martin Siron, Inel Djafar, Joseph Musielewicz, Amandine Rossello, Vic- tor Schmidt, and Alexandre Duval. Lemat-traj: A scalable and unified dataset of materials tra- jectories for atomistic modeling, 2025. URL https://arxiv.org/abs/2508.20875

work page arXiv 2025
[21]

Improving machine-learning models in materials science through large datasets.Materials Today Physics, 48:101560, 2024

Jonathan Schmidt, Tiago FT Cerqueira, Aldo H Romero, Antoine Loew, Fabian J¨ ager, Hai-Chen Wang, Silvana Botti, and Miguel AL Marques. Improving machine-learning models in materials science through large datasets.Materials Today Physics, 48:101560, 2024. doi: 10.1016/j.mtphys. 2024.101560

work page doi:10.1016/j.mtphys 2024
[22]

Lemat-bulk: aggregating, and de-duplicating quantum chemistry materials databases, 2025

Martin Siron, Inel Djafar, Ali Ramlaoui, Eti- enne du Fayette, Amandine Rossello, Edvin Fako, Matthew McDermott, Felix Therrien, Luis Barroso-Luque, Flaviu Cipcigan, Philippe Schwaller, Thomas Wolf, and Alexandre Duval. Lemat-bulk: aggregating, and de-duplicating quantum chemistry materials databases, 2025. URLhttps://arxiv.org/abs/2511.05178

work page arXiv 2025
[23]

How big is big data? Faraday Discussions, 256:483–502, 2025

Daniel Speckhard, Tim Bechtel, Luca M Ghir- inghelli, Martin Kuban, Santiago Rigamonti, and Claudia Draxl. How big is big data? Faraday Discussions, 256:483–502, 2025. doi: 10.1039/D4FD00102H

work page doi:10.1039/d4fd00102h 2025
[24]

Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, and Claudia Draxl

Daniel T. Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, and Claudia Draxl. Training speedups via batching for geometric learning: an analysis of static and dynamic al- gorithms.Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https: //openreview.net/forum?id=v8rC6EEUep

2026
[25]

Hierarchical data format, ver- sion 5

The HDF Group. Hierarchical data format, ver- sion 5. https://github.com/HDFGroup/hdf5,
[26]

DOI: https://doi.org/10

Version 2.1.1. DOI: https://doi.org/10. 5281/zenodo.17808558. 10

[1] [1]

High performance i/o for large scale deep learn- ing, 2020

Alex Aizman, Gavin Maltby, and Thomas Breuel. High performance i/o for large scale deep learn- ing, 2020. URL https://arxiv.org/abs/2001. 01858

2020

[2] [2]

Wood, Misko Dzamba, Meng Gao, Ammar Rizvi, C

Luis Barroso-Luque, Muhammed Shuaibi, Xi- 8 ang Fu, Brandon M. Wood, Misko Dzamba, Meng Gao, Ammar Rizvi, C. Lawrence Zitnick, and Zachary W. Ulissi. Open materials 2024 (omat24) inorganic materials dataset and mod- els, 2024. URL https://arxiv.org/abs/2410. 12771

2024

[3] [3]

The aflow standard for high-throughput materials science calculations

Camilo E Calderon, Jose J Plata, Cormac Toher, Corey Oses, Ohad Levy, Marco Fornari, Amir Natan, Michael J Mehl, Gus Hart, Marco Buon- giorno Nardelli, et al. The aflow standard for high-throughput materials science calculations. Computational Materials Science, 108:233–238,

[4] [4]

doi: 10.1016/j.commatsci.2015.07.019

work page doi:10.1016/j.commatsci.2015.07.019 2015

[5] [5]

Lawrence Zitnick, and Zachary Ulissi

Lowik Chanussot, Abhishek Das, Siddharth Goyal, Thibaut Lavril, Muhammed Shuaibi, Morgane Riviere, Kevin Tran, Javier Heras- Domingo, Caleb Ho, Weihua Hu, Aini Pal- izhati, Anuroop Sriram, Brandon Wood, Jun- woong Yoon, Devi Parikh, C. Lawrence Zitnick, and Zachary Ulissi. Open catalyst 2020 (oc20) dataset and community challenges.ACS Catal- ysis, 11(10):...

2020

[6] [6]

URL http: //dx.doi.org/10.1021/acscatal.0c04525

doi: 10.1021/acscatal.0c04525. URL http: //dx.doi.org/10.1021/acscatal.0c04525

work page doi:10.1021/acscatal.0c04525

[7] [7]

MDB: A memory-mapped database and backend for OpenLDAP

Howard Chu. MDB: A memory-mapped database and backend for OpenLDAP. InPro- ceedings of LDAPCon, 2011. URL https://www. openldap.org/pub/hyc/mdb-paper.pdf

2011

[8] [8]

Bartel, and Gerbrand Ceder

Bowen Deng, Peichen Zhong, KyuJung Jun, Janosh Riebesell, Kevin Han, Christopher J. Bartel, and Gerbrand Ceder. CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling.Na- ture Machine Intelligence, 5(9):1031–1041, 2023. doi: 10.1038/s42256-023-00716-3. URL https: //doi.org/10.1038/s42256-023-00716-3

work page doi:10.1038/s42256-023-00716-3 2023

[9] [9]

The no- mad laboratory: from data sharing to artificial intelligence.Journal of Physics: Materials, 2(3): 036001, 2019

Claudia Draxl and Matthias Scheffler. The no- mad laboratory: from data sharing to artificial intelligence.Journal of Physics: Materials, 2(3): 036001, 2019. doi: 10.1088/2515-7639/ab13bb

work page doi:10.1088/2515-7639/ab13bb 2019

[10] [10]

Mathis, Chai- tanya K

Alexandre Duval, Simon V. Mathis, Chai- tanya K. Joshi, Victor Schmidt, Santiago Miret, Fragkiskos D. Malliaros, Taco Cohen, Pietro Li` o, Yoshua Bengio, and Michael Bronstein. A hitch- hiker’s guide to geometric gnns for 3d atomic systems, 2024. URL https://arxiv.org/abs/ 2312.07511

work page arXiv 2024

[11] [11]

Fair chemistry documentation

FAIR Chemistry. Fair chemistry documentation. https://fair-chem.github.io/omol25, 2026. Accessed: 2026-05-18

2026

[12] [12]

Jain , author S

Anubhav Jain, Shyue Ping Ong, Geoffroy Hau- tier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, et al. Com- mentary: The materials project: A materials genome approach to accelerating materials inno- vation.APL Materials, 1(1):011002, 2013. doi: 10.1063/1.4812323

work page doi:10.1063/1.4812323 2013

[13] [13]

Kaplan, Runze Liu, Ji Qi, Tsz Wai Ko, Bowen Deng, Janosh Riebesell, Gerbrand Ceder, Kristin A

Aaron D. Kaplan, Runze Liu, Ji Qi, Tsz Wai Ko, Bowen Deng, Janosh Riebesell, Gerbrand Ceder, Kristin A. Persson, and Shyue Ping Ong. A foundational potential energy surface dataset for materials. arXiv:2503.04070, 2025

work page arXiv 2025

[14] [14]

The open quantum materials database (oqmd): assess- ing the accuracy of dft formation energies.npj Computational Materials, 1(1):15010, 2015

Scott Kirklin, James E Saal, Bryce Meredig, Alex Thompson, Jeff W Doak, Muratahan Aykol, Stephan R¨ uhl, and Chris Wolverton. The open quantum materials database (oqmd): assess- ing the accuracy of dft formation energies.npj Computational Materials, 1(1):15010, 2015. doi: 10.1038/npjcompumats.2015.10

work page doi:10.1038/npjcompumats.2015.10 2015

[15] [15]

Kuner, Aaron D

Matthew C. Kuner, Aaron D. Kaplan, Kristin A. Persson, Mark Asta, and Daryl C. Chrzan. MP-ALOE: an r 2SCAN dataset for universal machine learning interatomic potentials.npj Computational Materials, 11:352, 2025. doi: 10.1038/s41524-025-01834-9

work page doi:10.1038/s41524-025-01834-9 2025

[16] [16]

The atomic simulation environment—a python library for working with atoms.Jour- nal of Physics: Condensed Matter, 29(27): 273002, 2017

Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist, Ivano E Castelli, Rune Chris- tensen, Marcin Du lak, Jesper Friis, Michael N Groves, Bjørk Hammer, Cory Hargus, Eric D Hermes, Paul C Jennings, Peter Bjerre Jensen, James Kermode, John R Kitchin, Esben Leon- hard Kolsbjerg, Joseph Kubal, Kristen Kaasb- jerg, Steen Lysgaard, J´ on Bergmann Maronsso...

work page doi:10.1088/1361-648x/aa680e 2017

[17] [17]

Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G

Daniel S. Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G. Tay- lor, Muhammad R. Hasyim, Kyle Michel, Ilyes Batatia, G´ abor Cs´ anyi, Misko Dzamba, Peter Eastman, Nathan C. Frey, Xiang Fu, Vahe Gharakhanyan, Aditi S. Krishnapriyan, Joshua A. Rackers, Sanjeev Raja, Ammar Rizvi, 9 Andrew S. Rosen, Zachary Ulissi, Santiago Var- gas, C. L...

work page arXiv 2025

[18] [18]

Zarr stor- age specification 2.0 community stan- dard

Open Geospatial Consortium. Zarr stor- age specification 2.0 community stan- dard. https://www.ogc.org/standards/ zarr-storage-specification/, 2022. OGC Document 21-050r1

2022

[19] [19]

PyTorch: An imperative style, high- performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high- perf...

2019

[20] [20]

Lemat-traj: A scalable and unified dataset of materials tra- jectories for atomistic modeling, 2025

Ali Ramlaoui, Martin Siron, Inel Djafar, Joseph Musielewicz, Amandine Rossello, Vic- tor Schmidt, and Alexandre Duval. Lemat-traj: A scalable and unified dataset of materials tra- jectories for atomistic modeling, 2025. URL https://arxiv.org/abs/2508.20875

work page arXiv 2025

[21] [21]

Improving machine-learning models in materials science through large datasets.Materials Today Physics, 48:101560, 2024

Jonathan Schmidt, Tiago FT Cerqueira, Aldo H Romero, Antoine Loew, Fabian J¨ ager, Hai-Chen Wang, Silvana Botti, and Miguel AL Marques. Improving machine-learning models in materials science through large datasets.Materials Today Physics, 48:101560, 2024. doi: 10.1016/j.mtphys. 2024.101560

work page doi:10.1016/j.mtphys 2024

[22] [22]

Lemat-bulk: aggregating, and de-duplicating quantum chemistry materials databases, 2025

Martin Siron, Inel Djafar, Ali Ramlaoui, Eti- enne du Fayette, Amandine Rossello, Edvin Fako, Matthew McDermott, Felix Therrien, Luis Barroso-Luque, Flaviu Cipcigan, Philippe Schwaller, Thomas Wolf, and Alexandre Duval. Lemat-bulk: aggregating, and de-duplicating quantum chemistry materials databases, 2025. URLhttps://arxiv.org/abs/2511.05178

work page arXiv 2025

[23] [23]

How big is big data? Faraday Discussions, 256:483–502, 2025

Daniel Speckhard, Tim Bechtel, Luca M Ghir- inghelli, Martin Kuban, Santiago Rigamonti, and Claudia Draxl. How big is big data? Faraday Discussions, 256:483–502, 2025. doi: 10.1039/D4FD00102H

work page doi:10.1039/d4fd00102h 2025

[24] [24]

Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, and Claudia Draxl

Daniel T. Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, and Claudia Draxl. Training speedups via batching for geometric learning: an analysis of static and dynamic al- gorithms.Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https: //openreview.net/forum?id=v8rC6EEUep

2026

[25] [25]

Hierarchical data format, ver- sion 5

The HDF Group. Hierarchical data format, ver- sion 5. https://github.com/HDFGroup/hdf5,

[26] [26]

DOI: https://doi.org/10

Version 2.1.1. DOI: https://doi.org/10. 5281/zenodo.17808558. 10