pith. sign in

arxiv: 2606.10195 · v1 · pith:QOIKKOWWnew · submitted 2026-06-08 · ❄️ cond-mat.mtrl-sci · physics.comp-ph

Graphlet Histogram Representation Database of Inorganic Crystals

Pith reviewed 2026-06-27 15:23 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci physics.comp-ph
keywords graphlet histogramsinorganic crystalscrystal structure representationmaterials property predictionVoronoi tessellationMaterials Projectdata-efficient representationsEarth Mover's Distance
0
0 comments X

The pith

Graphlet histogram representations precomputed from crystal structures provide a data-efficient alternative to end-to-end learned features for materials property prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Graphlet-MP, a database containing graphlet histogram representations for 149,082 inorganic crystals drawn from the Materials Project. These representations are built directly from crystallographic information files by applying screened Voronoi tessellation to extract distributions at three hierarchical levels: individual atomic sites, bonded pairs, and bond-angle triplets. The authors argue that such precomputed, domain-knowledge-driven features remain useful when only limited experimental data exist, unlike models that require large density-functional-theory training sets. The work supplies both an Earth Mover's Distance metric for comparing crystals in this space and open-source code that lets users generate the same histograms for any new structure, experimental or computed.

Core claim

The central claim is that seventy-nine graphlet histogram distributions, extracted via screened Voronoi tessellation from CIF files, capture the local structural geometry of inorganic crystals at atomic, pair, and triplet orders, thereby furnishing an interpretable, precomputed representation that can support property prediction without requiring end-to-end training on large DFT databases.

What carries the argument

Graphlet histogram representations over three hierarchical graphlet orders (atomic sites, bonded pairs, bond-angle triplets) extracted via screened Voronoi tessellation from crystallographic information files.

If this is right

  • Materials can be compared directly using the supplied Earth Mover's Distance metric in graphlet-histogram space.
  • The database and accompanying code allow generation of representations for experimentally determined structures without DFT calculations.
  • The same precomputed features can be used when only limited experimental property data are available.
  • The representation can be extended to additional materials or target properties by running the provided open-source tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These representations could be applied to crystal structures obtained purely from experiment, bypassing computational databases entirely.
  • The three chosen graphlet orders may highlight which local geometric motifs most strongly influence particular properties.
  • The database could serve as a fixed feature set for transfer across different material classes or prediction tasks.

Load-bearing premise

The graphlet histograms extracted from CIF files via screened Voronoi tessellation capture the local structural geometry that determines material properties.

What would settle it

A test showing that machine-learning models trained on these graphlet histograms achieve no improvement over composition-only or connectivity-only baselines when predicting properties from scarce experimental data.

Figures

Figures reproduced from arXiv: 2606.10195 by Aaditya Panigrahi, Eun-Ah Kim, Krishnanand Mallayya, Omri Lesser, Yanjun Liu.

Figure 1
Figure 1. Figure 1: | Schematic representing the construction of the Property-Labelled Materials Fragments (PLMF). The crystal structure (a) is analysed for atomic neighbours via Voronoi tessellation (b). After property labelling, the resulting periodic graph (c) is decomposed into simple subgraphs (d). NATURE COMMUNICATIONS | DOI: 10.1038/ncomms15679 ARTICLE NATURE COMMUNICATIONS | 8:15679 | DOI: 10.1038/ncomms15679 | www.na… view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Machine learning models for materials property prediction increasingly rely on representations learned end-to-end from large density-functional-theory databases, limiting their applicability when only scarce experimental data are available. Domain-knowledge-driven representations precomputed from crystal structures alone offer a data-efficient, interpretable alternative, but existing approaches capture at most composition or bonding connectivity and discard local structural geometry. Here, we present Graphlet-MP, a database of graphlet histogram representations for 149,082 inorganic crystals from the Materials Project (MP). Seventy-nine distributions describe each material over three hierarchical graphlet orders: atomic sites, bonded pairs, and bond-angle triplets, extracted via screened Voronoi tessellation from the crystallographic information file. We provide a complete technical specification of the representation, an Earth Mover's Distance metric for comparing materials in this space, and the full precomputed database. An accompanying open-source codebase enables users to generate graphlet histograms for arbitrary crystal structures, including experimentally determined ones, and to extend the database to new materials or target properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents Graphlet-MP, a database of graphlet histogram representations for 149,082 inorganic crystals from the Materials Project. Each material is described by 79 distributions over three hierarchical graphlet orders (atomic sites, bonded pairs, bond-angle triplets) extracted via screened Voronoi tessellation from CIF files. The work supplies a technical specification of the representation, an Earth Mover's Distance metric, the precomputed database, and open-source code to generate histograms for arbitrary (including experimental) structures.

Significance. If the representations can be shown to improve property prediction in low-data regimes, the database would constitute a concrete, interpretable, domain-knowledge-driven alternative to end-to-end learned embeddings from large DFT corpora. The release of the full precomputed set together with reproducible code is itself a tangible contribution that lowers the barrier for subsequent validation studies.

major comments (1)
  1. [Abstract] Abstract (and the paragraph describing the three hierarchical graphlet orders): the central claim that the graphlet histograms 'offer a data-efficient, interpretable alternative' for property prediction when experimental data are scarce is unsupported. The manuscript contains no regression or classification results on any target property, no ablation against composition-only or connectivity baselines, and no correlation or feature-importance analysis linking the chosen graphlet orders to known physical properties. Consequently the assumption that the screened-Voronoi graphlets encode geometry relevant to downstream tasks remains untested.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for identifying the unsupported claim in the abstract. We agree that the manuscript, as a data-descriptor paper, provides no empirical validation of downstream property prediction performance. We address the point below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and the paragraph describing the three hierarchical graphlet orders): the central claim that the graphlet histograms 'offer a data-efficient, interpretable alternative' for property prediction when experimental data are scarce is unsupported. The manuscript contains no regression or classification results on any target property, no ablation against composition-only or connectivity baselines, and no correlation or feature-importance analysis linking the chosen graphlet orders to known physical properties. Consequently the assumption that the screened-Voronoi graphlets encode geometry relevant to downstream tasks remains untested.

    Authors: We agree with the referee. The manuscript is a database release that specifies the graphlet-histogram construction, the Earth Mover's Distance metric, the precomputed data for 149082 MP entries, and the open code; it contains no regression, classification, or ablation experiments. The phrasing in the abstract was an attempt to motivate the representation's intended use case, but it is indeed unsupported by results within this work. We will revise the abstract and the paragraph on the three graphlet orders to remove any claim that the histograms constitute a demonstrated alternative for property prediction. The revised text will focus strictly on the technical specification of the representation, its interpretability through explicit geometric graphlets, and the public release of the database and code. revision: yes

Circularity Check

0 steps flagged

No circularity; database release with no derivations or fitted predictions

full rationale

The paper constructs and releases a database of graphlet histograms extracted from CIF files via screened Voronoi tessellation. No equations, parameter fits, predictions, or uniqueness theorems are presented. The central contribution is the precomputed data and open-source code for arbitrary structures; the utility claim for property prediction is stated but not derived or validated within the manuscript. No load-bearing steps reduce to self-definition, self-citation chains, or fitted inputs renamed as outputs. This is a standard data-release paper whose content is independent of any internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Database release paper; no mathematical derivations, fitted parameters, or postulated entities appear in the abstract.

pith-pipeline@v0.9.1-grok · 5719 in / 1040 out tokens · 22972 ms · 2026-06-27T15:23:41.596419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Charting the emergent low-dimensional manifold of quantum materials

    cond-mat.supr-con 2026-06 unverdicted novelty 6.0

    Unsupervised manifold learning on ICSD data reveals a low-dimensional embedding that segregates superconductors and predicts critical temperatures across families.

Reference graph

Works this paper leans on

15 extracted references · 1 canonical work pages · cited by 1 Pith paper

  1. [1]

    Choudhary and B

    K. Choudhary and B. DeCost, Atomistic Line Graph Neural Network for improved materials property predic- tions, npj Computational Materials7, 185 (2021)

  2. [2]

    R. E. A. Goodall and A. A. Lee, Predicting materials properties without crystal structure: deep representation learning from stoichiometry, Nature Communications11, 6280 (2020)

  3. [3]

    Xie and J

    T. Xie and J. C. Grossman, Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Pre- diction of Material Properties, Physical Review Letters 120, 145301 (2018)

  4. [4]

    Sommer, R

    T. Sommer, R. Willa, J. Schmalian, and P. Friederich, 3DSC - a dataset of superconductors including crystal structures, Scientific Data10, 816 (2023)

  5. [5]

    Lesser, Y

    O. Lesser, Y. Liu, N. Maus, A. Panigrahi, K. Mallayya, A. Gong, A. Kabra, S. B. Lee, S. Chatterjee, A. Merino, K. Q. Weinberger, L. M. Schoop, J. R. Gardner, and E.- A. Kim, Electron affinity difference distributions guide the discovery of the superconductor PtPb 3Bi (2026), arXiv:2510.07373 [cond-mat.supr-con]

  6. [6]

    There are∼9,150 entries in 3DSC. Out of those, 6,463 are unique superconductors, and only 4,325 have unique representative structural information, as measured by the graphlet histogram earth mover distance metric, as of June 2026 [5]

  7. [7]

    A. Y.-T. Wang, S. K. Kauwe, R. J. Murdock, and T. D. Sparks, Compositionally restricted attention-based net- work for materials property predictions, npj Computa- tional Materials7, 77 (2021)

  8. [8]

    L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, and M. Scheffler, Big Data of Materials Science: Criti- cal Role of the Descriptor, Physical Review Letters114, 105503 (2015)

  9. [9]

    Meredig, A

    B. Meredig, A. Agrawal, S. Kirklin, J. E. Saal, J. W. Doak, A. Thompson, K. Zhang, A. Choudhary, and C. Wolverton, Combinatorial screening for new materials in unconstrained composition space with machine learn- ing, Physical Review B89, 094104 (2014)

  10. [10]

    L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, A general-purpose machine learning framework for pre- dicting properties of inorganic materials, npj Computa- tional Materials2, 16028 (2016)

  11. [11]

    Isayev, C

    O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, and A. Tropsha, Universal fragment descriptors for pre- dicting properties of inorganic crystals, Nature Commu- nications8, 15679 (2017). [12]https://github.com/ai-materials-institute/ GraphletDatabase

  12. [12]

    J. R. Rumble, T. J. Bruno, and M. J. Doa, eds.,CRC Handbook of Chemistry and Physics: A Ready-Reference Book of Chemical and Physical Data, 101st ed. (CRC Press, Taylor & Francis Group, Boca Raton London New York, 2020)

  13. [13]

    L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal, A. Choudhary, and C. Wolverton, Including crystal struc- ture attributes in machine learning models of formation energies via Voronoi tessellations, Physical Review B96, 024104 (2017)

  14. [14]

    Rubner, C

    Y. Rubner, C. Tomasi, and L. J. Guibas, The Earth Mover’s Distance as a Metric for Image Retrieval, In- ternational Journal of Computer Vision40, 99 (2000)

  15. [15]

    A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and K. A. Persson, Commentary: The Materials Project: A materials genome approach to accelerating materials innovation, APL Materials1, 011002 (2013). [17]https://doi.org/10.5281/zenodo.20532978