pith. machine review for the scientific record.
sign in

arxiv: 2511.19002 · v2 · submitted 2025-11-24 · ⚛️ physics.chem-ph

CHAOS -- A Consistent Large-scale Database for Sigma-Profiles and Other Molecular Descriptors

Pith reviewed 2026-05-17 05:22 UTC · model grok-4.3

classification ⚛️ physics.chem-ph
keywords sigma-profilesquantum-chemical databasemolecular descriptorsthermodynamic modelingsolvent selectionC-PCMinfrared spectraNMR shielding
0
0 comments X

The pith

CHAOS supplies sigma-profiles for 53091 molecules from a single standardized quantum-chemical calculation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new open database that computes sigma-profiles and several related quantum-chemical properties for more than fifty thousand molecules. Existing collections of these descriptors remain small and vary in quality, which limits their use in solvent selection, thermodynamic calculations, and molecular design. The authors apply one fixed computational protocol to every entry so that all values sit on the same internal scale. This produces a resource an order of magnitude larger than prior public sets and links each sigma-profile to geometries, spectra, heat capacities, and NMR tensors. The resulting collection supplies consistent input for both physics-based models and machine-learning methods across chemistry and chemical engineering.

Core claim

The central claim is that a uniform wB97X-D/def2-TZVP workflow applied to a curated set of 53091 molecules yields an internally consistent library of sigma-profiles together with gas-phase geometries, conductor-like polarizable continuum model data, infrared spectra, ideal-gas thermodynamic functions, and atomic NMR shielding tensors; the library is released openly and increases the number of publicly available sigma-profiles by more than a factor of ten.

What carries the argument

The standardized quantum-chemical workflow at the wB97X-D/def2-TZVP level that produces both sigma-profiles and the linked observables for every molecule.

If this is right

  • Larger, consistent sigma-profile sets become available for solvent screening and thermodynamic property prediction.
  • Both physics-based and data-driven models can be trained or validated on an order-of-magnitude more entries than before.
  • Each sigma-profile is accompanied by geometries, spectra, heat capacities, and NMR data, enabling joint modeling of multiple properties.
  • Open release on Zenodo allows immediate reuse across chemical engineering and materials science.
  • The workflow can be reapplied to additional molecules without reintroducing inconsistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same computational protocol could be rerun on larger or more diverse molecular sets to test coverage limits.
  • Machine-learning models trained on CHAOS data might predict sigma-profiles for molecules outside the current mass and dipole range.
  • Cross-validation against experimental solvation data for a subset of entries would quantify practical accuracy for specific applications.

Load-bearing premise

The chosen density-functional level and basis set produce sigma-profiles accurate enough for the downstream uses, and the chosen molecules adequately cover the chemical space needed for those uses.

What would settle it

Direct numerical comparison of the CHAOS sigma-profiles for a few hundred overlapping molecules against any existing smaller public library, or against experimental activity coefficients or solvation free energies derived from sigma-profiles.

Figures

Figures reproduced from arXiv: 2511.19002 by Dominik Gond, Fabian Jirasek, Hans Hasse, Justus Arweiler, Thomas Specht.

Figure 1
Figure 1. Figure 1: Schematic overview of the CHAOS data generation workflow. Starting from MOL [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the diversity of molecules included in the CHAOS database. Panels [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Sigma-profiles obtained from quantum-chemical calculations are key molecular descriptors for solvent selection, thermodynamic modeling, and data-driven molecular design. However, existing sigma-profile libraries are limited in size and inconsistent in quality, which restricts their utility. In this work, we introduce CHAOS (Computed High-Accuracy Observables and Sigma Profiles), a large-scale and internally consistent database providing sigma-profiles for 53091 molecules, along with additional quantum-chemical observables including gas-phase geometries, single-point conductor-like polarizable continuum (C-PCM) data, infrared spectra, ideal-gas heat capacities and entropies, and atomic orbital nuclear magnetic resonance (NMR) shielding tensors. All data were generated using a standardized quantum-chemical workflow based on an wB97X-D/def2-TZVP level of theory. The CHAOS database covers molecules composed of a diverse set of elements, with molecular masses up to 400 amu and dipole moments up to 15 D, and is freely available on Zenodo under an open license. It extends the number of molecules for which sigma-profiles are publicly available by more than an order of magnitude and systematically links them to a broad range of other quantum-chemical molecular descriptors. CHAOS provides a comprehensive and consistent foundation for developing models of molecular and thermodynamic properties -- both physics-based and machine-learning approaches -- across chemistry, chemical engineering, and materials science, greatly extending the possibilities and the available quantum-chemical data basis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the CHAOS database containing sigma-profiles and linked quantum-chemical descriptors (gas-phase geometries, C-PCM data, IR spectra, ideal-gas heat capacities/entropies, and NMR shielding tensors) for 53091 molecules. All entries were generated with a uniform wB97X-D/def2-TZVP + C-PCM protocol, the data are released openly on Zenodo, and the work claims to increase the number of publicly available sigma-profiles by more than an order of magnitude for applications in solvent selection, thermodynamic modeling, and machine learning.

Significance. A large, internally consistent, and openly licensed collection of sigma-profiles linked to multiple other descriptors would be a useful resource for COSMO-RS-based modeling and data-driven chemistry if the chosen level of theory yields descriptors of adequate fidelity. The scale and standardization are clear strengths.

major comments (1)
  1. Computational workflow description: The paper applies a uniform wB97X-D/def2-TZVP + C-PCM protocol to all 53091 molecules but reports no direct comparisons of the resulting sigma-profiles to experimental solvation data, to existing COSMO-RS libraries, or to higher-level references (e.g., MP2 or DLPNO-CCSD). This absence is load-bearing for the utility claims in thermodynamic modeling and solvent selection, because any systematic bias in screening-charge distributions would propagate into downstream predictions even while internal consistency is preserved.
minor comments (2)
  1. Abstract and title: The phrasing 'Computed High-Accuracy Observables' is used without accompanying justification or error metrics; a brief statement clarifying the intended meaning of 'high-accuracy' relative to the chosen functional/basis would improve clarity.
  2. Data coverage: The ranges for molecular mass (up to 400 amu) and dipole moment (up to 15 D) are stated, but a concise table or figure summarizing element distribution or functional-group coverage would help readers assess representativeness of chemical space.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the CHAOS database and the recommendation for minor revision. We address the major comment below.

read point-by-point responses
  1. Referee: Computational workflow description: The paper applies a uniform wB97X-D/def2-TZVP + C-PCM protocol to all 53091 molecules but reports no direct comparisons of the resulting sigma-profiles to experimental solvation data, to existing COSMO-RS libraries, or to higher-level references (e.g., MP2 or DLPNO-CCSD). This absence is load-bearing for the utility claims in thermodynamic modeling and solvent selection, because any systematic bias in screening-charge distributions would propagate into downstream predictions even while internal consistency is preserved.

    Authors: We agree that explicit comparisons to experimental solvation free energies, existing COSMO-RS libraries, or higher-level references would provide additional support for absolute accuracy claims. The primary objective of this work, however, is the release of a large, internally consistent dataset generated under a single, reproducible protocol; such consistency is itself valuable for relative predictions, machine-learning training, and screening applications where systematic errors are expected to cancel. The wB97X-D/def2-TZVP + C-PCM combination is a standard, computationally tractable level already employed in multiple COSMO-RS studies. To address the referee's concern directly, we will revise the manuscript to add a concise subsection discussing the rationale for the chosen level of theory, citing relevant literature benchmarks, and explicitly noting the distinction between internal consistency and absolute accuracy. This addition will clarify the scope without requiring new calculations. revision: yes

Circularity Check

0 steps flagged

No circularity: direct standardized computations with independent outputs

full rationale

The manuscript presents a uniform wB97X-D/def2-TZVP + C-PCM workflow to generate sigma-profiles and linked quantum-chemical observables for 53091 molecules. No derivations, equations, fitted parameters, or predictions are defined inside the paper that later reappear as outputs. The central claim is simply that the chosen level of theory was applied consistently to produce the database; the resulting descriptors are independent of any result constructed within the work itself. Self-citations, if present for methodology, are not load-bearing for any derivation chain because none exists. This is a data-generation paper whose claims rest on external quantum-chemistry software and stated computational settings rather than internal self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper performs direct quantum-chemical calculations rather than introducing fitted parameters or new theoretical entities; the central assumption is the adequacy of the chosen DFT functional and basis set for the target descriptors.

axioms (1)
  • domain assumption The wB97X-D/def2-TZVP level of theory yields sufficiently accurate sigma-profiles and related observables for the intended applications
    This choice underpins the entire standardized workflow described in the abstract.

pith-pipeline@v0.9.0 · 5571 in / 1323 out tokens · 67830 ms · 2026-05-17T05:22:25.747065+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

  1. [1]

    Conductor-like screening model for real solvents: a new approach to the quantitative calculation of solvation phenomena.The Journal of Physical Chemistry 1995,99, 2224–2235

    (1) Klamt, A. Conductor-like screening model for real solvents: a new approach to the quantitative calculation of solvation phenomena.The Journal of Physical Chemistry 1995,99, 2224–2235. (2) Klamt, A.COSMO-RS: from quantum chemistry to fluid phase thermodynamics and drug design; Elsevier,

  2. [2]

    O.; Maginn, E

    (15) Abranches, D. O.; Maginn, E. J.; Colón, Y. J. Stochastic machine learning via sigma profiles to build a digital chemical space.Proceedings of the National Academy of Sci- ences2024,121, e2404676121. (16) Jirasek, F.; Hasse, H. Perspective: machine learning of thermophysical properties.Fluid Phase Equilibria2021,549, 113206. (17) Jirasek, F.; Hasse, H...

  3. [3]

    Improvement of prediction performance with conjoint molecular fingerprint in deep learning.Frontiers in pharmacology2020, 11, 606668

    (32) Xie, L.; Xu, L.; Kong, R.; Chang, S.; Xu, X. Improvement of prediction performance with conjoint molecular fingerprint in deep learning.Frontiers in pharmacology2020, 11, 606668. (33) Hayer, N.; Specht, T.; Arweiler, J.; Gond, D.; Hasse, H.; Jirasek, F. Prediction of activitycoefficientsbysimilarity-basedimputationusingquantum-chemicaldescriptors. Ph...