CHAOS -- A Consistent Large-scale Database for Sigma-Profiles and Other Molecular Descriptors
Pith reviewed 2026-05-17 05:22 UTC · model grok-4.3
The pith
CHAOS supplies sigma-profiles for 53091 molecules from a single standardized quantum-chemical calculation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a uniform wB97X-D/def2-TZVP workflow applied to a curated set of 53091 molecules yields an internally consistent library of sigma-profiles together with gas-phase geometries, conductor-like polarizable continuum model data, infrared spectra, ideal-gas thermodynamic functions, and atomic NMR shielding tensors; the library is released openly and increases the number of publicly available sigma-profiles by more than a factor of ten.
What carries the argument
The standardized quantum-chemical workflow at the wB97X-D/def2-TZVP level that produces both sigma-profiles and the linked observables for every molecule.
If this is right
- Larger, consistent sigma-profile sets become available for solvent screening and thermodynamic property prediction.
- Both physics-based and data-driven models can be trained or validated on an order-of-magnitude more entries than before.
- Each sigma-profile is accompanied by geometries, spectra, heat capacities, and NMR data, enabling joint modeling of multiple properties.
- Open release on Zenodo allows immediate reuse across chemical engineering and materials science.
- The workflow can be reapplied to additional molecules without reintroducing inconsistency.
Where Pith is reading between the lines
- The same computational protocol could be rerun on larger or more diverse molecular sets to test coverage limits.
- Machine-learning models trained on CHAOS data might predict sigma-profiles for molecules outside the current mass and dipole range.
- Cross-validation against experimental solvation data for a subset of entries would quantify practical accuracy for specific applications.
Load-bearing premise
The chosen density-functional level and basis set produce sigma-profiles accurate enough for the downstream uses, and the chosen molecules adequately cover the chemical space needed for those uses.
What would settle it
Direct numerical comparison of the CHAOS sigma-profiles for a few hundred overlapping molecules against any existing smaller public library, or against experimental activity coefficients or solvation free energies derived from sigma-profiles.
Figures
read the original abstract
Sigma-profiles obtained from quantum-chemical calculations are key molecular descriptors for solvent selection, thermodynamic modeling, and data-driven molecular design. However, existing sigma-profile libraries are limited in size and inconsistent in quality, which restricts their utility. In this work, we introduce CHAOS (Computed High-Accuracy Observables and Sigma Profiles), a large-scale and internally consistent database providing sigma-profiles for 53091 molecules, along with additional quantum-chemical observables including gas-phase geometries, single-point conductor-like polarizable continuum (C-PCM) data, infrared spectra, ideal-gas heat capacities and entropies, and atomic orbital nuclear magnetic resonance (NMR) shielding tensors. All data were generated using a standardized quantum-chemical workflow based on an wB97X-D/def2-TZVP level of theory. The CHAOS database covers molecules composed of a diverse set of elements, with molecular masses up to 400 amu and dipole moments up to 15 D, and is freely available on Zenodo under an open license. It extends the number of molecules for which sigma-profiles are publicly available by more than an order of magnitude and systematically links them to a broad range of other quantum-chemical molecular descriptors. CHAOS provides a comprehensive and consistent foundation for developing models of molecular and thermodynamic properties -- both physics-based and machine-learning approaches -- across chemistry, chemical engineering, and materials science, greatly extending the possibilities and the available quantum-chemical data basis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the CHAOS database containing sigma-profiles and linked quantum-chemical descriptors (gas-phase geometries, C-PCM data, IR spectra, ideal-gas heat capacities/entropies, and NMR shielding tensors) for 53091 molecules. All entries were generated with a uniform wB97X-D/def2-TZVP + C-PCM protocol, the data are released openly on Zenodo, and the work claims to increase the number of publicly available sigma-profiles by more than an order of magnitude for applications in solvent selection, thermodynamic modeling, and machine learning.
Significance. A large, internally consistent, and openly licensed collection of sigma-profiles linked to multiple other descriptors would be a useful resource for COSMO-RS-based modeling and data-driven chemistry if the chosen level of theory yields descriptors of adequate fidelity. The scale and standardization are clear strengths.
major comments (1)
- Computational workflow description: The paper applies a uniform wB97X-D/def2-TZVP + C-PCM protocol to all 53091 molecules but reports no direct comparisons of the resulting sigma-profiles to experimental solvation data, to existing COSMO-RS libraries, or to higher-level references (e.g., MP2 or DLPNO-CCSD). This absence is load-bearing for the utility claims in thermodynamic modeling and solvent selection, because any systematic bias in screening-charge distributions would propagate into downstream predictions even while internal consistency is preserved.
minor comments (2)
- Abstract and title: The phrasing 'Computed High-Accuracy Observables' is used without accompanying justification or error metrics; a brief statement clarifying the intended meaning of 'high-accuracy' relative to the chosen functional/basis would improve clarity.
- Data coverage: The ranges for molecular mass (up to 400 amu) and dipole moment (up to 15 D) are stated, but a concise table or figure summarizing element distribution or functional-group coverage would help readers assess representativeness of chemical space.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the CHAOS database and the recommendation for minor revision. We address the major comment below.
read point-by-point responses
-
Referee: Computational workflow description: The paper applies a uniform wB97X-D/def2-TZVP + C-PCM protocol to all 53091 molecules but reports no direct comparisons of the resulting sigma-profiles to experimental solvation data, to existing COSMO-RS libraries, or to higher-level references (e.g., MP2 or DLPNO-CCSD). This absence is load-bearing for the utility claims in thermodynamic modeling and solvent selection, because any systematic bias in screening-charge distributions would propagate into downstream predictions even while internal consistency is preserved.
Authors: We agree that explicit comparisons to experimental solvation free energies, existing COSMO-RS libraries, or higher-level references would provide additional support for absolute accuracy claims. The primary objective of this work, however, is the release of a large, internally consistent dataset generated under a single, reproducible protocol; such consistency is itself valuable for relative predictions, machine-learning training, and screening applications where systematic errors are expected to cancel. The wB97X-D/def2-TZVP + C-PCM combination is a standard, computationally tractable level already employed in multiple COSMO-RS studies. To address the referee's concern directly, we will revise the manuscript to add a concise subsection discussing the rationale for the chosen level of theory, citing relevant literature benchmarks, and explicitly noting the distinction between internal consistency and absolute accuracy. This addition will clarify the scope without requiring new calculations. revision: yes
Circularity Check
No circularity: direct standardized computations with independent outputs
full rationale
The manuscript presents a uniform wB97X-D/def2-TZVP + C-PCM workflow to generate sigma-profiles and linked quantum-chemical observables for 53091 molecules. No derivations, equations, fitted parameters, or predictions are defined inside the paper that later reappear as outputs. The central claim is simply that the chosen level of theory was applied consistently to produce the database; the resulting descriptors are independent of any result constructed within the work itself. Self-citations, if present for methodology, are not load-bearing for any derivation chain because none exists. This is a data-generation paper whose claims rest on external quantum-chemistry software and stated computational settings rather than internal self-reference.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The wB97X-D/def2-TZVP level of theory yields sufficiently accurate sigma-profiles and related observables for the intended applications
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
All data were generated using a standardized quantum-chemical workflow based on an ωB97X-D/def2-TZVP level of theory... single-point conductor-like polarizable continuum (C-PCM) calculation
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The CHAOS database... extends the number of molecules for which σ-profiles are publicly available by more than an order of magnitude
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
(1) Klamt, A. Conductor-like screening model for real solvents: a new approach to the quantitative calculation of solvation phenomena.The Journal of Physical Chemistry 1995,99, 2224–2235. (2) Klamt, A.COSMO-RS: from quantum chemistry to fluid phase thermodynamics and drug design; Elsevier,
work page 1995
-
[2]
(15) Abranches, D. O.; Maginn, E. J.; Colón, Y. J. Stochastic machine learning via sigma profiles to build a digital chemical space.Proceedings of the National Academy of Sci- ences2024,121, e2404676121. (16) Jirasek, F.; Hasse, H. Perspective: machine learning of thermophysical properties.Fluid Phase Equilibria2021,549, 113206. (17) Jirasek, F.; Hasse, H...
-
[3]
(32) Xie, L.; Xu, L.; Kong, R.; Chang, S.; Xu, X. Improvement of prediction performance with conjoint molecular fingerprint in deep learning.Frontiers in pharmacology2020, 11, 606668. (33) Hayer, N.; Specht, T.; Arweiler, J.; Gond, D.; Hasse, H.; Jirasek, F. Prediction of activitycoefficientsbysimilarity-basedimputationusingquantum-chemicaldescriptors. Ph...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.