pith. sign in

arxiv: 2604.06043 · v1 · submitted 2026-04-07 · ⚛️ physics.chem-ph · cond-mat.mtrl-sci

The BOS-Lig Dataset: Accurate Ligand Charges from a Consensus Approach for 66,810 Experimentally Synthesized Ligands

Pith reviewed 2026-05-10 18:31 UTC · model grok-4.3

classification ⚛️ physics.chem-ph cond-mat.mtrl-sci
keywords ligand chargestransition metal complexescharge assignmentheteroleptic complexesCambridge Structural DatabaseBOS-Lig datasettopic modelinghemilabile ligands
0
0 comments X

The pith

An iterative consensus workflow assigns reliable net charges to 66,810 ligands from 126,985 experimental transition metal complexes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds the BOS-Lig dataset by pulling ligands from over 126,000 mononuclear transition metal complexes stored in the Cambridge Structural Database. It applies an iterative charge-balancing process that first settles charges in homoleptic complexes using complex charges and metal oxidation states, then spreads those assignments to heteroleptic cases through repeated cross-checks with multiple observations. This produces confident net-charge labels for 66,810 of the 94,581 unique ligand structures identified. A sympathetic reader cares because missing or inconsistent ligand charges have blocked reliable high-throughput computational screening of transition metal complexes for catalysis, redox chemistry, and photophysical applications. The work also classifies coordinating atoms, notes hemilabile variants, and connects many ligands to functional use areas via topic modeling of journal abstracts.

Core claim

By applying an iterative charge-balancing workflow that combines complex charges, metal oxidation states, and consensus across crystallographic observations, net charges can be confidently assigned to 66,810 ligands among 94,581 unique structures extracted from 126,985 mononuclear transition metal complexes. The process begins with homoleptic complexes and propagates assignments to heteroleptic environments, allowing inference even when direct charge data is absent. Each ligand receives additional labels for metal-coordinating atoms and hemilability, while 25,146 ligands are linked to application domains through topic modeling of associated abstracts.

What carries the argument

The iterative charge-balancing workflow, which settles charges first in homoleptic complexes then propagates them via consensus with metal oxidation states to handle heteroleptic cases.

If this is right

  • Consistent charges become available for screening libraries of transition metal complexes in reactivity and photophysical applications.
  • The purity metric flags assignments likely to be unreliable, allowing users to filter the dataset for high-confidence subsets.
  • Classification of coordinating atoms and hemilability supports targeted searches for ligands with specific binding behavior.
  • Linking ligands to application topics through abstracts provides an experimentally grounded starting point for data-driven ligand selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could serve as training data for machine-learning models that predict charges or properties for ligands in yet-unseen complexes.
  • The propagation logic might be adapted to assign charges in polynuclear or supramolecular systems where multiple metals share ligands.
  • The hemilability flags could guide experimental design of switchable catalysts that respond to external stimuli.

Load-bearing premise

Consensus across multiple crystallographic observations and known metal oxidation states can correctly infer ligand net charges even when no direct charge measurement exists for that ligand.

What would settle it

A ligand assigned a particular charge in several complexes would be shown incorrect if independent experimental data, such as a measured pKa or redox potential for an isolated ligand or a well-characterized complex, consistently contradicted the assigned value while satisfying the overall complex charge.

Figures

Figures reproduced from arXiv: 2604.06043 by Aaron G. Garrison, Heather J. Kulik, Ilia Kevlishvili, Roland G. St. Michel, Ryan J. Jang.

Figure 1
Figure 1. Figure 1: Iterative workflow for complex-level charge inference. Mononuclear TMCs extracted from the CSD are decomposed into molecular components and tallied to identify the most frequently occurring species. Initially defined seed charges enable iterative charge solving across entries under the assumption of unit-cell neturality. Each pass identifies solvable entries, assigns charges by difference, and updates the … view at source ↗
Figure 2
Figure 2. Figure 2: Count, charge and oxidation state distributions for TMCs. a) Funnel diagram illustrating the filtering of TMCs from the CSD. Starting from 254,989 unique complexes, structures were screened for hydrogen completeness, structural validity, and the presence of a consistent complex￾level charge and oxidation state, yielding 126,985 high-confidence entries suitable for ligand charge analysis. b) Histogram of as… view at source ↗
Figure 4
Figure 4. Figure 4: Ligand charge workflow and results. a) Flowchart of the iterative procedure used to assign ligand charges. b) Distribution of the assigned ligand charges. c) Representative ligand examples illustrating typical and out-of-range assignments: a ligand with a charge of −2 (within the main distribution), alongside cases of rarely assigned charges (−5 and +4) outside of the main distribution. Atom colors: carbon… view at source ↗
Figure 5
Figure 5. Figure 5: Weighted ligand charge assignment. (Top) An example inconsistent ligand whose assigned charge transitions from –1 to 0 after the second iteration. Boxes denote the ligand’s charge in each iteration, with color indicating whether the ligand was solved (blue) or unsolved (red) at that stage. (Left) Scatter plot depicting the evolution of the number of solved and unsolved across iterations. (Right) An example… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of iterative ligand-charge assignments with alternative charge-assignment schemes. a) Representative ligands where the iterative workflow disagrees with cell2mol. b) Representative ligands where the iterative workflow disagrees with the octet rule, showcasing an hypervalent phosphorous containing ligand (double bonds to the central metal), and an n￾heterocyclic carbene containing ligand. Atom co… view at source ↗
Figure 7
Figure 7. Figure 7: Ligand coordination details. a) Distribution of number of ligand coordinating atoms across all binding modes in the dataset. b) Elemental composition of donor atoms, expressed as the percentage contribution of each donor element. We further classify the charge-assigned ligands as hemilabile using our recently introduced four-type hemilability categorization.63 Type 1 and 2 hemilability both involve changes… view at source ↗
Figure 8
Figure 8. Figure 8: Coverage and application breadth of ligands and transition-metal complexes with associated bibliographic text. a) Fraction of crystallographically characterized transition-metal complexes and unique ligands that could be classified, relative to those lacking text coverage. b) Distribution of application breadth, defined as the number of distinct application classes in which a given ligand appears. c) Uniqu… view at source ↗
Figure 9
Figure 9. Figure 9: Pairwise overlap of application areas for ligands. Each cell reports the number of ligands shared between two application areas and the percentage of ligands in the row category that also appear in the column category. Due to the difference in size of the two classes, the percentage will differ in the upper and lower triangle of the matrix. Diagonal entries correspond to all ligands associated with that ca… view at source ↗
read the original abstract

Understanding ligand properties is essential for computational high-throughput screening of transition metal complexes. However, ligand properties such as net charge and other information such as their application area are often absent or inconsistently recorded in crystallographic datasets. Here, we construct a ligand dataset from 126,985 mononuclear transition metal complexes curated from the Cambridge Structural Database. Using an iterative charge-balancing workflow that combines complex charges, metal oxidation states, and consensus across crystallographic observations, we confidently assign net charges to 66,810 ligands among 94,581 identified unique ligand structures to curate the Boston Open-Shell Ligand (BOS-Lig) dataset. The workflow assigns ligand charges in homoleptic complexes first and then iteratively propagates these assignments across heteroleptic environments, allowing charges to be inferred even when direct charge information is unavailable. We analyze cases where simple heuristics such as the octet rule would have failed and introduce a purity metric to identify when our charge assignments may be incorrect. Each ligand is also classified in terms of its metal coordinating atoms and whether there are multiple variants (i.e., hemilability). We then link complexes to their associated journal abstracts and apply a topic-modeling workflow to link 25,146 ligands with functional application areas spanning reactivity, redox chemistry, biological chemistry, and photophysical chemistry. Together, we provide an experimentally grounded dataset of ligand chemical space that connects charge and functional application as a foundation for computational screening and data-driven ligand design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the curation of the BOS-Lig dataset from 126,985 mononuclear transition metal complexes extracted from the Cambridge Structural Database. An iterative charge-balancing workflow first assigns charges in homoleptic complexes using reported complex charges and metal oxidation states, then propagates assignments via consensus to heteroleptic cases, resulting in confident net-charge assignments for 66,810 ligands out of 94,581 unique structures. The work also classifies ligands by coordinating atoms and hemilability, introduces a purity metric to flag potential errors, analyzes failures of simple heuristics such as the octet rule, and applies topic modeling to link 25,146 ligands to application areas (reactivity, redox, biological, and photophysical chemistry).

Significance. If the charge assignments prove reliable, the dataset would provide a large-scale, experimentally grounded resource for high-throughput computational screening and data-driven design of transition-metal ligands, directly addressing the frequent absence of charge information in crystallographic databases. The transparency around inference limits, the purity metric, and the linkage to functional applications strengthen its potential utility as a foundation for ligand-property modeling.

major comments (2)
  1. [Abstract and methods (iterative charge-balancing workflow)] Abstract and workflow description: the central claim that the iterative consensus approach 'confidently assign[s]' net charges to 66,810 ligands rests on propagation from homoleptic to heteroleptic environments, yet no quantitative validation (e.g., error rates on a held-out set of ligands with independently known charges, or agreement statistics against a gold-standard subset) is reported; the purity metric is defined but its correlation with actual accuracy is not demonstrated, leaving the 'accurate' descriptor unsupported by numerical evidence.
  2. [Methods (iterative propagation to heteroleptic environments)] Section describing heteroleptic propagation: the assumption that consensus across crystallographic observations reliably infers charges when direct information is unavailable is load-bearing for the dataset size and utility, but the manuscript provides no systematic quantification of inconsistency rates or failure modes specific to mixed-ligand complexes, which could undermine the weakest assumption identified in the review.
minor comments (1)
  1. [Results] The manuscript would benefit from explicit statements of the total number of unique ligands before and after filtering, and from a table summarizing the distribution of assigned charges and purity scores.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on the BOS-Lig dataset manuscript. The comments highlight important aspects of validation for the charge assignment workflow. We address each major comment below and will incorporate additional quantitative analyses in the revised version to better support the reliability claims.

read point-by-point responses
  1. Referee: Abstract and methods (iterative charge-balancing workflow)] Abstract and workflow description: the central claim that the iterative consensus approach 'confidently assign[s]' net charges to 66,810 ligands rests on propagation from homoleptic to heteroleptic environments, yet no quantitative validation (e.g., error rates on a held-out set of ligands with independently known charges, or agreement statistics against a gold-standard subset) is reported; the purity metric is defined but its correlation with actual accuracy is not demonstrated, leaving the 'accurate' descriptor unsupported by numerical evidence.

    Authors: We agree that the manuscript would benefit from stronger numerical support for the charge assignments. A comprehensive external gold-standard dataset of independently verified ligand charges does not exist in the literature, which precludes a traditional held-out validation with known error rates. However, we will revise the manuscript to include internal agreement statistics: the fraction of ligands assigned directly from homoleptic complexes versus those inferred via propagation, and the rate at which consensus resolves conflicts. We will also add an analysis demonstrating the purity metric's correlation with observed inconsistencies (e.g., higher conflict frequency for low-purity assignments). These changes will be placed in the Methods and Results sections, and the abstract wording will be adjusted for precision. revision: partial

  2. Referee: [Methods (iterative propagation to heteroleptic environments)] Section describing heteroleptic propagation: the assumption that consensus across crystallographic observations reliably infers charges when direct information is unavailable is load-bearing for the dataset size and utility, but the manuscript provides no systematic quantification of inconsistency rates or failure modes specific to mixed-ligand complexes, which could undermine the weakest assumption identified in the review.

    Authors: We acknowledge that systematic quantification of the propagation step is needed to substantiate the assumption. In the revised manuscript, we will add a new subsection to Methods that reports inconsistency rates during iterative propagation to heteroleptic complexes. This will include the percentage of cases resolved by majority consensus, the number of ligands discarded due to unresolved conflicts, and discussion of failure modes such as ligands exhibiting variable coordination modes or rare charge states across observations. Concrete examples of both successful and failed propagations will be provided to illustrate the process. revision: yes

standing simulated objections not resolved
  • Quantitative error rates on a held-out set of ligands with independently known charges, as no such external gold-standard dataset is available in the field.

Circularity Check

0 steps flagged

No significant circularity: external database curation with transparent inference rules

full rationale

The manuscript presents a data curation pipeline that ingests experimental entries from the Cambridge Structural Database, seeds charge assignments from homoleptic complexes whose net charges and metal oxidation states are directly reported, and then propagates those assignments via consensus to heteroleptic cases. No equations, fitted parameters, or predictions are introduced that reduce to the output by construction; the workflow is a deterministic rule-based procedure whose inputs remain external crystallographic records. No self-citations are used to justify uniqueness or to smuggle in ansatzes, and no known empirical patterns are merely renamed. The resulting BOS-Lig dataset is therefore self-contained against the external CSD benchmark, with explicit purity metrics and failure-case analysis provided to users.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of charge-balancing rules derived from complex charge and oxidation state data plus the accuracy of CSD records; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption Net ligand charge can be reliably determined from overall complex charge and metal oxidation state
    This forms the foundation of the iterative charge-balancing workflow described in the abstract.
  • domain assumption Crystallographic observations in the Cambridge Structural Database provide accurate complex charges and oxidation states
    The workflow depends on this input data being correct and consistent.

pith-pipeline@v0.9.0 · 5592 in / 1406 out tokens · 80701 ms · 2026-05-10T18:31:54.650012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

  1. [1]

    Homeopathic

    Conclusions In this work, we developed a workflow for extracting ligand information from crystallographic structures of transition metal complexes to enable high-throughput screening and data-driven ligand design. Starting from 126,985 mononuclear complexes curated from the Cambridge Structural Database (CSD), we constructed a ligand-centered dataset cont...

  2. [2]

    S.; Meyer, Z.; Jensen, B.; Kraus, A.; Lambert, A.; Ess, D

    (33) Chen, S. S.; Meyer, Z.; Jensen, B.; Kraus, A.; Lambert, A.; Ess, D. H. Realigands: A Ligand Library Cultivated from Experiment and Intended for Molecular Computational Catalyst Design. J Chem Inf Model 2023, 63, 7412-7422. (34) Vela, S.; Laplaza, R.; Cho, Y .; Corminboeuf, C. Cell2mol: Encoding Chemistry to Interpret Crystallographic Data. npj Comput...

  3. [3]

    D.; Polynski, M

    (35) V ogiatzis, K. D.; Polynski, M. V .; Kirkland, J. K.; Townsend, J.; Hashemi, A.; Liu, C.; Pidko, E. A. Computational Approach to Molecular Catalysis by 3d Transition Metals: Challenges and Opportunities. Chemical Reviews 2019, 119, 2453-2523. 31 (36) Cramer, C. J.; Truhlar, D. G. Density Functional Theory for Transition Metals and Transition Metal Ch...

  4. [5]

    V .; Fayzullin, R

    (83) Khrizanforova, V . V .; Fayzullin, R. R.; Morozov, V . I.; Gilmutdinov, I. F.; Lukoyanov, A. N.; Kataeva, O. N.; Gerasimova, T. P.; Katsyuba, S. A.; Fedushkin, I. L.; Lyssenko, K. A.; Budnikova, Y . H. One-Electron Reduction of Acenaphthene-1,2-Diimine Nickel(Ii) Complexes. Chem Asian J 2019, 14, 2979-2987. (84) Seredyuk, M.; Znovjyak, K.; Valverde-M...

  5. [6]

    Alkyl-Substituted Bipyridyl Platinum (Ii) Complexes Bearing Alkynyl-Naphthalimide Ligands: Synthesis, Photophysical Properties, and Tunable Aggregation-Induced Emission Activity

    (96) Wang, H.; Yan, Y .; Gan, S.; Tu, H.; Jiang, X.; Zhu, S.; Song, G.; Liu, R.; Zhu, H. Alkyl-Substituted Bipyridyl Platinum (Ii) Complexes Bearing Alkynyl-Naphthalimide Ligands: Synthesis, Photophysical Properties, and Tunable Aggregation-Induced Emission Activity. Inorganica Chimica Acta 2023, 555, 121578. (97) Paderina, A.; Slavova, S.; Petrovskii, S....

  6. [7]

    #$%&⋅11+((/100), where $

    Construction of ligand submolecules and metadata. For each identified connected component, a sub-molecule was constructed containing only the ligand atoms. We then recorded the indices of coordinating atoms within the ligand (relative to the subgraph), and computed a Weisfeiler–Lehman (WL) hash6 for the ligand using atom symbols as node attributes. This h...

  7. [8]

    Michel, R

    (9) Kevlishvili, I.; St. Michel, R. G.; Garrison, A. G.; Toney, J. W.; Adamji, H.; Jia, H.; Román-Leshkov, Y .; Kulik, H. J. Leveraging Natural Language Processing to Curate the Tmcat, Tmphoto, Tmbio, and Tmsco Datasets of Functional Transition Metal Complexes. Faraday Discussions 2025, 256, 275-303