The BOS-Lig Dataset: Accurate Ligand Charges from a Consensus Approach for 66,810 Experimentally Synthesized Ligands
Pith reviewed 2026-05-10 18:31 UTC · model grok-4.3
The pith
An iterative consensus workflow assigns reliable net charges to 66,810 ligands from 126,985 experimental transition metal complexes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying an iterative charge-balancing workflow that combines complex charges, metal oxidation states, and consensus across crystallographic observations, net charges can be confidently assigned to 66,810 ligands among 94,581 unique structures extracted from 126,985 mononuclear transition metal complexes. The process begins with homoleptic complexes and propagates assignments to heteroleptic environments, allowing inference even when direct charge data is absent. Each ligand receives additional labels for metal-coordinating atoms and hemilability, while 25,146 ligands are linked to application domains through topic modeling of associated abstracts.
What carries the argument
The iterative charge-balancing workflow, which settles charges first in homoleptic complexes then propagates them via consensus with metal oxidation states to handle heteroleptic cases.
If this is right
- Consistent charges become available for screening libraries of transition metal complexes in reactivity and photophysical applications.
- The purity metric flags assignments likely to be unreliable, allowing users to filter the dataset for high-confidence subsets.
- Classification of coordinating atoms and hemilability supports targeted searches for ligands with specific binding behavior.
- Linking ligands to application topics through abstracts provides an experimentally grounded starting point for data-driven ligand selection.
Where Pith is reading between the lines
- The dataset could serve as training data for machine-learning models that predict charges or properties for ligands in yet-unseen complexes.
- The propagation logic might be adapted to assign charges in polynuclear or supramolecular systems where multiple metals share ligands.
- The hemilability flags could guide experimental design of switchable catalysts that respond to external stimuli.
Load-bearing premise
Consensus across multiple crystallographic observations and known metal oxidation states can correctly infer ligand net charges even when no direct charge measurement exists for that ligand.
What would settle it
A ligand assigned a particular charge in several complexes would be shown incorrect if independent experimental data, such as a measured pKa or redox potential for an isolated ligand or a well-characterized complex, consistently contradicted the assigned value while satisfying the overall complex charge.
Figures
read the original abstract
Understanding ligand properties is essential for computational high-throughput screening of transition metal complexes. However, ligand properties such as net charge and other information such as their application area are often absent or inconsistently recorded in crystallographic datasets. Here, we construct a ligand dataset from 126,985 mononuclear transition metal complexes curated from the Cambridge Structural Database. Using an iterative charge-balancing workflow that combines complex charges, metal oxidation states, and consensus across crystallographic observations, we confidently assign net charges to 66,810 ligands among 94,581 identified unique ligand structures to curate the Boston Open-Shell Ligand (BOS-Lig) dataset. The workflow assigns ligand charges in homoleptic complexes first and then iteratively propagates these assignments across heteroleptic environments, allowing charges to be inferred even when direct charge information is unavailable. We analyze cases where simple heuristics such as the octet rule would have failed and introduce a purity metric to identify when our charge assignments may be incorrect. Each ligand is also classified in terms of its metal coordinating atoms and whether there are multiple variants (i.e., hemilability). We then link complexes to their associated journal abstracts and apply a topic-modeling workflow to link 25,146 ligands with functional application areas spanning reactivity, redox chemistry, biological chemistry, and photophysical chemistry. Together, we provide an experimentally grounded dataset of ligand chemical space that connects charge and functional application as a foundation for computational screening and data-driven ligand design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the curation of the BOS-Lig dataset from 126,985 mononuclear transition metal complexes extracted from the Cambridge Structural Database. An iterative charge-balancing workflow first assigns charges in homoleptic complexes using reported complex charges and metal oxidation states, then propagates assignments via consensus to heteroleptic cases, resulting in confident net-charge assignments for 66,810 ligands out of 94,581 unique structures. The work also classifies ligands by coordinating atoms and hemilability, introduces a purity metric to flag potential errors, analyzes failures of simple heuristics such as the octet rule, and applies topic modeling to link 25,146 ligands to application areas (reactivity, redox, biological, and photophysical chemistry).
Significance. If the charge assignments prove reliable, the dataset would provide a large-scale, experimentally grounded resource for high-throughput computational screening and data-driven design of transition-metal ligands, directly addressing the frequent absence of charge information in crystallographic databases. The transparency around inference limits, the purity metric, and the linkage to functional applications strengthen its potential utility as a foundation for ligand-property modeling.
major comments (2)
- [Abstract and methods (iterative charge-balancing workflow)] Abstract and workflow description: the central claim that the iterative consensus approach 'confidently assign[s]' net charges to 66,810 ligands rests on propagation from homoleptic to heteroleptic environments, yet no quantitative validation (e.g., error rates on a held-out set of ligands with independently known charges, or agreement statistics against a gold-standard subset) is reported; the purity metric is defined but its correlation with actual accuracy is not demonstrated, leaving the 'accurate' descriptor unsupported by numerical evidence.
- [Methods (iterative propagation to heteroleptic environments)] Section describing heteroleptic propagation: the assumption that consensus across crystallographic observations reliably infers charges when direct information is unavailable is load-bearing for the dataset size and utility, but the manuscript provides no systematic quantification of inconsistency rates or failure modes specific to mixed-ligand complexes, which could undermine the weakest assumption identified in the review.
minor comments (1)
- [Results] The manuscript would benefit from explicit statements of the total number of unique ligands before and after filtering, and from a table summarizing the distribution of assigned charges and purity scores.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the BOS-Lig dataset manuscript. The comments highlight important aspects of validation for the charge assignment workflow. We address each major comment below and will incorporate additional quantitative analyses in the revised version to better support the reliability claims.
read point-by-point responses
-
Referee: Abstract and methods (iterative charge-balancing workflow)] Abstract and workflow description: the central claim that the iterative consensus approach 'confidently assign[s]' net charges to 66,810 ligands rests on propagation from homoleptic to heteroleptic environments, yet no quantitative validation (e.g., error rates on a held-out set of ligands with independently known charges, or agreement statistics against a gold-standard subset) is reported; the purity metric is defined but its correlation with actual accuracy is not demonstrated, leaving the 'accurate' descriptor unsupported by numerical evidence.
Authors: We agree that the manuscript would benefit from stronger numerical support for the charge assignments. A comprehensive external gold-standard dataset of independently verified ligand charges does not exist in the literature, which precludes a traditional held-out validation with known error rates. However, we will revise the manuscript to include internal agreement statistics: the fraction of ligands assigned directly from homoleptic complexes versus those inferred via propagation, and the rate at which consensus resolves conflicts. We will also add an analysis demonstrating the purity metric's correlation with observed inconsistencies (e.g., higher conflict frequency for low-purity assignments). These changes will be placed in the Methods and Results sections, and the abstract wording will be adjusted for precision. revision: partial
-
Referee: [Methods (iterative propagation to heteroleptic environments)] Section describing heteroleptic propagation: the assumption that consensus across crystallographic observations reliably infers charges when direct information is unavailable is load-bearing for the dataset size and utility, but the manuscript provides no systematic quantification of inconsistency rates or failure modes specific to mixed-ligand complexes, which could undermine the weakest assumption identified in the review.
Authors: We acknowledge that systematic quantification of the propagation step is needed to substantiate the assumption. In the revised manuscript, we will add a new subsection to Methods that reports inconsistency rates during iterative propagation to heteroleptic complexes. This will include the percentage of cases resolved by majority consensus, the number of ligands discarded due to unresolved conflicts, and discussion of failure modes such as ligands exhibiting variable coordination modes or rare charge states across observations. Concrete examples of both successful and failed propagations will be provided to illustrate the process. revision: yes
- Quantitative error rates on a held-out set of ligands with independently known charges, as no such external gold-standard dataset is available in the field.
Circularity Check
No significant circularity: external database curation with transparent inference rules
full rationale
The manuscript presents a data curation pipeline that ingests experimental entries from the Cambridge Structural Database, seeds charge assignments from homoleptic complexes whose net charges and metal oxidation states are directly reported, and then propagates those assignments via consensus to heteroleptic cases. No equations, fitted parameters, or predictions are introduced that reduce to the output by construction; the workflow is a deterministic rule-based procedure whose inputs remain external crystallographic records. No self-citations are used to justify uniqueness or to smuggle in ansatzes, and no known empirical patterns are merely renamed. The resulting BOS-Lig dataset is therefore self-contained against the external CSD benchmark, with explicit purity metrics and failure-case analysis provided to users.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Net ligand charge can be reliably determined from overall complex charge and metal oxidation state
- domain assumption Crystallographic observations in the Cambridge Structural Database provide accurate complex charges and oxidation states
Reference graph
Works this paper leans on
-
[1]
Conclusions In this work, we developed a workflow for extracting ligand information from crystallographic structures of transition metal complexes to enable high-throughput screening and data-driven ligand design. Starting from 126,985 mononuclear complexes curated from the Cambridge Structural Database (CSD), we constructed a ligand-centered dataset cont...
work page 2016
-
[2]
S.; Meyer, Z.; Jensen, B.; Kraus, A.; Lambert, A.; Ess, D
(33) Chen, S. S.; Meyer, Z.; Jensen, B.; Kraus, A.; Lambert, A.; Ess, D. H. Realigands: A Ligand Library Cultivated from Experiment and Intended for Molecular Computational Catalyst Design. J Chem Inf Model 2023, 63, 7412-7422. (34) Vela, S.; Laplaza, R.; Cho, Y .; Corminboeuf, C. Cell2mol: Encoding Chemistry to Interpret Crystallographic Data. npj Comput...
work page 2023
-
[3]
(35) V ogiatzis, K. D.; Polynski, M. V .; Kirkland, J. K.; Townsend, J.; Hashemi, A.; Liu, C.; Pidko, E. A. Computational Approach to Molecular Catalysis by 3d Transition Metals: Challenges and Opportunities. Chemical Reviews 2019, 119, 2453-2523. 31 (36) Cramer, C. J.; Truhlar, D. G. Density Functional Theory for Transition Metals and Transition Metal Ch...
work page 2019
-
[5]
(83) Khrizanforova, V . V .; Fayzullin, R. R.; Morozov, V . I.; Gilmutdinov, I. F.; Lukoyanov, A. N.; Kataeva, O. N.; Gerasimova, T. P.; Katsyuba, S. A.; Fedushkin, I. L.; Lyssenko, K. A.; Budnikova, Y . H. One-Electron Reduction of Acenaphthene-1,2-Diimine Nickel(Ii) Complexes. Chem Asian J 2019, 14, 2979-2987. (84) Seredyuk, M.; Znovjyak, K.; Valverde-M...
-
[6]
(96) Wang, H.; Yan, Y .; Gan, S.; Tu, H.; Jiang, X.; Zhu, S.; Song, G.; Liu, R.; Zhu, H. Alkyl-Substituted Bipyridyl Platinum (Ii) Complexes Bearing Alkynyl-Naphthalimide Ligands: Synthesis, Photophysical Properties, and Tunable Aggregation-Induced Emission Activity. Inorganica Chimica Acta 2023, 555, 121578. (97) Paderina, A.; Slavova, S.; Petrovskii, S....
-
[7]
Construction of ligand submolecules and metadata. For each identified connected component, a sub-molecule was constructed containing only the ligand atoms. We then recorded the indices of coordinating atoms within the ligand (relative to the subgraph), and computed a Weisfeiler–Lehman (WL) hash6 for the ligand using atom symbols as node attributes. This h...
-
[8]
(9) Kevlishvili, I.; St. Michel, R. G.; Garrison, A. G.; Toney, J. W.; Adamji, H.; Jia, H.; Román-Leshkov, Y .; Kulik, H. J. Leveraging Natural Language Processing to Curate the Tmcat, Tmphoto, Tmbio, and Tmsco Datasets of Functional Transition Metal Complexes. Faraday Discussions 2025, 256, 275-303
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.