Designing compact training sets for data-driven molecular property prediction

Bowen Li; Srinivas Rangarajan

arxiv: 1906.10273 · v1 · pith:X4ZBJ7PJnew · submitted 2019-06-25 · ⚛️ physics.data-an · physics.comp-ph

Designing compact training sets for data-driven molecular property prediction

Bowen Li , Srinivas Rangarajan This is my paper

Pith reviewed 2026-05-25 16:29 UTC · model grok-4.3

classification ⚛️ physics.data-an physics.comp-ph

keywords molecular property predictiontraining set selectionD-optimalitydiversity selectionepsilon-greedygroup additivitykernel ridge regressioncheminformatics

0 comments

The pith

Epsilon-greedy selection lets sparse group additive models match full-set accuracy with as little as 15 percent of the molecules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a selection procedure that mixes D-optimality from design-of-experiments criteria with diversity-maximizing subset selection inside an epsilon-greedy loop. For sparse generalized group additivity models this balance produces compact training sets whose cross-validation accuracy equals that obtained from the entire library, sometimes using only 15 percent of the molecules. Kernel ridge regression, by contrast, performs best when selections rely purely on diversity. The tests are performed on subsets drawn from the QM7, NIST and catalysis databases.

Core claim

The authors demonstrate that an epsilon-greedy combination of rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection can systematically reduce the number of molecules required to train sparse group additive models and kernel ridge regression, with the former class reaching the accuracy of five-fold cross validation on the full set while using fractions as small as 15 percent of the data.

What carries the argument

The epsilon-greedy framework that interleaves D-optimality selection for exploitation with diversity-maximizing selection for exploration.

If this is right

Sparse group additive models reach equivalent accuracy with substantially fewer molecules than the full library.
Kernel ridge regression achieves its best results when training sets are chosen by diversity criteria alone.
The combined selection method works on subsets from QM7, NIST and catalysis libraries.
Systematic reduction of required training data is possible for these two model classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection rule could be applied to graph neural networks or other modern molecular models not tested here.
If the compact sets remain effective on new libraries, high-throughput screening could avoid many costly property calculations.
Active-learning loops might adopt the epsilon-greedy rule to decide when to acquire additional molecules.

Load-bearing premise

The compact sets chosen by the epsilon-greedy rule will generalize beyond the specific QM7, NIST and catalysis subsets examined in the experiments.

What would settle it

Run the selection procedure on an independent molecular library never used in the original tests and measure whether the resulting compact set produces cross-validation error within a few percent of the full-set result for the sparse additive model.

Figures

Figures reproduced from arXiv: 1906.10273 by Bowen Li, Srinivas Rangarajan.

**Figure 2.** Figure 2: Comparison of the learning rates of different molecule selection strategies on the QM7 dataset for GA-based [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of the learning rate of different [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the learning rates of different molecule selection strategies on the NIST chemistry webbook [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the learning rates of different molecule selection strategies on the surface intermediates dataset [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of the learning rates of different molecule selection strategies on the QM7 dataset for kernel-based [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Stopping criterion for the epsilon greedy method ( [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Illustrations of linear substructures and correction terms of modified pathway fingerprints on propane and a [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of the average RMSE/MAE with different regularization values for five models during five-fold [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of learning rates for 10 executions of variance sampling strategy started with different initial [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

In this paper, we consider the problem of designing a training set using the most informative molecules from a specified library to build data-driven molecular property models. Specifically, we use (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the epsilon--greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on subsets of various databases, including QM7, NIST, and a catalysis dataset. For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy as five-fold cross validation on the entire set. On the other hand, kernel ridge regression prefers diversity-maximizing selections.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The epsilon-greedy mix of D-optimality and diversity selection lets sparse group-additive models reach full-set accuracy with ~15% of the data on QM7, NIST, and the catalysis collection, while KRR favors pure diversity; the balance looks library-dependent.

read the letter

The headline result is that an epsilon-greedy combination of model-based D-optimality and cheminformatics diversity selection produces compact training sets for sparse group-additive models. On the three libraries tested, this sometimes reaches the accuracy of five-fold cross-validation on the full set with only 15% of the molecules. Kernel ridge regression, by contrast, does better with the diversity-only selections. The paper therefore gives a concrete, algorithmic way to trade off exploitation and exploration inside a single selection loop rather than tuning a weighted objective by hand. That is the actual increment over prior active-learning and subset-selection work in the area. The tests are run on standard public databases, which makes the numbers easy to check and compare against existing baselines. For anyone who regularly builds property models and pays for quantum calculations or experiments, the reported data reduction is the kind of practical outcome worth noting. The main limitation is that every number comes from subsets drawn from QM7, NIST, and one catalysis collection. Both the group-additivity basis and the kernel are fitted to the molecular distributions inside each library, so the observed sweet spot between the two selection criteria could be an artifact of those particular distributions. The abstract gives no cross-library transfer test or out-of-distribution hold-out that would show the same epsilon value works elsewhere. Without those checks, the 15% figure should be treated as library-specific until more evidence appears. The paper is aimed at people who already work on data-efficient molecular modeling or design-of-experiments methods in cheminformatics. A reader who needs to pick small training sets for additive or kernel models will find the direct comparison useful. It is solid enough on its own terms to go to referees; the algorithmic description and the public-database results are clear enough that a serious review can focus on whether the generalization claim holds up rather than on whether the work is coherent.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an epsilon-greedy algorithm that interleaves D-optimality (exploitation) with cheminformatics diversity maximization (exploration) to select compact training subsets for two model classes: sparse generalized group-additivity and kernel ridge regression. On subsets drawn from QM7, NIST, and a catalysis collection, the method is reported to produce group-additive models whose accuracy matches five-fold cross-validation on the full library while using as little as 15% of the data; kernel ridge regression is shown to favor pure diversity selection.

Significance. If the performance advantage generalizes, the approach supplies a concrete, model-aware procedure for reducing the data volume required to train accurate molecular-property predictors, which would lower the barrier to data-driven screening in chemistry and catalysis.

major comments (2)

[Abstract] Abstract: the central claim that an epsilon-greedy balance yields compact sets whose accuracy matches full-library 5-fold CV rests exclusively on three in-library test collections (QM7, NIST, catalysis). No cross-library or out-of-distribution hold-out experiments are described, leaving open whether the reported 15% fraction and optimal epsilon are artifacts of the particular molecular distributions rather than a general property of the algorithm.
[Abstract] Abstract (results paragraph): the manuscript provides no statistical comparison (error bars, paired t-tests, or bootstrap intervals) between the compact-set and full-set accuracies, so it is impossible to determine whether the observed parity is statistically meaningful or within the variability of the cross-validation procedure itself.

minor comments (1)

[Abstract] The abstract does not define the precise form of the group-additivity basis or the kernel used, making it difficult for a reader to reproduce the D-optimality calculations without the full methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that an epsilon-greedy balance yields compact sets whose accuracy matches full-library 5-fold CV rests exclusively on three in-library test collections (QM7, NIST, catalysis). No cross-library or out-of-distribution hold-out experiments are described, leaving open whether the reported 15% fraction and optimal epsilon are artifacts of the particular molecular distributions rather than a general property of the algorithm.

Authors: We agree that all reported experiments use in-library hold-outs and that no cross-library or out-of-distribution tests are presented. The manuscript's scope is the selection of compact subsets from a single given library; the algorithm itself does not assume a specific distribution. We will revise the abstract and discussion to explicitly state this scope and to avoid implying broader generalization beyond the tested libraries. revision: yes
Referee: [Abstract] Abstract (results paragraph): the manuscript provides no statistical comparison (error bars, paired t-tests, or bootstrap intervals) between the compact-set and full-set accuracies, so it is impossible to determine whether the observed parity is statistically meaningful or within the variability of the cross-validation procedure itself.

Authors: The referee correctly notes the absence of explicit statistical comparisons in the abstract. While the full manuscript reports results averaged over multiple random seeds, we acknowledge that formal error bars or hypothesis tests are not provided. We will add appropriate statistical measures (e.g., standard deviations across runs and, where feasible, paired comparisons) to the abstract and results sections. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method is an empirical algorithmic procedure

full rationale

The paper presents an algorithmic framework that combines D-optimality-based exploitation with diversity-maximizing exploration inside an epsilon-greedy loop to select compact training subsets for sparse group-additive and kernel ridge regression models. All reported performance numbers are obtained by direct comparison against five-fold cross-validation on the full QM7, NIST, and catalysis libraries; no equations, fitted parameters, or uniqueness theorems are invoked that reduce to the inputs by construction. The central result (that a balanced epsilon-greedy policy can reach comparable accuracy with as little as 15 % of the data) is therefore an empirical observation rather than a self-referential derivation, and no load-bearing self-citations or ansatzes imported from prior work by the same authors appear in the described chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method description does not introduce new physical quantities or unstated mathematical assumptions beyond standard DOE and cheminformatics practices.

pith-pipeline@v0.9.0 · 5693 in / 1060 out tokens · 25638 ms · 2026-05-25T16:29:08.052469+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Redox ﬂow batteries: a review

Adam Z Weber, Matthew M Mench, Jeremy P Meyers, Philip N Ross, Jeffrey T Gostick, and Qinghua Liu. Redox ﬂow batteries: a review. Journal of Applied Electrochemistry, 41(10):1137, 2011

work page 2011
[2]

Machine-learning-augmented chemisorption model for co2 electroreduction catalyst screening

Xianfeng Ma, Zheng Li, Luke EK Achenie, and Hongliang Xin. Machine-learning-augmented chemisorption model for co2 electroreduction catalyst screening. The journal of physical chemistry letters, 6(18):3528–3533, 2015. 11

work page 2015
[3]

Identiﬁcation of potential photovoltaic absorbers based on ﬁrst-principles spectro- scopic screening of materials

Liping Yu and Alex Zunger. Identiﬁcation of potential photovoltaic absorbers based on ﬁrst-principles spectro- scopic screening of materials. Physical review letters, 108(6):068701, 2012

work page 2012
[4]

Assessment and validation of machine learning methods for predicting molecular atomization energies

Katja Hansen, Grégoire Montavon, Franziska Biegler, Siamac Fazli, Matthias Rupp, Matthias Schefﬂer, O Anatole V on Lilienfeld, Alexandre Tkatchenko, and Klaus-Robert Mu?ller. Assessment and validation of machine learning methods for predicting molecular atomization energies. Journal of Chemical Theory and Computation, 9(8):3404–3419, 2013

work page 2013
[5]

Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space

Katja Hansen, Franziska Biegler, Raghunathan Ramakrishnan, Wiktor Pronobis, O Anatole V on Lilienfeld, Klaus- Robert Mu?ller, and Alexandre Tkatchenko. Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space. The journal of physical chemistry letters, 6(12):2326– 2331, 2015

work page 2015
[6]

Extended-connectivity ﬁngerprints

David Rogers and Mathew Hahn. Extended-connectivity ﬁngerprints. Journal of chemical information and modeling, 50(5):742–754, 2010

work page 2010
[7]

Machine learning for quantum mechanics in a nutshell

Matthias Rupp. Machine learning for quantum mechanics in a nutshell. International Journal of Quantum Chemistry, 115(16):1058–1073, 2015

work page 2015
[8]

Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules

Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal of chemical information and modeling , 53(7):1563–1575, 2013

work page 2013
[9]

Convolutional networks on graphs for learning molecular ﬁngerprints

David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular ﬁngerprints. In Advances in neural information processing systems, pages 2224–2232, 2015

work page 2015
[10]

Molecular graph convolutions: moving beyond ﬁngerprints

Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond ﬁngerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016

work page 2016
[11]

Active learning for computational chemoge- nomics

Daniel Reker, Petra Schneider, Gisbert Schneider, and JB Brown. Active learning for computational chemoge- nomics. Future medicinal chemistry, 9(4):381–402, 2017

work page 2017
[12]

Feasibility of active machine learning for multiclass compound classiﬁcation

Tobias Lang, Florian Flachsenberg, Ulrike von Luxburg, and Matthias Rarey. Feasibility of active machine learning for multiclass compound classiﬁcation. Journal of chemical information and modeling, 56(1):12–20, 2016

work page 2016
[13]

Active-learning strategies in computer-assisted drug discovery

Daniel Reker and Gisbert Schneider. Active-learning strategies in computer-assisted drug discovery. Drug discovery today, 20(4):458–465, 2015

work page 2015
[14]

Active learning with support vector machine applied to gene expression data for cancer classiﬁcation

Ying Liu. Active learning with support vector machine applied to gene expression data for cancer classiﬁcation. Journal of chemical information and computer sciences, 44(6):1936–1941, 2004

work page 1936
[15]

Prediction of Atomization Energy Using Graph Kernel and Active Learning

Yu-Hang Tang and Wibe A de Jong. Prediction of atomization energy using graph kernel and active learning. arXiv preprint arXiv:1810.07310, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Machine learning of molecular properties: Locality and active learning

Konstantin Gubaev, Evgeny V Podryabinkin, and Alexander V Shapeev. Machine learning of molecular properties: Locality and active learning. The Journal of Chemical Physics, 148(24):241727, 2018

work page 2018
[17]

Additivity rules for the estimation of molecular properties

Sidney W Benson and Jerry H Buss. Additivity rules for the estimation of molecular properties. thermodynamic properties. The Journal of Chemical Physics, 29(3):546–572, 1958

work page 1958
[18]

Additivity rules for the estimation of thermochemical properties

Sidney W Benson, FR Cruickshank, DM Golden, Gilbert R Haugen, HE O’neal, AS Rodgers, Robert Shaw, and R Walsh. Additivity rules for the estimation of thermochemical properties. Chemical Reviews, 69(3):279–324, 1969

work page 1969
[19]

Revised group additivity parameters for the enthalpies of formation of oxygen-containing organic compounds

HK Eigenmann, DM Golden, and SW Benson. Revised group additivity parameters for the enthalpies of formation of oxygen-containing organic compounds. The Journal of Physical Chemistry, 77(13):1687–1691, 1973

work page 1973
[20]

Estimation of heats of formation of organic compounds by additivity methods.Chemical Reviews, 93(7):2419–2438, 1993

N Cohen and SW Benson. Estimation of heats of formation of organic compounds by additivity methods.Chemical Reviews, 93(7):2419–2438, 1993

work page 1993
[21]

Thermochemistry of gas-phase and surface species via lasso-assisted subgraph selection

Geun Ho Gu, Petr Plechac, and Dionisios G Vlachos. Thermochemistry of gas-phase and surface species via lasso-assisted subgraph selection. Reaction Chemistry & Engineering, 3(4):454–466, 2018

work page 2018
[22]

Concepts and applications of molecular similarity

Mark A Johnson and Gerald M Maggiora. Concepts and applications of molecular similarity. Wiley, 1990

work page 1990
[23]

A decade of fragment-based drug design: strategic advances and lessons learned

Philip J Hajduk and Jonathan Greer. A decade of fragment-based drug design: strategic advances and lessons learned. Nature reviews Drug discovery, 6(3):211, 2007

work page 2007
[24]

Computational methods in molecular diversity and combinatorial chemistry

Mark G Bures and Yvonne C Martin. Computational methods in molecular diversity and combinatorial chemistry. Current opinion in chemical biology, 2(3):376–380, 1998. 12

work page 1998
[25]

Molecular similarity and diversity in chemoin- formatics: from theory to applications

Ana G Maldonado, JP Doucet, Michel Petitjean, and Bo-Tao Fan. Molecular similarity and diversity in chemoin- formatics: from theory to applications. Molecular diversity, 10(1):39–79, 2006

work page 2006
[26]

Identiﬁcation of diverse database subsets using property-based and fragment-based molecular descriptions

Mark Ashton, John Barnard, Florence Casset, Michael Charlton, Geoffrey Downs, Dominique Gorse, John Holliday, Roger Lahana, and Peter Willett. Identiﬁcation of diverse database subsets using property-based and fragment-based molecular descriptions. Quantitative Structure-Activity Relationships, 21(6):598–604, 2002

work page 2002
[27]

Active learning with statistical models

David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models. Journal of artiﬁcial intelligence research, 4:129–145, 1996

work page 1996
[28]

Heterogeneous uncertainty sampling for supervised learning

David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings 1994, pages 148–156. Elsevier, 1994

work page 1994
[29]

Support vector machine active learning with applications to text classiﬁcation

Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classiﬁcation. Journal of machine learning research, 2(Nov):45–66, 2001

work page 2001
[30]

Active learning via transductive experimental design

Kai Yu, Jinbo Bi, and V olker Tresp. Active learning via transductive experimental design. InProceedings of the 23rd international conference on Machine learning, pages 1081–1088. ACM, 2006

work page 2006
[31]

Query by committee

H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceedings of the ﬁfth annual workshop on Computational learning theory, pages 287–294. ACM, 1992

work page 1992
[32]

Active learning by querying informative and representative examples

Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representative examples. In Advances in neural information processing systems, pages 892–900, 2010

work page 2010
[33]

Optimum experimental designs, with SAS, volume 34

Anthony Atkinson, Alexander Donev, and Randall Tobias. Optimum experimental designs, with SAS, volume 34. Oxford University Press, 2007

work page 2007
[34]

Kirstine Smith. On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations.Biometrika, 12(1/2):1–85, 1918

work page 1918
[35]

Applied regression analysis, volume 326

Norman R Draper and Harry Smith. Applied regression analysis, volume 326. John Wiley & Sons, 1998

work page 1998
[36]

d-optimal

Toby J Mitchell. An algorithm for the construction of “d-optimal” experimental designs.Technometrics, 16(2):203– 210, 1974

work page 1974
[37]

Adjustment of an inverse matrix corresponding to a change in one element of a given matrix

Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950

work page 1950
[38]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[39]

Value-difference based exploration: adaptive control between epsilon-greedy and softmax

Michel Tokic and Günther Palm. Value-difference based exploration: adaptive control between epsilon-greedy and softmax. In Annual Conference on Artiﬁcial Intelligence, pages 335–346. Springer, 2011

work page 2011
[40]

Classes of multiagent q-learning dynamics with epsilon-greedy exploration

Michael Wunder, Michael L Littman, and Monica Babes. Classes of multiagent q-learning dynamics with epsilon-greedy exploration. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1167–1174. Citeseer, 2010

work page 2010
[41]

Many molecular properties from one kernel in chemical space

Raghunathan Ramakrishnan and O Anatole von Lilienfeld. Many molecular properties from one kernel in chemical space. CHIMIA International Journal for Chemistry, 69(4):182–186, 2015

work page 2015
[42]

970 million druglike small molecules for virtual screening in the chemical universe database gdb-13

Lorenz C Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database gdb-13. Journal of the American Chemical Society, 131(25):8732–8733, 2009

work page 2009
[43]

Fast and accurate modeling of molecular atomization energies with machine learning

Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole V on Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Physical review letters, 108(5):058301, 2012

work page 2012
[44]

A big data framework to validate thermody- namic data for chemical species

Philipp Buerger, Jethro Akroyd, Jacob W Martin, and Markus Kraft. A big data framework to validate thermody- namic data for chemical species. Combustion and Flame, 176:584–591, 2017

work page 2017
[45]

Nist chemistry webbook

P Linstrom and W Mallard. Nist chemistry webbook. NIST standard reference database, (69):20899, 2005

work page 2005
[46]

Group additivity for thermochemical property estimation of lignin monomers on pt (111)

Geun Ho Gu and Dionisios G Vlachos. Group additivity for thermochemical property estimation of lignin monomers on pt (111). The Journal of Physical Chemistry C, 120(34):19234–19241, 2016. 13 Supplemental Materials: Designing compact training sets for data-driven molecular property prediction S1: Generalized pathway ﬁngerprints for molecular representation...

work page 2016

[1] [1]

Redox ﬂow batteries: a review

Adam Z Weber, Matthew M Mench, Jeremy P Meyers, Philip N Ross, Jeffrey T Gostick, and Qinghua Liu. Redox ﬂow batteries: a review. Journal of Applied Electrochemistry, 41(10):1137, 2011

work page 2011

[2] [2]

Machine-learning-augmented chemisorption model for co2 electroreduction catalyst screening

Xianfeng Ma, Zheng Li, Luke EK Achenie, and Hongliang Xin. Machine-learning-augmented chemisorption model for co2 electroreduction catalyst screening. The journal of physical chemistry letters, 6(18):3528–3533, 2015. 11

work page 2015

[3] [3]

Identiﬁcation of potential photovoltaic absorbers based on ﬁrst-principles spectro- scopic screening of materials

Liping Yu and Alex Zunger. Identiﬁcation of potential photovoltaic absorbers based on ﬁrst-principles spectro- scopic screening of materials. Physical review letters, 108(6):068701, 2012

work page 2012

[4] [4]

Assessment and validation of machine learning methods for predicting molecular atomization energies

Katja Hansen, Grégoire Montavon, Franziska Biegler, Siamac Fazli, Matthias Rupp, Matthias Schefﬂer, O Anatole V on Lilienfeld, Alexandre Tkatchenko, and Klaus-Robert Mu?ller. Assessment and validation of machine learning methods for predicting molecular atomization energies. Journal of Chemical Theory and Computation, 9(8):3404–3419, 2013

work page 2013

[5] [5]

Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space

Katja Hansen, Franziska Biegler, Raghunathan Ramakrishnan, Wiktor Pronobis, O Anatole V on Lilienfeld, Klaus- Robert Mu?ller, and Alexandre Tkatchenko. Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space. The journal of physical chemistry letters, 6(12):2326– 2331, 2015

work page 2015

[6] [6]

Extended-connectivity ﬁngerprints

David Rogers and Mathew Hahn. Extended-connectivity ﬁngerprints. Journal of chemical information and modeling, 50(5):742–754, 2010

work page 2010

[7] [7]

Machine learning for quantum mechanics in a nutshell

Matthias Rupp. Machine learning for quantum mechanics in a nutshell. International Journal of Quantum Chemistry, 115(16):1058–1073, 2015

work page 2015

[8] [8]

Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules

Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal of chemical information and modeling , 53(7):1563–1575, 2013

work page 2013

[9] [9]

Convolutional networks on graphs for learning molecular ﬁngerprints

David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular ﬁngerprints. In Advances in neural information processing systems, pages 2224–2232, 2015

work page 2015

[10] [10]

Molecular graph convolutions: moving beyond ﬁngerprints

Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond ﬁngerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016

work page 2016

[11] [11]

Active learning for computational chemoge- nomics

Daniel Reker, Petra Schneider, Gisbert Schneider, and JB Brown. Active learning for computational chemoge- nomics. Future medicinal chemistry, 9(4):381–402, 2017

work page 2017

[12] [12]

Feasibility of active machine learning for multiclass compound classiﬁcation

Tobias Lang, Florian Flachsenberg, Ulrike von Luxburg, and Matthias Rarey. Feasibility of active machine learning for multiclass compound classiﬁcation. Journal of chemical information and modeling, 56(1):12–20, 2016

work page 2016

[13] [13]

Active-learning strategies in computer-assisted drug discovery

Daniel Reker and Gisbert Schneider. Active-learning strategies in computer-assisted drug discovery. Drug discovery today, 20(4):458–465, 2015

work page 2015

[14] [14]

Active learning with support vector machine applied to gene expression data for cancer classiﬁcation

Ying Liu. Active learning with support vector machine applied to gene expression data for cancer classiﬁcation. Journal of chemical information and computer sciences, 44(6):1936–1941, 2004

work page 1936

[15] [15]

Prediction of Atomization Energy Using Graph Kernel and Active Learning

Yu-Hang Tang and Wibe A de Jong. Prediction of atomization energy using graph kernel and active learning. arXiv preprint arXiv:1810.07310, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Machine learning of molecular properties: Locality and active learning

Konstantin Gubaev, Evgeny V Podryabinkin, and Alexander V Shapeev. Machine learning of molecular properties: Locality and active learning. The Journal of Chemical Physics, 148(24):241727, 2018

work page 2018

[17] [17]

Additivity rules for the estimation of molecular properties

Sidney W Benson and Jerry H Buss. Additivity rules for the estimation of molecular properties. thermodynamic properties. The Journal of Chemical Physics, 29(3):546–572, 1958

work page 1958

[18] [18]

Additivity rules for the estimation of thermochemical properties

Sidney W Benson, FR Cruickshank, DM Golden, Gilbert R Haugen, HE O’neal, AS Rodgers, Robert Shaw, and R Walsh. Additivity rules for the estimation of thermochemical properties. Chemical Reviews, 69(3):279–324, 1969

work page 1969

[19] [19]

Revised group additivity parameters for the enthalpies of formation of oxygen-containing organic compounds

HK Eigenmann, DM Golden, and SW Benson. Revised group additivity parameters for the enthalpies of formation of oxygen-containing organic compounds. The Journal of Physical Chemistry, 77(13):1687–1691, 1973

work page 1973

[20] [20]

Estimation of heats of formation of organic compounds by additivity methods.Chemical Reviews, 93(7):2419–2438, 1993

N Cohen and SW Benson. Estimation of heats of formation of organic compounds by additivity methods.Chemical Reviews, 93(7):2419–2438, 1993

work page 1993

[21] [21]

Thermochemistry of gas-phase and surface species via lasso-assisted subgraph selection

Geun Ho Gu, Petr Plechac, and Dionisios G Vlachos. Thermochemistry of gas-phase and surface species via lasso-assisted subgraph selection. Reaction Chemistry & Engineering, 3(4):454–466, 2018

work page 2018

[22] [22]

Concepts and applications of molecular similarity

Mark A Johnson and Gerald M Maggiora. Concepts and applications of molecular similarity. Wiley, 1990

work page 1990

[23] [23]

A decade of fragment-based drug design: strategic advances and lessons learned

Philip J Hajduk and Jonathan Greer. A decade of fragment-based drug design: strategic advances and lessons learned. Nature reviews Drug discovery, 6(3):211, 2007

work page 2007

[24] [24]

Computational methods in molecular diversity and combinatorial chemistry

Mark G Bures and Yvonne C Martin. Computational methods in molecular diversity and combinatorial chemistry. Current opinion in chemical biology, 2(3):376–380, 1998. 12

work page 1998

[25] [25]

Molecular similarity and diversity in chemoin- formatics: from theory to applications

Ana G Maldonado, JP Doucet, Michel Petitjean, and Bo-Tao Fan. Molecular similarity and diversity in chemoin- formatics: from theory to applications. Molecular diversity, 10(1):39–79, 2006

work page 2006

[26] [26]

Identiﬁcation of diverse database subsets using property-based and fragment-based molecular descriptions

Mark Ashton, John Barnard, Florence Casset, Michael Charlton, Geoffrey Downs, Dominique Gorse, John Holliday, Roger Lahana, and Peter Willett. Identiﬁcation of diverse database subsets using property-based and fragment-based molecular descriptions. Quantitative Structure-Activity Relationships, 21(6):598–604, 2002

work page 2002

[27] [27]

Active learning with statistical models

David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models. Journal of artiﬁcial intelligence research, 4:129–145, 1996

work page 1996

[28] [28]

Heterogeneous uncertainty sampling for supervised learning

David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings 1994, pages 148–156. Elsevier, 1994

work page 1994

[29] [29]

Support vector machine active learning with applications to text classiﬁcation

Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classiﬁcation. Journal of machine learning research, 2(Nov):45–66, 2001

work page 2001

[30] [30]

Active learning via transductive experimental design

Kai Yu, Jinbo Bi, and V olker Tresp. Active learning via transductive experimental design. InProceedings of the 23rd international conference on Machine learning, pages 1081–1088. ACM, 2006

work page 2006

[31] [31]

Query by committee

H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceedings of the ﬁfth annual workshop on Computational learning theory, pages 287–294. ACM, 1992

work page 1992

[32] [32]

Active learning by querying informative and representative examples

Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representative examples. In Advances in neural information processing systems, pages 892–900, 2010

work page 2010

[33] [33]

Optimum experimental designs, with SAS, volume 34

Anthony Atkinson, Alexander Donev, and Randall Tobias. Optimum experimental designs, with SAS, volume 34. Oxford University Press, 2007

work page 2007

[34] [34]

Kirstine Smith. On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations.Biometrika, 12(1/2):1–85, 1918

work page 1918

[35] [35]

Applied regression analysis, volume 326

Norman R Draper and Harry Smith. Applied regression analysis, volume 326. John Wiley & Sons, 1998

work page 1998

[36] [36]

d-optimal

Toby J Mitchell. An algorithm for the construction of “d-optimal” experimental designs.Technometrics, 16(2):203– 210, 1974

work page 1974

[37] [37]

Adjustment of an inverse matrix corresponding to a change in one element of a given matrix

Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950

work page 1950

[38] [38]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[39] [39]

Value-difference based exploration: adaptive control between epsilon-greedy and softmax

Michel Tokic and Günther Palm. Value-difference based exploration: adaptive control between epsilon-greedy and softmax. In Annual Conference on Artiﬁcial Intelligence, pages 335–346. Springer, 2011

work page 2011

[40] [40]

Classes of multiagent q-learning dynamics with epsilon-greedy exploration

Michael Wunder, Michael L Littman, and Monica Babes. Classes of multiagent q-learning dynamics with epsilon-greedy exploration. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1167–1174. Citeseer, 2010

work page 2010

[41] [41]

Many molecular properties from one kernel in chemical space

Raghunathan Ramakrishnan and O Anatole von Lilienfeld. Many molecular properties from one kernel in chemical space. CHIMIA International Journal for Chemistry, 69(4):182–186, 2015

work page 2015

[42] [42]

970 million druglike small molecules for virtual screening in the chemical universe database gdb-13

Lorenz C Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database gdb-13. Journal of the American Chemical Society, 131(25):8732–8733, 2009

work page 2009

[43] [43]

Fast and accurate modeling of molecular atomization energies with machine learning

Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole V on Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Physical review letters, 108(5):058301, 2012

work page 2012

[44] [44]

A big data framework to validate thermody- namic data for chemical species

Philipp Buerger, Jethro Akroyd, Jacob W Martin, and Markus Kraft. A big data framework to validate thermody- namic data for chemical species. Combustion and Flame, 176:584–591, 2017

work page 2017

[45] [45]

Nist chemistry webbook

P Linstrom and W Mallard. Nist chemistry webbook. NIST standard reference database, (69):20899, 2005

work page 2005

[46] [46]

Group additivity for thermochemical property estimation of lignin monomers on pt (111)

Geun Ho Gu and Dionisios G Vlachos. Group additivity for thermochemical property estimation of lignin monomers on pt (111). The Journal of Physical Chemistry C, 120(34):19234–19241, 2016. 13 Supplemental Materials: Designing compact training sets for data-driven molecular property prediction S1: Generalized pathway ﬁngerprints for molecular representation...

work page 2016