Developing an AI Course for Synthetic Chemistry Students

Zhiling Zheng

arxiv: 2511.18244 · v1 · submitted 2025-11-23 · 💻 cs.AI · cond-mat.mtrl-sci· physics.ed-ph

Developing an AI Course for Synthetic Chemistry Students

Zhiling Zheng This is my paper

Pith reviewed 2026-05-17 06:11 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.mtrl-sciphysics.ed-ph

keywords AI educationsynthetic chemistrymachine learningcurriculum designdata-driven chemistryweb-based platformstudent projectsreaction optimization

0 comments

The pith

A web-based course called AI4CHEM teaches machine learning to synthetic chemistry students who have no coding experience by centering lessons on chemical problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the design of AI4CHEM as an introductory course that introduces data-driven methods to students on the synthetic chemistry track. It replaces abstract algorithm teaching with chemistry-specific tasks such as molecular property prediction and reaction optimization. Students work through a zero-install web platform that supports immediate practice with machine learning workflows and active in-class exercises. Assessments include code-guided homework, literature reviews, and group projects where learners construct AI tools for actual experimental challenges. The approach aims to lower entry barriers so that experimental chemists can evaluate and apply AI in their own research without first becoming programmers.

Core claim

The paper claims that an introductory data-driven chemistry course built around chemical context, an accessible web platform for zero-install machine learning practice, and project-based assessments on real experimental problems enables synthetic chemistry students with no prior programming background to develop practical skills in molecular property prediction, reaction optimization, data mining, and the evaluation of AI tools.

What carries the argument

The AI4CHEM curriculum structure, which sequences chemistry examples and collaborative projects through a web-based platform to deliver machine learning workflow practice without requiring software installation or prior coding knowledge.

If this is right

Students gain confidence in using Python for chemistry tasks such as property prediction and reaction optimization.
Learners improve their ability to evaluate the suitability of AI tools for specific chemical research questions.
Collaborative projects result in students producing working AI-assisted workflows tied to real experimental data.
Open release of all course materials enables other programs to replicate or adapt the same beginner-accessible structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar course designs could be developed for experimental biology or materials science tracks that also lack coding prerequisites.
Widespread adoption might shift laboratory practice so that synthetic chemists routinely incorporate AI checks during reaction planning.
A natural next test would be to track whether students who complete the course later apply the skills in their own research publications.

Load-bearing premise

That combining a web-based platform, chemistry-specific examples, and project assessments will produce meaningful learning gains in AI skills for students who start with no coding experience.

What would settle it

A controlled pre- and post-course evaluation that finds no measurable increase in students' ability to build and apply AI workflows to experimental chemistry problems would show the approach does not deliver the claimed gains.

Figures

Figures reproduced from arXiv: 2511.18244 by Zhiling Zheng.

**Figure 3.** Figure 3: Graphical representation of learner (N = 13) responses end of course survey. (a) Pie chart showing which course components students felt helped them learn AI in chemistry most. (b) Self-reported likelihood of using AI or machine learning in future research before and after the course. (c) Stacked bar chart of agreement with course outcome statements [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Artificial intelligence (AI) and data science are transforming chemical research, yet few formal courses are tailored to synthetic and experimental chemists, who often face steep entry barriers due to limited coding experience and lack of chemistry-specific examples. We present the design and implementation of AI4CHEM, an introductory data-driven chem-istry course created for students on the synthetic chemistry track with no prior programming background. The curricu-lum emphasizes chemical context over abstract algorithms, using an accessible web-based platform to ensure zero-install machine learning (ML) workflow development practice and in-class active learning. Assessment combines code-guided homework, literature-based mini-reviews, and collaborative projects in which students build AI-assisted workflows for real experimental problems. Learning gains include increased confidence with Python, molecular property prediction, reaction optimization, and data mining, and improved skills in evaluating AI tools in chemistry. All course materials are openly available, offering a discipline-specific, beginner-accessible framework for integrating AI into synthetic chemistry training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear description of a new course design for teaching AI to non-coding chemistry students, but the effectiveness claims rest on design choices alone with no supporting outcome data.

read the letter

The paper introduces AI4CHEM, a course built specifically for synthetic chemistry students who have no programming background. It uses a web-based platform so students can run ML workflows without installs, keeps the focus on chemistry examples like molecular property prediction and reaction optimization, and includes projects where students apply these tools to real experimental questions. All materials are made public, which is a practical step for others who might want to try something similar.

Referee Report

1 major / 1 minor

Summary. The manuscript describes the design and implementation of AI4CHEM, an introductory data-driven chemistry course for synthetic chemistry students with no prior programming background. The curriculum prioritizes chemical context over abstract algorithms, uses an accessible web-based platform for zero-install ML workflow practice and in-class active learning, and employs assessments consisting of code-guided homework, literature-based mini-reviews, and collaborative projects on real experimental problems. The abstract asserts specific learning gains in Python confidence, molecular property prediction, reaction optimization, data mining, and AI tool evaluation skills.

Significance. If the effectiveness claims are supported by appropriate evidence, the work would supply a practical, discipline-specific template for incorporating AI and data science into synthetic chemistry training. The open availability of all course materials is a clear strength that could aid adoption and iterative improvement by other instructors.

major comments (1)

[Abstract] Abstract: The abstract asserts concrete learning gains ('increased confidence with Python, molecular property prediction, reaction optimization, and data mining, and improved skills in evaluating AI tools in chemistry'). The implementation and assessment sections supply no quantitative data, validated instruments, sample sizes, statistical tests, pre/post metrics, or control comparisons. This leaves the central claim of course effectiveness unsupported by presented evidence.

minor comments (1)

The description of the web-based platform and project-based assessments could include more concrete examples of the chemistry-specific workflows students developed to improve clarity for readers unfamiliar with the tools.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the need for alignment between claims and evidence. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts concrete learning gains ('increased confidence with Python, molecular property prediction, reaction optimization, and data mining, and improved skills in evaluating AI tools in chemistry'). The implementation and assessment sections supply no quantitative data, validated instruments, sample sizes, statistical tests, pre/post metrics, or control comparisons. This leaves the central claim of course effectiveness unsupported by presented evidence.

Authors: We agree that the abstract currently makes specific assertions about learning gains that are not supported by quantitative data, validated instruments, sample sizes, statistical tests, pre/post metrics, or control comparisons in the manuscript. The paper is a description of course design, implementation, and open materials rather than a formal educational research study. We will revise the abstract to describe the intended learning outcomes and the assessment methods (code-guided homework, literature mini-reviews, and collaborative projects) without asserting measured gains. We will also add a brief note in the discussion section clarifying that formal evaluation of learning outcomes lies outside the scope of this work and could be addressed in future studies. This change will ensure the abstract accurately reflects the manuscript content. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive curriculum paper with no derivations or fitted claims

full rationale

This is a descriptive account of course design and implementation with no mathematical derivations, equations, parameters, predictions, or self-referential reductions. The abstract and structure focus on platform choice, example selection, and project format; learning gains are stated as outcomes of the design rather than derived from any internal chain that could reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear. The paper is self-contained as a curriculum report and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a descriptive educational paper with no scientific modeling, so it introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5459 in / 1161 out tokens · 30576 ms · 2026-05-17T06:11:26.395934+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present the design and implementation of AI4CHEM, an introductory data-driven chemistry course... Assessment combines code-guided homework, literature-based mini-reviews, and collaborative projects...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The curriculum emphasizes chemical context over abstract algorithms, using an accessible web-based platform...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · 1 internal anchor

[1]

Learning Transferable Visual Models From Natural Language Supervision

https://doi.org/10.1021/acs.jcim.0c00174. (40) Zheng, Z.; Zhang, O.; Borgs, C.; Chayes, J. T.; Yaghi, O. M. ChatGPT Chemistry Assistant for Text Mining and the Prediction o f MOF Synthesis. J. Am. Chem. Soc. 2023, 145 (32), 18048–18062. https://doi.org/10.1021/jacs.3c05819. (41) Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduc...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1021/acs.jcim.0c00174 2023
[2]

Describe the goals, structure, assessments, and expectations of CHEM 5080

work page
[3]

Navigate the Jupyter Book, Colab notebooks, and course resources

work page
[4]

Run code and Markdown cells in a notebook and switch between the two modes

work page
[5]

Use Python as a calculator for basic chemical math (e.g., moles, molar mass)

work page
[6]

2 Pandas and Plotting for Chemical Data

Store values in variables, create simple lists and dictionaries, and access their elements. 2 Pandas and Plotting for Chemical Data

work page
[7]

Explain what pandas is, define Series and DataFrame, and use standard nam‑ ing conventions

work page
[8]

Read CSV files into a DataFrame, inspect column types, and perform basic cleaning (sorting, filtering, handling missing values)

work page
[9]

Select, filter, group, and summarize data from chemical datasets using pan‑ das

work page
[10]

Create line, scatter, bar, histogram, box, violin, and heatmap plots with Mat‑ plotlib

work page
[11]

3 SMILES and RDKit: Machine-Readable Molecules

Combine pandas and plotting to explore real chemical data (e.g., Beer–Lam‑ bert–law examples) and save publication‑quality figures. 3 SMILES and RDKit: Machine-Readable Molecules

work page
[12]

Interpret SMILES strings in terms of atoms, bonds, branches, rings, aromatic‑ ity, charges, and simple stereochemistry

work page
[13]

Use RDKit to parse SMILES, draw molecular structures, add hydrogens, and compute basic molecular properties

work page
[14]

Perform small structure edits in RDKit (e.g., atom substitution, neutralizing groups, adding a methyl group)

work page
[15]

4 Chemical Structure Identifiers and Web Services

Connect to PubChem to retrieve SMILES and related information, then round‑trip between text, RDKit objects, and files. 4 Chemical Structure Identifiers and Web Services

work page
[16]

Describe PubChem’s APIs as chemical data services and explain typical use cases

work page
[17]

Construct URLs that return JSON, text, or images for given identifiers (name, SMILES, CAS, CID)

work page
[18]

Resolve chemical names, SMILES, and CAS numbers to PubChem CIDs and retrieve IUPAC names, SMILES, InChIKeys, and selected properties

work page
[19]

Use the NCI Chemical Identifier Resolver (CIR) as a second query path and compare its responses to PubChem

work page
[20]

14 5 Regression and Classification with Chemical Data

Write small helper functions with basic error handling and fallbacks to auto‑ mate identifier resolution for a list of ligands. 14 5 Regression and Classification with Chemical Data

work page
[21]

Distinguish between regression and classification problems by examining the type of target variable

work page
[22]

Load small chemistry datasets containing SMILES and simple descriptors or text features

work page
[23]

Create train, validation, and test splits and describe the role of each split in model development

work page
[24]

Fit basic regression model using linear regression and logistic regression

work page
[25]

6 Cross-Validation, Model Selection, and Feature Im- portance

Compute and interpret standard metrics including RMSE, MAE, R2, accuracy, precision, recall, F1, and ROC‑AUC to compare models. 6 Cross-Validation, Model Selection, and Feature Im- portance

work page
[26]

Use K‑fold cross‑validation to obtain fairer performance estimates than a single train/test split

work page
[27]

Explain the role of hyperparameters and tune them with tools such as GridSearchCV

work page
[28]

Perform basic exploratory data analysis by plotting descriptor distributions, pair plots, and correlations

work page
[29]

Apply cross‑validation to compare models and hyperparameter settings, then choose a final model

work page
[30]

7 Decision Trees and Random Forests

Interpret feature importance measures to explain model predictions on chemical properties. 7 Decision Trees and Random Forests

work page
[31]

Describe the intuition behind decision trees for both regression and classifi‑ cation problems

work page
[32]

Interpret Gini impurity, entropy, and mean squared error as criteria for split‑ ting nodes

work page
[33]

Grow and visualize a decision tree, examining nodes, depth, and leaf counts

work page
[34]

Control overfitting using hyperparameters

work page
[35]

Train random forest models for toxicity or property prediction and compare their performance to single trees

work page
[36]

8 Introduction to Neu- ral Networks • Explain the components of a multilayer perceptron (MLP)

Use tree‑based feature importance and permutation importance to identify key molecular descriptors. 8 Introduction to Neu- ral Networks • Explain the components of a multilayer perceptron (MLP). • Build a small MLP for a toy dataset, then extend it to chemical tasks such as solubility or toxicity prediction. • Train MLPRegressor and MLPClassifier models s...

work page
[37]

Represent molecules as graphs with atoms as nodes, bonds as edges, and ap‑ propriate node and edge features

work page
[38]

Build a basic MLP in PyTorch and use it as a stepping stone to graph neural networks (GNNs)

work page
[39]

Explain message passing and neighborhood aggregation in message‑passing neural networks (MPNNs)

work page
[40]

Implement a tiny GNN in PyTorch to predict properties on toy molecular graphs

work page
[41]

10 Property and Reac- tion Prediction with Graph Neural Net- works

Prepare molecular graphs from SMILES and run a simple GNN model, com‑ paring its performance to descriptor‑based models. 10 Property and Reac- tion Prediction with Graph Neural Net- works

work page
[42]

Set up message‑passing neural networks (D‑MPNNs) for both regression and classification tasks on a reaction dataset (e.g., C–H oxidation)

work page
[43]

Train single‑task Chemprop models for properties such as solubility, pKa, melting point, and toxicity

work page
[44]

Train a reactivity classifier and an atom‑level selectivity predictor for reac‑ tion outcomes

work page
[45]

11 Dimension Reduc- tion and Visualiza- tion

Interpret Chemprop models using Shapley values (SHAP) at the feature and node levels. 11 Dimension Reduc- tion and Visualiza- tion

work page
[46]

Differentiate supervised from unsupervised learning in a chemistry context

work page
[47]

Explain the intuition and basic mathematics of Principal Component Analy‑ sis (PCA) and interpret loadings, scores, and explained variance

work page
[48]

Use t‑SNE and UMAP to embed high‑dimensional chemical features into 2D for visualization

work page
[49]

Compare descriptor‑based and fingerprint‑based representations in low‑di‑ mensional plots

work page
[50]

15 12 Clustering and Self - Supervised Work- flows

Use distance metrics and clustering outputs to explore structure–property relationships in a reaction dataset. 15 12 Clustering and Self - Supervised Work- flows

work page
[51]

Build clustering pipelines that include feature selection, scaling, clustering, and visualization

work page
[52]

Select suitable distance metrics for descriptors versus fingerprints and jus‑ tify these choices

work page
[53]

Use K‑means clustering and evaluate candidate values of k using elbow and silhouette analyses

work page
[54]

Explore alternative clustering methods such as agglomerative clustering and DBSCAN and compare their behavior

work page
[55]

13 De Novo Molecule Generation with Variational Autoen- coders

Interpret clustering results in terms of chemical similarity, reactivity, or ex‑ perimental outcomes. 13 De Novo Molecule Generation with Variational Autoen- coders

work page
[56]

Connect unsupervised learning concepts such as reconstruction and latent space to molecular generation tasks

work page
[57]

Explain encoder and decoder roles in a variational autoencoder (VAE) and why VAEs are useful for sampling

work page
[58]

Train a small SMILES‑based VAE model on a molecular dataset

work page
[59]

Inspect latent‑space organization and perform simple sampling or interpola‑ tion to generate new molecules

work page
[60]

14 Bayesian Optimiza- tion for Synthesis Conditions

Discuss the strengths and limitations of VAE‑based generative models for molecular design. 14 Bayesian Optimiza- tion for Synthesis Conditions

work page
[61]

Describe the motivation for Bayesian optimization (BO) in expensive experi‑ mental settings

work page
[62]

Define key components of BO: prior, surrogate model (GP , RF, small NN), and acquisition function (EI, UCB, PI, greedy)

work page
[63]

Implement a basic BO loop: fit surrogate, compute acquisition, pick the next point, update data, and repeat

work page
[64]

Visualize BO behavior in 1D or low‑dimensional examples to build intuition about exploration–exploitation trade‑offs

work page
[65]

15 Multi-Objective Bayesian Optimiza- tion

Apply BO to a toy Suzuki coupling dataset to optimize yield over tempera‑ ture, time, and concentration. 15 Multi-Objective Bayesian Optimiza- tion

work page
[66]

Extend single‑objective BO concepts to multi‑objective problems common in chemistry (e.g., yield, purity, and cost)

work page
[67]

Define Pareto dominance, Pareto front, scalarization, hypervolume, and ex‑ pected hypervolume improvement

work page
[68]

Engineer features and targets for a multi‑objective metal‑organic framework (MOF) synthesis dataset

work page
[69]

Build simple surrogate models for each objective and use them within a multi‑objective BO loop

work page
[70]

16 Reinforcement Learning and Ban- dits for Experiment Design

Analyze and visualize Pareto fronts to support decision‑making in multi‑cri‑ teria experimental design. 16 Reinforcement Learning and Ban- dits for Experiment Design

work page
[71]

Define agent, environment, state, action, reward, trajectory, policy, and value in reinforcement learning (RL)

work page
[72]

Implement tabular Q‑learning in a simple gridworld with a chemistry‑in‑ spired reward structure

work page
[73]

Compare exploration strategies such as ε‑greedy, optimistic initialization, UCB, and Thompson sampling in bandit problems

work page
[74]

Frame a chemistry example (e.g., MOF synthesis) as a multi‑armed bandit and simulate different agents

work page
[75]

17 Positive–Unlabeled (PU) Learning

Explain how RL and bandit methods can inform closed‑loop experiment se‑ lection. 17 Positive–Unlabeled (PU) Learning

work page
[76]

Define semi‑supervised learning and distinguish between LU (labeled + un‑ labeled) and PU (positive + unlabeled) settings

work page
[77]

Explain standard assumptions used in PU learning and when they are rea‑ sonable in chemical datasets

work page
[78]

Construct a simple PU workflow for a chemistry example where failures are unlabeled or rarely reported

work page
[79]

Estimate class priors and convert scores from an intermediate classifier into PU probabilities

work page
[80]

18 Contrastive Learn- ing and Data Aug- mentation

Propose evaluation strategies when true negatives are unavailable or scarce. 18 Contrastive Learn- ing and Data Aug- mentation

work page

Showing first 80 references.

[1] [1]

Learning Transferable Visual Models From Natural Language Supervision

https://doi.org/10.1021/acs.jcim.0c00174. (40) Zheng, Z.; Zhang, O.; Borgs, C.; Chayes, J. T.; Yaghi, O. M. ChatGPT Chemistry Assistant for Text Mining and the Prediction o f MOF Synthesis. J. Am. Chem. Soc. 2023, 145 (32), 18048–18062. https://doi.org/10.1021/jacs.3c05819. (41) Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduc...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1021/acs.jcim.0c00174 2023

[2] [2]

Describe the goals, structure, assessments, and expectations of CHEM 5080

work page

[3] [3]

Navigate the Jupyter Book, Colab notebooks, and course resources

work page

[4] [4]

Run code and Markdown cells in a notebook and switch between the two modes

work page

[5] [5]

Use Python as a calculator for basic chemical math (e.g., moles, molar mass)

work page

[6] [6]

2 Pandas and Plotting for Chemical Data

Store values in variables, create simple lists and dictionaries, and access their elements. 2 Pandas and Plotting for Chemical Data

work page

[7] [7]

Explain what pandas is, define Series and DataFrame, and use standard nam‑ ing conventions

work page

[8] [8]

Read CSV files into a DataFrame, inspect column types, and perform basic cleaning (sorting, filtering, handling missing values)

work page

[9] [9]

Select, filter, group, and summarize data from chemical datasets using pan‑ das

work page

[10] [10]

Create line, scatter, bar, histogram, box, violin, and heatmap plots with Mat‑ plotlib

work page

[11] [11]

3 SMILES and RDKit: Machine-Readable Molecules

Combine pandas and plotting to explore real chemical data (e.g., Beer–Lam‑ bert–law examples) and save publication‑quality figures. 3 SMILES and RDKit: Machine-Readable Molecules

work page

[12] [12]

Interpret SMILES strings in terms of atoms, bonds, branches, rings, aromatic‑ ity, charges, and simple stereochemistry

work page

[13] [13]

Use RDKit to parse SMILES, draw molecular structures, add hydrogens, and compute basic molecular properties

work page

[14] [14]

Perform small structure edits in RDKit (e.g., atom substitution, neutralizing groups, adding a methyl group)

work page

[15] [15]

4 Chemical Structure Identifiers and Web Services

Connect to PubChem to retrieve SMILES and related information, then round‑trip between text, RDKit objects, and files. 4 Chemical Structure Identifiers and Web Services

work page

[16] [16]

Describe PubChem’s APIs as chemical data services and explain typical use cases

work page

[17] [17]

Construct URLs that return JSON, text, or images for given identifiers (name, SMILES, CAS, CID)

work page

[18] [18]

Resolve chemical names, SMILES, and CAS numbers to PubChem CIDs and retrieve IUPAC names, SMILES, InChIKeys, and selected properties

work page

[19] [19]

Use the NCI Chemical Identifier Resolver (CIR) as a second query path and compare its responses to PubChem

work page

[20] [20]

14 5 Regression and Classification with Chemical Data

Write small helper functions with basic error handling and fallbacks to auto‑ mate identifier resolution for a list of ligands. 14 5 Regression and Classification with Chemical Data

work page

[21] [21]

Distinguish between regression and classification problems by examining the type of target variable

work page

[22] [22]

Load small chemistry datasets containing SMILES and simple descriptors or text features

work page

[23] [23]

Create train, validation, and test splits and describe the role of each split in model development

work page

[24] [24]

Fit basic regression model using linear regression and logistic regression

work page

[25] [25]

6 Cross-Validation, Model Selection, and Feature Im- portance

Compute and interpret standard metrics including RMSE, MAE, R2, accuracy, precision, recall, F1, and ROC‑AUC to compare models. 6 Cross-Validation, Model Selection, and Feature Im- portance

work page

[26] [26]

Use K‑fold cross‑validation to obtain fairer performance estimates than a single train/test split

work page

[27] [27]

Explain the role of hyperparameters and tune them with tools such as GridSearchCV

work page

[28] [28]

Perform basic exploratory data analysis by plotting descriptor distributions, pair plots, and correlations

work page

[29] [29]

Apply cross‑validation to compare models and hyperparameter settings, then choose a final model

work page

[30] [30]

7 Decision Trees and Random Forests

Interpret feature importance measures to explain model predictions on chemical properties. 7 Decision Trees and Random Forests

work page

[31] [31]

Describe the intuition behind decision trees for both regression and classifi‑ cation problems

work page

[32] [32]

Interpret Gini impurity, entropy, and mean squared error as criteria for split‑ ting nodes

work page

[33] [33]

Grow and visualize a decision tree, examining nodes, depth, and leaf counts

work page

[34] [34]

Control overfitting using hyperparameters

work page

[35] [35]

Train random forest models for toxicity or property prediction and compare their performance to single trees

work page

[36] [36]

8 Introduction to Neu- ral Networks • Explain the components of a multilayer perceptron (MLP)

Use tree‑based feature importance and permutation importance to identify key molecular descriptors. 8 Introduction to Neu- ral Networks • Explain the components of a multilayer perceptron (MLP). • Build a small MLP for a toy dataset, then extend it to chemical tasks such as solubility or toxicity prediction. • Train MLPRegressor and MLPClassifier models s...

work page

[37] [37]

Represent molecules as graphs with atoms as nodes, bonds as edges, and ap‑ propriate node and edge features

work page

[38] [38]

Build a basic MLP in PyTorch and use it as a stepping stone to graph neural networks (GNNs)

work page

[39] [39]

Explain message passing and neighborhood aggregation in message‑passing neural networks (MPNNs)

work page

[40] [40]

Implement a tiny GNN in PyTorch to predict properties on toy molecular graphs

work page

[41] [41]

10 Property and Reac- tion Prediction with Graph Neural Net- works

Prepare molecular graphs from SMILES and run a simple GNN model, com‑ paring its performance to descriptor‑based models. 10 Property and Reac- tion Prediction with Graph Neural Net- works

work page

[42] [42]

Set up message‑passing neural networks (D‑MPNNs) for both regression and classification tasks on a reaction dataset (e.g., C–H oxidation)

work page

[43] [43]

Train single‑task Chemprop models for properties such as solubility, pKa, melting point, and toxicity

work page

[44] [44]

Train a reactivity classifier and an atom‑level selectivity predictor for reac‑ tion outcomes

work page

[45] [45]

11 Dimension Reduc- tion and Visualiza- tion

Interpret Chemprop models using Shapley values (SHAP) at the feature and node levels. 11 Dimension Reduc- tion and Visualiza- tion

work page

[46] [46]

Differentiate supervised from unsupervised learning in a chemistry context

work page

[47] [47]

Explain the intuition and basic mathematics of Principal Component Analy‑ sis (PCA) and interpret loadings, scores, and explained variance

work page

[48] [48]

Use t‑SNE and UMAP to embed high‑dimensional chemical features into 2D for visualization

work page

[49] [49]

Compare descriptor‑based and fingerprint‑based representations in low‑di‑ mensional plots

work page

[50] [50]

15 12 Clustering and Self - Supervised Work- flows

Use distance metrics and clustering outputs to explore structure–property relationships in a reaction dataset. 15 12 Clustering and Self - Supervised Work- flows

work page

[51] [51]

Build clustering pipelines that include feature selection, scaling, clustering, and visualization

work page

[52] [52]

Select suitable distance metrics for descriptors versus fingerprints and jus‑ tify these choices

work page

[53] [53]

Use K‑means clustering and evaluate candidate values of k using elbow and silhouette analyses

work page

[54] [54]

Explore alternative clustering methods such as agglomerative clustering and DBSCAN and compare their behavior

work page

[55] [55]

13 De Novo Molecule Generation with Variational Autoen- coders

Interpret clustering results in terms of chemical similarity, reactivity, or ex‑ perimental outcomes. 13 De Novo Molecule Generation with Variational Autoen- coders

work page

[56] [56]

Connect unsupervised learning concepts such as reconstruction and latent space to molecular generation tasks

work page

[57] [57]

Explain encoder and decoder roles in a variational autoencoder (VAE) and why VAEs are useful for sampling

work page

[58] [58]

Train a small SMILES‑based VAE model on a molecular dataset

work page

[59] [59]

Inspect latent‑space organization and perform simple sampling or interpola‑ tion to generate new molecules

work page

[60] [60]

14 Bayesian Optimiza- tion for Synthesis Conditions

Discuss the strengths and limitations of VAE‑based generative models for molecular design. 14 Bayesian Optimiza- tion for Synthesis Conditions

work page

[61] [61]

Describe the motivation for Bayesian optimization (BO) in expensive experi‑ mental settings

work page

[62] [62]

Define key components of BO: prior, surrogate model (GP , RF, small NN), and acquisition function (EI, UCB, PI, greedy)

work page

[63] [63]

Implement a basic BO loop: fit surrogate, compute acquisition, pick the next point, update data, and repeat

work page

[64] [64]

Visualize BO behavior in 1D or low‑dimensional examples to build intuition about exploration–exploitation trade‑offs

work page

[65] [65]

15 Multi-Objective Bayesian Optimiza- tion

Apply BO to a toy Suzuki coupling dataset to optimize yield over tempera‑ ture, time, and concentration. 15 Multi-Objective Bayesian Optimiza- tion

work page

[66] [66]

Extend single‑objective BO concepts to multi‑objective problems common in chemistry (e.g., yield, purity, and cost)

work page

[67] [67]

Define Pareto dominance, Pareto front, scalarization, hypervolume, and ex‑ pected hypervolume improvement

work page

[68] [68]

Engineer features and targets for a multi‑objective metal‑organic framework (MOF) synthesis dataset

work page

[69] [69]

Build simple surrogate models for each objective and use them within a multi‑objective BO loop

work page

[70] [70]

16 Reinforcement Learning and Ban- dits for Experiment Design

Analyze and visualize Pareto fronts to support decision‑making in multi‑cri‑ teria experimental design. 16 Reinforcement Learning and Ban- dits for Experiment Design

work page

[71] [71]

Define agent, environment, state, action, reward, trajectory, policy, and value in reinforcement learning (RL)

work page

[72] [72]

Implement tabular Q‑learning in a simple gridworld with a chemistry‑in‑ spired reward structure

work page

[73] [73]

Compare exploration strategies such as ε‑greedy, optimistic initialization, UCB, and Thompson sampling in bandit problems

work page

[74] [74]

Frame a chemistry example (e.g., MOF synthesis) as a multi‑armed bandit and simulate different agents

work page

[75] [75]

17 Positive–Unlabeled (PU) Learning

Explain how RL and bandit methods can inform closed‑loop experiment se‑ lection. 17 Positive–Unlabeled (PU) Learning

work page

[76] [76]

Define semi‑supervised learning and distinguish between LU (labeled + un‑ labeled) and PU (positive + unlabeled) settings

work page

[77] [77]

Explain standard assumptions used in PU learning and when they are rea‑ sonable in chemical datasets

work page

[78] [78]

Construct a simple PU workflow for a chemistry example where failures are unlabeled or rarely reported

work page

[79] [79]

Estimate class priors and convert scores from an intermediate classifier into PU probabilities

work page

[80] [80]

18 Contrastive Learn- ing and Data Aug- mentation

Propose evaluation strategies when true negatives are unavailable or scarce. 18 Contrastive Learn- ing and Data Aug- mentation

work page