Foundation Models for Discovery and Exploration in Chemical Space
Pith reviewed 2026-05-18 05:47 UTC · model grok-4.3
The pith
Molecular foundation models called MIST predict over 400 chemical properties and generalize to mapping molecular scents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MIST models use the Smirk tokenizer to comprehensively capture nuclear, electronic, and geometric information from molecules, enabling them to learn diverse representations across chemical space. Fine-tuned versions predict more than 400 structure-property relationships with performance at or above the state of the art on diverse benchmarks. The models address real-world challenges such as multiobjective electrolyte solvent screening, stereochemical reasoning for organometallics, and mixture property prediction. They also accurately predict scent profiles and form a hierarchical representation of olfactory space that is consistent with hyperbolic geometry. Hyperparameter-aware Bayesian神经网络缩放
What carries the argument
The Smirk tokenizer, which encodes nuclear, electronic, and geometric information from molecular structures to support learning of broad representations in the MIST foundation models.
If this is right
- The models enable multiobjective screening of electrolyte solvents for desired performance criteria.
- Stereochemical reasoning tasks for organometallic compounds become tractable without custom model development.
- Properties of chemical mixtures can be estimated from component structures alone.
- New problems in chemical space can be solved without explicit training on those exact tasks.
- The models learn hierarchical representations that align with hyperbolic geometry in perceptual domains such as olfaction.
Where Pith is reading between the lines
- If the models truly capture general molecular features, they could extend to other sensory or biological domains that share underlying physical principles.
- The Bayesian scaling laws may let researchers train still larger models on modest compute budgets by removing the need for repeated hyperparameter searches.
- A single foundation model might eventually reduce reliance on many narrow, task-specific predictors in chemistry and materials science.
- The observed hyperbolic structure in olfactory space raises the possibility that similar geometries appear in other learned representations of molecular or biological data.
Load-bearing premise
The Smirk tokenizer captures all relevant nuclear, electronic, and geometric information from molecular structures without significant loss or bias.
What would settle it
Direct comparison of MIST scent-profile predictions against measured human sensory data for a large set of molecules outside any training distribution, or quantitative verification that the learned olfactory embeddings exhibit negative curvature consistent with hyperbolic geometry.
Figures
read the original abstract
Accurate prediction of atomistic, thermodynamic, and kinetic properties from molecular structures underpins materials innovation. Existing computational and experimental approaches lack the scalability required to navigate chemical space efficiently. Scientific foundation models trained on large unlabelled datasets offer a path towards navigating chemical space across application domains. Here, we develop MIST, a family of molecular foundation models with up to an order of magnitude more parameters and data than prior works. Trained using a novel tokenizer, Smirk, which comprehensively captures nuclear, electronic, and geometric information, MIST learns a diverse range of molecules. MIST models have been fine-tuned to predict more than 400 structure-property relationships and have been shown to match or exceed state-of-the-art performance across diverse benchmarks, from physiology to electrochemistry. We demonstrate the ability of these models to solve real-world problems across chemical space from multiobjective electrolyte solvent screening to stereochemical reasoning for organometallics and mixture property prediction. The clearest demonstration of a foundation model is its ability to solve problems that were neither explicit targets of training nor central to the intentions of its developers. We identify olfactory perception mapping as such a problem, and show that MIST accurately predicted scent profiles and learned a hierarchical representation of olfactory space consistent with hyperbolic geometry. We formulated hyperparameter aware Bayesian neural scaling laws which eliminate the need for hyperparameter sweeps at every scale, making training large compute-optimal models feasible on a limited compute budget. The methods and findings presented here represent a significant step towards accelerating materials discovery, design, and optimization using foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MIST, a family of large molecular foundation models trained on extensive unlabeled data using a novel Smirk tokenizer asserted to comprehensively encode nuclear, electronic, and geometric features from molecular structures. The models are fine-tuned on more than 400 structure-property tasks and reported to match or exceed SOTA performance across benchmarks spanning physiology to electrochemistry. Applications are demonstrated in electrolyte solvent screening, stereochemical reasoning, and mixture property prediction. The central evidence for foundation-model behavior is zero-shot accurate prediction of scent profiles together with a learned hierarchical representation of olfactory space consistent with hyperbolic geometry. The work also formulates hyperparameter-aware Bayesian neural scaling laws to enable compute-optimal training without exhaustive sweeps.
Significance. If the performance and generalization claims are substantiated with detailed metrics and ablations, MIST would constitute a meaningful advance in scaling foundation models for chemical space navigation, with potential to accelerate materials discovery. The Bayesian scaling-law formulation that removes the need for per-scale hyperparameter sweeps is a concrete methodological strength that could be adopted more broadly. The zero-shot olfactory result, if shown to be robust rather than spurious, would provide a strong falsifiable test of emergent property capture. These elements, taken together, would support the paper's positioning as a step toward practical foundation-model use in chemistry.
major comments (3)
- [Abstract] Abstract: the claim that MIST models 'match or exceed state-of-the-art performance across diverse benchmarks' on >400 tasks is presented without any tabulated metrics, error bars, benchmark identifiers, or ablation results, rendering the central performance assertion impossible to evaluate from the given text.
- [Smirk tokenizer description] Description of the Smirk tokenizer: the assertion that it 'comprehensively captures nuclear, electronic, and geometric information' is load-bearing for the zero-shot olfactory generalization, yet no reconstruction-error statistics, mutual-information scores with electronic properties (partial charges, HOMO/LUMO), or ablation on 3D conformer recovery are supplied; without these, the risk that olfactory predictions rest on incomplete or biased encodings cannot be quantified.
- [Olfactory perception mapping] Olfactory perception results: the reported accurate scent-profile prediction and hyperbolic embedding hierarchy are presented as the clearest demonstration of foundation-model behavior, but the manuscript does not include controls or ablations showing that these outcomes survive removal of auxiliary electronic or geometric features; this omission directly affects the claim that the representation is comprehensive rather than correlational.
minor comments (2)
- [Abstract] Abstract: the statement 'up to an order of magnitude more parameters and data than prior works' would be strengthened by explicit numerical comparison to the largest previously published molecular foundation models.
- [Bayesian neural scaling laws] Scaling-laws section: the precise functional form of the hyperparameter-aware Bayesian neural scaling laws and the validation procedure against held-out empirical runs should be stated more explicitly to allow independent reproduction.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments highlight important areas where additional evidence and clarity will strengthen the manuscript. We address each major comment below and have revised the manuscript accordingly to provide the requested metrics, statistics, and controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that MIST models 'match or exceed state-of-the-art performance across diverse benchmarks' on >400 tasks is presented without any tabulated metrics, error bars, benchmark identifiers, or ablation results, rendering the central performance assertion impossible to evaluate from the given text.
Authors: We agree that the abstract would benefit from greater specificity to allow direct evaluation of the performance claims. In the revised manuscript we have expanded the abstract to reference key benchmark identifiers and representative metrics (with error bars) drawn from the main results and supplementary tables. Detailed tabulated results, including per-task metrics, error bars, and ablation summaries across the >400 tasks, are now explicitly signposted in the main text and supplementary information. revision: yes
-
Referee: [Smirk tokenizer description] Description of the Smirk tokenizer: the assertion that it 'comprehensively captures nuclear, electronic, and geometric information' is load-bearing for the zero-shot olfactory generalization, yet no reconstruction-error statistics, mutual-information scores with electronic properties (partial charges, HOMO/LUMO), or ablation on 3D conformer recovery are supplied; without these, the risk that olfactory predictions rest on incomplete or biased encodings cannot be quantified.
Authors: We acknowledge that quantitative validation of the tokenizer's encoding capacity was not provided in the initial submission. In the revision we have added a dedicated supplementary section reporting (i) reconstruction-error statistics for molecular graphs and 3D structures, (ii) mutual-information scores between tokenizer-derived representations and electronic properties including partial charges and HOMO/LUMO energies, and (iii) results from an ablation study measuring 3D conformer recovery accuracy. These additions allow readers to assess the completeness of the encoding directly. revision: yes
-
Referee: [Olfactory perception mapping] Olfactory perception results: the reported accurate scent-profile prediction and hyperbolic embedding hierarchy are presented as the clearest demonstration of foundation-model behavior, but the manuscript does not include controls or ablations showing that these outcomes survive removal of auxiliary electronic or geometric features; this omission directly affects the claim that the representation is comprehensive rather than correlational.
Authors: We agree that explicit controls are necessary to distinguish comprehensive representation from spurious correlations. We have performed the requested ablation experiments and report the results in the revised manuscript: zero-shot olfactory prediction accuracy and the hyperbolic geometry of the learned embedding space are re-evaluated after systematic removal or masking of electronic and geometric features. The outcomes remain consistent, supporting the claim that the integrated representation drives the observed foundation-model behavior. revision: yes
Circularity Check
No significant circularity: empirical training and external benchmarks drive results
full rationale
The paper's core claims rest on training MIST models with the Smirk tokenizer on large unlabeled molecular datasets, followed by fine-tuning to predict over 400 structure-property relationships and evaluation on diverse external benchmarks spanning physiology, electrochemistry, and zero-shot olfactory tasks. These outcomes are data-driven performance measurements rather than algebraic reductions, self-definitional mappings, or predictions forced by fitted inputs. The hyperparameter-aware Bayesian neural scaling laws are presented as a training optimization method to avoid exhaustive sweeps, but they do not reduce any reported predictions to the inputs by construction. No load-bearing self-citations, imported uniqueness theorems, or ansatzes smuggled via prior work are invoked to justify the central results; the olfactory hyperbolic geometry finding is an observed empirical pattern on held-out tasks. The derivation chain is therefore self-contained against external validation data.
Axiom & Free-Parameter Ledger
free parameters (2)
- model parameter count and training data volume
- hyperparameters in Bayesian neural scaling laws
axioms (1)
- domain assumption Large-scale pre-training on unlabeled molecular structures produces generalizable representations usable for downstream property prediction and zero-shot tasks.
invented entities (1)
-
Smirk tokenizer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/CostJcost functional equation and cosh identities echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
learned a hierarchical representation of olfactory space consistent with hyperbolic geometry
-
IndisputableMonolith/Foundation/AlphaCoordinateFixationhigher-derivative calibration of CostAlphaLog refines?
refinesRelation between the paper passage and the cited Recognition theorem.
Smirk ... comprehensively captures nuclear, electronic, and geometric information
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
-
Energy-Aware Routing to Large Reasoning Models
In the critical regime for energy provisioning to large reasoning models, performance is volatility-limited, motivating variance-aware routing policies based on training and inference compute scaling laws.
Reference graph
Works this paper leans on
-
[1]
Christopher M. Dobson. “Chemical Space and Biology”. In:Nature432.7019 (Dec. 1, 2004), pp. 824– 828
work page 2004
-
[2]
How Machine Learning Will Revolutionize Electrochemical Sciences
Aashutosh Mistry et al. “How Machine Learning Will Revolutionize Electrochemical Sciences”. In: ACS Energy Lett.6 (Mar. 23, 2021), pp. 1422–1431
work page 2021
-
[4]
Eduardo Soares et al.A Large Encoder-Decoder Family of Foundation Models For Chemical Language. July 24, 2024. arXiv:2407.20267. Pre-published
-
[5]
Peter Kirkpatrick and Clare Ellis. “Chemical Space”. In:Nature432.7019 (Dec. 1, 2004), pp. 823–823
work page 2004
-
[6]
Up–down Approach for Expanding the Chemical Space of Metal–Organic Frame- works
Jiyeon Kim et al. “Up–down Approach for Expanding the Chemical Space of Metal–Organic Frame- works”. In:Nature Synthesis3.12 (Dec. 2024), pp. 1518–1528
work page 2024
-
[7]
Navigating Chemical Space for Biology and Medicine
Christopher Lipinski and Andrew Hopkins. “Navigating Chemical Space for Biology and Medicine”. In:Nature432.7019 (Dec. 2004), pp. 855–861
work page 2004
-
[8]
Generative AI for Navigating Synthesizable Chem- ical Space
Wenhao Gao, Shitong Luo, and Connor W. Coley. “Generative AI for Navigating Synthesizable Chem- ical Space”. In:Proceedings of the National Academy of Sciences122.41 (Oct. 14, 2025), e2415665122
work page 2025
-
[9]
Why Big Data and Compute Are Not Necessarily the Path to Big Materials Science
Naohiro Fujinuma et al. “Why Big Data and Compute Are Not Necessarily the Path to Big Materials Science”. In:Commun Mater3.1 (Aug. 30, 2022), p. 59
work page 2022
-
[12]
Alec Radford et al.Learning Transferable Visual Models From Natural Language Supervision. Feb. 26,
-
[13]
Learning Transferable Visual Models From Natural Language Supervision
arXiv:2103.00020 [cs].url:http://arxiv.org/abs/2103.00020. Pre-published
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
A Vision–Language Foundation Model for Precision Oncology
Jinxi Xiang et al. “A Vision–Language Foundation Model for Precision Oncology”. In:Nature638.8051 (Feb. 2025), pp. 769–778
work page 2025
-
[16]
Charles O’Neill et al. “Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts”. In: Neurips 2024 Workshop Foundation Models for Science: Progress, Opportunities, and Challenges. Nov. 2, 2024
work page 2024
-
[18]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani et al.On the Opportunities and Risks of Foundation Models. July 12, 2022. arXiv: 2108.07258 [cs].url:http://arxiv.org/abs/2108.07258. Pre-published
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [24]
-
[28]
Accelerating Electrolyte Discovery for Energy Storage with High-Throughput Screening
Lei Cheng et al. “Accelerating Electrolyte Discovery for Energy Storage with High-Throughput Screening”. In:Journal of Physical Chemistry Letters6.2 (2015), pp. 283–291
work page 2015
-
[31]
Flammability of Li-Ion Battery Electrolytes: Flash Point and Self-Extinguishing Time Measurements
Steffen Hess, Margret Wohlfahrt-Mehrens, and Mario Wachtler. “Flammability of Li-Ion Battery Electrolytes: Flash Point and Self-Extinguishing Time Measurements”. In:J. Electrochem. Soc.162.2 (2015), A3084–A3097
work page 2015
-
[32]
Petrucci.General Chemistry: Principles and Modern Applications
R.H. Petrucci.General Chemistry: Principles and Modern Applications. Pearson Education. Pearson Education International, 2007
work page 2007
-
[33]
Predicting Human Olfactory Perception from Chemical Features of Odor Molecules
Andreas Keller et al. “Predicting Human Olfactory Perception from Chemical Features of Odor Molecules”. In:Science355.6327 (Feb. 24, 2017), pp. 820–826
work page 2017
-
[34]
Combinatorial Receptor Codes for Odors
Bettina Malnic et al. “Combinatorial Receptor Codes for Odors”. In:Cell96.5 (Mar. 5, 1999), pp. 713–
work page 1999
-
[35]
Rayane Achebouche et al. “Application of Artificial Intelligence to Decode the Relationships between Smell, Olfactory Receptors and Small Molecules”. In:Scientific Reports12.1 (Nov. 5, 2022), p. 18817
work page 2022
-
[37]
$$\alpha$$-Decay Half-Life Predictions with Support Vector Machine
Amir Jalili et al. “$$\alpha$$-Decay Half-Life Predictions with Support Vector Machine”. In:Sci- entific Reports14.1 (Dec. 28, 2024), p. 30776
work page 2024
-
[38]
Catalysis in the Excited State: Bringing Innate Transition Metal Photochemistry into Play
Fabio Juli´ a. “Catalysis in the Excited State: Bringing Innate Transition Metal Photochemistry into Play”. In:ACS Catal.15.6 (Mar. 21, 2025), pp. 4665–4680
work page 2025
-
[39]
Giacomo Morselli, Christian Reber, and Oliver S. Wenger. “Molecular Design Principles for Photoac- tive Transition Metal Complexes: A Guide for “Photo-Motivated” Chemists”. In:J. Am. Chem. Soc. 147.14 (Apr. 9, 2025), pp. 11608–11624
work page 2025
- [43]
-
[44]
Leon de Villiers Engelbrecht et al. “MD simulations explain the excess molar enthalpies in pseudo- binary mixtures of a choline chloride-based deep eutectic solvent with water or methanol”. In:Fron- tiers in Chemistry10 (2022), p. 983281
work page 2022
-
[45]
Algebraic Representation of Thermodynamic Properties and the Classification of Solutions
Otto Redlich and A. T. Kister. “Algebraic Representation of Thermodynamic Properties and the Classification of Solutions”. In:Ind. Eng. Chem.40.2 (Feb. 1948), pp. 345–348. 25
work page 1948
-
[47]
Definitions, Methods, and Applications in Interpretable Machine Learning
W. James Murdoch et al. “Definitions, Methods, and Applications in Interpretable Machine Learning”. In:Proceedings of the National Academy of Sciences116.44 (Oct. 29, 2019), pp. 22071–22080
work page 2019
-
[48]
Adly Templeton et al.Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread. May 21, 2024.url:https://transformer- circuits.pub/ 2024/scaling-monosemanticity/index.html
work page 2024
- [49]
-
[50]
GenSLMs: Genome-scale Language Models Reveal SARS-CoV-2 Evolutionary Dynamics
Maxim Zvyagin et al. “GenSLMs: Genome-scale Language Models Reveal SARS-CoV-2 Evolutionary Dynamics”. In:The International Journal of High Performance Computing Applications37.6 (Nov. 2023), pp. 683–705
work page 2023
-
[51]
Christopher A. Lipinski et al. “Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings”. In:Advanced Drug Delivery Reviews 23.1–3 (Jan. 1997), pp. 3–25
work page 1997
-
[53]
A History of the Structural Theory of Benzene - The Aromatic Sextet Rule and Huckel’s Rule
Shigeaki Kikuchi. “A History of the Structural Theory of Benzene - The Aromatic Sextet Rule and Huckel’s Rule”. In:Journal of Chemical Education74.2 (Feb. 1, 1997), p. 194
work page 1997
- [54]
-
[55]
Sequence Modeling and Design from Molecular to Genome Scale with Evo
Eric Nguyen et al. “Sequence Modeling and Design from Molecular to Genome Scale with Evo”. In: Science386.6723 (Nov. 15, 2024), eado9336
work page 2024
-
[60]
Scaling Laws from the Data Manifold Dimension
Utkarsh Sharma and Jared Kaplan. “Scaling Laws from the Data Manifold Dimension”. In:J. Mach. Learn. Res.23.1 (Jan. 1, 2022), 9:343–9:376
work page 2022
- [64]
-
[65]
Language Models Are Few-Shot Learners
Tom Brown et al. “Language Models Are Few-Shot Learners”. In:Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. [64]Huggingface/Transformers: Transformers: State-of-the-art Machine Learning for Pytorch, Tensor- Flow, and JAX.Version 4.40.2. May 6, 2024. [65]Microsoft/DeepSpeed. Microsoft, May 15, 2024. 26
work page 2020
-
[66]
William Falcon and The PyTorch Lightning team.PyTorch Lightning. Zenodo, June 7, 2023
work page 2023
-
[67]
Jerry Ma and Denis Yarats.On the Adequacy of Untuned Warmup for Adaptive Optimization. Mar. 19,
-
[68]
arXiv:1910.04209 [cs, stat].url:http://arxiv.org/abs/1910.04209. Pre-published
-
[69]
Ilya Loshchilov and Frank Hutter.Decoupled Weight Decay Regularization. Jan. 4, 2019. arXiv:1711. 05101 [cs]. Pre-published. [69]Lightning-AI Torchmetrics. Version 1.4.0. Lightning AI, May 6, 2024
work page 2019
-
[70]
On the Art of Compiling and Using ’Drug-Like’ Chemical Fragment Spaces
J¨ org Degen et al. “On the Art of Compiling and Using ’Drug-Like’ Chemical Fragment Spaces”. In: ChemMedChem3.10 (Oct. 20, 2008), pp. 1503–1507
work page 2008
-
[73]
Patrice Porion et al. “Comparative Study on Transport Properties for LiFAP and LiPF6 in Alkyl- Carbonates as Electrolytes through Conductivity, Viscosity and NMR Self-Diffusion Measurements”. In:Electrochimica Acta114 (Dec. 30, 2013), pp. 95–104
work page 2013
-
[74]
Predicting Electrolyte Con- ductivity Directly from Molecular-Level Interactions
Yumin Zhang, Imanuel Bier, and Venkatasubramanian Viswanathan. “Predicting Electrolyte Con- ductivity Directly from Molecular-Level Interactions”. In:ACS Energy Lett.7.11 (Nov. 11, 2022), pp. 4061–4070
work page 2022
-
[75]
Ionic conduction and solution structure in LiPF6 and LiBF4 propylene car- bonate electrolytes
Sunwook Hwang et al. “Ionic conduction and solution structure in LiPF6 and LiBF4 propylene car- bonate electrolytes”. In:The Journal of Physical Chemistry C122.34 (2018), pp. 19438–19446
work page 2018
-
[78]
Alexandra Wahab et al. “The COMPAS Project : A Computational Database of Polycyclic Aro- matic Systems . Phase 1: Cata - Condensed Polybenzenoid Hydrocarbons”. In:Journal of Chemical Information and Modeling62.16 (Aug. 22, 2022), pp. 3704–3713
work page 2022
-
[79]
COMPAS-2 : A Dataset of Cata-Condensed Hetero-Polycyclic Aromatic Systems
Eduardo Mayo Yanes, Sabyasachi Chakraborty, and Renana Gershoni-Poranne. “COMPAS-2 : A Dataset of Cata-Condensed Hetero-Polycyclic Aromatic Systems”. In:Scientific Data11.1 (Jan. 19, 2024), p. 97
work page 2024
-
[80]
Alexandra Wahab and Renana Gershoni-Poranne.COMPAS-3 : A Data Set of Peri- Condensed Poly- benzenoid Hydrocarbons. Feb. 26, 2024.url:https://chemrxiv.org/engage/chemrxiv/article- details/65d8c60ae9ebbb4db90f6276. Pre-published
work page 2024
-
[81]
Huckel theory and aromatically
L. J. Schaad and B. A. Jr. Hess. “Huckel theory and aromatically”. In:Journal of Chemical Education 51.10 (1974), p. 640. eprint:https://doi.org/10.1021/ed051p640
-
[87]
Lukasz Maziarka et al.Molecule Attention Transformer. Feb. 19, 2020. arXiv:2002 . 08264 [cs]. Pre-published. 27
work page 2020
-
[88]
Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction
Juncai Li and Xiaofei Jiang. “Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction”. In:Wireless Communications and Mobile Computing2021.1 (Jan. 2021). Ed. by Yulin Wang, p. 7181815
work page 2021
-
[89]
SELFormer: Molecular Representation Learning via SELFIES Language Mod- els
Atakan Y¨ uksel et al. “SELFormer: Molecular Representation Learning via SELFIES Language Mod- els”. In:Mach. Learn.: Sci. Technol.4.2 (June 1, 2023), p. 025035
work page 2023
-
[90]
A Fingerprints Based Molecular Property Prediction Method Using the BERT Model
Naifeng Wen et al. “A Fingerprints Based Molecular Property Prediction Method Using the BERT Model”. In:J Cheminform14.1 (Oct. 21, 2022), p. 71
work page 2022
- [92]
-
[93]
Ueda.SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery
Shion Honda, Shoi Shi, and Hiroki R. Ueda.SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. Nov. 12, 2019. arXiv:1911.04738 [cs, stat].url:http://arxiv. org/abs/1911.04738. Pre-published
-
[94]
Jannis Born and Matteo Manica. “Regression Transformer Enables Concurrent Sequence Regression and Generation for Molecular Language Modelling”. In:Nature Machine Intelligence5.4 (Apr. 2023), pp. 432–444
work page 2023
-
[95]
Chemformer: A Pre-Trained Transformer for Computational Chemistry
Ross Irwin et al. “Chemformer: A Pre-Trained Transformer for Computational Chemistry”. In:Mach. Learn.: Sci. Technol.3.1 (Mar. 1, 2022), p. 015022
work page 2022
-
[96]
X-MOL: Large-Scale Pre-Training for Molecular Understanding and Diverse Molecular Analysis
Dongyu Xue et al. “X-MOL: Large-Scale Pre-Training for Molecular Understanding and Diverse Molecular Analysis”. In:Science Bulletin67.9 (May 2022), pp. 899–902. 28 Supplementary Information for Foundation Models for Discovery and Exploration in Chemical Space Alexius Wadell∗1, Anoushka Bhutani ∗1, Victor Azumah 2, Austin R. Ellis-Mohr 3, Celia Kelly1, Han...
-
[97]
Generate a single conformer using RDKit’s [77]ETKDGv3[180]
-
[98]
Embed molecules using OpenBabel [181] and the UFF (Universal force field) [182] to generate a single starting conformer
-
[99]
Generate 200 conformers using RDKit’s [77]ETKDFv3[180] and select the lowest energy conformer after relaxation with the UFF [182] or MMFF (Merck molecular force field) [183, 184]. We evaluated each method by computing all QM9 reported properties for up to 100 randomly selected molecules from the QM9 dataset [29]. Parity plots of our calculations versus th...
work page 2000
-
[100]
Remove any molecule that was rejected byrdkit’sMolFromSMILES
-
[101]
De-duplicate dataset usingrdkit’s computed InChI Key
-
[102]
Use iterative proportional refitting to randomly sample a balanced dataset
-
[103]
Usescikit-learn’sStratifiedShuffleSplitto split the dataset into train/validation/test (80/10/10) while preserving the relative frequency of passing molecules. Initial Resampled H-Donor 99.2% 84.9% H-Acceptor 98.9% 84.2% MWT 97.1% 81.6% Log P 96.2% 73.6% Dataset Size 279,066 10,000 Table S7: Frequency of molecules passing each of Lipinski’s RO5 criteria, ...
work page 2024
-
[104]
Large-Scale Chemical Language Representations Capture Molecular Structure and Properties
Jerret Ross et al. “Large-Scale Chemical Language Representations Capture Molecular Structure and Properties”. In:Nat Mach Intell4.12 (Dec. 2022), pp. 1256–1264
work page 2022
-
[105]
Shang Zhu et al. “Differentiable Modeling and Optimization of Non-Aqueous Li-based Battery Elec- trolyte Solutions Using Geometric Deep Learning”. In:Nat Commun15.1 (Oct. 5, 2024), p. 8649
work page 2024
-
[106]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin et al.BERT: Pre-training of Deep Bidirectional Transformers for Language Understand- ing. May 24, 2019. arXiv:1810.04805 [cs]. Pre-published
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [107]
-
[108]
Enamine Ltd.REAL Space. 2024
work page 2024
-
[109]
Alexius Wadell, Anoushka Bhutani, and Venkatasubramanian Viswanathan.Tokenization for Molec- ular Foundation Models. July 8, 2025. arXiv:2409.15370 [cs]. Pre-published
-
[110]
Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. Mar. 29, 2022. arXiv: 2203.15556 [cs]. Pre-published
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[111]
Jared Kaplan et al.Scaling Laws for Neural Language Models. Jan. 22, 2020. arXiv:2001.08361 [cs, stat]. Pre-published
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[112]
Xiao Bi et al.DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. Jan. 5, 2024. arXiv:2401.02954 [cs].url:http://arxiv.org/abs/2401.02954. Pre-published
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[113]
Ashish Vaswani et al.Attention Is All You Need. Dec. 5, 2017. arXiv:1706.03762. Pre-published
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [114]
-
[115]
Generalized Subset Designs in Analytical Chemistry
Izabella Surowiec et al. “Generalized Subset Designs in Analytical Chemistry”. In:Anal. Chem.89.12 (June 20, 2017), pp. 6491–6497
work page 2017
-
[116]
Nonaqueous Liquid Electrolytes for Lithium-Based Rechargeable Batteries
Kang Xu. “Nonaqueous Liquid Electrolytes for Lithium-Based Rechargeable Batteries”. In:Chem. Rev.104.10 (Oct. 1, 2004), pp. 4303–4418
work page 2004
-
[117]
Electrolytes and Interphases in Li-Ion Batteries and Beyond
Kang Xu. “Electrolytes and Interphases in Li-Ion Batteries and Beyond”. In:Chem. Rev.114.23 (Dec. 10, 2014), pp. 11503–11618
work page 2014
-
[118]
Molecular Generation by Fast Assembly of (Deep)SMILES Frag- ments
Francois Berenger and Koji Tsuda. “Molecular Generation by Fast Assembly of (Deep)SMILES Frag- ments”. In:J Cheminform13.1 (Dec. 2021), p. 88
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.