FARM: Enhancing Molecular Representations with Functional Group Awareness

Ge Liu; Heng Ji; Kuan-Hao Huang; Martin D. Burke; Thao Nguyen; Ying Diao

arxiv: 2410.02082 · v4 · submitted 2024-10-02 · 💻 cs.LG · q-bio.QM

FARM: Enhancing Molecular Representations with Functional Group Awareness

Thao Nguyen , Kuan-Hao Huang , Ge Liu , Martin D. Burke , Ying Diao , Heng Ji This is my paper

Pith reviewed 2026-05-23 19:46 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords functional groupsmolecular representationsSMILESgraph neural networkscontrastive learningMoleculeNetdrug discoverymolecular graphs

0 comments

The pith

Adding functional group annotations to atoms in SMILES strings and graphs yields unified embeddings that reach state-of-the-art results on eight of thirteen MoleculeNet tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FARM to bridge SMILES strings, natural language, and molecular graphs by injecting functional group membership directly into atomic tokens. This produces FG-enhanced SMILES for masked language modeling and FG graphs for graph neural network processing of group connectivity. Contrastive learning then aligns the two resulting views into a single embedding space that carries both atom-level detail and higher-level topology. The resulting representations improve molecular property prediction enough to outperform prior methods on most MoleculeNet benchmarks and to generalize to a photostability dataset. A reader would care because chemically richer embeddings could speed up virtual screening and property prediction in drug and materials design.

Core claim

FARM learns molecular representations from two complementary perspectives: masked language modeling on FG-enhanced SMILES that captures atom-level features enriched with functional context, and graph neural networks on FG graphs that model higher-level molecular topology through functional group connectivity. Contrastive learning aligns these views into a unified embedding space. When evaluated on the MoleculeNet benchmark this produces state-of-the-art performance on 8 out of 13 tasks and supports transfer to a photostability dataset for quantum mechanical properties.

What carries the argument

Functional group-enhanced SMILES and FG graphs aligned by contrastive learning between a masked language model and a graph neural network.

If this is right

The same representations support stronger transfer learning across drug discovery and materials science tasks.
Atom-level functional context improves predictions for both small-molecule properties and quantum mechanical quantities.
The unified embedding space enables applications in pharmaceutical research and functional material design.
FG-enhanced tokenization expands the effective molecular vocabulary for Transformer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation scheme could be tested on larger or more diverse chemical libraries to check whether the gains scale.
If functional groups prove decisive, similar label injection might improve other graph-language hybrid models beyond this architecture.
The contrastive alignment step could be replaced or augmented with different objectives to isolate which part drives the observed gains.

Load-bearing premise

Functional group annotations at the atomic level supply chemical knowledge that standard SMILES tokenization and atom-level graphs do not already capture, and the contrastive alignment between the two views produces a genuinely more informative embedding.

What would settle it

An ablation that removes all functional group annotations while keeping the rest of the architecture and training identical would produce the same or better MoleculeNet scores.

Figures

Figures reproduced from arXiv: 2410.02082 by Ge Liu, Heng Ji, Kuan-Hao Huang, Martin D. Burke, Thao Nguyen, Ying Diao.

**Figure 2.** Figure 2: (a) FARM’s molecular representation learning model architecture. (b) Functional group [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the attention map of the BERT model trained with functional group [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Visualization of functional group knowledge graph embedding space: Clusters of five [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Example of naming a fused ring system in 4 steps: generate the core structure of the [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Number of functional groups associated with different chemical elements in the FG [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Loss curves for the masked language model (MLM) during training on two datasets: [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Link prediction model performance: Similar to word embedding analogies in NLP, replacing [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key idea behind FARM is the incorporation of functional group (FG) annotations at the atomic level, enabling both FG-enhanced SMILES and FG graphs. In this representation, SMILES strings are enriched with functional group information that identifies the group membership of each atom, while the FG graph captures molecular structure by representing how functional groups are connected. This tokenization injects chemical knowledge into SMILES and expands the effective molecular vocabulary, making the representation more suitable for Transformer-based models and more aligned with natural language structure. FARM learns molecular representations from two complementary perspectives to jointly encode functional and structural information. Masked language modeling on FG-enhanced SMILES captures atom-level features enriched with functional context, while graph neural networks model higher-level molecular topology through functional group connectivity. Contrastive learning is then used to align these two views into a unified embedding space, ensuring that both atom-level detail and functional group structure are jointly represented. We evaluate FARM on the MoleculeNet benchmark and achieve state-of-the-art performance on 8 out of 13 tasks. We further validate its generalization ability on a photostability dataset for quantum mechanical properties. These results demonstrate that FARM improves molecular representation learning, supports strong transfer learning across drug discovery and materials science, and enables broad applications in pharmaceutical research and functional material design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FARM adds FG annotations to SMILES and FG graphs then aligns them contrastively, hitting SOTA on 8/13 MoleculeNet tasks with clean ablations but only modest lifts.

read the letter

The main takeaway is that FARM enriches SMILES tokens with functional group labels, builds a separate FG graph, runs masked language modeling on the first view and a GNN on the second, then uses contrastive loss to align the embeddings. It reports state-of-the-art on 8 of 13 MoleculeNet tasks using the standard splits and metrics, plus a small extra check on photostability data. The ablations are present and the numbers include standard deviations, which makes the empirical claims easier to evaluate than many similar papers. The training procedure is described without obvious circularity or mismatched protocols. What is actually new is the specific combination of FG-annotated tokenization, the FG graph construction, and the three-stage alignment; none of those pieces by itself is revolutionary, but the package is put together coherently. The soft spots are that the gains over strong baselines are fairly small on several tasks and the paper does not include some of the more recent multi-view or graph transformer models that have come out since the earlier baselines it cites. It is also not fully clear how much the functional group labels supply information that a good atom-level model would not already capture, even though the ablations point to some benefit. This is a standard empirical molecular ML paper aimed at people who run property prediction benchmarks. Readers working on drug discovery or materials representations will find the numbers worth checking. The evaluation is reproducible enough and the claims are testable, so it deserves a serious referee rather than a desk reject.

Referee Report

0 major / 2 minor

Summary. The paper introduces FARM, a foundation model for small molecules that augments SMILES strings with functional group (FG) annotations at the atomic level and constructs corresponding FG graphs. It trains via masked language modeling on the FG-enhanced SMILES, graph neural networks on the FG graphs, and contrastive alignment between the two views to produce unified embeddings. The central empirical claim is state-of-the-art performance on 8 out of 13 MoleculeNet tasks together with generalization results on a photostability dataset for quantum-mechanical properties.

Significance. If the reported gains hold under the stated evaluation protocol, the explicit injection of functional-group information could supply chemically grounded features that standard SMILES tokenization or atom-level graphs do not fully capture, with downstream utility in drug discovery and materials design. The work is an empirical ML study that uses canonical MoleculeNet splits, reports standard deviations, and presents internally consistent ablations; the stress-test concern about information gain from FG annotations does not introduce circularity or inconsistency in the manuscript.

minor comments (2)

Abstract: the SOTA claim on 8/13 tasks would be clearer if the abstract briefly noted the use of standard MoleculeNet splits, the set of baselines, and the reporting of standard deviations (these details appear in the main text).
The definition and construction of the FG graph (how functional groups are identified and connected) could be illustrated with a small concrete example in the methods section to aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of FARM, the recognition of its empirical contributions on MoleculeNet, and the recommendation for minor revision. No major comments appear in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical machine-learning paper introducing a molecular representation model (FG-enhanced SMILES + FG graphs + MLM + GNN + contrastive alignment) and reporting benchmark results on MoleculeNet. No derivation chain, equations, or 'predictions' are present that reduce by construction to fitted parameters or self-defined quantities inside the paper. The central performance claims rest on standard training and evaluation protocols on public data splits; no self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that functional groups can be reliably annotated at the atomic level and that this annotation supplies information orthogonal to standard molecular graphs and SMILES. Standard deep-learning training assumptions (convergence of contrastive objectives, absence of data leakage) are also required but not stated explicitly.

free parameters (1)

loss weighting coefficients among MLM, GNN, and contrastive terms
Joint training of three objectives requires tunable scalars whose values are not reported in the abstract.

axioms (1)

domain assumption Functional groups can be accurately and unambiguously annotated at the atomic level for arbitrary small molecules
The key idea of FG-enhanced SMILES and FG graphs depends on this annotation step being reliable and chemically meaningful.

invented entities (1)

FG graph no independent evidence
purpose: Represent molecular topology at the level of functional-group connectivity rather than atom connectivity
The FG graph is introduced as a new structural view whose utility is demonstrated only through the reported benchmark gains.

pith-pipeline@v0.9.0 · 5809 in / 1537 out tokens · 32670 ms · 2026-05-23T19:46:59.522253+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FG-aware tokenization... masked language modeling on FG-enhanced SMILES... graph neural networks... contrastive learning to align these two views
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate FARM on the MoleculeNet benchmark and achieve state-of-the-art performance on 8 out of 13 tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Applications of quantitative structure-activity relationships (qsar) based virtual screening in drug design: A review

Patnala GR Achary. Applications of quantitative structure-activity relationships (qsar) based virtual screening in drug design: A review. Mini Reviews in Medicinal Chemistry, 20:1375–1388, 2020

work page 2020
[2]

Closed-loop transfer enables artificial intelligence to yield chemical knowledge

Nicholas H Angello, David M Friday, Changhyun Hwang, Seungjoo Yi, Austin H Cheng, Tiara C Torres- Flores, Edward R Jira, Wesley Wang, Alán Aspuru-Guzik, Martin D Burke, et al. Closed-loop transfer enables artificial intelligence to yield chemical knowledge. Nature, 633(8029):351–358, 2024

work page 2024
[3]

Drug–target interaction prediction: databases, web servers and computational models

Xing Chen, Chenggang Clarence Yan, Xiaotian Zhang, Xu Zhang, Feng Dai, Jian Yin, and Yongdong Zhang. Drug–target interaction prediction: databases, web servers and computational models. Briefings in bioinformatics, 17:696–712, 2016

work page 2016
[4]

arXiv preprint arXiv:2406.14021(2024)

Yongqiang Chen, Quanming Yao, Juzheng Zhang, James Cheng, and Yatao Bian. Hight: Hierarchical graph tokenization for graph-language alignment. arXiv preprint arXiv:2406.14021, 2024

work page arXiv 2024
[5]

On the art of compiling and using’drug-like’chemical fragment spaces.ChemMedChem, 3:1503, 2008

Jorg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. On the art of compiling and using’drug-like’chemical fragment spaces.ChemMedChem, 3:1503, 2008

work page 2008
[6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Translation between molecules and natural language

Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. In Proc. The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP2022), 2022

work page 2022
[8]

Synergpt: In-context learning for personalized drug synergy prediction and drug design

Carl Edwards, Aakanksha Naik, Tushar Khot, Martin Burke, Heng Ji, and Tom Hope. Synergpt: In-context learning for personalized drug synergy prediction and drug design. In Proc. 1st Conference on Language Modeling (COLM2024), 2024

work page 2024
[9]

L+m-24: Building a dataset for language + molecules acl 2024

Carl Edwards, Qingyun Wang, Lawrence Zhao, and Heng Ji. L+m-24: Building a dataset for language + molecules acl 2024. In Proc. ACL 2024 Workshop on Language+Molecules, 2024

work page 2024
[10]

Geometry-enhanced molecular representation learning for property prediction

Xiaomin Fang, Lihang Liu, Jieqiong Lei, Donglong He, Shanzhuo Zhang, Jingbo Zhou, Fan Wang, Hua Wu, and Haifeng Wang. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022

work page 2022
[11]

Knowledge graph-enhanced molecular contrastive learning with functional prompt

Yin Fang, Qiang Zhang, Ningyu Zhang, Zhuo Chen, Xiang Zhuang, Xin Shao, Xiaohui Fan, and Huajun Chen. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence, 5(5):542–553, 2023

work page 2023
[12]

Chembl: A large-scale bioactivity database for chemical biology and drug discovery

A Gaulton, L Bellis, J Chambers, M Davies, A Hersey, Y Light, S McGlinchey, R Akhtar, F Atkinson, AP Bento, et al. Chembl: A large-scale bioactivity database for chemical biology and drug discovery. Nucleic Acids Research. Database, page D1, 2011

work page 2011
[13]

Himgnn: a novel hierarchical molecular graph representation learning framework for property prediction

Shen Han, Haitao Fu, Yuyang Wu, Ganglan Zhao, Zhenyu Song, Feng Huang, Zhongfei Zhang, Shichao Liu, and Wen Zhang. Himgnn: a novel hierarchical molecular graph representation learning framework for property prediction. Briefings in Bioinformatics, 24(5):bbad305, 2023

work page 2023
[14]

Strategies for pre-training graph neural networks.arXiv preprint arXiv:1905.12265, 2019

Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019

work page arXiv 1905
[15]

Zinc- a free database of commercially available compounds for virtual screening

John J Irwin and Brian K Shoichet. Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45:177–182, 2005

work page 2005
[16]

Rdkit: Open-source cheminformatics

G Landrum. Rdkit: Open-source cheminformatics. https://www.rdkit.org, 2010. Accessed: 2024- 09-19

work page 2010
[17]

Fg-bert: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction

Biaoshun Li, Mujie Lin, Tiegen Chen, and Ling Wang. Fg-bert: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Briefings in Bioinformatics, 24(6):bbad398, 2023

work page 2023
[18]

Pre-training molecular graph representation with 3d geometry

Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021

work page arXiv 2021
[19]

Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices

Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, and Heng Ji. Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices. In Proc. 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024), 2024. 11

work page 2024
[20]

Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices

Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, and Heng Ji. Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices. 33rd ACM International Conference on Information and Knowledge Management, 2024

work page 2024
[21]

Smiclr: contrastive learning on multiple molecular representations for semisupervised and unsupervised representation learning

Gabriel A Pinheiro, Juarez LF Da Silva, and Marcos G Quiles. Smiclr: contrastive learning on multiple molecular representations for semisupervised and unsupervised representation learning. Journal of Chemical Information and Modeling, 62(17):3948–3960, 2022

work page 2022
[22]

Self- supervised graph transformer on large-scale molecular data

Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self- supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559–12571, 2020

work page 2020
[23]

Molecular property prediction: recent trends in the era of artificial intelligence

Jie Shen and Christos A Nicolaou. Molecular property prediction: recent trends in the era of artificial intelligence. Drug Discovery Today: Technologies, 32:29–36, 2019

work page 2019
[24]

Complex embeddings for simple link prediction

Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International conference on machine learning, pages 2071–2080. PMLR, 2016

work page 2071
[25]

Graph Attention Networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Applications of deep learning in molecule generation and molecular property prediction

W Patrick Walters and Regina Barzilay. Applications of deep learning in molecule generation and molecular property prediction. Accounts of chemical research, 54:263–270, 2020

work page 2020
[27]

Chemical-reaction-aware molecule representation learning

Hongwei Wang, Weijiang Li, Xiaomeng Jin, Kyunghyun Cho, Heng Ji, Jiawei Han, and Martin Burke. Chemical-reaction-aware molecule representation learning. In Proc. The International Conference on Learning Representations (ICLR2022), 2022

work page 2022
[28]

Motif-based graph representation learning with application to chemical molecules

Yifei Wang, Shiyang Chen, Guobin Chen, Ethan Shurberg, Hang Liu, and Pengyu Hong. Motif-based graph representation learning with application to chemical molecules. In Informatics, volume 10, page 8. MDPI, 2023

work page 2023
[29]

Molecular contrastive learning of representations via graph neural networks

Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287, 2022

work page 2022
[30]

Deep-learning-based drug–target interaction prediction

Ming Wen, Zhimin Zhang, Shaoyu Niu, Haozhi Sha, Ruihan Yang, Yonghuan Yun, and Hongmei Lu. Deep-learning-based drug–target interaction prediction. Journal of proteome research, 16:1401–1409, 2017

work page 2017
[31]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020

work page 2020
[32]

Molformer: Motif-based transformer on 3d heterogeneous molecular graphs

Fang Wu, Dragomir Radev, and Stan Z Li. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5312–5320, 2023

work page 2023
[33]

Moleculenet: a benchmark for molecular machine learning

Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9:513–530, 2018

work page 2018
[34]

Mole-bert: Rethinking pre-training graph neural networks for molecules

Jun Xia, Chengshuai Zhao, Bozhen Hu, Zhangyang Gao, Cheng Tan, Yue Liu, Siyuan Li, and Stan Z Li. Mole-bert: Rethinking pre-training graph neural networks for molecules. The Eleventh International Conference on Learning Representations, ICLR 2023, 2023

work page 2023
[35]

Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism

Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry, 63:8749–8760, 2019

work page 2019
[36]

Invariant tokenization for language model enabled crystal materials generation

Keqiang Yan, Xiner Li, Hongyi Ling, Carl Ashen, Kenna; Edwards, Raymundo Arroyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Qian Xiaoning, and Shuiwang Ji. Invariant tokenization for language model enabled crystal materials generation. In Proc. the Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS2024), 2024. 12

work page 2024
[37]

Learning substructure invariance for out-of-distribution molecular representations

Nianzu Yang, Kaipeng Zeng, Qitian Wu, Xiaosong Jia, and Junchi Yan. Learning substructure invariance for out-of-distribution molecular representations. Advances in Neural Information Processing Systems, 35: 12964–12978, 2022

work page 2022
[38]

Molecular representation learning via heterogeneous motif graph neural networks

Zhaoning Yu and Hongyang Gao. Molecular representation learning via heterogeneous motif graph neural networks. In International Conference on Machine Learning, pages 25581–25594. PMLR, 2022

work page 2022
[39]

Hierarchical molecular graph self-supervised learning for property prediction

Xuan Zang, Xianbing Zhao, and Buzhou Tang. Hierarchical molecular graph self-supervised learning for property prediction. Communications Chemistry, 6(1):34, 2023

work page 2023
[40]

Motif-driven contrastive learning of graph representations

Shichang Zhang, Ziniu Hu, Arjun Subramonian, and Yizhou Sun. Motif-driven contrastive learning of graph representations. arXiv preprint arXiv:2012.12533, 2020

work page arXiv 2012
[41]

Hofgard, Aria Mansouri Tehrani, Rui Wang, Ameya Daigavane, Montgomery Bohde, Jerry Kurtin, Qian Huang, Tuong Phung, Minkai Xu, Chaitanya K

Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, Keir Adams, Maurice Weiler, Xiner Li, Tianfan Fu, Yucheng Wang, Haiyang Yu, YuQing Xie, Xiang Fu, Alex Strasser, Shenglong Xu, Yi Liu, Yuanqi Du, Alexandra Saxton, Hongyi Ling, Hannah Lawrence, Hannes Stärk, Shurui Gui, Carl Edwards, Nichola...

work page 2023
[42]

Artificial intelligence for science in quantum, atomistic, and continuum systems

Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, et al. Artificial intelligence for science in quantum, atomistic, and continuum systems. arXiv preprint arXiv:2307.08423, 2023

work page arXiv 2023
[43]

Motif-based graph self-supervised learning for molecular property prediction

Zaixi Zhang, Qi Liu, Hao Wang, Chengqiang Lu, and Chee-Kong Lee. Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems , 34: 15870–15882, 2021

work page 2021
[44]

Uni-mol: A universal 3d molecular representation learning framework

Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: A universal 3d molecular representation learning framework. The Eleventh International Conference on Learning Representations, ICLR 2023, 2023. 13 A Molecular Datasets A.1 Training data We collected a diverse dataset to train our FARM model ...

work page 2023

[1] [1]

Applications of quantitative structure-activity relationships (qsar) based virtual screening in drug design: A review

Patnala GR Achary. Applications of quantitative structure-activity relationships (qsar) based virtual screening in drug design: A review. Mini Reviews in Medicinal Chemistry, 20:1375–1388, 2020

work page 2020

[2] [2]

Closed-loop transfer enables artificial intelligence to yield chemical knowledge

Nicholas H Angello, David M Friday, Changhyun Hwang, Seungjoo Yi, Austin H Cheng, Tiara C Torres- Flores, Edward R Jira, Wesley Wang, Alán Aspuru-Guzik, Martin D Burke, et al. Closed-loop transfer enables artificial intelligence to yield chemical knowledge. Nature, 633(8029):351–358, 2024

work page 2024

[3] [3]

Drug–target interaction prediction: databases, web servers and computational models

Xing Chen, Chenggang Clarence Yan, Xiaotian Zhang, Xu Zhang, Feng Dai, Jian Yin, and Yongdong Zhang. Drug–target interaction prediction: databases, web servers and computational models. Briefings in bioinformatics, 17:696–712, 2016

work page 2016

[4] [4]

arXiv preprint arXiv:2406.14021(2024)

Yongqiang Chen, Quanming Yao, Juzheng Zhang, James Cheng, and Yatao Bian. Hight: Hierarchical graph tokenization for graph-language alignment. arXiv preprint arXiv:2406.14021, 2024

work page arXiv 2024

[5] [5]

On the art of compiling and using’drug-like’chemical fragment spaces.ChemMedChem, 3:1503, 2008

Jorg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. On the art of compiling and using’drug-like’chemical fragment spaces.ChemMedChem, 3:1503, 2008

work page 2008

[6] [6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Translation between molecules and natural language

Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. In Proc. The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP2022), 2022

work page 2022

[8] [8]

Synergpt: In-context learning for personalized drug synergy prediction and drug design

Carl Edwards, Aakanksha Naik, Tushar Khot, Martin Burke, Heng Ji, and Tom Hope. Synergpt: In-context learning for personalized drug synergy prediction and drug design. In Proc. 1st Conference on Language Modeling (COLM2024), 2024

work page 2024

[9] [9]

L+m-24: Building a dataset for language + molecules acl 2024

Carl Edwards, Qingyun Wang, Lawrence Zhao, and Heng Ji. L+m-24: Building a dataset for language + molecules acl 2024. In Proc. ACL 2024 Workshop on Language+Molecules, 2024

work page 2024

[10] [10]

Geometry-enhanced molecular representation learning for property prediction

Xiaomin Fang, Lihang Liu, Jieqiong Lei, Donglong He, Shanzhuo Zhang, Jingbo Zhou, Fan Wang, Hua Wu, and Haifeng Wang. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022

work page 2022

[11] [11]

Knowledge graph-enhanced molecular contrastive learning with functional prompt

Yin Fang, Qiang Zhang, Ningyu Zhang, Zhuo Chen, Xiang Zhuang, Xin Shao, Xiaohui Fan, and Huajun Chen. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence, 5(5):542–553, 2023

work page 2023

[12] [12]

Chembl: A large-scale bioactivity database for chemical biology and drug discovery

A Gaulton, L Bellis, J Chambers, M Davies, A Hersey, Y Light, S McGlinchey, R Akhtar, F Atkinson, AP Bento, et al. Chembl: A large-scale bioactivity database for chemical biology and drug discovery. Nucleic Acids Research. Database, page D1, 2011

work page 2011

[13] [13]

Himgnn: a novel hierarchical molecular graph representation learning framework for property prediction

Shen Han, Haitao Fu, Yuyang Wu, Ganglan Zhao, Zhenyu Song, Feng Huang, Zhongfei Zhang, Shichao Liu, and Wen Zhang. Himgnn: a novel hierarchical molecular graph representation learning framework for property prediction. Briefings in Bioinformatics, 24(5):bbad305, 2023

work page 2023

[14] [14]

Strategies for pre-training graph neural networks.arXiv preprint arXiv:1905.12265, 2019

Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019

work page arXiv 1905

[15] [15]

Zinc- a free database of commercially available compounds for virtual screening

John J Irwin and Brian K Shoichet. Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45:177–182, 2005

work page 2005

[16] [16]

Rdkit: Open-source cheminformatics

G Landrum. Rdkit: Open-source cheminformatics. https://www.rdkit.org, 2010. Accessed: 2024- 09-19

work page 2010

[17] [17]

Fg-bert: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction

Biaoshun Li, Mujie Lin, Tiegen Chen, and Ling Wang. Fg-bert: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Briefings in Bioinformatics, 24(6):bbad398, 2023

work page 2023

[18] [18]

Pre-training molecular graph representation with 3d geometry

Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021

work page arXiv 2021

[19] [19]

Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices

Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, and Heng Ji. Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices. In Proc. 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024), 2024. 11

work page 2024

[20] [20]

Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices

Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, and Heng Ji. Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices. 33rd ACM International Conference on Information and Knowledge Management, 2024

work page 2024

[21] [21]

Smiclr: contrastive learning on multiple molecular representations for semisupervised and unsupervised representation learning

Gabriel A Pinheiro, Juarez LF Da Silva, and Marcos G Quiles. Smiclr: contrastive learning on multiple molecular representations for semisupervised and unsupervised representation learning. Journal of Chemical Information and Modeling, 62(17):3948–3960, 2022

work page 2022

[22] [22]

Self- supervised graph transformer on large-scale molecular data

Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self- supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559–12571, 2020

work page 2020

[23] [23]

Molecular property prediction: recent trends in the era of artificial intelligence

Jie Shen and Christos A Nicolaou. Molecular property prediction: recent trends in the era of artificial intelligence. Drug Discovery Today: Technologies, 32:29–36, 2019

work page 2019

[24] [24]

Complex embeddings for simple link prediction

Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International conference on machine learning, pages 2071–2080. PMLR, 2016

work page 2071

[25] [25]

Graph Attention Networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Applications of deep learning in molecule generation and molecular property prediction

W Patrick Walters and Regina Barzilay. Applications of deep learning in molecule generation and molecular property prediction. Accounts of chemical research, 54:263–270, 2020

work page 2020

[27] [27]

Chemical-reaction-aware molecule representation learning

Hongwei Wang, Weijiang Li, Xiaomeng Jin, Kyunghyun Cho, Heng Ji, Jiawei Han, and Martin Burke. Chemical-reaction-aware molecule representation learning. In Proc. The International Conference on Learning Representations (ICLR2022), 2022

work page 2022

[28] [28]

Motif-based graph representation learning with application to chemical molecules

Yifei Wang, Shiyang Chen, Guobin Chen, Ethan Shurberg, Hang Liu, and Pengyu Hong. Motif-based graph representation learning with application to chemical molecules. In Informatics, volume 10, page 8. MDPI, 2023

work page 2023

[29] [29]

Molecular contrastive learning of representations via graph neural networks

Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287, 2022

work page 2022

[30] [30]

Deep-learning-based drug–target interaction prediction

Ming Wen, Zhimin Zhang, Shaoyu Niu, Haozhi Sha, Ruihan Yang, Yonghuan Yun, and Hongmei Lu. Deep-learning-based drug–target interaction prediction. Journal of proteome research, 16:1401–1409, 2017

work page 2017

[31] [31]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020

work page 2020

[32] [32]

Molformer: Motif-based transformer on 3d heterogeneous molecular graphs

Fang Wu, Dragomir Radev, and Stan Z Li. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5312–5320, 2023

work page 2023

[33] [33]

Moleculenet: a benchmark for molecular machine learning

Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9:513–530, 2018

work page 2018

[34] [34]

Mole-bert: Rethinking pre-training graph neural networks for molecules

Jun Xia, Chengshuai Zhao, Bozhen Hu, Zhangyang Gao, Cheng Tan, Yue Liu, Siyuan Li, and Stan Z Li. Mole-bert: Rethinking pre-training graph neural networks for molecules. The Eleventh International Conference on Learning Representations, ICLR 2023, 2023

work page 2023

[35] [35]

Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism

Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry, 63:8749–8760, 2019

work page 2019

[36] [36]

Invariant tokenization for language model enabled crystal materials generation

Keqiang Yan, Xiner Li, Hongyi Ling, Carl Ashen, Kenna; Edwards, Raymundo Arroyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Qian Xiaoning, and Shuiwang Ji. Invariant tokenization for language model enabled crystal materials generation. In Proc. the Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS2024), 2024. 12

work page 2024

[37] [37]

Learning substructure invariance for out-of-distribution molecular representations

Nianzu Yang, Kaipeng Zeng, Qitian Wu, Xiaosong Jia, and Junchi Yan. Learning substructure invariance for out-of-distribution molecular representations. Advances in Neural Information Processing Systems, 35: 12964–12978, 2022

work page 2022

[38] [38]

Molecular representation learning via heterogeneous motif graph neural networks

Zhaoning Yu and Hongyang Gao. Molecular representation learning via heterogeneous motif graph neural networks. In International Conference on Machine Learning, pages 25581–25594. PMLR, 2022

work page 2022

[39] [39]

Hierarchical molecular graph self-supervised learning for property prediction

Xuan Zang, Xianbing Zhao, and Buzhou Tang. Hierarchical molecular graph self-supervised learning for property prediction. Communications Chemistry, 6(1):34, 2023

work page 2023

[40] [40]

Motif-driven contrastive learning of graph representations

Shichang Zhang, Ziniu Hu, Arjun Subramonian, and Yizhou Sun. Motif-driven contrastive learning of graph representations. arXiv preprint arXiv:2012.12533, 2020

work page arXiv 2012

[41] [41]

Hofgard, Aria Mansouri Tehrani, Rui Wang, Ameya Daigavane, Montgomery Bohde, Jerry Kurtin, Qian Huang, Tuong Phung, Minkai Xu, Chaitanya K

Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, Keir Adams, Maurice Weiler, Xiner Li, Tianfan Fu, Yucheng Wang, Haiyang Yu, YuQing Xie, Xiang Fu, Alex Strasser, Shenglong Xu, Yi Liu, Yuanqi Du, Alexandra Saxton, Hongyi Ling, Hannah Lawrence, Hannes Stärk, Shurui Gui, Carl Edwards, Nichola...

work page 2023

[42] [42]

Artificial intelligence for science in quantum, atomistic, and continuum systems

Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, et al. Artificial intelligence for science in quantum, atomistic, and continuum systems. arXiv preprint arXiv:2307.08423, 2023

work page arXiv 2023

[43] [43]

Motif-based graph self-supervised learning for molecular property prediction

Zaixi Zhang, Qi Liu, Hao Wang, Chengqiang Lu, and Chee-Kong Lee. Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems , 34: 15870–15882, 2021

work page 2021

[44] [44]

Uni-mol: A universal 3d molecular representation learning framework

Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: A universal 3d molecular representation learning framework. The Eleventh International Conference on Learning Representations, ICLR 2023, 2023. 13 A Molecular Datasets A.1 Training data We collected a diverse dataset to train our FARM model ...

work page 2023