pith. sign in

arxiv: 2605.16480 · v1 · pith:AJOP2LMOnew · submitted 2026-05-15 · 🧬 q-bio.BM · cs.AI

MoleCode unlocks structural intelligence in large language models

Pith reviewed 2026-05-19 21:34 UTC · model grok-4.3

classification 🧬 q-bio.BM cs.AI
keywords MoleCodemolecular representationlarge language modelsSMILESgraph structuremolecular topologychemical reasoningtraining-free method
0
0 comments X

The pith

MoleCode makes molecular topology directly readable, editable and auditable by LLMs instead of hidden in SMILES strings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Molecules are graphs, yet LLMs usually receive them as compact SMILES sequences where bonds, branches and rings must be mentally reconstructed before any operation. MoleCode supplies a training-free alternative that encodes every component as typed entities carrying persistent identifiers and explicit relations through a Subgraph-Node-Edge grammar. The model can therefore read, edit and audit structure inside the prompt itself rather than recover it from syntax. Gains appear most clearly on unfamiliar molecules, topology-sensitive edits, larger structures and repetitive polymers, while reasoning traces shorten and become more chemically focused. The same grammar also covers polymers, Markush structures and mixed text-image scientific documents.

Core claim

MoleCode is an LLM-native, training-free, graph-explicit molecular language in which all molecular components are represented as typed entities with persistent identifiers and explicit relations. This makes molecular topology directly readable, editable and auditable within the language context, allowing an LLM to operate on structure rather than recover it from syntax. Across reasoning, editing, generation and analysis tasks the shift improves frontier models most when structural access limits performance, replaces long reconstruction traces with shorter chemically directed reasoning, and supports localized property-aligned edits that preserve similarity to starting compounds.

What carries the argument

Subgraph-Node-Edge grammar that supplies typed entities with persistent identifiers and explicit relations so topology becomes first-class content inside LLM prompts.

If this is right

  • Molecular optimization produces localized edits that preserve structural similarity while aligning with target properties.
  • Inference effort moves from lengthy implicit reconstruction to shorter, chemically directed reasoning over explicit atoms and bonds.
  • The grammar extends without change to polymers, Markush structures, mechanism-style transformations and documents that interleave text with chemical images.
  • Performance improvements concentrate on unfamiliar molecules, topology-sensitive operations, larger structures and repetitive polymers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit structural interfaces may prove useful for any relational scientific object whose topology is currently decoded from linear text.
  • The approach could be tested on protein or materials graphs to check whether the same reduction in reconstruction overhead appears.
  • Auditability of edits inside the prompt may lower error rates in multi-step chemical planning by making each structural change traceable.

Load-bearing premise

Frontier LLMs can immediately exploit the explicit Subgraph-Node-Edge grammar in ordinary prompts for better reasoning without training or fine-tuning, and observed gains come specifically from structural access rather than prompt length or other factors.

What would settle it

A controlled comparison of MoleCode versus length-matched SMILES prompts on identical tasks that finds no performance difference once prompt length and wording are equalized.

read the original abstract

Molecules are graphs, but large language models~(LLMs) are usually asked to reason about them through linear strings. The most popular molecular representation, SMILES, compresses atoms, bonds, branches and rings into a compact sequence in which topology is implicit, forcing LLMs to reconstruct molecular structure before performing the requested chemical operation. Here we introduce MoleCode, an LLM-native, training-free, graph-explicit molecular language in which all molecular components are represented as typed entities with persistent identifiers and explicit relations. MoleCode makes molecular topology directly readable, editable and auditable within the language context, allowing an LLM to operate on structure rather than recover it from syntax. Across molecular reasoning, editing, generation and analysis tasks, this representational shift improves frontier LLMs most strongly when structural access is limiting: unfamiliar molecules, topology-sensitive operations, larger structures and repetitive polymers. It also changes how inference is allocated, replacing long reasoning traces devoted to implicit structural reconstruction with shorter, more chemically directed reasoning over explicit atoms and bonds. In molecular optimization, this enables localized, property-aligned edits that preserve structural similarity to the starting compounds. The same Subgraph--Node--Edge grammar extends beyond small molecules to polymers, Markush structures, mechanism-style transformations and interleaved scientific documents, including research articles and patent disclosures in which chemical information is distributed across text and images. These results suggest that the interface between scientific objects and LLMs should not treat structure as something to be decoded from text. When the object of reasoning is relational, the structure itself should be part of the language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MoleCode, a training-free, graph-explicit molecular representation using a Subgraph-Node-Edge grammar that assigns persistent identifiers and explicit relations to atoms, bonds, subgraphs and other components. It claims this allows frontier LLMs to directly read, edit and reason over molecular topology instead of implicitly reconstructing structure from linear strings such as SMILES, yielding performance gains on reasoning, editing, generation and analysis tasks (especially for unfamiliar molecules, topology-sensitive operations and larger or polymeric structures) while also extending the grammar to Markush structures, mechanisms and interleaved scientific documents.

Significance. If the empirical claims are substantiated with controlled experiments, the work would offer a practical, immediately usable interface for injecting explicit relational structure into LLM prompts for chemistry and biology. The training-free character and the extension to polymers and document-scale chemical information are notable strengths that could influence how structural objects are represented in scientific LLM applications.

major comments (2)
  1. [Abstract] Abstract and Results: the manuscript asserts quantitative improvements on molecular tasks but supplies no metrics, baselines, error bars, or task definitions, preventing any assessment of effect size or reproducibility.
  2. [Results] The central claim that observed gains arise specifically from the explicit Subgraph-Node-Edge grammar (rather than longer or more detailed prompts) is load-bearing yet untested; no ablation that holds total token count and surface-level chemical detail fixed while varying only the node/edge identifiers and relations is described.
minor comments (2)
  1. Provide a concise formal grammar or BNF for the Subgraph-Node-Edge syntax together with a side-by-side comparison to SMILES for a small molecule containing a ring and a branch.
  2. Clarify how persistent identifiers are maintained across multi-turn editing sessions and how the representation scales to very large polymers without token explosion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the opportunity to clarify the empirical basis of our claims. We address each major comment below and outline targeted revisions to improve quantitative reporting and experimental controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Results: the manuscript asserts quantitative improvements on molecular tasks but supplies no metrics, baselines, error bars, or task definitions, preventing any assessment of effect size or reproducibility.

    Authors: We agree that the abstract and results presentation would benefit from greater quantitative detail. The manuscript reports comparative performance on reasoning, editing, generation and analysis tasks against SMILES baselines, with gains most pronounced for topology-sensitive and larger structures. To address reproducibility concerns, the revised manuscript will add explicit task definitions (e.g., success rate for localized edits, accuracy on ring-counting and connectivity queries), report error bars from multiple independent runs, and include a summary table of effect sizes and baselines in both the abstract and results sections. revision: yes

  2. Referee: [Results] The central claim that observed gains arise specifically from the explicit Subgraph-Node-Edge grammar (rather than longer or more detailed prompts) is load-bearing yet untested; no ablation that holds total token count and surface-level chemical detail fixed while varying only the node/edge identifiers and relations is described.

    Authors: This is a fair and important point. While our experiments contrast MoleCode against standard SMILES prompting, they do not include a controlled ablation that equalizes token count and surface-level chemical description while isolating the effect of persistent node/edge identifiers and explicit relations. We will add such an ablation in the revision, using length-matched prompt variants that expand SMILES with equivalent textual detail, to quantify the incremental benefit attributable to the structured grammar. revision: yes

Circularity Check

0 steps flagged

No circularity: paper introduces new representational grammar with empirical claims only

full rationale

The manuscript proposes MoleCode as a new Subgraph-Node-Edge grammar for molecular structures to make topology explicit in LLM prompts. No mathematical derivations, fitted parameters, predictions, or equations are present in the provided text. The central claims rest on empirical performance improvements across tasks rather than any reduction of outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the grammar itself. The contribution is a proposed interface change whose benefits are asserted through task results, not derived from prior fitted quantities or self-referential definitions. This is a standard non-circular empirical proposal of a representational format.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLMs can directly exploit explicit relational structure in context windows. MoleCode itself is the primary invented entity. No free parameters or mathematical axioms are specified in the abstract.

axioms (1)
  • domain assumption Frontier LLMs can directly utilize explicit graph representations with typed entities and relations in their context for reasoning without additional training.
    The method is described as training-free and relies on the LLM operating on the provided explicit structure.
invented entities (1)
  • MoleCode no independent evidence
    purpose: An LLM-native molecular language using typed entities, persistent identifiers, and explicit relations to represent molecular graphs directly.
    New representation introduced to replace implicit string encodings like SMILES.

pith-pipeline@v0.9.0 · 5845 in / 1288 out tokens · 42717 ms · 2026-05-19T21:34:51.396221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 3 internal anchors

  1. [1]

    A survey on large language models in biology and chemistry

    Islambek Ashyrmamatov, Su Ji Gwak, Su-Young Jin, Ikhyeong Jun, Umit V Ucak, Jay-Yoon Lee, and Juyong Lee. A survey on large language models in biology and chemistry. Experimental & Molecular Medicine, pages 1–11, 2025

  2. [2]

    Large language models as molecular design engines

    Debjyoti Bhattacharya, Harrison J Cassady, Michael A Hickner, and Wesley F Reinhart. Large language models as molecular design engines. Journal of Chemical Information and Modeling, 64(18):7086–7096, 2024

  3. [3]

    Llamo: Large language model-based molecular graph assistant

    Jinyoung Park, Minseong Bae, Dohwan Ko, and Hyunwoo J Kim. Llamo: Large language model-based molecular graph assistant. Advances in Neural Information Processing Systems, 37:131972–132000, 2024

  4. [4]

    A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists

    Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, et al. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nature Chemistry, 17(7):1027– 1034, 2025

  5. [5]

    Translation between Molecules and Natural Language

    Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. arXiv preprint arXiv:2204.11817, 2022

  6. [6]

    BioT5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations

    Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, and Rui Yan. BioT5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  7. [7]

    Navigating chemical-linguistic sharing space with heterogeneous molecular encoding

    Liuzhenghao Lv, Hao Li, Yu Wang, Zhiyuan Yan, Zijun Chen, Zongying Lin, Li Yuan, and Yonghong Tian. Navigating chemical-linguistic sharing space with heterogeneous molecular encoding. arXiv preprint arXiv:2412.20888, 2024

  8. [8]

    Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter

    Zhiyuan Liu, Sihang Li, Yanchen Luo, Hao Fei, Yixin Cao, Kenji Kawaguchi, Xiang Wang, and Tat-Seng Chua. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15623–15638, 2023

  9. [9]

    Mol- instructions: A large-scale biomolecular instruction dataset for large language models

    Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol- instructions: A large-scale biomolecular instruction dataset for large language models. In International Conference on Learning Representations, volume 2024, pages 48221–48251, 2024

  10. [10]

    Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery

    He Cao, Zijing Liu, Xingyu Lu, Yuan Yao, and Yu Li. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. InProceedings of the 31st International Conference on Computational Linguistics, pages 354–379, 2025. 17

  11. [11]

    Enhancing activity prediction models in drug discovery with the ability to understand human language

    Philipp Seidl, Andreu Vall, Sepp Hochreiter, and Günter Klambauer. Enhancing activity prediction models in drug discovery with the ability to understand human language. In International Conference on Machine Learning, pages 30458–30490. PMLR, 2023

  12. [12]

    Towards 3d molecule-text interpretation in language models

    Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian. Towards 3d molecule-text interpretation in language models. In International Conference on Learning Representations, volume 2024, pages 17352–17371, 2024

  13. [13]

    Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset

    Botao Yu, Frazier N Baker, Ziqi Chen, Xia Ning, and Huan Sun. Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. In ACL 2024 Workshop Language+ Molecules, 2024

  14. [14]

    Immunostruct enables multimodal deep learning for immunogenicity prediction

    Kevin Bijan Givechian, João Felipe Rocha, Chen Liu, Edward Yang, Sidharth Tyagi, Kerrie Greene, Rex Ying, Etienne Caron, Akiko Iwasaki, and Smita Krishnaswamy. Immunostruct enables multimodal deep learning for immunogenicity prediction. Nature Machine Intelligence, 8:70–83, 2026

  15. [15]

    Graph neural networks for materials science and chemistry

    Patrick Reiser, Marlen Neubert, André Eberhard, Luca Torresi, Chen Zhou, Chen Shao, Houssam Metni, Clint van Hoesel, Henrik Schopmans, Timo Sommer, et al. Graph neural networks for materials science and chemistry. Communications Materials, 3(1):93, 2022

  16. [16]

    Invalid smiles are beneficial rather than detrimental to chemical language models

    Michael A Skinnider. Invalid smiles are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4):437–448, 2024

  17. [17]

    SMILES, a chemical language and information system

    David Weininger. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988

  18. [18]

    Improving chemical understanding of llms via smiles parsing

    Yunhui Jang, Jaehyung Kim, and Sungsoo Ahn. Improving chemical understanding of llms via smiles parsing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15694–15709, 2025

  19. [19]

    What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks

    Taicheng Guo, Kehan Guo, Bowen Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, and Xiangliang Zhang. What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks. Advances in Neural Information Processing Systems, 36, 2023

  20. [20]

    How to detect and defeat molecular mirage: A metric-driven benchmark for hallucination in llm-based molecular comprehension

    Li Hao, Liuzhenghao Lv, Zijing Liu, Zhiyuan Yan, Yu Wang, Yonghong Tian, Yu Li, Li Yuan, et al. How to detect and defeat molecular mirage: A metric-driven benchmark for hallucination in llm-based molecular comprehension. In NeurIPS 2025 AI for Science Workshop, 2025

  21. [21]

    Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.Advances in Neural Information Processing Systems, 38, 2026

    Li Hao, He Cao, Bin Feng, Daniel Shao, Robert Tang, Zhiyuan Yan, Yonghong Tian, Li Yuan, and Yu Li. Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.Advances in Neural Information Processing Systems, 38, 2026

  22. [22]

    Fgbench: A dataset and benchmark for molecular property reasoning at functional group-level in large language models

    Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, and Huimin Zhao. Fgbench: A dataset and benchmark for molecular property reasoning at functional group-level in large language models. Advances in Neural Information Processing Systems, 38, 2026

  23. [23]

    Molerr2fix: Benchmarking llm trustworthiness in chemistry via modular error detection, localization, explanation, and correction

    Yuyang Wu, Jinhui Ye, Shuhao Zhang, Lu Dai, Yonatan Bisk, and Olexandr Isayev. Molerr2fix: Benchmarking llm trustworthiness in chemistry via modular error detection, localization, explanation, and correction. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19365–19382, 2025

  24. [24]

    Molxpt: Wrapping molecules with text for generative pre-training

    Zequn Liu, Wei Zhang, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Ming Zhang, and Tie-Yan Liu. Molxpt: Wrapping molecules with text for generative pre-training. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume2: Short Papers), pages 1606–1616, 2023

  25. [25]

    Root-aligned smiles: a tight representation for chemical reaction prediction

    Zipeng Zhong, Jie Song, Zunlei Feng, Tiantao Liu, Lingxiang Jia, Shaolun Yao, Min Wu, Tingjun Hou, and Mingli Song. Root-aligned smiles: a tight representation for chemical reaction prediction. Chemical Science, 13(31):9023–9034, 2022

  26. [26]

    Representations in distributed cognitive tasks

    Jiaje Zhang and Donald A Norman. Representations in distributed cognitive tasks. Cognitive science, 18(1):87–122, 1994

  27. [27]

    The nature of external representations in problem solving

    Jiajie Zhang. The nature of external representations in problem solving. Cognitive science, 21(2):179–217, 1997

  28. [28]

    The hindu-arabic numerals

    David Eugene Smith and Louis Charles Karpinski. The hindu-arabic numerals. Ginn, 1911

  29. [29]

    Chemistry: the molecular science

    John Olmsted and Gregory M Williams. Chemistry: the molecular science. Jones & Bartlett Learning, 1997

  30. [30]

    The complexity of reasoning about and with chemical representations

    Vicente Talanquer. The complexity of reasoning about and with chemical representations. Jacs Au, 2(12):2658–2669, 2022. 18

  31. [31]

    Self-referencing embedded strings (selfies): A 100% robust molecular string representation

    Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020

  32. [32]

    t-smiles: a fragment-based molecular representation framework for de novo ligand design

    Juan-Ni Wu, Tong Wang, Yue Chen, Li-Juan Tang, Hai-Long Wu, and Ru-Qin Yu. t-smiles: a fragment-based molecular representation framework for de novo ligand design. Nature Communications, 15(1):4993, 2024

  33. [33]

    Neural message passing for quantum chemistry

    Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017

  34. [34]

    Semi-supervised classification with graph convolutional networks

    Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017

  35. [35]

    SchNet: A continuous-filter convolutional neural network for modeling quantum interactions

    Kristof T Schütt, Pieter-Jan Kindermans, Huziel E Sauceda, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems, volume 30, 2018

  36. [36]

    Uni- mol: A universal 3d molecular representation learning framework

    Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni- mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representations, 2023

  37. [37]

    Graphcl: Contrastive self-supervised learning of graph representations

    Hakim Hafidi, Mounir Ghogho, Philippe Ciblat, and Ananthram Swami. Graphcl: Contrastive self-supervised learning of graph representations. arXiv preprint arXiv:2007.08025, 2020

  38. [38]

    Multi-modal molecule structure–text model for text-based retrieval and editing

    Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang, Chaowei Xiao, and Animashree Anandkumar. Multi-modal molecule structure–text model for text-based retrieval and editing. Nature Machine Intelligence, 5(12):1447–1457, 2023

  39. [39]

    Mollangbench: A comprehensive benchmark for language-prompted molecular structure recognition, editing, and generation.arXiv preprint arXiv:2505.15054, 2025

    Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, and Feng Luo. Mollangbench: A comprehensive benchmark for language-prompted molecular structure recognition, editing, and generation. arXiv preprint arXiv:2505.15054, 2025

  40. [40]

    Pre-training molecular graph representation with 3d geometry

    Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training molecular graph representation with 3d geometry. In International Conference on Learning Representations, 2022

  41. [41]

    Multimodal molecular pretraining via modality blending

    Qiying Yu, Yudi Zhang, Yuyan Ni, Shikun Feng, Yanyan Lan, Hao Zhou, and Jingjing Liu. Multimodal molecular pretraining via modality blending. In The Twelfth International Conference on Learning Representations, 2024

  42. [42]

    Unicorn: A unified contrastive learning approach for multi-view molecular representation learning

    Shikun Feng, Yuyan Ni, Minghao Li, Yanwen Huang, Zhi-Ming Ma, Wei-Ying Ma, and Yanyan Lan. Unicorn: A unified contrastive learning approach for multi-view molecular representation learning. arXiv preprint arXiv:2405.10343, 2024

  43. [43]

    A molecular multimodal foundation model associating molecule graphs with natural language

    Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022

  44. [44]

    Mol-llama: Towards general understanding of molecules in large molecular language model

    Dongki Kim, Wonbin Lee, and Sung Ju Hwang. Mol-llama: Towards general understanding of molecules in large molecular language model. Advances in Neural Information Processing Systems, 38:26921–26960, 2026

  45. [45]

    DeepSeek-R1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

  46. [46]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  47. [47]

    Gemini 3 pro model card, 2025

    Google Deepmind. Gemini 3 pro model card, 2025

  48. [48]

    Pubchem 2023 update

    Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem 2023 update. Nucleic acids research, 51(D1):D1373–D1380, 2023

  49. [49]

    Grammar variational autoencoder

    Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. InInternational conference on machine learning, pages 1945–1954. PMLR, 2017

  50. [50]

    Junction tree variational autoencoder for molecular graph generation

    Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, pages 2323–2332. PMLR, 2018

  51. [51]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025. 19

  52. [52]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  53. [53]

    polybert: a chemical language model to enable fully machine-driven ultrafast polymer informatics

    Christopher Kuenneth and Rampi Ramprasad. polybert: a chemical language model to enable fully machine-driven ultrafast polymer informatics. Nature communications, 14(1):4099, 2023

  54. [54]

    Programmable divergent electrochemical ring-opening multifunctionalization of strained rings

    Yajuan Li, Yatao Lang, Shu-Fan He, Daixi Li, Ke-Xin Liu, Wenying Ai, Yong Jiang, Chen Zhu, and Tao Shen. Programmable divergent electrochemical ring-opening multifunctionalization of strained rings. Nature Chemistry, pages 1–13, 2026

  55. [55]

    AMEGADZIE

    Ryan Paul WURZ, Michael Masaharu Y AMANO, Stephen SARDINI JR., Wei ZHAO, Lalita TANW AR, Yunxiao LI, Brian Alan LANMAN, and Albert K. AMEGADZIE. Macrocyclic compounds as modulators of kras and uses thereof

  56. [56]

    Molparser: End-to-end visual recognition of molecule structures in the wild

    Xi Fang, Jiankun Wang, Xiaochen Cai, Shangqian Chen, Shuwen Yang, Haoyi Tao, Nan Wang, Lin Yao, Linfeng Zhang, and Guolin Ke. Molparser: End-to-end visual recognition of molecule structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24528–24538, 2025

  57. [57]

    RDKit: Open-source cheminformatics.http://www.rdkit.org, 2006

    Greg Landrum et al. RDKit: Open-source cheminformatics.http://www.rdkit.org, 2006. Version 2023.09

  58. [58]

    Speak- to-structure: Evaluating llms in open-domain natural language-driven molecule generation

    Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, and Qing Li. Speak- to-structure: Evaluating llms in open-domain natural language-driven molecule generation. arXiv preprint arXiv:2412.14642, 2024

  59. [59]

    Assessing the chemical intelligence of large language models

    Nicholas T Runcie, Charlotte M Deane, and Fergus Imrie. Assessing the chemical intelligence of large language models. Journal of Chemical Information and Modeling, 66(1):216–227, 2025

  60. [60]

    A large-scale reaction dataset of mechanistic pathways of organic reactions

    Shuan Chen, Ramil Babazade, Taewan Kim, Sunkyu Han, and Yousung Jung. A large-scale reaction dataset of mechanistic pathways of organic reactions. Scientific Data, 11(1):863, 2024. 20