Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis

Changmeng Zheng; Jiatong Li; Qing Li; Shufei Zhang; Weida Wang; Xiao-Yong Wei; Yatao Bian

arxiv: 2607.01800 · v1 · pith:KXQCHSWVnew · submitted 2026-07-02 · 💻 cs.LG · cs.CL

Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis

Jiatong Li , Weida Wang , Changmeng Zheng , Shufei Zhang , Yatao Bian , Xiao-yong Wei , Qing Li This is my paper

Pith reviewed 2026-07-03 17:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords molecular LLMsgeneralizationperturbation analysisgraph edit distancein-context tuningmolecular discoverystructural sensitivitytrust region

0 comments

The pith

Molecular LLMs show fragile generalization, with even single structural edits causing large performance drops on standard tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs applied to molecules can move beyond the immediate neighborhoods of their training examples. It builds a perturbation method that creates valid new molecular structures at controlled distances measured by graph edits and runs the models on these variants. Performance falls sharply with minimal changes, indicating the models rely on very local patterns rather than broader chemical rules. In-context examples drawn from similar molecules reduce but do not eliminate the sensitivity.

Core claim

Using a Molecular Perturbation framework that produces syntax-valid structural variants under controlled Graph Edit Distance, the work shows that molecular LLMs suffer substantial performance drops even from a single edit, exposing narrow local trust regions and high sensitivity to structural variation; In-Context Tuning that anchors on similar molecules partially widens these regions.

What carries the argument

Molecular Perturbation framework that generates syntax-valid structural variants under controlled Graph Edit Distance to probe manifold regularity.

If this is right

A single graph edit produces substantial drops on common molecular prediction tasks.
In-Context Tuning anchors predictions on similar molecules and partially expands the local trust region.
Sequence-based representations confine LLMs to narrow neighborhoods despite the similarity principle in chemistry.
Stabilizing molecular LLMs against structural variation requires methods that explicitly use nearby structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same controlled-edit test could be run on protein or materials sequences to check whether the narrow-trust-region pattern appears in other structured domains.
Hybrid architectures that combine token-based LLMs with explicit graph encoders might enlarge the effective trust region beyond what in-context examples alone achieve.
If the fragility scales with edit distance in a predictable way, training objectives could be modified to penalize sensitivity inside small GED neighborhoods.

Load-bearing premise

The Graph Edit Distance metric and syntax-valid perturbation procedure produce structural variants whose property changes represent real generalization failures rather than artifacts of the editing rules or chosen tasks.

What would settle it

Models that maintain stable high performance across multiple molecular tasks when tested on the single-edit perturbed molecules would contradict the reported narrow trust regions.

Figures

Figures reproduced from arXiv: 2607.01800 by Changmeng Zheng, Jiatong Li, Qing Li, Shufei Zhang, Weida Wang, Xiao-Yong Wei, Yatao Bian.

**Figure 1.** Figure 1: Illustration of Molecular Perturbation, including atom perturbations and bond perturbations. The integration of Large Language Models (LLMs) into chemical research has enabled models to generate and predict molecular structures with apparent fluency. By linearizing molecular graphs into sequential notations like SMILES (Weininger, 1988) and SELFIES (Krenn et al., 2020), these models leverage the stati… view at source ↗

**Figure 2.** Figure 2: Comparison of atom-only and bond-only perturbations across five graph edit distance [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance degradation landscapes (smoothed) with increasing perturbation level. Each [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Absolute task performance on the original train/test splits and perturbed test sets (GED [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of model pre-training strategy on extreme perturbation scenario (GED=5). To examine how architectural and pretraining choices shape robustness, we further group the evaluated models into three categories: Molecular LLMs (encoder–decoder models pretrained on chemical corpora, including MolT5 and BioT5), Scientific LLMs (Galactica-125M), and General LLMs (the Qwen3 series). Domain pretraining is… view at source ↗

**Figure 6.** Figure 6: Absolute performance trend of Galactica [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Histograms of Tanimoto similarities between original and perturbed molecules across GED [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have recently shown promise in molecular discovery, yet a gap remains between their probabilistic nature over discrete sequential tokens and the rigid topological constraints of chemical space. This raises the question of whether molecular LLMs can generalize beyond the local neighborhoods induced by their sequence-based representations. To systematically investigate this question, we introduce a Molecular Perturbation framework that generates syntax-valid structural variants of training molecules under controlled Graph Edit Distance (GED) to probe the manifold regularity of molecular LLMs. Our analysis shows that even a single edit can cause substantial performance drops on common molecular tasks, revealing a narrow local trust region and fragile sensitivity to structural changes. Since similar molecules tend to exhibit similar properties, In-Context Tuning (ICT), which anchors predictions on structurally similar molecules, offers a natural way to mitigate such fragility. Our experiments also examine whether ICT confers robustness under controlled structural perturbations, and the results suggest that it can partially expand the local trust region and offer a promising direction for stabilizing molecular LLMs against structural variation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The GED perturbation test is a concrete diagnostic for local sensitivity in molecular LLMs, but the fragility claim depends on an unchecked assumption that single edits preserve task labels.

read the letter

The paper's main contribution is a Molecular Perturbation framework that produces syntax-valid molecular variants at controlled Graph Edit Distance and measures how much LLM performance drops on standard tasks. They then check whether In-Context Tuning reduces that sensitivity.

What works is the controlled, reproducible way they generate the variants and the direct link to the similarity principle for the ICT mitigation. That gives a practical probe that goes beyond generic robustness tests.

The soft spot is the missing verification that the edits leave the ground-truth labels unchanged. The abstract invokes the similarity principle but supplies no check on whether a single edit systematically shifts solubility, binding, or whatever the task measures. If the labels move, the observed drops are consistent with correct behavior rather than narrow trust regions. Without that evidence the central claim is hard to interpret.

This is aimed at groups already building or evaluating molecular LLMs who need a low-cost way to surface local fragility. A reader working on robustness or in-context methods could extract the perturbation generator and the ICT experiment as a starting point.

It deserves a serious referee because the framework is new enough and the question is live, but the review should focus on whether the paper added any label-stability diagnostics or at least quantified how often the edits alter known properties.

Referee Report

1 major / 2 minor

Summary. The paper introduces a Molecular Perturbation framework that generates syntax-valid structural variants of molecules via controlled Graph Edit Distance (GED) edits. It reports that even single edits cause substantial performance drops on standard molecular tasks for LLMs, which the authors interpret as evidence of narrow local trust regions and fragile generalization. The work further tests In-Context Tuning (ICT) as a mitigation that partially expands the trust region.

Significance. If the perturbations preserve ground-truth task labels, the controlled GED analysis supplies a useful diagnostic for evaluating whether sequence-based LLMs respect the similarity principle in chemical space, and the ICT results point to a concrete inference-time remedy. The framework itself could serve as a reusable evaluation tool for molecular models.

major comments (1)

[Abstract and perturbation framework description] The central claim that observed accuracy drops demonstrate model fragility (rather than correct adaptation to changed properties) rests on the unverified premise that syntax-valid GED edits leave task labels (solubility, binding affinity, etc.) essentially unchanged. The abstract invokes the similarity principle but supplies no explicit check—such as property-value histograms or label-consistency statistics before versus after perturbation—that the editing rules preserve the relevant distributions. This assumption is load-bearing for both the fragility diagnosis and the ICT experiments.

minor comments (2)

[Methods] Define 'syntax-valid' more precisely and state whether additional chemical-validity filters (valence, ring strain, etc.) are applied beyond token-level syntax.
[Experiments] Report the exact dataset sizes, number of perturbations per molecule, and statistical tests used to establish that the performance drops are significant rather than within noise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about verifying label preservation under GED perturbations is valid and will be addressed in revision.

read point-by-point responses

Referee: The central claim that observed accuracy drops demonstrate model fragility (rather than correct adaptation to changed properties) rests on the unverified premise that syntax-valid GED edits leave task labels (solubility, binding affinity, etc.) essentially unchanged. The abstract invokes the similarity principle but supplies no explicit check—such as property-value histograms or label-consistency statistics before versus after perturbation—that the editing rules preserve the relevant distributions. This assumption is load-bearing for both the fragility diagnosis and the ICT experiments.

Authors: We agree that explicit verification would strengthen the paper. The framework is grounded in the standard similarity principle of cheminformatics, but the current version provides no direct statistical confirmation. In revision we will add (i) property-value histograms for key tasks (solubility, binding affinity) comparing originals to perturbed molecules at GED=1 and higher, and (ii) label-consistency statistics (fraction of cases where ground-truth label is unchanged or changes by less than a task-specific threshold). These will appear in the perturbation framework section and will be used to qualify both the fragility claims and ICT results. With this addition the interpretation that small structural edits should not produce large label shifts (hence performance drops indicate fragility) becomes empirically supported rather than assumed. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical perturbation study

full rationale

The paper introduces a Molecular Perturbation framework and reports experimental performance drops under GED edits on molecular tasks, followed by ICT mitigation tests. No equations, parameter fits, or derivations are present that could reduce claims to inputs by construction. The similarity principle is invoked as standard domain knowledge rather than derived or self-cited in a load-bearing way. All central claims rest on direct measurements, so the work is self-contained with no reduction to fitted quantities or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are stated or implied beyond standard assumptions of molecular graph validity and task evaluation.

pith-pipeline@v0.9.1-grok · 5728 in / 1078 out tokens · 20085 ms · 2026-07-03T17:14:53.986337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Edwards Carl, Lai Tuan, Ros Kevin, Honke Garrett, Cho Kyunghyun, Ji Heng

84–92. Edwards Carl, Lai Tuan, Ros Kevin, Honke Garrett, Cho Kyunghyun, Ji Heng. Translation between Molecules and Natural Language // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

2022
[2]

Edwards Carl, Zhai ChengXiang, Ji Heng

375–413. Edwards Carl, Zhai ChengXiang, Ji Heng. Text2mol: Cross-modal molecule retrieval with natural language queries // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

2021
[3]

Reasoning robustness of llms to adversarial typographical errors // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Gan Esther, Zhao Yiran, Cheng Liying, Yancan Mao, Goyal Anirudh, Kawaguchi Kenji, Kan Min-Yen, Shieh Michael. Reasoning robustness of llms to adversarial typographical errors // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

2024
[4]

Kumar Pankaj, Mishra Subhankar

045024. Kumar Pankaj, Mishra Subhankar. Robustness in large language models: A survey of mitigation strategies and evaluation metrics // arXiv preprint arXiv:2505.18658

work page arXiv
[5]

Preprint, arXiv:2505.21318

Li Hao, Cao He, Feng Bin, Shao Yanjun, Tang Xiangru, Yan Zhiyuan, Yuan Li, Tian Yonghong, Li Yu. Beyond Chemical QA: Evaluating LLM’s Chemical Reasoning with Modular Chemical Operations // arXiv preprint arXiv:2505.21318

work page arXiv
[6]

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Li Jiatong, Li Junxian, Wang Weida, Liu Yunqing, Zheng Changmeng, Zhou Dongzhan, Wei Xiao- yong, Li Qing. Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation // arXiv preprint arXiv:2412.14642. 2024a. Li Jiatong, Liu Yunqing, Fan Wenqi, Wei Xiao-Yong, Liu Hui, Tang Jiliang, Li Qing. Empowering molecule discovery ...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

10 Liu Zequn, Zhang Wei, Xia Yingce, Wu Lijun, Xie Shufang, Qin Tao, Zhang Ming, Liu Tie-Yan

6071–6083. 10 Liu Zequn, Zhang Wei, Xia Yingce, Wu Lijun, Xie Shufang, Qin Tao, Zhang Ming, Liu Tie-Yan. MolXPT: Wrapping Molecules with Text for Generative Pre-training // Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers). 2023a. 1606–1616. Liu Zhiyuan, Li Sihang, Luo Yanchen, Fei Hao, Cao Y...

2023
[8]

Pei Qizhi, Zhang Wei, Zhu Jinhua, Wu Kehan, Gao Kaiyuan, Wu Lijun, Xia Yingce, Yan Rui

9863–9875. Pei Qizhi, Zhang Wei, Zhu Jinhua, Wu Kehan, Gao Kaiyuan, Wu Lijun, Xia Yingce, Yan Rui. BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

2023
[9]

Galactica: A Large Language Model for Science

Taylor Ross, Kardas Marcin, Cucurull Guillem, Scialom Thomas, Hartshorn Anthony, Saravia Elvis, Poulton Andrew, Kerkez Viktor, Stojnic Robert. Galactica: A large language model for science // arXiv preprint arXiv:2211.09085

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Qwen3 Technical Report

Yang An, Li Anfeng, Yang Baosong, Zhang Beichen, Hui Binyuan, Zheng Bo, Yu Bowen, Gao Chang, Huang Chengen, Lv Chenxu, others. Qwen3 technical report // arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Assessing adversarial robustness of large language models: An empirical study // arXiv preprint arXiv:2405.02764

Yang Zeyu, Meng Zhao, Zheng Xiaochen, Wattenhofer Roger. Assessing adversarial robustness of large language models: An empirical study // arXiv preprint arXiv:2405.02764

work page arXiv
[12]

trust region

The two distributions are highly consistent: carbon and oxygen, which together account for over 90% of all atoms in the raw data, remain the dominant elements after perturbation, with only a marginal −3.8% shift in carbon frequency. Other common organic atoms (N, S, P, Cl, F) see small absolute increases 12 Dataset Total Samples Atom-only PerturbationsBon...

work page arXiv 1985

[1] [1]

Edwards Carl, Lai Tuan, Ros Kevin, Honke Garrett, Cho Kyunghyun, Ji Heng

84–92. Edwards Carl, Lai Tuan, Ros Kevin, Honke Garrett, Cho Kyunghyun, Ji Heng. Translation between Molecules and Natural Language // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

2022

[2] [2]

Edwards Carl, Zhai ChengXiang, Ji Heng

375–413. Edwards Carl, Zhai ChengXiang, Ji Heng. Text2mol: Cross-modal molecule retrieval with natural language queries // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

2021

[3] [3]

Reasoning robustness of llms to adversarial typographical errors // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Gan Esther, Zhao Yiran, Cheng Liying, Yancan Mao, Goyal Anirudh, Kawaguchi Kenji, Kan Min-Yen, Shieh Michael. Reasoning robustness of llms to adversarial typographical errors // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

2024

[4] [4]

Kumar Pankaj, Mishra Subhankar

045024. Kumar Pankaj, Mishra Subhankar. Robustness in large language models: A survey of mitigation strategies and evaluation metrics // arXiv preprint arXiv:2505.18658

work page arXiv

[5] [5]

Preprint, arXiv:2505.21318

Li Hao, Cao He, Feng Bin, Shao Yanjun, Tang Xiangru, Yan Zhiyuan, Yuan Li, Tian Yonghong, Li Yu. Beyond Chemical QA: Evaluating LLM’s Chemical Reasoning with Modular Chemical Operations // arXiv preprint arXiv:2505.21318

work page arXiv

[6] [6]

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Li Jiatong, Li Junxian, Wang Weida, Liu Yunqing, Zheng Changmeng, Zhou Dongzhan, Wei Xiao- yong, Li Qing. Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation // arXiv preprint arXiv:2412.14642. 2024a. Li Jiatong, Liu Yunqing, Fan Wenqi, Wei Xiao-Yong, Liu Hui, Tang Jiliang, Li Qing. Empowering molecule discovery ...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

10 Liu Zequn, Zhang Wei, Xia Yingce, Wu Lijun, Xie Shufang, Qin Tao, Zhang Ming, Liu Tie-Yan

6071–6083. 10 Liu Zequn, Zhang Wei, Xia Yingce, Wu Lijun, Xie Shufang, Qin Tao, Zhang Ming, Liu Tie-Yan. MolXPT: Wrapping Molecules with Text for Generative Pre-training // Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers). 2023a. 1606–1616. Liu Zhiyuan, Li Sihang, Luo Yanchen, Fei Hao, Cao Y...

2023

[8] [8]

Pei Qizhi, Zhang Wei, Zhu Jinhua, Wu Kehan, Gao Kaiyuan, Wu Lijun, Xia Yingce, Yan Rui

9863–9875. Pei Qizhi, Zhang Wei, Zhu Jinhua, Wu Kehan, Gao Kaiyuan, Wu Lijun, Xia Yingce, Yan Rui. BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

2023

[9] [9]

Galactica: A Large Language Model for Science

Taylor Ross, Kardas Marcin, Cucurull Guillem, Scialom Thomas, Hartshorn Anthony, Saravia Elvis, Poulton Andrew, Kerkez Viktor, Stojnic Robert. Galactica: A large language model for science // arXiv preprint arXiv:2211.09085

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Qwen3 Technical Report

Yang An, Li Anfeng, Yang Baosong, Zhang Beichen, Hui Binyuan, Zheng Bo, Yu Bowen, Gao Chang, Huang Chengen, Lv Chenxu, others. Qwen3 technical report // arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Assessing adversarial robustness of large language models: An empirical study // arXiv preprint arXiv:2405.02764

Yang Zeyu, Meng Zhao, Zheng Xiaochen, Wattenhofer Roger. Assessing adversarial robustness of large language models: An empirical study // arXiv preprint arXiv:2405.02764

work page arXiv

[12] [12]

trust region

The two distributions are highly consistent: carbon and oxygen, which together account for over 90% of all atoms in the raw data, remain the dominant elements after perturbation, with only a marginal −3.8% shift in carbon frequency. Other common organic atoms (N, S, P, Cl, F) see small absolute increases 12 Dataset Total Samples Atom-only PerturbationsBon...

work page arXiv 1985