ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

Hao Yu; Huajun Chen; Jiangzhen Fu; Kehua Feng; Keyan Ding; Mingyang Rao; Zhihui Zhu

arxiv: 2605.17214 · v1 · pith:2XH3MDO2new · submitted 2026-05-17 · 💻 cs.AI · cs.CL· cs.CV

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

Mingyang Rao , Kehua Feng , Zhihui Zhu , Jiangzhen Fu , Hao Yu , Keyan Ding , Huajun Chen This is my paper

Pith reviewed 2026-05-20 13:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords chemical reaction diagramslarge language modelsvisual understandingsemantic alignmentmolecular graph recognitionfunctional group detectionOCRD-Bench

0 comments

The pith

ChemVA framework bridges visual and semantic gaps so LLMs can accurately read chemical reaction diagrams and reason about them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current large language models have trouble interpreting chemical reaction diagrams because their vision components cannot reliably track the exact connections between atoms in crowded molecular structures, and standard text representations like SMILES strings do not reliably trigger the models' stored chemical knowledge. The paper proposes the ChemVA framework to fix both problems at once. It first uses a Visual Anchor to locate functional groups at both coarse and fine scales, then converts those detected visual features into familiar entity names that better activate the model's reasoning capabilities. This matters if true because many chemistry tasks depend on reading diagrams rather than text alone, and closing the gap would let smaller open models perform closer to large proprietary systems on realistic scientific problems. The authors introduce a new benchmark called OCRD-Bench that tests the full pipeline from diagram recognition through to reaction reasoning and report strong gains on it.

Core claim

The central claim is that the Visual Anchor mechanism with hybrid-granularity detection grounds functional groups in reaction diagrams and that subsequent semantic alignment to entity names overcomes both the visual deficit in resolving topological connectivity and the semantic disconnect in activating chemical knowledge, yielding 92.0 percent structural recognition accuracy and an approximately 20 percentage point performance increase across nine different LLMs on the OCRD-Bench dataset.

What carries the argument

The Visual Anchor mechanism, which detects functional groups at multiple levels of detail and translates the resulting visual features into entity names for semantic alignment with the language model.

If this is right

Open-weight LLMs reach performance levels comparable to proprietary systems on complex chemical reasoning tasks.
Structural recognition accuracy on molecular diagrams reaches 92 percent when the visual and semantic components are aligned.
A single framework can address both poor diagram parsing and weak knowledge activation without separate fine-tuning for each.
The new OCRD-Bench dataset supports end-to-end evaluation from visual recognition through multi-step reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar visual-anchoring techniques could be adapted to help models read other types of scientific diagrams that contain dense connectivity, such as reaction networks in biology.
If the semantic alignment step generalizes, it might reduce reliance on string-based inputs like SMILES for chemistry-related queries.
The reported gains across nine models suggest the method could serve as a lightweight addition to existing multimodal pipelines rather than requiring full retraining.

Load-bearing premise

The hybrid-granularity detection step correctly identifies strict topological connections in dense molecular graphs without introducing errors that affect later reasoning steps.

What would settle it

Running ChemVA on a collection of dense reaction diagrams and finding either no accuracy improvement or a drop relative to baseline vision-language models on the same reasoning tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.17214 by Hao Yu, Huajun Chen, Jiangzhen Fu, Kehua Feng, Keyan Ding, Mingyang Rao, Zhihui Zhu.

**Figure 2.** Figure 2: Overview of the ChemVA framework. Stage 1: Reaction Diagram Parsing. (a) Reaction Diagram Deconstruction: FG-VLM [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Superiority of FG-VLM over RxnScribe. The chart [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 3.** Figure 3: Effectiveness of Semantic Activation. The stacked bar [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChemVA gives a targeted fix for diagram understanding in chemistry LLMs with a new benchmark, but the 20-point gains rest on thin validation that skips error breakdowns by graph complexity.

read the letter

ChemVA adds a visual anchor that detects functional groups at hybrid granularity and then maps those features to entity names so LLMs can activate their chemical knowledge more reliably. That directly targets the visual deficit in dense molecular graphs and the semantic disconnect from plain SMILES strings. The new OCRD-Bench dataset, built for dense visual-semantic contexts and full reaction coverage, is a concrete step forward for testing the whole pipeline from recognition to reasoning. The reported 92% structural accuracy and roughly 20-point lifts across nine different LLMs are the kind of numbers that could matter for people trying to use open models on real chemistry tasks. Those results show the framework is at least moving the needle in practice. The evaluation still leaves gaps. The abstract gives only aggregate figures with no stratified breakdown by diagram density, no ablation that isolates the anchor from prompting changes or generic vision encoders, and no direct comparison against ground-truth molecular graphs for connectivity errors. If misdetections on bonds or ring attachments are concentrated in the harder cases that drive the reasoning scores, the gains could be partly explained by that rather than by the claimed bridging of bottlenecks. The stress-test point about undetected topological errors is worth checking in the full methods and results. This work is aimed at researchers who build or apply multimodal models to scientific diagrams, especially in chemistry. Anyone looking for a practical way to improve diagram-to-reasoning performance would get usable ideas from the framework and the benchmark. It has enough substance and addresses a real documented gap, so it deserves a serious referee who can press on the missing error analysis and ablations. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces the ChemVA framework to improve LLMs' interpretation of chemical reaction diagrams. It identifies a Visual Deficit in generic vision encoders for resolving topological connectivity in dense molecular graphs and a Semantic Disconnect with linear representations like SMILES. The proposed solution uses a Visual Anchor with hybrid-granularity detection to ground functional groups, followed by semantic alignment to entity names for better knowledge activation. A new OCRD-Bench dataset is introduced for evaluating recognition to reasoning. Experiments report 92.0% structural recognition accuracy and consistent ~20 percentage point gains across 9 LLMs, allowing open-weight models to approach proprietary SOTA performance.

Significance. If the empirical gains hold under rigorous verification, the work could meaningfully advance multimodal chemical reasoning by providing a practical way to ground LLMs in visual molecular structures. The new OCRD-Bench dataset with dense visual-semantic contexts is a useful contribution for the community. The approach of combining hybrid detection with semantic translation has potential applicability beyond chemistry to other diagram-heavy scientific domains.

major comments (2)

[§4.2] §4 Experiments and §4.2 Results: The central claim of ~20pp gains across LLMs rests on the Visual Anchor correctly resolving strict topological connectivity (bonds, rings, attachments) without systematic errors in dense graphs. However, the reported 92.0% aggregate structural recognition accuracy on OCRD-Bench supplies no stratified error analysis by graph density or complexity, no direct comparison to ground-truth molecular graphs, and no ablation isolating the hybrid-granularity detection from generic vision encoders or prompting variations. If connectivity misdetections concentrate in the complex diagrams driving the reasoning tasks, they could inflate the observed LLM improvements.
[§3.2] §3.2 Visual Anchor mechanism: The description of hybrid-granularity detection does not include quantitative validation (e.g., precision/recall on bond detection or ring closure in high-density subgraphs) or error propagation analysis to downstream reasoning steps. This is load-bearing for the claim that the mechanism bridges the visual deficit without introducing undetected topological errors.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit dataset statistics (e.g., number of diagrams, average atoms/bonds per diagram, distribution of reaction types) to contextualize the 92% accuracy figure.
[§3.3] Notation for the semantic alignment step could be clarified with a short pseudocode or equation showing how visual features map to entity names.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which help us strengthen the presentation of our work. We address each major comment below and will revise the manuscript to incorporate additional analyses where this improves rigor without altering the core claims.

read point-by-point responses

Referee: [§4.2] §4 Experiments and §4.2 Results: The central claim of ~20pp gains across LLMs rests on the Visual Anchor correctly resolving strict topological connectivity (bonds, rings, attachments) without systematic errors in dense graphs. However, the reported 92.0% aggregate structural recognition accuracy on OCRD-Bench supplies no stratified error analysis by graph density or complexity, no direct comparison to ground-truth molecular graphs, and no ablation isolating the hybrid-granularity detection from generic vision encoders or prompting variations. If connectivity misdetections concentrate in the complex diagrams driving the reasoning tasks, they could inflate the observed LLM improvements.

Authors: We appreciate the referee's emphasis on verifying that the reported gains stem from reliable topological resolution rather than undetected errors in complex cases. The manuscript presents the 92.0% structural recognition accuracy as an aggregate metric on OCRD-Bench together with consistent end-to-end gains across nine LLMs. To directly address the concern, we will revise §4 to include a stratified breakdown of recognition accuracy by graph density and complexity, an explicit comparison of extracted structures against ground-truth molecular graphs, and an ablation isolating the hybrid-granularity detection from standard vision encoders and prompting variations. These additions will clarify the source of the improvements. revision: yes
Referee: [§3.2] §3.2 Visual Anchor mechanism: The description of hybrid-granularity detection does not include quantitative validation (e.g., precision/recall on bond detection or ring closure in high-density subgraphs) or error propagation analysis to downstream reasoning steps. This is load-bearing for the claim that the mechanism bridges the visual deficit without introducing undetected topological errors.

Authors: We agree that quantitative validation of the hybrid-granularity detection would strengthen the mechanistic claims. The current §3.2 describes the design rationale for combining fine- and coarse-grained anchors to resolve connectivity. In the revision we will add precision and recall figures for bond detection and ring closure evaluated on high-density subgraphs drawn from OCRD-Bench, together with a concise error-propagation analysis tracing detection errors through semantic alignment to final reasoning accuracy. These results will be placed in §3.2 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework evaluated on new benchmark

full rationale

The paper describes an empirical approach: it identifies visual and semantic bottlenecks in LLMs for chemical diagrams, proposes the ChemVA framework using a Visual Anchor for hybrid-granularity detection and semantic alignment to entity names, constructs OCRD-Bench, and reports 92% structural recognition plus ~20pp gains across LLMs. No equations, parameter fits presented as predictions, self-citations as load-bearing premises, or uniqueness theorems appear in the text. All central claims reduce to experimental measurements on the introduced dataset rather than reducing by construction to prior inputs or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no mathematical derivations, free parameters, or new physical entities; the framework is described purely in terms of existing vision and language components.

pith-pipeline@v0.9.0 · 5757 in / 1081 out tokens · 34026 ms · 2026-05-20T13:46:01.164380+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[1]

Benchmarking MLLMs on Topological Reasoning of Chemical Reaction Diagrams.OpenReview/arXiv Submission 1142(2025)

2025. Benchmarking MLLMs on Topological Reasoning of Chemical Reaction Diagrams.OpenReview/arXiv Submission 1142(2025)

work page 2025
[2]

Evaluating the Accuracy and Educational Potential of Generative AI Models in Pharmacy Education: A Comparative Analysis of ChatGPT and Gemini Across Bloom’s Taxonomy.Pharmacy(2025)

2025. Evaluating the Accuracy and Educational Potential of Generative AI Models in Pharmacy Education: A Comparative Analysis of ChatGPT and Gemini Across Bloom’s Taxonomy.Pharmacy(2025)

work page 2025
[3]

MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs.Under Review at ICLR 2026(2025)

2025. MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs.Under Review at ICLR 2026(2025)

work page 2025
[4]

Mandal, P

Nawaf Alampara, I. Mandal, P. Khetarpal, H. S. Grover, et al. 2024. MaCBench: A multimodal chemistry and materials science benchmark. InNeurIPS 2024 Workshop AI for Materials

work page 2024
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Daniil A Boiko, Robert MacKnight, Gabriel Gomes, et al . 2023. Autonomous chemical research with large language models.Nature624, 7992 (2023), 570–578

work page 2023
[7]

Andres M Bran et al. 2025. Chemical reasoning in LLMs unlocks steerable synthe- sis planning and reaction mechanism elucidation.arXiv preprint arXiv:2503.08537 (2025)

work page arXiv 2025
[8]

Kexin Chen, Yuyang Du, Junyou Li, Hanqun Cao, Menghao Guo, Xilin Dang, Lanqing Li, Jiezhong Qiu, Guangyong Chen, and Pheng Ann Heng. 2025. Chem- Miner: A Large Language Model Agent System for Chemical Literature Data Mining. InProceedings of the IEEE/CVF International Conference on Computer Vision. 7595–7603

work page 2025
[9]

Djork-Arné Clevert, Tuan Le, Robin Winter, and Floriane Montanari. 2021. Img2Mol - Accurate Molecular Structure Estimation from Images.Chemical Science12, 42 (2021), 14174–14181

work page 2021
[10]

Y Diao et al. 2023. MacFrag: Segmenting large-scale molecules to obtain diverse fragments.Bioinformatics39 (2023)

work page 2023
[11]

Carl Edwards et al. 2022. Translation between Molecules and Natural Language. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

work page 2022
[12]

Vincent Fan, Yujie Qian, Alex Wang, Amber Wang, Connor W Coley, and Regina Barzilay. 2024. OpenChemIE: An information extraction toolkit for chemistry literature.Journal of Chemical Information and Modeling64, 14 (2024), 5521– 5534

work page 2024
[13]

Fernand Gobet et al. 2001. Chunking mechanisms in human learning.Trends in Cognitive Sciences5, 6 (2001), 236–243

work page 2001
[14]

Yu Gu and Zhi Liang. 2025. MolRAG: Unlocking the Power of LLMs for Molecular Property Prediction. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

work page 2025
[15]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. 2023. What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks. InAdvances in Neural Information Processing Systems (NeurIPS), V ol. 36

work page 2023
[16]

Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein, and Dmitrii Tchekhovskoi. 2015. InChI, the IUPAC international chemical identifier.Journal of cheminformatics7, 1 (2015), 23

work page 2015
[17]

Steven M Kearnes, Michael R Maser, Michael Wleklinski, Anton Kast, Abigail G Doyle, Spencer D Dreher, Joel M Hawkins, Klavs F Jensen, and Connor W Coley

work page
[18]

The open reaction database.Journal of the American Chemical Society143, 45 (2021), 18820–18826

work page 2021
[19]

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al . 2023. PubChem 2023 update.Nucleic Acids Research51, D1 (2023), D1373–D1380

work page 2023
[20]

Greg Landrum et al. 2013. RDKit: Open-source cheminformatics. http://www. rdkit.org. Accessed: 2025-05-20

work page 2013
[21]

Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. InProceedings of the AAAI Conference on Artificial Intelligence, V ol. 39. 415–423

work page 2025
[22]

Junxian Li, Di Zhang, Dongzhan Zhou, et al. 2024. ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area.arXiv preprint arXiv:2408.07246(2024)

work page arXiv 2024
[23]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740– 755

work page 2014
[24]

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS)

work page 2022
[25]

Lucas Morin, Martin Danelljan, Miguel I Agea, et al. 2023. MolGrapher: Graph- based Visual Recognition of Chemical Structures. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 19552–19561

work page 2023
[26]

Martijn Oldenhof, Adam Arany, Yves Moreau, and Jaak Simm. 2021. Self-labeling of fully mediating representations by graph alignment. InBenelux Conference on Artificial Intelligence. Springer, 46–65

work page 2021
[27]

Yujie Qian, Jiang Guo, Regina Barzilay, and Connor Coley. 2023. RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing. InJournal of Chemical Information and Modeling, V ol. 63. ACS Publications, 4030–4041

work page 2023
[28]

Coley, and Regina Barzilay

Yujie Qian, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W. Coley, and Regina Barzilay. 2023. MolScribe: Robust Molecular Structure Recognition with Image- to-Graph Generation.Journal of Chemical Information and Modeling63, 18 (2023), 5833–5844

work page 2023
[29]

Coley, and Regina Barzilay

Yujie Qian, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W. Coley, and Regina Barzilay. 2024. RxnScribe: A Unified Framework for Chemical Reaction Diagram Parsing.arXiv preprint arXiv:2305.11845(2024)

work page arXiv 2024
[30]

K Rajan, H O Brinkhaus, M I Agea, A Zielesny, and C Steinbeck. 2023. DEC- IMER.ai: An open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications.Nature communications 14, 1 (2023), 5045

work page 2023
[31]

LG Research, Sehyun Chun, Jiye Kim, Ahra Jo, Yeonsik Jo, Seungyul Oh, Seungjun Lee, Kwangrok Ryoo, Jongmin Lee, Seung Hwan Kim, et al

work page
[32]

MolMole: Molecule Mining from Scientific Literature.arXiv preprint arXiv:2505.03777(2025)

work page arXiv 2025
[33]

Nicholas T Runcie et al. 2025. Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?Journal of Chemical Information and Modeling(2025)

work page 2025
[34]

Runcie, Charlotte M

Nicholas T. Runcie, Charlotte M. Deane, and Fergus Imrie. 2025. ChemIQ: A Benchmark for Chemical Reasoning and Molecular Comprehension.arXiv preprint arXiv:2505.07735(2025)

work page arXiv 2025
[35]

Christof Schütt, Kohulan Rajan, Achim Zielesny, and Christoph Steinbeck. 2020. DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Trans- formers.Journal of Chemical Information and Modeling60 (2020), 5359–5372. Also published in J. Cheminf. as separate work, please verify specific citation

work page 2020
[36]

Ayush Kumar Shah, Abhisek Dey, Leo Luo, Bryan Amador, Patrick Philippy, Ming Zhong, Siru Ouyang, David Mark Friday, David Bianchi, Nick Jackson, et al

work page
[37]

InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

Multimodal Search in Chemical Documents and Reactions. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 4030–4034

work page
[38]

Joshua Staker, Kyle Marshall, Robert Abel, and Carolyn M McQuaw. 2019. Molec- ular structure extraction from documents using deep learning.Journal of chemical information and modeling59, 3 (2019), 1017–1029

work page 2019
[39]

Jingchao Wang, Haote Yang, Jiang Wu, Yifan He, Xingjian Wei, Yinfan Wang, Chengjin Liu, Lingli Ge, Lijun Wu, Bin Wang, et al . 2025. GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition.arXiv preprint arXiv:2506.07553(2025)

work page arXiv 2025
[40]

Xiaoxuan Wang, Yanqiao Zhu, Zemand Liu, et al. 2024. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. In International Conference on Machine Learning (ICML)

work page 2024
[41]

Damian M Wilary and Jacqueline M Cole. 2021. ReactionDataExtractor: A tool for automated extraction of information from chemical reaction schemes.Journal of chemical information and modeling61, 10 (2021), 4962–4974

work page 2021
[42]

Damian M Wilary and Jacqueline M Cole. 2023. ReactionDataExtractor 2.0: A deep learning approach for data extraction from chemical reaction schemes. Journal of Chemical Information and Modeling63, 19 (2023), 6053–6067

work page 2023
[43]

Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, and Heng Ji. 2025. oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning. arXiv preprint arXiv:2510.07731

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Zhaoning Yu, Xiangyang Xu, and Hongyang Gao. 2024. G2t-llm: Graph-to-tree text encoding for molecule generation with fine-tuned large language models. arXiv preprint arXiv:2410.02198(2024)

work page arXiv 2024
[45]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9556–9567

work page 2024
[46]

Di Zhang, Wei Liu, et al. 2024. ChemLLM: A Chemical Large Language Model. arXiv preprint arXiv:2402.06852(2024)

work page arXiv 2024
[47]

Shuai Zhang, Wei Liu, et al. 2024. Igniting the Power of Large Language Models for Chemistry: A Systematic Survey.arXiv preprint arXiv:2401.14656(2024)

work page arXiv 2024
[48]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al . 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Reactant

Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, and Hengxing Cai. 2025. Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents.arXiv preprint arXiv:2506.21625(2025). 9 Preprint, 2026, Rao et al. A1 Implementation Details: Prompts and Instruction Tuning A1.1 Reaction Diagram ...

work page arXiv 2025
[50]

‘json). [ {

- In disconnected layouts (multiple independent reactions in one image), treat them as separate Reaction IDs. ### Output Format (STRICT) ### Respond with a valid JSON list of objects. Do not include markdown code blocks (“‘json). [ {"reaction_id": 1, "role": "Reactant", "bbox": [x1, y1, x2, y2]}, {"reaction_id": 1, "role": "Arrow", "bbox": [x1, y1, x2, y2...

work page
[51]

residual

Finally, determine the connectivity (Bonds) between all iden- tified nodes. User Prompt ### Task Description ### Analyze the molecular image and generate a structured JSON representation containing supernodes, atoms, and bonds. ### Decomposition Constraints (CRITICAL) ### 1.Visual Priority (Top-Down): -Rule: Prioritize the detection of Functional Group Pa...

work page 2026
[52]

images": [ {

Return the center coordinates [x, y] of these anchor atoms (normalized 0-1000). ### Output Format (STRICT) ### Respond with a JSON object containing a single list of coordi- nates. { "anchors": [ [x1, y1], [x2, y2] ] } ### Input Image ### {cropped_molecule_image} Now, locate the anchors for the{target_label}at{tar- get_bbox}: A2 Data Construction Details ...

work page 2026
[53]

High-complexity groups (e.g., Carboxyl −COOH, Amide −CONH2) are assigned higher matching priority than their constituents (e.g., Carbonyl 𝐶=𝑂 , Hydroxyl−𝑂𝐻)

Priority Hierarchy Definition.We constructed a hierarchical dictionary where functional groups are ranked by heavy atom count, topological complexity, and semantic weight. High-complexity groups (e.g., Carboxyl −COOH, Amide −CONH2) are assigned higher matching priority than their constituents (e.g., Carbonyl 𝐶=𝑂 , Hydroxyl−𝑂𝐻)

work page
[54]

super-node

Recursive Matching with Exclusivity.For each SMILES string, we perform recursive substructure matching using RDKit [19]. Cru- cially, we enforce anAtom-wise Exclusivity Constraint: once an atom is assigned to a high-priority "super-node" (e.g., the Carbon in −COOH), it is locked and explicitly excluded from subsequent scans. This prevents the redundant la...

work page
[55]

Thishybrid-granularityapproach ensures that the model captures both high-level functional semantics and low-level structural details

Residual Atom Handling.After the greedy matching process, any remaining atoms (typically satisfying the saturated alkane skele- ton) are retained as atomic tokens. Thishybrid-granularityapproach ensures that the model captures both high-level functional semantics and low-level structural details. We then calculate the 2D bound- ing box and the precise anc...

work page
[56]

Carboxylic Anhydride

work page
[57]

Hemiacetal/Hemiketal

work page
[58]

Sulfo (Sulfonic Acid)

work page
[59]

Identify the substrate (Benzene) and reagent (Chloroethane) in the image and encode them into SMILES

Halo A3 OCRD-Bench Framework: Design and Metrics To comprehensively evaluate Multimodal Large Language Models (MLLMs) on organic chemistry reasoning, we designedOCRD- Bench, a hierarchical benchmark covering 8 major reaction cat- egories (see Figure A2). The evaluation is structured into three cognitive tiers, ranging from visual perception to deep mechan...

work page 2048

[1] [1]

Benchmarking MLLMs on Topological Reasoning of Chemical Reaction Diagrams.OpenReview/arXiv Submission 1142(2025)

2025. Benchmarking MLLMs on Topological Reasoning of Chemical Reaction Diagrams.OpenReview/arXiv Submission 1142(2025)

work page 2025

[2] [2]

Evaluating the Accuracy and Educational Potential of Generative AI Models in Pharmacy Education: A Comparative Analysis of ChatGPT and Gemini Across Bloom’s Taxonomy.Pharmacy(2025)

2025. Evaluating the Accuracy and Educational Potential of Generative AI Models in Pharmacy Education: A Comparative Analysis of ChatGPT and Gemini Across Bloom’s Taxonomy.Pharmacy(2025)

work page 2025

[3] [3]

MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs.Under Review at ICLR 2026(2025)

2025. MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs.Under Review at ICLR 2026(2025)

work page 2025

[4] [4]

Mandal, P

Nawaf Alampara, I. Mandal, P. Khetarpal, H. S. Grover, et al. 2024. MaCBench: A multimodal chemistry and materials science benchmark. InNeurIPS 2024 Workshop AI for Materials

work page 2024

[5] [5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Daniil A Boiko, Robert MacKnight, Gabriel Gomes, et al . 2023. Autonomous chemical research with large language models.Nature624, 7992 (2023), 570–578

work page 2023

[7] [7]

Andres M Bran et al. 2025. Chemical reasoning in LLMs unlocks steerable synthe- sis planning and reaction mechanism elucidation.arXiv preprint arXiv:2503.08537 (2025)

work page arXiv 2025

[8] [8]

Kexin Chen, Yuyang Du, Junyou Li, Hanqun Cao, Menghao Guo, Xilin Dang, Lanqing Li, Jiezhong Qiu, Guangyong Chen, and Pheng Ann Heng. 2025. Chem- Miner: A Large Language Model Agent System for Chemical Literature Data Mining. InProceedings of the IEEE/CVF International Conference on Computer Vision. 7595–7603

work page 2025

[9] [9]

Djork-Arné Clevert, Tuan Le, Robin Winter, and Floriane Montanari. 2021. Img2Mol - Accurate Molecular Structure Estimation from Images.Chemical Science12, 42 (2021), 14174–14181

work page 2021

[10] [10]

Y Diao et al. 2023. MacFrag: Segmenting large-scale molecules to obtain diverse fragments.Bioinformatics39 (2023)

work page 2023

[11] [11]

Carl Edwards et al. 2022. Translation between Molecules and Natural Language. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

work page 2022

[12] [12]

Vincent Fan, Yujie Qian, Alex Wang, Amber Wang, Connor W Coley, and Regina Barzilay. 2024. OpenChemIE: An information extraction toolkit for chemistry literature.Journal of Chemical Information and Modeling64, 14 (2024), 5521– 5534

work page 2024

[13] [13]

Fernand Gobet et al. 2001. Chunking mechanisms in human learning.Trends in Cognitive Sciences5, 6 (2001), 236–243

work page 2001

[14] [14]

Yu Gu and Zhi Liang. 2025. MolRAG: Unlocking the Power of LLMs for Molecular Property Prediction. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

work page 2025

[15] [15]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. 2023. What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks. InAdvances in Neural Information Processing Systems (NeurIPS), V ol. 36

work page 2023

[16] [16]

Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein, and Dmitrii Tchekhovskoi. 2015. InChI, the IUPAC international chemical identifier.Journal of cheminformatics7, 1 (2015), 23

work page 2015

[17] [17]

Steven M Kearnes, Michael R Maser, Michael Wleklinski, Anton Kast, Abigail G Doyle, Spencer D Dreher, Joel M Hawkins, Klavs F Jensen, and Connor W Coley

work page

[18] [18]

The open reaction database.Journal of the American Chemical Society143, 45 (2021), 18820–18826

work page 2021

[19] [19]

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al . 2023. PubChem 2023 update.Nucleic Acids Research51, D1 (2023), D1373–D1380

work page 2023

[20] [20]

Greg Landrum et al. 2013. RDKit: Open-source cheminformatics. http://www. rdkit.org. Accessed: 2025-05-20

work page 2013

[21] [21]

Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. InProceedings of the AAAI Conference on Artificial Intelligence, V ol. 39. 415–423

work page 2025

[22] [22]

Junxian Li, Di Zhang, Dongzhan Zhou, et al. 2024. ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area.arXiv preprint arXiv:2408.07246(2024)

work page arXiv 2024

[23] [23]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740– 755

work page 2014

[24] [24]

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS)

work page 2022

[25] [25]

Lucas Morin, Martin Danelljan, Miguel I Agea, et al. 2023. MolGrapher: Graph- based Visual Recognition of Chemical Structures. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 19552–19561

work page 2023

[26] [26]

Martijn Oldenhof, Adam Arany, Yves Moreau, and Jaak Simm. 2021. Self-labeling of fully mediating representations by graph alignment. InBenelux Conference on Artificial Intelligence. Springer, 46–65

work page 2021

[27] [27]

Yujie Qian, Jiang Guo, Regina Barzilay, and Connor Coley. 2023. RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing. InJournal of Chemical Information and Modeling, V ol. 63. ACS Publications, 4030–4041

work page 2023

[28] [28]

Coley, and Regina Barzilay

Yujie Qian, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W. Coley, and Regina Barzilay. 2023. MolScribe: Robust Molecular Structure Recognition with Image- to-Graph Generation.Journal of Chemical Information and Modeling63, 18 (2023), 5833–5844

work page 2023

[29] [29]

Coley, and Regina Barzilay

Yujie Qian, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W. Coley, and Regina Barzilay. 2024. RxnScribe: A Unified Framework for Chemical Reaction Diagram Parsing.arXiv preprint arXiv:2305.11845(2024)

work page arXiv 2024

[30] [30]

K Rajan, H O Brinkhaus, M I Agea, A Zielesny, and C Steinbeck. 2023. DEC- IMER.ai: An open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications.Nature communications 14, 1 (2023), 5045

work page 2023

[31] [31]

LG Research, Sehyun Chun, Jiye Kim, Ahra Jo, Yeonsik Jo, Seungyul Oh, Seungjun Lee, Kwangrok Ryoo, Jongmin Lee, Seung Hwan Kim, et al

work page

[32] [32]

MolMole: Molecule Mining from Scientific Literature.arXiv preprint arXiv:2505.03777(2025)

work page arXiv 2025

[33] [33]

Nicholas T Runcie et al. 2025. Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?Journal of Chemical Information and Modeling(2025)

work page 2025

[34] [34]

Runcie, Charlotte M

Nicholas T. Runcie, Charlotte M. Deane, and Fergus Imrie. 2025. ChemIQ: A Benchmark for Chemical Reasoning and Molecular Comprehension.arXiv preprint arXiv:2505.07735(2025)

work page arXiv 2025

[35] [35]

Christof Schütt, Kohulan Rajan, Achim Zielesny, and Christoph Steinbeck. 2020. DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Trans- formers.Journal of Chemical Information and Modeling60 (2020), 5359–5372. Also published in J. Cheminf. as separate work, please verify specific citation

work page 2020

[36] [36]

Ayush Kumar Shah, Abhisek Dey, Leo Luo, Bryan Amador, Patrick Philippy, Ming Zhong, Siru Ouyang, David Mark Friday, David Bianchi, Nick Jackson, et al

work page

[37] [37]

InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

Multimodal Search in Chemical Documents and Reactions. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 4030–4034

work page

[38] [38]

Joshua Staker, Kyle Marshall, Robert Abel, and Carolyn M McQuaw. 2019. Molec- ular structure extraction from documents using deep learning.Journal of chemical information and modeling59, 3 (2019), 1017–1029

work page 2019

[39] [39]

Jingchao Wang, Haote Yang, Jiang Wu, Yifan He, Xingjian Wei, Yinfan Wang, Chengjin Liu, Lingli Ge, Lijun Wu, Bin Wang, et al . 2025. GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition.arXiv preprint arXiv:2506.07553(2025)

work page arXiv 2025

[40] [40]

Xiaoxuan Wang, Yanqiao Zhu, Zemand Liu, et al. 2024. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. In International Conference on Machine Learning (ICML)

work page 2024

[41] [41]

Damian M Wilary and Jacqueline M Cole. 2021. ReactionDataExtractor: A tool for automated extraction of information from chemical reaction schemes.Journal of chemical information and modeling61, 10 (2021), 4962–4974

work page 2021

[42] [42]

Damian M Wilary and Jacqueline M Cole. 2023. ReactionDataExtractor 2.0: A deep learning approach for data extraction from chemical reaction schemes. Journal of Chemical Information and Modeling63, 19 (2023), 6053–6067

work page 2023

[43] [43]

Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, and Heng Ji. 2025. oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning. arXiv preprint arXiv:2510.07731

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Zhaoning Yu, Xiangyang Xu, and Hongyang Gao. 2024. G2t-llm: Graph-to-tree text encoding for molecule generation with fine-tuned large language models. arXiv preprint arXiv:2410.02198(2024)

work page arXiv 2024

[45] [45]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9556–9567

work page 2024

[46] [46]

Di Zhang, Wei Liu, et al. 2024. ChemLLM: A Chemical Large Language Model. arXiv preprint arXiv:2402.06852(2024)

work page arXiv 2024

[47] [47]

Shuai Zhang, Wei Liu, et al. 2024. Igniting the Power of Large Language Models for Chemistry: A Systematic Survey.arXiv preprint arXiv:2401.14656(2024)

work page arXiv 2024

[48] [48]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al . 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Reactant

Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, and Hengxing Cai. 2025. Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents.arXiv preprint arXiv:2506.21625(2025). 9 Preprint, 2026, Rao et al. A1 Implementation Details: Prompts and Instruction Tuning A1.1 Reaction Diagram ...

work page arXiv 2025

[50] [50]

‘json). [ {

- In disconnected layouts (multiple independent reactions in one image), treat them as separate Reaction IDs. ### Output Format (STRICT) ### Respond with a valid JSON list of objects. Do not include markdown code blocks (“‘json). [ {"reaction_id": 1, "role": "Reactant", "bbox": [x1, y1, x2, y2]}, {"reaction_id": 1, "role": "Arrow", "bbox": [x1, y1, x2, y2...

work page

[51] [51]

residual

Finally, determine the connectivity (Bonds) between all iden- tified nodes. User Prompt ### Task Description ### Analyze the molecular image and generate a structured JSON representation containing supernodes, atoms, and bonds. ### Decomposition Constraints (CRITICAL) ### 1.Visual Priority (Top-Down): -Rule: Prioritize the detection of Functional Group Pa...

work page 2026

[52] [52]

images": [ {

Return the center coordinates [x, y] of these anchor atoms (normalized 0-1000). ### Output Format (STRICT) ### Respond with a JSON object containing a single list of coordi- nates. { "anchors": [ [x1, y1], [x2, y2] ] } ### Input Image ### {cropped_molecule_image} Now, locate the anchors for the{target_label}at{tar- get_bbox}: A2 Data Construction Details ...

work page 2026

[53] [53]

High-complexity groups (e.g., Carboxyl −COOH, Amide −CONH2) are assigned higher matching priority than their constituents (e.g., Carbonyl 𝐶=𝑂 , Hydroxyl−𝑂𝐻)

Priority Hierarchy Definition.We constructed a hierarchical dictionary where functional groups are ranked by heavy atom count, topological complexity, and semantic weight. High-complexity groups (e.g., Carboxyl −COOH, Amide −CONH2) are assigned higher matching priority than their constituents (e.g., Carbonyl 𝐶=𝑂 , Hydroxyl−𝑂𝐻)

work page

[54] [54]

super-node

Recursive Matching with Exclusivity.For each SMILES string, we perform recursive substructure matching using RDKit [19]. Cru- cially, we enforce anAtom-wise Exclusivity Constraint: once an atom is assigned to a high-priority "super-node" (e.g., the Carbon in −COOH), it is locked and explicitly excluded from subsequent scans. This prevents the redundant la...

work page

[55] [55]

Thishybrid-granularityapproach ensures that the model captures both high-level functional semantics and low-level structural details

Residual Atom Handling.After the greedy matching process, any remaining atoms (typically satisfying the saturated alkane skele- ton) are retained as atomic tokens. Thishybrid-granularityapproach ensures that the model captures both high-level functional semantics and low-level structural details. We then calculate the 2D bound- ing box and the precise anc...

work page

[56] [56]

Carboxylic Anhydride

work page

[57] [57]

Hemiacetal/Hemiketal

work page

[58] [58]

Sulfo (Sulfonic Acid)

work page

[59] [59]

Identify the substrate (Benzene) and reagent (Chloroethane) in the image and encode them into SMILES

Halo A3 OCRD-Bench Framework: Design and Metrics To comprehensively evaluate Multimodal Large Language Models (MLLMs) on organic chemistry reasoning, we designedOCRD- Bench, a hierarchical benchmark covering 8 major reaction cat- egories (see Figure A2). The evaluation is structured into three cognitive tiers, ranging from visual perception to deep mechan...

work page 2048