ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding
Pith reviewed 2026-05-20 13:46 UTC · model grok-4.3
The pith
ChemVA framework bridges visual and semantic gaps so LLMs can accurately read chemical reaction diagrams and reason about them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Visual Anchor mechanism with hybrid-granularity detection grounds functional groups in reaction diagrams and that subsequent semantic alignment to entity names overcomes both the visual deficit in resolving topological connectivity and the semantic disconnect in activating chemical knowledge, yielding 92.0 percent structural recognition accuracy and an approximately 20 percentage point performance increase across nine different LLMs on the OCRD-Bench dataset.
What carries the argument
The Visual Anchor mechanism, which detects functional groups at multiple levels of detail and translates the resulting visual features into entity names for semantic alignment with the language model.
If this is right
- Open-weight LLMs reach performance levels comparable to proprietary systems on complex chemical reasoning tasks.
- Structural recognition accuracy on molecular diagrams reaches 92 percent when the visual and semantic components are aligned.
- A single framework can address both poor diagram parsing and weak knowledge activation without separate fine-tuning for each.
- The new OCRD-Bench dataset supports end-to-end evaluation from visual recognition through multi-step reasoning.
Where Pith is reading between the lines
- Similar visual-anchoring techniques could be adapted to help models read other types of scientific diagrams that contain dense connectivity, such as reaction networks in biology.
- If the semantic alignment step generalizes, it might reduce reliance on string-based inputs like SMILES for chemistry-related queries.
- The reported gains across nine models suggest the method could serve as a lightweight addition to existing multimodal pipelines rather than requiring full retraining.
Load-bearing premise
The hybrid-granularity detection step correctly identifies strict topological connections in dense molecular graphs without introducing errors that affect later reasoning steps.
What would settle it
Running ChemVA on a collection of dense reaction diagrams and finding either no accuracy improvement or a drop relative to baseline vision-language models on the same reasoning tasks would falsify the central claim.
Figures
read the original abstract
While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the ChemVA framework to improve LLMs' interpretation of chemical reaction diagrams. It identifies a Visual Deficit in generic vision encoders for resolving topological connectivity in dense molecular graphs and a Semantic Disconnect with linear representations like SMILES. The proposed solution uses a Visual Anchor with hybrid-granularity detection to ground functional groups, followed by semantic alignment to entity names for better knowledge activation. A new OCRD-Bench dataset is introduced for evaluating recognition to reasoning. Experiments report 92.0% structural recognition accuracy and consistent ~20 percentage point gains across 9 LLMs, allowing open-weight models to approach proprietary SOTA performance.
Significance. If the empirical gains hold under rigorous verification, the work could meaningfully advance multimodal chemical reasoning by providing a practical way to ground LLMs in visual molecular structures. The new OCRD-Bench dataset with dense visual-semantic contexts is a useful contribution for the community. The approach of combining hybrid detection with semantic translation has potential applicability beyond chemistry to other diagram-heavy scientific domains.
major comments (2)
- [§4.2] §4 Experiments and §4.2 Results: The central claim of ~20pp gains across LLMs rests on the Visual Anchor correctly resolving strict topological connectivity (bonds, rings, attachments) without systematic errors in dense graphs. However, the reported 92.0% aggregate structural recognition accuracy on OCRD-Bench supplies no stratified error analysis by graph density or complexity, no direct comparison to ground-truth molecular graphs, and no ablation isolating the hybrid-granularity detection from generic vision encoders or prompting variations. If connectivity misdetections concentrate in the complex diagrams driving the reasoning tasks, they could inflate the observed LLM improvements.
- [§3.2] §3.2 Visual Anchor mechanism: The description of hybrid-granularity detection does not include quantitative validation (e.g., precision/recall on bond detection or ring closure in high-density subgraphs) or error propagation analysis to downstream reasoning steps. This is load-bearing for the claim that the mechanism bridges the visual deficit without introducing undetected topological errors.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from explicit dataset statistics (e.g., number of diagrams, average atoms/bonds per diagram, distribution of reaction types) to contextualize the 92% accuracy figure.
- [§3.3] Notation for the semantic alignment step could be clarified with a short pseudocode or equation showing how visual features map to entity names.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments, which help us strengthen the presentation of our work. We address each major comment below and will revise the manuscript to incorporate additional analyses where this improves rigor without altering the core claims.
read point-by-point responses
-
Referee: [§4.2] §4 Experiments and §4.2 Results: The central claim of ~20pp gains across LLMs rests on the Visual Anchor correctly resolving strict topological connectivity (bonds, rings, attachments) without systematic errors in dense graphs. However, the reported 92.0% aggregate structural recognition accuracy on OCRD-Bench supplies no stratified error analysis by graph density or complexity, no direct comparison to ground-truth molecular graphs, and no ablation isolating the hybrid-granularity detection from generic vision encoders or prompting variations. If connectivity misdetections concentrate in the complex diagrams driving the reasoning tasks, they could inflate the observed LLM improvements.
Authors: We appreciate the referee's emphasis on verifying that the reported gains stem from reliable topological resolution rather than undetected errors in complex cases. The manuscript presents the 92.0% structural recognition accuracy as an aggregate metric on OCRD-Bench together with consistent end-to-end gains across nine LLMs. To directly address the concern, we will revise §4 to include a stratified breakdown of recognition accuracy by graph density and complexity, an explicit comparison of extracted structures against ground-truth molecular graphs, and an ablation isolating the hybrid-granularity detection from standard vision encoders and prompting variations. These additions will clarify the source of the improvements. revision: yes
-
Referee: [§3.2] §3.2 Visual Anchor mechanism: The description of hybrid-granularity detection does not include quantitative validation (e.g., precision/recall on bond detection or ring closure in high-density subgraphs) or error propagation analysis to downstream reasoning steps. This is load-bearing for the claim that the mechanism bridges the visual deficit without introducing undetected topological errors.
Authors: We agree that quantitative validation of the hybrid-granularity detection would strengthen the mechanistic claims. The current §3.2 describes the design rationale for combining fine- and coarse-grained anchors to resolve connectivity. In the revision we will add precision and recall figures for bond detection and ring closure evaluated on high-density subgraphs drawn from OCRD-Bench, together with a concise error-propagation analysis tracing detection errors through semantic alignment to final reasoning accuracy. These results will be placed in §3.2 and the appendix. revision: yes
Circularity Check
No circularity; empirical framework evaluated on new benchmark
full rationale
The paper describes an empirical approach: it identifies visual and semantic bottlenecks in LLMs for chemical diagrams, proposes the ChemVA framework using a Visual Anchor for hybrid-granularity detection and semantic alignment to entity names, constructs OCRD-Bench, and reports 92% structural recognition plus ~20pp gains across LLMs. No equations, parameter fits presented as predictions, self-citations as load-bearing premises, or uniqueness theorems appear in the text. All central claims reduce to experimental measurements on the introduced dataset rather than reducing by construction to prior inputs or definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2025. Benchmarking MLLMs on Topological Reasoning of Chemical Reaction Diagrams.OpenReview/arXiv Submission 1142(2025)
work page 2025
-
[2]
2025. Evaluating the Accuracy and Educational Potential of Generative AI Models in Pharmacy Education: A Comparative Analysis of ChatGPT and Gemini Across Bloom’s Taxonomy.Pharmacy(2025)
work page 2025
-
[3]
2025. MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs.Under Review at ICLR 2026(2025)
work page 2025
- [4]
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Daniil A Boiko, Robert MacKnight, Gabriel Gomes, et al . 2023. Autonomous chemical research with large language models.Nature624, 7992 (2023), 570–578
work page 2023
- [7]
-
[8]
Kexin Chen, Yuyang Du, Junyou Li, Hanqun Cao, Menghao Guo, Xilin Dang, Lanqing Li, Jiezhong Qiu, Guangyong Chen, and Pheng Ann Heng. 2025. Chem- Miner: A Large Language Model Agent System for Chemical Literature Data Mining. InProceedings of the IEEE/CVF International Conference on Computer Vision. 7595–7603
work page 2025
-
[9]
Djork-Arné Clevert, Tuan Le, Robin Winter, and Floriane Montanari. 2021. Img2Mol - Accurate Molecular Structure Estimation from Images.Chemical Science12, 42 (2021), 14174–14181
work page 2021
-
[10]
Y Diao et al. 2023. MacFrag: Segmenting large-scale molecules to obtain diverse fragments.Bioinformatics39 (2023)
work page 2023
-
[11]
Carl Edwards et al. 2022. Translation between Molecules and Natural Language. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
work page 2022
-
[12]
Vincent Fan, Yujie Qian, Alex Wang, Amber Wang, Connor W Coley, and Regina Barzilay. 2024. OpenChemIE: An information extraction toolkit for chemistry literature.Journal of Chemical Information and Modeling64, 14 (2024), 5521– 5534
work page 2024
-
[13]
Fernand Gobet et al. 2001. Chunking mechanisms in human learning.Trends in Cognitive Sciences5, 6 (2001), 236–243
work page 2001
-
[14]
Yu Gu and Zhi Liang. 2025. MolRAG: Unlocking the Power of LLMs for Molecular Property Prediction. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)
work page 2025
-
[15]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. 2023. What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks. InAdvances in Neural Information Processing Systems (NeurIPS), V ol. 36
work page 2023
-
[16]
Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein, and Dmitrii Tchekhovskoi. 2015. InChI, the IUPAC international chemical identifier.Journal of cheminformatics7, 1 (2015), 23
work page 2015
-
[17]
Steven M Kearnes, Michael R Maser, Michael Wleklinski, Anton Kast, Abigail G Doyle, Spencer D Dreher, Joel M Hawkins, Klavs F Jensen, and Connor W Coley
-
[18]
The open reaction database.Journal of the American Chemical Society143, 45 (2021), 18820–18826
work page 2021
-
[19]
Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al . 2023. PubChem 2023 update.Nucleic Acids Research51, D1 (2023), D1373–D1380
work page 2023
-
[20]
Greg Landrum et al. 2013. RDKit: Open-source cheminformatics. http://www. rdkit.org. Accessed: 2025-05-20
work page 2013
-
[21]
Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. InProceedings of the AAAI Conference on Artificial Intelligence, V ol. 39. 415–423
work page 2025
- [22]
-
[23]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740– 755
work page 2014
-
[24]
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS)
work page 2022
-
[25]
Lucas Morin, Martin Danelljan, Miguel I Agea, et al. 2023. MolGrapher: Graph- based Visual Recognition of Chemical Structures. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 19552–19561
work page 2023
-
[26]
Martijn Oldenhof, Adam Arany, Yves Moreau, and Jaak Simm. 2021. Self-labeling of fully mediating representations by graph alignment. InBenelux Conference on Artificial Intelligence. Springer, 46–65
work page 2021
-
[27]
Yujie Qian, Jiang Guo, Regina Barzilay, and Connor Coley. 2023. RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing. InJournal of Chemical Information and Modeling, V ol. 63. ACS Publications, 4030–4041
work page 2023
-
[28]
Yujie Qian, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W. Coley, and Regina Barzilay. 2023. MolScribe: Robust Molecular Structure Recognition with Image- to-Graph Generation.Journal of Chemical Information and Modeling63, 18 (2023), 5833–5844
work page 2023
-
[29]
Yujie Qian, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W. Coley, and Regina Barzilay. 2024. RxnScribe: A Unified Framework for Chemical Reaction Diagram Parsing.arXiv preprint arXiv:2305.11845(2024)
-
[30]
K Rajan, H O Brinkhaus, M I Agea, A Zielesny, and C Steinbeck. 2023. DEC- IMER.ai: An open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications.Nature communications 14, 1 (2023), 5045
work page 2023
-
[31]
LG Research, Sehyun Chun, Jiye Kim, Ahra Jo, Yeonsik Jo, Seungyul Oh, Seungjun Lee, Kwangrok Ryoo, Jongmin Lee, Seung Hwan Kim, et al
- [32]
-
[33]
Nicholas T Runcie et al. 2025. Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?Journal of Chemical Information and Modeling(2025)
work page 2025
-
[34]
Nicholas T. Runcie, Charlotte M. Deane, and Fergus Imrie. 2025. ChemIQ: A Benchmark for Chemical Reasoning and Molecular Comprehension.arXiv preprint arXiv:2505.07735(2025)
-
[35]
Christof Schütt, Kohulan Rajan, Achim Zielesny, and Christoph Steinbeck. 2020. DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Trans- formers.Journal of Chemical Information and Modeling60 (2020), 5359–5372. Also published in J. Cheminf. as separate work, please verify specific citation
work page 2020
-
[36]
Ayush Kumar Shah, Abhisek Dey, Leo Luo, Bryan Amador, Patrick Philippy, Ming Zhong, Siru Ouyang, David Mark Friday, David Bianchi, Nick Jackson, et al
-
[37]
Multimodal Search in Chemical Documents and Reactions. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 4030–4034
-
[38]
Joshua Staker, Kyle Marshall, Robert Abel, and Carolyn M McQuaw. 2019. Molec- ular structure extraction from documents using deep learning.Journal of chemical information and modeling59, 3 (2019), 1017–1029
work page 2019
- [39]
-
[40]
Xiaoxuan Wang, Yanqiao Zhu, Zemand Liu, et al. 2024. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. In International Conference on Machine Learning (ICML)
work page 2024
-
[41]
Damian M Wilary and Jacqueline M Cole. 2021. ReactionDataExtractor: A tool for automated extraction of information from chemical reaction schemes.Journal of chemical information and modeling61, 10 (2021), 4962–4974
work page 2021
-
[42]
Damian M Wilary and Jacqueline M Cole. 2023. ReactionDataExtractor 2.0: A deep learning approach for data extraction from chemical reaction schemes. Journal of Chemical Information and Modeling63, 19 (2023), 6053–6067
work page 2023
-
[43]
Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, and Heng Ji. 2025. oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning. arXiv preprint arXiv:2510.07731
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [44]
-
[45]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9556–9567
work page 2024
- [46]
- [47]
-
[48]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al . 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, and Hengxing Cai. 2025. Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents.arXiv preprint arXiv:2506.21625(2025). 9 Preprint, 2026, Rao et al. A1 Implementation Details: Prompts and Instruction Tuning A1.1 Reaction Diagram ...
-
[50]
- In disconnected layouts (multiple independent reactions in one image), treat them as separate Reaction IDs. ### Output Format (STRICT) ### Respond with a valid JSON list of objects. Do not include markdown code blocks (“‘json). [ {"reaction_id": 1, "role": "Reactant", "bbox": [x1, y1, x2, y2]}, {"reaction_id": 1, "role": "Arrow", "bbox": [x1, y1, x2, y2...
-
[51]
Finally, determine the connectivity (Bonds) between all iden- tified nodes. User Prompt ### Task Description ### Analyze the molecular image and generate a structured JSON representation containing supernodes, atoms, and bonds. ### Decomposition Constraints (CRITICAL) ### 1.Visual Priority (Top-Down): -Rule: Prioritize the detection of Functional Group Pa...
work page 2026
-
[52]
Return the center coordinates [x, y] of these anchor atoms (normalized 0-1000). ### Output Format (STRICT) ### Respond with a JSON object containing a single list of coordi- nates. { "anchors": [ [x1, y1], [x2, y2] ] } ### Input Image ### {cropped_molecule_image} Now, locate the anchors for the{target_label}at{tar- get_bbox}: A2 Data Construction Details ...
work page 2026
-
[53]
Priority Hierarchy Definition.We constructed a hierarchical dictionary where functional groups are ranked by heavy atom count, topological complexity, and semantic weight. High-complexity groups (e.g., Carboxyl −COOH, Amide −CONH2) are assigned higher matching priority than their constituents (e.g., Carbonyl 𝐶=𝑂 , Hydroxyl−𝑂𝐻)
-
[54]
Recursive Matching with Exclusivity.For each SMILES string, we perform recursive substructure matching using RDKit [19]. Cru- cially, we enforce anAtom-wise Exclusivity Constraint: once an atom is assigned to a high-priority "super-node" (e.g., the Carbon in −COOH), it is locked and explicitly excluded from subsequent scans. This prevents the redundant la...
-
[55]
Residual Atom Handling.After the greedy matching process, any remaining atoms (typically satisfying the saturated alkane skele- ton) are retained as atomic tokens. Thishybrid-granularityapproach ensures that the model captures both high-level functional semantics and low-level structural details. We then calculate the 2D bound- ing box and the precise anc...
-
[56]
Carboxylic Anhydride
-
[57]
Hemiacetal/Hemiketal
-
[58]
Sulfo (Sulfonic Acid)
-
[59]
Identify the substrate (Benzene) and reagent (Chloroethane) in the image and encode them into SMILES
Halo A3 OCRD-Bench Framework: Design and Metrics To comprehensively evaluate Multimodal Large Language Models (MLLMs) on organic chemistry reasoning, we designedOCRD- Bench, a hierarchical benchmark covering 8 major reaction cat- egories (see Figure A2). The evaluation is structured into three cognitive tiers, ranging from visual perception to deep mechan...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.