arxiv: 2604.15994 · v2 · submitted 2026-04-17 · 💻 cs.AI

Recognition: unknown

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Qiang Xu , Shengyuan Bai , Yu Wang , He Cao , Leqing Chen , Yuanyuan Liu , Bin Feng , Zijing Liu

show 1 more author

Yu Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords reasoningdiagramsstructuralbenchmarkmllmstasksacrosschemical

0 comments

The pith

ReactBench benchmark shows MLLMs suffer over 30% performance drop on complex topological reasoning tasks versus basic ones when evaluated on chemical reaction diagrams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models handle simple pictures and straight diagrams well but struggle when visuals have complicated connections like branching paths, merging flows, and loops. Chemical reaction diagrams provide a natural mix of these structures, making them a good test for whether models can both spot local parts and understand global connections. The ReactBench benchmark includes 1,618 questions created by experts and organized into four difficulty levels that require increasing amounts of structural understanding. When 17 different MLLMs were tested, accuracy fell sharply on tasks needing full diagram topology compared to easier anchor-based questions. Additional checks confirmed the weakness comes from reasoning steps, not from failing to see the image elements. This points to a core limitation in how current models process connected information in scientific visuals.

Core claim

Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception.

Load-bearing premise

That the expert-annotated QA pairs and chemical reaction diagrams accurately isolate topological reasoning from semantic comprehension or perceptual factors, and that the observed gap generalizes beyond the specific 1,618 pairs.

Figures

Figures reproduced from arXiv: 2604.15994 by Bin Feng, He Cao, Leqing Chen, Qiang Xu, Shengyuan Bai, Yuanyuan Liu, Yu Li, Yu Wang, Zijing Liu.

**Figure 1.** Figure 1: Comparing General Diagram and Chemical Reaction Diagram on Element Localization Tasks. GPT-4o performs well on general diagrams for identification tasks. However, for chemical reaction diagram, the structural complexity of molecular diagrams and reaction pathways leads to errors. Abstract Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple line… view at source ↗

**Figure 2.** Figure 2: Overview of ReactBench. ReactBench systematically evaluates topological reasoning on chemical reaction diagrams across four complexity-stratified dimensions. understanding of reaction pathways. We introduce ReactBench, a benchmark designed to diagnose structural reasoning limitations in MLLMs through chemical reaction diagrams. ReactBench comprises 1,618 question-answer pairs carefully curated from real-w… view at source ↗

**Figure 3.** Figure 3: Composition of ChemReaction. non-structural or shallow visual reasoning. In these scenarios, visual inputs typically act as static semantic references rather than intricate relational networks. Consequently, while current VQA benchmarks heavily focus on general semantic comprehension and basic grounding, they lack a systematic evaluation of complex topological reasoning. Our work addresses this critical … view at source ↗

**Figure 4.** Figure 4: ReactBench dataset construction and evaluation pipeline. (a) Data acquisition involves systematic collection of chemical reaction diagrams from peer-reviewed literature and patent databases, followed by preprocessing through manual curation and quality filtering protocols. (b) Annotation framework encompasses structured generation and expert validation of question-answer pairs, ensuring alignment with tar… view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of recurring failure patterns in ReactBench evaluation. Each case study exemplifies a [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison of three models with [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Example of a chemical reaction diagram used for the text-only ablation study. This visual representation is translated into the structured JSON format shown below, which preserves the chemical semantics (SMILES, text) [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of the visual context ablation. Detailed molecule images are masked and replaced with rectangle placeholders to isolate the topological structure. Direct Answer System Prompt Multimodal Chemical Reaction QA Task Inputs: 1. Chemical reaction scheme image 2. Question Answer the question by following these rules: 1. Provide ONLY the numerical answer (e.g., \"80\") without any units, symbols, or a… view at source ↗

**Figure 9.** Figure 9: The direct answer prompt in our ReactBench [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: The prompt of Chain-of-Thought in our ReactBench. External Knowledge System Prompt Multimodal Chemical Reaction QA Task Inputs: 1. Chemical reaction scheme image 2. Supplemental JSON data Task Requirements: 1. Cross-validate information between image and JSON data 2. Answer format requirements: - Return ONLY numerical value (e.g. \"80\") - No units, symbols or additional text" 3. Explicitly explain how bo… view at source ↗

**Figure 11.** Figure 11: The prompt of external knowledge in our ReactBench [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: External Knowledge Example in our ReactBench. This JSON represents molecular structures and reaction relationships extracted from the image. The bboxes section defines detected elements with their respective IDs, bounding box coordinates, and the category_id, where 1 indicates molecules, 2 denotes text, 3 corresponds to compound identifiers (numerical labels without molecular structures), and 4 represents… view at source ↗

**Figure 13.** Figure 13: Examples of Element Localization Task in our ReactBench. Question: "Which term classifies the structure of the reaction pathway? A) Single line B) Multiple line C) Tree D) Graph" Question: "Is there a circular reaction in the diagram, YES or NO?" Question: "Which term classifies the structure of the reaction pathway? A) Single line B) Multiple line C) Tree D) Graph" [PITH_FULL_IMAGE:figures/full_fig_p018… view at source ↗

**Figure 14.** Figure 14: Examples of Topology Reasoning Task in our ReactBench [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Examples of Information Extraction Task in our ReactBench. % 10. Analyze the reaction mechanism diagram. \nThe current step involves compound with subscript 14. **Focus only on compounds directly participating in the reaction pathway** (reactants and products).\n**Ignore substances in reaction conditions** (e.g., catalysts, solvents) What is the subscript of the immediate next compound formed in this reac… view at source ↗

**Figure 16.** Figure 16: Examples of Connectivity Tracing Task in our ReactBench [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReactBench gives a new benchmark on chemical diagrams that shows a clear gap in MLLM topological reasoning, but the ablations do not cleanly rule out semantic confounds.

read the letter

The main point is that this paper introduces ReactBench, a set of 1,618 expert-annotated QA pairs on real chemical reaction diagrams, and uses it to show that current MLLMs drop more than 30% when moving from simple anchor tasks to holistic structural ones across 17 models. The hierarchical task design and the focus on branching, cycles, and flows in actual scientific diagrams are the concrete additions here. That part is useful because most existing MLLM benchmarks stay on semantic recognition or toy diagrams and do not stress global topology in the same way. The broad model sweep and the attempt at controlled ablations to separate perception from reasoning are also straightforward and worth having on record. The work is grounded in a domain where diagrams naturally vary in structure, which gives the testbed some ecological validity. The soft spot is exactly the one the stress-test flags: chemical diagrams carry atom labels, bond orders, and reaction context that models could still exploit differently across the task types. Without seeing the precise ablation variants (text-only, masked regions, etc.) and confirmation that they neutralize those cues, the jump to a “fundamental deficit in structural understanding” is not fully supported. Annotation verification details and error bars are also thin in what is visible, which leaves the 30% gap harder to interpret. This paper is for groups working on visual reasoning for science applications or anyone building diagram benchmarks. A reader who needs diagnostic tasks for complex graphs will get direct value from the dataset and task breakdown. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject; the benchmark itself is a real contribution even if the causal claims need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReactBench, a new benchmark of 1,618 expert-annotated QA pairs on chemical reaction diagrams, to probe topological reasoning limitations in MLLMs. It evaluates 17 models and reports a >30% performance gap between anchor-based tasks and holistic structural reasoning tasks, with controlled ablations claimed to show that the deficit is in reasoning rather than perception.

Significance. If the benchmark and ablations successfully isolate pure topological structure from chemical semantics and perceptual factors, the work would supply a useful diagnostic for MLLM limitations on complex scientific diagrams and could inform targeted improvements in structural visual reasoning.

major comments (2)

[Abstract] Abstract and the description of the four task dimensions: the claim that the >30% gap reflects a 'fundamental deficit in structural understanding' and that ablations isolate reasoning from perception rests on the assumption that expert QA pairs cleanly separate topological features (branching, cycles, connectivity) from semantic cues (atom types, bond orders, reaction labels). No explicit verification is provided that the anchor-based vs. holistic split neutralizes these cues, leaving open the possibility that models exploit residual chemical meaning differently across task types.
[Ablations] The ablation section (referenced in the abstract as 'controlled ablations'): details on the text-only, masked, or variant constructions are insufficient to confirm they remove semantic comprehension while preserving topological structure. Without quantitative results showing that performance drops are attributable solely to reasoning (e.g., error breakdowns by task dimension or inter-annotator agreement on isolation), the attribution to a reasoning bottleneck rather than perceptual or semantic confounds cannot be fully evaluated.

minor comments (2)

[Results] The manuscript should include a table or appendix listing the exact distribution of the 1,618 pairs across the four hierarchical task dimensions and the 17 models evaluated, along with error bars or statistical significance tests for the reported 30% gap.
[Benchmark Construction] Clarify the precise definition of 'anchor-based' versus 'holistic structural reasoning' tasks with one or two concrete QA examples per category to aid reproducibility.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct evaluation

full rationale

The paper creates a new dataset of 1,618 expert-annotated QA pairs on chemical reaction diagrams and reports direct empirical results from evaluating 17 MLLMs plus controlled ablations. No mathematical derivations, parameter fitting, predictions from fitted inputs, or self-citation chains appear in the abstract or described structure. The performance gap and 'reasoning not perception' conclusion follow from the new test data itself rather than reducing to prior fitted quantities or self-referential definitions. This is a standard empirical benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that chemical reaction diagrams form an ideal testbed for isolating topological reasoning and that the four hierarchical task dimensions validly measure it.

axioms (1)

domain assumption Chemical reaction diagrams naturally span diverse topological structures from linear to cyclic and require both local recognition and global reasoning.
Invoked in abstract to justify choice of testbed.

pith-pipeline@v0.9.0 · 5499 in / 1151 out tokens · 47152 ms · 2026-05-10T08:59:37.475940+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages · 1 internal anchor

[1]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models.Preprint, arXiv:2301.12597. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. Llava- next: Improved reasoning, ocr, and world knowledge. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual...

work page internal anchor Pith review arXiv 2024
[2]

To cot or not to cot? chain- of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024

Decimer. ai: an open platform for automated optical chemical structure identification, segmenta- tion and recognition in scientific publications.Nature communications, 14(1):5045. Kohulan Rajan, Achim Zielesny, and Christoph Stein- beck. 2020. Decimer: towards deep learning for chemical image recognition.Journal of Cheminfor- matics, 12(1):65. Amanpreet S...

work page arXiv 2020
[3]

reactants

Vision transformer with quadrangle attention. arXiv preprint arXiv:2303.15105. A Empirical Validation of OCSR Limitations In the main text, we posit that existing Optical Chemical Structure Recognition (OCSR) methods are insufficient for topological reasoning because they fail to capture the structural connectivity of reaction diagrams. In this section, w...

work page arXiv 2023
[4]

Figure 8:Illustration of the visual context ablation.Detailed molecule images are masked and replaced with rectangle placeholders to isolate the topological structure

These examples highlight the diversity of our designed questions, which encompass a wide range of problem types. Figure 8:Illustration of the visual context ablation.Detailed molecule images are masked and replaced with rectangle placeholders to isolate the topological structure. Direct Answer System Prompt Multimodal Chemical Reaction QA Task Inputs:
[6]

Question Answer the question by following these rules:
[7]

Provide ONLY the numerical answer (e.g., \"80\") without any units, symbols, or additional text
[8]

Final Output Format (as a JSON object): {

Provide a detailed explanation of the step-by-step reasoning.\n" Final Output Format (as a JSON object): { " \"answer\": \"<numerical_value>\", " \"explanation\": \"<step-by-step reasoning>\" } Figure 9:The direct answer prompt in our ReactBench. Chain of Thought System Prompt Chemical Reaction QA Task: "Step 1 - Structural Parsing: Analyze the reaction i...
[9]

80\") without any units, symbols, or additional text

Provide ONLY the numerical answer (e.g., \"80\") without any units, symbols, or additional text."
[10]

"Final Output Format (as a JSON object):

Provide a detailed explanation of the step-by-step reasoning." "Final Output Format (as a JSON object): " "{" \"answer\": \"<numerical_value>\"," \"explanation\": \"<step-by-step reasoning>\"" "}" Figure 10:The prompt of Chain-of-Thought in our ReactBench. External Knowledge System Prompt Multimodal Chemical Reaction QA Task Inputs:
[11]

Chemical reaction scheme image
[12]

Supplemental JSON data Task Requirements:
[13]

Cross-validate information between image and JSON data
[14]

80\") - No units, symbols or additional text

Answer format requirements: - Return ONLY numerical value (e.g. \"80\") - No units, symbols or additional text"
[15]

answer\": \

Explicitly explain how both modalities contribute to the answer Output Format (strict JSON): { \"answer\": \"<numerical_value>\"," \"explanation\": \"<integration_steps>\"" } Figure 11:The prompt of external knowledge in our ReactBench. { "bboxes": [ { "id": 0, "bbox": [ 829.38, 21.2, 204.02, 163.45], "category_id": 1}, // molecules { "id": 1, "bbox": [ 1...