pith. machine review for the scientific record. sign in

arxiv: 2604.15994 · v2 · submitted 2026-04-17 · 💻 cs.AI

Recognition: unknown

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords reasoningdiagramsstructuralbenchmarkmllmstasksacrosschemical
0
0 comments X

The pith

ReactBench benchmark shows MLLMs suffer over 30% performance drop on complex topological reasoning tasks versus basic ones when evaluated on chemical reaction diagrams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models handle simple pictures and straight diagrams well but struggle when visuals have complicated connections like branching paths, merging flows, and loops. Chemical reaction diagrams provide a natural mix of these structures, making them a good test for whether models can both spot local parts and understand global connections. The ReactBench benchmark includes 1,618 questions created by experts and organized into four difficulty levels that require increasing amounts of structural understanding. When 17 different MLLMs were tested, accuracy fell sharply on tasks needing full diagram topology compared to easier anchor-based questions. Additional checks confirmed the weakness comes from reasoning steps, not from failing to see the image elements. This points to a core limitation in how current models process connected information in scientific visuals.

Core claim

Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception.

Load-bearing premise

That the expert-annotated QA pairs and chemical reaction diagrams accurately isolate topological reasoning from semantic comprehension or perceptual factors, and that the observed gap generalizes beyond the specific 1,618 pairs.

Figures

Figures reproduced from arXiv: 2604.15994 by Bin Feng, He Cao, Leqing Chen, Qiang Xu, Shengyuan Bai, Yuanyuan Liu, Yu Li, Yu Wang, Zijing Liu.

Figure 1
Figure 1. Figure 1: Comparing General Diagram and Chemical Reaction Diagram on Element Localization Tasks. GPT-4o performs well on general diagrams for identification tasks. However, for chemical reaction diagram, the structural complexity of molecular diagrams and reaction pathways leads to errors. Abstract Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple line… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ReactBench. ReactBench systematically evaluates topological reasoning on chemical reaction diagrams across four complexity-stratified dimensions. understanding of reaction pathways. We introduce ReactBench, a benchmark designed to diagnose structural reasoning limitations in MLLMs through chemical reaction diagrams. ReactBench com￾prises 1,618 question-answer pairs carefully curated from real-w… view at source ↗
Figure 3
Figure 3. Figure 3: Composition of ChemReaction. non-structural or shallow visual reasoning. In these scenarios, visual inputs typically act as static seman￾tic references rather than intricate relational networks. Consequently, while current VQA benchmarks heav￾ily focus on general semantic comprehension and basic grounding, they lack a systematic evaluation of complex topological reasoning. Our work addresses this critical … view at source ↗
Figure 4
Figure 4. Figure 4: ReactBench dataset construction and evaluation pipeline. (a) Data acquisition involves systematic collection of chemical reaction diagrams from peer-reviewed literature and patent databases, followed by prepro￾cessing through manual curation and quality filtering protocols. (b) Annotation framework encompasses structured generation and expert validation of question-answer pairs, ensuring alignment with tar… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis of recurring failure patterns in ReactBench evaluation. Each case study exemplifies a [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison of three models with [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of a chemical reaction diagram used for the text-only ablation study. This visual representa￾tion is translated into the structured JSON format shown below, which preserves the chemical semantics (SMILES, text) [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of the visual context ablation. Detailed molecule images are masked and replaced with rectangle placeholders to isolate the topological structure. Direct Answer System Prompt Multimodal Chemical Reaction QA Task Inputs: 1. Chemical reaction scheme image 2. Question Answer the question by following these rules: 1. Provide ONLY the numerical answer (e.g., \"80\") without any units, symbols, or a… view at source ↗
Figure 9
Figure 9. Figure 9: The direct answer prompt in our ReactBench [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The prompt of Chain-of-Thought in our ReactBench. External Knowledge System Prompt Multimodal Chemical Reaction QA Task Inputs: 1. Chemical reaction scheme image 2. Supplemental JSON data Task Requirements: 1. Cross-validate information between image and JSON data 2. Answer format requirements: - Return ONLY numerical value (e.g. \"80\") - No units, symbols or additional text" 3. Explicitly explain how bo… view at source ↗
Figure 11
Figure 11. Figure 11: The prompt of external knowledge in our ReactBench [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: External Knowledge Example in our ReactBench. This JSON represents molecular structures and reaction relationships extracted from the image. The bboxes section defines detected elements with their respective IDs, bounding box coordinates, and the category_id, where 1 indicates molecules, 2 denotes text, 3 corresponds to compound identifiers (numerical labels without molecular structures), and 4 represents… view at source ↗
Figure 13
Figure 13. Figure 13: Examples of Element Localization Task in our ReactBench. Question: "Which term classifies the structure of the reaction pathway? A) Single line B) Multiple line C) Tree D) Graph" Question: "Is there a circular reaction in the diagram, YES or NO?" Question: "Which term classifies the structure of the reaction pathway? A) Single line B) Multiple line C) Tree D) Graph" [PITH_FULL_IMAGE:figures/full_fig_p018… view at source ↗
Figure 14
Figure 14. Figure 14: Examples of Topology Reasoning Task in our ReactBench [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples of Information Extraction Task in our ReactBench. % 10. Analyze the reaction mechanism diagram. \nThe current step involves compound with subscript 14. **Focus only on compounds directly participating in the reaction pathway** (reactants and products).\n**Ignore substances in reaction conditions** (e.g., catalysts, solvents) What is the subscript of the immediate next compound formed in this reac… view at source ↗
Figure 16
Figure 16. Figure 16: Examples of Connectivity Tracing Task in our ReactBench [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReactBench, a new benchmark of 1,618 expert-annotated QA pairs on chemical reaction diagrams, to probe topological reasoning limitations in MLLMs. It evaluates 17 models and reports a >30% performance gap between anchor-based tasks and holistic structural reasoning tasks, with controlled ablations claimed to show that the deficit is in reasoning rather than perception.

Significance. If the benchmark and ablations successfully isolate pure topological structure from chemical semantics and perceptual factors, the work would supply a useful diagnostic for MLLM limitations on complex scientific diagrams and could inform targeted improvements in structural visual reasoning.

major comments (2)
  1. [Abstract] Abstract and the description of the four task dimensions: the claim that the >30% gap reflects a 'fundamental deficit in structural understanding' and that ablations isolate reasoning from perception rests on the assumption that expert QA pairs cleanly separate topological features (branching, cycles, connectivity) from semantic cues (atom types, bond orders, reaction labels). No explicit verification is provided that the anchor-based vs. holistic split neutralizes these cues, leaving open the possibility that models exploit residual chemical meaning differently across task types.
  2. [Ablations] The ablation section (referenced in the abstract as 'controlled ablations'): details on the text-only, masked, or variant constructions are insufficient to confirm they remove semantic comprehension while preserving topological structure. Without quantitative results showing that performance drops are attributable solely to reasoning (e.g., error breakdowns by task dimension or inter-annotator agreement on isolation), the attribution to a reasoning bottleneck rather than perceptual or semantic confounds cannot be fully evaluated.
minor comments (2)
  1. [Results] The manuscript should include a table or appendix listing the exact distribution of the 1,618 pairs across the four hierarchical task dimensions and the 17 models evaluated, along with error bars or statistical significance tests for the reported 30% gap.
  2. [Benchmark Construction] Clarify the precise definition of 'anchor-based' versus 'holistic structural reasoning' tasks with one or two concrete QA examples per category to aid reproducibility.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct evaluation

full rationale

The paper creates a new dataset of 1,618 expert-annotated QA pairs on chemical reaction diagrams and reports direct empirical results from evaluating 17 MLLMs plus controlled ablations. No mathematical derivations, parameter fitting, predictions from fitted inputs, or self-citation chains appear in the abstract or described structure. The performance gap and 'reasoning not perception' conclusion follow from the new test data itself rather than reducing to prior fitted quantities or self-referential definitions. This is a standard empirical benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that chemical reaction diagrams form an ideal testbed for isolating topological reasoning and that the four hierarchical task dimensions validly measure it.

axioms (1)
  • domain assumption Chemical reaction diagrams naturally span diverse topological structures from linear to cyclic and require both local recognition and global reasoning.
    Invoked in abstract to justify choice of testbed.

pith-pipeline@v0.9.0 · 5499 in / 1151 out tokens · 47152 ms · 2026-05-10T08:59:37.475940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models.Preprint, arXiv:2301.12597. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. Llava- next: Improved reasoning, ocr, and world knowledge. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual...

  2. [2]

    To cot or not to cot? chain- of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024

    Decimer. ai: an open platform for automated optical chemical structure identification, segmenta- tion and recognition in scientific publications.Nature communications, 14(1):5045. Kohulan Rajan, Achim Zielesny, and Christoph Stein- beck. 2020. Decimer: towards deep learning for chemical image recognition.Journal of Cheminfor- matics, 12(1):65. Amanpreet S...

  3. [3]

    reactants

    Vision transformer with quadrangle attention. arXiv preprint arXiv:2303.15105. A Empirical Validation of OCSR Limitations In the main text, we posit that existing Optical Chemical Structure Recognition (OCSR) methods are insufficient for topological reasoning because they fail to capture the structural connectivity of reaction diagrams. In this section, w...

  4. [4]

    Figure 8:Illustration of the visual context ablation.Detailed molecule images are masked and replaced with rectangle placeholders to isolate the topological structure

    These examples highlight the diversity of our designed questions, which encompass a wide range of problem types. Figure 8:Illustration of the visual context ablation.Detailed molecule images are masked and replaced with rectangle placeholders to isolate the topological structure. Direct Answer System Prompt Multimodal Chemical Reaction QA Task Inputs:

  5. [6]

    Question Answer the question by following these rules:

  6. [7]

    Provide ONLY the numerical answer (e.g., \"80\") without any units, symbols, or additional text

  7. [8]

    Final Output Format (as a JSON object): {

    Provide a detailed explanation of the step-by-step reasoning.\n" Final Output Format (as a JSON object): { " \"answer\": \"<numerical_value>\", " \"explanation\": \"<step-by-step reasoning>\" } Figure 9:The direct answer prompt in our ReactBench. Chain of Thought System Prompt Chemical Reaction QA Task: "Step 1 - Structural Parsing: Analyze the reaction i...

  8. [9]

    80\") without any units, symbols, or additional text

    Provide ONLY the numerical answer (e.g., \"80\") without any units, symbols, or additional text."

  9. [10]

    "Final Output Format (as a JSON object):

    Provide a detailed explanation of the step-by-step reasoning." "Final Output Format (as a JSON object): " "{" \"answer\": \"<numerical_value>\"," \"explanation\": \"<step-by-step reasoning>\"" "}" Figure 10:The prompt of Chain-of-Thought in our ReactBench. External Knowledge System Prompt Multimodal Chemical Reaction QA Task Inputs:

  10. [11]

    Chemical reaction scheme image

  11. [12]

    Supplemental JSON data Task Requirements:

  12. [13]

    Cross-validate information between image and JSON data

  13. [14]

    80\") - No units, symbols or additional text

    Answer format requirements: - Return ONLY numerical value (e.g. \"80\") - No units, symbols or additional text"

  14. [15]

    answer\": \

    Explicitly explain how both modalities contribute to the answer Output Format (strict JSON): { \"answer\": \"<numerical_value>\"," \"explanation\": \"<integration_steps>\"" } Figure 11:The prompt of external knowledge in our ReactBench. { "bboxes": [ { "id": 0, "bbox": [ 829.38, 21.2, 204.02, 163.45], "category_id": 1}, // molecules { "id": 1, "bbox": [ 1...