Recognition: unknown
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
Pith reviewed 2026-05-10 08:59 UTC · model grok-4.3
The pith
ReactBench benchmark shows MLLMs suffer over 30% performance drop on complex topological reasoning tasks versus basic ones when evaluated on chemical reaction diagrams.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception.
Load-bearing premise
That the expert-annotated QA pairs and chemical reaction diagrams accurately isolate topological reasoning from semantic comprehension or perceptual factors, and that the observed gap generalizes beyond the specific 1,618 pairs.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReactBench, a new benchmark of 1,618 expert-annotated QA pairs on chemical reaction diagrams, to probe topological reasoning limitations in MLLMs. It evaluates 17 models and reports a >30% performance gap between anchor-based tasks and holistic structural reasoning tasks, with controlled ablations claimed to show that the deficit is in reasoning rather than perception.
Significance. If the benchmark and ablations successfully isolate pure topological structure from chemical semantics and perceptual factors, the work would supply a useful diagnostic for MLLM limitations on complex scientific diagrams and could inform targeted improvements in structural visual reasoning.
major comments (2)
- [Abstract] Abstract and the description of the four task dimensions: the claim that the >30% gap reflects a 'fundamental deficit in structural understanding' and that ablations isolate reasoning from perception rests on the assumption that expert QA pairs cleanly separate topological features (branching, cycles, connectivity) from semantic cues (atom types, bond orders, reaction labels). No explicit verification is provided that the anchor-based vs. holistic split neutralizes these cues, leaving open the possibility that models exploit residual chemical meaning differently across task types.
- [Ablations] The ablation section (referenced in the abstract as 'controlled ablations'): details on the text-only, masked, or variant constructions are insufficient to confirm they remove semantic comprehension while preserving topological structure. Without quantitative results showing that performance drops are attributable solely to reasoning (e.g., error breakdowns by task dimension or inter-annotator agreement on isolation), the attribution to a reasoning bottleneck rather than perceptual or semantic confounds cannot be fully evaluated.
minor comments (2)
- [Results] The manuscript should include a table or appendix listing the exact distribution of the 1,618 pairs across the four hierarchical task dimensions and the 17 models evaluated, along with error bars or statistical significance tests for the reported 30% gap.
- [Benchmark Construction] Clarify the precise definition of 'anchor-based' versus 'holistic structural reasoning' tasks with one or two concrete QA examples per category to aid reproducibility.
Circularity Check
No circularity: empirical benchmark with direct evaluation
full rationale
The paper creates a new dataset of 1,618 expert-annotated QA pairs on chemical reaction diagrams and reports direct empirical results from evaluating 17 MLLMs plus controlled ablations. No mathematical derivations, parameter fitting, predictions from fitted inputs, or self-citation chains appear in the abstract or described structure. The performance gap and 'reasoning not perception' conclusion follow from the new test data itself rather than reducing to prior fitted quantities or self-referential definitions. This is a standard empirical benchmark paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Chemical reaction diagrams naturally span diverse topological structures from linear to cyclic and require both local recognition and global reasoning.
Reference graph
Works this paper leans on
-
[1]
Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models.Preprint, arXiv:2301.12597. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. Llava- next: Improved reasoning, ocr, and world knowledge. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual...
work page internal anchor Pith review arXiv 2024
-
[2]
Decimer. ai: an open platform for automated optical chemical structure identification, segmenta- tion and recognition in scientific publications.Nature communications, 14(1):5045. Kohulan Rajan, Achim Zielesny, and Christoph Stein- beck. 2020. Decimer: towards deep learning for chemical image recognition.Journal of Cheminfor- matics, 12(1):65. Amanpreet S...
-
[3]
Vision transformer with quadrangle attention. arXiv preprint arXiv:2303.15105. A Empirical Validation of OCSR Limitations In the main text, we posit that existing Optical Chemical Structure Recognition (OCSR) methods are insufficient for topological reasoning because they fail to capture the structural connectivity of reaction diagrams. In this section, w...
-
[4]
Figure 8:Illustration of the visual context ablation.Detailed molecule images are masked and replaced with rectangle placeholders to isolate the topological structure
These examples highlight the diversity of our designed questions, which encompass a wide range of problem types. Figure 8:Illustration of the visual context ablation.Detailed molecule images are masked and replaced with rectangle placeholders to isolate the topological structure. Direct Answer System Prompt Multimodal Chemical Reaction QA Task Inputs:
-
[6]
Question Answer the question by following these rules:
-
[7]
Provide ONLY the numerical answer (e.g., \"80\") without any units, symbols, or additional text
-
[8]
Final Output Format (as a JSON object): {
Provide a detailed explanation of the step-by-step reasoning.\n" Final Output Format (as a JSON object): { " \"answer\": \"<numerical_value>\", " \"explanation\": \"<step-by-step reasoning>\" } Figure 9:The direct answer prompt in our ReactBench. Chain of Thought System Prompt Chemical Reaction QA Task: "Step 1 - Structural Parsing: Analyze the reaction i...
-
[9]
80\") without any units, symbols, or additional text
Provide ONLY the numerical answer (e.g., \"80\") without any units, symbols, or additional text."
-
[10]
"Final Output Format (as a JSON object):
Provide a detailed explanation of the step-by-step reasoning." "Final Output Format (as a JSON object): " "{" \"answer\": \"<numerical_value>\"," \"explanation\": \"<step-by-step reasoning>\"" "}" Figure 10:The prompt of Chain-of-Thought in our ReactBench. External Knowledge System Prompt Multimodal Chemical Reaction QA Task Inputs:
-
[11]
Chemical reaction scheme image
-
[12]
Supplemental JSON data Task Requirements:
-
[13]
Cross-validate information between image and JSON data
-
[14]
80\") - No units, symbols or additional text
Answer format requirements: - Return ONLY numerical value (e.g. \"80\") - No units, symbols or additional text"
-
[15]
answer\": \
Explicitly explain how both modalities contribute to the answer Output Format (strict JSON): { \"answer\": \"<numerical_value>\"," \"explanation\": \"<integration_steps>\"" } Figure 11:The prompt of external knowledge in our ReactBench. { "bboxes": [ { "id": 0, "bbox": [ 829.38, 21.2, 204.02, 163.45], "category_id": 1}, // molecules { "id": 1, "bbox": [ 1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.