arxiv: 2604.06250 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

Dikshant Kukreja , Kshitij Sah , Karan Goyal , Mukesh Mohania , Vikram Goyal

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsscientific reasoningmultimodal benchmarkschemistrybiologyperception gaplanguage priorsmodel evaluation

0 comments

The pith

Vision-language models can describe scientific images correctly yet still fail when reasoning from them, revealing a perception-integration gap that hits open-source systems harder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DISSECT, a benchmark of 12,000 questions in chemistry and biology, to separate how well models extract visual information from how well they integrate it into answers. By testing each question in five modes including a Model Oracle where the model first describes the image in its own words and then reasons from that description, the work isolates failures that standard accuracy scores hide. It shows chemistry questions expose less reliance on language priors than biology questions and that open-source models improve when given their own descriptions while closed-source models do not. This matters because current benchmarks treat seeing and thinking as one skill, masking where multimodal systems actually break. The Model Oracle method itself can be applied to other evaluations to diagnose the same separation.

Core claim

The central claim is that vision-language models exhibit a perception-integration gap: they successfully extract visual content from scientific diagrams yet lose that information during downstream reasoning. The DISSECT benchmark measures this by comparing performance across Vision+Text, Text-Only, Vision-Only, Human Oracle, and Model Oracle inputs. Results from 18 models indicate chemistry has lower language-prior exploitability than biology, open-source models score higher under Model Oracle than raw images, and closed-source models show no such difference, making integration the distinguishing frontier.

What carries the argument

The Model Oracle protocol, where the VLM first produces a verbal description of the input image and then answers the question from that description alone, to isolate integration effectiveness from raw visual perception.

If this is right

Chemistry questions provide a stricter test of genuine visual reasoning because they allow less exploitation of language priors than biology questions.
Open-source models suffer a systematic integration bottleneck that the Model Oracle protocol exposes by letting them reason from their own image descriptions.
Closed-source models already bridge perception and integration effectively, showing no performance lift from the Model Oracle.
The Model Oracle can be run post-hoc on any existing VLM benchmark to diagnose integration failures without new data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving open-source multimodal models may require targeted work on fusion mechanisms rather than simply adding more image-text pairs.
The same diagnostic approach could be extended to other diagram-heavy domains such as physics or engineering to check whether the gap is domain-specific.
If the gap persists across new model releases, it suggests that scaling alone will not close the separation between description and reasoning.

Load-bearing premise

The verbalization step in the Model Oracle does not itself introduce new perception or language biases that distort the measured difference between raw-image and description-based reasoning.

What would settle it

If open-source models show equal or lower accuracy when reasoning from their own verbalized descriptions than from the original raw images on the same questions, the claimed integration bottleneck would not hold.

Figures

Figures reproduced from arXiv: 2604.06250 by Dikshant Kukreja, Karan Goyal, Kshitij Sah, Mukesh Mohania, Vikram Goyal.

**Figure 2.** Figure 2: Biology sample, Modes 1/4/5 — Diagram of the hu [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Biology sample, Mode 3 (V) — The complete image [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 5.** Figure 5: Chem Q30 — Mode 3 (V) composite image: the elec [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 4.** Figure 4: Chem Q30 — Electrolysis cell diagram showing [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.'' When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes -- Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description -- yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. Evaluating 18~VLMs, we find that: (1) Chemistry exhibits substantially lower language-prior exploitability than Biology, confirming molecular visual content as a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing a systematic integration bottleneck; and (3) Closed-source models show no such gap, indicating that bridging perception and integration is the frontier separating open-source from closed-source multimodal capability. The Model Oracle protocol is both model and benchmark agnostic, applicable post-hoc to any VLM evaluation to diagnose integration failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DISSECT, a 12,000-question diagnostic benchmark (7,000 Chemistry, 5,000 Biology) that evaluates 18 VLMs across five input modes—Vision+Text, Text-Only, Vision-Only, Human Oracle, and Model Oracle—to decompose VLM performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. The central claims are that Chemistry shows lower language-prior exploitability than Biology, open-source models exhibit a systematic integration bottleneck (higher accuracy on Model Oracle than raw images), and closed-source models show no such gap.

Significance. If the diagnostic gaps are robust, the work supplies a practical, model-agnostic protocol for isolating perception-integration failures in scientific VLMs. The five-mode design and post-hoc applicability of Model Oracle could become a standard tool for benchmark analysis, directly addressing why single-configuration accuracy numbers obscure distinct failure modes in domains like molecular reasoning.

major comments (2)

[Methods, Model Oracle protocol] Methods, Model Oracle protocol: The headline finding that open-source models show higher accuracy on Model Oracle than raw images (while closed-source models do not) rests on the assumption that a model's own verbalized description supplies a comparable visual proxy. No control is described that measures verbalization quality or completeness (e.g., by comparing Model Oracle descriptions against Human Oracle descriptions or by scoring description fidelity separately). Without such a control, the observed gap could reflect differences in perception or text-modality advantage rather than downstream integration effectiveness. This directly affects the validity of claims (2) and (3).
[Experimental setup and results] Experimental setup and results: The manuscript reports aggregate results over 12,000 questions but provides insufficient detail on question construction (image sourcing, answer-key validation, controls for visual complexity), inter-rater reliability for the Human Oracle, or statistical testing of the reported gaps (e.g., confidence intervals or significance tests on mode differences within and across domains). These omissions are load-bearing for the Chemistry-vs-Biology and open-vs-closed comparisons.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the split between open-source and closed-source models evaluated (e.g., exact counts per category).
[Figures] Figure captions for the five-mode schematic should include a brief legend clarifying how each mode isolates a specific diagnostic component.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us identify areas to strengthen the methodological rigor and transparency of the DISSECT benchmark. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Methods, Model Oracle protocol] Methods, Model Oracle protocol: The headline finding that open-source models show higher accuracy on Model Oracle than raw images (while closed-source models do not) rests on the assumption that a model's own verbalized description supplies a comparable visual proxy. No control is described that measures verbalization quality or completeness (e.g., by comparing Model Oracle descriptions against Human Oracle descriptions or by scoring description fidelity separately). Without such a control, the observed gap could reflect differences in perception or text-modality advantage rather than downstream integration effectiveness. This directly affects the validity of claims (2) and (3).

Authors: We appreciate this observation and agree that quantifying verbalization quality would provide a stronger control for interpreting the Model Oracle results. The protocol is designed such that the Model Oracle uses the VLM's own extracted description as input for reasoning, isolating integration from raw visual input; the Human Oracle then serves as a high-fidelity text baseline. However, we did not include an explicit fidelity scoring of the model's verbalizations. In the revised manuscript, we will add a dedicated analysis: we sample 200 images per domain, have two independent domain experts rate Model Oracle and Human Oracle descriptions on completeness, accuracy, and relevance (using a 1-5 Likert scale with inter-rater agreement reported), and correlate these scores with the observed performance gaps. This will allow us to assess whether verbalization differences contribute to the open-source integration bottleneck and will be presented in a new subsection of the results. revision: yes
Referee: [Experimental setup and results] Experimental setup and results: The manuscript reports aggregate results over 12,000 questions but provides insufficient detail on question construction (image sourcing, answer-key validation, controls for visual complexity), inter-rater reliability for the Human Oracle, or statistical testing of the reported gaps (e.g., confidence intervals or significance tests on mode differences within and across domains). These omissions are load-bearing for the Chemistry-vs-Biology and open-vs-closed comparisons.

Authors: We concur that greater detail is required for full reproducibility and to support the domain and model-type comparisons. In the revised manuscript, we will substantially expand the Methods section with: (i) image sourcing details (e.g., licensed diagrams from PubChem, BioRender, scientific textbooks, and papers, with explicit exclusion criteria for overly complex or ambiguous visuals); (ii) answer-key validation protocol (two-stage review by PhD-level chemists and biologists, with disagreement resolution); (iii) visual complexity controls (categorization into low/medium/high complexity based on element count and relation density, with balanced sampling); (iv) inter-rater reliability for Human Oracle (Fleiss' kappa and percentage agreement across three annotators per question); and (v) statistical analysis (bootstrapped 95% confidence intervals on all accuracy figures, plus paired McNemar tests and two-way ANOVA for mode differences, Chemistry vs. Biology, and open- vs. closed-source models, with p-values and effect sizes). These additions will be accompanied by supplementary tables. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct score comparisons

full rationale

The paper introduces DISSECT as a diagnostic benchmark and reports accuracy differences across five input modes (Vision+Text, Text-Only, Vision-Only, Human Oracle, Model Oracle) for 18 VLMs on chemistry and biology questions. All claims rest on observed performance gaps rather than any derivation, equation, fitted parameter, or self-referential prediction. The Model Oracle is defined operationally (model verbalizes image then reasons from its description) and evaluated by direct comparison; no step reduces a result to its own inputs by construction. No uniqueness theorems, ansatzes, or load-bearing self-citations appear in the provided text. The evaluation is self-contained and falsifiable via external replication on the same benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The diagnostic claims rest on the assumption that score differences across the five input modes directly reflect the named components (language priors, visual extraction, perception fidelity, integration) without significant interference from the verbalization process itself.

axioms (1)

domain assumption Differences in performance across Vision+Text, Text-Only, Vision-Only, Human Oracle, and Model Oracle modes can be attributed specifically to language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness.
This decomposition underpins the entire diagnostic framework and the interpretation of results for open- versus closed-source models.

pith-pipeline@v0.9.0 · 5603 in / 1447 out tokens · 52808 ms · 2026-05-10T20:11:47.258348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

five input modes—Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle... Integration Gap: Acc(OM) − Acc(V+T)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Chemistry exhibits substantially lower language-prior exploitability than Biology

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 4971–4980

2018
[2]

Anthropic. 2025. Introducing Claude 4.Anthropic News(22 May 2025). https: //www.anthropic.com/news/claude-4

2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang, and Guoping Hu. 2025. Evaluating large language models on multimodal chemistry olympiad exams. Communications Chemistry(2025)

2025
[6]

Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. 2024. Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7768–7791

2024
[7]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh
[8]

InProceedings of the IEEE conference on computer vision and pattern recognition

Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913
[9]

Ziyu Guo, Renrui Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, and Pheng-Ann Heng. 2025. Sciverse: Unveiling the knowledge comprehension and visual reasoning of lmms on multi-modal scientific problems. InFindings of the Association for Computational Linguistics: ACL 2025. 19683–19704

2025
[10]

Zhiyuan Huang, Baichuan Yang, Zikun He, Yanhong Wu, Fang Hongyu, Zhenhe Liu, Lin Dongsheng, and Bing Su. 2025. ChemVTS-Bench: Evaluating Visual- Textual-Symbolic Reasoning of Multimodal Large Language Models in Chemistry. arXiv preprint arXiv:2511.17909(2025)

work page arXiv 2025
[11]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2025. LLaVA- OneVision: Easy Visual Task Transfer.Transactions on Machine Learning Research (2025). https://openreview.net/forum?id=zKv8qULV6n

2025
[12]

Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Yaotian Yang, Xinrui Xiong, et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 415–423

2025
[13]

Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, et al. 2025. Seeing but not believing: DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs Conference’17, July 2017, Washington, DC, USA Probing the disconnect between visual attention and answer correctness i...

work page arXiv 2025
[14]

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2024. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=KUNzEQMWU7

2024
[15]

Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, and Honglak Lee. 2025. Prob- ing Visual Language Priors in VLMs. InInternational Conference on Machine Learning. PMLR, 41120–41156

2025
[16]

OpenAI. 2025. Introducing GPT-5.OpenAI(7 Aug. 2025). https://openai.com/gpt- 5/

2025
[17]

Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H Lê Khac, Ankit Singh, Sofian Chaybouti, and Sanath Narayan. 2025. VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs.arXiv preprint arXiv:2512.21194(2025)

work page arXiv 2025
[18]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9556–9567

2024
[19]

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. 2024. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?. InEuropean Conference on Computer Vision. Springer, 169–186

2024
[20]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025). A Prompt Templates We provide the complete prompt templates used across all five evaluation ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Read the question text embedded in the image
[22]

Examine the scientific content in the image
[23]

A" or "B

Answer the question. Rules: - If the question is multiple choice, respond with ONLY the option letter (e.g., "A" or "B"). Do not include any explanation. - If the question requires a numerical answer, respond with ONLY the number. - Do not reproduce or restate the question. Conference’17, July 2017, Washington, DC, USA Kukreja and Sah et al. Only provide ...

2017
[24]

OVERALL LAYOUT - What type of scientific diagram is this? - How many distinct visual components are present? How are they arranged spatially?
[25]

- For Biology: identify all organelles, tissue types, organs, organisms, or structural components visible

STRUCTURES AND SHAPES - For Chemistry: identify all atoms, bonds, ring systems, functional groups, stereo- chemistry indicators, and charge symbols. - For Biology: identify all organelles, tissue types, organs, organisms, or structural components visible
[26]

- Note the position of each label relative to the structure it annotates

TEXT AND LABELS - Transcribe ALL text visible in the image. - Note the position of each label relative to the structure it annotates
[27]

ARROWS, LINES, AND FLOW - Describe all arrows with their direction, start point, and end point
[28]

COLORS AND VISUAL ENCODING - Note any color coding, shading, hatching, or highlighting
[29]

NUMERICAL AND QUANTITATIVE DATA - Transcribe all numerical values, units, measurements, angles, or coordinates
[30]

{question_text}

SPATIAL RELATIONSHIPS - Describe relative positions between key components. [User] <image> {image_data} </image> A student needs to answer the following question about this image (but you must NOT answer it yourself --- only describe what you see): DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs Conference’17, July 2017,...

2017
[31]

First, carefully examine the image and identify all relevant visual information
[32]

Then, reason through the problem using the visual information and your scientific knowledge
[33]

FINAL ANSWER:

Finally, provide your answer. Rules: - Show your reasoning step by step. - After your reasoning, write "FINAL ANSWER:" followed by ONLY the option letter (e.g., "A") or the numerical answer. - Do not hedge or say "I think". Commit to a single answer. [User] <image> {image_data} </image> Question: {question_text} Options (if applicable): {options} Think st...

2017