Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

Akiko Aizawa; Andre Greiner-Petter; Florian Boudin; Sunisth Kumar; Tim Schopf; Xanh Ho

arxiv: 2606.01679 · v1 · pith:IDBGC55Onew · submitted 2026-06-01 · 💻 cs.CL

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

Sunisth Kumar , Xanh Ho , Tim Schopf , Andre Greiner-Petter , Florian Boudin , Akiko Aizawa This is my paper

Pith reviewed 2026-06-28 15:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords table-chart gapscientific claim verificationmultimodal LLMslinear probingattention analysisvision-language modelsrouting failureinformation encoding

0 comments

The pith

Chart information is encoded in vision-language models but does not reach the prediction token, unlike table information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why multimodal models verify scientific claims better from tables than from charts of identical data. Using layer-wise linear probing and attention analysis on three open-weight vision-language models, it demonstrates that chart data gets encoded in intermediate layers yet fails to arrive at the position used for the final prediction. This routing gap does not appear with tables and persists across tested conditions. The analysis identifies two distinct architectural patterns for the disconnect in different model families. The work therefore reframes the performance difference as a failure in routing encoded information rather than in extracting it from charts.

Core claim

The central claim is that chart information is encoded in the models' intermediate representations but does not reach the prediction position, a gap absent for tables that holds across all conditions. Attention analysis reveals this disconnect takes two architecturally distinct forms across model families. This reframes the table-chart gap as a failure of routing encoded visual information at prediction time rather than a failure of encoding itself.

What carries the argument

Layer-wise linear probing of intermediate representations combined with attention analysis to track whether encoded information reaches the prediction token.

If this is right

The performance advantage of tables over charts stems from successful routing of encoded data to the prediction step.
Different vision-language model architectures exhibit distinct mechanisms for the routing failure with charts.
Addressing the table-chart gap requires methods that ensure visual information influences the prediction position.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Modifying attention mechanisms or adding explicit routing layers could close the gap in chart-based verification.
The same encoding-without-routing pattern may appear in other multimodal tasks beyond scientific claim verification.
Probing methods like those used here could diagnose similar issues in new model releases.

Load-bearing premise

Linear probing at intermediate layers measures information available for downstream use, and attention patterns indicate whether information reaches the prediction token.

What would settle it

A model variant where chart representations are forced to attend to the prediction token would close the performance gap with tables if the routing account is correct.

Figures

Figures reproduced from arXiv: 2606.01679 by Akiko Aizawa, Andre Greiner-Petter, Florian Boudin, Sunisth Kumar, Tim Schopf, Xanh Ho.

**Figure 2.** Figure 2: Probe AUROC by layer for Qwen2.5-VL-32B on SciTabAlign+. (a) Last-token: table rises sharply in late layers, chart variants remain near chance. (b) Mean-pool: chart signal remains decodable across all layers, but does not reach prediction position. Results for all models are shown in Appendix B.2. region. A value of aˆ (l) = 1.0 means the model attends to image tokens in proportion to their count in the i… view at source ↗

**Figure 3.** Figure 3: Probe accuracy vs. model inference accuracy on SciTabAlign+. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Image-token attention relative to the propor [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 8.** Figure 8: Coefficient of variation in image-token at [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 5.** Figure 5: Baseline prompt template for claim verifi [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Probe AUROC by layer for all three models on SciTabAlign+. Top row: last-token probing. Bottom row: [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Last-token probe AUROC heatmap across layers and formats for all models. Table evidence (top row) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Chain-of-thought prompt template for claim [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Representative error analysis example for Qwen2.5-VL-7B. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Mean cosine similarity between table and chart mean-pooled representations across all layers for each [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

read the original abstract

Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models' intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses layer-wise probes to argue the table-chart gap is a routing failure after encoding, but without causal interventions that interpretation stays suggestive rather than locked down.

read the letter

The main point is that the authors find chart information reaches intermediate layers in three open VLMs but does not appear at the prediction position, while table information does. They back this with consistent layer-wise linear probing results and attention patterns that differ by model family.

The work is useful because it moves past just reporting the performance gap and tries to localize where the problem occurs. Testing the same underlying data in table and chart form across multiple conditions and models gives the pattern some weight. The attention analysis is a reasonable addition for showing architecturally distinct disconnects.

The soft spot is exactly the one in the stress-test note. Linear probes measure what is linearly decodable at a given layer, not whether the model’s own pathways carry that information forward to the output. Attention weights are likewise incomplete as a map of residual flow. Without representation edits, path ablations, or other interventions, the data are compatible with a true routing failure but also with information that is present yet ignored by the model’s computation. The paper does not appear to include those steps, so the reframing rests on correlational evidence.

This is for readers working on multimodal interpretability or on VLMs for scientific tasks. Someone debugging why charts underperform tables would get concrete layer-wise numbers to think about. It is not a foundational result but adds a mechanistic hypothesis worth checking.

I would bring it to a reading group to discuss the probe setup and what would count as stronger evidence for routing. I would not cite it in my own work at this stage. It still deserves peer review because the question is well-posed and the methods are standard enough that referees can evaluate the strength of the claim directly.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the table-chart performance gap in multimodal LLMs for scientific claim verification. Using layer-wise linear probing and attention analysis across three open-weight VLMs, it reports that chart data is encoded in intermediate-layer representations (high probe accuracy) but fails to reach the final prediction position, a disconnect absent for tables; attention patterns indicate two architecturally distinct routing failures. The central claim reframes the gap as a routing rather than encoding problem.

Significance. If the empirical patterns hold under causal scrutiny, the work supplies a mechanistic account of a documented multimodal failure mode and identifies a concrete target (routing at prediction time) for architectural or training interventions. The consistency of results across models and conditions is a strength; the absence of parameter fitting or self-referential definitions keeps the analysis non-circular.

major comments (2)

[§4] §4 (Probing Results): The reframing from 'encoding failure' to 'routing failure' rests on the inference that high intermediate-layer probe accuracy demonstrates information available for downstream use while low accuracy at the prediction token demonstrates a routing failure. Linear probes recover linearly separable information that the model's non-linear pathways may never employ; without representation editing, attention ablation, or path-specific knockouts, the observed gap is compatible with both interpretations. This assumption is load-bearing for the central claim.
[Attention Analysis] Attention Analysis subsection: The claim that attention patterns reveal 'architecturally distinct forms' of the disconnect across model families requires explicit quantification of how attention weights track residual-stream flow to the prediction token; current attention analysis does not exhaustively rule out alternative flow paths.

minor comments (2)

[Methods] Clarify the exact layer indices used for 'intermediate' vs. 'prediction position' probes and report the number of data points per condition to allow replication.
[Figure 3] Figure 3 caption should state whether error bars reflect standard deviation across seeds or across data splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [§4] §4 (Probing Results): The reframing from 'encoding failure' to 'routing failure' rests on the inference that high intermediate-layer probe accuracy demonstrates information available for downstream use while low accuracy at the prediction token demonstrates a routing failure. Linear probes recover linearly separable information that the model's non-linear pathways may never employ; without representation editing, attention ablation, or path-specific knockouts, the observed gap is compatible with both interpretations. This assumption is load-bearing for the central claim.

Authors: We agree that linear probes demonstrate the presence of linearly extractable information but do not prove that the model employs this information via its non-linear pathways. Our central claim is grounded in the consistent contrast between tables (where probe accuracy remains high at the prediction token) and charts (where it drops), observed across three models and multiple conditions. This differential pattern under identical methods supports interpreting the gap as routing-related rather than a general artifact of probing. We will revise §4 to explicitly acknowledge the correlational nature of the evidence, add caveats on interpretation, and note that causal interventions (e.g., representation editing) would provide stronger confirmation. revision: yes
Referee: [Attention Analysis] Attention Analysis subsection: The claim that attention patterns reveal 'architecturally distinct forms' of the disconnect across model families requires explicit quantification of how attention weights track residual-stream flow to the prediction token; current attention analysis does not exhaustively rule out alternative flow paths.

Authors: We will expand the Attention Analysis subsection with explicit quantitative metrics linking attention weights to residual-stream contributions at the prediction token. Additional analyses will be included to evaluate and address potential alternative flow paths (e.g., via other tokens or cross-layer mechanisms), thereby better supporting the distinction across model families. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probing results are independent of inputs

full rationale

The paper conducts an empirical interpretability study on VLMs using layer-wise linear probing and attention analysis to compare table vs. chart evidence. The central claim (information encoded in intermediate layers but not reaching the prediction position for charts) is derived directly from measured probe accuracies and attention weights across models and conditions. No equations, fitted parameters, or self-citations are used to define the result in terms of itself; the observations stand as external measurements rather than tautological redefinitions. This is a standard non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two standard interpretability assumptions rather than new free parameters or invented entities.

axioms (2)

domain assumption Linear probing at intermediate layers reveals information that is encoded and potentially usable by later layers
Invoked by the layer-wise probing component of the method.
domain assumption Attention weights indicate whether encoded information is routed to the final prediction position
Invoked by the attention analysis component of the method.

pith-pipeline@v0.9.1-grok · 5729 in / 1290 out tokens · 26393 ms · 2026-06-28T15:10:05.802102+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 2 internal anchors

[1]

On the Perception Bottleneck of VLM s for Chart Understanding

Liu, Junteng and Zeng, Weihao and Zhang, Xiwen and Wang, Yijun and Shan, Zifei and He, Junxian. On the Perception Bottleneck of VLM s for Chart Understanding. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.573

work page doi:10.18653/v1/2025.findings-emnlp.573 2025
[2]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , month=. doi:10.1609/aaai.v40i37.40361 , number=

work page doi:10.1609/aaai.v40i37.40361 2026
[3]

Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Ho, Xanh and Kumar, Sunisth and Wu, Yun-Ang and Boudin, Florian and Takasu, Atsuhiro and Aizawa, Akiko. Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.135

work page doi:10.18653/v1/2025.findings-emnlp.135 2025
[4]

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

Ravichander, Abhilasha and Belinkov, Yonatan and Hovy, Eduard. Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.295

work page doi:10.18653/v1/2021.eacl-main.295 2021
[5]

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , url =

Orgad, Hadas and Toker, Michael and Gekhman, Zorik and Reichart, Roi and Szpektor, Idan and Kotek, Hadas and Belinkov, Yonatan , booktitle =. LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , url =
[6]

Inside-Out: Hidden Factual Knowledge in

Zorik Gekhman and Eyal Ben-David and Hadas Orgad and Eran Ofek and Yonatan Belinkov and Idan Szpektor and Jonathan Herzig and Roi Reichart , booktitle=. Inside-Out: Hidden Factual Knowledge in. 2025 , url=

2025
[7]

2026 , eprint=

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding , author=. 2026 , eprint=

2026
[8]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp

Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.609

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[9]

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) , month =

SciClaimEval: Cross-modal Claim Verification in Scientific Papers , author =. Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) , month =. 2026 , pages =

2026
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zhang, Zhi and Yadav, Srishti and Han, Fengze and Shutova, Ekaterina , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[11]

Proceedings of the 42nd International Conference on Machine Learning , articleno =

Skean, Oscar and Arefin, Md Rifat and Zhao, Dan and Patel, Niket and Naghiyev, Jalal and LeCun, Yann and Shwartz-Ziv, Ravid , title =. Proceedings of the 42nd International Conference on Machine Learning , articleno =. 2025 , publisher =

2025
[12]

Probing classifiers: Promises, shortcomings, and advances.Computa- tional Linguistics, 48(1):207–219, March 2022

Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[13]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1074

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[14]

SCITAB : A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables

Lu, Xinyuan and Pan, Liangming and Liu, Qian and Nakov, Preslav and Kan, Min-Yen. SCITAB : A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.483

work page doi:10.18653/v1/2023.emnlp-main.483 2023
[15]

M u S ci C laims: Multimodal Scientific Claim Verification

Lal, Yash Kumar and Bandham, Manikanta and Hasan, Mohammad Saqib and Kashi, Apoorva and Koupaee, Mahnaz and Balasubramanian, Niranjan. M u S ci C laims: Multimodal Scientific Claim Verification. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Co...

work page doi:10.18653/v1/2025.ijcnlp-long.175 2025
[16]

S ci V er: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Wang, Chengye and Shen, Yifei and Kuang, Zexi and Cohan, Arman and Zhao, Yilun. S ci V er: Evaluating Foundation Models for Multimodal Scientific Claim Verification. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.420

work page doi:10.18653/v1/2025.acl-long.420 2025
[17]

C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Masry, Ahmed and Long, Do Xuan and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul. C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.177

work page doi:10.18653/v1/2022.findings-acl.177 2022
[18]

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , url =

Wang, Zirui and Xia, Mengzhou and He, Luxi and Chen, Howard and Liu, Yitao and Zhu, Richard and Liang, Kaiqu and Wu, Xindi and Liu, Haotian and Malladi, Sadhika and Chevalier, Alexis and Arora, Sanjeev and Chen, Danqi , booktitle =. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , url =. doi:10.52202/079017-3609 , editor =

work page doi:10.52202/079017-3609
[19]

Probing the Visualization Literacy of Vision Language Models: The Good, the Bad, and the Ugly , year=

Dong, Lianghan and Crisan, Anamaria , journal=. Probing the Visualization Literacy of Vision Language Models: The Good, the Bad, and the Ugly , year=
[20]

2025 , eprint=

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs , author=. 2025 , eprint=

2025
[21]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , month =

Esmaeilkhani, Parsa and Latecki, Longin Jan , title =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , month =. 2026 , pages =

2026
[22]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

2025
[23]

2025 , eprint=

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. 2025 , eprint=

2025
[24]

APACrefauthors \ 1947

Note on the sampling error of the difference between correlated proportions or percentages , author =. Psychometrika , volume =. 1947 , publisher =. doi:10.1007/BF02295996 , url =

work page doi:10.1007/bf02295996 1947
[25]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in
[26]

Transformers: State-of-the-Art Natural Language Processing

Thomas Wolf, et al. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on EMNLP: System Demonstrations. 2020

2020

[1] [1]

On the Perception Bottleneck of VLM s for Chart Understanding

Liu, Junteng and Zeng, Weihao and Zhang, Xiwen and Wang, Yijun and Shan, Zifei and He, Junxian. On the Perception Bottleneck of VLM s for Chart Understanding. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.573

work page doi:10.18653/v1/2025.findings-emnlp.573 2025

[2] [2]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , month=. doi:10.1609/aaai.v40i37.40361 , number=

work page doi:10.1609/aaai.v40i37.40361 2026

[3] [3]

Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Ho, Xanh and Kumar, Sunisth and Wu, Yun-Ang and Boudin, Florian and Takasu, Atsuhiro and Aizawa, Akiko. Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.135

work page doi:10.18653/v1/2025.findings-emnlp.135 2025

[4] [4]

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

Ravichander, Abhilasha and Belinkov, Yonatan and Hovy, Eduard. Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.295

work page doi:10.18653/v1/2021.eacl-main.295 2021

[5] [5]

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , url =

Orgad, Hadas and Toker, Michael and Gekhman, Zorik and Reichart, Roi and Szpektor, Idan and Kotek, Hadas and Belinkov, Yonatan , booktitle =. LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , url =

[6] [6]

Inside-Out: Hidden Factual Knowledge in

Zorik Gekhman and Eyal Ben-David and Hadas Orgad and Eran Ofek and Yonatan Belinkov and Idan Szpektor and Jonathan Herzig and Roi Reichart , booktitle=. Inside-Out: Hidden Factual Knowledge in. 2025 , url=

2025

[7] [7]

2026 , eprint=

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding , author=. 2026 , eprint=

2026

[8] [8]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp

Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.609

work page doi:10.18653/v1/2020.emnlp-main.609 2020

[9] [9]

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) , month =

SciClaimEval: Cross-modal Claim Verification in Scientific Papers , author =. Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) , month =. 2026 , pages =

2026

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zhang, Zhi and Yadav, Srishti and Han, Fengze and Shutova, Ekaterina , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025

[11] [11]

Proceedings of the 42nd International Conference on Machine Learning , articleno =

Skean, Oscar and Arefin, Md Rifat and Zhao, Dan and Patel, Niket and Naghiyev, Jalal and LeCun, Yann and Shwartz-Ziv, Ravid , title =. Proceedings of the 42nd International Conference on Machine Learning , articleno =. 2025 , publisher =

2025

[12] [12]

Probing classifiers: Promises, shortcomings, and advances.Computa- tional Linguistics, 48(1):207–219, March 2022

Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[13] [13]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1074

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018

[14] [14]

SCITAB : A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables

Lu, Xinyuan and Pan, Liangming and Liu, Qian and Nakov, Preslav and Kan, Min-Yen. SCITAB : A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.483

work page doi:10.18653/v1/2023.emnlp-main.483 2023

[15] [15]

M u S ci C laims: Multimodal Scientific Claim Verification

Lal, Yash Kumar and Bandham, Manikanta and Hasan, Mohammad Saqib and Kashi, Apoorva and Koupaee, Mahnaz and Balasubramanian, Niranjan. M u S ci C laims: Multimodal Scientific Claim Verification. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Co...

work page doi:10.18653/v1/2025.ijcnlp-long.175 2025

[16] [16]

S ci V er: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Wang, Chengye and Shen, Yifei and Kuang, Zexi and Cohan, Arman and Zhao, Yilun. S ci V er: Evaluating Foundation Models for Multimodal Scientific Claim Verification. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.420

work page doi:10.18653/v1/2025.acl-long.420 2025

[17] [17]

C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Masry, Ahmed and Long, Do Xuan and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul. C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.177

work page doi:10.18653/v1/2022.findings-acl.177 2022

[18] [18]

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , url =

Wang, Zirui and Xia, Mengzhou and He, Luxi and Chen, Howard and Liu, Yitao and Zhu, Richard and Liang, Kaiqu and Wu, Xindi and Liu, Haotian and Malladi, Sadhika and Chevalier, Alexis and Arora, Sanjeev and Chen, Danqi , booktitle =. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , url =. doi:10.52202/079017-3609 , editor =

work page doi:10.52202/079017-3609

[19] [19]

Probing the Visualization Literacy of Vision Language Models: The Good, the Bad, and the Ugly , year=

Dong, Lianghan and Crisan, Anamaria , journal=. Probing the Visualization Literacy of Vision Language Models: The Good, the Bad, and the Ugly , year=

[20] [20]

2025 , eprint=

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs , author=. 2025 , eprint=

2025

[21] [21]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , month =

Esmaeilkhani, Parsa and Latecki, Longin Jan , title =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , month =. 2026 , pages =

2026

[22] [22]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

2025

[23] [23]

2025 , eprint=

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. 2025 , eprint=

2025

[24] [24]

APACrefauthors \ 1947

Note on the sampling error of the difference between correlated proportions or percentages , author =. Psychometrika , volume =. 1947 , publisher =. doi:10.1007/BF02295996 , url =

work page doi:10.1007/bf02295996 1947

[25] [25]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in

[26] [26]

Transformers: State-of-the-Art Natural Language Processing

Thomas Wolf, et al. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on EMNLP: System Demonstrations. 2020

2020