ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Chao Liu; Chunran Hu; Fen Wang; Lexu Xie; Qiman Kang; Siming Chen; Zekai Shao; Zhixuan Zhang

arxiv: 2605.23694 · v1 · pith:INKBZ3E5new · submitted 2026-05-22 · 💻 cs.CL

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Fen Wang , Zekai Shao , Qiman Kang , Chunran Hu , Zhixuan Zhang , Lexu Xie , Chao Liu , Siming Chen This is my paper

Pith reviewed 2026-05-25 04:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords chart descriptionsmultimodal large language modelsbenchmarkfaithfulness evaluationinsightfulnessevaluation metricsMLLM assessment

0 comments

The pith

ChartFI-Bench introduces four dimensions and aligned metrics to evaluate how faithfully and insightfully MLLMs describe charts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs ChartFI-Bench to overcome limitations in prior datasets and metrics for assessing chart descriptions produced by multimodal large language models. It identifies four dimensions of quality—factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity—and builds a set of 896 pairs featuring complex charts paired with rich descriptions. Four corresponding metrics then measure performance along these lines. Experiments apply the benchmark to mainstream MLLMs, confirming the framework works while exposing recurring shortcomings in the models. Readers should care because chart descriptions support accessibility and insight extraction, yet current evaluation methods do not reliably track those capabilities.

Core claim

Existing benchmarks use simple charts and shallow fact-listing descriptions, so they cannot adequately test MLLM output; ChartFI-Bench supplies 896 visually complex chart-description pairs built around the four quality dimensions, supplies four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity), and shows through experiments that the new framework detects common model weaknesses.

What carries the argument

ChartFI-Bench dataset of 896 pairs together with the four metrics (Faithfulness, Coverage, Informativeness, Acuity) that operationalize the four quality dimensions.

If this is right

Existing MLLMs exhibit measurable shortfalls in factual accuracy, feature emphasis, domain guidance, and text-chart alignment when describing charts.
The four metrics enable systematic comparison of future models against the benchmark.
Improved chart descriptions can directly aid accessibility tools and cross-modal retrieval systems.
The benchmark construction process itself supplies a template for creating richer evaluation sets in related multimodal tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The metrics could be used as reward signals during fine-tuning to push models toward higher-quality descriptions.
The same dimension-based approach might transfer to evaluating descriptions of other data visualizations such as graphs or diagrams.
Deployment of the benchmark in accessibility pipelines would let developers track progress on real-world usefulness rather than proxy tasks.

Load-bearing premise

The four dimensions fully characterize what makes a chart description high-quality.

What would settle it

Human experts rating the same descriptions on the same charts produce rankings that systematically disagree with scores from the four proposed metrics.

Figures

Figures reproduced from arXiv: 2605.23694 by Chao Liu, Chunran Hu, Fen Wang, Lexu Xie, Qiman Kang, Siming Chen, Zekai Shao, Zhixuan Zhang.

**Figure 1.** Figure 1: Overview of the benchmark construction pipeline, consisting of three stages: dataset collection from academic papers, chart filtering with [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Statistics of the ChartFI-Bench: the left shows the number of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison and error analysis across methods on [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics -- Faithfulness, Coverage, Informativeness, and Acuity -- to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces ChartFI-Bench, a benchmark of 896 chart-description pairs featuring visually complex charts and semantically rich descriptions. It summarizes four dimensions claimed to characterize high-quality chart descriptions (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity) that guide benchmark construction and the design of four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity). Experiments on mainstream MLLMs are reported to demonstrate the framework's effectiveness and reveal common model weaknesses.

Significance. If the dimensions are justified and the metrics validated against human judgments, the benchmark could improve evaluation of MLLM chart descriptions beyond existing simple datasets and shallow metrics, supporting accessibility and insight extraction tasks. The scale and complexity of the pairs represent a concrete advance, but significance is reduced by the absence of validation for the guiding dimensions.

major comments (1)

[Abstract] Abstract: The four dimensions are asserted to characterize high-quality chart descriptions and explicitly guide both benchmark construction and the four metrics, yet no derivation from prior visualization or accessibility literature, coverage argument, or validation (e.g., correlation with overall human quality ratings or inter-rater agreement) is supplied. This is load-bearing for the central claim that experiments demonstrate framework effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive comment. We address the concern regarding justification of the four dimensions below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The four dimensions are asserted to characterize high-quality chart descriptions and explicitly guide both benchmark construction and the four metrics, yet no derivation from prior visualization or accessibility literature, coverage argument, or validation (e.g., correlation with overall human quality ratings or inter-rater agreement) is supplied. This is load-bearing for the central claim that experiments demonstrate framework effectiveness.

Authors: We agree that the current presentation does not sufficiently derive or validate the four dimensions (factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity). In the revised version we will add a new subsection (likely in Section 2 or 3) that explicitly grounds each dimension in prior visualization literature (e.g., Bertin’s visual variables, Tufte’s data-ink ratio and graphical excellence, and Cleveland & McGill’s perceptual rankings) as well as accessibility guidelines (W3C WCAG and chart-specific recommendations from the visualization accessibility community). We will also include a brief coverage argument showing how these dimensions collectively address gaps in existing chart-description evaluation. To address validation, we will conduct and report a small human study (n=30–50 raters) measuring correlation between the four dimension scores and overall quality ratings, plus inter-rater agreement (Cohen’s/Fleiss’ kappa). These additions will directly support the claim that the framework and experiments are effective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and metrics constructed independently

full rationale

The paper summarizes four dimensions and uses them to guide benchmark construction and metric design, but this does not constitute circularity under the defined patterns. There are no equations, fitted parameters renamed as predictions, self-citations that are load-bearing for the central claim, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The derivation chain consists of an asserted starting point (the dimensions) followed by independent construction of the dataset and metrics; no step reduces a result to its own inputs by construction. This is a standard benchmark paper whose claims rest on the new artifacts rather than self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the central claim rests on the domain assumption that the four listed dimensions fully capture description quality and that the constructed pairs are representative.

axioms (1)

domain assumption The four dimensions (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity) characterize high-quality chart descriptions.
Abstract states these dimensions guided benchmark construction.

pith-pipeline@v0.9.0 · 5758 in / 1179 out tokens · 29902 ms · 2026-05-25T04:11:23.892290+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 5 internal anchors

[1]

https://www.statista.com/, 2026

Statista. https://www.statista.com/, 2026. Accessed: 2026-3-31. 2

work page 2026
[2]

Ayres and J

P. Ayres and J. Sweller. The split-attention principle in multimedia learning. The Cambridge handbook of multimedia learning, 2:135–146, 2005. 3

work page 2005
[3]

Y . Bai, Y . Ding, S. Lin, and W. Fan. Beyond description: A multi- modal agent framework for insightful chart summarization.arXiv preprint arXiv:2602.18731, 2026. 3

work page arXiv 2026
[4]

Banerjee and A

S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005. 1, 7

work page 2005
[5]

Battle and A

L. Battle and A. Ottley. What do we mean when we say “insight”? a formal synthesis of existing theory.IEEE Transactions on Visualization and Computer Graphics, 30(9):6075–6088, 2023. 3

work page 2023
[6]

H. P. Chan, Q. Zeng, and H. Ji. Interpretable automatic fine-grained incon- sistency detection in text summarization. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6433–6444, 2023. 3

work page 2023
[7]

C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu et al. Figure captioning with reasoning and sequence-level training.arXiv preprint arXiv:1906.02850, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1906
[8]

N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Computer Graphics, 31(1):1301–1311, 2024. 1, 7, 8

work page 2024
[9]

C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi. On the use of arxiv as a dataset.arXiv preprint arXiv:1905.00075, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1905
[10]

A. K. Das, M. Tarun, and K. Mueller. Charts-of-thought: Enhancing llm vi- sualization literacy through structured data extraction.IEEE Transactions on Visualization and Computer Graphics, 2025. 5

work page 2025
[11]

H. Dong, J. Li, B. Wu, J. Wang, Y . Zhang, and H. Guo. Benchmarking and improving detail image caption.arXiv preprint arXiv:2405.19092,

work page arXiv
[12]

Ellemose and N

J. Ellemose and N. Elmqvist. Eye of the beholder: Towards measuring vi- sualization complexity.IEEE Transactions on Visualization and Computer Graphics, 2025. 3

work page 2025
[13]

X. Fu, Y . Wang, H. Dong, W. Cui, and H. Zhang. Visualization assessment: A machine learning approach. In2019 IEEE Visualization Conference (VIS), pp. 126–130. IEEE, 2019. 8

work page 2019
[14]

Gemini 3 pro

Google. Gemini 3 pro. https://chatgpt.com/, 2026. Accessed: 2026- 3-31. 2, 4, 5, 7

work page 2026
[15]

Y . Han, C. Zhang, X. Chen, X. Yang, Z. Wang, G. Yu et al. Chartllama: A multimodal llm for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023. 2

work page arXiv 2023
[16]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528, 2021. 3

work page 2021
[17]

Hoque and M

E. Hoque and M. S. Islam. Natural language generation for visualizations: State of the art, challenges and future directions. InComputer Graphics Forum, vol. 44, p. e15266. Wiley Online Library, 2025. 7

work page 2025
[19]

T.-Y . Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp. 3258–3264, 2021. 2

work page 2021
[20]

Hsu, C.-Y

T.-Y . Hsu, C.-Y . Huang, R. Rossi, S. Kim, C. Giles, and T.-H. Huang. Gpt-4 as an effective zero-shot evaluator for scientific figure captions. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5464–5474, 2023. 9

work page 2023
[21]

Huang, H

K.-H. Huang, H. P. Chan, M. Fung, H. Qiu, M. Zhou, S. Joty et al. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5):2550–2568, 2024. 3

work page 2024
[22]

Huang, H

K.-H. Huang, H. P. Chan, and H. Ji. Zero-shot faithful factual error correction. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5660–5676,

work page
[23]

Huang, M

K.-H. Huang, M. Zhou, H. P. Chan, Y . Fung, Z. Wang, L. Zhang et al. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 730–749, 2024. 9

work page 2024
[24]

Kantharaj, R

S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque et al. Chart-to-text: A large-scale benchmark for chart summarization. pp. 4005–4023. Association for Computational Linguistics, 2022. doi: 10. 18653/v1/2022.acl-long.277 2

work page 2022
[25]

D. H. Kim, V . Setlur, and M. Agrawala. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–11, 2021. 3, 7

work page 2021
[26]

K. Kim, S. Lee, K.-H. Huang, H. P. Chan, M. Li, and H. Ji. Can llms pro- duce faithful explanations for fact-checking? towards faithful explainable fact-checking via multi-agent debate.arXiv preprint arXiv:2402.07401,

work page arXiv
[27]

Kinney, C

R. Kinney, C. Anastasiades, R. Authur, I. Beltagy, J. Bragg, A. Buraczyn- ski et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140, 2023. 3

work page arXiv 2023
[28]

H.-K. Ko, H. Jeon, G. Park, D. H. Kim, N. W. Kim, J. Kim et al. Natural language dataset generation framework for visualizations powered by large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–22, 2024. 2

work page 2024
[29]

Krichene, F

S. Krichene, F. Piccinno, F. Liu, and J. Eisenschlos. Faithful chart summa- rization with chats-pi. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8705–8723, 2024. 3

work page 2024
[30]

Latif, Z

S. Latif, Z. Zhou, Y . Kim, F. Beck, and N. W. Kim. Kori: Interactive synthesis of text and charts in data documents.IEEE Transactions on Visualization and Computer Graphics, 28(1):184–194, 2021. 1, 3

work page 2021
[32]

L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong et al. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14369–14387,

work page
[33]

J. Lim, J. Ahn, and G. Kim. Chartcap: Mitigating hallucination of dense chart captioning. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 13171–13182, 2025. 1, 2, 3

work page 2025
[34]

C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. 7

work page 2004
[35]

C. Liu, Y . Guo, and X. Yuan. Autotitle: An interactive title generator for visualizations.IEEE Transactions on Visualization and Computer Graphics, 30(8):5276–5288, 2023. 3

work page 2023
[36]

F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pp. 1287–1310, 2024. 2

work page 2024
[37]

M. Liu, D. Chen, Y . Li, G. Fang, and Y . Shen. Chartthinker: A contex- tual chain-of-thought approach to optimized chart summarization.arXiv preprint arXiv:2403.11236, 2024. 2

work page arXiv 2024
[38]

Liu, C.-W

Z. Liu, C.-W. Xie, B. Wen, F. Yu, J. Chen, P. Li et al. Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.arXiv preprint arXiv:2502.14914, 2025. 1, 3

work page arXiv 2025
[39]

Y . Lu, L. Zhong, J. Yang, W. Li, P. Wei, Y . Wang et al. Domaincqa: Crafting knowledge-intensive qa from domain-specific charts. InPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 32347–32355, 2026. 7

work page 2026
[40]

Lundgard and A

A. Lundgard and A. Satyanarayan. Accessible visualization via natural language descriptions: A four-level model of semantic content.IEEE transactions on visualization and computer graphics, 28(1):1073–1083,

work page
[41]

Mahinpei, Z

A. Mahinpei, Z. Kostic, and C. Tanner. Linecap: Line charts for data visualization captioning models. In2022 IEEE Visualization and Visual Analytics (VIS), pp. 35–39. IEEE, 2022. 2

work page 2022
[42]

F. Meng, W. Shao, Q. Lu, P. Gao, K. Zhang, Y . Qiao et al. Chartassis- stant: A universal chart multimodal language model via chart-to-table pre- training and multitask instruction tuning.arXiv preprint arXiv:2401.02384,

work page arXiv
[43]

J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao et al. Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025. 3

work page 2025
[44]

Obeid and E

J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model. InProceedings of the 13th International Conference on Natural Language Generation, pp. 138–147, 2020. 1, 2, 3

work page 2020
[45]

Obeid and E

J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model.arXiv preprint arXiv:2010.09142, 2020. 2

work page arXiv 2010
[46]

Gemini 3 pro

OpenAI. Gemini 3 pro. https://gemini.google.com/, 2026. Ac- cessed: 2026-3-31. 2, 4, 7

work page 2026
[47]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. 1, 3, 7

work page 2002
[48]

Pew research center.https://www.pewresearch

Pew Research Center. Pew research center.https://www.pewresearch. org/, 2026. Accessed: 2026-3-31. 2

work page 2026
[49]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. 2, 5, 6, 7

work page 2026
[50]

Rahman, R

R. Rahman, R. Hasan, A. Al Farhad, M. T. R. Laskar, M. H. Ashmafee, and A. R. M. Kamal. Chartsumm: A comprehensive benchmark for automatic chart summarization of long and short summaries. InCanadian AI, 2023. 2

work page 2023
[51]

E. R. RECALL. Beyond memorability: Visualization recognition and recall.IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 22(1), 2016. 8

work page 2016
[52]

Sellam, D

T. Sellam, D. Das, and A. Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 7881–7892, 2020. 7

work page 2020
[53]

Z. Shao, L. Shen, H. Li, Y . Shan, H. Qu, Y . Wang et al. Narrative player: Reviving data narratives with visuals.IEEE Transactions on Visualization and Computer Graphics, 31(10):6781–6795, 2025. 5

work page 2025
[54]

L. Shen, E. Shen, Z. Tai, Y . Xu, J. Dong, and J. Wang. Visual data analysis with task-based recommendations.Data Science and Engineering, 7(4):354–369, 2022. 5

work page 2022
[55]

Y . Shi, C. Zheng, Z. Yang, K. Xu, and N. Cao. Vistoryteller: Designing data stories with llm agent-based generation and interactive user control. InProceedings of the 31st International Conference on Intelligent User Interfaces, pp. 1141–1156, 2026. 1, 2

work page 2026
[56]

Stokes, V

C. Stokes, V . Setlur, B. Cogley, A. Satyanarayan, and M. A. Hearst. Strik- ing a balance: Reader takeaways and preferences when integrating text and charts.IEEE Transactions on Visualization and Computer Graphics, 29(1):1233–1243, 2022. 3

work page 2022
[57]

Sultanum and A

N. Sultanum and A. Srinivasan. Datatales: Investigating the use of large language models for authoring data-driven articles. In2023 IEEE Visual- ization and Visual Analytics (VIS), pp. 231–235. IEEE, 2023. 2

work page 2023
[58]

J. Sweller. Implications of cognitive load theory for multimedia learning. The Cambridge handbook of multimedia learning, 3(2):19–30, 2005. 3

work page 2005
[59]

B. Tang, A. Boggust, and A. Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7268–7298, 2023. 1, 2, 3

work page 2023
[60]

Vedantam, C

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus- based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015. 3

work page 2015
[61]

A. Z. Wang, G. J. Quadri, M. Zhu, C. Tseng, and D. A. Szafir. Characteriz- ing visualization perception with psychological phenomena: Uncovering the role of subitizing in data visualization.IEEE Transactions on Visual- ization and Computer Graphics, 2025. 3

work page 2025
[62]

F. Wang, B. Wang, X. Shu, Z. Liu, Z. Shao, C. Liu et al. Chartinsighter: An approach for mitigating hallucination in time-series chart summary generation with a benchmark dataset.IEEE transactions on visualization and computer graphics, 2025. 1, 2, 3

work page 2025
[63]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Y . Wang, Z. Sun, H. Zhang, W. Cui, K. Xu, X. Ma et al. Datashot: Automatic generation of fact sheets from tabular data.IEEE transactions on visualization and computer graphics, 26(1):895–905, 2019. 5

work page 2019
[65]

Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. 7

work page 2024
[66]

Wiseman, S

S. Wiseman, S. M. Shieber, and A. M. Rush. Challenges in data-to- document generation. InProceedings of the 2017 conference on empirical methods in natural language processing, pp. 2253–2263, 2017. 3

work page 2017
[67]

R. Xia, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen et al. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning.IEEE Transactions on Image Processing, 2025. 1

work page 2025
[68]

R. Ye. Chartdiff: A large-scale benchmark for comprehending pairs of charts.arXiv preprint arXiv:2603.28902, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[1] [1]

https://www.statista.com/, 2026

Statista. https://www.statista.com/, 2026. Accessed: 2026-3-31. 2

work page 2026

[2] [2]

Ayres and J

P. Ayres and J. Sweller. The split-attention principle in multimedia learning. The Cambridge handbook of multimedia learning, 2:135–146, 2005. 3

work page 2005

[3] [3]

Y . Bai, Y . Ding, S. Lin, and W. Fan. Beyond description: A multi- modal agent framework for insightful chart summarization.arXiv preprint arXiv:2602.18731, 2026. 3

work page arXiv 2026

[4] [4]

Banerjee and A

S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005. 1, 7

work page 2005

[5] [5]

Battle and A

L. Battle and A. Ottley. What do we mean when we say “insight”? a formal synthesis of existing theory.IEEE Transactions on Visualization and Computer Graphics, 30(9):6075–6088, 2023. 3

work page 2023

[6] [6]

H. P. Chan, Q. Zeng, and H. Ji. Interpretable automatic fine-grained incon- sistency detection in text summarization. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6433–6444, 2023. 3

work page 2023

[7] [7]

C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu et al. Figure captioning with reasoning and sequence-level training.arXiv preprint arXiv:1906.02850, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1906

[8] [8]

N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Computer Graphics, 31(1):1301–1311, 2024. 1, 7, 8

work page 2024

[9] [9]

C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi. On the use of arxiv as a dataset.arXiv preprint arXiv:1905.00075, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1905

[10] [10]

A. K. Das, M. Tarun, and K. Mueller. Charts-of-thought: Enhancing llm vi- sualization literacy through structured data extraction.IEEE Transactions on Visualization and Computer Graphics, 2025. 5

work page 2025

[11] [11]

H. Dong, J. Li, B. Wu, J. Wang, Y . Zhang, and H. Guo. Benchmarking and improving detail image caption.arXiv preprint arXiv:2405.19092,

work page arXiv

[12] [12]

Ellemose and N

J. Ellemose and N. Elmqvist. Eye of the beholder: Towards measuring vi- sualization complexity.IEEE Transactions on Visualization and Computer Graphics, 2025. 3

work page 2025

[13] [13]

X. Fu, Y . Wang, H. Dong, W. Cui, and H. Zhang. Visualization assessment: A machine learning approach. In2019 IEEE Visualization Conference (VIS), pp. 126–130. IEEE, 2019. 8

work page 2019

[14] [14]

Gemini 3 pro

Google. Gemini 3 pro. https://chatgpt.com/, 2026. Accessed: 2026- 3-31. 2, 4, 5, 7

work page 2026

[15] [15]

Y . Han, C. Zhang, X. Chen, X. Yang, Z. Wang, G. Yu et al. Chartllama: A multimodal llm for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023. 2

work page arXiv 2023

[16] [16]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528, 2021. 3

work page 2021

[17] [17]

Hoque and M

E. Hoque and M. S. Islam. Natural language generation for visualizations: State of the art, challenges and future directions. InComputer Graphics Forum, vol. 44, p. e15266. Wiley Online Library, 2025. 7

work page 2025

[18] [19]

T.-Y . Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp. 3258–3264, 2021. 2

work page 2021

[19] [20]

Hsu, C.-Y

T.-Y . Hsu, C.-Y . Huang, R. Rossi, S. Kim, C. Giles, and T.-H. Huang. Gpt-4 as an effective zero-shot evaluator for scientific figure captions. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5464–5474, 2023. 9

work page 2023

[20] [21]

Huang, H

K.-H. Huang, H. P. Chan, M. Fung, H. Qiu, M. Zhou, S. Joty et al. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5):2550–2568, 2024. 3

work page 2024

[21] [22]

Huang, H

K.-H. Huang, H. P. Chan, and H. Ji. Zero-shot faithful factual error correction. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5660–5676,

work page

[22] [23]

Huang, M

K.-H. Huang, M. Zhou, H. P. Chan, Y . Fung, Z. Wang, L. Zhang et al. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 730–749, 2024. 9

work page 2024

[23] [24]

Kantharaj, R

S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque et al. Chart-to-text: A large-scale benchmark for chart summarization. pp. 4005–4023. Association for Computational Linguistics, 2022. doi: 10. 18653/v1/2022.acl-long.277 2

work page 2022

[24] [25]

D. H. Kim, V . Setlur, and M. Agrawala. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–11, 2021. 3, 7

work page 2021

[25] [26]

K. Kim, S. Lee, K.-H. Huang, H. P. Chan, M. Li, and H. Ji. Can llms pro- duce faithful explanations for fact-checking? towards faithful explainable fact-checking via multi-agent debate.arXiv preprint arXiv:2402.07401,

work page arXiv

[26] [27]

Kinney, C

R. Kinney, C. Anastasiades, R. Authur, I. Beltagy, J. Bragg, A. Buraczyn- ski et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140, 2023. 3

work page arXiv 2023

[27] [28]

H.-K. Ko, H. Jeon, G. Park, D. H. Kim, N. W. Kim, J. Kim et al. Natural language dataset generation framework for visualizations powered by large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–22, 2024. 2

work page 2024

[28] [29]

Krichene, F

S. Krichene, F. Piccinno, F. Liu, and J. Eisenschlos. Faithful chart summa- rization with chats-pi. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8705–8723, 2024. 3

work page 2024

[29] [30]

Latif, Z

S. Latif, Z. Zhou, Y . Kim, F. Beck, and N. W. Kim. Kori: Interactive synthesis of text and charts in data documents.IEEE Transactions on Visualization and Computer Graphics, 28(1):184–194, 2021. 1, 3

work page 2021

[30] [32]

L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong et al. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14369–14387,

work page

[31] [33]

J. Lim, J. Ahn, and G. Kim. Chartcap: Mitigating hallucination of dense chart captioning. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 13171–13182, 2025. 1, 2, 3

work page 2025

[32] [34]

C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. 7

work page 2004

[33] [35]

C. Liu, Y . Guo, and X. Yuan. Autotitle: An interactive title generator for visualizations.IEEE Transactions on Visualization and Computer Graphics, 30(8):5276–5288, 2023. 3

work page 2023

[34] [36]

F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pp. 1287–1310, 2024. 2

work page 2024

[35] [37]

M. Liu, D. Chen, Y . Li, G. Fang, and Y . Shen. Chartthinker: A contex- tual chain-of-thought approach to optimized chart summarization.arXiv preprint arXiv:2403.11236, 2024. 2

work page arXiv 2024

[36] [38]

Liu, C.-W

Z. Liu, C.-W. Xie, B. Wen, F. Yu, J. Chen, P. Li et al. Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.arXiv preprint arXiv:2502.14914, 2025. 1, 3

work page arXiv 2025

[37] [39]

Y . Lu, L. Zhong, J. Yang, W. Li, P. Wei, Y . Wang et al. Domaincqa: Crafting knowledge-intensive qa from domain-specific charts. InPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 32347–32355, 2026. 7

work page 2026

[38] [40]

Lundgard and A

A. Lundgard and A. Satyanarayan. Accessible visualization via natural language descriptions: A four-level model of semantic content.IEEE transactions on visualization and computer graphics, 28(1):1073–1083,

work page

[39] [41]

Mahinpei, Z

A. Mahinpei, Z. Kostic, and C. Tanner. Linecap: Line charts for data visualization captioning models. In2022 IEEE Visualization and Visual Analytics (VIS), pp. 35–39. IEEE, 2022. 2

work page 2022

[40] [42]

F. Meng, W. Shao, Q. Lu, P. Gao, K. Zhang, Y . Qiao et al. Chartassis- stant: A universal chart multimodal language model via chart-to-table pre- training and multitask instruction tuning.arXiv preprint arXiv:2401.02384,

work page arXiv

[41] [43]

J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao et al. Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025. 3

work page 2025

[42] [44]

Obeid and E

J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model. InProceedings of the 13th International Conference on Natural Language Generation, pp. 138–147, 2020. 1, 2, 3

work page 2020

[43] [45]

Obeid and E

J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model.arXiv preprint arXiv:2010.09142, 2020. 2

work page arXiv 2010

[44] [46]

Gemini 3 pro

OpenAI. Gemini 3 pro. https://gemini.google.com/, 2026. Ac- cessed: 2026-3-31. 2, 4, 7

work page 2026

[45] [47]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. 1, 3, 7

work page 2002

[46] [48]

Pew research center.https://www.pewresearch

Pew Research Center. Pew research center.https://www.pewresearch. org/, 2026. Accessed: 2026-3-31. 2

work page 2026

[47] [49]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. 2, 5, 6, 7

work page 2026

[48] [50]

Rahman, R

R. Rahman, R. Hasan, A. Al Farhad, M. T. R. Laskar, M. H. Ashmafee, and A. R. M. Kamal. Chartsumm: A comprehensive benchmark for automatic chart summarization of long and short summaries. InCanadian AI, 2023. 2

work page 2023

[49] [51]

E. R. RECALL. Beyond memorability: Visualization recognition and recall.IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 22(1), 2016. 8

work page 2016

[50] [52]

Sellam, D

T. Sellam, D. Das, and A. Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 7881–7892, 2020. 7

work page 2020

[51] [53]

Z. Shao, L. Shen, H. Li, Y . Shan, H. Qu, Y . Wang et al. Narrative player: Reviving data narratives with visuals.IEEE Transactions on Visualization and Computer Graphics, 31(10):6781–6795, 2025. 5

work page 2025

[52] [54]

L. Shen, E. Shen, Z. Tai, Y . Xu, J. Dong, and J. Wang. Visual data analysis with task-based recommendations.Data Science and Engineering, 7(4):354–369, 2022. 5

work page 2022

[53] [55]

Y . Shi, C. Zheng, Z. Yang, K. Xu, and N. Cao. Vistoryteller: Designing data stories with llm agent-based generation and interactive user control. InProceedings of the 31st International Conference on Intelligent User Interfaces, pp. 1141–1156, 2026. 1, 2

work page 2026

[54] [56]

Stokes, V

C. Stokes, V . Setlur, B. Cogley, A. Satyanarayan, and M. A. Hearst. Strik- ing a balance: Reader takeaways and preferences when integrating text and charts.IEEE Transactions on Visualization and Computer Graphics, 29(1):1233–1243, 2022. 3

work page 2022

[55] [57]

Sultanum and A

N. Sultanum and A. Srinivasan. Datatales: Investigating the use of large language models for authoring data-driven articles. In2023 IEEE Visual- ization and Visual Analytics (VIS), pp. 231–235. IEEE, 2023. 2

work page 2023

[56] [58]

J. Sweller. Implications of cognitive load theory for multimedia learning. The Cambridge handbook of multimedia learning, 3(2):19–30, 2005. 3

work page 2005

[57] [59]

B. Tang, A. Boggust, and A. Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7268–7298, 2023. 1, 2, 3

work page 2023

[58] [60]

Vedantam, C

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus- based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015. 3

work page 2015

[59] [61]

A. Z. Wang, G. J. Quadri, M. Zhu, C. Tseng, and D. A. Szafir. Characteriz- ing visualization perception with psychological phenomena: Uncovering the role of subitizing in data visualization.IEEE Transactions on Visual- ization and Computer Graphics, 2025. 3

work page 2025

[60] [62]

F. Wang, B. Wang, X. Shu, Z. Liu, Z. Shao, C. Liu et al. Chartinsighter: An approach for mitigating hallucination in time-series chart summary generation with a benchmark dataset.IEEE transactions on visualization and computer graphics, 2025. 1, 2, 3

work page 2025

[61] [63]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [64]

Y . Wang, Z. Sun, H. Zhang, W. Cui, K. Xu, X. Ma et al. Datashot: Automatic generation of fact sheets from tabular data.IEEE transactions on visualization and computer graphics, 26(1):895–905, 2019. 5

work page 2019

[63] [65]

Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. 7

work page 2024

[64] [66]

Wiseman, S

S. Wiseman, S. M. Shieber, and A. M. Rush. Challenges in data-to- document generation. InProceedings of the 2017 conference on empirical methods in natural language processing, pp. 2253–2263, 2017. 3

work page 2017

[65] [67]

R. Xia, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen et al. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning.IEEE Transactions on Image Processing, 2025. 1

work page 2025

[66] [68]

R. Ye. Chartdiff: A large-scale benchmark for comprehending pairs of charts.arXiv preprint arXiv:2603.28902, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [69]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904