ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Chao Liu; Chunran Hu; Fen Wang; Lexu Xie; Qiman Kang; Siming Chen; Zekai Shao; Zhixuan Zhang

REVIEW 1 major objections 1 minor 67 references

ChartFI-Bench evaluates multimodal models on chart descriptions using four quality dimensions and aligned metrics.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 16:07 UTC pith:INKBZ3E5

load-bearing objection ChartFI-Bench supplies a new 896-pair dataset and four aligned metrics for MLLM chart descriptions, but the metrics are built directly from the authors' chosen dimensions with no external validation shown. the 1 major comments →

arxiv 2605.23694 v2 pith:INKBZ3E5 submitted 2026-05-22 cs.CL

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Fen Wang , Zekai Shao , Qiman Kang , Chunran Hu , Zhixuan Zhang , Lexu Xie , Chao Liu , Siming Chen This is my paper

classification cs.CL

keywords chart descriptionsmultimodal large language modelsbenchmarkfaithfulnessinsightfulnessevaluation metricsvisualizationsaccessibility

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs ChartFI-Bench with 896 chart-description pairs that pair visually complex charts with semantically rich descriptions. It defines four dimensions of description quality and introduces four corresponding metrics to measure performance. Experiments apply this framework to mainstream multimodal large language models. The results indicate that current models exhibit common shortcomings in producing descriptions that meet the defined standards. Readers would care because chart descriptions support accessibility and help people extract meaning from data visualizations.

Core claim

The central claim is that existing benchmarks rely on simple charts and shallow descriptions while current metrics miss the multi-faceted nature of quality, so a new benchmark built around four dimensions—factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity—plus four aligned metrics enables systematic assessment that reveals limitations in how multimodal large language models generate chart descriptions.

What carries the argument

The four dimensions of high-quality chart descriptions together with the four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity) that assess them across those dimensions.

Load-bearing premise

The four dimensions and four metrics are sufficient and appropriate to characterize high-quality chart descriptions.

What would settle it

A new set of human raters scoring the same model outputs on overall usefulness finds no correlation with the four proposed metrics.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Models can be ranked and compared systematically on their ability to produce descriptions that are factually accurate, emphasize salient features, incorporate domain guidance, and complement the chart.
Development of future multimodal models can target the specific weaknesses identified in the experiments.
Automated description systems can be trained or fine-tuned to improve scores on the four metrics.
The benchmark supports evaluation for applications such as accessibility tools and cross-modal retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dimensions could be adapted to evaluate text descriptions of other visual data such as scientific diagrams or maps.
Human users might show measurable gains in data interpretation speed or accuracy when given descriptions that score high on the new metrics.
The benchmark dataset could serve as a training resource to improve model performance on complex visualizations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

ChartFI-Bench supplies a new 896-pair dataset and four aligned metrics for MLLM chart descriptions, but the metrics are built directly from the authors' chosen dimensions with no external validation shown.

read the letter

The main takeaway is that this paper puts together ChartFI-Bench with 896 chart-description pairs that use more complex charts and richer text than earlier collections, plus four metrics (Faithfulness, Coverage, Informativeness, Acuity) matched to four quality dimensions.

The work does a clear job naming the limits of existing benchmarks—mostly simple charts paired with flat fact lists—and standard metrics that ignore insight or how text and chart work together. Defining the four dimensions first (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity) and then building both the data and the metrics around them gives the effort a consistent structure.

The experiments on mainstream MLLMs are said to show the framework works and to surface common model weaknesses, which could matter for accessibility tools that already rely on these descriptions.

The soft spot is that the metrics are constructed to line up with the authors' taxonomy and the abstract supplies no numbers on inter-annotator agreement, human correlation, or head-to-head comparison with prior metrics. Without those anchors the scores risk measuring the chosen categories rather than independent quality.

This is for people who build or test MLLMs for visualization accessibility and data insight extraction. A reader who needs a concrete benchmark in this area would get direct use from the dataset and dimension list.

It deserves peer review because the problem is practical and the new artifacts are specific, even if the validation steps need more detail in revision.

Referee Report

1 major / 1 minor

Summary. The paper introduces ChartFI-Bench, a benchmark of 896 chart-description pairs featuring visually complex charts and semantically rich descriptions. It first summarizes four dimensions of high-quality chart descriptions (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity), constructs the benchmark guided by these dimensions, defines four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity), and reports experiments on mainstream MLLMs that claim to demonstrate the framework's effectiveness while revealing common model weaknesses.

Significance. If the metrics receive independent validation, the work could advance evaluation of MLLM chart descriptions for accessibility and insight tasks by moving beyond simple fact-enumeration datasets and single-aspect metrics. The scale of the benchmark and focus on multi-faceted quality represent concrete contributions, though the lack of reported human correlation or inter-rater data in the provided sections limits immediate applicability.

major comments (1)

[Metrics Definition and Validation] Metrics section (and abstract claim of framework effectiveness): The four metrics are constructed to align directly with the four author-chosen dimensions, yet no independent validation (human correlation studies, comparison against prior chart-description metrics on the same items, or inter-annotator agreement) is described. This is load-bearing for the central experimental claim, as it leaves open whether the metrics measure description quality or merely reproduce the taxonomy.

minor comments (1)

[Abstract] Abstract: the statement that experiments 'demonstrate the effectiveness' would be strengthened by including at least one quantitative result or comparison (e.g., score ranges or baseline deltas) rather than a qualitative summary.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee report. We address the single major comment below.

read point-by-point responses

Referee: [Metrics Definition and Validation] Metrics section (and abstract claim of framework effectiveness): The four metrics are constructed to align directly with the four author-chosen dimensions, yet no independent validation (human correlation studies, comparison against prior chart-description metrics on the same items, or inter-annotator agreement) is described. This is load-bearing for the central experimental claim, as it leaves open whether the metrics measure description quality or merely reproduce the taxonomy.

Authors: We agree that the absence of independent validation (human correlation, inter-annotator agreement on the metrics, or head-to-head comparison with prior metrics) is a substantive limitation. The metrics were constructed by directly mapping each to one of the four dimensions we derived from the literature; no separate validation step was performed. The experimental section shows that the metrics produce differentiated scores across models that are consistent with qualitative inspection of outputs, but this does not constitute independent evidence that the metrics capture description quality rather than the taxonomy itself. We will revise the manuscript to (1) add an explicit limitations paragraph on this point, (2) moderate the abstract and conclusion language from "demonstrate the effectiveness" to "illustrate the utility," and (3) outline concrete directions for future human validation studies. This is a partial revision; a full empirical validation study lies outside the scope of the current submission. revision: partial

Circularity Check

0 steps flagged

No significant circularity; benchmark design is self-contained

full rationale

The paper first summarizes four dimensions of chart description quality and then constructs a benchmark and four aligned metrics to evaluate descriptions across those dimensions. This constitutes an explicit design choice for the evaluation framework rather than any derivation, equation, or prediction that reduces to its own inputs by construction. No fitted parameters, self-citations as load-bearing premises, uniqueness theorems, or renamings of prior results appear in the abstract or described chain. The central experiments on MLLMs therefore rest on independent application of the defined metrics to model outputs, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the four dimensions are presented as given without derivation or external validation shown.

axioms (1)

domain assumption Four dimensions (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity) characterize high-quality chart descriptions
Abstract states these dimensions guided benchmark construction and metric design.

pith-pipeline@v0.9.1-grok · 5758 in / 1254 out tokens · 36321 ms · 2026-06-30T16:07:25.396361+00:00 · methodology

0 comments

read the original abstract

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics -- Faithfulness, Coverage, Informativeness, and Acuity -- to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

Figures

Figures reproduced from arXiv: 2605.23694 by Chao Liu, Chunran Hu, Fen Wang, Lexu Xie, Qiman Kang, Siming Chen, Zekai Shao, Zhixuan Zhang.

**Figure 1.** Figure 1: Overview of the benchmark construction pipeline, consisting of three stages: dataset collection from academic papers, chart filtering with [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Statistics of the ChartFI-Bench: the left shows the number of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison and error analysis across methods on [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 5 internal anchors

[1]

https://www.statista.com/, 2026

Statista. https://www.statista.com/, 2026. Accessed: 2026-3-31. 2

work page 2026
[2]

Ayres and J

P. Ayres and J. Sweller. The split-attention principle in multimedia learning. The Cambridge handbook of multimedia learning, 2:135–146, 2005. 3

work page 2005
[3]

Y . Bai, Y . Ding, S. Lin, and W. Fan. Beyond description: A multi- modal agent framework for insightful chart summarization.arXiv preprint arXiv:2602.18731, 2026. 3

work page arXiv 2026
[4]

Banerjee and A

S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005. 1, 7

work page 2005
[5]

Battle and A

L. Battle and A. Ottley. What do we mean when we say “insight”? a formal synthesis of existing theory.IEEE Transactions on Visualization and Computer Graphics, 30(9):6075–6088, 2023. 3

work page 2023
[6]

H. P. Chan, Q. Zeng, and H. Ji. Interpretable automatic fine-grained incon- sistency detection in text summarization. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6433–6444, 2023. 3

work page 2023
[7]

C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu et al. Figure captioning with reasoning and sequence-level training.arXiv preprint arXiv:1906.02850, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1906
[8]

N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Computer Graphics, 31(1):1301–1311, 2024. 1, 7, 8

work page 2024
[9]

C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi. On the use of arxiv as a dataset.arXiv preprint arXiv:1905.00075, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1905
[10]

A. K. Das, M. Tarun, and K. Mueller. Charts-of-thought: Enhancing llm vi- sualization literacy through structured data extraction.IEEE Transactions on Visualization and Computer Graphics, 2025. 5

work page 2025
[11]

H. Dong, J. Li, B. Wu, J. Wang, Y . Zhang, and H. Guo. Benchmarking and improving detail image caption.arXiv preprint arXiv:2405.19092,

work page arXiv
[12]

Ellemose and N

J. Ellemose and N. Elmqvist. Eye of the beholder: Towards measuring vi- sualization complexity.IEEE Transactions on Visualization and Computer Graphics, 2025. 3

work page 2025
[13]

X. Fu, Y . Wang, H. Dong, W. Cui, and H. Zhang. Visualization assessment: A machine learning approach. In2019 IEEE Visualization Conference (VIS), pp. 126–130. IEEE, 2019. 8

work page 2019
[14]

Gemini 3 pro

Google. Gemini 3 pro. https://chatgpt.com/, 2026. Accessed: 2026- 3-31. 2, 4, 5, 7

work page 2026
[15]

Y . Han, C. Zhang, X. Chen, X. Yang, Z. Wang, G. Yu et al. Chartllama: A multimodal llm for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023. 2

work page Pith review arXiv 2023
[16]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528, 2021. 3

work page 2021
[17]

Hoque and M

E. Hoque and M. S. Islam. Natural language generation for visualizations: State of the art, challenges and future directions. InComputer Graphics Forum, vol. 44, p. e15266. Wiley Online Library, 2025. 7

work page 2025
[19]

T.-Y . Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp. 3258–3264, 2021. 2

work page 2021
[20]

Hsu, C.-Y

T.-Y . Hsu, C.-Y . Huang, R. Rossi, S. Kim, C. Giles, and T.-H. Huang. Gpt-4 as an effective zero-shot evaluator for scientific figure captions. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5464–5474, 2023. 9

work page 2023
[21]

Huang, H

K.-H. Huang, H. P. Chan, M. Fung, H. Qiu, M. Zhou, S. Joty et al. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5):2550–2568, 2024. 3

work page 2024
[22]

Huang, H

K.-H. Huang, H. P. Chan, and H. Ji. Zero-shot faithful factual error correction. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5660–5676,

work page
[23]

Huang, M

K.-H. Huang, M. Zhou, H. P. Chan, Y . Fung, Z. Wang, L. Zhang et al. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 730–749, 2024. 9

work page 2024
[24]

Kantharaj, R

S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque et al. Chart-to-text: A large-scale benchmark for chart summarization. pp. 4005–4023. Association for Computational Linguistics, 2022. doi: 10. 18653/v1/2022.acl-long.277 2

work page 2022
[25]

D. H. Kim, V . Setlur, and M. Agrawala. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–11, 2021. 3, 7

work page 2021
[26]

K. Kim, S. Lee, K.-H. Huang, H. P. Chan, M. Li, and H. Ji. Can llms pro- duce faithful explanations for fact-checking? towards faithful explainable fact-checking via multi-agent debate.arXiv preprint arXiv:2402.07401,

work page arXiv
[27]

The Semantic Scholar Open Data Platform

R. Kinney, C. Anastasiades, R. Authur, I. Beltagy, J. Bragg, A. Buraczyn- ski et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140, 2023. 3

work page Pith review arXiv 2023
[28]

H.-K. Ko, H. Jeon, G. Park, D. H. Kim, N. W. Kim, J. Kim et al. Natural language dataset generation framework for visualizations powered by large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–22, 2024. 2

work page 2024
[29]

Krichene, F

S. Krichene, F. Piccinno, F. Liu, and J. Eisenschlos. Faithful chart summa- rization with chats-pi. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8705–8723, 2024. 3

work page 2024
[30]

Latif, Z

S. Latif, Z. Zhou, Y . Kim, F. Beck, and N. W. Kim. Kori: Interactive synthesis of text and charts in data documents.IEEE Transactions on Visualization and Computer Graphics, 28(1):184–194, 2021. 1, 3

work page 2021
[32]

L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong et al. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14369–14387,

work page
[33]

J. Lim, J. Ahn, and G. Kim. Chartcap: Mitigating hallucination of dense chart captioning. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 13171–13182, 2025. 1, 2, 3

work page 2025
[34]

C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. 7

work page 2004
[35]

C. Liu, Y . Guo, and X. Yuan. Autotitle: An interactive title generator for visualizations.IEEE Transactions on Visualization and Computer Graphics, 30(8):5276–5288, 2023. 3

work page 2023
[36]

F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pp. 1287–1310, 2024. 2

work page 2024
[37]

M. Liu, D. Chen, Y . Li, G. Fang, and Y . Shen. Chartthinker: A contex- tual chain-of-thought approach to optimized chart summarization.arXiv preprint arXiv:2403.11236, 2024. 2

work page arXiv 2024
[38]

Liu, C.-W

Z. Liu, C.-W. Xie, B. Wen, F. Yu, J. Chen, P. Li et al. Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.arXiv preprint arXiv:2502.14914, 2025. 1, 3

work page arXiv 2025
[39]

Y . Lu, L. Zhong, J. Yang, W. Li, P. Wei, Y . Wang et al. Domaincqa: Crafting knowledge-intensive qa from domain-specific charts. InPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 32347–32355, 2026. 7

work page 2026
[40]

Lundgard and A

A. Lundgard and A. Satyanarayan. Accessible visualization via natural language descriptions: A four-level model of semantic content.IEEE transactions on visualization and computer graphics, 28(1):1073–1083,

work page
[41]

Mahinpei, Z

A. Mahinpei, Z. Kostic, and C. Tanner. Linecap: Line charts for data visualization captioning models. In2022 IEEE Visualization and Visual Analytics (VIS), pp. 35–39. IEEE, 2022. 2

work page 2022
[42]

F. Meng, W. Shao, Q. Lu, P. Gao, K. Zhang, Y . Qiao et al. Chartassis- stant: A universal chart multimodal language model via chart-to-table pre- training and multitask instruction tuning.arXiv preprint arXiv:2401.02384,

work page arXiv
[43]

J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao et al. Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025. 3

work page 2025
[44]

Obeid and E

J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model. InProceedings of the 13th International Conference on Natural Language Generation, pp. 138–147, 2020. 1, 2, 3

work page 2020
[45]

Obeid and E

J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model.arXiv preprint arXiv:2010.09142, 2020. 2

work page arXiv 2010
[46]

Gemini 3 pro

OpenAI. Gemini 3 pro. https://gemini.google.com/, 2026. Ac- cessed: 2026-3-31. 2, 4, 7

work page 2026
[47]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. 1, 3, 7

work page 2002
[48]

Pew research center.https://www.pewresearch

Pew Research Center. Pew research center.https://www.pewresearch. org/, 2026. Accessed: 2026-3-31. 2

work page 2026
[49]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. 2, 5, 6, 7

work page 2026
[50]

Rahman, R

R. Rahman, R. Hasan, A. Al Farhad, M. T. R. Laskar, M. H. Ashmafee, and A. R. M. Kamal. Chartsumm: A comprehensive benchmark for automatic chart summarization of long and short summaries. InCanadian AI, 2023. 2

work page 2023
[51]

E. R. RECALL. Beyond memorability: Visualization recognition and recall.IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 22(1), 2016. 8

work page 2016
[52]

Sellam, D

T. Sellam, D. Das, and A. Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 7881–7892, 2020. 7

work page 2020
[53]

Z. Shao, L. Shen, H. Li, Y . Shan, H. Qu, Y . Wang et al. Narrative player: Reviving data narratives with visuals.IEEE Transactions on Visualization and Computer Graphics, 31(10):6781–6795, 2025. 5

work page 2025
[54]

L. Shen, E. Shen, Z. Tai, Y . Xu, J. Dong, and J. Wang. Visual data analysis with task-based recommendations.Data Science and Engineering, 7(4):354–369, 2022. 5

work page 2022
[55]

Y . Shi, C. Zheng, Z. Yang, K. Xu, and N. Cao. Vistoryteller: Designing data stories with llm agent-based generation and interactive user control. InProceedings of the 31st International Conference on Intelligent User Interfaces, pp. 1141–1156, 2026. 1, 2

work page 2026
[56]

Stokes, V

C. Stokes, V . Setlur, B. Cogley, A. Satyanarayan, and M. A. Hearst. Strik- ing a balance: Reader takeaways and preferences when integrating text and charts.IEEE Transactions on Visualization and Computer Graphics, 29(1):1233–1243, 2022. 3

work page 2022
[57]

Sultanum and A

N. Sultanum and A. Srinivasan. Datatales: Investigating the use of large language models for authoring data-driven articles. In2023 IEEE Visual- ization and Visual Analytics (VIS), pp. 231–235. IEEE, 2023. 2

work page 2023
[58]

J. Sweller. Implications of cognitive load theory for multimedia learning. The Cambridge handbook of multimedia learning, 3(2):19–30, 2005. 3

work page 2005
[59]

B. Tang, A. Boggust, and A. Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7268–7298, 2023. 1, 2, 3

work page 2023
[60]

Vedantam, C

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus- based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015. 3

work page 2015
[61]

A. Z. Wang, G. J. Quadri, M. Zhu, C. Tseng, and D. A. Szafir. Characteriz- ing visualization perception with psychological phenomena: Uncovering the role of subitizing in data visualization.IEEE Transactions on Visual- ization and Computer Graphics, 2025. 3

work page 2025
[62]

F. Wang, B. Wang, X. Shu, Z. Liu, Z. Shao, C. Liu et al. Chartinsighter: An approach for mitigating hallucination in time-series chart summary generation with a benchmark dataset.IEEE transactions on visualization and computer graphics, 2025. 1, 2, 3

work page 2025
[63]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Y . Wang, Z. Sun, H. Zhang, W. Cui, K. Xu, X. Ma et al. Datashot: Automatic generation of fact sheets from tabular data.IEEE transactions on visualization and computer graphics, 26(1):895–905, 2019. 5

work page 2019
[65]

Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. 7

work page 2024
[66]

Wiseman, S

S. Wiseman, S. M. Shieber, and A. M. Rush. Challenges in data-to- document generation. InProceedings of the 2017 conference on empirical methods in natural language processing, pp. 2253–2263, 2017. 3

work page 2017
[67]

R. Xia, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen et al. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning.IEEE Transactions on Image Processing, 2025. 1

work page 2025
[68]

R. Ye. Chartdiff: A large-scale benchmark for comprehending pairs of charts.arXiv preprint arXiv:2603.28902, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[1] [1]

https://www.statista.com/, 2026

Statista. https://www.statista.com/, 2026. Accessed: 2026-3-31. 2

work page 2026

[2] [2]

Ayres and J

P. Ayres and J. Sweller. The split-attention principle in multimedia learning. The Cambridge handbook of multimedia learning, 2:135–146, 2005. 3

work page 2005

[3] [3]

Y . Bai, Y . Ding, S. Lin, and W. Fan. Beyond description: A multi- modal agent framework for insightful chart summarization.arXiv preprint arXiv:2602.18731, 2026. 3

work page arXiv 2026

[4] [4]

Banerjee and A

S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005. 1, 7

work page 2005

[5] [5]

Battle and A

L. Battle and A. Ottley. What do we mean when we say “insight”? a formal synthesis of existing theory.IEEE Transactions on Visualization and Computer Graphics, 30(9):6075–6088, 2023. 3

work page 2023

[6] [6]

H. P. Chan, Q. Zeng, and H. Ji. Interpretable automatic fine-grained incon- sistency detection in text summarization. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6433–6444, 2023. 3

work page 2023

[7] [7]

C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu et al. Figure captioning with reasoning and sequence-level training.arXiv preprint arXiv:1906.02850, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1906

[8] [8]

N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Computer Graphics, 31(1):1301–1311, 2024. 1, 7, 8

work page 2024

[9] [9]

C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi. On the use of arxiv as a dataset.arXiv preprint arXiv:1905.00075, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1905

[10] [10]

A. K. Das, M. Tarun, and K. Mueller. Charts-of-thought: Enhancing llm vi- sualization literacy through structured data extraction.IEEE Transactions on Visualization and Computer Graphics, 2025. 5

work page 2025

[11] [11]

H. Dong, J. Li, B. Wu, J. Wang, Y . Zhang, and H. Guo. Benchmarking and improving detail image caption.arXiv preprint arXiv:2405.19092,

work page arXiv

[12] [12]

Ellemose and N

J. Ellemose and N. Elmqvist. Eye of the beholder: Towards measuring vi- sualization complexity.IEEE Transactions on Visualization and Computer Graphics, 2025. 3

work page 2025

[13] [13]

X. Fu, Y . Wang, H. Dong, W. Cui, and H. Zhang. Visualization assessment: A machine learning approach. In2019 IEEE Visualization Conference (VIS), pp. 126–130. IEEE, 2019. 8

work page 2019

[14] [14]

Gemini 3 pro

Google. Gemini 3 pro. https://chatgpt.com/, 2026. Accessed: 2026- 3-31. 2, 4, 5, 7

work page 2026

[15] [15]

Y . Han, C. Zhang, X. Chen, X. Yang, Z. Wang, G. Yu et al. Chartllama: A multimodal llm for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023. 2

work page Pith review arXiv 2023

[16] [16]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528, 2021. 3

work page 2021

[17] [17]

Hoque and M

E. Hoque and M. S. Islam. Natural language generation for visualizations: State of the art, challenges and future directions. InComputer Graphics Forum, vol. 44, p. e15266. Wiley Online Library, 2025. 7

work page 2025

[18] [19]

T.-Y . Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp. 3258–3264, 2021. 2

work page 2021

[19] [20]

Hsu, C.-Y

T.-Y . Hsu, C.-Y . Huang, R. Rossi, S. Kim, C. Giles, and T.-H. Huang. Gpt-4 as an effective zero-shot evaluator for scientific figure captions. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5464–5474, 2023. 9

work page 2023

[20] [21]

Huang, H

K.-H. Huang, H. P. Chan, M. Fung, H. Qiu, M. Zhou, S. Joty et al. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5):2550–2568, 2024. 3

work page 2024

[21] [22]

Huang, H

K.-H. Huang, H. P. Chan, and H. Ji. Zero-shot faithful factual error correction. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5660–5676,

work page

[22] [23]

Huang, M

K.-H. Huang, M. Zhou, H. P. Chan, Y . Fung, Z. Wang, L. Zhang et al. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 730–749, 2024. 9

work page 2024

[23] [24]

Kantharaj, R

S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque et al. Chart-to-text: A large-scale benchmark for chart summarization. pp. 4005–4023. Association for Computational Linguistics, 2022. doi: 10. 18653/v1/2022.acl-long.277 2

work page 2022

[24] [25]

D. H. Kim, V . Setlur, and M. Agrawala. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–11, 2021. 3, 7

work page 2021

[25] [26]

K. Kim, S. Lee, K.-H. Huang, H. P. Chan, M. Li, and H. Ji. Can llms pro- duce faithful explanations for fact-checking? towards faithful explainable fact-checking via multi-agent debate.arXiv preprint arXiv:2402.07401,

work page arXiv

[26] [27]

The Semantic Scholar Open Data Platform

R. Kinney, C. Anastasiades, R. Authur, I. Beltagy, J. Bragg, A. Buraczyn- ski et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140, 2023. 3

work page Pith review arXiv 2023

[27] [28]

H.-K. Ko, H. Jeon, G. Park, D. H. Kim, N. W. Kim, J. Kim et al. Natural language dataset generation framework for visualizations powered by large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–22, 2024. 2

work page 2024

[28] [29]

Krichene, F

S. Krichene, F. Piccinno, F. Liu, and J. Eisenschlos. Faithful chart summa- rization with chats-pi. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8705–8723, 2024. 3

work page 2024

[29] [30]

Latif, Z

S. Latif, Z. Zhou, Y . Kim, F. Beck, and N. W. Kim. Kori: Interactive synthesis of text and charts in data documents.IEEE Transactions on Visualization and Computer Graphics, 28(1):184–194, 2021. 1, 3

work page 2021

[30] [32]

L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong et al. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14369–14387,

work page

[31] [33]

J. Lim, J. Ahn, and G. Kim. Chartcap: Mitigating hallucination of dense chart captioning. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 13171–13182, 2025. 1, 2, 3

work page 2025

[32] [34]

C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. 7

work page 2004

[33] [35]

C. Liu, Y . Guo, and X. Yuan. Autotitle: An interactive title generator for visualizations.IEEE Transactions on Visualization and Computer Graphics, 30(8):5276–5288, 2023. 3

work page 2023

[34] [36]

F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pp. 1287–1310, 2024. 2

work page 2024

[35] [37]

M. Liu, D. Chen, Y . Li, G. Fang, and Y . Shen. Chartthinker: A contex- tual chain-of-thought approach to optimized chart summarization.arXiv preprint arXiv:2403.11236, 2024. 2

work page arXiv 2024

[36] [38]

Liu, C.-W

Z. Liu, C.-W. Xie, B. Wen, F. Yu, J. Chen, P. Li et al. Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.arXiv preprint arXiv:2502.14914, 2025. 1, 3

work page arXiv 2025

[37] [39]

Y . Lu, L. Zhong, J. Yang, W. Li, P. Wei, Y . Wang et al. Domaincqa: Crafting knowledge-intensive qa from domain-specific charts. InPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 32347–32355, 2026. 7

work page 2026

[38] [40]

Lundgard and A

A. Lundgard and A. Satyanarayan. Accessible visualization via natural language descriptions: A four-level model of semantic content.IEEE transactions on visualization and computer graphics, 28(1):1073–1083,

work page

[39] [41]

Mahinpei, Z

A. Mahinpei, Z. Kostic, and C. Tanner. Linecap: Line charts for data visualization captioning models. In2022 IEEE Visualization and Visual Analytics (VIS), pp. 35–39. IEEE, 2022. 2

work page 2022

[40] [42]

F. Meng, W. Shao, Q. Lu, P. Gao, K. Zhang, Y . Qiao et al. Chartassis- stant: A universal chart multimodal language model via chart-to-table pre- training and multitask instruction tuning.arXiv preprint arXiv:2401.02384,

work page arXiv

[41] [43]

J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao et al. Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025. 3

work page 2025

[42] [44]

Obeid and E

J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model. InProceedings of the 13th International Conference on Natural Language Generation, pp. 138–147, 2020. 1, 2, 3

work page 2020

[43] [45]

Obeid and E

J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model.arXiv preprint arXiv:2010.09142, 2020. 2

work page arXiv 2010

[44] [46]

Gemini 3 pro

OpenAI. Gemini 3 pro. https://gemini.google.com/, 2026. Ac- cessed: 2026-3-31. 2, 4, 7

work page 2026

[45] [47]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. 1, 3, 7

work page 2002

[46] [48]

Pew research center.https://www.pewresearch

Pew Research Center. Pew research center.https://www.pewresearch. org/, 2026. Accessed: 2026-3-31. 2

work page 2026

[47] [49]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. 2, 5, 6, 7

work page 2026

[48] [50]

Rahman, R

R. Rahman, R. Hasan, A. Al Farhad, M. T. R. Laskar, M. H. Ashmafee, and A. R. M. Kamal. Chartsumm: A comprehensive benchmark for automatic chart summarization of long and short summaries. InCanadian AI, 2023. 2

work page 2023

[49] [51]

E. R. RECALL. Beyond memorability: Visualization recognition and recall.IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 22(1), 2016. 8

work page 2016

[50] [52]

Sellam, D

T. Sellam, D. Das, and A. Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 7881–7892, 2020. 7

work page 2020

[51] [53]

Z. Shao, L. Shen, H. Li, Y . Shan, H. Qu, Y . Wang et al. Narrative player: Reviving data narratives with visuals.IEEE Transactions on Visualization and Computer Graphics, 31(10):6781–6795, 2025. 5

work page 2025

[52] [54]

L. Shen, E. Shen, Z. Tai, Y . Xu, J. Dong, and J. Wang. Visual data analysis with task-based recommendations.Data Science and Engineering, 7(4):354–369, 2022. 5

work page 2022

[53] [55]

Y . Shi, C. Zheng, Z. Yang, K. Xu, and N. Cao. Vistoryteller: Designing data stories with llm agent-based generation and interactive user control. InProceedings of the 31st International Conference on Intelligent User Interfaces, pp. 1141–1156, 2026. 1, 2

work page 2026

[54] [56]

Stokes, V

C. Stokes, V . Setlur, B. Cogley, A. Satyanarayan, and M. A. Hearst. Strik- ing a balance: Reader takeaways and preferences when integrating text and charts.IEEE Transactions on Visualization and Computer Graphics, 29(1):1233–1243, 2022. 3

work page 2022

[55] [57]

Sultanum and A

N. Sultanum and A. Srinivasan. Datatales: Investigating the use of large language models for authoring data-driven articles. In2023 IEEE Visual- ization and Visual Analytics (VIS), pp. 231–235. IEEE, 2023. 2

work page 2023

[56] [58]

J. Sweller. Implications of cognitive load theory for multimedia learning. The Cambridge handbook of multimedia learning, 3(2):19–30, 2005. 3

work page 2005

[57] [59]

B. Tang, A. Boggust, and A. Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7268–7298, 2023. 1, 2, 3

work page 2023

[58] [60]

Vedantam, C

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus- based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015. 3

work page 2015

[59] [61]

A. Z. Wang, G. J. Quadri, M. Zhu, C. Tseng, and D. A. Szafir. Characteriz- ing visualization perception with psychological phenomena: Uncovering the role of subitizing in data visualization.IEEE Transactions on Visual- ization and Computer Graphics, 2025. 3

work page 2025

[60] [62]

F. Wang, B. Wang, X. Shu, Z. Liu, Z. Shao, C. Liu et al. Chartinsighter: An approach for mitigating hallucination in time-series chart summary generation with a benchmark dataset.IEEE transactions on visualization and computer graphics, 2025. 1, 2, 3

work page 2025

[61] [63]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [64]

Y . Wang, Z. Sun, H. Zhang, W. Cui, K. Xu, X. Ma et al. Datashot: Automatic generation of fact sheets from tabular data.IEEE transactions on visualization and computer graphics, 26(1):895–905, 2019. 5

work page 2019

[63] [65]

Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. 7

work page 2024

[64] [66]

Wiseman, S

S. Wiseman, S. M. Shieber, and A. M. Rush. Challenges in data-to- document generation. InProceedings of the 2017 conference on empirical methods in natural language processing, pp. 2253–2263, 2017. 3

work page 2017

[65] [67]

R. Xia, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen et al. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning.IEEE Transactions on Image Processing, 2025. 1

work page 2025

[66] [68]

R. Ye. Chartdiff: A large-scale benchmark for comprehending pairs of charts.arXiv preprint arXiv:2603.28902, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [69]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904