ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models
Pith reviewed 2026-05-25 04:11 UTC · model grok-4.3
The pith
ChartFI-Bench introduces four dimensions and aligned metrics to evaluate how faithfully and insightfully MLLMs describe charts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing benchmarks use simple charts and shallow fact-listing descriptions, so they cannot adequately test MLLM output; ChartFI-Bench supplies 896 visually complex chart-description pairs built around the four quality dimensions, supplies four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity), and shows through experiments that the new framework detects common model weaknesses.
What carries the argument
ChartFI-Bench dataset of 896 pairs together with the four metrics (Faithfulness, Coverage, Informativeness, Acuity) that operationalize the four quality dimensions.
If this is right
- Existing MLLMs exhibit measurable shortfalls in factual accuracy, feature emphasis, domain guidance, and text-chart alignment when describing charts.
- The four metrics enable systematic comparison of future models against the benchmark.
- Improved chart descriptions can directly aid accessibility tools and cross-modal retrieval systems.
- The benchmark construction process itself supplies a template for creating richer evaluation sets in related multimodal tasks.
Where Pith is reading between the lines
- The metrics could be used as reward signals during fine-tuning to push models toward higher-quality descriptions.
- The same dimension-based approach might transfer to evaluating descriptions of other data visualizations such as graphs or diagrams.
- Deployment of the benchmark in accessibility pipelines would let developers track progress on real-world usefulness rather than proxy tasks.
Load-bearing premise
The four dimensions fully characterize what makes a chart description high-quality.
What would settle it
Human experts rating the same descriptions on the same charts produce rankings that systematically disagree with scores from the four proposed metrics.
Figures
read the original abstract
Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics -- Faithfulness, Coverage, Informativeness, and Acuity -- to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ChartFI-Bench, a benchmark of 896 chart-description pairs featuring visually complex charts and semantically rich descriptions. It summarizes four dimensions claimed to characterize high-quality chart descriptions (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity) that guide benchmark construction and the design of four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity). Experiments on mainstream MLLMs are reported to demonstrate the framework's effectiveness and reveal common model weaknesses.
Significance. If the dimensions are justified and the metrics validated against human judgments, the benchmark could improve evaluation of MLLM chart descriptions beyond existing simple datasets and shallow metrics, supporting accessibility and insight extraction tasks. The scale and complexity of the pairs represent a concrete advance, but significance is reduced by the absence of validation for the guiding dimensions.
major comments (1)
- [Abstract] Abstract: The four dimensions are asserted to characterize high-quality chart descriptions and explicitly guide both benchmark construction and the four metrics, yet no derivation from prior visualization or accessibility literature, coverage argument, or validation (e.g., correlation with overall human quality ratings or inter-rater agreement) is supplied. This is load-bearing for the central claim that experiments demonstrate framework effectiveness.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comment. We address the concern regarding justification of the four dimensions below and commit to revisions that strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The four dimensions are asserted to characterize high-quality chart descriptions and explicitly guide both benchmark construction and the four metrics, yet no derivation from prior visualization or accessibility literature, coverage argument, or validation (e.g., correlation with overall human quality ratings or inter-rater agreement) is supplied. This is load-bearing for the central claim that experiments demonstrate framework effectiveness.
Authors: We agree that the current presentation does not sufficiently derive or validate the four dimensions (factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity). In the revised version we will add a new subsection (likely in Section 2 or 3) that explicitly grounds each dimension in prior visualization literature (e.g., Bertin’s visual variables, Tufte’s data-ink ratio and graphical excellence, and Cleveland & McGill’s perceptual rankings) as well as accessibility guidelines (W3C WCAG and chart-specific recommendations from the visualization accessibility community). We will also include a brief coverage argument showing how these dimensions collectively address gaps in existing chart-description evaluation. To address validation, we will conduct and report a small human study (n=30–50 raters) measuring correlation between the four dimension scores and overall quality ratings, plus inter-rater agreement (Cohen’s/Fleiss’ kappa). These additions will directly support the claim that the framework and experiments are effective. revision: yes
Circularity Check
No significant circularity; benchmark and metrics constructed independently
full rationale
The paper summarizes four dimensions and uses them to guide benchmark construction and metric design, but this does not constitute circularity under the defined patterns. There are no equations, fitted parameters renamed as predictions, self-citations that are load-bearing for the central claim, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The derivation chain consists of an asserted starting point (the dimensions) followed by independent construction of the dataset and metrics; no step reduces a result to its own inputs by construction. This is a standard benchmark paper whose claims rest on the new artifacts rather than self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The four dimensions (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity) characterize high-quality chart descriptions.
Reference graph
Works this paper leans on
-
[1]
https://www.statista.com/, 2026
Statista. https://www.statista.com/, 2026. Accessed: 2026-3-31. 2
work page 2026
-
[2]
P. Ayres and J. Sweller. The split-attention principle in multimedia learning. The Cambridge handbook of multimedia learning, 2:135–146, 2005. 3
work page 2005
- [3]
-
[4]
S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005. 1, 7
work page 2005
-
[5]
L. Battle and A. Ottley. What do we mean when we say “insight”? a formal synthesis of existing theory.IEEE Transactions on Visualization and Computer Graphics, 30(9):6075–6088, 2023. 3
work page 2023
-
[6]
H. P. Chan, Q. Zeng, and H. Ji. Interpretable automatic fine-grained incon- sistency detection in text summarization. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6433–6444, 2023. 3
work page 2023
-
[7]
C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu et al. Figure captioning with reasoning and sequence-level training.arXiv preprint arXiv:1906.02850, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[8]
N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Computer Graphics, 31(1):1301–1311, 2024. 1, 7, 8
work page 2024
-
[9]
C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi. On the use of arxiv as a dataset.arXiv preprint arXiv:1905.00075, 2019. 3
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[10]
A. K. Das, M. Tarun, and K. Mueller. Charts-of-thought: Enhancing llm vi- sualization literacy through structured data extraction.IEEE Transactions on Visualization and Computer Graphics, 2025. 5
work page 2025
- [11]
-
[12]
J. Ellemose and N. Elmqvist. Eye of the beholder: Towards measuring vi- sualization complexity.IEEE Transactions on Visualization and Computer Graphics, 2025. 3
work page 2025
-
[13]
X. Fu, Y . Wang, H. Dong, W. Cui, and H. Zhang. Visualization assessment: A machine learning approach. In2019 IEEE Visualization Conference (VIS), pp. 126–130. IEEE, 2019. 8
work page 2019
-
[14]
Google. Gemini 3 pro. https://chatgpt.com/, 2026. Accessed: 2026- 3-31. 2, 4, 5, 7
work page 2026
- [15]
- [16]
-
[17]
E. Hoque and M. S. Islam. Natural language generation for visualizations: State of the art, challenges and future directions. InComputer Graphics Forum, vol. 44, p. e15266. Wiley Online Library, 2025. 7
work page 2025
-
[19]
T.-Y . Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp. 3258–3264, 2021. 2
work page 2021
- [20]
- [21]
- [22]
- [23]
-
[24]
S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque et al. Chart-to-text: A large-scale benchmark for chart summarization. pp. 4005–4023. Association for Computational Linguistics, 2022. doi: 10. 18653/v1/2022.acl-long.277 2
work page 2022
-
[25]
D. H. Kim, V . Setlur, and M. Agrawala. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–11, 2021. 3, 7
work page 2021
- [26]
- [27]
-
[28]
H.-K. Ko, H. Jeon, G. Park, D. H. Kim, N. W. Kim, J. Kim et al. Natural language dataset generation framework for visualizations powered by large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–22, 2024. 2
work page 2024
-
[29]
S. Krichene, F. Piccinno, F. Liu, and J. Eisenschlos. Faithful chart summa- rization with chats-pi. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8705–8723, 2024. 3
work page 2024
- [30]
-
[32]
L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong et al. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14369–14387,
-
[33]
J. Lim, J. Ahn, and G. Kim. Chartcap: Mitigating hallucination of dense chart captioning. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 13171–13182, 2025. 1, 2, 3
work page 2025
-
[34]
C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. 7
work page 2004
-
[35]
C. Liu, Y . Guo, and X. Yuan. Autotitle: An interactive title generator for visualizations.IEEE Transactions on Visualization and Computer Graphics, 30(8):5276–5288, 2023. 3
work page 2023
-
[36]
F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pp. 1287–1310, 2024. 2
work page 2024
- [37]
- [38]
-
[39]
Y . Lu, L. Zhong, J. Yang, W. Li, P. Wei, Y . Wang et al. Domaincqa: Crafting knowledge-intensive qa from domain-specific charts. InPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 32347–32355, 2026. 7
work page 2026
-
[40]
A. Lundgard and A. Satyanarayan. Accessible visualization via natural language descriptions: A four-level model of semantic content.IEEE transactions on visualization and computer graphics, 28(1):1073–1083,
-
[41]
A. Mahinpei, Z. Kostic, and C. Tanner. Linecap: Line charts for data visualization captioning models. In2022 IEEE Visualization and Visual Analytics (VIS), pp. 35–39. IEEE, 2022. 2
work page 2022
- [42]
-
[43]
J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao et al. Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025. 3
work page 2025
-
[44]
J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model. InProceedings of the 13th International Conference on Natural Language Generation, pp. 138–147, 2020. 1, 2, 3
work page 2020
-
[45]
J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model.arXiv preprint arXiv:2010.09142, 2020. 2
-
[46]
OpenAI. Gemini 3 pro. https://gemini.google.com/, 2026. Ac- cessed: 2026-3-31. 2, 4, 7
work page 2026
-
[47]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. 1, 3, 7
work page 2002
-
[48]
Pew research center.https://www.pewresearch
Pew Research Center. Pew research center.https://www.pewresearch. org/, 2026. Accessed: 2026-3-31. 2
work page 2026
-
[49]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. 2, 5, 6, 7
work page 2026
- [50]
-
[51]
E. R. RECALL. Beyond memorability: Visualization recognition and recall.IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 22(1), 2016. 8
work page 2016
- [52]
-
[53]
Z. Shao, L. Shen, H. Li, Y . Shan, H. Qu, Y . Wang et al. Narrative player: Reviving data narratives with visuals.IEEE Transactions on Visualization and Computer Graphics, 31(10):6781–6795, 2025. 5
work page 2025
-
[54]
L. Shen, E. Shen, Z. Tai, Y . Xu, J. Dong, and J. Wang. Visual data analysis with task-based recommendations.Data Science and Engineering, 7(4):354–369, 2022. 5
work page 2022
-
[55]
Y . Shi, C. Zheng, Z. Yang, K. Xu, and N. Cao. Vistoryteller: Designing data stories with llm agent-based generation and interactive user control. InProceedings of the 31st International Conference on Intelligent User Interfaces, pp. 1141–1156, 2026. 1, 2
work page 2026
- [56]
-
[57]
N. Sultanum and A. Srinivasan. Datatales: Investigating the use of large language models for authoring data-driven articles. In2023 IEEE Visual- ization and Visual Analytics (VIS), pp. 231–235. IEEE, 2023. 2
work page 2023
-
[58]
J. Sweller. Implications of cognitive load theory for multimedia learning. The Cambridge handbook of multimedia learning, 3(2):19–30, 2005. 3
work page 2005
-
[59]
B. Tang, A. Boggust, and A. Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7268–7298, 2023. 1, 2, 3
work page 2023
-
[60]
R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus- based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015. 3
work page 2015
-
[61]
A. Z. Wang, G. J. Quadri, M. Zhu, C. Tseng, and D. A. Szafir. Characteriz- ing visualization perception with psychological phenomena: Uncovering the role of subitizing in data visualization.IEEE Transactions on Visual- ization and Computer Graphics, 2025. 3
work page 2025
-
[62]
F. Wang, B. Wang, X. Shu, Z. Liu, Z. Shao, C. Liu et al. Chartinsighter: An approach for mitigating hallucination in time-series chart summary generation with a benchmark dataset.IEEE transactions on visualization and computer graphics, 2025. 1, 2, 3
work page 2025
-
[63]
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 2, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Y . Wang, Z. Sun, H. Zhang, W. Cui, K. Xu, X. Ma et al. Datashot: Automatic generation of fact sheets from tabular data.IEEE transactions on visualization and computer graphics, 26(1):895–905, 2019. 5
work page 2019
-
[65]
Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. 7
work page 2024
-
[66]
S. Wiseman, S. M. Shieber, and A. M. Rush. Challenges in data-to- document generation. InProceedings of the 2017 conference on empirical methods in natural language processing, pp. 2253–2263, 2017. 3
work page 2017
-
[67]
R. Xia, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen et al. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning.IEEE Transactions on Image Processing, 2025. 1
work page 2025
-
[68]
R. Ye. Chartdiff: A large-scale benchmark for comprehending pairs of charts.arXiv preprint arXiv:2603.28902, 2026. 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[69]
BERTScore: Evaluating Text Generation with BERT
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,
work page internal anchor Pith review Pith/arXiv arXiv 1904
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.