pith. sign in

arxiv: 2604.08959 · v1 · submitted 2026-04-10 · 💻 cs.HC

How Do LLMs See Charts? A Comparative Study on High-Level Visualization Comprehension in Humans and LLMs

Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3

classification 💻 cs.HC
keywords LLM visualization comprehensionhuman-AI comparisonchart interpretationqualitative studyhigh-level patternsinterpretative strategiesdata visualizationprompt conditions
0
0 comments X

The pith

LLMs interpret charts by enumerating comparisons and numerical ranges in a fixed way, while humans synthesize data into trend-centered narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares how humans and large language models extract high-level meaning from visualizations such as line graphs, bar graphs, and scatterplots. It shows that LLMs apply the same structural breakdown regardless of prompt wording, listing data points and comparisons without shifting to broader patterns. Humans instead weave the numbers into stories focused on overall trends and connections. The difference arises because LLMs rely on mechanisms separate from human intuition when reading charts. These findings matter for anyone designing visualizations that both people and AI systems are meant to understand.

Core claim

LLMs exhibit a consistent interpretative strategy that remains unchanged across prompt constraints. Humans naturally synthesize data into trend-centric narratives, whereas LLMs persist with a structural enumeration of comparisons and numerical ranges. LLMs achieve visualization comprehension through mechanisms distinct from human intuition.

What carries the argument

Qualitative comparison of interpretative strategies across three visualization types and three prompt conditions, revealing fixed structural enumeration in LLMs versus narrative synthesis in humans.

If this is right

  • Visualization designers need to account for LLMs favoring explicit numerical comparisons over implicit trends when charts are meant for AI audiences.
  • Changing prompt wording will not shift LLMs toward human-like narrative reading of charts.
  • Tools that combine human and LLM chart analysis must bridge the structural versus narrative gap to avoid misaligned interpretations.
  • Opportunities exist to create new chart designs that explicitly support both human trend synthesis and LLM enumeration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data for LLMs could be augmented with narrative summaries of charts to encourage more human-aligned comprehension.
  • Testing the same protocol on additional chart forms such as heat maps or network diagrams would show whether the enumeration pattern generalizes.
  • Design software could incorporate checks that flag when a visualization prioritizes LLM-friendly lists at the expense of human trend clarity.

Load-bearing premise

The three visualization types and three prompt conditions together with the qualitative analysis capture general high-level interpretative strategies for both humans and LLMs.

What would settle it

A follow-up experiment in which LLMs switch from enumeration to trend synthesis when shown the same charts under new prompt wording or with additional visualization types would falsify the claim of consistent strategy.

Figures

Figures reproduced from arXiv: 2604.08959 by Daeun Jeong, Ghulam Jilani Quadri, Hyotaek Jeon, Hyunwook Lee, Joohee Kim, Minjeong Shin, Shinwook Seon, Sungahn Ko, Tapendra Pandey.

Figure 1
Figure 1. Figure 1: Study Design Dimensions: Chart Types(3), Data Types(2), Composition Types(2), LLMs(3), Prompt Constraints(3). should be used as interpreters in visualizations. Recent work has begun to investigate the capabilities of LLMs for low-level visu￾alization tasks, such as basic chart reading and performing visual analytics directly on chart representations, while assessing their vi￾sualization literacy and analyt… view at source ↗
Figure 2
Figure 2. Figure 2: Bloom’s Taxonomy for visualization comprehension with descriptions and example tasks [BXF∗ 20]. Statistical quantities analysis characterizes the interpretative fo￾cus of humans and LLMs, revealing differences in how they pro￾cess visual information. Utilizing a closed taxonomy, we analyzed the frequency and distribution of tasks (e.g., Trend, Comparison) to determine whether the descriptions are dominated… view at source ↗
Figure 3
Figure 3. Figure 3: Character Length (a) and Word (Token) Count (b) of de￾scription generated for three prompt constraints (PC0, PC1, PC2) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: illustrates how LLMs realign in response to tightening length constraints (PC0 → PC1 → PC2). We observe that the pri￾oritization of statistical tasks varies significantly across chart types. For line charts, the most prominent shift occurs in the extraction of global patterns. As the constraint intensifies with PC2, the pro￾portion of Trend tasks increases notably. This suggests that when space is limited,… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of Bloom’s Taxonomy cognitive categories across different chart types for humans and LLMs. 6.1. Comparative Distribution of Cognitive Strategies Comparing the overall distribution of the categories reveals distinct differences in how humans and LLMs allocate cognitive resources during visualization comprehension. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task transition probability matrices for human and LLM descriptions. Cell values indicate the likelihood of transitioning from the row task to the column task. predominantly found in the LLM’s pre-training data. Standard ac￾cessibility guidelines [Ini99] and high-quality captioning datasets typically advocate for a top-down descriptive structure. This pat￾tern also aligns with Shneiderman’s Visual Informat… view at source ↗
Figure 7
Figure 7. Figure 7: Representative visualization examples used in our study: (a) line charts with (a-1) single-class and (a-2) multi-class data; (b) (b-1) non-juxtaposed and (b-2) juxtaposed layouts with the same data; and (c-1) a scatterplot and (c-2) a bar chart for chart type comparison with the same data. LLMs treat the chart merely as data, showing no qualitative change in their comprehension strategy regardless of the l… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of visualizations used in our study: (a) Single-class bar chart utilizing 12 distinct categories. (b) Single-class line chart displaying Google stock price. (c) Multi-class juxtaposed scatterplot visualizing between client’s age and BMI across four regions. Model Chart Type PC2 → PC1 PC1 → PC0 PC2 → PC0 Claude Bar chart 0.9256 0.8867 0.8658 Line chart 0.9348 0.8722 0.8453 Scatterplot 0.9079 0.8770… view at source ↗
read the original abstract

Designers often create visualizations to achieve specific high-level analytical or communication goals. These goals require people to extract complex and interconnected data patterns. Prior perceptual studies of visualization effectiveness have focused on low-level tasks, such as estimating statistical quantities, and have recently explored high-level comprehension of visualization. Despite the growing use of Large Language Models (LLMs) as visualization interpreters, how their interpretations relate to human understanding or what reasoning processes underlie their responses remains insufficiently understood. In this work, we explore LLMs' visualization comprehension, examining the alignment between designers' communicative goals and what their audience sees in a visualization. We have conducted a qualitative study to investigate the gap between human interpretative strategies and the reasoning pathways of LLMs across three types of visualizations, line graphs, bar graphs, and scatterplots, to identify the high-level patterns generated by LLMs using three prompt conditions. Our analysis results indicate that LLMs exhibit a consistent interpretative strategy that remains unchanged across prompt constraints. Furthermore, we observe two distinct approaches: humans naturally synthesize data into trend-centric narratives, whereas LLMs persist with a structural enumeration of comparisons and numerical ranges. Lastly, we see LLMs achieve visualization comprehension through mechanisms distinct from human intuition, pointing to critical challenges and new opportunities for visualization design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports a qualitative study comparing high-level visualization comprehension between humans and LLMs across line graphs, bar graphs, and scatterplots under three prompt conditions. It claims LLMs display a consistent interpretative strategy that does not change with prompt constraints, humans synthesize data into trend-centric narratives while LLMs rely on structural enumeration of comparisons and numerical ranges, and LLMs therefore comprehend visualizations through mechanisms distinct from human intuition.

Significance. If the observed patterns hold under broader testing, the work would be significant for visualization and HCI research by identifying concrete differences in reasoning pathways. This could inform visualization design practices that account for both human and LLM audiences and highlight limitations in using LLMs as chart interpreters. The qualitative framing provides an initial mapping of strategies, though the absence of quantitative validation or sampling justification reduces immediate applicability.

major comments (2)
  1. [Methods] Methods section: No information is provided on the number of human participants, their recruitment or demographics, the exact qualitative coding procedure, or inter-rater reliability metrics. These details are required to assess whether the reported divergence between trend-centric human narratives and LLM structural enumeration is reproducible and not an artifact of small or unrepresentative samples.
  2. [Results] Results and Discussion sections: The central claim that LLMs use mechanisms distinct from human intuition rests on responses to only three visualization types and three prompt conditions. Without quantitative metrics (e.g., frequency counts of narrative vs. enumeration codes), justification for stimulus representativeness, or tests of additional chart types, the consistency of LLM strategy and the human-LLM divergence cannot be shown to generalize beyond the specific data patterns tested.
minor comments (1)
  1. [Abstract] Abstract: Adding one or two concrete examples of a human trend narrative versus an LLM enumeration response would help readers immediately grasp the claimed distinction before reading the full analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps us strengthen the methodological transparency and scope of our qualitative study on LLM versus human visualization comprehension. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods section: No information is provided on the number of human participants, their recruitment or demographics, the exact qualitative coding procedure, or inter-rater reliability metrics. These details are required to assess whether the reported divergence between trend-centric human narratives and LLM structural enumeration is reproducible and not an artifact of small or unrepresentative samples.

    Authors: We acknowledge that these details were omitted from the submitted manuscript. In the revised version, we will expand the Methods section to include the number of human participants, recruitment method (via university participant pools and online forums), demographics (age, gender, visualization familiarity), the qualitative coding procedure (thematic analysis with iterative codebook development), and inter-rater reliability metrics (e.g., Cohen's kappa between independent coders). This will allow better evaluation of reproducibility. revision: yes

  2. Referee: [Results] Results and Discussion sections: The central claim that LLMs use mechanisms distinct from human intuition rests on responses to only three visualization types and three prompt conditions. Without quantitative metrics (e.g., frequency counts of narrative vs. enumeration codes), justification for stimulus representativeness, or tests of additional chart types, the consistency of LLM strategy and the human-LLM divergence cannot be shown to generalize beyond the specific data patterns tested.

    Authors: We agree the study is scoped to three visualization types and prompt conditions as an initial qualitative exploration. In revision, we will add quantitative elements such as frequency counts and proportions of trend-centric narrative codes versus structural enumeration codes across all responses to better demonstrate consistency. We will also justify the representativeness of the chosen stimuli (common chart types for high-level tasks) and explicitly discuss limitations on generalizability, including the need for future work with additional chart types. The qualitative depth remains central, but these additions will better support the claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical qualitative study with no derivations or self-referential steps

full rationale

The paper is a qualitative comparative study of human and LLM responses to three visualization types under three prompt conditions. It contains no equations, no fitted parameters, no derivations, and no load-bearing self-citations that reduce claims to inputs by construction. All findings are presented as observed patterns from coded responses rather than predictions forced by prior definitions or ansatzes. The central claims rest on direct empirical observation and qualitative analysis, which are self-contained against external benchmarks and do not invoke uniqueness theorems or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the chosen chart types and prompts plus the validity of the qualitative coding process, with no free parameters, new entities, or mathematical axioms introduced.

axioms (1)
  • domain assumption The selected visualizations and prompt conditions are representative enough to reveal general differences in high-level comprehension strategies.
    This underpins the study design and generalization from the observed patterns.

pith-pipeline@v0.9.0 · 5566 in / 1153 out tokens · 34426 ms · 2026-05-10T17:55:52.156998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    https://www.anthropic.com/claude-4-system-card,

    2, 5 [Ant25] ANTHROPIC: System card: Claude opus 4 & claude sonnet 4. https://www.anthropic.com/claude-4-system-card,

  2. [2]

    Computational Linguistics 34(4), 555–596 (2008)

    4 [AP08] ARTSTEINR., POESIOM.: Survey article: Inter-coder agreement for computational linguistics. Computational Linguistics 34, 4 (2008), 555–596.doi:10.1162/coli.07-034-R2. 4 [BBF∗21] BORLANDD., BRAINI., FECHOK., PFAFFE., XUH., CHAMPIONJ., BIZONC., GOTZD.: Enabling longitudinal exploratory analysis of clinical covid data. In IEEE Workshop on Visual Ana...

  3. [3]

    D., BONILLAK., FENGM., KAYM., HARRISONL.: The risks of ranking: Revisit- ing graphical perception to model individual differences in visualiza- tion performance

    10 [DPD∗24] DAVISR., PUX., DINGY., HALLB. D., BONILLAK., FENGM., KAYM., HARRISONL.: The risks of ranking: Revisit- ing graphical perception to model individual differences in visualiza- tion performance. IEEE Transactions on Visualization and Computer Graphics 30, 3 (Mar. 2024), 1756–1771. URL:https://doi.org/ 10.1109/TVCG.2022.3226463. 2 [DTM25] DASA. K....

  4. [4]

    E., LARAMEER

    4, 13 [JSFL24] JOSHIA., SRINIVASC., FIRATE. E., LARAMEER. S.: Eval- uating the recommendations of llms to teach a visualization technique using bloom’s taxonomy.Electronic Imaging 36, 1 (2024), 360–1–360– 1.doi:10.2352/EI.2024.36.1.VDA-360. 5 [KAMB25] KIMN. W., AHNY., MYERSG., BACHB.: How good is chatgpt in giving advice on your visualization design? ACM ...

  5. [5]

    Feed- grains

    URL:https://doi.org/10.2312/eged.20221042. 5 [PGM19] PRESTONA., GOMOVM., MAK.-L.: Uncertainty-aware visualization for analyzing heterogeneous wildfire detections. IEEE Computer Graphics and Applications 39, 5 (2019), 72–82. 1 [QR21] QUADRIG. J., ROSENP.: A survey of perception-based visual- ization studies by task. IEEE transactions on visualization and c...