Exploring MLLMs Perception of Network Visualization Principles

Chunyang Chen; Henry F\"orster; Jacob Miller; Johannes Zink; Ludwig Felder; Markus Wallinger; Stephen Kobourov; Timo Brand

arxiv: 2506.14611 · v2 · submitted 2025-06-17 · 💻 cs.HC

Exploring MLLMs Perception of Network Visualization Principles

Jacob Miller , Markus Wallinger , Ludwig Felder , Timo Brand , Henry F\"orster , Johannes Zink , Chunyang Chen , Stephen Kobourov This is my paper

Pith reviewed 2026-05-19 09:07 UTC · model grok-4.3

classification 💻 cs.HC

keywords multimodal large language modelsnetwork visualizationlayout stress perceptionhuman-AI comparisonvisual perceptionprompt engineeringgraph layoutsHCI evaluation

0 comments

The pith

Multimodal LLMs match trained human experts at judging stress in network layouts when given equivalent instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates a prior human study on perceiving stress in network visualizations by feeding the same images and instructions to three MLLMs. It finds that the models reach performance levels comparable to human experts and above those of untrained participants. The models explain their judgments using the same visual cues humans do, such as even node spacing and uniform edge lengths, rather than calculating exact stress values. This indicates the models can approximate perceptual evaluation of layout quality under controlled conditions.

Core claim

Providing MLLMs with the identical study information used for trained human participants produces accuracy in rating network layout stress that matches expert humans and exceeds untrained non-experts. The models rely on visual proxies instead of direct stress computation, and their generated explanations mirror those of human subjects.

What carries the argument

Replication of the human-subject experiment on stress perception in network layouts, using identical visual stimuli and textual instructions supplied to GPT-4o, Gemini-2.5, and Qwen2.5.

If this is right

MLLMs can serve as scalable substitutes for human subjects when evaluating visualization quality under the same protocol.
Deviating from the human-style prompt can produce performance that exceeds human experts in some cases.
Model explanations of layout quality track human reasoning patterns such as node distribution and edge uniformity.
The approach enables rapid testing of additional network visualization principles without new human recruitment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests MLLMs may have acquired implicit models of visual aesthetics during training that parallel human perceptual heuristics.
Similar methods could be applied to other visualization metrics or chart types to test the breadth of this capability.
If confirmed across more models and tasks, it could reduce reliance on human participants for early-stage HCI experiments in visualization.
The finding opens questions about whether the models are truly perceiving or simply pattern-matching from training data on diagrams.

Load-bearing premise

The images and instructions create a perceptual task for the models that is equivalent to the training and experience given to the human participants.

What would settle it

Showing the MLLMs a set of network layouts with pre-computed stress values and checking whether their quality rankings align more closely with the actual stress metric or with the human expert rankings from the original study.

Figures

Figures reproduced from arXiv: 2506.14611 by Chunyang Chen, Henry F\"orster, Jacob Miller, Johannes Zink, Ludwig Felder, Markus Wallinger, Stephen Kobourov, Timo Brand.

**Figure 1.** Figure 1: An illustrative example from our MLLM experiment. Left: a pair of different network diagrams of the same network is shown to [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Specific instructions to the MLLM models on structuring the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall accuracy in the Trained, Untrained, Expert and Tuned setting with respect to stress level difference. • Show the MLLM training examples. The first two bullet points are aimed to set the context for the MLLM. We find that the models understand the concept of stress well, and it is debatable if it is necessary in this context. However, this might be required in other domains to achieve good results. … view at source ↗

**Figure 4.** Figure 4: (Top) 95% confidence intervals for the 10,000 iteration bootstrapped difference of means test. An interval containing zero indicates an [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: (Top) Confidence intervals for the difference between [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE projection of the sentence embedding from MLLM re [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: The diagram of the network on the left has a higher stress value of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Overall accuracy in the Trained, Untrained, Expert and Tuned setting for GPT-4o , Gemini-2.5 , and human subjects with respect to stress level difference. Every row represents the size of network seen, and every column a different setting. All trends tend to increase in accuracy as the stress difference gets larger. GPT-4o ’s are offset by 0.01 as they often overlap with Gemini-2.5 [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 10.** Figure 10: Example stimuli (pairs of network diagrams). In all examples, [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

In this paper, we test whether Multimodal Large Language Models (MLLMs) can match human-subject performance in tasks involving the perception of properties in network layouts. Specifically, we replicate a human-subject experiment about perceiving quality (namely stress) in network layouts using GPT-4o, Gemini-2.5 and Qwen2.5. Our experiments show that giving MLLMs the same study information as trained human participants yields performance comparable to that of human experts and exceeds that of untrained non-experts. Additionally, we show that prompt engineering that deviates from the human-subject experiment can lead to better-than-human performance in some settings. Interestingly, like human subjects, the MLLMs seem to rely on visual proxies rather than computing the actual value of stress, indicating some sense or facsimile of perception. Explanations from the models are similar to those used by the human participants (e.g., an even distribution of nodes and uniform edge lengths).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLLMs reach expert-level ratings on network layout stress when given matching instructions and use similar visual proxies, but the paper provides little quantitative evidence to back the equivalence claim.

read the letter

The main thing to know is that this paper replicates a prior human study on perceiving stress in network layouts and finds that MLLMs like GPT-4o, Gemini, and Qwen2.5 can match trained expert performance while using the same kinds of visual shortcuts humans do, such as checking node spread and edge uniformity. Prompt variants that move away from the original human instructions sometimes produce even stronger results. The parallel in reasoning is the most interesting observation here. The work is straightforward in its approach: it takes an existing external human dataset, feeds the models the same study materials, and compares outcomes without introducing new fitted parameters or circular definitions. Testing several models and both faithful and altered prompts adds a bit of breadth that is useful for anyone thinking about AI in visualization evaluation. The soft spot is the thin quantitative grounding. The abstract claims comparability but does not report metrics, statistical tests, sample sizes, or direct comparisons such as error distributions or rating correlations with the human data. The stress-test point about whether the prompts truly create an equivalent perceptual task is fair; without those side-by-side numbers it is difficult to tell how much of the match is functional versus surface-level. If the full paper supplies those details and they hold, the result becomes more convincing. If they remain absent, the central claim stays provisional. This paper is for visualization and HCI researchers who want to explore whether models can stand in for some human-subject work on layout quality. A reader interested in automated assessment tools or cheaper evaluation pipelines would find the prompt experiments worth looking at. It is coherent and engages the existing literature directly enough to deserve a serious referee, though any review will almost certainly ask for the missing quantitative validation and clearer task-equivalence checks. I would send it to peer review with those requests.

Referee Report

2 major / 2 minor

Summary. The paper replicates a prior human-subject study on perceiving stress in network layouts using MLLMs (GPT-4o, Gemini-2.5, Qwen2.5). It claims that supplying the models with the same study information given to trained human participants produces performance comparable to human experts and superior to untrained non-experts. The authors further report that MLLMs rely on visual proxies (e.g., node distribution and edge uniformity) rather than direct stress computation, with explanations resembling those of human participants, and that non-standard prompt engineering can yield better-than-human results in some cases.

Significance. If the central equivalence claim is substantiated with quantitative evidence, the work would contribute to understanding MLLM capabilities in visualization perception tasks within HCI, potentially informing AI-assisted network layout evaluation and reducing reliance on human subjects for certain perceptual studies. The observation that models use human-like heuristics is a useful qualitative parallel, but the absence of direct metrics comparing model outputs to the original human dataset limits the strength of the contribution at present.

major comments (2)

[Abstract / Results] Abstract and Results sections: the claim of 'performance comparable to that of human experts' is stated without quantitative metrics, statistical tests, exact sample sizes, error distributions, confusion matrices, or correlation coefficients between MLLM ratings and the original human-subject data, leaving the central comparability assertion only weakly supported by the available text.
[Methods] Methods: the assumption that textual instructions plus images given to the MLLMs constitute an equivalent perceptual task to the training and practice trials provided to human experts is not verified; no quantitative comparison (e.g., rating distributions or agreement measures) is reported to confirm functional equivalence rather than superficial similarity.

minor comments (2)

[Methods] The paper would benefit from an appendix containing the exact prompts and image presentation protocol used for each MLLM to support reproducibility.
[Methods] Clarify how 'stress' was operationalized in the model prompts versus the original human study (e.g., rating scale, number of stimuli) to allow direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: Abstract and Results sections: the claim of 'performance comparable to that of human experts' is stated without quantitative metrics, statistical tests, exact sample sizes, error distributions, confusion matrices, or correlation coefficients between MLLM ratings and the original human-subject data.

Authors: We acknowledge that the original manuscript could benefit from more explicit quantitative evidence. The paper compares MLLM performance to the expert and non-expert levels reported in the original human study using the same task and scales. To strengthen this, we have revised the Results section to include exact sample sizes (number of network layouts evaluated per model), mean ratings with standard deviations, and direct comparisons to the published human means. We added Pearson correlation coefficients where aggregate data allowed, and statistical significance tests for differences from expert performance. A confusion matrix for binary high/low stress classification has also been included. We note that without access to the raw per-participant human data, item-level correlations are not possible, but the aggregate metrics support the comparability claim. revision: yes
Referee: Methods: the assumption that textual instructions plus images given to the MLLMs constitute an equivalent perceptual task to the training and practice trials provided to human experts is not verified; no quantitative comparison (e.g., rating distributions or agreement measures) is reported to confirm functional equivalence rather than superficial similarity.

Authors: This point highlights an important distinction. We have added to the Methods and Results sections quantitative comparisons of rating distributions between MLLMs and human experts, including histograms and statistical tests for distribution similarity (e.g., Kolmogorov-Smirnov test). We also report inter-rater agreement measures such as Cronbach's alpha or ICC between model outputs and human data where feasible. The revised text clarifies that while the MLLM task is not identical to human training (lacking practice trials), the similar performance and reasoning patterns (from explanations) suggest a functional parallel. We discuss this as a limitation in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical replication against external human dataset

full rationale

The paper performs an empirical replication by supplying MLLMs with the same textual instructions and network layout images used in a prior human-subject study, then directly compares performance metrics (e.g., stress perception accuracy) to the published human results. No equations, fitted parameters, or self-defined quantities are introduced that reduce the reported outcomes to author choices by construction. The central claim rests on measurement against an independent external benchmark rather than any self-referential derivation, self-citation load-bearing premise, or renaming of known results. This matches the default case of a self-contained experimental comparison with no reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified modeling assumption that human training and MLLM prompting are functionally equivalent for this perceptual task; no free parameters or new entities are introduced.

axioms (1)

domain assumption Multimodal LLMs can receive and reason over static images of network layouts in a manner comparable to human visual perception when given textual instructions.
Invoked when the authors supply the same study information to the models as to human participants.

pith-pipeline@v0.9.0 · 5709 in / 1247 out tokens · 33647 ms · 2026-05-19T09:07:02.071091+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MLLMs seem to rely on visual proxies rather than computing the actual value of stress... explanations... similar to those used by the human participants (e.g., an even distribution of nodes and uniform edge lengths)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replicate a human-subject experiment about perceiving quality (namely stress) in network layouts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages

[1]

Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

W. Agnew, A. S. Bergman, J. Chien, M. Díaz, S. El-Sayed, J. Pittman, S. Mohamed, and K. R. McKee. The illusion of artificial inclusion. In F. F. Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, eds., CHI 2024, pp. 286:1–286:12. ACM, 2024. doi: 10.1145/3613904.3642703 2, 9

work page doi:10.1145/3613904.3642703 2024
[2]

A. R. Ahmed, F. De Luca, S. Devkota, S. G. Kobourov, and M. Li. Multi- criteria scalable graph drawing via stochastic gradient descent, (SGD)2. IEEE Trans. Vis. Comput. Graph., 28(6):2388–2399, 2022. doi: 10.1109/ TVCG.2022.3155564 8

work page arXiv 2022
[3]

Arleo, S

A. Arleo, S. Miksch, and D. Archambault. Event-based dynamic graph drawing without the agonizing pain. Comput. Graph. Forum, 41(6):226– 244, 2022. doi: 10.1111/CGF.14615 3

work page doi:10.1111/cgf.14615 2022
[4]

Aubin Le Quéré, H

M. Aubin Le Quéré, H. Schroeder, C. Randazzo, J. Gao, Z. Epstein, S. T. Perrault, D. Mimno, L. Barkhuus, and H. Li. LLMs as research tools: Applications and evaluations in HCI data work. In F. F. Mueller, P. Kyburz, J. R. Williamson, and C. Sas, eds., CHI EA 2024, pp. 479:1–479:7. ACM,

work page 2024
[5]

doi: 10.1145/3613905.3636301 2

work page doi:10.1145/3613905.3636301
[6]

Bendeck and J

A. Bendeck and J. T. Stasko. An empirical evaluation of the GPT-4 multimodal language model on visualization literacy tasks. IEEE Trans. Vis. Comput. Graph., 31(1):1105–1115, 2025. doi: 10.1109/TVCG.2024. 3456155 1, 2, 9

work page doi:10.1109/tvcg.2024 2025
[8]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amode...

work page 2020
[9]

L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, and F. Zhao. Are we on the right way for evaluating large vision-language models? In NeurIPS 2024, pp. 27056–27087. Curran Associates, Inc., 2024. 4

work page 2024
[10]

Z. Chen, C. Zhang, Q. Wang, J. Troidl, S. Warchol, J. Beyer, N. Gehlen- borg, and H. Pfister. Beyond generating code: Evaluating GPT on a data visualization course. In EduVis 2023, pp. 16–21. IEEE, 2023. doi: 10. 1109/EduVis60792.2023.00009 2

work page arXiv 2023
[11]

Chimani, P

M. Chimani, P. Eades, P. Eades, S. Hong, W. Huang, K. Klein, M. Marner, R. T. Smith, and B. H. Thomas. People prefer less stress and fewer crossings. In C. Duncan and A. Symvonis, eds., GD 2014, vol. 8871 of LNCS, pp. 523–524. Springer, 2014. 3

work page 2014
[12]

De Luca, M

F. De Luca, M. I. Hossain, and S. G. Kobourov. Symmetry detection and classification in drawings of graphs. In D. Archambault and C. D. Tóth, eds., GD 2019, vol. 11904 of LNCS, pp. 499–513. Springer, 2019. doi: 10. 1007/978-3-030-35802-0_38 2

work page 2019
[13]

Di Bartolomeo, T

S. Di Bartolomeo, T. Crnovrsanin, D. Saffo, E. Puerta, C. Wilson, and C. Dunne. Evaluating graph layout algorithms: A systematic review of methods and best practices. Comput. Graph. Forum, 43(6), 2024. doi: 10. 1111/CGF.15073 3

work page 2024
[14]

Di Bartolomeo, G

S. Di Bartolomeo, G. Severi, V . Schetinger, and C. Dunne. Ask and you shall receive (a graph drawing): Testing ChatGPT’s potential to apply graph layout algorithms. In T. Höllt, W. Aigner, and B. Wang, eds., EuroVis 2023, pp. 79–83. Eurographics Association, 2023. doi: 10.2312/ EVS.20231047 2

work page 2023
[15]

Dragicevic

P. Dragicevic. Fair statistical communication in HCI. In J. Robertson and M. Kaptein, eds., Modern Statistical Methods for HCI, pp. 291–330. Springer, Cham, 2016. doi: 10.1007/978-3-319-26633-6_13 5

work page doi:10.1007/978-3-319-26633-6_13 2016
[16]

P. Duan, J. Warner, Y . Li, and B. Hartmann. Generating automatic feedback on UI mockups with large language models. In F. F. Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, eds., CHI 2024, pp. 6:1–6:20. ACM, 2024. doi: 10.1145/ 3613904.3642782 2

work page arXiv 2024
[17]

El-Kishky, A

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Dohan, F. Song, H. Lightman, I. C. Gilaberte, J. Pachocki, J. Tworek, L. Kuhn, L. Kaiser, M. Chen, M. Schwarzer, M. Rohaninejad, N. McAleese, o3 con- tributors, O. Mürk, R. Garg, R. Shu, S. Sidor, V . Kosaraju, and W. Zhou. Competitive programming with large reasoning models. arXiv preprint, abs/...

work page doi:10.48550/arxiv.2502.06807 2025
[18]

Förster, F

H. Förster, F. Klesen, T. Dwyer, P. Eades, S. Hong, S. G. Kobourov, G. Liotta, K. Misue, F. Montecchiani, A. Pastukhov, and F. Schreiber. GraphTrials: Visual proofs of graph properties. In S. Felsner and K. Klein, eds., GD 2024, vol. 320 of LIPIcs, pp. 16:1–16:18. Schloss Dagstuhl,

work page 2024
[19]

doi: 10.4230/LIPICS.GD.2024.16 1, 3

work page doi:10.4230/lipics.gd.2024.16 2024
[20]

E. R. Gansner, Y . Hu, and S. C. North. A maxent-stress model for graph layout. IEEE Trans. Vis. Comput. Graph., 19(6):927–940, 2013. doi: 10. 1109/TVCG.2012.299 3

work page 2013
[21]

E. R. Gansner, Y . Koren, and S. C. North. Graph drawing by stress majorization. In J. Pach, ed., GD 2004, vol. 3383 of LNCS, pp. 239–250. Springer, 2004. doi: 10.1007/978-3-540-31843-9_25 1, 3

work page doi:10.1007/978-3-540-31843-9_25 2004
[22]

Z. Gao, C. Jiang, J. Zhang, X. Jiang, L. Li, P. Zhao, H. Yang, Y . Huang, and J. Li. Hierarchical graph learning for protein–protein interaction. Nature Communications, 14(1):1093, 2023. doi: 10.1038/s41467-023-36736-1 2

work page doi:10.1038/s41467-023-36736-1 2023
[23]

Giovannangeli, F

L. Giovannangeli, F. Lalanne, D. Auber, R. Giot, and R. Bourqui. Deep neural network for drawing networks, (DNN )2. In H. C. Purchase and I. Rutter, eds., GD 2021, vol. 12868 of LNCS, pp. 375–390. Springer, 2021. doi: 10.1007/978-3-030-92931-2_27 2

work page doi:10.1007/978-3-030-92931-2_27 2021
[24]

Grötschla, J

F. Grötschla, J. Mathys, R. Veres, and R. Wattenhofer. CoRe-GD: A hierarchical framework for scalable graph visualization with GNNs. In ICLR 2024. OpenReview.net, 2024. doi: 10.48550/ARXIV.2402.06706 2

work page doi:10.48550/arxiv.2402.06706 2024
[25]

Hämäläinen, M

P. Hämäläinen, M. Tavast, and A. Kunnari. Evaluating large language mod- els in generating synthetic HCI research data: a case study. In A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, A. Peters, S. Mueller, J. R. Williamson, and M. L. Wilson, eds., CHI 2023, pp. 433:1–433:19. ACM,

work page 2023
[26]

doi: 10.1145/3544548.3580688 2, 9

work page doi:10.1145/3544548.3580688
[27]

J. Hong, C. Seto, A. Fan, and R. Maciejewski. Do LLMs have visualization literacy? An evaluation on modified visualizations to test generalization in data interpretation. IEEE Trans. Vis. Comput. Graph., 2025. doi: 10. 1109/TVCG.2025.3536358 1, 2, 5

work page arXiv 2025
[28]

S. Hong, P. Eades, M. Torkel, Z. Wang, D. Chae, S. Hong, D. Langerenken, and H. Chafi. Multi-level graph drawing using infomap clustering. In D. Archambault and C. D. Tóth, eds., GD 2019, vol. 11904 of LNCS, pp. 139–146. Springer, 2019. doi: 10.1007/978-3-030-35802-0_11 3

work page doi:10.1007/978-3-030-35802-0_11 2019
[29]

Kamada and S

T. Kamada and S. Kawai. An algorithm for drawing general undirected graphs. Inf. Process. Lett., 31(1):7–15, 1989. doi: 10.1016/0020-0190(89) 90102-6 3

work page doi:10.1016/0020-0190(89 1989
[30]

Klammler, T

M. Klammler, T. Mchedlidze, and A. Pak. Aesthetic discrimination of graph layouts. In T. Biedl and A. Kerren, eds., GD 2018, vol. 11282 of LNCS, pp. 169–184. Springer, 2018. doi: 10.1007/978-3-030-04414-5_12 2

work page doi:10.1007/978-3-030-04414-5_12 2018
[31]

J. F. Kruiger, P. E. Rauber, R. M. Martins, A. Kerren, S. G. Kobourov, and A. C. Telea. Graph layouts by t-SNE. Comput. Graph. Forum, 36(3):283– 294, 2017. doi: 10.1111/CGF.13187 1, 3

work page doi:10.1111/cgf.13187 2017
[32]

J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964. doi: 10. 1007/BF02289565 3

work page 1964
[34]

Lee, S.-H

S. Lee, S.-H. Kim, and B. C. Kwon. VLAT: Development of a Visu- alization Literacy Assessment Test. IEEE Trans. Vis. Comput. Graph., 23(1):551–560, 2017. doi: 10.1109/TVCG.2016.2598920 2

work page doi:10.1109/tvcg.2016.2598920 2017
[35]

Z. Li, H. Miao, V . Pascucci, and S. Liu. Visualization literacy of mul- timodal large language models: A comparative study. arXiv preprint, abs/2407.10996, 2024. doi: 10.48550/arXiv.2407.10996 2

work page doi:10.48550/arxiv.2407.10996 2024
[36]

Liew and K

A. Liew and K. Mueller. Using large language models to generate engaging captions for data visualizations. In NLVIZ 2022, 2022. 2

work page 2022
[37]

L. Y . Lo and H. Qu. How good (or bad) are LLMs at detecting misleading visualizations? IEEE Trans. Vis. Comput. Graph., 31(1):1116–1125, 2025. doi: 10.1109/TVCG.2024.3456333 2

work page doi:10.1109/tvcg.2024.3456333 2025
[38]

Y . Lu, D. Jiang, W. Chen, W. Y . Wang, Y . Choi, and B. Y . Lin. Wildvision: Evaluating vision-language models in the wild with human preferences. In NeurIPS 2024, pp. 48224–48255. Curran Associates, Inc., 2024. 4

work page 2024
[39]

M. R. Marner, R. T. Smith, B. H. Thomas, K. Klein, P. Eades, and S. Hong. GION: Interactively untangling large graphs on wall-sized displays. In C. A. Duncan and A. Symvonis, eds., GD 2014, vol. 8871 of LNCS, pp. 113–124. Springer, 2014. doi: 10.1007/978-3-662-45803-7_10 3

work page doi:10.1007/978-3-662-45803-7_10 2014
[40]

Masry, M

A. Masry, M. Thakkar, A. Bajaj, A. Kartha, E. Hoque, and S. Joty. Chart- Gemma: Visual instruction-tuning for chart reasoning in the wild. In COLING 2025, pp. 625–643. Assoc. f. Comput. Linguistics, 2025. 2

work page 2025
[41]

Mchedlidze, A

T. Mchedlidze, A. Pak, and M. Klammler. Aesthetic discrimination of graph layouts. J. Graph Algorithms Appl., 23(3):525–552, 2019. doi: 10. 7155/JGAA.00501 2

work page 2019
[42]

Miller, V

J. Miller, V . Huroyan, and S. G. Kobourov. Balancing between the local and global structures (LGS) in graph embedding. In M. A. Bekos and M. Chimani, eds., GD 2023, vol. 14465 of LNCS, pp. 263–279. Springer,

work page 2023
[43]

doi: 10.1007/978-3-031-49272-3_18 3

work page doi:10.1007/978-3-031-49272-3_18
[44]

G. J. Mooney, H. C. Purchase, M. Wybrow, S. G. Kobourov, and J. Miller. The perception of stress in graph drawings. In S. Felsner and K. Klein, eds., GD 2024, vol. 320 of LIPIcs, pp. 21:1–21:17. Schloss Dagstuhl,

work page 2024
[45]

doi: 10.4230/LIPICS.GD.2024.21 1, 2, 3, 4, 5, 6, 7, 8, 9

work page doi:10.4230/lipics.gd.2024.21 2024
[46]

Q. H. Nguyen, P. Eades, and S. Hong. On the faithfulness of graph visualizations. In S. Carpendale, W. Chen, and S. Hong, eds., PacificVis 2013, pp. 209–216. IEEE, 2013. doi: 10.1109/PACIFICVIS.2013.6596147 3

work page doi:10.1109/pacificvis.2013.6596147 2013
[47]

A. Noack. Energy models for graph clustering. J. Graph Algorithms Appl., 11(2):453–480, 2007. doi: 10.7155/JGAA.00154 1, 3

work page doi:10.7155/jgaa.00154 2007
[48]

OpenAI prompt engineering best practices, 2024

OpenAI. OpenAI prompt engineering best practices, 2024. Accessed: 2025-03-14. 4

work page 2024
[49]

R. Y . Pang, H. Schroeder, K. S. Smith, S. Barocas, Z. Xiao, E. Tseng, and D. Bragg. Understanding the LLM-ification of CHI: Unpacking the impact of LLMs at CHI through a systematic literature review. In P. Toups-Dugas, B. Lee, and M. Chetty, eds., CHI 2025. ACM, 2025. to appear. doi: 10. 48550/ARXIV.2501.12557 2

work page arXiv 2025
[50]

Pascual-Ferrá, N

P. Pascual-Ferrá, N. Alperstein, and D. J. Barnett. Social network analysis of COVID-19 public discourse on Twitter: implications for risk communi- cation. Disaster medicine and public health preparedness, 16(2):561–569,

work page
[51]

doi: 10.1017/dmp.2020.347 2

work page doi:10.1017/dmp.2020.347 2020
[52]

L. Podo, M. Ishmal, and M. Angelini. Vi(E)va LLM! A conceptual stack for evaluating and interpreting generative AI-based visualizations. arXiv preprint, abs/2402.02167, 2024. doi: 10.48550/ARXIV.2402.02167 2

work page doi:10.48550/arxiv.2402.02167 2024
[53]

M. Prpa, G. M. Troiano, M. Wood, and Y . Coady. Challenges and op- portunities of LLM-based synthetic personae and data in HCI. In F. F. Mueller, P. Kyburz, J. R. Williamson, and C. Sas, eds.,CHI EA 2024, pp. 461:1–461:5. ACM, 2024. doi: 10.1145/3613905.3636293 2

work page doi:10.1145/3613905.3636293 2024
[54]

H. C. Purchase. Metrics for graph drawing aesthetics. J. Vis. Lang. Comput., 13(5):501–516, 2002. doi: 10.1006/JVLC.2002.0232 3

work page doi:10.1006/jvlc.2002.0232 2002
[55]

H. C. Purchase, D. A. Carrington, and J. Allder. Empirical evaluation of aesthetics-based graph layout. Empir. Softw. Eng., 7(3):233–255, 2002. doi: 10.1023/A:1016344215610 3

work page doi:10.1023/a:1016344215610 2002
[56]

H. C. Purchase, R. F. Cohen, and M. I. James. Validating graph drawing aesthetics. In F. Brandenburg, ed., GD 1995, vol. 1027 of LNCS, pp. 435–446. Springer, 1995. doi: 10.1007/BFB0021827 8

work page doi:10.1007/bfb0021827 1995
[57]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In K. Inui, J. Jiang, V . Ng, and X. Wan, eds., EMNLP-IJCNLP 2019, pp. 3980–3990. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1410 7

work page doi:10.18653/v1/d19-1410 2019
[58]

Reynolds and K

L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Y . Kitamura, A. Quigley, K. Isbister, and T. Igarashi, eds., CHI EA 2021, pp. 314:1–314:7. ACM,

work page 2021
[59]

doi: 10.1145/3411763.3451760 4

work page doi:10.1145/3411763.3451760
[60]

J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Trans. Computers, 18(5):401–409, 1969. doi: 10.1109/T-C.1969.222678 3

work page doi:10.1109/t-c.1969.222678 1969
[61]

Schetinger, S

V . Schetinger, S. Di Bartolomeo, M. El-Assady, A. M. McNutt, M. Miller, J. P. A. Passos, and J. L. Adams. Doom or deliciousness: Challenges and opportunities for visualization in the age of generative models. Comput. Graph. Forum, 42(3):423–435, 2023. doi: 10.1111/CGF.14841 2

work page doi:10.1111/cgf.14841 2023
[62]

H. Shen, T. Li, T. J. Li, J. S. Park, and D. Yang. Shaping the emerging norms of using large language models in social computing research. In C. Fiesler, L. G. Terveen, M. Ames, S. R. Fussell, E. Gilbert, V . Liao, X. Ma, X. Page, M. Rouncefield, V . Singh, and P. J. Wisniewski, eds., CSCW 2023, pp. 569–571. ACM, 2023. doi: 10.1145/3584931.3606955 2

work page doi:10.1145/3584931.3606955 2023
[63]

R. N. Shepard. The analysis of proximities: Multidimensional scaling with an unknown distance function. I. Psychometrika, 27(2):125–140,

work page
[64]

doi: 10.1007/BF02289630 3

work page doi:10.1007/bf02289630
[65]

Simonetto, D

P. Simonetto, D. Archambault, and S. G. Kobourov. Drawing dynamic graphs without timeslices. In F. Frati and K. Ma, eds., GD 2017, vol. 10692 of LNCS, pp. 394–409. Springer, 2017. doi: 10.1007/978-3-319 -73915-1_31 3

work page doi:10.1007/978-3-319 2017
[66]

J. Tang, F. Yang, J. Wu, Y . Wang, J. Zhou, X. Cai, L. Yu, and Y . Wu. A comparative study on fixed-order event sequence visualizations: Gantt, extended Gantt, and stringline charts. IEEE Trans. Vis. Comput. Graph., 30(12):7687–7701, 2024. doi: 10.1109/TVCG.2024.3358919 5

work page doi:10.1109/tvcg.2024.3358919 2024
[67]

Taylor and P

M. Taylor and P. Rodgers. Applying graphical design techniques to graph visualisation. In IV 2005, pp. 651–656. IEEE, 2005. doi: 10.1109/IV.2005 .19 8

work page doi:10.1109/iv.2005 2005
[68]

Y . Tian, W. Cui, D. Deng, X. Yi, Y . Yang, H. Zhang, and Y . Wu. ChartGPT: Leveraging LLMs to generate charts from abstract natural language. IEEE Trans. Vis. Comput. Graph., 31(3):1731–1745, 2025. doi: 10.1109/TVCG. 2024.3368621 2

work page doi:10.1109/tvcg 2025
[69]

R. J. Tibshirani and B. Efron. An introduction to the bootstrap. Mono- graphs on statistics and applied probability, 57(1):1–436, 1993. 5

work page 1993
[70]

W. S. Torgerson. Multidimensional scaling: I. Theory and method. Psy- chometrika, 17(4):401–419, 1952. doi: 10.1007/BF02288916 3

work page doi:10.1007/bf02288916 1952
[71]

van der Maaten and G

L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008. 7

work page 2008
[72]

van Dijck

J. van Dijck. Seeing the forest for the trees: Visualizing platformiza- tion and its governance. New Media Soc. , 23(9), 2021. doi: 10.1177/ 1461444820940293 2

work page 2021
[73]

van Wageningen, T

S. van Wageningen, T. Mchedlidze, and A. C. Telea. An experimental evaluation of viewpoint-based 3D graph drawing. Comput. Graph. Forum, 43(3), 2024. doi: 10.1111/CGF.15077 3

work page doi:10.1111/cgf.15077 2024
[74]

P. Vázquez. Are LLMs ready for visualization? In PacificVis 2024, pp. 343–352. IEEE, 2024. doi: 10.1109/PACIFICVIS60374.2024.00049 2

work page doi:10.1109/pacificvis60374.2024.00049 2024
[75]

A. Wang, J. Morgenstern, and J. P. Dickerson. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7:400–411, 2025. doi: 10.1038/ s42256-025-00986-z 9

work page 2025
[76]

H. W. Wang, J. Hoffswell, S. M. T. Thane, V . S. Bursztyn, and C. X. Bearfield. How aligned are human chart takeaways and LLM predictions? A case study on bar charts with varying layouts. IEEE Trans. Vis. Comput. Graph., 31(1):536–546, 2025. doi: 10.1109/TVCG.2024.3456378 2

work page doi:10.1109/tvcg.2024.3456378 2025
[77]

L. Wang, S. Zhang, Y . Wang, E. Lim, and Y . Wang. LLM4Vis: Explainable visualization recommendation using ChatGPT. In M. Wang and I. Zitouni, eds., EMNLP 2023, pp. 675–692. Assoc. f. Comput. Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-INDUSTRY.64 2

work page doi:10.18653/v1/2023.emnlp-industry.64 2023
[78]

X. Wang, K. Yen, Y . Hu, and H. Shen. DeepGD: A deep learning frame- work for graph drawing using GNN. IEEE Computer Graphics and Applications, 41(5):32–44, 2021. doi: 10.1109/MCG.2021.3093908 2

work page doi:10.1109/mcg.2021.3093908 2021
[79]

X. Wang, K. Yen, Y . Hu, and H. Shen. SmartGD: A GAN-based graph drawing framework for diverse aesthetic goals. IEEE Trans. Vis. Comput. Graph., 30(8):5666–5678, 2024. doi: 10.1109/TVCG.2023.3306356 2

work page doi:10.1109/tvcg.2023.3306356 2024
[80]

Y . Wang, Z. Jin, Q. Wang, W. Cui, T. Ma, and H. Qu. DeepDrawing: A deep learning approach to graph drawing. IEEE Trans. Vis. Comput. Graph., 26(1):676–686, 2020. doi: 10.1109/TVCG.2019.2934798 2

work page doi:10.1109/tvcg.2019.2934798 2020
[81]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS 2022, 2022. 4, 5

work page 2022
[82]

Welch and S

E. Welch and S. G. Kobourov. Measuring symmetry in drawings of graphs. Comput. Graph. Forum, 36(3):341–351, 2017. doi: 10.1111/CGF.13192 3

work page doi:10.1111/cgf.13192 2017

Showing first 80 references.

[1] [1]

Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

W. Agnew, A. S. Bergman, J. Chien, M. Díaz, S. El-Sayed, J. Pittman, S. Mohamed, and K. R. McKee. The illusion of artificial inclusion. In F. F. Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, eds., CHI 2024, pp. 286:1–286:12. ACM, 2024. doi: 10.1145/3613904.3642703 2, 9

work page doi:10.1145/3613904.3642703 2024

[2] [2]

A. R. Ahmed, F. De Luca, S. Devkota, S. G. Kobourov, and M. Li. Multi- criteria scalable graph drawing via stochastic gradient descent, (SGD)2. IEEE Trans. Vis. Comput. Graph., 28(6):2388–2399, 2022. doi: 10.1109/ TVCG.2022.3155564 8

work page arXiv 2022

[3] [3]

Arleo, S

A. Arleo, S. Miksch, and D. Archambault. Event-based dynamic graph drawing without the agonizing pain. Comput. Graph. Forum, 41(6):226– 244, 2022. doi: 10.1111/CGF.14615 3

work page doi:10.1111/cgf.14615 2022

[4] [4]

Aubin Le Quéré, H

M. Aubin Le Quéré, H. Schroeder, C. Randazzo, J. Gao, Z. Epstein, S. T. Perrault, D. Mimno, L. Barkhuus, and H. Li. LLMs as research tools: Applications and evaluations in HCI data work. In F. F. Mueller, P. Kyburz, J. R. Williamson, and C. Sas, eds., CHI EA 2024, pp. 479:1–479:7. ACM,

work page 2024

[5] [5]

doi: 10.1145/3613905.3636301 2

work page doi:10.1145/3613905.3636301

[6] [6]

Bendeck and J

A. Bendeck and J. T. Stasko. An empirical evaluation of the GPT-4 multimodal language model on visualization literacy tasks. IEEE Trans. Vis. Comput. Graph., 31(1):1105–1115, 2025. doi: 10.1109/TVCG.2024. 3456155 1, 2, 9

work page doi:10.1109/tvcg.2024 2025

[7] [8]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amode...

work page 2020

[8] [9]

L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, and F. Zhao. Are we on the right way for evaluating large vision-language models? In NeurIPS 2024, pp. 27056–27087. Curran Associates, Inc., 2024. 4

work page 2024

[9] [10]

Z. Chen, C. Zhang, Q. Wang, J. Troidl, S. Warchol, J. Beyer, N. Gehlen- borg, and H. Pfister. Beyond generating code: Evaluating GPT on a data visualization course. In EduVis 2023, pp. 16–21. IEEE, 2023. doi: 10. 1109/EduVis60792.2023.00009 2

work page arXiv 2023

[10] [11]

Chimani, P

M. Chimani, P. Eades, P. Eades, S. Hong, W. Huang, K. Klein, M. Marner, R. T. Smith, and B. H. Thomas. People prefer less stress and fewer crossings. In C. Duncan and A. Symvonis, eds., GD 2014, vol. 8871 of LNCS, pp. 523–524. Springer, 2014. 3

work page 2014

[11] [12]

De Luca, M

F. De Luca, M. I. Hossain, and S. G. Kobourov. Symmetry detection and classification in drawings of graphs. In D. Archambault and C. D. Tóth, eds., GD 2019, vol. 11904 of LNCS, pp. 499–513. Springer, 2019. doi: 10. 1007/978-3-030-35802-0_38 2

work page 2019

[12] [13]

Di Bartolomeo, T

S. Di Bartolomeo, T. Crnovrsanin, D. Saffo, E. Puerta, C. Wilson, and C. Dunne. Evaluating graph layout algorithms: A systematic review of methods and best practices. Comput. Graph. Forum, 43(6), 2024. doi: 10. 1111/CGF.15073 3

work page 2024

[13] [14]

Di Bartolomeo, G

S. Di Bartolomeo, G. Severi, V . Schetinger, and C. Dunne. Ask and you shall receive (a graph drawing): Testing ChatGPT’s potential to apply graph layout algorithms. In T. Höllt, W. Aigner, and B. Wang, eds., EuroVis 2023, pp. 79–83. Eurographics Association, 2023. doi: 10.2312/ EVS.20231047 2

work page 2023

[14] [15]

Dragicevic

P. Dragicevic. Fair statistical communication in HCI. In J. Robertson and M. Kaptein, eds., Modern Statistical Methods for HCI, pp. 291–330. Springer, Cham, 2016. doi: 10.1007/978-3-319-26633-6_13 5

work page doi:10.1007/978-3-319-26633-6_13 2016

[15] [16]

P. Duan, J. Warner, Y . Li, and B. Hartmann. Generating automatic feedback on UI mockups with large language models. In F. F. Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, eds., CHI 2024, pp. 6:1–6:20. ACM, 2024. doi: 10.1145/ 3613904.3642782 2

work page arXiv 2024

[16] [17]

El-Kishky, A

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Dohan, F. Song, H. Lightman, I. C. Gilaberte, J. Pachocki, J. Tworek, L. Kuhn, L. Kaiser, M. Chen, M. Schwarzer, M. Rohaninejad, N. McAleese, o3 con- tributors, O. Mürk, R. Garg, R. Shu, S. Sidor, V . Kosaraju, and W. Zhou. Competitive programming with large reasoning models. arXiv preprint, abs/...

work page doi:10.48550/arxiv.2502.06807 2025

[17] [18]

Förster, F

H. Förster, F. Klesen, T. Dwyer, P. Eades, S. Hong, S. G. Kobourov, G. Liotta, K. Misue, F. Montecchiani, A. Pastukhov, and F. Schreiber. GraphTrials: Visual proofs of graph properties. In S. Felsner and K. Klein, eds., GD 2024, vol. 320 of LIPIcs, pp. 16:1–16:18. Schloss Dagstuhl,

work page 2024

[18] [19]

doi: 10.4230/LIPICS.GD.2024.16 1, 3

work page doi:10.4230/lipics.gd.2024.16 2024

[19] [20]

E. R. Gansner, Y . Hu, and S. C. North. A maxent-stress model for graph layout. IEEE Trans. Vis. Comput. Graph., 19(6):927–940, 2013. doi: 10. 1109/TVCG.2012.299 3

work page 2013

[20] [21]

E. R. Gansner, Y . Koren, and S. C. North. Graph drawing by stress majorization. In J. Pach, ed., GD 2004, vol. 3383 of LNCS, pp. 239–250. Springer, 2004. doi: 10.1007/978-3-540-31843-9_25 1, 3

work page doi:10.1007/978-3-540-31843-9_25 2004

[21] [22]

Z. Gao, C. Jiang, J. Zhang, X. Jiang, L. Li, P. Zhao, H. Yang, Y . Huang, and J. Li. Hierarchical graph learning for protein–protein interaction. Nature Communications, 14(1):1093, 2023. doi: 10.1038/s41467-023-36736-1 2

work page doi:10.1038/s41467-023-36736-1 2023

[22] [23]

Giovannangeli, F

L. Giovannangeli, F. Lalanne, D. Auber, R. Giot, and R. Bourqui. Deep neural network for drawing networks, (DNN )2. In H. C. Purchase and I. Rutter, eds., GD 2021, vol. 12868 of LNCS, pp. 375–390. Springer, 2021. doi: 10.1007/978-3-030-92931-2_27 2

work page doi:10.1007/978-3-030-92931-2_27 2021

[23] [24]

Grötschla, J

F. Grötschla, J. Mathys, R. Veres, and R. Wattenhofer. CoRe-GD: A hierarchical framework for scalable graph visualization with GNNs. In ICLR 2024. OpenReview.net, 2024. doi: 10.48550/ARXIV.2402.06706 2

work page doi:10.48550/arxiv.2402.06706 2024

[24] [25]

Hämäläinen, M

P. Hämäläinen, M. Tavast, and A. Kunnari. Evaluating large language mod- els in generating synthetic HCI research data: a case study. In A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, A. Peters, S. Mueller, J. R. Williamson, and M. L. Wilson, eds., CHI 2023, pp. 433:1–433:19. ACM,

work page 2023

[25] [26]

doi: 10.1145/3544548.3580688 2, 9

work page doi:10.1145/3544548.3580688

[26] [27]

J. Hong, C. Seto, A. Fan, and R. Maciejewski. Do LLMs have visualization literacy? An evaluation on modified visualizations to test generalization in data interpretation. IEEE Trans. Vis. Comput. Graph., 2025. doi: 10. 1109/TVCG.2025.3536358 1, 2, 5

work page arXiv 2025

[27] [28]

S. Hong, P. Eades, M. Torkel, Z. Wang, D. Chae, S. Hong, D. Langerenken, and H. Chafi. Multi-level graph drawing using infomap clustering. In D. Archambault and C. D. Tóth, eds., GD 2019, vol. 11904 of LNCS, pp. 139–146. Springer, 2019. doi: 10.1007/978-3-030-35802-0_11 3

work page doi:10.1007/978-3-030-35802-0_11 2019

[28] [29]

Kamada and S

T. Kamada and S. Kawai. An algorithm for drawing general undirected graphs. Inf. Process. Lett., 31(1):7–15, 1989. doi: 10.1016/0020-0190(89) 90102-6 3

work page doi:10.1016/0020-0190(89 1989

[29] [30]

Klammler, T

M. Klammler, T. Mchedlidze, and A. Pak. Aesthetic discrimination of graph layouts. In T. Biedl and A. Kerren, eds., GD 2018, vol. 11282 of LNCS, pp. 169–184. Springer, 2018. doi: 10.1007/978-3-030-04414-5_12 2

work page doi:10.1007/978-3-030-04414-5_12 2018

[30] [31]

J. F. Kruiger, P. E. Rauber, R. M. Martins, A. Kerren, S. G. Kobourov, and A. C. Telea. Graph layouts by t-SNE. Comput. Graph. Forum, 36(3):283– 294, 2017. doi: 10.1111/CGF.13187 1, 3

work page doi:10.1111/cgf.13187 2017

[31] [32]

J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964. doi: 10. 1007/BF02289565 3

work page 1964

[32] [34]

Lee, S.-H

S. Lee, S.-H. Kim, and B. C. Kwon. VLAT: Development of a Visu- alization Literacy Assessment Test. IEEE Trans. Vis. Comput. Graph., 23(1):551–560, 2017. doi: 10.1109/TVCG.2016.2598920 2

work page doi:10.1109/tvcg.2016.2598920 2017

[33] [35]

Z. Li, H. Miao, V . Pascucci, and S. Liu. Visualization literacy of mul- timodal large language models: A comparative study. arXiv preprint, abs/2407.10996, 2024. doi: 10.48550/arXiv.2407.10996 2

work page doi:10.48550/arxiv.2407.10996 2024

[34] [36]

Liew and K

A. Liew and K. Mueller. Using large language models to generate engaging captions for data visualizations. In NLVIZ 2022, 2022. 2

work page 2022

[35] [37]

L. Y . Lo and H. Qu. How good (or bad) are LLMs at detecting misleading visualizations? IEEE Trans. Vis. Comput. Graph., 31(1):1116–1125, 2025. doi: 10.1109/TVCG.2024.3456333 2

work page doi:10.1109/tvcg.2024.3456333 2025

[36] [38]

Y . Lu, D. Jiang, W. Chen, W. Y . Wang, Y . Choi, and B. Y . Lin. Wildvision: Evaluating vision-language models in the wild with human preferences. In NeurIPS 2024, pp. 48224–48255. Curran Associates, Inc., 2024. 4

work page 2024

[37] [39]

M. R. Marner, R. T. Smith, B. H. Thomas, K. Klein, P. Eades, and S. Hong. GION: Interactively untangling large graphs on wall-sized displays. In C. A. Duncan and A. Symvonis, eds., GD 2014, vol. 8871 of LNCS, pp. 113–124. Springer, 2014. doi: 10.1007/978-3-662-45803-7_10 3

work page doi:10.1007/978-3-662-45803-7_10 2014

[38] [40]

Masry, M

A. Masry, M. Thakkar, A. Bajaj, A. Kartha, E. Hoque, and S. Joty. Chart- Gemma: Visual instruction-tuning for chart reasoning in the wild. In COLING 2025, pp. 625–643. Assoc. f. Comput. Linguistics, 2025. 2

work page 2025

[39] [41]

Mchedlidze, A

T. Mchedlidze, A. Pak, and M. Klammler. Aesthetic discrimination of graph layouts. J. Graph Algorithms Appl., 23(3):525–552, 2019. doi: 10. 7155/JGAA.00501 2

work page 2019

[40] [42]

Miller, V

J. Miller, V . Huroyan, and S. G. Kobourov. Balancing between the local and global structures (LGS) in graph embedding. In M. A. Bekos and M. Chimani, eds., GD 2023, vol. 14465 of LNCS, pp. 263–279. Springer,

work page 2023

[41] [43]

doi: 10.1007/978-3-031-49272-3_18 3

work page doi:10.1007/978-3-031-49272-3_18

[42] [44]

G. J. Mooney, H. C. Purchase, M. Wybrow, S. G. Kobourov, and J. Miller. The perception of stress in graph drawings. In S. Felsner and K. Klein, eds., GD 2024, vol. 320 of LIPIcs, pp. 21:1–21:17. Schloss Dagstuhl,

work page 2024

[43] [45]

doi: 10.4230/LIPICS.GD.2024.21 1, 2, 3, 4, 5, 6, 7, 8, 9

work page doi:10.4230/lipics.gd.2024.21 2024

[44] [46]

Q. H. Nguyen, P. Eades, and S. Hong. On the faithfulness of graph visualizations. In S. Carpendale, W. Chen, and S. Hong, eds., PacificVis 2013, pp. 209–216. IEEE, 2013. doi: 10.1109/PACIFICVIS.2013.6596147 3

work page doi:10.1109/pacificvis.2013.6596147 2013

[45] [47]

A. Noack. Energy models for graph clustering. J. Graph Algorithms Appl., 11(2):453–480, 2007. doi: 10.7155/JGAA.00154 1, 3

work page doi:10.7155/jgaa.00154 2007

[46] [48]

OpenAI prompt engineering best practices, 2024

OpenAI. OpenAI prompt engineering best practices, 2024. Accessed: 2025-03-14. 4

work page 2024

[47] [49]

R. Y . Pang, H. Schroeder, K. S. Smith, S. Barocas, Z. Xiao, E. Tseng, and D. Bragg. Understanding the LLM-ification of CHI: Unpacking the impact of LLMs at CHI through a systematic literature review. In P. Toups-Dugas, B. Lee, and M. Chetty, eds., CHI 2025. ACM, 2025. to appear. doi: 10. 48550/ARXIV.2501.12557 2

work page arXiv 2025

[48] [50]

Pascual-Ferrá, N

P. Pascual-Ferrá, N. Alperstein, and D. J. Barnett. Social network analysis of COVID-19 public discourse on Twitter: implications for risk communi- cation. Disaster medicine and public health preparedness, 16(2):561–569,

work page

[49] [51]

doi: 10.1017/dmp.2020.347 2

work page doi:10.1017/dmp.2020.347 2020

[50] [52]

L. Podo, M. Ishmal, and M. Angelini. Vi(E)va LLM! A conceptual stack for evaluating and interpreting generative AI-based visualizations. arXiv preprint, abs/2402.02167, 2024. doi: 10.48550/ARXIV.2402.02167 2

work page doi:10.48550/arxiv.2402.02167 2024

[51] [53]

M. Prpa, G. M. Troiano, M. Wood, and Y . Coady. Challenges and op- portunities of LLM-based synthetic personae and data in HCI. In F. F. Mueller, P. Kyburz, J. R. Williamson, and C. Sas, eds.,CHI EA 2024, pp. 461:1–461:5. ACM, 2024. doi: 10.1145/3613905.3636293 2

work page doi:10.1145/3613905.3636293 2024

[52] [54]

H. C. Purchase. Metrics for graph drawing aesthetics. J. Vis. Lang. Comput., 13(5):501–516, 2002. doi: 10.1006/JVLC.2002.0232 3

work page doi:10.1006/jvlc.2002.0232 2002

[53] [55]

H. C. Purchase, D. A. Carrington, and J. Allder. Empirical evaluation of aesthetics-based graph layout. Empir. Softw. Eng., 7(3):233–255, 2002. doi: 10.1023/A:1016344215610 3

work page doi:10.1023/a:1016344215610 2002

[54] [56]

H. C. Purchase, R. F. Cohen, and M. I. James. Validating graph drawing aesthetics. In F. Brandenburg, ed., GD 1995, vol. 1027 of LNCS, pp. 435–446. Springer, 1995. doi: 10.1007/BFB0021827 8

work page doi:10.1007/bfb0021827 1995

[55] [57]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In K. Inui, J. Jiang, V . Ng, and X. Wan, eds., EMNLP-IJCNLP 2019, pp. 3980–3990. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1410 7

work page doi:10.18653/v1/d19-1410 2019

[56] [58]

Reynolds and K

L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Y . Kitamura, A. Quigley, K. Isbister, and T. Igarashi, eds., CHI EA 2021, pp. 314:1–314:7. ACM,

work page 2021

[57] [59]

doi: 10.1145/3411763.3451760 4

work page doi:10.1145/3411763.3451760

[58] [60]

J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Trans. Computers, 18(5):401–409, 1969. doi: 10.1109/T-C.1969.222678 3

work page doi:10.1109/t-c.1969.222678 1969

[59] [61]

Schetinger, S

V . Schetinger, S. Di Bartolomeo, M. El-Assady, A. M. McNutt, M. Miller, J. P. A. Passos, and J. L. Adams. Doom or deliciousness: Challenges and opportunities for visualization in the age of generative models. Comput. Graph. Forum, 42(3):423–435, 2023. doi: 10.1111/CGF.14841 2

work page doi:10.1111/cgf.14841 2023

[60] [62]

H. Shen, T. Li, T. J. Li, J. S. Park, and D. Yang. Shaping the emerging norms of using large language models in social computing research. In C. Fiesler, L. G. Terveen, M. Ames, S. R. Fussell, E. Gilbert, V . Liao, X. Ma, X. Page, M. Rouncefield, V . Singh, and P. J. Wisniewski, eds., CSCW 2023, pp. 569–571. ACM, 2023. doi: 10.1145/3584931.3606955 2

work page doi:10.1145/3584931.3606955 2023

[61] [63]

R. N. Shepard. The analysis of proximities: Multidimensional scaling with an unknown distance function. I. Psychometrika, 27(2):125–140,

work page

[62] [64]

doi: 10.1007/BF02289630 3

work page doi:10.1007/bf02289630

[63] [65]

Simonetto, D

P. Simonetto, D. Archambault, and S. G. Kobourov. Drawing dynamic graphs without timeslices. In F. Frati and K. Ma, eds., GD 2017, vol. 10692 of LNCS, pp. 394–409. Springer, 2017. doi: 10.1007/978-3-319 -73915-1_31 3

work page doi:10.1007/978-3-319 2017

[64] [66]

J. Tang, F. Yang, J. Wu, Y . Wang, J. Zhou, X. Cai, L. Yu, and Y . Wu. A comparative study on fixed-order event sequence visualizations: Gantt, extended Gantt, and stringline charts. IEEE Trans. Vis. Comput. Graph., 30(12):7687–7701, 2024. doi: 10.1109/TVCG.2024.3358919 5

work page doi:10.1109/tvcg.2024.3358919 2024

[65] [67]

Taylor and P

M. Taylor and P. Rodgers. Applying graphical design techniques to graph visualisation. In IV 2005, pp. 651–656. IEEE, 2005. doi: 10.1109/IV.2005 .19 8

work page doi:10.1109/iv.2005 2005

[66] [68]

Y . Tian, W. Cui, D. Deng, X. Yi, Y . Yang, H. Zhang, and Y . Wu. ChartGPT: Leveraging LLMs to generate charts from abstract natural language. IEEE Trans. Vis. Comput. Graph., 31(3):1731–1745, 2025. doi: 10.1109/TVCG. 2024.3368621 2

work page doi:10.1109/tvcg 2025

[67] [69]

R. J. Tibshirani and B. Efron. An introduction to the bootstrap. Mono- graphs on statistics and applied probability, 57(1):1–436, 1993. 5

work page 1993

[68] [70]

W. S. Torgerson. Multidimensional scaling: I. Theory and method. Psy- chometrika, 17(4):401–419, 1952. doi: 10.1007/BF02288916 3

work page doi:10.1007/bf02288916 1952

[69] [71]

van der Maaten and G

L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008. 7

work page 2008

[70] [72]

van Dijck

J. van Dijck. Seeing the forest for the trees: Visualizing platformiza- tion and its governance. New Media Soc. , 23(9), 2021. doi: 10.1177/ 1461444820940293 2

work page 2021

[71] [73]

van Wageningen, T

S. van Wageningen, T. Mchedlidze, and A. C. Telea. An experimental evaluation of viewpoint-based 3D graph drawing. Comput. Graph. Forum, 43(3), 2024. doi: 10.1111/CGF.15077 3

work page doi:10.1111/cgf.15077 2024

[72] [74]

P. Vázquez. Are LLMs ready for visualization? In PacificVis 2024, pp. 343–352. IEEE, 2024. doi: 10.1109/PACIFICVIS60374.2024.00049 2

work page doi:10.1109/pacificvis60374.2024.00049 2024

[73] [75]

A. Wang, J. Morgenstern, and J. P. Dickerson. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7:400–411, 2025. doi: 10.1038/ s42256-025-00986-z 9

work page 2025

[74] [76]

H. W. Wang, J. Hoffswell, S. M. T. Thane, V . S. Bursztyn, and C. X. Bearfield. How aligned are human chart takeaways and LLM predictions? A case study on bar charts with varying layouts. IEEE Trans. Vis. Comput. Graph., 31(1):536–546, 2025. doi: 10.1109/TVCG.2024.3456378 2

work page doi:10.1109/tvcg.2024.3456378 2025

[75] [77]

L. Wang, S. Zhang, Y . Wang, E. Lim, and Y . Wang. LLM4Vis: Explainable visualization recommendation using ChatGPT. In M. Wang and I. Zitouni, eds., EMNLP 2023, pp. 675–692. Assoc. f. Comput. Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-INDUSTRY.64 2

work page doi:10.18653/v1/2023.emnlp-industry.64 2023

[76] [78]

X. Wang, K. Yen, Y . Hu, and H. Shen. DeepGD: A deep learning frame- work for graph drawing using GNN. IEEE Computer Graphics and Applications, 41(5):32–44, 2021. doi: 10.1109/MCG.2021.3093908 2

work page doi:10.1109/mcg.2021.3093908 2021

[77] [79]

X. Wang, K. Yen, Y . Hu, and H. Shen. SmartGD: A GAN-based graph drawing framework for diverse aesthetic goals. IEEE Trans. Vis. Comput. Graph., 30(8):5666–5678, 2024. doi: 10.1109/TVCG.2023.3306356 2

work page doi:10.1109/tvcg.2023.3306356 2024

[78] [80]

Y . Wang, Z. Jin, Q. Wang, W. Cui, T. Ma, and H. Qu. DeepDrawing: A deep learning approach to graph drawing. IEEE Trans. Vis. Comput. Graph., 26(1):676–686, 2020. doi: 10.1109/TVCG.2019.2934798 2

work page doi:10.1109/tvcg.2019.2934798 2020

[79] [81]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS 2022, 2022. 4, 5

work page 2022

[80] [82]

Welch and S

E. Welch and S. G. Kobourov. Measuring symmetry in drawings of graphs. Comput. Graph. Forum, 36(3):341–351, 2017. doi: 10.1111/CGF.13192 3

work page doi:10.1111/cgf.13192 2017