Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

Chaoli Wang; Haichao Miao; Kaiyuan Tang; Kuangshi Ai; Peer-Timo Bremer; Shusen Liu; Zhimin Li

arxiv: 2605.21825 · v1 · pith:JG72U7PZnew · submitted 2026-05-20 · 💻 cs.AI · cs.HC

Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

Haichao Miao , Zhimin Li , Kuangshi Ai , Kaiyuan Tang , Chaoli Wang , Peer-Timo Bremer , Shusen Liu This is my paper

Pith reviewed 2026-05-22 08:22 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords multi-agent systemsdata visualizationscientific visualizationautonomous agentsvisual analysis applicationsend-to-end pipelineSciVis contests

0 comments

The pith

A collection of specialized AI agents can autonomously design and build functional single-page visualization applications from raw data and high-level task descriptions alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how an end-to-end harness of AI agents works together to handle the full pipeline of scientific visualization: exploring data, planning views, setting up code environments, writing the implementation, checking the interface, and judging whether the result meets the stated goals. Each stage creates documents and instructions that the next agents use, allowing refinement without external fixes. If the approach holds, domain scientists could receive customized interactive tools for complex datasets directly from their questions rather than needing separate visualization experts. The system is demonstrated on IEEE SciVis Contest problems that include ambiguous requirements and varied data types across science and engineering fields.

Core claim

Given only the data and target tasks, the agent harness produces functional single-page VIS Apps with verified linked-view behavior that are highly customized to the domain experts' specified tasks and needs. The harness coordinates agents across exploratory analysis, planning, environment configuration, implementation, interface validation, and task evaluation, with each stage emitting artifacts that guide downstream agents and support iterative refinement.

What carries the argument

The end-to-end agentic harness, a coordinated collection of agents with specialized skills that generate document and instruction artifacts at each stage to enable autonomous progression from data and task description to validated visualization application.

If this is right

Domain scientists receive ready-to-use interactive visualization tools tailored to their exact questions without writing code or consulting visualization specialists.
The same harness structure can process the ambiguous requirements and mixed data types that appear in real contest problems across multiple scientific fields.
Each agent stage outputs reusable artifacts that allow the system to iterate on the visualization design internally.
The harness forms a concrete building block for larger autonomous AI systems that execute long-horizon scientific workflows from high-level directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the harness to output multi-page or web-deployable apps would require only adding new validation agents for layout and performance checks.
Combining this visualization harness with separate agents for experiment design or simulation could close the loop from hypothesis to visual result without human handoff.
The artifact-passing mechanism between agents suggests a general pattern for other complex creative tasks such as report generation or model prototyping.

Load-bearing premise

A team of AI agents can reliably divide and complete the full sequence of data exploration, planning, coding, validation, and evaluation for ambiguous scientific tasks without human intervention or later manual corrections.

What would settle it

Deploy the system on an unseen IEEE SciVis Contest dataset with new data modalities and observe whether it produces a working single-page app with correct linked views and full task completion with zero post-hoc code changes.

Figures

Figures reproduced from arXiv: 2605.21825 by Chaoli Wang, Haichao Miao, Kaiyuan Tang, Kuangshi Ai, Peer-Timo Bremer, Shusen Liu, Zhimin Li.

**Figure 1.** Figure 1: VIS co-scientist agent harness diagram and generated visualization applications (VIS Apps). Left: Given a dataset and task description, the main code agent orchestrates a multi-stage workflow: specialized subagents, evaluation procedures, and custom skills. The harness enables closed-loop design with browser-based validation. A hierarchical memory system stores insights and lessons learned across sessions.… view at source ↗

**Figure 2.** Figure 2: The key capabilities for an AI VIS co-scientist. We define a VIS co-scientist as a research assistant who can independently: 1) obtain domain understanding and translate tasks into concrete requirements; 2) carry out self-directed exploratory data analysis and analytical reasoning; 3) design and implement a complete, fully featured interactive visualization tool; and 4) reliably assess the quality an… view at source ↗

**Figure 3.** Figure 3: VIS co-scientist output for the 2025 materials-discovery case [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 1.** Figure 1: VIS App screenshot for the 2021 SciVis Contest case. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: VIS App screenshot for the 2023 SciVis Contest case. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: VIS App screenshot for the 2024 SciVis Contest case. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: VIS App screenshot for the 2025 SciVis Contest case. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Baseline coding-agent plots for the 2025 SciVis Contest materials-discovery case. Top left: correlation overview for Challenge 1, Task [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Additional baseline coding-agent plots for the 2025 SciVis Contest materials-discovery case. Top left: composition–phase–property [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

The ability to inspect, interpret, and communicate complex data is crucial for virtually any scientific endeavor, but often requires significant expertise outside the core domain ranging from data management and analysis to visualization design and implementation. We present an end-to-end agentic harness that, based on only the data and a high level description of the tasks, independently designs custom visual analysis applications (VIS apps). This represents an important step towards a general AI co-scientist envisioned by many as an autonomous system that can autonomously execute long horizon tasks based on high-level directions. Our proposed VIS co-scientist is an essential component of this broader AI co-scientist vision: a harness that can autonomously analyze data and design visualization solutions using a collection of agents and specialized skills that coordinate exploratory analysis, plan, configure the environment, implement, validate the interface, and most importantly evaluate the overall task completion. Each stage produces document and instruction artifacts that guide downstream work and enable iterative refinement. We validate this approach on IEEE SciVis Contests spanning multiple science and engineering fields. These contests serve as ideal proving grounds because they encode real-world complexity: ambiguous requirements, diverse data modalities, design trade-offs, and task-driven validation. Given only the data and target tasks, our system autonomously produces functional single-page VIS Apps with verified linked-view behavior, highly customized to domain experts' specified tasks and needs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper describes a multi-agent harness for turning data and task descriptions into custom visualization apps, but the abstract gives no metrics or comparisons to back the autonomy claim.

read the letter

The main takeaway is a system that uses specialized AI agents to handle the full pipeline from data exploration to building and validating single-page VIS apps with linked views. It takes only the raw data and high-level task goals as input and aims to produce functional outputs tailored to domain needs without ongoing human fixes. The work frames this as a concrete piece of the larger AI co-scientist idea for scientific workflows. What is new is the end-to-end harness built specifically around visualization tasks, with clear stages for analysis, planning, environment configuration, implementation, interface validation, and overall evaluation. Each stage hands off documents and instructions to the next, allowing iterative refinement. Choosing IEEE SciVis Contests as the test cases is sensible because those problems include ambiguous requirements, mixed data types, and real design trade-offs across science and engineering fields. The paper does a solid job laying out a practical architecture that could reduce the expertise needed for custom visualization in analysis pipelines. The description of verified linked-view behavior and customization to expert-specified tasks shows attention to what actually matters for usable outputs. The soft spots are mainly around evidence. The abstract states that the approach was validated on the contests, yet it supplies no success rates, failure modes, performance numbers, or comparisons to simpler baselines or existing tools. Without those details it is difficult to assess how reliably the agent coordination works on vague or complex inputs. The assumption that a collection of agents can manage everything autonomously from exploratory analysis through final evaluation is plausible but rests on the unshown results. If the full paper includes concrete interaction logs, quantitative outcomes, or examples of how it handled specific contest tasks, that would address the gap. This is the kind of work that would interest people building agent systems for domain-specific scientific tasks or visualization researchers looking for automation aids. A reader focused on practical AI applications in data analysis would find the setup worth examining. I would send it for peer review so the authors can add the missing quantitative support and let referees check the implementation details.

Referee Report

1 major / 0 minor

Summary. The paper presents an end-to-end agentic harness for autonomously designing custom single-page visual analysis applications (VIS apps) from input data and high-level task descriptions. A collection of specialized AI agents coordinates stages including exploratory analysis, planning, environment configuration, implementation, interface validation, and task evaluation, with each stage producing document and instruction artifacts to enable iterative refinement. The system is claimed to produce functional VIS apps with verified linked-view behavior, highly customized to domain experts' needs. Validation is described on IEEE SciVis Contests spanning multiple science and engineering fields, which involve ambiguous requirements, diverse data modalities, and design trade-offs.

Significance. If substantiated, the work would constitute a meaningful step toward general AI co-scientists for long-horizon scientific tasks, specifically by automating the expertise-intensive pipeline of data analysis and visualization design. The artifact-handoff architecture for multi-agent coordination is a plausible mechanism for managing ambiguity and trade-offs without human intervention. The choice of SciVis contests as a testbed is appropriate for demonstrating real-world applicability. Credit is due for framing the system as a general, end-to-end harness rather than a narrow tool.

major comments (1)

[Abstract] Abstract: the statement that the approach 'was validated on IEEE SciVis Contests' supplies no performance metrics, success rates, failure modes, or comparison baselines. This omission leaves the central claim—that the harness autonomously produces functional, verified VIS apps on ambiguous tasks—without visible quantitative or qualitative supporting evidence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of the work's significance and for the constructive comment on the abstract. We address the point below and will make the corresponding revision.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that the approach 'was validated on IEEE SciVis Contests' supplies no performance metrics, success rates, failure modes, or comparison baselines. This omission leaves the central claim—that the harness autonomously produces functional, verified VIS apps on ambiguous tasks—without visible quantitative or qualitative supporting evidence.

Authors: We agree that the abstract would be strengthened by including a concise summary of the validation outcomes. The full manuscript reports quantitative results (task completion rates and verified linked-view success across multiple SciVis contest entries), qualitative case studies, and analysis of failure modes and design trade-offs in Sections 5 and 6. We will revise the abstract to incorporate key metrics and a brief statement of the evidence supporting the central claim, ensuring the validation is visible at the highest level of the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; system description is self-contained

full rationale

The paper describes an end-to-end multi-agent architecture for generating visualization applications from input data and task descriptions. No equations, fitted parameters, or mathematical derivations are present in the provided text. Claims rest on the implemented harness producing functional outputs, with validation on external SciVis contests rather than any self-referential reduction or self-citation chain that defines the result by construction. This is a standard systems paper whose central contribution is the working pipeline itself, independent of the circularity patterns listed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that current AI agents possess sufficient reliability and coordination to autonomously manage the full visualization pipeline for complex, ambiguous scientific tasks.

axioms (1)

domain assumption AI agents can autonomously perform and coordinate exploratory analysis, planning, environment configuration, implementation, validation, and task evaluation for visualization design.
This premise is invoked throughout the abstract as the basis for the harness producing functional apps from data and high-level descriptions alone.

invented entities (1)

VIS co-scientist harness no independent evidence
purpose: Autonomous system that designs and implements custom visualization applications
Introduced as the proposed framework without external falsifiable predictions beyond the contest validation mentioned.

pith-pipeline@v0.9.0 · 5803 in / 1314 out tokens · 48007 ms · 2026-05-22T08:22:12.184027+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present an end-to-end agentic harness that, based on only the data and a high level description of the tasks, independently designs custom visual analysis applications (VIS apps).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The main design loop consists of three roles: Planner, VIS Designer, and Evaluator... Each stage produces document and instruction artifacts that guide downstream work and enable iterative refinement.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

K. Ai, H. Miao, Z. Li, C. Wang, and S. Liu. An evaluation- centric paradigm for scientific visualization agents.arXiv preprint arXiv:2509.15160, 2025. 2

work page arXiv 2025
[2]

K. Ai, H. Miao, K. Tang, N. Gorski, J. Sun, G. Liu, H. I. Ingolfs- son, D. Lenz, H. Guo, H. Yu, et al. SciVisAgentBench: A benchmark for evaluating scientific data analysis and visualization agents.arXiv preprint arXiv:2603.29139, 2026. 1, 2, 4

work page arXiv 2026
[3]

K. Ai, K. Tang, and C. Wang. NLI4V olVis: Natural language inter- action for volume visualization via multi-LLM agents and editable 3D Gaussian splatting.IEEE Transactions on Visualization and Computer Graphics, 32(1):46–56, 2026. 2

work page 2026
[4]

Announcements: Introducing the model context protocol

Anthropic. Announcements: Introducing the model context protocol. https://www.anthropic.com/news/model-context-protocol. 2

work page
[5]

Equipping agents for the real world with agent skills.https://claude.com/blog/ equipping-agents-for-the-real-world-with-agent-skills,

Anthropic. Equipping agents for the real world with agent skills.https://claude.com/blog/ equipping-agents-for-the-real-world-with-agent-skills,

work page
[6]

Harness design for long-running apps.https://www.anthropic.com/engineering/ harness-design-long-running-apps, 2026

Anthropic. Harness design for long-running apps.https://www.anthropic.com/engineering/ harness-design-long-running-apps, 2026. 2

work page 2026
[7]

Bendeck and J

A. Bendeck and J. Stasko. An empirical evaluation of the GPT-4 mul- timodal language model on visualization literacy tasks.IEEE Transac- tions on Visualization and Computer Graphics, 31(1):1105–1115, 2025. 2

work page 2025
[8]

Castelvecchi

D. Castelvecchi. Researchers built an ‘AI scientist’—what can it do. Nature, 633(8029):266–266, 2024. 1, 2

work page 2024
[9]

N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. VisEval: A benchmark for data visualization in the era of large language models.IEEE Transac- tions on Visualization and Computer Graphics, 31(1):1301–1311, 2025. 2

work page 2025
[10]

P. P. Do, K. Tang, K. Ai, and C. Wang. SVLAT: Scientific visualization literacy assessment test.arXiv preprint arXiv:2603.19000, 2026. 2

work page arXiv 2026
[11]

Esmaeili, S

S. Esmaeili, S. Kabir, A. M. Colas, R. P. Linder, and E. D. Ragan. Eval- uating graphical perception of visual motion for quantitative data en- coding.IEEE Transactions on Visualization and Computer Graphics, 29(12):4845–4857, 2022. 1

work page 2022
[12]

Towards an AI co-scientist

J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, et al. Towards an AI co-scientist.arXiv preprint arXiv:2502.18864, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

J. Hong, C. Seto, A. Fan, and R. Maciejewski. Do LLMs have visual- ization literacy? an evaluation on modified visualizations to test gener- alization in data interpretation.IEEE Transactions on Visualization and Computer Graphics, 31(10):7004–7018, 2025. 2

work page 2025
[14]

Karpathy

A. Karpathy. LLM Wiki.https://gist.github.com/karpathy/ 442a6bf555914893e9891c11519de94f, 2024. Accessed: 2026-04-

work page 2024
[15]

Lee, S.-H

S. Lee, S.-H. Kim, and B. C. Kwon. VLAT: Development of a visual- ization literacy assessment test.IEEE Transactions on Visualization and Computer Graphics, 23(1):551–560, 2016. 2

work page 2016
[16]

Y . Li, E. D. Berger, M. Kahng, and C. X. Bearfield. From perception to decision: Assessing the role of chart type affordances in high-level decision tasks. InProceedings of IEEE VIS Conference (Short Papers), pp. 366–370, 2025. 1

work page 2025
[17]

Z. Li, H. Miao, X. Yan, V . Pascucci, M. Berger, and S. Liu. See or recall: A sanity check for the role of vision in solving visualization question answer tasks with multimodal LLMs.arXiv preprint arXiv:2504.09809,

work page arXiv
[18]

S. Liu, H. Miao, and P.-T. Bremer. ParaView-MCP: An autonomous visualization agent with direct tool use. InProceedings of IEEE VIS Conference (Short Papers), pp. 61–65, 2025. 2

work page 2025
[19]

S. Liu, H. Miao, Z. Li, M. Olson, V . Pascucci, and P.-T. Bremer. A V A: Towards autonomous visualization agents through visual perception- driven decision-making.Computer Graphics F orum, 43(3):e15093,

work page
[20]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The AI sci- entist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Y . Lyu, X. Zhang, X. Yi, Y . Zhao, S. Guo, W. Hu, J. Piotrowski, J. Kaliski, J. Urbani, Z. Meng, et al. EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery.arXiv preprint arXiv:2603.08127, 2026. 2

work page arXiv 2026
[22]

Playwright MCP documentation.https://playwright

Microsoft. Playwright MCP documentation.https://playwright. dev/docs/getting-started-mcp, 2024. Accessed: 2026-04-29. 3

work page 2024
[23]

Kosmos: An AI Scientist for Autonomous Discovery

L. Mitchener, A. Yiu, B. Chang, M. Bourdenx, T. Nadolski, A. Sulo- vari, E. C. Landsness, D. L. Barabasi, S. Narayanan, N. Evans, et al. Kosmos: An AI scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025. 1, 2

work page internal anchor Pith review arXiv 2025
[24]

Pandey and A

S. Pandey and A. Ottley. Mini-VLAT: A short and effective measure of visualization literacy.Computer graphics forum, 42(3):1–11, 2023. 2

work page 2023
[25]

Peterka, T

T. Peterka, T. Mallick, O. Yildiz, D. Lenz, C. Quammen, and B. Geveci. ChatVis: Large language model agent for generating scientific visual- izations. InProceedings of IEEE Workshop on Large Data Analysis and Visualization, pp. 22–32, 2025. 2

work page 2025
[26]

C. Shao, D. Huang, Y . Li, K. Zhao, W. Lin, Y . Zhang, Q. Zeng, Z. Chen, T. Li, Y . Huang, et al. OmniScientist: Toward a co-evolving ecosystem of human and AI scientists.arXiv preprint arXiv:2511.16931, 2025. 2

work page arXiv 2025
[27]

J. Sun, D. Lenz, T. Peterka, and H. Yu. SASA V: Self-directed agent for scientific analysis and visualization.arXiv preprint arXiv:2604.03406,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

J. Tang, L. Xia, Z. Li, and C. Huang. AI-researcher: Autonomous sci- entific innovation.arXiv preprint arXiv:2505.18705, 2025. 2

work page arXiv 2025
[29]

Y . Weng, M. Zhu, Q. Xie, Q. Sun, Z. Lin, S. Liu, and Y . Zhang. Deep- Scientist: Advancing frontier-pushing scientific findings progressively. arXiv preprint arXiv:2509.26603, 2025. 2

work page arXiv 2025
[30]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Y . Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The AI scientist-v2: Workshop-level automated scientific dis- covery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025. 1, 2 5 1 APPENDIX: EVALUATIONSETUP 1.1 Evaluation of Individual Agent Reasoning Artifacts As discussed in the Method section of the main paper, t...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Bottom left: parallel-coordinates view for Challenge 1, Task 1

Top right: PCA embedding for Challenge 1, Task 1. Bottom left: parallel-coordinates view for Challenge 1, Task 1. Bottom right: candidate- family summary for Challenge 1, Task 2. 5 Figure 6: Additional baseline coding-agent plots for the 2025 SciVis Contest materials-discovery case. Top left: composition–phase–property pathway view for Challenge 1, Task 2...

work page 2025

[1] [1]

K. Ai, H. Miao, Z. Li, C. Wang, and S. Liu. An evaluation- centric paradigm for scientific visualization agents.arXiv preprint arXiv:2509.15160, 2025. 2

work page arXiv 2025

[2] [2]

K. Ai, H. Miao, K. Tang, N. Gorski, J. Sun, G. Liu, H. I. Ingolfs- son, D. Lenz, H. Guo, H. Yu, et al. SciVisAgentBench: A benchmark for evaluating scientific data analysis and visualization agents.arXiv preprint arXiv:2603.29139, 2026. 1, 2, 4

work page arXiv 2026

[3] [3]

K. Ai, K. Tang, and C. Wang. NLI4V olVis: Natural language inter- action for volume visualization via multi-LLM agents and editable 3D Gaussian splatting.IEEE Transactions on Visualization and Computer Graphics, 32(1):46–56, 2026. 2

work page 2026

[4] [4]

Announcements: Introducing the model context protocol

Anthropic. Announcements: Introducing the model context protocol. https://www.anthropic.com/news/model-context-protocol. 2

work page

[5] [5]

Equipping agents for the real world with agent skills.https://claude.com/blog/ equipping-agents-for-the-real-world-with-agent-skills,

Anthropic. Equipping agents for the real world with agent skills.https://claude.com/blog/ equipping-agents-for-the-real-world-with-agent-skills,

work page

[6] [6]

Harness design for long-running apps.https://www.anthropic.com/engineering/ harness-design-long-running-apps, 2026

Anthropic. Harness design for long-running apps.https://www.anthropic.com/engineering/ harness-design-long-running-apps, 2026. 2

work page 2026

[7] [7]

Bendeck and J

A. Bendeck and J. Stasko. An empirical evaluation of the GPT-4 mul- timodal language model on visualization literacy tasks.IEEE Transac- tions on Visualization and Computer Graphics, 31(1):1105–1115, 2025. 2

work page 2025

[8] [8]

Castelvecchi

D. Castelvecchi. Researchers built an ‘AI scientist’—what can it do. Nature, 633(8029):266–266, 2024. 1, 2

work page 2024

[9] [9]

N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. VisEval: A benchmark for data visualization in the era of large language models.IEEE Transac- tions on Visualization and Computer Graphics, 31(1):1301–1311, 2025. 2

work page 2025

[10] [10]

P. P. Do, K. Tang, K. Ai, and C. Wang. SVLAT: Scientific visualization literacy assessment test.arXiv preprint arXiv:2603.19000, 2026. 2

work page arXiv 2026

[11] [11]

Esmaeili, S

S. Esmaeili, S. Kabir, A. M. Colas, R. P. Linder, and E. D. Ragan. Eval- uating graphical perception of visual motion for quantitative data en- coding.IEEE Transactions on Visualization and Computer Graphics, 29(12):4845–4857, 2022. 1

work page 2022

[12] [12]

Towards an AI co-scientist

J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, et al. Towards an AI co-scientist.arXiv preprint arXiv:2502.18864, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

J. Hong, C. Seto, A. Fan, and R. Maciejewski. Do LLMs have visual- ization literacy? an evaluation on modified visualizations to test gener- alization in data interpretation.IEEE Transactions on Visualization and Computer Graphics, 31(10):7004–7018, 2025. 2

work page 2025

[14] [14]

Karpathy

A. Karpathy. LLM Wiki.https://gist.github.com/karpathy/ 442a6bf555914893e9891c11519de94f, 2024. Accessed: 2026-04-

work page 2024

[15] [15]

Lee, S.-H

S. Lee, S.-H. Kim, and B. C. Kwon. VLAT: Development of a visual- ization literacy assessment test.IEEE Transactions on Visualization and Computer Graphics, 23(1):551–560, 2016. 2

work page 2016

[16] [16]

Y . Li, E. D. Berger, M. Kahng, and C. X. Bearfield. From perception to decision: Assessing the role of chart type affordances in high-level decision tasks. InProceedings of IEEE VIS Conference (Short Papers), pp. 366–370, 2025. 1

work page 2025

[17] [17]

Z. Li, H. Miao, X. Yan, V . Pascucci, M. Berger, and S. Liu. See or recall: A sanity check for the role of vision in solving visualization question answer tasks with multimodal LLMs.arXiv preprint arXiv:2504.09809,

work page arXiv

[18] [18]

S. Liu, H. Miao, and P.-T. Bremer. ParaView-MCP: An autonomous visualization agent with direct tool use. InProceedings of IEEE VIS Conference (Short Papers), pp. 61–65, 2025. 2

work page 2025

[19] [19]

S. Liu, H. Miao, Z. Li, M. Olson, V . Pascucci, and P.-T. Bremer. A V A: Towards autonomous visualization agents through visual perception- driven decision-making.Computer Graphics F orum, 43(3):e15093,

work page

[20] [20]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The AI sci- entist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Y . Lyu, X. Zhang, X. Yi, Y . Zhao, S. Guo, W. Hu, J. Piotrowski, J. Kaliski, J. Urbani, Z. Meng, et al. EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery.arXiv preprint arXiv:2603.08127, 2026. 2

work page arXiv 2026

[22] [22]

Playwright MCP documentation.https://playwright

Microsoft. Playwright MCP documentation.https://playwright. dev/docs/getting-started-mcp, 2024. Accessed: 2026-04-29. 3

work page 2024

[23] [23]

Kosmos: An AI Scientist for Autonomous Discovery

L. Mitchener, A. Yiu, B. Chang, M. Bourdenx, T. Nadolski, A. Sulo- vari, E. C. Landsness, D. L. Barabasi, S. Narayanan, N. Evans, et al. Kosmos: An AI scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025. 1, 2

work page internal anchor Pith review arXiv 2025

[24] [24]

Pandey and A

S. Pandey and A. Ottley. Mini-VLAT: A short and effective measure of visualization literacy.Computer graphics forum, 42(3):1–11, 2023. 2

work page 2023

[25] [25]

Peterka, T

T. Peterka, T. Mallick, O. Yildiz, D. Lenz, C. Quammen, and B. Geveci. ChatVis: Large language model agent for generating scientific visual- izations. InProceedings of IEEE Workshop on Large Data Analysis and Visualization, pp. 22–32, 2025. 2

work page 2025

[26] [26]

C. Shao, D. Huang, Y . Li, K. Zhao, W. Lin, Y . Zhang, Q. Zeng, Z. Chen, T. Li, Y . Huang, et al. OmniScientist: Toward a co-evolving ecosystem of human and AI scientists.arXiv preprint arXiv:2511.16931, 2025. 2

work page arXiv 2025

[27] [27]

J. Sun, D. Lenz, T. Peterka, and H. Yu. SASA V: Self-directed agent for scientific analysis and visualization.arXiv preprint arXiv:2604.03406,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

J. Tang, L. Xia, Z. Li, and C. Huang. AI-researcher: Autonomous sci- entific innovation.arXiv preprint arXiv:2505.18705, 2025. 2

work page arXiv 2025

[29] [29]

Y . Weng, M. Zhu, Q. Xie, Q. Sun, Z. Lin, S. Liu, and Y . Zhang. Deep- Scientist: Advancing frontier-pushing scientific findings progressively. arXiv preprint arXiv:2509.26603, 2025. 2

work page arXiv 2025

[30] [30]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Y . Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The AI scientist-v2: Workshop-level automated scientific dis- covery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025. 1, 2 5 1 APPENDIX: EVALUATIONSETUP 1.1 Evaluation of Individual Agent Reasoning Artifacts As discussed in the Method section of the main paper, t...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Bottom left: parallel-coordinates view for Challenge 1, Task 1

Top right: PCA embedding for Challenge 1, Task 1. Bottom left: parallel-coordinates view for Challenge 1, Task 1. Bottom right: candidate- family summary for Challenge 1, Task 2. 5 Figure 6: Additional baseline coding-agent plots for the 2025 SciVis Contest materials-discovery case. Top left: composition–phase–property pathway view for Challenge 1, Task 2...

work page 2025