pith. sign in

arxiv: 2606.00384 · v2 · pith:JRVRNJPGnew · submitted 2026-05-29 · 💻 cs.AI · cs.CL· cs.CV· cs.LG· stat.CO

VESTA: Visual Exploration with Statistical Tool Agents

Pith reviewed 2026-06-28 21:56 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LGstat.CO
keywords VESTAdynamic tool creationvision-language modelsstatistical model fittingDAWN benchmarkagentic systemsdata visualizationastronomy modeling
0
0 comments X

The pith

VESTA lets vision-language models create and reuse their own diagnostic tools during statistical model fitting, outperforming fixed-tool agent systems especially on complex tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VESTA as a way to improve automated fitting of quantitative models to data by giving vision-language models a growing library of tools they generate themselves. These tools handle data transformations, hypothesis-driven visualizations, and statistical tests, and they stay available in the model's context for later reuse instead of relying only on iterative critique. The authors test this against baselines on the new DAWN benchmark, which includes distribution fitting, time series modeling, and real astronomy problems such as initial mass functions and gravitational-wave signals. Dynamic tool creation produces larger gains than static expert tools or no tools at all, with the biggest differences appearing on harder and more specialized tasks. The generated tools also turn out more sophisticated than those from prior visual tool-creation methods, favoring outputs the model can inspect directly.

Core claim

VESTA demonstrates that endowing VLMs with the ability to dynamically select or write diagnostic tools for data exploration and model refinement leads to better performance than prior agentic pipelines that use only critique loops or fixed tool sets. The largest improvements occur on complex and domain-specific tasks in the DAWN benchmark. Dynamically generated tools cover more diagnostic categories per function and show a strong preference for visual outputs that support direct reasoning by the VLM critic.

What carries the argument

The dynamic tool creation and accumulation mechanism, in which the VLM writes or selects new functions for transformations, visualizations, and tests that persist in context for reuse across refinement steps.

If this is right

  • VESTA with dynamic tools outperforms no-tool and static-expert-tool baselines across the evaluated tasks.
  • The performance gap widens on complex and domain-specific problems such as astronomy modeling.
  • Dynamically generated tools are more sophisticated than those from existing visual tool-creation systems, spanning more diagnostic categories and favoring visual outputs.
  • Tools accumulate in context and remain available for later reuse during iterative refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the tool-creation process scales, the same framework could be applied to other iterative scientific workflows that currently require extensive human-written diagnostics.
  • The preference for visual outputs suggests the method may integrate naturally with existing vision-language capabilities rather than requiring separate text-only pipelines.
  • Over repeated use on similar data types, the growing tool library might reduce the need for fresh tool invention on each new modeling problem.

Load-bearing premise

The specific modeling tasks in the DAWN benchmark, including the astronomy examples, represent the kinds of challenges where dynamic tools give a real advantage over static or no-tool approaches.

What would settle it

Running the same three toolkit configurations on a fresh collection of distribution-fitting or time-series tasks drawn from a different domain, such as particle physics or financial data, and finding that the dynamic-tool version loses its performance edge.

Figures

Figures reproduced from arXiv: 2606.00384 by Abhishek Divekar, Greg Durrett, Junyi Jessy Li, Kanishk Jain, Kyle Mahowald, Matthew Lease, Sebastian Joseph, Stella S. R. Offner, William Rudman.

Figure 1
Figure 1. Figure 1: Overview of VESTA. By effectively using and creating tools, VESTA produces a probabilistic PyMC program that models the input data. model definition and fitting to data using an efficient Markov Chain Monte Carlo (MCMC) method. Model-building with PyMC code is compositional, allowing for complex distributions to be con￾structed by combining simpler components. VESTA instantiates a loop of proposing models,… view at source ↗
Figure 3
Figure 3. Figure 3: DAWN’s Astro distribution fitting tasks. Example of Initial Mass Functions projected into log-log space. Distributions become visually distinct only when projected into log-log space. 4 The DAWN Benchmark Distribution fitting and time series modeling are two key data science modeling tasks that appear consistently across scientific disciplines. We select these domains because they allow us to benchmark AI … view at source ↗
Figure 2
Figure 2. Figure 2: Sample inputs from both domains and all dataset splits in DAWN. Easy splits contain easily recognizable forms. Hard tasks contain mixtures of distinct forms, and Astro tasks reflect real-world astronomy challenges that require additional analysis beyond simple visualization to solve. The Easy tasks in distribution fitting con￾sists of identifying the family and asso￾ciated parameters for a unimodal distrib… view at source ↗
Figure 4
Figure 4. Figure 4: [Top] Average Jensen-Shannon divergence (↓ better) between the ground-truth distribution and the probability density function of the proposed PYMC model on the Distribution Fitting task of DAWN. [Bottom] Average ELPD-LOO (↑ better) for the Time Series Modeling task of DAWN, computed via leave-one-out cross-validation. Error bars denote ±1 standard error of the mean. 6.2 Analysis of Generated Tools [PITH_F… view at source ↗
Figure 5
Figure 5. Figure 5: Example of the output from a VESTA gen￾erated tool. This tool composes multiple functions to analyze a heavy-tailed distribution. This multi-panel visualization output is fed back into VESTA to generate better hypotheses. Panel titles are enlarged for clarity and panel numbers are added manually. Case Study [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Jensen–Shannon divergence on the Hard distribution fitting split (lower is bet￾ter) comparing three accumulated toolkit con￾ditions with Claude Sonnet 4.6. presents JS divergence scores across the three toolkit conditions on the Hard distribution fitting split. Over￾all, performance differences across conditions are modest, with all three variants achieving mean JS divergence between 0.106 and 0.124, and o… view at source ↗
Figure 7
Figure 7. Figure 7: Critique-stage prompt used by VESTA for time series modeling. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Critique-stage prompt used by VESTA for distribution fitting. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Tool-selection prompt used by the Generate-Tools stage of V [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Tool-creation prompt used by the Generate-Tools stage of V [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used by the Summarize stage of VESTA. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Proposal prompt used by the BoxLM baseline for time series modeling. [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Proposal prompt used by the BoxLM baseline for distribution fitting. [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Critic prompt used by the BoxLM baseline for both distribution fitting and time series [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Agent prompt used by the PyVision baseline for both distribution fitting and time series [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
read the original abstract

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VESTA, a framework that augments vision-language models with a dynamically growing toolkit of statistical tools (data transformations, visualizations, and tests) for iterative model fitting and refinement. It introduces the DAWN benchmark covering distribution fitting, time series, and real-world astronomy tasks (initial mass functions, gravitational-wave signals), and reports results from three controlled configurations (no tools, static expert tools, dynamic model-written tools) showing that dynamic tool creation yields the largest gains on complex and domain-specific tasks while producing more sophisticated tools than prior visual tool-creation systems.

Significance. If the empirical results hold under full scrutiny of the methods and error analysis, the work would represent a meaningful step toward more adaptive agentic systems for scientific modeling. The introduction of a new benchmark with tiered difficulty and domain-specific astronomy examples, together with the explicit comparison of tool-creation strategies, provides a concrete testbed that future systems can build upon.

major comments (2)
  1. [Abstract (evaluation description)] The central empirical claim rests on the DAWN benchmark tasks being representative of the modeling challenges where static tools fail; however, the abstract provides no quantitative breakdown of task difficulty tiers or failure modes of baselines on the astronomy subset, making it difficult to assess whether the reported gains generalize beyond the chosen examples.
  2. [Abstract (tool sophistication result)] The claim that dynamically generated tools are 'substantially more sophisticated' is load-bearing for the contribution, yet the abstract does not specify the rubric or inter-rater protocol used to rate diagnostic categories and visual-output preference; without this, the comparison to existing visual tool-creation systems cannot be independently verified.
minor comments (1)
  1. [Abstract] The three toolkit configurations are described at a high level; adding a table that explicitly lists the tool inventory size, reuse frequency, and example tool signatures for each configuration would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major point below and will revise the abstract accordingly to improve clarity and verifiability while preserving its length.

read point-by-point responses
  1. Referee: [Abstract (evaluation description)] The central empirical claim rests on the DAWN benchmark tasks being representative of the modeling challenges where static tools fail; however, the abstract provides no quantitative breakdown of task difficulty tiers or failure modes of baselines on the astronomy subset, making it difficult to assess whether the reported gains generalize beyond the chosen examples.

    Authors: We agree that the abstract would benefit from a concise reference to the benchmark structure. The revised abstract will note the tiered design of DAWN (distribution fitting, time series, and domain-specific astronomy tasks) and state that dynamic tool creation yields the largest gains on the astronomy subset. Full quantitative results, including baseline failure rates and error analysis by tier, are already provided in Section 4 and Appendix B; we will ensure the abstract points readers to these sections. revision: yes

  2. Referee: [Abstract (tool sophistication result)] The claim that dynamically generated tools are 'substantially more sophisticated' is load-bearing for the contribution, yet the abstract does not specify the rubric or inter-rater protocol used to rate diagnostic categories and visual-output preference; without this, the comparison to existing visual tool-creation systems cannot be independently verified.

    Authors: The rubric (counting diagnostic categories such as distribution shape, outliers, and correlations, plus preference for visual outputs) and inter-rater protocol are described in Section 5.2. We will revise the abstract to briefly indicate that tool sophistication was assessed by human raters using these criteria. This addition will allow the claim to be evaluated from the abstract while directing readers to the full protocol in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on new benchmark

full rationale

The paper introduces VESTA as an empirical framework for agentic model fitting and evaluates it via controlled comparisons (no-tool, static tools, dynamic tools) on the newly introduced DAWN benchmark, including astronomy tasks. The abstract and described claims rest on direct performance reporting and tool-sophistication ratings rather than any derivation chain, equations, fitted parameters, or self-citation load-bearing premises. No load-bearing step reduces to its own inputs by construction, and the central results are externally falsifiable via the benchmark tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no information on free parameters, axioms, or invented entities is available.

pith-pipeline@v0.9.1-grok · 5833 in / 897 out tokens · 19160 ms · 2026-06-28T21:56:25.469991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    Pymc: a modern, and comprehensive probabilistic programming framework in python.PeerJ Computer Science, 9:e1516, 2023

    Oriol Abril-Pla, Virgile Andreani, Colin Carroll, Larry Dong, Christopher J Fonnesbeck, Maxim Kochurov, Ravin Kumar, Junpeng Lao, Christian C Luhmann, Osvaldo A Martin, et al. Pymc: a modern, and comprehensive probabilistic programming framework in python.PeerJ Computer Science, 9:e1516, 2023

  2. [2]

    Evoskill: Automated skill discovery for multi-agent systems, 2026

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems, 2026. URL https://arxiv.org/abs/ 2603.02766

  3. [3]

    Speech signal modeling using multivariate distributions.EURASIP Journal on Audio Speech and Music Processing, 2015: 1–14, 12 2015

    Ali Aroudi, Hadi Veisi, Hossein Sameti, and Zahra Mafakheri. Speech signal modeling using multivariate distributions.EURASIP Journal on Audio Speech and Music Processing, 2015: 1–14, 12 2015. doi: 10.1186/s13636-015-0078-1

  4. [4]

    Covey, and Michael R

    Nate Bastian, Kevin R. Covey, and Michael R. Meyer. A universal stellar initial mass function? a critical look at variations.Annual Review of Astronomy and Astro- physics, 48(V olume 48, 2010):339–389, 2010. ISSN 1545-4282. doi: https://doi.org/ 10 10.1146/annurev-astro-082708-101642. URL https://www.annualreviews.org/content/ journals/10.1146/annurev-ast...

  5. [5]

    Automated reverse engineering of nonlinear dynamical systems

    Josh Bongard and Hod Lipson. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 104(24): 9943–9948, Jun 2007. doi: 10.1073/pnas.0609476104

  6. [6]

    Probabilistic grammars for equation discovery.CoRR, abs/2012.00428, 2020

    Jure Brence, Ljupco Todorovski, and Saso Dzeroski. Probabilistic grammars for equation discovery.CoRR, abs/2012.00428, 2020. URLhttps://arxiv.org/abs/2012.00428

  7. [7]

    Large language models as tool makers, 2024

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers, 2024. URLhttps://arxiv.org/abs/2305.17126

  8. [8]

    Adaevolve: Adaptive llm driven zeroth-order optimization, 2026

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. Adaevolve: Adaptive llm driven zeroth-order optimization, 2026. URL https://arxiv.org/ abs/2602.20133

  9. [9]

    2003, Publications of the Astronomical Society of the Pacific, 115, 763, doi: 10.1086/376392

    Gilles Chabrier. Galactic stellar and substellar initial mass function.Publications of the Astronomical Society of the Pacific, 115(809):763–795, July 2003. ISSN 1538-3873. doi: 10.1086/376392. URLhttp://dx.doi.org/10.1086/376392

  10. [10]

    Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025. URL https://arxiv.org/abs/2410.07095

  11. [11]

    Evoclaw: Evaluating ai agents on continuous software evolution, 2026

    Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, Qian Zhang, Viktor Prasanna, Xiangru Tang, and Xingyao Wang. Evoclaw: Evaluating ai agents on continuous software evolution, 2026. URLhttps://arxiv.org/abs/2603.13428

  12. [12]

    Tenenbaum, and Zoubin Ghahramani

    David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Structure discovery in nonparametric regression through compositional kernel search, 2013. URLhttps://arxiv.org/abs/1302.4922

  13. [13]

    Dabstep: Data agent benchmark for multi-step reasoning, 2025

    Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf. Dabstep: Data agent benchmark for multi-step reasoning, 2025. URL https: //arxiv.org/abs/2506.23719

  14. [14]

    Time-series fore- casting of seasonal items sales using machine learning – a comparative analysis.International Journal of Information Management Data Insights, 2(1):100058, 2022

    Yasaman Ensafi, Saman Hassanzadeh Amin, Guoqing Zhang, and Bharat Shah. Time-series fore- casting of seasonal items sales using machine learning – a comparative analysis.International Journal of Information Management Data Insights, 2(1):100058, 2022. ISSN 2667-0968. doi: 10.1016/j.jjimei.2022.100058. URL https://www.sciencedirect.com/science/article/ pii...

  15. [15]

    Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mohammed Zaman, and Noah D

    Kanishk Gandhi, Michael Y . Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mohammed Zaman, and Noah D. Goodman. Boxinggym: Benchmarking progress in automated experimental design and model discovery, 2025. URL https://arxiv.org/abs/2501.01540

  16. [16]

    Large language models are zero-shot time series forecasters, 2024

    Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters, 2024. URLhttps://arxiv.org/abs/2310.07820

  17. [17]

    Visual programming: Compositional visual reasoning without training, 2022

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training, 2022. URLhttps://arxiv.org/abs/2211.11559

  18. [18]

    Deepeyesv2: Toward agentic multimodal model, 2026

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model, 2026. URLhttps://arxiv.org/abs/2511.05271

  19. [19]

    Hollon, and Bryan Wang

    Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C. Hollon, and Bryan Wang. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization, 2026. URLhttps://arxiv.org/abs/2511.19661. 11

  20. [20]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024. URLhttps://arxiv.org/abs/2406.09403

  21. [21]

    Toolace-dev: Self-improving tool learning via decomposition and evolution, 2025

    Xu Huang, Weiwen Liu, Xingshan Zeng, Yuefeng Huang, Xinlong Hao, Yuxian Wang, Yirong Zeng, Chuhan Wu, Yasheng Wang, Ruiming Tang, and Defu Lian. Toolace-dev: Self-improving tool learning via decomposition and evolution, 2025. URL https://arxiv.org/abs/2505. 07512

  22. [22]

    Jordan, Song Mei, Jason E Weston, Weijie J

    Wenlong Ji, Weizhe Yuan, Emily Getzen, Kyunghyun Cho, Michael I. Jordan, Song Mei, Jason E Weston, Weijie J. Su, Jing Xu, and Linjun Zhang. An overview of large language models for statisticians, 2025. URLhttps://arxiv.org/abs/2502.17814

  23. [23]

    Astro- visbench: A code benchmark for scientific computing and visualization in astronomy.arXiv preprint arXiv:2505.20538, 2025

    Sebastian Antony Joseph, Syed Murtaza Husain, Stella SR Offner, StÊphanie Juneau, Paul Torrey, Adam S Bolton, Juan P Farias, Niall Gaffney, Greg Durrett, and Junyi Jessy Li. Astro- visbench: A code benchmark for scientific computing and visualization in astronomy.arXiv preprint arXiv:2505.20538, 2025

  24. [24]

    Automated model discovery via multi-modal & multi-step pipeline, 2025

    Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin, Junhyun Nam, and Tae-Hyun Oh. Automated model discovery via multi-modal & multi-step pipeline, 2025. URL https://arxiv.org/abs/ 2509.25946

  25. [25]

    P. Kroupa. On the variation of the initial mass function.Monthly Notices of the Royal Astronomical Society, 322(2):231–246, April 2001. ISSN 1365-2966. doi: 10.1046/j.1365-8711. 2001.04022.x. URLhttp://dx.doi.org/10.1046/j.1365-8711.2001.04022.x

  26. [26]

    Opensage: Self-programming agent generation engine, 2026

    Hongwei Li, Zhun Wang, Qinrun Dai, Yuzhou Nie, Jinjun Peng, Ruitong Liu, Jingyang Zhang, Kaijie Zhu, Jingxuan He, Lun Wang, Yangruibo Ding, Yueqi Chen, Wenbo Guo, and Dawn Song. Opensage: Self-programming agent generation engine, 2026. URL https: //arxiv.org/abs/2602.16891

  27. [27]

    Li, Emily B

    Michael Y . Li, Emily B. Fox, and Noah D. Goodman. Automated statistical model discovery with language models, 2024. URLhttps://arxiv.org/abs/2402.17879

  28. [28]

    Li, Vivek Vajipey, Noah D

    Michael Y . Li, Vivek Vajipey, Noah D. Goodman, and Emily B. Fox. Critical: Critic automation with language models, 2024. URLhttps://arxiv.org/abs/2411.06590

  29. [29]

    Tenenbaum, and Zoubin Ghahramani

    James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regres- sion models, 2014. URLhttps://arxiv.org/abs/1402.4304

  30. [30]

    Beyond static tools: Test-time tool evolution for scientific reasoning, 2026

    Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, Xiaosong Wang, Xiao Sun, and Dongzhan Zhou. Beyond static tools: Test-time tool evolution for scientific reasoning, 2026. URL https: //arxiv.org/abs/2601.07641

  31. [31]

    Mixture cure model methodology in survival analysis: Some recent results for the one-sample case.Statistics Surveys, 18, 01 2024

    Ross Maller, Sidney Resnick, Soudabeh Shemehsavar, and Muzhi Zhao. Mixture cure model methodology in survival analysis: Some recent results for the one-sample case.Statistics Surveys, 18, 01 2024. doi: 10.1214/24-SS147

  32. [32]

    Vesta: In depth

    NASA Science. Vesta: In depth. https://science.nasa.gov/solar-system/asteroids/ 4-vesta/, . Accessed: May 2, 2026

  33. [33]

    Dawn mission overview

    NASA Science. Dawn mission overview. https://science.nasa.gov/mission/dawn/, . Accessed: May 2, 2026

  34. [34]

    Harnessing vision models for time series analysis: A survey, 2025

    Jingchao Ni, Ziming Zhao, ChengAo Shen, Hanghang Tong, Dongjin Song, Wei Cheng, Dongsheng Luo, and Haifeng Chen. Harnessing vision models for time series analysis: A survey, 2025. URLhttps://arxiv.org/abs/2502.08869

  35. [35]

    S. S. R. Offner, P. C. Clark, P. Hennebelle, N. Bastian, M. R. Bate, P. F. Hopkins, E. Moreaux, and A. P. Whitworth.The Origin and Universality of the Stellar Initial Mass Function. University of Arizona Press, 2014. ISBN 9780816531240. doi: 10.2458/azu_uapress_9780816531240-ch003. URLhttp://dx.doi.org/10.2458/azu_uapress_9780816531240-ch003. 12

  36. [36]

    Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

    Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models, 2024. URL https://arxiv.org/abs/2305.14318

  37. [37]

    Turner, and David Duvenaud

    James Requeima, John Bronskill, Dami Choi, Richard E. Turner, and David Duvenaud. Llm processes: Numerical predictive distributions conditioned on natural language, 2024. URL https://arxiv.org/abs/2405.12856

  38. [38]

    Forgotten polygons: Multimodal large language models are shape-blind

    William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, and Ritambhara Singh. Forgotten polygons: Multimodal large language models are shape-blind. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 11983–1199...

  39. [39]

    ApJ , year = 1955, month = jan, volume =

    Edwin E. Salpeter. The Luminosity Function and Stellar Evolution.apj, 121:161, January 1955. doi: 10.1086/145971

  40. [40]

    Towards execution-grounded automated ai research, 2026

    Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, and Tatsunori Hashimoto. Towards execution-grounded automated ai research, 2026. URL https://arxiv.org/abs/ 2601.14525

  41. [41]

    Restgpt: Connecting large language models with real-world restful apis, 2023

    Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world restful apis, 2023. URLhttps://arxiv.org/abs/2306.06624

  42. [42]

    A survey on large language model-based agents for statistics and data science.The American Statistician, page 1–14, October 2025

    Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, and Jian Huang. A survey on large language model-based agents for statistics and data science.The American Statistician, page 1–14, October 2025. ISSN 1537-2731. doi: 10.1080/00031305. 2025.2561140. URLhttp://dx.doi.org/10.1080/00031305.2025.2561140

  43. [43]

    Seagent: Self-evolving computer use agent with autonomous learning from experience,

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience,

  44. [44]

    URLhttps://arxiv.org/abs/2508.04700

  45. [45]

    Vipergpt: Visual inference via python execution for reasoning, 2023

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning, 2023. URLhttps://arxiv.org/abs/2303.08128

  46. [46]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024. URL https://arxiv. org/abs/2401.06209

  47. [47]

    V oyager: An open-ended embodied agent with large language models,

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models,

  48. [48]

    URLhttps://arxiv.org/abs/2305.16291

  49. [49]

    Transformers in time series: A survey, 2023

    Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey, 2023. URLhttps://arxiv.org/abs/2202.07125

  50. [50]

    Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. URL https://arxiv.org/abs/2303.04671

  51. [51]

    Llm agents making agent tools, 2025

    Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovi´c, and Jakob Nikolas Kather. Llm agents making agent tools, 2025. URLhttps://arxiv.org/abs/2502.11705

  52. [52]

    Act wisely: Cultivating meta-cognitive tool use in agentic multimodal models, 2026

    Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi, Guannan Zhang, Ruixuan Li, and Yixiong Zou. Act wisely: Cultivating meta-cognitive tool use in agentic multimodal models, 2026. URLhttps://arxiv.org/abs/2604.08545

  53. [53]

    Vismem: Latent vision memory unlocks potential of vision-language models, 2026

    Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models, 2026. URL https://arxiv.org/abs/ 2511.11007. 13

  54. [54]

    A transformer-based framework for multivariate time series representation learning,

    George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff. A transformer-based framework for multivariate time series representation learning,

  55. [55]

    URLhttps://arxiv.org/abs/2010.02803

  56. [56]

    Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch, 2025

    Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, and Yahui Zhou. Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch, 2025. URLhttps://arxiv.org/abs/2512.02395

  57. [57]

    Vipact: Visual-perception enhancement via specialized vlm agent collaboration and tool-use, 2025

    Zhehao Zhang, Ryan Rossi, Tong Yu, Franck Dernoncourt, Ruiyi Zhang, Jiuxiang Gu, Sungchul Kim, Xiang Chen, Zichao Wang, and Nedim Lipka. Vipact: Visual-perception enhancement via specialized vlm agent collaboration and tool-use, 2025. URL https://arxiv.org/abs/2410. 16400

  58. [58]

    Pyvision: Agentic vision with dynamic tooling, 2025

    Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling, 2025. URL https://arxiv.org/abs/ 2507.07998

  59. [59]

    Time-vlm: Exploring multimodal vision-language models for augmented time series forecasting, 2025

    Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, and Yuxuan Liang. Time-vlm: Exploring multimodal vision-language models for augmented time series forecasting, 2025. URLhttps://arxiv.org/abs/2502.04395

  60. [60]

    Image-of- thought prompting for visual reasoning refinement in multimodal large language models, 2024

    Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of- thought prompting for visual reasoning refinement in multimodal large language models, 2024. URLhttps://arxiv.org/abs/2405.13872

  61. [61]

    Reinforced visual perception with tools, 2025

    Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual perception with tools, 2025. URL https: //arxiv.org/abs/2509.01656

  62. [62]

    VESTAbeats baseline

    Yizhang Zhu, Shiyin Du, Boyan Li, Yuyu Luo, and Nan Tang. Are large language models good statisticians?, 2024. URLhttps://arxiv.org/abs/2406.07815. 14 AVESTADetails Algorithm 2Visual Exploration Agents (Detailed) Require: Data D, iteration limit N, proposals per iteration p, metric R, registry E (initial state: generate_new_toolonly) Ensure:M best, θbest ...

  63. [63]

    CalculateMoments: Computes the mean, variance, skewness, and excess kurtosis of the input data. Returns a JSON artifact with a plain-language interpretation to guide distribution selection, including symmetry hints (e.g., right-skewed data suggests Gamma, Lognormal, or Weibull families) and tail-weight hints (e.g., leptokurtic data suggests Student-t, Cau...

  64. [64]

    Handles both single distributions and mixtures by summing component PDFs weighted by their mixture weights

    Histogram: Plots a histogram of the empirical data with the fitted distribution’s probability density function (PDF) overlaid. Handles both single distributions and mixtures by summing component PDFs weighted by their mixture weights. Provides an immediate visual check of whether the model captures the overall shape, modality, and spread of the data. When...

  65. [65]

    Produces both a segmentation image with a total mixture overlay and a JSON summary of per-component statistics with distribution family hints

    SegmentDistributionsAndCalculateMoments: Segments the data into a specified number of mixture components using a Gaussian Mixture Model (GMM), then computes per- component moments (mean, variance, skewness, kurtosis). Produces both a segmentation image with a total mixture overlay and a JSON summary of per-component statistics with distribution family hin...

  66. [66]

    QQPlot: Generates a Quantile-Quantile (Q-Q) plot comparing empirical data quantiles to theoretical quantiles from the currently fitted distribution. Linearity indicates a good fit; S-shaped curvature signals tail mismatch; one-sided curvature suggests skew; and sharp tail departures may indicate outliers or heavier tails than the model captures

  67. [67]

    A straight line on the log-log plot indicates power-law or Pareto-type heavy tails, while a straight line on the semi-log plot indicates exponential decay

    PlotTailsTransform: Produces log-log and semi-log complementary CDF (CCDF) plots to diagnose tail behavior. A straight line on the log-log plot indicates power-law or Pareto-type heavy tails, while a straight line on the semi-log plot indicates exponential decay. Useful for distinguishing heavy-tailed from light-tailed distributions when the histogram alo...

  68. [68]

    A consistent horizontal shift indicates a mis-specified location parameter; a slope mismatch indicates a scale misfit; and systematic tail deviations suggest distributional misfit

    ProbabilityPlot: Generates a probability plot comparing the empirical CDF to the fitted distribution’s theoretical CDF. A consistent horizontal shift indicates a mis-specified location parameter; a slope mismatch indicates a scale misfit; and systematic tail deviations suggest distributional misfit. Also reports a Kolmogorov-Smirnov (KS) statistic for qua...

  69. [69]

    Returns a plain-text summary of the detected period

    GetDominantPeriod: Extracts the dominant period from the time series using Fast Fourier Transform (FFT) analysis. Returns a plain-text summary of the detected period. Most useful when Periodic or PeriodicComplex kernels are under consideration and the period has not yet been numerically determined. The result is available in the subsequent feedback iteration

  70. [70]

    Essential for visually assessing whether the model adequately captures the underlying trend and seasonality while appropriately discounting noise

    FitVsActuals: Produces a visual overlay of the Gaussian Process (GP) fit on the raw time series data. Essential for visually assessing whether the model adequately captures the underlying trend and seasonality while appropriately discounting noise. Falls back to a raw series plot if no model has been fitted yet

  71. [71]

    Used to assess whether residuals resemble white noise; a broadly normal residual distribution is indicative of a well-specified model

    FitVsActualsWithResidualsDistribution: Generates a combined plot showing the GP fit overlaid on the observed time series alongside the distribution of residuals. Used to assess whether residuals resemble white noise; a broadly normal residual distribution is indicative of a well-specified model. Falls back to a raw series plot if no model has been fitted yet

  72. [72]

    Significant spikes above the confidence band indicate that the model is failing to capture some latent structure in the data

    ResidualsAutoCorrelationPlot: Produces an Autocorrelation Function (ACF) plot of the model residuals to check for temporal independence. Significant spikes above the confidence band indicate that the model is failing to capture some latent structure in the data. Falls back to a raw series plot if no model has been fitted yet. 5.ResidualsAutoCorrelationSco...

  73. [73]

    Each family encodes different assumptions

    diagnostic_fit_checks:Naming a concrete model family (gaussian, gamma, lognormal, Pareto, Weibull, etc.) and trying it on the data. Each family encodes different assumptions. These tools allow for typically allow for a visual comparison of multiple model families at once. Occasionally, we observe some single use model fitting

  74. [74]

    Beyond simply fitting and visualizing models, AIC and BIC provide quantitative fit metrics

    information_criteria:Numerical scores that rank competing fits while penalizing model complexity. Beyond simply fitting and visualizing models, AIC and BIC provide quantitative fit metrics

  75. [75]

    ) that maximize the probability of observing the data under the chosen family

    mle_fitting:Maximum likelihood estimation: choosing the parameter values ( µ, σ, shape, scale, . . . ) that maximize the probability of observing the data under the chosen family. This is the how you actually of fit models, distinct from what models we want to test in diagnostic_fit_checks. MLE gives you the canonical “best” parameters under a given famil...

  76. [76]

    extreme regime

    mean_excess_plotPlots the conditional expectation E[X−u|X > u] against threshold u. For the Generalized Pareto distribution this function is linear in u, so a straight line in the upper tail signals a GPD-like tail and tells you where the “extreme regime” begins. This is a tail-diagnostic that complementsdiagnostic_fit_checks. These test help it VESTAdeci...

  77. [77]

    how heavy

    hill_estimatorEstimates the tail index α of a heavy-tailed distribution from the largest k order statistics, giving a concrete number for “how heavy” the tail is. A Hill plot ( ˆαvs. k) lets you check stability and pick a sensible threshold. This refines a Pareto/power-law fit by pinning down its single most important parameter, and serves as a sanity che...

  78. [78]

    If Shapiro-Wilk rejects normality strongly, that rules out the normal family in diagnostic_fit_checks

    shapiro_wilkA formal hypothesis test for whether data come from a normal distribu- tion. If Shapiro-Wilk rejects normality strongly, that rules out the normal family in diagnostic_fit_checks

  79. [79]

    F.2 Time Series

    box_coxA parametric family of power transforms y= (x λ −1)/λ that searches for the λ making the transformed data closest to normal and can be useful when working with exotic, heavy-tailed distributions. F.2 Time Series

  80. [80]

    This is typically 22 Table 11: Analysis of functions in VESTA-generated tools that are not contained in the expert toolkit for Distribution Fitting

    density_visualization:Overlays a histogram with a kernel density estimate (KDE) to give a non-parametric picture of the marginal distribution of a time series. This is typically 22 Table 11: Analysis of functions in VESTA-generated tools that are not contained in the expert toolkit for Distribution Fitting. Function Easy Hard Astro All Diagnostic Fit Chec...

Showing first 80 references.