Business Utility of Large Language Models as Exploratory Data Analysis Agents

Cezary Depta; Hubert Rutkowski; Jan Kanty Milczek; Jaros{\l}aw Kochanowicz; Patryk Miziu{\l}a; Rafa{\l} {\L}ab\k{e}dzki; Szymon Betlewski; Szymon Janowski

arxiv: 2606.00051 · v1 · pith:KQ3FCDEPnew · submitted 2026-05-08 · 💻 cs.CY · cs.AI

Business Utility of Large Language Models as Exploratory Data Analysis Agents

Rafa{\l} {\L}ab\k{e}dzki , Patryk Miziu{\l}a , Hubert Rutkowski , Szymon Betlewski , Cezary Depta , Szymon Janowski , Jaros{\l}aw Kochanowicz , Jan Kanty Milczek This is my paper

Pith reviewed 2026-06-30 23:23 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords large language modelsexploratory data analysisbusiness utilityrepeatabilityagent evaluationsupply chain simulation

0 comments

The pith

Most LLM configurations are not reliable enough for autonomous business exploratory data analysis even when average scores look acceptable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates large language models as exploratory data analysis agents on a supply chain simulation task that requires identifying problematic supplier-product combinations from indirect operational traces. It tests fifteen configurations across four conditions varying data representation, prompt clarity, and signal strength, with five runs each, scoring outputs against ground truth via Jaccard index. A new business utility metric combines average performance with a variability adjustment to produce a single operational score. A sympathetic reader would care because inconsistent agent outputs could undermine trust in data-driven business decisions. The central result is that repeatability and condition sensitivity must be evaluated alongside mean quality, as most configurations lose substantial utility once variability is accounted for.

Core claim

The paper claims that most of the fifteen model-variant configurations are not reliable enough for autonomous EDA use, even when their average scores appear acceptable. GPT-5.4 with extra-high reasoning effort achieved the strongest overall profile with an experiment-averaged mean score of 0.8748 and business utility of 0.6952, while the next-best configurations lost substantially more utility after variability discounting. The findings indicate that average quality, repeatability, and condition sensitivity function as complementary dimensions of operational trustworthiness.

What carries the argument

The business utility metric, a risk-adjusted measure that summarises quality and repeatability in a single operational score by discounting mean performance for output variability across repeated runs and conditions.

If this is right

Business utility provides a single operational measure that penalises configurations whose outputs vary too much across runs or conditions.
Most tested configurations lose substantial utility once variability is discounted, even if raw mean scores appear high.
Evaluation of EDA agents must incorporate cross-condition significance tests in addition to mean scores and coefficients of variation.
Only configurations that maintain both high mean scores and low variation qualify as candidates for autonomous deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organisations considering LLM-based EDA tools may need to run repeated trials on their own data before granting autonomous access.
The simulation-based benchmark could be adapted to test other analytical tasks such as customer segmentation or inventory forecasting.
Model providers might prioritise training objectives that reduce output variance rather than maximising average benchmark scores alone.

Load-bearing premise

The agent-based supply chain simulation and its indirect operational traces constitute a valid proxy for the challenges of real-world business exploratory data analysis tasks.

What would settle it

Applying the top-performing configuration to actual company operational data and measuring whether its business utility score remains comparable to the 0.6952 level obtained in the simulation.

Figures

Figures reproduced from arXiv: 2606.00051 by Cezary Depta, Hubert Rutkowski, Jan Kanty Milczek, Jaros{\l}aw Kochanowicz, Patryk Miziu{\l}a, Rafa{\l} {\L}ab\k{e}dzki, Szymon Betlewski, Szymon Janowski.

read the original abstract

Large Language Models (LLMs) are increasingly used in analytical workflows, but their suitability as exploratory data analysis (EDA) agents in business settings remains uncertain. In practice, a deployable EDA agent must provide not only useful average performance but also sufficient repeatability to support trust in its outputs. We evaluate this requirement in a controlled, business-relevant benchmark built on an agent-based supply chain simulation. The task is to identify supplier-product combinations responsible for low quality and downstream sales loss by reasoning from indirect operational traces rather than from explicit labels. Fifteen model-variant configurations from eight model families were evaluated under four experimental conditions that varied data representation, prompt clarity, and signal strength, with five trajectories per condition. Outputs were scored against deterministic ground truth using the Jaccard index and assessed through a framework that combines mean score (ms), coefficient of variation (CV), exploratory cross-condition significance tests, and Business utility, a risk-adjusted metric that we propose to summarise quality and repeatability in a single operational measure. The results show that most configurations are not reliable enough for autonomous EDA use, even when their average scores appear acceptable. GPT-5.4 with extra-high reasoning effort achieved the strongest overall profile, with an experiment-averaged ms of 0.8748 and an experiment-averaged Business utility of 0.6952, while the next-best configurations lost substantially more utility after variability discounting. Our findings suggest that evaluation of EDA agents should treat average quality, repeatability, and condition sensitivity as complementary dimensions of operational trustworthiness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a simulation benchmark and variability-adjusted utility metric for LLM EDA agents, but the proxy's realism is untested so the reliability claims rest on an assumption.

read the letter

The main thing here is that the authors built a supply chain simulation benchmark to test LLM agents on recovering supplier-product issues from indirect traces, and they propose a Business utility metric that discounts mean score by variability. Most models don't hold up well on repeatability even when averages look okay, with one GPT variant coming out ahead.

The new part is the specific benchmark setup and the risk-adjusted metric. They do a decent job running multiple conditions on data representation, prompt clarity, and signal strength, scoring with Jaccard, and combining mean and CV into one number. That addresses a real gap in how we evaluate these agents for business use, where consistency matters as much as average quality.

The weak point is whether the simulation is a good stand-in for real business EDA. The task involves reasoning from operational traces without explicit labels, which is plausible, but if the generated data has simpler patterns or lower noise than messy company datasets, the variability they measure will be too low and the utility scores too high. No comparison to actual EDA logs is mentioned, so that assumption sits on top of the results. Five trajectories per condition also feels thin for stable CV estimates.

This paper is aimed at people working on LLM tools for analytics or evaluating their reliability in applied settings. Someone looking for concrete experiments on repeatability would find it worth reading.

I would send it to peer review. The experimental approach is solid enough to merit feedback, even with the benchmark validation question.

Referee Report

3 major / 2 minor

Summary. The paper evaluates 15 LLM configurations across 8 families as EDA agents on a supply-chain simulation benchmark. The task requires recovering supplier-product combinations from indirect operational traces using Jaccard scoring. Four conditions vary data representation, prompt clarity, and signal strength, with five trajectories each. The authors propose a new 'Business utility' metric that combines mean score (ms) and coefficient of variation (CV) into a single risk-adjusted measure. They conclude that most configurations lack sufficient repeatability for autonomous use even when average scores appear acceptable, with GPT-5.4 (extra-high reasoning) strongest at experiment-averaged ms 0.8748 and Business utility 0.6952.

Significance. If the benchmark holds, the work usefully shifts attention from average performance alone to repeatability and condition sensitivity for business deployment of LLM agents. The multi-condition design and explicit proposal of a risk-adjusted metric are strengths that could inform evaluation practices. Credit is due for testing multiple model families and for framing trustworthiness in operational terms rather than purely academic metrics.

major comments (3)

[Abstract and Methods (metric definition)] The exact formula for Business utility (how ms is discounted or weighted by CV) is never stated, only described qualitatively as 'risk-adjusted'. This definition is load-bearing for all reported utility values (e.g., 0.6952) and for the central claim that variability causes substantial utility loss.
[Experimental design (trajectories per condition)] Only five trajectories per condition are used to compute CV. With n=5 the sampling variability of the CV estimator itself is large, directly affecting the stability of the Business utility scores and the conclusion that 'most configurations are not reliable enough'.
[Benchmark construction] No validation or sensitivity analysis is provided for the agent-based supply-chain simulation as a proxy for real enterprise EDA tasks. If the generated traces have lower noise or narrower ambiguity than actual business data, both the observed CVs and the utility metric will understate deployment risk, weakening the general claim about autonomous EDA suitability.

minor comments (2)

[Results] The abstract refers to 'exploratory cross-condition significance tests' but neither the test statistic, correction method, nor any p-values appear in the provided summary; these details should be added for reproducibility.
[Reproducibility] Raw trajectory data, exact Jaccard implementation, and prompt templates are not stated to be released; making them available would strengthen the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract and Methods (metric definition)] The exact formula for Business utility (how ms is discounted or weighted by CV) is never stated, only described qualitatively as 'risk-adjusted'. This definition is load-bearing for all reported utility values (e.g., 0.6952) and for the central claim that variability causes substantial utility loss.

Authors: We agree that the precise mathematical definition of Business utility was not explicitly stated, which is an oversight given its central role. We will revise the Methods section to include the exact formula combining mean score (ms) and coefficient of variation (CV) as a risk-adjusted measure. This will ensure the reported values such as 0.6952 are fully reproducible. revision: yes
Referee: [Experimental design (trajectories per condition)] Only five trajectories per condition are used to compute CV. With n=5 the sampling variability of the CV estimator itself is large, directly affecting the stability of the Business utility scores and the conclusion that 'most configurations are not reliable enough'.

Authors: We acknowledge that n=5 per condition introduces substantial sampling variability in the CV estimates, which affects the precision of the Business utility metric. We will add an explicit discussion of this limitation in a new Limitations subsection, including its implications for the reliability conclusions, while noting that the design prioritizes breadth across conditions and models. revision: partial
Referee: [Benchmark construction] No validation or sensitivity analysis is provided for the agent-based supply-chain simulation as a proxy for real enterprise EDA tasks. If the generated traces have lower noise or narrower ambiguity than actual business data, both the observed CVs and the utility metric will understate deployment risk, weakening the general claim about autonomous EDA suitability.

Authors: The simulation is presented as a controlled proxy to isolate factors such as data representation, prompt clarity, and signal strength. We will expand the manuscript with additional discussion of the simulation's assumptions, design rationale, and potential differences from real enterprise data that could affect risk estimates. Full validation against proprietary real-world datasets is not feasible within this study. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core claims rest on an empirical evaluation: LLMs are tested on a fixed agent-based supply chain simulation, outputs are scored deterministically via Jaccard index against ground truth, and a new composite metric (Business utility) is proposed to combine observed ms and CV. No equations, self-citations, or derivations are quoted that reduce any reported result to a fitted parameter or prior self-work by construction. The simulation and scoring procedure are externally specified and falsifiable; the metric is introduced as a summary statistic rather than a predictive derivation. This is a standard empirical benchmark study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the proposed metric are detailed. The Jaccard index and five-trajectory sampling are used without further justification visible.

axioms (2)

domain assumption Jaccard index is suitable for scoring LLM-generated supplier-product sets against deterministic ground truth.
Used to produce mean score and CV in the evaluation framework.
domain assumption Five trajectories per condition suffice to estimate coefficient of variation for business utility.
Basis for repeatability assessment across the four experimental conditions.

invented entities (1)

Business utility metric no independent evidence
purpose: Single operational measure combining mean score and variability to assess deployability.
Newly proposed to summarise quality and repeatability; no external validation or independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5844 in / 1435 out tokens · 42784 ms · 2026-06-30T23:23:05.079871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 31 canonical work pages · 5 internal anchors

[1]

Exploratory Text Data Analysis for Quality Hypothesis Generation

“Exploratory Text Data Analysis for Quality Hypothesis Generation. ” Quality Engineering 30 (4): 701–12. https://doi.org/10.1080/0898 2112.2018.1481216. Cai, Yifu, Xinyu Li, Mononito Goswami, Michał Wiliński, Gus Welter, and Artur Dubrawski

work page doi:10.1080/0898 2018
[2]

https://doi.org/10.48550/ARXIV.2505.13291

TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents . https://doi.org/10.48550/ARXIV.2505.13291. Cao, Ruisheng, Fangyu Lei, Haoyuan Wu, et al

work page doi:10.48550/arxiv.2505.13291
[3]

Chan, Jun Shern, Neil Chowdhury, Oliver Jaffe, et al

Spider2-V: How Far Are Multimodal Agents from Automating Data Science and Engineering Workflows? https://doi.org/10.48550/arXiv.2407.10956. Chan, Jun Shern, Neil Chowdhury, Oliver Jaffe, et al

work page doi:10.48550/arxiv.2407.10956
[4]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering . https://doi.org/10.48550/arXiv.2410.07095. Chittepu, Yaswanth, Raghavendra Addanki, Tung Mai, Anup Rao, and Branislav Kveton

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.07095
[5]

https://doi.org/10.48550/arXiv.2512.00672

ML- Tool-Bench: Tool-Augmented Planning for ML Tasks . https://doi.org/10.48550/arXiv.2512.00672. De Mast, Jeroen, and Albert Trip

work page doi:10.48550/arxiv.2512.00672
[6]

Exploratory Data Analysis in Quality-Improvement Projects

“Exploratory Data Analysis in Quality-Improvement Projects. ” Journal of Quality Technology 39 (4): 301–11. https://doi.org/10.1080/00224065.2007.11917697. Egg, Alex, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf

work page doi:10.1080/00224065.2007.11917697 2007
[7]

Zaletel, and Joel E

DABstep: Data Agent Benchmark for Multi-Step Reasoning . https://doi.org/10.48550/arXiv .2506.23719. 14 Gao, Yunfan, Yun Xiong, Xinyu Gao, et al

work page internal anchor Pith review doi:10.48550/arxiv
[8]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-Augmented Generation for Large Language Models: A Survey . https://doi.org/10.48550/arXiv.2312.10997. Good, I. J

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997
[9]

The Philosophy of Exploratory Data Analysis

“The Philosophy of Exploratory Data Analysis. ” Philosophy of Science 50 (2): 283–95. https://doi.org/10.1086/289110. Hu, Xueyu, Ziyu Zhao, Shuang Wei, et al

work page doi:10.1086/289110
[10]

https://doi.org/10.48550/arXiv.2401.05507

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. https://doi.org/10.48550/arXiv.2401.05507. Huang, Qian, Jian Vora, Percy Liang, and Jure Leskovec

work page doi:10.48550/arxiv.2401.05507
[11]

https://doi.org/10.48550/arXiv.2410.07331

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models . https://doi.org/10.48550/arXiv.2410.07331. Jansen, Jacqueline A., Artür Manukyan, Nour Al Khoury, and Altuna Akalin

work page doi:10.48550/arxiv.2410.07331
[12]

Leveraging Large Language Models for Data Analysis Automation

“Leveraging Large Language Models for Data Analysis Automation. ” PLOS ONE 20 (2): e0317084. https: //doi.org/10.1371/journal.pone.0317084. Jimenez, Carlos E., John Yang, Alexander Wettig, et al

work page doi:10.1371/journal.pone.0317084
[13]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? https://doi.org/10.48550/arXiv.2310.06770. Jing, Liqiang, Zhehui Huang, Xiaoyang Wang, et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770
[14]

Kahneman, Daniel, and Amos Tversky

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? https://doi.org/10.48550/arXiv.2409.07703. Kahneman, Daniel, and Amos Tversky

work page doi:10.48550/arxiv.2409.07703
[15]

Prospect Theory: An Analysis of Decision under Risk.Econometrica 1979,47, 263–292

“Prospect Theory: An Analysis of Decision Under Risk. ” Econometrica 47 (2): 263–91. https://doi.org/10.2307/1914185. Łabędzki, Rafał, Katarzyna Mikołajczyk, Anna Biłyk, and Monika Trojanowska

work page doi:10.2307/1914185
[16]

Understanding Human-AI Collaboration: A Systematic Review of Challenges and Research Methods in Manage- ment

“Understanding Human-AI Collaboration: A Systematic Review of Challenges and Research Methods in Manage- ment. ” In HCI International 2025 Posters , edited by Constantine Stephanidis, Margherita Antona, Stavroula Ntoa, and Gavriel Salvendy, vol

2025
[17]

Springer

Communications in Computer and Information Science. Springer. https://doi.org/10.1007/978-3-031-94171-9_32 . Lai, Yuhang, Chengxi Li, Yiming Wang, et al

work page doi:10.1007/978-3-031-94171-9_32
[18]

https://doi.org/10.48550/arXiv.2211.11501

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation . https://doi.org/10.48550/arXiv.2211.11501. Lei, Fangyu, Jinxiang Meng, Yiming Huang, et al

work page doi:10.48550/arxiv.2211.11501
[19]

https://doi.org/10.48550/arXiv.2512.04324

DAComp: Benchmarking Data Agents Across the Full Data Intelligence Lifecycle . https://doi.org/10.48550/arXiv.2512.04324. Li, Hanyu, Haoyu Liu, Tingyu Zhu, et al

work page doi:10.48550/arxiv.2512.04324
[20]

https://doi.org/10.48550/arXiv.2505.18223

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis . https://doi.org/10.48550/arXiv.2505.18223. Li Vigni, Mario, Caterina Durante, and Marina Cocchi

work page doi:10.48550/arxiv.2505.18223
[21]

https://doi.org/10.1016/B978-0-444-59528- 7.00003-X

Elsevier. https://doi.org/10.1016/B978-0-444-59528- 7.00003-X. Li, Xinyi, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang

work page doi:10.1016/b978-0-444-59528-
[22]

https://doi.org/10.48550/arXiv.2401.02982

BIBench: Benchmarking Data Analysis Knowledge of Large Language Models . https://doi.org/10.48550/arXiv.2401.02982. Liu, Xinyu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo

work page doi:10.48550/arxiv.2401.02982
[23]

NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation

“NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation. ” Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining , August, 5662–73. https://doi.org/10.1145/37 15 11896.3737427. Luo, Tianqi, Chuhan Huang, Leixian Shen, et al

work page doi:10.1145/37
[24]

https://doi.org/10.48550/ARXIV.2503.12880

nvBench 2.0: Resolving Ambiguity in Text-to- Visualization Through Stepwise Reasoning . https://doi.org/10.48550/ARXIV.2503.12880. Morgenthaler, Stephan

work page doi:10.48550/arxiv.2503.12880
[25]

Exploratory Data Analysis

“Exploratory Data Analysis. ” WIREs Computational Statistics 1 (1): 33–44. https://doi.org/10.1002/wics.2. Rahman, Mizanur, Md Tahmid Rahman Laskar, Shafiq Joty, and Enamul Hoque

work page doi:10.1002/wics.2
[26]

https: //doi.org/10.48550/arXiv.2507.19969

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text . https: //doi.org/10.48550/arXiv.2507.19969. Tang, Xiangru, Yuliang Liu, Zefan Cai, et al

work page doi:10.48550/arxiv.2507.19969
[27]

https://doi.org/10.48550/arXiv.2311

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code . https://doi.org/10.48550/arXiv.2311. 09835. Tukey, John Wilder

work page doi:10.48550/arxiv.2311
[28]

Advances in Prospect Theory: Cumulative Representa- tion of Uncertainty

“Advances in Prospect Theory: Cumulative Representa- tion of Uncertainty. ” Journal of Risk and Uncertainty 5 (4): 297–323. https://doi.org/10.1007/BF 00122574. Wang, Ziting, Shize Zhang, Haitao Yuan, et al

work page doi:10.1007/bf
[29]

FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data

FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data . https://doi.org/10.48550/arXiv.2509.02473. Zhang, Dan, Sining Zhoubian, Min Cai, et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.02473
[30]

https://doi.org/10.48550/arXiv.2502.13897

DataSciBench: An LLM Agent Benchmark for Data Science. https://doi.org/10.48550/arXiv.2502.13897. Zhang, Yuge, Qiyang Jiang, Xingyu Han, Nan Chen, Yuqing Yang, and Kan Ren

work page doi:10.48550/arxiv.2502.13897
[31]

https://doi.org/10.48550/arXiv.2402.17168

Benchmarking Data Science Agents . https://doi.org/10.48550/arXiv.2402.17168. Zhu, Zhenghao, Yuanfeng Song, Xin Chen, et al

work page doi:10.48550/arxiv.2402.17168
[32]

https://doi.org/10.48550/arXiv.2511.2288

InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents . https://doi.org/10.48550/arXiv.2511.2288

work page doi:10.48550/arxiv.2511.2288

[1] [1]

Exploratory Text Data Analysis for Quality Hypothesis Generation

“Exploratory Text Data Analysis for Quality Hypothesis Generation. ” Quality Engineering 30 (4): 701–12. https://doi.org/10.1080/0898 2112.2018.1481216. Cai, Yifu, Xinyu Li, Mononito Goswami, Michał Wiliński, Gus Welter, and Artur Dubrawski

work page doi:10.1080/0898 2018

[2] [2]

https://doi.org/10.48550/ARXIV.2505.13291

TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents . https://doi.org/10.48550/ARXIV.2505.13291. Cao, Ruisheng, Fangyu Lei, Haoyuan Wu, et al

work page doi:10.48550/arxiv.2505.13291

[3] [3]

Chan, Jun Shern, Neil Chowdhury, Oliver Jaffe, et al

Spider2-V: How Far Are Multimodal Agents from Automating Data Science and Engineering Workflows? https://doi.org/10.48550/arXiv.2407.10956. Chan, Jun Shern, Neil Chowdhury, Oliver Jaffe, et al

work page doi:10.48550/arxiv.2407.10956

[4] [4]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering . https://doi.org/10.48550/arXiv.2410.07095. Chittepu, Yaswanth, Raghavendra Addanki, Tung Mai, Anup Rao, and Branislav Kveton

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.07095

[5] [5]

https://doi.org/10.48550/arXiv.2512.00672

ML- Tool-Bench: Tool-Augmented Planning for ML Tasks . https://doi.org/10.48550/arXiv.2512.00672. De Mast, Jeroen, and Albert Trip

work page doi:10.48550/arxiv.2512.00672

[6] [6]

Exploratory Data Analysis in Quality-Improvement Projects

“Exploratory Data Analysis in Quality-Improvement Projects. ” Journal of Quality Technology 39 (4): 301–11. https://doi.org/10.1080/00224065.2007.11917697. Egg, Alex, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf

work page doi:10.1080/00224065.2007.11917697 2007

[7] [7]

Zaletel, and Joel E

DABstep: Data Agent Benchmark for Multi-Step Reasoning . https://doi.org/10.48550/arXiv .2506.23719. 14 Gao, Yunfan, Yun Xiong, Xinyu Gao, et al

work page internal anchor Pith review doi:10.48550/arxiv

[8] [8]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-Augmented Generation for Large Language Models: A Survey . https://doi.org/10.48550/arXiv.2312.10997. Good, I. J

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997

[9] [9]

The Philosophy of Exploratory Data Analysis

“The Philosophy of Exploratory Data Analysis. ” Philosophy of Science 50 (2): 283–95. https://doi.org/10.1086/289110. Hu, Xueyu, Ziyu Zhao, Shuang Wei, et al

work page doi:10.1086/289110

[10] [10]

https://doi.org/10.48550/arXiv.2401.05507

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. https://doi.org/10.48550/arXiv.2401.05507. Huang, Qian, Jian Vora, Percy Liang, and Jure Leskovec

work page doi:10.48550/arxiv.2401.05507

[11] [11]

https://doi.org/10.48550/arXiv.2410.07331

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models . https://doi.org/10.48550/arXiv.2410.07331. Jansen, Jacqueline A., Artür Manukyan, Nour Al Khoury, and Altuna Akalin

work page doi:10.48550/arxiv.2410.07331

[12] [12]

Leveraging Large Language Models for Data Analysis Automation

“Leveraging Large Language Models for Data Analysis Automation. ” PLOS ONE 20 (2): e0317084. https: //doi.org/10.1371/journal.pone.0317084. Jimenez, Carlos E., John Yang, Alexander Wettig, et al

work page doi:10.1371/journal.pone.0317084

[13] [13]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? https://doi.org/10.48550/arXiv.2310.06770. Jing, Liqiang, Zhehui Huang, Xiaoyang Wang, et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770

[14] [14]

Kahneman, Daniel, and Amos Tversky

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? https://doi.org/10.48550/arXiv.2409.07703. Kahneman, Daniel, and Amos Tversky

work page doi:10.48550/arxiv.2409.07703

[15] [15]

Prospect Theory: An Analysis of Decision under Risk.Econometrica 1979,47, 263–292

“Prospect Theory: An Analysis of Decision Under Risk. ” Econometrica 47 (2): 263–91. https://doi.org/10.2307/1914185. Łabędzki, Rafał, Katarzyna Mikołajczyk, Anna Biłyk, and Monika Trojanowska

work page doi:10.2307/1914185

[16] [16]

Understanding Human-AI Collaboration: A Systematic Review of Challenges and Research Methods in Manage- ment

“Understanding Human-AI Collaboration: A Systematic Review of Challenges and Research Methods in Manage- ment. ” In HCI International 2025 Posters , edited by Constantine Stephanidis, Margherita Antona, Stavroula Ntoa, and Gavriel Salvendy, vol

2025

[17] [17]

Springer

Communications in Computer and Information Science. Springer. https://doi.org/10.1007/978-3-031-94171-9_32 . Lai, Yuhang, Chengxi Li, Yiming Wang, et al

work page doi:10.1007/978-3-031-94171-9_32

[18] [18]

https://doi.org/10.48550/arXiv.2211.11501

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation . https://doi.org/10.48550/arXiv.2211.11501. Lei, Fangyu, Jinxiang Meng, Yiming Huang, et al

work page doi:10.48550/arxiv.2211.11501

[19] [19]

https://doi.org/10.48550/arXiv.2512.04324

DAComp: Benchmarking Data Agents Across the Full Data Intelligence Lifecycle . https://doi.org/10.48550/arXiv.2512.04324. Li, Hanyu, Haoyu Liu, Tingyu Zhu, et al

work page doi:10.48550/arxiv.2512.04324

[20] [20]

https://doi.org/10.48550/arXiv.2505.18223

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis . https://doi.org/10.48550/arXiv.2505.18223. Li Vigni, Mario, Caterina Durante, and Marina Cocchi

work page doi:10.48550/arxiv.2505.18223

[21] [21]

https://doi.org/10.1016/B978-0-444-59528- 7.00003-X

Elsevier. https://doi.org/10.1016/B978-0-444-59528- 7.00003-X. Li, Xinyi, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang

work page doi:10.1016/b978-0-444-59528-

[22] [22]

https://doi.org/10.48550/arXiv.2401.02982

BIBench: Benchmarking Data Analysis Knowledge of Large Language Models . https://doi.org/10.48550/arXiv.2401.02982. Liu, Xinyu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo

work page doi:10.48550/arxiv.2401.02982

[23] [23]

NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation

“NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation. ” Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining , August, 5662–73. https://doi.org/10.1145/37 15 11896.3737427. Luo, Tianqi, Chuhan Huang, Leixian Shen, et al

work page doi:10.1145/37

[24] [24]

https://doi.org/10.48550/ARXIV.2503.12880

nvBench 2.0: Resolving Ambiguity in Text-to- Visualization Through Stepwise Reasoning . https://doi.org/10.48550/ARXIV.2503.12880. Morgenthaler, Stephan

work page doi:10.48550/arxiv.2503.12880

[25] [25]

Exploratory Data Analysis

“Exploratory Data Analysis. ” WIREs Computational Statistics 1 (1): 33–44. https://doi.org/10.1002/wics.2. Rahman, Mizanur, Md Tahmid Rahman Laskar, Shafiq Joty, and Enamul Hoque

work page doi:10.1002/wics.2

[26] [26]

https: //doi.org/10.48550/arXiv.2507.19969

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text . https: //doi.org/10.48550/arXiv.2507.19969. Tang, Xiangru, Yuliang Liu, Zefan Cai, et al

work page doi:10.48550/arxiv.2507.19969

[27] [27]

https://doi.org/10.48550/arXiv.2311

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code . https://doi.org/10.48550/arXiv.2311. 09835. Tukey, John Wilder

work page doi:10.48550/arxiv.2311

[28] [28]

Advances in Prospect Theory: Cumulative Representa- tion of Uncertainty

“Advances in Prospect Theory: Cumulative Representa- tion of Uncertainty. ” Journal of Risk and Uncertainty 5 (4): 297–323. https://doi.org/10.1007/BF 00122574. Wang, Ziting, Shize Zhang, Haitao Yuan, et al

work page doi:10.1007/bf

[29] [29]

FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data

FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data . https://doi.org/10.48550/arXiv.2509.02473. Zhang, Dan, Sining Zhoubian, Min Cai, et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.02473

[30] [30]

https://doi.org/10.48550/arXiv.2502.13897

DataSciBench: An LLM Agent Benchmark for Data Science. https://doi.org/10.48550/arXiv.2502.13897. Zhang, Yuge, Qiyang Jiang, Xingyu Han, Nan Chen, Yuqing Yang, and Kan Ren

work page doi:10.48550/arxiv.2502.13897

[31] [31]

https://doi.org/10.48550/arXiv.2402.17168

Benchmarking Data Science Agents . https://doi.org/10.48550/arXiv.2402.17168. Zhu, Zhenghao, Yuanfeng Song, Xin Chen, et al

work page doi:10.48550/arxiv.2402.17168

[32] [32]

https://doi.org/10.48550/arXiv.2511.2288

InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents . https://doi.org/10.48550/arXiv.2511.2288

work page doi:10.48550/arxiv.2511.2288