pith. machine review for the scientific record. sign in

arxiv: 2604.24198 · v1 · submitted 2026-04-27 · 💻 cs.CL · cs.AI· cs.CE· cs.LG· cs.MA

Recognition: unknown

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CEcs.LGcs.MA
keywords process reward modelsdata analysis agentslarge language modelssilent errorsagentic workflowsreinforcement learningscientific data analysis
0
0 comments X

The pith

DataPRM improves AI data analysis agents by actively probing execution states to catch silent errors and applying ternary rewards that separate fixable mistakes from fatal ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard process reward models trained on static domains like math fail on dynamic data analysis because they overlook logical flaws that produce wrong answers without raising errors and penalize necessary exploration steps. The authors introduce DataPRM, which interacts with the environment to verify intermediate results and uses a reflection-aware three-category reward to guide agents more precisely. They build a pipeline that generates and annotates over 8,000 diverse trajectories to train this model. When used for selecting outputs or inside reinforcement learning loops, DataPRM raises performance on multiple agent benchmarks for scientific and tabular data tasks.

Core claim

DataPRM is an environment-aware generative process reward model that autonomously interacts with the execution environment to probe intermediate states and detect silent errors while employing a reflection-aware ternary reward strategy that distinguishes correctable grounding errors from irrecoverable mistakes, thereby providing more effective supervision for agentic data analysis than general-domain PRMs.

What carries the argument

DataPRM, a generative process reward model that acts as an active verifier through environment probing and applies reflection-aware ternary rewards to separate fixable from fatal errors.

If this is right

  • DataPRM raises policy LLM accuracy by 7.21 percent on ScienceAgentBench and 11.28 percent on DABStep when used for Best-of-N selection.
  • Inserting DataPRM into reinforcement learning training produces 78.73 percent on DABench and 64.84 percent on TableBench, exceeding outcome-reward baselines.
  • A 4B-parameter DataPRM already beats larger strong baselines and maintains gains across multiple test-time scaling methods.
  • The model generalizes to diverse data analysis environments without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar environment-probing reward models could be built for other agentic domains such as code generation or experimental design where silent failures are common.
  • Making reward models interactive rather than purely passive may become a standard requirement once agents operate in rich, stateful environments.
  • The scalable trajectory-generation pipeline could be reused to create process supervision data for additional scientific workflows beyond the four benchmarks tested.

Load-bearing premise

The reported gains on downstream tasks arise from the environment-aware probing and ternary reward design rather than from the size, quality, or diversity of the 8K training instances or from unstated differences in evaluation setups.

What would settle it

Train a control reward model on the identical 8K trajectories but without environment probing or ternary labels, then measure whether the downstream improvements on ScienceAgentBench and DABStep disappear under the same Best-of-N and RL protocols.

Figures

Figures reproduced from arXiv: 2604.24198 by Huajun Chen, Kewei Xu, Lun Du, Ningyu Zhang, Shuofei Qiao, Yuqi Zhu, Zhisong Qiu.

Figure 1
Figure 1. Figure 1: The Collaborative Pipeline Between Data Analysis Agent and Process Reward Model(PRM). The agent addresses data view at source ↗
Figure 2
Figure 2. Figure 2: (a): General PRMs’ Best-of-N performance on subset of DABStep. (b): General PRMs’ scores on steps with grounding view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DataPRM Framework. (a): A diversity-driven trajectory generation strategy followed by knowledge view at source ↗
Figure 4
Figure 4. Figure 4: Performance of DataPRM evaluated under two ex view at source ↗
Figure 5
Figure 5. Figure 5: Experiment results on RL training and benchmarks. (a): The evaluation results on DABench and TableBench for view at source ↗
read the original abstract

Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing general-domain process reward models (PRMs) fail to supervise data analysis agents because they cannot detect silent errors (logical flaws without interpreter exceptions) and incorrectly penalize necessary exploratory actions. It introduces DataPRM, an environment-aware generative PRM that autonomously probes execution states to uncover silent errors and uses a reflection-aware ternary reward strategy to distinguish correctable grounding errors from irrecoverable mistakes. A scalable pipeline generates over 8K training instances via diversity-driven trajectories and knowledge-augmented annotation. Experiments report that DataPRM boosts downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep under Best-of-N inference, outperforms strong baselines despite having only 4B parameters, generalizes across test-time scaling methods, and when integrated into RL yields 78.73% on DABench and 64.84% on TableBench, outperforming outcome-reward baselines.

Significance. If the reported gains are causally attributable to the environment-aware probing and ternary reward design rather than data curation alone, the work meaningfully extends process-level supervision from static mathematical reasoning to dynamic, interactive agentic data analysis. It offers a concrete mechanism for active verification in environments with silent failures and provides empirical evidence that process rewards can improve both inference-time selection and RL training of scientific agents.

major comments (2)
  1. [Experiments] The central empirical claims (7.21% and 11.28% Best-of-N gains, plus RL improvements) rest on comparisons whose attribution to environment-aware probing and the ternary strategy is not isolated. No ablation is described that holds the 8K-instance dataset and training procedure fixed while removing autonomous probing (replacing it with static rewards) or switching from ternary to binary rewards; without such controls it remains possible that differences in trajectory diversity, annotation quality, or evaluation harness explain the deltas.
  2. [Method] The reflection-aware ternary reward strategy is introduced as distinguishing 'correctable grounding errors' from 'irrecoverable mistakes,' yet the manuscript provides no formal definition, decision procedure, or example annotation guidelines for assigning the three categories during step-level labeling. This makes it impossible to assess whether the strategy is reproducible or whether its benefit is separable from the knowledge-augmented annotation pipeline.
minor comments (2)
  1. [Abstract] Baseline models are referred to only as 'strong baselines' or 'general-domain PRMs' without explicit names, sizes, or training details in the abstract or early sections, complicating direct replication of the reported margins.
  2. [Abstract] The abstract states that DataPRM 'exhibits robust generalizability across diverse Test-Time Scaling strategies' but does not list the specific strategies tested or report per-strategy breakdowns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and outline the revisions we will make to strengthen the empirical attribution and methodological clarity.

read point-by-point responses
  1. Referee: [Experiments] The central empirical claims (7.21% and 11.28% Best-of-N gains, plus RL improvements) rest on comparisons whose attribution to environment-aware probing and the ternary strategy is not isolated. No ablation is described that holds the 8K-instance dataset and training procedure fixed while removing autonomous probing (replacing it with static rewards) or switching from ternary to binary rewards; without such controls it remains possible that differences in trajectory diversity, annotation quality, or evaluation harness explain the deltas.

    Authors: We agree that the reported gains would be more convincingly attributed to the environment-aware probing and ternary reward design if controlled ablations were provided that hold the 8K training instances and overall training procedure fixed. The current experiments compare DataPRM against strong baselines (including general-domain PRMs), but do not isolate these two components via static-reward or binary-reward variants on the same data. In the revised manuscript we will add these ablations, reporting performance deltas when probing is replaced by static rewards and when the ternary scheme is replaced by binary rewards, thereby clarifying the specific contributions of each design choice. revision: yes

  2. Referee: [Method] The reflection-aware ternary reward strategy is introduced as distinguishing 'correctable grounding errors' from 'irrecoverable mistakes,' yet the manuscript provides no formal definition, decision procedure, or example annotation guidelines for assigning the three categories during step-level labeling. This makes it impossible to assess whether the strategy is reproducible or whether its benefit is separable from the knowledge-augmented annotation pipeline.

    Authors: We acknowledge that the manuscript does not supply formal definitions, a decision procedure, or annotation examples for the three ternary categories. This omission hinders reproducibility assessment and separation from the broader annotation pipeline. In the revision we will insert a dedicated subsection that (i) formally defines each category, (ii) provides an explicit decision tree or rubric for step-level labeling, and (iii) includes several concrete annotation examples drawn from the 8K dataset, thereby enabling readers to reproduce the labeling process and evaluate the ternary strategy independently of the knowledge-augmented pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model training and benchmarking

full rationale

The paper contains no mathematical derivation, equations, or functional forms. It describes an empirical pipeline for generating 8K training instances via diversity-driven trajectories and knowledge-augmented annotation, followed by training DataPRM and evaluating downstream improvements on external benchmarks (ScienceAgentBench, DABStep, DABench, TableBench). Performance deltas are reported as experimental outcomes rather than predictions derived from fitted parameters or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims; the contribution reduces to standard supervised learning and ablation-style benchmarking without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract relies on standard assumptions of LLM-based agent training and reward modeling without enumerating free parameters or new axioms; the new model itself is the primary addition.

axioms (1)
  • domain assumption Process-level reward signals can effectively guide LLM agents in dynamic environments when they include environment interaction and error-type differentiation.
    Implicit in the design of DataPRM and the claim that it outperforms general PRMs.
invented entities (1)
  • DataPRM no independent evidence
    purpose: Environment-aware generative process reward model for data analysis agents
    New model introduced to address identified limitations of prior PRMs.

pith-pipeline@v0.9.0 · 5638 in / 1409 out tokens · 69290 ms · 2026-05-08T03:43:41.171135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Table to Cell: Attention for Better Reasoning with TABALIGN

    cs.AI 2026-05 unverdicted novelty 7.0

    TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...

Reference graph

Works this paper leans on

98 extracted references · 63 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Amirhossein Abaskohi, Amrutha Varshini Ramesh, Shailesh Nanisetty, Chirag Goel, David Vázquez, Christopher Pal, Spandana Gella, Giuseppe Carenini, and Issam H. Laradji. 2025. AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery.CoRRabs/2504.07421 (2025). arXiv:2504.07421 doi:10.48550/ ARXIV.2504.07421

  2. [2]

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. 2026. EvoSkill: Automated Skill Discovery for Multi-Agent Systems.CoRR abs/2603.02766 (2026). arXiv:2603.02766 doi:10.48550/ARXIV.2603.02766

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  4. [4]

    Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, and Jinyoung Yeo. 2025. Web-Shepherd: Advancing PRMs for Reinforcing Web Agents. CoR...

  5. [5]

    Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yan- feng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. 2025. SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?CoRRabs/2507.05241 (2025). arXiv:2507.05241 doi:10.48550/ARXIV.2507.05241

  6. [6]

    Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. 2025. Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Prov...

  7. [7]

    Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. 2024. AutoManual: Constructing Instruction Manuals by LLM Agents via Interactive Environmental Learning. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15,...

  8. [8]

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. 2025. Towards Reason- ing Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Mod- els.CoRRabs/2503.09567 (2025). arXiv:2503.09567 doi:10.48550/ARXIV.2503.09567

  9. [9]

    Qiguang Chen, Ming-Hsuan Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yu Zhou, Jiaqi Wang, Zhi Chen, and Wanxiang Che. 2025. AI4Research: A Survey of Artificial Intelligence for Scientific Research.CoRRabs/2507.01903 (2025). arXiv:2507.01903 doi:10.48550/ARXIV.2507.01903

  10. [10]

    Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. 2025. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discover...

  11. [11]

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. 2025. Process Reinforcement through Implicit Rewards.CoRRabs/2502.01456 (2025). arXiv:2...

  12. [12]

    DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.CoRRabs/2512.02556 (2025). arXiv:2512.02556 doi:10.48550/ARXIV.2512. 02556

  13. [13]

    Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, and Min Zhang. 2025. FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning. CoRRabs/2510.22543 (2025). arXiv:2510.22543 doi:10.48550/ARXIV.2510.22543

  14. [15]

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025. ReTool: Reinforce- ment Learning for Strategic Tool Use in LLMs.CoRRabs/2504.11536 (2025). arXiv:2504.11536 doi:10.48550/ARXIV.2504.11536

  15. [16]

    Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, and Yara Rizk. 2025. When Agents go Astray: Course-Correcting SWE Agents with PRMs.CoRR abs/2509.02360 (2025). arXiv:2509.02360 doi:10.48550/ARXIV.2509.02360

  16. [17]

    Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, and Hongyu Lin. 2024. Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering.CoRRabs/2411.11504 (2024). arXiv:2411.11504 doi:10.48550/ ARXIV.2411.11504

  17. [18]

    Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. 2025. Da...

  18. [19]

    Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. InfiAgent-DABench: Evalu- ating Agents on Data Analysis Tasks. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 2...

  19. [20]

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2025. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?...

  20. [21]

    Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. 2025. Process Reward Models That Think.CoRRabs/2504.16828 (2025). arXiv:2504.16828 doi:10.48550/ ARXIV.2504.16828

  21. [22]

    Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, and Ruocheng Guo. 2026. Tool- PRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents.arXiv preprint arXiv:2601.12294(2026)

  22. [23]

    Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruim- ing Tang, and Yong Yu. 2025. CodePRM: Execution Feedback-enhanced Pro- cess Reward Model for Code Generation. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad T...

  23. [24]

    Autosdt: Scaling data-driven discovery tasks toward open co-scientists

    Yifei Li, Hanane Nour Moussa, Ziru Chen, Shijie Chen, Botao Yu, Mingyi Xue, Benjamin Burns, Tzu-Yao Chiu, Vishal Dey, Zitong Lu, Chen Wei, Qianheng Zhang, Tianyu Zhang, Song Gao, Xuhui Huang, Xia Ning, Nesreen K. Ahmed, Ali Payani, and Huan Sun. 2025. AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists.CoRRabs/2506.08140 (2025). arXiv:2...

  24. [25]

    Pan, Guilin Qi, Haofen Wang, and Huajun Chen

    Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan, Xi Chen, Zhaopeng Tu, Feiyu Xiong, X...

  25. [26]

    Skillnet: Create, evaluate, and connect ai skills.arXiv preprint arXiv:2603.04448,

    SkillNet: Create, Evaluate, and Connect AI Skills.CoRRabs/2603.04448 (2026). arXiv:2603.04448 doi:10.48550/ARXIV.2603.04448

  26. [27]

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=v8L0pN6EOi

  27. [28]

    Jianghao Lin, Yuanyuan Shi, Xin Peng, Renjie Ding, Hairui Wang, Yuxuan Peng, Bizhe Bai, Weixi Song, Fengshuo Bai, Huacan Chai, Weinan Zhang, Fei Huang, and Ying Wen. 2025. ToolPRM: Fine-Grained Inference Scaling of Structured Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Qiu et al. Outputs for Function Calling.CoRRabs/2510.14703 (2025). arXiv:25...

  28. [29]

    Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. 2025. Can 1B LLM Surpass 405B LLM? Rethinking Compute- Optimal Test-Time Scaling.CoRRabs/2502.06703 (2025). arXiv:2502.06703 doi:10. 48550/ARXIV.2502.06703

  29. [30]

    Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena, and Monica S. Lam. 2026. DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling. https://api.semanticscholar.org/CorpusID:287248168

  30. [31]

    Wei Liu, Peijie Yu, Michele Orini, Yali Du, and Yulan He. 2026. Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models.arXiv preprint arXiv:2602.02039(2026)

  31. [32]

    Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. 2025. Agentic Reinforcement Learning with Implicit Step Rewards. CoRRabs/2509.19199 (2025). arXiv:2509.19199 doi:10.48550/ARXIV.2509.19199

  32. [34]

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. 2024. Improve Mathematical Reasoning in Language Models by Automated Process Supervision. CoRRabs/2406.06592 (2024). arXiv:2406.06592 doi:10.48550/ARXIV.2406.06592

  33. [35]

    Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Cher- vonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, and Junehyuk Jung. 2025. Towards Robust Mathematical Reasoning. InProceedings of the...

  34. [36]

    Pingchuan Ma, Rui Ding, Shuai Wang, Shi Han, and Dongmei Zhang. 2023. In- sightPilot: An LLM-Empowered Automated Data Exploration System. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, Yansong Feng and Els Lefever (Eds.). Association for Comput...

  35. [37]

    Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, and Tomas Pfister. 2025. DS-STAR: Data Science Agent via Iterative Planning and Verification.CoRRabs/2509.21825 (2025). arXiv:2509.21825 doi:10.48550/ARXIV.2509.21825

  36. [38]

    Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. 2026. Trace2Skill: Dis- till Trajectory-Local Lessons into Transferable Agent Skills.CoRRabs/2603.25158 (2026). arXiv:2603.25158 doi:10.48550/ARXIV.2603.25158

  37. [39]

    Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, and James Zou. 2026. DSGym: A Holistic Framework for Evaluating and Training Data Science Agents.arXiv preprint arXiv:2601.16344 (2026)

  38. [40]

    Ruyi Qi, Zhou Liu, and Wentao Zhang. 2026. DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis.ArXiv abs/2601.21403 (2026). https://api.semanticscholar.org/CorpusID:285140426

  39. [41]

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani- Tür, Gokhan Tur, and Heng Ji. 2025. ToolRL: Reward is All Tool Learning Needs. CoRRabs/2504.13958 (2025). arXiv:2504.13958 doi:10.48550/ARXIV.2504.13958

  40. [42]

    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2023. Reasoning with Language Model Prompting: A Survey. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Bo...

  41. [43]

    Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen

  42. [44]

    arXiv:2509.25084 doi:10.48550/ARXIV.2509.25084

    Scaling Generalist Data-Analytic Agents.CoRRabs/2509.25084 (2025). arXiv:2509.25084 doi:10.48550/ARXIV.2509.25084

  43. [45]

    Salman Rahman, Sruthi Gorantla, Arpit Gupta, Swastik Roy, Nanyun Peng, and Yang Liu. 2025. SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning.CoRRabs/2512.03244 (2025). arXiv:2512.03244 doi:10. 48550/ARXIV.2512.03244

  44. [46]

    Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. 2025. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.CoRRabs/2504.2...

  45. [47]

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. 2025. Agent Laboratory: Using LLM Agents as Research Assistants.CoRRabs/2501.04227 (2025). arXiv:2501.04227 doi:10.48550/ARXIV.2501.04227

  46. [48]

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. 2025. Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https:...

  47. [49]

    Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. 2025. DeepSeekMath-V2: Towards Self- Verifiable Mathematical Reasoning.CoRRabs/2511.22570 (2025). arXiv:2511.22570 doi:10.48550/ARXIV.2511.22570

  48. [50]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024). arXiv:2402.03300 doi:10.48550/ARXIV.2402.03300

  49. [51]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A Flexible and Efficient RLHF Framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025. ACM, 1279–1297. doi:10.1145/3689031.3696075

  50. [52]

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM Test- Time Compute Optimally can be More Effective than Scaling Model Parameters. CoRRabs/2408.03314 (2024). arXiv:2408.03314 doi:10.48550/ARXIV.2408.03314

  51. [53]

    Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. 2025. AgenticData: An Agentic Data Analytics System for Heterogeneous Data.CoRR abs/2508.05002 (2025). arXiv:2508.05002 doi:10.48550/ARXIV.2508.05002

  52. [55]

    Chenxi Wang, Zhuoyun Yu, Xinghong Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, and Shumin Deng. 2026. SkillX: Automatically Constructing Skill Knowledge Bases for Agents. https://api.semanticscholar.org/CorpusID:287204111

  53. [56]

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. 2024. Math-Shepherd: Verify and Reinforce LLMs Step- by-step without Human Annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-...

  54. [57]

    Peiran Wang, Yaoning Yu, Ke Chen, Xianyang Zhan, and Haohan Wang. 2025. Large Language Model-based Data Science Agent: A Survey.CoRRabs/2508.02744 (2025). arXiv:2508.02744 doi:10.48550/ARXIV.2508.02744

  55. [58]

    Tongyu Wen, Guanting Dong, and Zhicheng Dou. 2026. SmartSearch: Pro- cess Reward-Guided Query Refinement for Search Agents.arXiv preprint arXiv:2601.04888(2026)

  56. [59]

    Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xeron Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Tongliang Li, Zhoujun Li, and Guanglin Niu. 2025. TableBench: A Comprehensive and Complex Benchmark for Table Question Answering. InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - Marc...

  57. [60]

    Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress.CoRRabs/2511.08325 (2025). arXiv:2511.08325 doi:10.48550/ARXIV.2511.08325

  58. [61]

    Wenyi Xu, Yuren Mao, Xiaolu Zhang, Chao Zhang, Xuemei Dong, Mengfei Zhang, and Yunjun Gao. 2025. DAgent: A Relational Database-Driven Data Analysis Report Generation Agent.CoRRabs/2503.13269 (2025). arXiv:2503.13269 doi:10.48550/ARXIV.2503.13269

  59. [62]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

  60. [63]

    Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zheng- hao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, Zhiyuan Liu, Xiaodong Shi, and Maosong Sun. 2024. MatPlotAgent: Method and Evaluation for LLM-Based Agen- tic Scientific Data Visualization. InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual...

  61. [64]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview. net/forum?id=WE_vluYUL-X

  62. [65]

    Ziming You, Yumiao Zhang, Dexuan Xu, Yiwei Lou, Yandong Yan, Wei Wang, Huaming Zhang, and Yu Huang. 2025. DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science.CoRRabs/2503.07044 (2025). arXiv:2503.07044 doi:10.48550/ARXIV.2503.07044

  63. [66]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  64. [67]

    Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, and Shikun Zhang. 2024. Outcome-Refining Process Supervision for Code Generation.CoRRabs/2412.15118 (2024). arXiv:2412.15118 doi:10.48550/ARXIV. 2412.15118

  65. [68]

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024. Self-Rewarding Language Models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https://openreview.net/forum?id= 0NphYCmgua

  66. [69]

    Hanrong Zhang, Shichen Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayuan Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S. Yu. 2026. CoEvoSkills: Self-Evolving Agent Skills via Co- Evolutionary Verification. https://api.semanticscholar.org/CorpusID:287071917

  67. [70]

    Ruiyi Zhang, Peijia Qin, Qi Cao, Eric Xue, and Pengtao Xie. 2026. FunPRM: Function-as-Step Process Reward Model with Meta Reward Correction for Code Generation.arXiv preprint arXiv:2601.22249(2026)

  68. [71]

    Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. 2025. Deep- Analyze: Agentic Large Language Models for Autonomous Data Science.CoRR abs/2510.16872 (2025). arXiv:2510.16872 doi:10.48550/ARXIV.2510.16872

  69. [72]

    Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, and Yeyun Gong. 2025. Process-based Self-Rewarding Language Models. InFind- ings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association f...

  70. [73]

    Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. 2025. Deep Research: A Survey of Autonomous Research Agents.CoRRabs/2508.12752 (2025). arXiv:2508.12752 doi:10.48550/ ARXIV.2508.12752

  71. [74]

    Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. 2023. Data- Copilot: Bridging Billions of Data and Humans with Autonomous Workflow. CoRRabs/2306.07209 (2023). arXiv:2306.07209 doi:10.48550/ARXIV.2306.07209

  72. [75]

    Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, and Guoliang Li. 2025. Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process- Supervised Rewards.CoRRabs/2505.04671 (2025). arXiv:2505.04671 doi:10.48550/ ARXIV.2505.04671

  73. [76]

    Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. The Lessons of Developing Process Reward Models in Mathematical Reasoning. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina S...

  74. [77]

    Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, and Bowen Zhou. 2025. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning.CoRR abs/2504.00891 (2025). arXiv:2504.00891 doi:10.48550/ARXIV.2504.00891

  75. [78]

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2025. SWIFT: A Scalable Lightweight Infrastructure for Fine-Tuning. In AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, ...

  76. [79]

    Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, and Weinan Zhang. 2025. A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models.CoRRabs/2510.08049 (2025). arXiv:2510.08049 doi:10.48550/ARXIV.2510.08049

  77. [80]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Pro- cessing Systems 36: Annual Conference on Neural Information Processing Sys- t...

  78. [81]

    Yuanchen Zhou, Shuo Jiang, Jie Zhu, Junhui Li, Lifan Guo, Feng Chen, and Chi Zhang. 2025. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models.CoRRabs/2508.15202 (2025). arXiv:2508.15202 doi:10.48550/ARXIV.2508.15202

  79. [82]

    Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, Chengliang Chai, Chong Chen, Shimin Di, Ju Fan, Ji Sun, Nan Tang, Fugee Tsung, Jiannan Wang, Chenglin Wu, Yanwei Xu, Shaolei Zhang, Yong Zhang, Xuanhe Zhou, Guoliang Li, and Yuyu Luo

  80. [83]

    https://api.semanticscholar.org/CorpusID:282389107

    A Survey of Data Agents: Emerging Paradigm or Overstated Hype?ArXiv abs/2510.23587 (2025). https://api.semanticscholar.org/CorpusID:282389107

Showing first 80 references.