pith. sign in

arxiv: 2605.22564 · v1 · pith:OSF2LBR6new · submitted 2026-05-21 · 💻 cs.CL · cs.LG· cs.SE

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

Pith reviewed 2026-05-22 06:21 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SE
keywords synthetic datatool-calling agentsevaluation frameworkdata validitydata fidelitydata diversitymulti-turn agentsagent benchmarks
0
0 comments X

The pith

SynAE shows that synthetic data for tool-calling agents needs checks across validity, fidelity, and diversity rather than any single metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SynAE to quantify how closely synthetic datasets match real execution traces when testing multi-turn tool-calling agents. Real data is often unavailable because of sensitivity or sparsity, so practitioners turn to synthetic replacements, yet it is unclear how well those replacements preserve the properties needed for reliable evaluation. SynAE applies metrics in four categories covering instructions and responses, tool calls, final outputs, and downstream performance to measure validity, fidelity, and diversity. Experiments on existing benchmarks plus controlled synthetic data with injected failure modes reveal that different quality problems appear at different scales and that any one metric leaves important gaps undetected. This leads to the conclusion that a multi-axis evaluation is required to understand synthetic data quality for agent testing.

Core claim

SynAE assesses the validity, fidelity, and diversity of synthetic data for multi-turn tool-calling agents across four metric categories: task instructions and intermediate responses, tool calls, final outputs, and downstream evaluation. When tested on recent agent benchmarks and data produced by realistic controlled generation schemes that simulate common failure modes, SynAE detects fine-grained variations in these dimensions and demonstrates that no single metric fully characterizes synthetic data quality, motivating multi-axis evaluation instead.

What carries the argument

SynAE evaluation framework that applies separate metric sets to validity, fidelity, and diversity in the four categories of instructions, tool calls, outputs, and downstream results.

If this is right

  • Synthetic data generators can be adjusted to fix specific weaknesses in validity or diversity once SynAE identifies them.
  • Agent evaluations become more reliable when the test data passes checks on all four metric categories.
  • Different synthetic data methods can be compared directly by their scores on the same multi-axis set.
  • Practitioners can decide whether a given synthetic dataset is sufficient for pre-deployment testing by inspecting its profile across the axes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-axis approach could be adapted to judge synthetic data for other agent types such as web navigation or code generation agents.
  • Widespread adoption of SynAE-style checks might reduce the volume of real user data needed for testing and thereby improve privacy protections during agent development.
  • Automated pipelines could optimize new synthetic data generators to maximize scores on the full set of SynAE metrics.

Load-bearing premise

The controlled and realistic generation schemes used to create test synthetic data accurately represent the common failure modes that occur when practitioners generate synthetic data for tool-calling agent evaluations in production settings.

What would settle it

A collection of synthetic datasets in which one metric correlates perfectly with all other validity, fidelity, diversity, and downstream performance measures would falsify the claim that multiple axes are required.

Figures

Figures reproduced from arXiv: 2605.22564 by Aadyaa Maddi, Giulia Fanti, Shuaiqi Wang, Zinan Lin.

Figure 1
Figure 1. Figure 1: The SynAE framework evaluates the quality of synthetic data used in agent evaluations. synthetic data provides almost no quantitative methods for evaluating the quality of such synthetic data, leaving operators with little visibility into evaluation gaps. In this work, we develop a comprehensive evaluation framework, SynAE, to assess how well synthetic trajectories replicate and augment the characteristics… view at source ↗
Figure 2
Figure 2. Figure 2: Agent trajectory from T1 [4] benchmark dataset, with notation for each component. Notation and setup Consider a dataset D = {Di} m i=1 of m samples (or agent trajectories) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fidelity of Blank Filling and Oversampling on the T1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diversity metrics for Blank Filling and Oversampling on the T1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validity of Invalidation on T1. As invalidation ratio v increases, Validity Rates for both tool calls and outputs decrease. Validity [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fidelity vs. diversity for Blank Filling and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fidelity of In-Context Generation under T1 with fixed or randomized in-context examples. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fidelity of Blank Filling and Oversampling on the BFCL dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Fidelity of In-Context Generation under BFCL with fixed and randomly sampled in-context [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
read the original abstract

Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SynAE, a framework for assessing synthetic data quality for multi-turn tool-calling agent evaluations. It defines metrics for validity, fidelity, and diversity across four categories—task instructions and intermediate responses, tool calls, final outputs, and downstream evaluation—and evaluates the framework on recent agent benchmarks by applying controlled generation schemes that inject common failure modes. The central result is that SynAE detects fine-grained variations in these dimensions and that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation.

Significance. If the evaluation holds, SynAE offers a practical, multi-dimensional tool for practitioners who must rely on synthetic data when real execution traces are sparse or sensitive. The public demo and GitHub code are explicit strengths that support reproducibility and adoption. The work directly addresses a growing need in agent benchmarking and could influence how synthetic datasets are validated before deployment.

major comments (1)
  1. [§4] §4 (Evaluation and Experiments): The claim that the controlled generation schemes are 'realistic' and capture common failure modes is load-bearing for the recommendation of multi-axis evaluation in practice. The manuscript does not report a side-by-side statistical comparison (e.g., distribution of tool-call validity rates, response diversity scores) between the injected schemes and synthetic data produced by standard LLM pipelines with typical prompting and temperature settings. Without this, it remains unclear whether SynAE’s sensitivity generalizes beyond the experimental construction.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'recent agent benchmarks' is used without naming the specific datasets or citations; adding the exact benchmark names would improve traceability.
  2. [Results] Figure 3 (or equivalent results figure): The color scale and legend for the multi-metric heatmaps are difficult to read at standard print size; increasing font size or adding a supplementary table of raw values would aid interpretation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of SynAE. We agree that additional evidence supporting the realism of the controlled generation schemes would strengthen the practical implications of the multi-axis evaluation recommendation. We will incorporate the suggested comparison in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation and Experiments): The claim that the controlled generation schemes are 'realistic' and capture common failure modes is load-bearing for the recommendation of multi-axis evaluation in practice. The manuscript does not report a side-by-side statistical comparison (e.g., distribution of tool-call validity rates, response diversity scores) between the injected schemes and synthetic data produced by standard LLM pipelines with typical prompting and temperature settings. Without this, it remains unclear whether SynAE’s sensitivity generalizes beyond the experimental construction.

    Authors: We acknowledge the value of a direct statistical comparison to demonstrate that the injected failure modes align with those arising from standard LLM-based synthetic data generation. The schemes in the current manuscript were constructed from failure modes documented in prior agent evaluation literature and observed in our preliminary experiments with production-style traces. To address the concern, we will add a new analysis in the revised Section 4 (or an appendix) that generates parallel synthetic datasets using common LLM pipelines (e.g., zero-shot prompting with temperature 0.7 and 1.0) on the same underlying tasks. We will then report side-by-side distributions for key metrics such as tool-call validity rates and response diversity scores, allowing readers to assess how closely the controlled schemes match typical synthetic outputs. revision: yes

Circularity Check

0 steps flagged

SynAE metrics defined independently; no reduction to inputs by construction

full rationale

The paper defines SynAE as a multi-category metric framework (task instructions/responses, tool calls, final outputs, downstream evaluation) for validity/fidelity/diversity and applies it empirically to synthetic data created with injected failure modes via controlled generation schemes on existing agent benchmarks. The central result—that fine-grained variations are detected and no single metric suffices—is an observation from these experiments rather than a quantity fitted from the test data or derived tautologically from the same definitions. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation; the framework functions as an external assessment tool whose outputs are not forced by its own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces an evaluation framework rather than a first-principles derivation, so the ledger captures domain assumptions about what properties synthetic data must satisfy to be useful for agent testing.

axioms (2)
  • domain assumption Real production datasets for tool-calling agents are often insufficient or unusable due to sensitivity or sparsity.
    This premise is stated directly in the abstract as the motivation for turning to synthetic data.
  • domain assumption Validity, fidelity, and diversity are the appropriate high-level dimensions for characterizing synthetic data quality in multi-turn tool-calling scenarios.
    The entire SynAE framework is constructed around measuring these three properties across the four data categories.

pith-pipeline@v0.9.0 · 5805 in / 1620 out tokens · 74207 ms · 2026-05-22T06:21:23.373383+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 8 internal anchors

  1. [1]

    Evaluating the evaluations: A perspective on benchmarks

    Omar Alonso and Kenneth Church. Evaluating the evaluations: A perspective on benchmarks. InACM SIGIR Forum, volume 58, pages 1–27. ACM New York, NY, USA, 2025

  2. [2]

    Demystifying evals for ai agents

    Anthropic. Demystifying evals for ai agents. https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents, 2026

  3. [3]

    Synthetic data matters for machine learning innovation.https://www.capitalone.com/ tech/machine-learning/synthetic-data-research/, 2022

    CapitalOne. Synthetic data matters for machine learning innovation.https://www.capitalone.com/ tech/machine-learning/synthetic-data-research/, 2022

  4. [4]

    T1: A tool-oriented conversational dataset for multi-turn agentic planning.arXiv preprint arXiv:2505.16986, 2025

    Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, and Genta Indra Winata. T1: A tool-oriented conversational dataset for multi-turn agentic planning.arXiv preprint arXiv:2505.16986, 2025

  5. [5]

    What are specialized task ai agents? benefits, features & use cases explained

    Enkrypt AI. What are specialized task ai agents? benefits, features & use cases explained. Enkrypt AI Blog (Guest Post), March 2024

  6. [6]

    The vendi score: A diversity evaluation metric for machine learning

    Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

  7. [7]

    What has been lost with synthetic evaluation?arXiv preprint arXiv:2505.22830, 2025

    Alexander Gill, Abhilasha Ravichander, and Ana Marasović. What has been lost with synthetic evaluation?arXiv preprint arXiv:2505.22830, 2025

  8. [8]

    Evaluation gaps in machine learning practice

    Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. Evaluation gaps in machine learning practice. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages 1859–1876, 2022

  9. [9]

    Quality matters: Evaluating synthetic data for tool-using llms

    Shadi Iskander, Sofia Tolmach, Ori Shapira, Nachshon Cohen, and Zohar Karnin. Quality matters: Evaluating synthetic data for tool-using llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4958–4976, 2024

  10. [10]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

  11. [11]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  12. [12]

    Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

    Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

  13. [13]

    Acpbench: Reasoning about action, change, and planning

    Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. Acpbench: Reasoning about action, change, and planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26559–26568, 2025

  14. [14]

    Toolrm: Towards agentic tool-use reward modeling.arXiv preprint arXiv:2510.26167, 2025

    Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F Wong, Junyang Lin, and Min Yang. Toolrm: Towards agentic tool-use reward modeling.arXiv preprint arXiv:2510.26167, 2025

  15. [15]

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pages 100–114, 2022

  16. [16]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. 11

  17. [17]

    Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

    Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

  18. [18]

    Efficacy of synthetic data as a benchmark

    Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of synthetic data as a benchmark. arXiv preprint arXiv:2409.11968, 2024

  19. [19]

    Grounding synthetic data evaluations of language models in unsupervised document corpora.arXiv preprint arXiv:2505.08905, 2025

    Michael Majurski and Cynthia Matuszek. Grounding synthetic data evaluations of language models in unsupervised document corpora.arXiv preprint arXiv:2505.08905, 2025

  20. [20]

    What are vertical ai agents? IBM Think, n.d

    Amanda McGrath and Amanda Downie. What are vertical ai agents? IBM Think, n.d

  21. [21]

    Prompt genotyping: Quantifying the evaluation gap between synthetic benchmarks and real llm performance

    Sohum Mehta and Saaketh Bhojanam. Prompt genotyping: Quantifying the evaluation gap between synthetic benchmarks and real llm performance. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

  22. [22]

    Evaluation and benchmarking of llm agents: A survey

    Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 6129–6139, 2025

  23. [23]

    NVIDIA NeMo.https://www.nvidia.com/en-us/ai-data-science/products/nemo/

    NVIDIA. NVIDIA NeMo.https://www.nvidia.com/en-us/ai-data-science/products/nemo/

  24. [24]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  25. [25]

    Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

    Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

  26. [26]

    Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025

    Melissa Z Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, et al. Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025

  27. [27]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

  28. [28]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

  29. [29]

    Assisting in writing wikipedia-like articles from scratch with large language models

    Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing wikipedia-like articles from scratch with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6252–6278, 2024

  30. [30]

    Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

    Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

  31. [31]

    Tamkin, M

    Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, et al. Clio: Privacy-preserving insights into real-world ai use.arXiv preprint arXiv:2412.13678, 2024

  32. [32]

    ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

    Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023. 12

  33. [33]

    B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models. InConference on Neural Information Processing Systems, 2023

  34. [34]

    Struct-bench: A benchmark for differentially private structured text generation

    Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, and Giulia Fanti. Struct-bench: A benchmark for differentially private structured text generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  35. [35]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark.arXiv preprint arXiv:2406.19314, 2025

  36. [36]

    Self-evolved diverse data sampling for efficient instruction tuning,

    Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, and Chang Zhou. Self-evolved diverse data sampling for efficient instruction tuning.arXiv preprint arXiv:2311.08182, 2023

  37. [37]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  38. [38]

    Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness

    Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, and Kevin Zhu. Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness. arXiv preprint arXiv:2509.00591, 2025

  39. [39]

    An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

  40. [40]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  41. [41]

    Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

    John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

  42. [42]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  43. [43]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  44. [44]

    Survey on Evaluation of LLM-based Agents

    Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of llm-based agents.arXiv preprint arXiv:2503.16416, 2025

  45. [45]

    Gecko: A simulation environment to ground agent tool calls with stateful feedback for refinement

    Zeyu Zhang, Guohao Li, Zhenchang Xing, Alexandros Apostolopoulos, Yu Lin Lee, and Liang Zheng. Gecko: A simulation environment to ground agent tool calls with stateful feedback for refinement

  46. [46]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  47. [47]

    Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

    Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

  48. [48]

    Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025

    Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, et al. Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025. 13

  49. [49]

    OR " vs

    Kaijian Zou, Muhammad Khalifa, and Lu Wang. On many-shot in-context learning for long-context evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25605–25639, 2025. 14 A Related Work Robust benchmarks for interactive tool-use are necessary both for generalist agents (e.g., co...

  50. [50]

    The c o n v e r s a t i o n MUST start with ’ a ss ist an t : ’ ( not ’ A ssi st an t : ’ or any v ari at io n )

  51. [51]

    Lines MUST al ter na te strictly between ’ user : ’ and ’ a ss ist an t : ’

  52. [52]

    Each line must follow the format : ’ role : content ’ where role is either ’ user ’ or ’ assistant ’

  53. [53]

    Output ONLY the c om pl ete d c o n v e r s a t i o n with no preamble , explanation , or extra text

  54. [54]

    The ferry is emp__

    Maintain the same number of c o n v e r s a t i o n turns as the input User Prompt: 1Example input for fill in the blanks : 2 3as sis ta nt : H_____ What ____ of a t t r a c t i o n s are you looking for ? Are you i n t e r e s t e d in _______ , a__ , or s om eth in g else ? 4user : I ’ m i n t e r e s t e d in ___ and ____ a t t r a c t i o n s in __ . ...