SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

Aadyaa Maddi; Giulia Fanti; Shuaiqi Wang; Zinan Lin

arxiv: 2605.22564 · v1 · pith:OSF2LBR6new · submitted 2026-05-21 · 💻 cs.CL · cs.LG· cs.SE

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

Shuaiqi Wang , Aadyaa Maddi , Zinan Lin , Giulia Fanti This is my paper

Pith reviewed 2026-05-22 06:21 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SE

keywords synthetic datatool-calling agentsevaluation frameworkdata validitydata fidelitydata diversitymulti-turn agentsagent benchmarks

0 comments

The pith

SynAE shows that synthetic data for tool-calling agents needs checks across validity, fidelity, and diversity rather than any single metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SynAE to quantify how closely synthetic datasets match real execution traces when testing multi-turn tool-calling agents. Real data is often unavailable because of sensitivity or sparsity, so practitioners turn to synthetic replacements, yet it is unclear how well those replacements preserve the properties needed for reliable evaluation. SynAE applies metrics in four categories covering instructions and responses, tool calls, final outputs, and downstream performance to measure validity, fidelity, and diversity. Experiments on existing benchmarks plus controlled synthetic data with injected failure modes reveal that different quality problems appear at different scales and that any one metric leaves important gaps undetected. This leads to the conclusion that a multi-axis evaluation is required to understand synthetic data quality for agent testing.

Core claim

SynAE assesses the validity, fidelity, and diversity of synthetic data for multi-turn tool-calling agents across four metric categories: task instructions and intermediate responses, tool calls, final outputs, and downstream evaluation. When tested on recent agent benchmarks and data produced by realistic controlled generation schemes that simulate common failure modes, SynAE detects fine-grained variations in these dimensions and demonstrates that no single metric fully characterizes synthetic data quality, motivating multi-axis evaluation instead.

What carries the argument

SynAE evaluation framework that applies separate metric sets to validity, fidelity, and diversity in the four categories of instructions, tool calls, outputs, and downstream results.

If this is right

Synthetic data generators can be adjusted to fix specific weaknesses in validity or diversity once SynAE identifies them.
Agent evaluations become more reliable when the test data passes checks on all four metric categories.
Different synthetic data methods can be compared directly by their scores on the same multi-axis set.
Practitioners can decide whether a given synthetic dataset is sufficient for pre-deployment testing by inspecting its profile across the axes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-axis approach could be adapted to judge synthetic data for other agent types such as web navigation or code generation agents.
Widespread adoption of SynAE-style checks might reduce the volume of real user data needed for testing and thereby improve privacy protections during agent development.
Automated pipelines could optimize new synthetic data generators to maximize scores on the full set of SynAE metrics.

Load-bearing premise

The controlled and realistic generation schemes used to create test synthetic data accurately represent the common failure modes that occur when practitioners generate synthetic data for tool-calling agent evaluations in production settings.

What would settle it

A collection of synthetic datasets in which one metric correlates perfectly with all other validity, fidelity, diversity, and downstream performance measures would falsify the claim that multiple axes are required.

Figures

Figures reproduced from arXiv: 2605.22564 by Aadyaa Maddi, Giulia Fanti, Shuaiqi Wang, Zinan Lin.

**Figure 1.** Figure 1: The SynAE framework evaluates the quality of synthetic data used in agent evaluations. synthetic data provides almost no quantitative methods for evaluating the quality of such synthetic data, leaving operators with little visibility into evaluation gaps. In this work, we develop a comprehensive evaluation framework, SynAE, to assess how well synthetic trajectories replicate and augment the characteristics… view at source ↗

**Figure 2.** Figure 2: Agent trajectory from T1 [4] benchmark dataset, with notation for each component. Notation and setup Consider a dataset D = {Di} m i=1 of m samples (or agent trajectories) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Fidelity of Blank Filling and Oversampling on the T1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Diversity metrics for Blank Filling and Oversampling on the T1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Validity of Invalidation on T1. As invalidation ratio v increases, Validity Rates for both tool calls and outputs decrease. Validity [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Fidelity vs. diversity for Blank Filling and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Fidelity of In-Context Generation under T1 with fixed or randomized in-context examples. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Fidelity of Blank Filling and Oversampling on the BFCL dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Fidelity of In-Context Generation under BFCL with fixed and randomly sampled in-context [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SynAE gives a concrete multi-metric framework for synthetic tool-calling data and shows single metrics fall short, but the controlled test cases may not match real production generation.

read the letter

Hey, the main point here is that SynAE supplies a structured set of metrics to judge how well synthetic data matches real trajectories for multi-turn tool-calling agents, split across validity, fidelity, and diversity. The work breaks evaluation into four categories covering task instructions and intermediate responses, tool calls, final outputs, and downstream performance. They apply the framework to recent agent benchmarks and generate test synthetic sets through controlled schemes that inject different failure modes. The results indicate that no one metric catches all the differences, which supports their argument for using several axes at once. Releasing both the code and a live demo is a clear practical step that lets others inspect and reuse the approach without rebuilding it. The experiments are the area that needs the most scrutiny. The paper depends on those realistic controlled generation schemes to represent common problems, yet it is not obvious how closely they track the actual synthetic data that practitioners produce with ordinary LLM prompts, temperature settings, or post-processing steps. If the injected issues are cleaner or narrower than typical production outputs, then the sensitivity SynAE demonstrates could be tied to the test construction rather than a general property of synthetic data. A direct comparison against data made the way teams actually make it would strengthen the case. This is aimed at engineers and researchers who build or test tool-calling agents but lack enough real interaction logs because of privacy or sparsity. Anyone looking for a ready set of categories to measure synthetic data quality will find the breakdown and the multi-metric demonstration useful. The thinking is direct and the resources are open, so the paper deserves a serious referee to check the metric definitions and the experimental design. I would send it for peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces SynAE, a framework for assessing synthetic data quality for multi-turn tool-calling agent evaluations. It defines metrics for validity, fidelity, and diversity across four categories—task instructions and intermediate responses, tool calls, final outputs, and downstream evaluation—and evaluates the framework on recent agent benchmarks by applying controlled generation schemes that inject common failure modes. The central result is that SynAE detects fine-grained variations in these dimensions and that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation.

Significance. If the evaluation holds, SynAE offers a practical, multi-dimensional tool for practitioners who must rely on synthetic data when real execution traces are sparse or sensitive. The public demo and GitHub code are explicit strengths that support reproducibility and adoption. The work directly addresses a growing need in agent benchmarking and could influence how synthetic datasets are validated before deployment.

major comments (1)

[§4] §4 (Evaluation and Experiments): The claim that the controlled generation schemes are 'realistic' and capture common failure modes is load-bearing for the recommendation of multi-axis evaluation in practice. The manuscript does not report a side-by-side statistical comparison (e.g., distribution of tool-call validity rates, response diversity scores) between the injected schemes and synthetic data produced by standard LLM pipelines with typical prompting and temperature settings. Without this, it remains unclear whether SynAE’s sensitivity generalizes beyond the experimental construction.

minor comments (2)

[Abstract] Abstract: The phrase 'recent agent benchmarks' is used without naming the specific datasets or citations; adding the exact benchmark names would improve traceability.
[Results] Figure 3 (or equivalent results figure): The color scale and legend for the multi-metric heatmaps are difficult to read at standard print size; increasing font size or adding a supplementary table of raw values would aid interpretation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of SynAE. We agree that additional evidence supporting the realism of the controlled generation schemes would strengthen the practical implications of the multi-axis evaluation recommendation. We will incorporate the suggested comparison in the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (Evaluation and Experiments): The claim that the controlled generation schemes are 'realistic' and capture common failure modes is load-bearing for the recommendation of multi-axis evaluation in practice. The manuscript does not report a side-by-side statistical comparison (e.g., distribution of tool-call validity rates, response diversity scores) between the injected schemes and synthetic data produced by standard LLM pipelines with typical prompting and temperature settings. Without this, it remains unclear whether SynAE’s sensitivity generalizes beyond the experimental construction.

Authors: We acknowledge the value of a direct statistical comparison to demonstrate that the injected failure modes align with those arising from standard LLM-based synthetic data generation. The schemes in the current manuscript were constructed from failure modes documented in prior agent evaluation literature and observed in our preliminary experiments with production-style traces. To address the concern, we will add a new analysis in the revised Section 4 (or an appendix) that generates parallel synthetic datasets using common LLM pipelines (e.g., zero-shot prompting with temperature 0.7 and 1.0) on the same underlying tasks. We will then report side-by-side distributions for key metrics such as tool-call validity rates and response diversity scores, allowing readers to assess how closely the controlled schemes match typical synthetic outputs. revision: yes

Circularity Check

0 steps flagged

SynAE metrics defined independently; no reduction to inputs by construction

full rationale

The paper defines SynAE as a multi-category metric framework (task instructions/responses, tool calls, final outputs, downstream evaluation) for validity/fidelity/diversity and applies it empirically to synthetic data created with injected failure modes via controlled generation schemes on existing agent benchmarks. The central result—that fine-grained variations are detected and no single metric suffices—is an observation from these experiments rather than a quantity fitted from the test data or derived tautologically from the same definitions. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation; the framework functions as an external assessment tool whose outputs are not forced by its own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces an evaluation framework rather than a first-principles derivation, so the ledger captures domain assumptions about what properties synthetic data must satisfy to be useful for agent testing.

axioms (2)

domain assumption Real production datasets for tool-calling agents are often insufficient or unusable due to sensitivity or sparsity.
This premise is stated directly in the abstract as the motivation for turning to synthetic data.
domain assumption Validity, fidelity, and diversity are the appropriate high-level dimensions for characterizing synthetic data quality in multi-turn tool-calling scenarios.
The entire SynAE framework is constructed around measuring these three properties across the four data categories.

pith-pipeline@v0.9.0 · 5805 in / 1620 out tokens · 74207 ms · 2026-05-22T06:21:23.373383+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 8 internal anchors

[1]

Evaluating the evaluations: A perspective on benchmarks

Omar Alonso and Kenneth Church. Evaluating the evaluations: A perspective on benchmarks. InACM SIGIR Forum, volume 58, pages 1–27. ACM New York, NY, USA, 2025

work page 2025
[2]

Demystifying evals for ai agents

Anthropic. Demystifying evals for ai agents. https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents, 2026

work page 2026
[3]

Synthetic data matters for machine learning innovation.https://www.capitalone.com/ tech/machine-learning/synthetic-data-research/, 2022

CapitalOne. Synthetic data matters for machine learning innovation.https://www.capitalone.com/ tech/machine-learning/synthetic-data-research/, 2022

work page 2022
[4]

T1: A tool-oriented conversational dataset for multi-turn agentic planning.arXiv preprint arXiv:2505.16986, 2025

Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, and Genta Indra Winata. T1: A tool-oriented conversational dataset for multi-turn agentic planning.arXiv preprint arXiv:2505.16986, 2025

work page arXiv 2025
[5]

What are specialized task ai agents? benefits, features & use cases explained

Enkrypt AI. What are specialized task ai agents? benefits, features & use cases explained. Enkrypt AI Blog (Guest Post), March 2024

work page 2024
[6]

The vendi score: A diversity evaluation metric for machine learning

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

work page arXiv 2022
[7]

What has been lost with synthetic evaluation?arXiv preprint arXiv:2505.22830, 2025

Alexander Gill, Abhilasha Ravichander, and Ana Marasović. What has been lost with synthetic evaluation?arXiv preprint arXiv:2505.22830, 2025

work page arXiv 2025
[8]

Evaluation gaps in machine learning practice

Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. Evaluation gaps in machine learning practice. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages 1859–1876, 2022

work page 2022
[9]

Quality matters: Evaluating synthetic data for tool-using llms

Shadi Iskander, Sofia Tolmach, Ori Shapira, Nachshon Cohen, and Zohar Karnin. Quality matters: Evaluating synthetic data for tool-using llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4958–4976, 2024

work page 2024
[10]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

work page arXiv 2025
[11]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

work page arXiv 2025
[13]

Acpbench: Reasoning about action, change, and planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. Acpbench: Reasoning about action, change, and planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26559–26568, 2025

work page 2025
[14]

Toolrm: Towards agentic tool-use reward modeling.arXiv preprint arXiv:2510.26167, 2025

Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F Wong, Junyang Lin, and Min Yang. Toolrm: Towards agentic tool-use reward modeling.arXiv preprint arXiv:2510.26167, 2025

work page arXiv 2025
[15]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pages 100–114, 2022

work page 2022
[16]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

work page arXiv 2025
[18]

Efficacy of synthetic data as a benchmark

Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of synthetic data as a benchmark. arXiv preprint arXiv:2409.11968, 2024

work page arXiv 2024
[19]

Grounding synthetic data evaluations of language models in unsupervised document corpora.arXiv preprint arXiv:2505.08905, 2025

Michael Majurski and Cynthia Matuszek. Grounding synthetic data evaluations of language models in unsupervised document corpora.arXiv preprint arXiv:2505.08905, 2025

work page arXiv 2025
[20]

What are vertical ai agents? IBM Think, n.d

Amanda McGrath and Amanda Downie. What are vertical ai agents? IBM Think, n.d

work page
[21]

Prompt genotyping: Quantifying the evaluation gap between synthetic benchmarks and real llm performance

Sohum Mehta and Saaketh Bhojanam. Prompt genotyping: Quantifying the evaluation gap between synthetic benchmarks and real llm performance. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

work page 2025
[22]

Evaluation and benchmarking of llm agents: A survey

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 6129–6139, 2025

work page 2025
[23]

NVIDIA NeMo.https://www.nvidia.com/en-us/ai-data-science/products/nemo/

NVIDIA. NVIDIA NeMo.https://www.nvidia.com/en-us/ai-data-science/products/nemo/

work page
[24]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[25]

Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

work page arXiv 2024
[26]

Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025

Melissa Z Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, et al. Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025

work page arXiv 2025
[27]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

work page
[28]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Assisting in writing wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing wikipedia-like articles from scratch with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6252–6278, 2024

work page 2024
[30]

Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

work page 2024
[31]

Tamkin, M

Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, et al. Clio: Privacy-preserving insights into real-world ai use.arXiv preprint arXiv:2412.13678, 2024

work page arXiv 2024
[32]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models. InConference on Neural Information Processing Systems, 2023

work page 2023
[34]

Struct-bench: A benchmark for differentially private structured text generation

Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, and Giulia Fanti. Struct-bench: A benchmark for differentially private structured text generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page
[35]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark.arXiv preprint arXiv:2406.19314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Self-evolved diverse data sampling for efficient instruction tuning,

Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, and Chang Zhou. Self-evolved diverse data sampling for efficient instruction tuning.arXiv preprint arXiv:2311.08182, 2023

work page arXiv 2023
[37]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[38]

Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness

Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, and Kevin Zhu. Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness. arXiv preprint arXiv:2509.00591, 2025

work page arXiv 2025
[39]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

work page arXiv 2025
[40]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[41]

Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

work page 2023
[42]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[44]

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of llm-based agents.arXiv preprint arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Gecko: A simulation environment to ground agent tool calls with stateful feedback for refinement

Zeyu Zhang, Guohao Li, Zhenchang Xing, Alexandros Apostolopoulos, Yu Lin Lee, and Liang Zheng. Gecko: A simulation environment to ground agent tool calls with stateful feedback for refinement

work page
[46]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

work page arXiv 2023
[48]

Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, et al. Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025. 13

work page arXiv 2025
[49]

OR " vs

Kaijian Zou, Muhammad Khalifa, and Lu Wang. On many-shot in-context learning for long-context evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25605–25639, 2025. 14 A Related Work Robust benchmarks for interactive tool-use are necessary both for generalist agents (e.g., co...

work page 2025
[50]

The c o n v e r s a t i o n MUST start with ’ a ss ist an t : ’ ( not ’ A ssi st an t : ’ or any v ari at io n )

work page
[51]

Lines MUST al ter na te strictly between ’ user : ’ and ’ a ss ist an t : ’

work page
[52]

Each line must follow the format : ’ role : content ’ where role is either ’ user ’ or ’ assistant ’

work page
[53]

Output ONLY the c om pl ete d c o n v e r s a t i o n with no preamble , explanation , or extra text

work page
[54]

The ferry is emp__

Maintain the same number of c o n v e r s a t i o n turns as the input User Prompt: 1Example input for fill in the blanks : 2 3as sis ta nt : H_____ What ____ of a t t r a c t i o n s are you looking for ? Are you i n t e r e s t e d in _______ , a__ , or s om eth in g else ? 4user : I ’ m i n t e r e s t e d in ___ and ____ a t t r a c t i o n s in __ . ...

work page

[1] [1]

Evaluating the evaluations: A perspective on benchmarks

Omar Alonso and Kenneth Church. Evaluating the evaluations: A perspective on benchmarks. InACM SIGIR Forum, volume 58, pages 1–27. ACM New York, NY, USA, 2025

work page 2025

[2] [2]

Demystifying evals for ai agents

Anthropic. Demystifying evals for ai agents. https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents, 2026

work page 2026

[3] [3]

Synthetic data matters for machine learning innovation.https://www.capitalone.com/ tech/machine-learning/synthetic-data-research/, 2022

CapitalOne. Synthetic data matters for machine learning innovation.https://www.capitalone.com/ tech/machine-learning/synthetic-data-research/, 2022

work page 2022

[4] [4]

T1: A tool-oriented conversational dataset for multi-turn agentic planning.arXiv preprint arXiv:2505.16986, 2025

Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, and Genta Indra Winata. T1: A tool-oriented conversational dataset for multi-turn agentic planning.arXiv preprint arXiv:2505.16986, 2025

work page arXiv 2025

[5] [5]

What are specialized task ai agents? benefits, features & use cases explained

Enkrypt AI. What are specialized task ai agents? benefits, features & use cases explained. Enkrypt AI Blog (Guest Post), March 2024

work page 2024

[6] [6]

The vendi score: A diversity evaluation metric for machine learning

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

work page arXiv 2022

[7] [7]

What has been lost with synthetic evaluation?arXiv preprint arXiv:2505.22830, 2025

Alexander Gill, Abhilasha Ravichander, and Ana Marasović. What has been lost with synthetic evaluation?arXiv preprint arXiv:2505.22830, 2025

work page arXiv 2025

[8] [8]

Evaluation gaps in machine learning practice

Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. Evaluation gaps in machine learning practice. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages 1859–1876, 2022

work page 2022

[9] [9]

Quality matters: Evaluating synthetic data for tool-using llms

Shadi Iskander, Sofia Tolmach, Ori Shapira, Nachshon Cohen, and Zohar Karnin. Quality matters: Evaluating synthetic data for tool-using llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4958–4976, 2024

work page 2024

[10] [10]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

work page arXiv 2025

[11] [11]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

work page arXiv 2025

[13] [13]

Acpbench: Reasoning about action, change, and planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. Acpbench: Reasoning about action, change, and planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26559–26568, 2025

work page 2025

[14] [14]

Toolrm: Towards agentic tool-use reward modeling.arXiv preprint arXiv:2510.26167, 2025

Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F Wong, Junyang Lin, and Min Yang. Toolrm: Towards agentic tool-use reward modeling.arXiv preprint arXiv:2510.26167, 2025

work page arXiv 2025

[15] [15]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pages 100–114, 2022

work page 2022

[16] [16]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

work page arXiv 2025

[18] [18]

Efficacy of synthetic data as a benchmark

Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of synthetic data as a benchmark. arXiv preprint arXiv:2409.11968, 2024

work page arXiv 2024

[19] [19]

Grounding synthetic data evaluations of language models in unsupervised document corpora.arXiv preprint arXiv:2505.08905, 2025

Michael Majurski and Cynthia Matuszek. Grounding synthetic data evaluations of language models in unsupervised document corpora.arXiv preprint arXiv:2505.08905, 2025

work page arXiv 2025

[20] [20]

What are vertical ai agents? IBM Think, n.d

Amanda McGrath and Amanda Downie. What are vertical ai agents? IBM Think, n.d

work page

[21] [21]

Prompt genotyping: Quantifying the evaluation gap between synthetic benchmarks and real llm performance

Sohum Mehta and Saaketh Bhojanam. Prompt genotyping: Quantifying the evaluation gap between synthetic benchmarks and real llm performance. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

work page 2025

[22] [22]

Evaluation and benchmarking of llm agents: A survey

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 6129–6139, 2025

work page 2025

[23] [23]

NVIDIA NeMo.https://www.nvidia.com/en-us/ai-data-science/products/nemo/

NVIDIA. NVIDIA NeMo.https://www.nvidia.com/en-us/ai-data-science/products/nemo/

work page

[24] [24]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[25] [25]

Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

work page arXiv 2024

[26] [26]

Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025

Melissa Z Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, et al. Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025

work page arXiv 2025

[27] [27]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

work page

[28] [28]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Assisting in writing wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing wikipedia-like articles from scratch with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6252–6278, 2024

work page 2024

[30] [30]

Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

work page 2024

[31] [31]

Tamkin, M

Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, et al. Clio: Privacy-preserving insights into real-world ai use.arXiv preprint arXiv:2412.13678, 2024

work page arXiv 2024

[32] [32]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models. InConference on Neural Information Processing Systems, 2023

work page 2023

[34] [34]

Struct-bench: A benchmark for differentially private structured text generation

Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, and Giulia Fanti. Struct-bench: A benchmark for differentially private structured text generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page

[35] [35]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark.arXiv preprint arXiv:2406.19314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Self-evolved diverse data sampling for efficient instruction tuning,

Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, and Chang Zhou. Self-evolved diverse data sampling for efficient instruction tuning.arXiv preprint arXiv:2311.08182, 2023

work page arXiv 2023

[37] [37]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

[38] [38]

Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness

Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, and Kevin Zhu. Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness. arXiv preprint arXiv:2509.00591, 2025

work page arXiv 2025

[39] [39]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

work page arXiv 2025

[40] [40]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024

[41] [41]

Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

work page 2023

[42] [42]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022

[44] [44]

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of llm-based agents.arXiv preprint arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Gecko: A simulation environment to ground agent tool calls with stateful feedback for refinement

Zeyu Zhang, Guohao Li, Zhenchang Xing, Alexandros Apostolopoulos, Yu Lin Lee, and Liang Zheng. Gecko: A simulation environment to ground agent tool calls with stateful feedback for refinement

work page

[46] [46]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

work page arXiv 2023

[48] [48]

Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, et al. Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025. 13

work page arXiv 2025

[49] [49]

OR " vs

Kaijian Zou, Muhammad Khalifa, and Lu Wang. On many-shot in-context learning for long-context evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25605–25639, 2025. 14 A Related Work Robust benchmarks for interactive tool-use are necessary both for generalist agents (e.g., co...

work page 2025

[50] [50]

The c o n v e r s a t i o n MUST start with ’ a ss ist an t : ’ ( not ’ A ssi st an t : ’ or any v ari at io n )

work page

[51] [51]

Lines MUST al ter na te strictly between ’ user : ’ and ’ a ss ist an t : ’

work page

[52] [52]

Each line must follow the format : ’ role : content ’ where role is either ’ user ’ or ’ assistant ’

work page

[53] [53]

Output ONLY the c om pl ete d c o n v e r s a t i o n with no preamble , explanation , or extra text

work page

[54] [54]

The ferry is emp__

Maintain the same number of c o n v e r s a t i o n turns as the input User Prompt: 1Example input for fill in the blanks : 2 3as sis ta nt : H_____ What ____ of a t t r a c t i o n s are you looking for ? Are you i n t e r e s t e d in _______ , a__ , or s om eth in g else ? 4user : I ’ m i n t e r e s t e d in ___ and ____ a t t r a c t i o n s in __ . ...

work page