arxiv: 2604.28139 · v2 · submitted 2026-04-30 · 💻 cs.SE · cs.AI

Recognition: unknown

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Chenxin Li , Zhengyang Tang , Mingxin Huang , Yunlong Lin , Shijue Huang , Shengyuan Liu , Bowen Ye , Rang Li

show 3 more authors

Lei Li Benyou Wang Yixuan Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:48 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM agentsworkflow automationagent benchmarkinglive evaluationtask executionbusiness workflowsagent reliabilityexecution traces

0 comments

The pith

A live benchmark reveals that even the best LLM agents complete only two-thirds of evolving real-world workflow tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Claw-Eval-Live as a benchmark that refreshes tasks from public workflow signals while keeping each release fixed and verifiable through execution traces and workspace artifacts. It tests 13 frontier models on 105 tasks drawn from business services and local repairs, finding that the top performer succeeds on just 66.7 percent and none exceed 70 percent. Failures concentrate in HR, management, and multi-system workflows, while local workspace tasks prove somewhat easier yet still incomplete. The design records full audit logs and service states so that grading combines deterministic checks with targeted semantic judgment. This shows current agents cannot yet deliver reliable end-to-end automation across structured, changing demands.

Core claim

Claw-Eval-Live shows reliable workflow automation remains far from solved: the leading model passes only 66.7 percent of tasks and no model reaches 70 percent. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks.

What carries the argument

Claw-Eval-Live, which separates a refreshable signal layer updated from public workflow-demand signals such as ClawHub Top-500 skills from a reproducible time-stamped release snapshot materialized with fixed fixtures, services, workspaces, and graders.

If this is right

Agents must make targeted gains on HR, management, and multi-system coordination tasks to approach reliable automation.
Evaluation should record execution traces, audit logs, and workspace artifacts rather than grading final responses alone.
Live benchmarks refreshed from external demand signals can track whether progress keeps pace with changing workflows.
Models with similar overall pass rates can still differ substantially in completion quality and error patterns.
Discrimination among models concentrates in a middle band of tasks, so benchmarks gain power by focusing there.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding dynamic service mutations between releases would test whether agents adapt when external systems change without notice.
The same live-refresh structure could be applied to other agent domains such as data pipelines or code maintenance to keep evaluations current.
Correlating failure modes with task-family statistics could guide targeted data collection for training on persistent bottlenecks.
Releasing the full execution traces publicly would let the community develop finer-grained diagnostics beyond binary pass rates.

Load-bearing premise

Tasks constructed from public workflow-demand signals and materialized with fixed fixtures, services, and workspaces accurately represent evolving real-world workflow demands, and the mixed deterministic checks plus structured LLM judging reliably identify successful execution.

What would settle it

A new model that passes more than 90 percent of tasks across two successive releases while human raters judge the task set as unrepresentative of common workplace demands would falsify the claim that reliable automation is far from solved.

Figures

Figures reproduced from arXiv: 2604.28139 by Benyou Wang, Bowen Ye, Chenxin Li, Lei Li, Mingxin Huang, Rang Li, Shengyuan Liu, Shijue Huang, Yixuan Yuan, Yunlong Lin, Zhengyang Tang.

**Figure 1.** Figure 1: Overview of Claw-Eval-Live. The benchmark starts from a refreshable snapshot of public view at source ↗

**Figure 2.** Figure 2: Signal-to-snapshot construction in Claw-Eval-Live. A refresh starts from public workflow view at source ↗

**Figure 3.** Figure 3: Leaderboard metric landscape for Claw-Eval-Live. Each point is one model, with Pass Rate view at source ↗

**Figure 4.** Figure 4: Pass-rate heatmap for 13 models over seven analysis groups derived from the released task view at source ↗

**Figure 5.** Figure 5: Threshold-aware task discrimination in the public release. Under the public pass rule, tasks view at source ↗

**Figure 6.** Figure 6: Overview of Claw-Eval-Live. Candidate tasks are piloted and screened, curated into a view at source ↗

read the original abstract

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Claw-Eval-Live introduces a refreshable public-signal benchmark with execution auditing for workflow agents, but its performance numbers rest on unvalidated task construction and LLM grading assumptions.

read the letter

The paper's main move is to split a live benchmark into a refreshable layer drawn from public workflow signals like ClawHub Top-500 skills and a fixed, time-stamped snapshot that includes controlled fixtures, services, workspaces, and graders. This setup records full execution traces, audit logs, and post-run artifacts, then grades with deterministic checks where possible and structured LLM judging only for semantic parts. On 105 tasks across 13 models, the best result is 66.7% and nothing reaches 70%, with failures clustering in HR, management, and multi-system business workflows while local workspace repair is easier but still incomplete. They also note that raw pass rates hide differences in completion quality and that task discrimination sits in the middle band rather than the extremes. That separation of refreshable demand from reproducible runs is the concrete addition over static agent benchmarks. It gives a clearer picture of where agents actually break on end-to-end work. The soft spots sit in the grounding steps. Materializing public signals into fixed environments removes API drift, integration noise, and live variability that real deployments face, so the tasks may understate difficulty even while gaining reproducibility. The fallback to LLM judging for semantic success has no reported human agreement rates, calibration data, or error analysis in the abstract, which leaves the family-level bottleneck claims open to label noise. Task-construction validation and exclusion criteria are also not detailed enough to judge how representative the 105 tasks are. This work is for groups building or evaluating workflow agents who need something closer to external demand than frozen test sets. It deserves peer review because the framework itself is a practical step past current limitations, even if the validation numbers and grader reliability checks will need to be added or strengthened in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Claw-Eval-Live, a live benchmark for LLM workflow agents that decouples a refreshable public signal layer (ClawHub Top-500 skills) from a time-stamped, reproducible snapshot. It materializes 105 tasks across controlled business services and local workspaces, records execution traces and artifacts, and applies mixed deterministic/LLM grading. Evaluation of 13 frontier models shows a maximum pass rate of 66.7% with none exceeding 70%, structured failures concentrated in HR, management, and multi-system workflows, and argues that leaderboard rank is insufficient while evaluation must be grounded in fresh external demand and verifiable actions.

Significance. If the task construction and grading validity hold, the results would establish that reliable end-to-end workflow automation remains unsolved for frontier models, with clear bottlenecks in complex business processes. The live, trace-based design with mixed grading offers a concrete advance over static final-answer benchmarks and could steer the field toward more reproducible and adaptive evaluation practices.

major comments (3)

[§3] §3 (Task Construction and Materialization): The central performance claims (66.7% ceiling, family-level bottlenecks) depend on the 105 tasks accurately proxying evolving real-world workflow demand. The manuscript freezes public signals into tasks with fixed fixtures, services, and workspaces for reproducibility, but provides no validation study, exclusion criteria, or comparison against live deployment variability, API drift, or cross-system noise. This assumption is load-bearing for interpreting the reported pass rates and the conclusion that automation is far from solved.
[§4.2] §4.2 (Grading and Labeling): Grading combines deterministic checks with structured LLM judging for semantic dimensions, yet the manuscript reports no human agreement rates, judge calibration data, inter-rater reliability, or error analysis on the LLM component. Absent these, label noise could systematically inflate or deflate pass rates and distort the identified bottlenecks in HR and management families.
[§5] §5 (Experimental Results): The key quantitative claims (leading model at 66.7%, no model above 70%, task-family discrimination) are presented without statistical error bars, confidence intervals, or sensitivity analysis over the 105-task set. This omission weakens assessment of whether observed differences between models or the middle-band task discrimination are robust.

minor comments (2)

[Abstract / §3] The abstract and §3 refer to 'ClawHub Top-500 skills' without stating the precise selection, filtering, or refresh protocol; a short subsection or appendix table listing the skill categories and update cadence would improve reproducibility.
[§5] Figure or table showing per-model completion metrics beyond raw pass rate (to support the claim that similar pass rates can diverge in overall completion) is referenced but not described in detail; ensure the relevant table is clearly captioned and cross-referenced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline revisions to improve the manuscript's rigor.

read point-by-point responses

Referee: [§3] §3 (Task Construction and Materialization): The central performance claims (66.7% ceiling, family-level bottlenecks) depend on the 105 tasks accurately proxying evolving real-world workflow demand. The manuscript freezes public signals into tasks with fixed fixtures, services, and workspaces for reproducibility, but provides no validation study, exclusion criteria, or comparison against live deployment variability, API drift, or cross-system noise. This assumption is load-bearing for interpreting the reported pass rates and the conclusion that automation is far from solved.

Authors: We appreciate the emphasis on validating task representativeness. The benchmark design sources tasks from the refreshable public ClawHub Top-500 signals and materializes them into fixed, time-stamped snapshots precisely to support reproducible evaluation of verifiable actions. Task selection followed the public skill ranking with instantiation rules that ensure deterministic fixtures and observable outcomes. While a dedicated live-deployment comparison study was not performed (as it would conflict with the reproducibility goal), we will revise §3 to add explicit exclusion criteria (e.g., skills lacking clear post-execution artifacts or requiring unavailable external accounts) and a new limitations paragraph discussing API drift and cross-system variability. This preserves the core contribution while clarifying the proxy nature of the 105-task set. revision: partial
Referee: [§4.2] §4.2 (Grading and Labeling): Grading combines deterministic checks with structured LLM judging for semantic dimensions, yet the manuscript reports no human agreement rates, judge calibration data, inter-rater reliability, or error analysis on the LLM component. Absent these, label noise could systematically inflate or deflate pass rates and distort the identified bottlenecks in HR and management families.

Authors: We agree that quantitative validation of the LLM judging component strengthens confidence in the labels. The manuscript already restricts LLM judging to semantic dimensions only after deterministic checks are exhausted, using fixed structured prompts. To directly address the concern, we will add to §4.2 a human agreement study on a stratified sample of 25 tasks (including HR and management families), reporting inter-rater reliability (Cohen's kappa) between two human annotators and the LLM judge, plus calibration notes and a brief error analysis of disagreements. These results will be included in the revised version. revision: yes
Referee: [§5] §5 (Experimental Results): The key quantitative claims (leading model at 66.7%, no model above 70%, task-family discrimination) are presented without statistical error bars, confidence intervals, or sensitivity analysis over the 105-task set. This omission weakens assessment of whether observed differences between models or the middle-band task discrimination are robust.

Authors: We thank the referee for noting the missing statistical support. Pass rates are reported as direct proportions on the fixed 105-task release. In revision we will augment §5 with bootstrap 95% confidence intervals for overall and per-family pass rates, plus a sensitivity analysis consisting of 1,000 random 80% subsamples of the task set to assess stability of model ordering and bottleneck identification. These additions will quantify the robustness of the 66.7% ceiling and family-level patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark from external signals

full rationale

The paper constructs and evaluates a live benchmark by materializing tasks from public ClawHub Top-500 skills into fixed, reproducible environments with deterministic checks and LLM judging where needed. Central results (leading model at 66.7% pass rate, family-level failure patterns) are direct empirical measurements on the 105-task release snapshot. No equations, fitted parameters, predictions derived from the same data, or self-referential derivations appear. The work relies on external public workflow-demand signals rather than any internal ansatz or uniqueness theorem. This is a standard empirical benchmark paper whose claims rest on observable execution traces and audit logs, not on any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that public signals accurately proxy real workflow demand; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Public workflow-demand signals such as ClawHub Top-500 skills provide a representative and refreshable source for constructing realistic tasks.
Invoked when each release is constructed from public signals and materialized as controlled tasks.

pith-pipeline@v0.9.0 · 5634 in / 1579 out tokens · 75617 ms · 2026-05-07T05:48:44.693184+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
cs.CR 2026-05 conditional novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and ...

Reference graph

Works this paper leans on

54 extracted references · 24 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

Claude code.https://www.anthropic.com/product/claude-code, 2025

Anthropic. Claude code.https://www.anthropic.com/product/claude-code, 2025

2025
[2]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review arXiv 2021
[3]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[4]

The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

D. Chezelles, T. Le Sellier, S. O. Shayegan, L. K. Jang, X. H. Lu, O. Yoran, D. Kong, F. F. Xu, S. Reddy, Q. Cappart, et al. The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467, 2024

work page arXiv 2024
[5]

G. Deng, Z. Chen, Z. Yu, H. Fan, Y . Liu, Y . Yang, D. Parikh, R. Kannan, L. Cong, M. Wang, Q. Zhang, V . Prasanna, X. Tang, and X. Wang. Evoclaw: Evaluating ai agents on continuous software evolution, 2026

2026
[6]

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su. Mind2web: Towards a generalist agent for the web.arXiv preprint arXiv:2306.06070, 2023

work page arXiv 2023
[7]

S. Ding, X. Dai, L. Xing, S. Ding, Z. Liu, J. Yang, P. Yang, Z. Zhang, X. Wei, Y . Ma, H. Duan, J. Shao, J. Wang, D. Lin, K. Chen, and Y . Zang. Wildclawbench, 2026. Official benchmark site and GitHub repository for an in-the-wild OpenClaw agent benchmark

2026
[8]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. Del Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review arXiv 2024
[9]

El Hattami, M

A. El Hattami, M. Thakkar, and N. Chapados. Webarena verified.arXiv preprint, 2025. Systematic review of evaluation issues in WebArena tasks

2025
[10]

Ersoy, B

D. Ersoy, B. Lee, A. Shreekumar, A. Arunasalam, M. Ibrahim, A. Bianchi, and Z. B. Ce- lik. Investigating the impact of dark patterns on LLM-based web agents.arXiv preprint arXiv:2510.18113, 2025

work page arXiv 2025
[11]

A. Gu, B. Roziere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. Wang. CRUXEval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065, 2024. 12

work page internal anchor Pith review arXiv 2024
[12]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations, 2024

2024
[13]

Researchclawbench: Evaluating ai agents for automated research from re- discovery to new-discovery, 2026

InternScience. Researchclawbench: Evaluating ai agents for automated research from re- discovery to new-discovery, 2026. Official benchmark collection, repository, and leaderboard

2026
[14]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

2024
[15]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2024

work page internal anchor Pith review arXiv 2024
[16]

Pinchbench: Benchmarking LLM models as OpenClaw coding agents

Kilo AI. Pinchbench: Benchmarking LLM models as OpenClaw coding agents. https: //github.com/pinchbench/skill, 2026. Benchmark repository

2026
[17]

J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdi- nov, and D. Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[18]

Y . Lai, C. Li, Y . Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih, D. Fried, S. Wang, and T. Yu. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345, 2023

2023
[19]

I. Levy, B. Wiesel, S. Marreed, A. Oved, A. Yaeli, and S. Shlomov. ST-webagentbench: A bench- mark for evaluating safety and trustworthiness in web agents.arXiv preprint arXiv:2410.06703, 2024

work page arXiv 2024
[20]

M. Li, F. Zhao, B. Yu, H. Song, H. Li, Z. Yu, J. Li, F. Huang, and Y . Li. API-bank: A comprehensive benchmark for tool-augmented LLMs.arXiv preprint arXiv:2304.08244, 2023

work page arXiv 2023
[21]

M. Li, Y . Li, J. Ren, Y . Tian, J. Gu, X. Li, X. Yue, W. Wang, L. Ma, J. Yang, et al. Groundingme: Exposing the visual grounding gap in MLLMs through multi-dimensional evaluation.arXiv preprint arXiv:2503.11875, 2025

work page arXiv 2025
[22]

T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code auto- completion systems.arXiv preprint arXiv:2306.03091, 2023

work page arXiv 2023
[23]

X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, 2024

2024
[24]

MacDiarmid, B

M. MacDiarmid, B. Wright, J. Uesato, J. Benton, I. Kutasov, I. Price, M. Bouscal, S. R. Bowman, T. Bricken, A. Cloud, et al. Natural emergent misalignment from reward hacking in production RL, 2025

2025
[25]

Merrill, P

W. Merrill, P. Shaw, N. Carlini, K. Li, H. Raj, E. Bercovich, F. Shi, D. Shin, T. Walshe, B. Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2602.03221, 2026

work page arXiv 2026
[26]

Mialon, C

G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y . LeCun, and T. Scialom. GAIA: A benchmark for general AI assistants. InInternational Conference on Learning Representations, 2024

2024
[27]

Hermes agent

NousResearch. Hermes agent. https://github.com/NousResearch/hermes-agent,
[28]

Codex.https://openai.com/codex, 2026

OpenAI. Codex.https://openai.com/codex, 2026

2026
[29]

Openclaw

OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. GitHub reposi- tory. 13

2026
[30]

Y . Pan, D. Kong, S. Zhou, C. Cui, Y . Leng, B. Jiang, H. Liu, Y . Shang, S. Zhou, T. Wu, et al. We- bcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373, 2024

work page arXiv 2024
[31]

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review arXiv 2023
[32]

B. Qiao, L. Li, X. Zhang, S. He, Y . Kang, C. Zhang, F. Yang, H. Dong, J. Zhang, L. Wang, et al. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023
[33]

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review arXiv 2023
[34]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review arXiv 2024
[35]

Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023

work page internal anchor Pith review arXiv 2023
[36]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023

2023
[37]

Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face. InAdvances in Neural Information Processing Systems, 2023

2023
[38]

V on Arx, L

S. V on Arx, L. Chan, et al. Recent frontier models are reward hacking.https://metr.org/ blog/2025-06-05-recent-reward-hacking/, 2025

2025
[39]

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, et al. Openhands: An open platform for AI software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review arXiv 2024
[40]

X. Wang, Z. Wang, J. Liu, Y . Chen, L. Yuan, H. Peng, and H. Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. InInternational Conference on Learning Representations, 2024

2024
[41]

Z. Wang, Q. Chang, H. Patel, S. Biju, C.-E. Wu, Q. Liu, A. Ding, A. Rezazadeh, A. Shah, Y . Bao, et al. MCP-bench: Benchmarking tool-using LLM agents with complex real-world tasks via MCP servers.arXiv preprint arXiv:2508.20453, 2025

work page arXiv 2025
[42]

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, et al. Autogen: Enabling next-gen LLM applications via multi-agent conversation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024
[43]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems, 2024

2024
[44]

Xiong, Y

W. Xiong, Y . Song, X. Zhao, W. Wu, X. Wang, K. Wang, C. Li, W. Peng, and S. Li. Watch every step! LLM agent learning via iterative step-level process refinement. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1556–1572, 2024

2024
[45]

F. F. Xu, Y . Song, B. Li, Y . Tang, K. Jain, M. Bao, Z. Z. Wang, X. Zhou, Z. Guo, M. Cao, et al. Theagentcompany: Benchmarking LLM agents on consequential real world tasks. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2025

2025
[46]

T. Xue, W. Qi, T. Shi, C. H. Song, B. Gou, D. Song, H. Sun, and Y . Su. An illusion of progress? assessing the current state of web agents. InConference on Language Modeling, 2025. 14

2025
[47]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[48]

S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review arXiv 2024
[49]

B. Ye, R. Li, Q. Yang, Y . Liu, L. Yao, H. Lv, Z. Xie, C. An, L. Li, L. Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents, 2026

2026
[50]

Yoran, S

O. Yoran, S. J. Amouyal, C. Malaviya, B. Bogin, O. Press, and J. Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, 2024

2024
[51]

T. Yuan, Z. He, L. Dong, Y . Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, F. Li, Z. Zhang, et al. R-judge: Benchmarking safety risk awareness for LLM agents. InFindings of the Association for Computational Linguistics: EMNLP, 2024

2024
[52]

Zhang, Y

Y . Zhang, Y . Wang, Y . Zhu, P. Du, J. Miao, X. Lu, W. Xu, Y . Hao, S. Cai, X. Wang, H. Zhang, X. Wu, Y . Lu, M. Lei, K. Zou, H. Yin, P. Nie, L. Chen, D. Jiang, W. Chen, and K. R. Allen. Clawbench: Can ai agents complete everyday online tasks?, 2026

2026
[53]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Z. Zhang, L. Lei, L. Wu, R. Sun, Y . Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang. Agent-safetybench: Evaluating the safety of LLM agents.arXiv preprint arXiv:2412.14470, 2024

work page internal anchor Pith review arXiv 2024
[54]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 15 A Supplementary Figures and Tables This appendix collects construction- and setup-reference material that accompanies the main text. Figure 6 gives...

work page internal anchor Pith review arXiv 2023