Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

Irwin King; Jinhu Qi; Minghao Zhao; Wentao Zhang; Yaoman Li; Yifan Li; Zijian Zhang

REVIEW 1 major objections 2 minor 24 references

Interventions tuned on one model raise trustworthiness across 13 agentic systems from seven families on a 100-scenario suite.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-22 10:15 UTC pith:GOAWQEKR

load-bearing objection The paper defines a five-property trustworthiness profile for agents and shows interventions from its HAAF framework transferring across 13 systems, but the shared 100-scenario set for design and testing undercuts the no-tuning claim. the 1 major comments →

arxiv 2603.14987 v2 pith:GOAWQEKR submitted 2026-03-16 cs.CL cs.DB

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

Jinhu Qi , Yifan Li , Minghao Zhao , Wentao Zhang , Zijian Zhang , Yaoman Li , Irwin King This is my paper

classification cs.CL cs.DB

keywords agentic AItrustworthiness evaluationscenario manifoldHAAFTrustworthy Optimization Factorycross-model generalizationrisk-weighted profilefive-property profile

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines agent trustworthiness as a five-property profile covering reliability, robustness, safety, social-ethical alignment, and operational integrity. It introduces a framework that samples scenarios representatively and runs an optimization loop that turns failure diagnoses into fixes. When those fixes are applied to a single focal model they transfer without further tuning to twelve other systems spanning Llama, Mistral, Kimi, GLM, Qwen, GPT, and DeepSeek families. All thirteen systems show measurable gains and two reach a perfect risk-weighted score. A sympathetic reader cares because current benchmarks test isolated tasks while real agent deployments fail across interacting socio-technical dimensions.

Core claim

Agentic trustworthiness is operationalized as the five-property profile of Reliability, Robustness, Safety, Social-Ethical Alignment, and Operational Integrity. The Holographic Agent Assessment Framework evaluates this profile over a distribution-aware scenario manifold using static policy analysis, sandbox simulation, social-ethical checks, and iterative red-to-blue optimization. Interventions derived from a single focal model generalize without per-model or per-scenario retuning to thirteen systems drawn from seven families on a one-hundred-scenario suite, producing uniform improvement and perfect risk-weighted profiles for two systems.

What carries the argument

The Trustworthy Optimization Factory inside HAAF, which converts red-team scenario failures into reusable blue-team interventions that are then applied across models.

Load-bearing premise

The five chosen properties plus the sampled scenario distribution together capture every relevant failure mode an agent might exhibit in real deployment.

What would settle it

Deploy one of the improved agents in an untested real-world workflow and observe a failure mode that lies outside the five properties yet produces measurable harm.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

A single set of interventions can be reused across model families without retuning.
Scalar leaderboards miss property-level trade-offs that become visible once evaluation follows a scenario distribution.
Deployment readiness can be assessed by running the full manifold rather than isolated benchmarks.
Two of the thirteen systems reach a perfect risk-weighted profile under the defined metric.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations could maintain a shared library of interventions that are updated once and applied to any new model family.
Future agent evaluations may need to treat scenario sampling as a first-class research problem rather than a fixed benchmark.
If the five-property profile proves incomplete, the measured generalization gains would shrink once new failure classes are added.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper defines a five-property trustworthiness profile for agents and shows interventions from its HAAF framework transferring across 13 systems, but the shared 100-scenario set for design and testing undercuts the no-tuning claim.

read the letter

The main point to take away is that the authors have tried to operationalize trustworthiness for agentic AI with a five-property profile and built a framework around it to create interventions that appear to improve performance across many different models. They do something useful by moving away from single benchmarks toward sampling scenarios in a way that might better reflect real distributions, and the experiment with 13 systems from Llama to DeepSeek shows all improving under their approach. The Trustworthy Optimization Factory idea for turning diagnoses into fixes is a concrete contribution that could be picked up by others working on deployment readiness. Where it gets softer is on the generalization claim. The interventions were developed using the same 100-scenario suite where the final results are reported, so it's not obvious that the no-per-model-tuning success would hold on a truly held-out set of scenarios. That could be an internal validity issue rather than just external representativeness. The paper also assumes the five properties and the sampling cover the important failure modes, but without more on how the manifold was built or exclusion rules, it's tough to assess if key risks are left out. This work is for people in AI safety who want practical tools for evaluating agents beyond leaderboards. A reader interested in building better evaluation pipelines might find the HAAF structure helpful, even if they adapt parts of it. It deserves to go to peer review because the topic is important and the experiment provides some evidence, though the methods section will need close attention for reproducibility and to address the scenario overlap concern.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the Holographic Agent Assessment Framework (HAAF) to address fragmented evaluation of agentic AI trustworthiness. It defines trustworthiness operationally as a five-property profile (Reliability, Robustness, Safety, Social-Ethical Alignment, Operational Integrity) grounded in existing risk frameworks, and proposes a scenario manifold with distribution-aware sampling, static policy analysis, sandbox simulation, and the iterative Trustworthy Optimization Factory that converts red-team diagnoses into interventions. The central claim is that interventions designed from a single focal model generalize without per-model or per-scenario tuning to 13 systems across seven families on a 100-scenario suite, with all systems improving and two reaching perfect risk-weighted profiles.

Significance. If the transfer results are substantiated, the work would advance the field by shifting from isolated benchmarks to a distribution-aware, multi-property evaluation that can reveal trade-offs and support model-agnostic improvement pipelines. The public code release at https://github.com/TonyQJH/haaf-pilot is a clear strength that enables inspection of the scenario manifold and Factory process, supporting reproducibility.

major comments (1)

[§5] §5 (cross-family transfer experiment): the reported generalization and improvements for all 13 systems are measured on the identical 100-scenario suite used to generate the interventions via red-team diagnoses and the Trustworthy Optimization Factory. This internal-validity concern is load-bearing for the no-per-scenario-tuning claim, as any distribution-aware sampling or diagnosis step could implicitly fit the manifold; a hold-out scenario subset or external test set is required to confirm that benefits are not artifacts of the evaluation distribution.

minor comments (2)

[Methods] The description of scenario construction, exclusion rules, and any statistical tests or error bars for the 100-scenario results should be expanded for verifiability.
[§3] Notation for the risk-weighted profile and property-level trade-offs could be clarified with an explicit equation or table to avoid ambiguity in how the five properties are aggregated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our work. Below we provide a point-by-point response to the major comment, outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§5] §5 (cross-family transfer experiment): the reported generalization and improvements for all 13 systems are measured on the identical 100-scenario suite used to generate the interventions via red-team diagnoses and the Trustworthy Optimization Factory. This internal-validity concern is load-bearing for the no-per-scenario-tuning claim, as any distribution-aware sampling or diagnosis step could implicitly fit the manifold; a hold-out scenario subset or external test set is required to confirm that benefits are not artifacts of the evaluation distribution.

Authors: We acknowledge this valid concern regarding potential distributional artifacts in our cross-family transfer results. Although the interventions are generated solely from diagnoses on the focal model and applied without any per-model or per-scenario tuning, the use of the same 100-scenario suite for both intervention design and evaluation does leave open the possibility of implicit fitting to the manifold. To rigorously address this, we will revise the paper to include a hold-out scenario subset. We plan to reserve a portion of the scenarios (not involved in sampling, diagnosis, or Factory optimization) as a test set and demonstrate that the improvements hold on these unseen scenarios for the 13 models. Updated results and discussion will be added to Section 5. revision: yes

Circularity Check

0 steps flagged

No circularity: definitions and empirical transfer results remain independent of input data by construction

full rationale

The paper explicitly defines the five-property trustworthiness profile and introduces the HAAF framework plus Trustworthy Optimization Factory as operational tools grounded in existing AI risk frameworks. The central claim of cross-family generalization is presented as an empirical outcome from applying interventions (derived from one focal model) to 13 other systems on the fixed 100-scenario suite, with no equations, parameter fitting, or self-citations shown that would force the reported improvements to equal the input definitions or scenario manifold by construction. The shared suite constitutes a methodological decision for measuring transfer across models rather than a self-referential reduction; the results could in principle have shown no improvement or negative transfer without violating any definitional step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the five properties appear defined rather than derived, and the scenario manifold is postulated without stated external validation.

axioms (2)

domain assumption The five properties (Reliability, Robustness, Safety, Social-Ethical Alignment, Operational Integrity) together form a sufficient and non-redundant specification of agentic trustworthiness.
Invoked when the authors state they address the absence of a measurable specification by defining this profile grounded in current AI risk frameworks.
domain assumption The chosen scenario manifold and distribution-aware sampling produce a representative socio-technical distribution.
Central to the claim that HAAF measures the profile over a scenario manifold rather than disconnected benchmark instances.

invented entities (1)

Trustworthy Optimization Factory no independent evidence
purpose: Iterative loop that converts red-team diagnoses into blue-team interventions transferable across models
Introduced as the mechanism connecting static policy analysis, sandbox simulation, and alignment assessment; no independent falsifiable prediction outside the framework is stated.

pith-pipeline@v0.9.0 · 5879 in / 1564 out tokens · 29397 ms · 2026-05-22T10:15:55.253768+00:00 · methodology

0 comments

read the original abstract

Agentic AI systems increasingly act through tool-augmented, multi-step workflows whose failures (unsafe tool use, unauthorised actions, social harm) carry deployment-level consequences. Evaluation practice remains fragmented across isolated benchmark slices, and "trustworthiness" is frequently invoked but rarely defined operationally. We argue the central limitation is twofold: (i) the absence of a measurable specification of what agent trustworthiness means, and (ii) the lack of a principled notion of representativeness allowing assessment over a socio-technical scenario distribution rather than disconnected benchmark instances. We address (i) by defining agentic trustworthiness as a five-property profile (Reliability, Robustness, Safety, Social-Ethical Alignment, Operational Integrity) grounded in current AI risk frameworks, and (ii) with the Holographic Agent Assessment Framework (HAAF), which measures this profile over a scenario manifold through static policy analysis, sandbox simulation, social-ethical alignment assessment, and distribution-aware sampling, connected through an iterative Trustworthy Optimization Factory that converts red-team diagnoses into blue-team interventions. Our contributions are: (1) an operational five-property definition of agentic trustworthiness; (2) a distribution-aware scenario-sampling framework that surfaces property-level trade-offs invisible to scalar leaderboards; and (3) a cross-family transfer experiment in which interventions designed from a single focal model generalise -- without per-model or per-scenario tuning -- to 13 systems from seven model families (Llama, Mistral, Kimi, GLM, Qwen, GPT, DeepSeek) on a 100-scenario suite, where all 13 systems improve and two reach a perfect risk-weighted profile, establishing HAAF's Factory as a model-agnostic deployment-readiness pipeline. Code: https://github.com/TonyQJH/haaf-pilot

Figures

Figures reproduced from arXiv: 2603.14987 by Irwin King, Jinhu Qi, Minghao Zhao, Wentao Zhang, Yaoman Li, Yifan Li, Zijian Zhang.

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 14 internal anchors

[1]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models.arXiv preprint arXiv:2404.01318(2024). doi:10.48550/arXiv.2404.01318

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.01318 2024
[2]

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents.arXiv preprint arXiv:2406.13352(2024). doi:10.48550/arXiv.2406.13352

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.13352 2024
[3]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? arXiv:2403.07718 [cs.LG] https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. arXiv:2502.06559 [cs.AI] https://arxiv.org/abs/2502.06559

work page arXiv 2025
[5]

Deep Ganguli, Liane Lovitt, Jackson Kernion, et al. 2022. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv preprint arXiv:2209.07858(2022). doi:10.48550/arXiv.2209.07858

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.07858 2022
[6]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?arXiv preprint arXiv:2310.06770(2023). doi:10.48550/ arXiv.2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengx- uan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dyn- abench: Rethinking Benchmarking in NLP.arXiv preprint ...

work page doi:10.48550/arxiv.2104.14337 2021
[8]

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, et al. 2021. WILDS: A Benchmark of in-the-Wild Distribution Shifts.arXiv preprint arXiv:2012.07421(2021). doi:10. 48550/arXiv.2012.07421

work page arXiv 2021
[9]

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models.arXiv preprint arXiv:2305.11747(2023). doi:10.48550/arXiv.2305.11747

work page doi:10.48550/arxiv.2305.11747 2023
[10]

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.arXiv preprint arXiv:2304.08244(2023). doi:10.48550/arXiv.2304.08244

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08244 2023
[11]

Percy Liang, Rishi Bommasani, Tony Lee, et al . 2023. Holistic Evaluation of Language Models.Transactions on Machine Learning Research(2023). arXiv:2211.09110 [cs.CL] doi:10.48550/arXiv.2211.09110

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09110 2023
[12]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2023. AgentBench: Evaluating LLMs as Agents.arXiv preprint arXiv:2308.03688(2023). doi:10.4...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308 2023
[13]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, Leo Yu Zhang, and Yang Liu. 2023. Prompt Injection Attack against LLM-Integrated Applications.arXiv preprint arXiv:2306.05499(2023). doi:10.48550/arXiv.2306.05499

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05499 2023
[14]

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. 2024. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities.arXiv preprint arXiv:2408.04682(2024). doi:10.48550/arXiv.2408.04682

work page doi:10.48550/arxiv.2408.04682 2024
[15]

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. GAIA: A Benchmark for General AI Assistants.arXiv preprint arXiv:2311.12983(2023). doi:10.48550/arXiv.2311.12983

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12983 2023
[16]

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025. Evaluation and Benchmarking of LLM Agents: A Survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). ACM, 6129–6139. doi:10.1145/3711896.3736570

work page doi:10.1145/3711896.3736570 2025
[17]

Beatrice Nolan. 2025. AI-powered coding tool wiped out a software com- pany’s database, then apologized for a ‘catastrophic failure on my part’. For- tune. https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database- called-it-a-catastrophic-failure/ Accessed: 2026-03-12

work page 2025
[18]

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks

work page
[19]

The answer is

Do the Rewards Justify the Means? Measuring Trade-Offs Between Re- wards and Ethical Behavior in the MACHIAVELLI Benchmark.arXiv preprint arXiv:2304.03279(2023). doi:10.48550/arXiv.2304.03279

work page doi:10.48550/arxiv.2304.03279 2023
[20]

Ali Shirali, Rediet Abebe, and Moritz Hardt. 2022. A Theory of Dynamic Bench- marks.arXiv preprint arXiv:2210.03165(2022). doi:10.48550/arXiv.2210.03165

work page doi:10.48550/arxiv.2210.03165 2022
[21]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. 2024. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.arXiv preprint arXiv:2404.079...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.07972 2024
[22]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045(2024). doi:10.48550/arXiv.2406.12045

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
[23]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2023. WebArena: A Realistic Web Environment for Building Autonomous Agents.arXiv preprint arXiv:2307.13854(2023). doi:10.48550/arXiv.2307.13854

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.13854 2023
[24]

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. 2023. SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents.arXiv preprint arXiv:2310.11667(2023). doi:10.48550/arXiv. 2310.11667

work page internal anchor Pith review doi:10.48550/arxiv 2023

[1] [1]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models.arXiv preprint arXiv:2404.01318(2024). doi:10.48550/arXiv.2404.01318

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.01318 2024

[2] [2]

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents.arXiv preprint arXiv:2406.13352(2024). doi:10.48550/arXiv.2406.13352

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.13352 2024

[3] [3]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? arXiv:2403.07718 [cs.LG] https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. arXiv:2502.06559 [cs.AI] https://arxiv.org/abs/2502.06559

work page arXiv 2025

[5] [5]

Deep Ganguli, Liane Lovitt, Jackson Kernion, et al. 2022. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv preprint arXiv:2209.07858(2022). doi:10.48550/arXiv.2209.07858

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.07858 2022

[6] [6]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?arXiv preprint arXiv:2310.06770(2023). doi:10.48550/ arXiv.2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengx- uan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dyn- abench: Rethinking Benchmarking in NLP.arXiv preprint ...

work page doi:10.48550/arxiv.2104.14337 2021

[8] [8]

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, et al. 2021. WILDS: A Benchmark of in-the-Wild Distribution Shifts.arXiv preprint arXiv:2012.07421(2021). doi:10. 48550/arXiv.2012.07421

work page arXiv 2021

[9] [9]

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models.arXiv preprint arXiv:2305.11747(2023). doi:10.48550/arXiv.2305.11747

work page doi:10.48550/arxiv.2305.11747 2023

[10] [10]

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.arXiv preprint arXiv:2304.08244(2023). doi:10.48550/arXiv.2304.08244

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08244 2023

[11] [11]

Percy Liang, Rishi Bommasani, Tony Lee, et al . 2023. Holistic Evaluation of Language Models.Transactions on Machine Learning Research(2023). arXiv:2211.09110 [cs.CL] doi:10.48550/arXiv.2211.09110

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09110 2023

[12] [12]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2023. AgentBench: Evaluating LLMs as Agents.arXiv preprint arXiv:2308.03688(2023). doi:10.4...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308 2023

[13] [13]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, Leo Yu Zhang, and Yang Liu. 2023. Prompt Injection Attack against LLM-Integrated Applications.arXiv preprint arXiv:2306.05499(2023). doi:10.48550/arXiv.2306.05499

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05499 2023

[14] [14]

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. 2024. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities.arXiv preprint arXiv:2408.04682(2024). doi:10.48550/arXiv.2408.04682

work page doi:10.48550/arxiv.2408.04682 2024

[15] [15]

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. GAIA: A Benchmark for General AI Assistants.arXiv preprint arXiv:2311.12983(2023). doi:10.48550/arXiv.2311.12983

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12983 2023

[16] [16]

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025. Evaluation and Benchmarking of LLM Agents: A Survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). ACM, 6129–6139. doi:10.1145/3711896.3736570

work page doi:10.1145/3711896.3736570 2025

[17] [17]

Beatrice Nolan. 2025. AI-powered coding tool wiped out a software com- pany’s database, then apologized for a ‘catastrophic failure on my part’. For- tune. https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database- called-it-a-catastrophic-failure/ Accessed: 2026-03-12

work page 2025

[18] [18]

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks

work page

[19] [19]

The answer is

Do the Rewards Justify the Means? Measuring Trade-Offs Between Re- wards and Ethical Behavior in the MACHIAVELLI Benchmark.arXiv preprint arXiv:2304.03279(2023). doi:10.48550/arXiv.2304.03279

work page doi:10.48550/arxiv.2304.03279 2023

[20] [20]

Ali Shirali, Rediet Abebe, and Moritz Hardt. 2022. A Theory of Dynamic Bench- marks.arXiv preprint arXiv:2210.03165(2022). doi:10.48550/arXiv.2210.03165

work page doi:10.48550/arxiv.2210.03165 2022

[21] [21]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. 2024. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.arXiv preprint arXiv:2404.079...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.07972 2024

[22] [22]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045(2024). doi:10.48550/arXiv.2406.12045

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024

[23] [23]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2023. WebArena: A Realistic Web Environment for Building Autonomous Agents.arXiv preprint arXiv:2307.13854(2023). doi:10.48550/arXiv.2307.13854

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.13854 2023

[24] [24]

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. 2023. SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents.arXiv preprint arXiv:2310.11667(2023). doi:10.48550/arXiv. 2310.11667

work page internal anchor Pith review doi:10.48550/arxiv 2023