pith. machine review for the scientific record. sign in

arxiv: 2605.10365 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:18 UTC · model grok-4.3

classification 💻 cs.AI
keywords agent valuesvalue alignmentAI agentsharnessesskill steeringvalue systemsAI safetybenchmarks
0
0 comments X

The pith

Agents display values distinct from their base LLMs that form a cross-model Value Tide malleable by harnesses and skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that autonomous agents hold value systems separate from those of the language models powering them, and that the shift to agentic execution creates new measurement problems not addressed by existing text-only value tests. It builds Agent-ValueBench to fill this gap, supplying executable environments, value-conflict tasks, and psychologist-curated golden trajectories that let evaluators score real agent behavior rather than static answers. Benchmark runs across models and harnesses reveal a broad homogeneity in agent values alongside clear non-additive shifts when harnesses change or skills are embedded. A reader would care because these patterns point to where control over agent behavior actually resides once models are deployed in harnesses rather than in the models themselves.

Core claim

The central claim is that agent values diverge from those of the underlying LLM and manifest as a Value Tide of cross-model homogeneity; this tide bends non-additively when different harnesses are applied and bends more decisively when skills are deliberately embedded, which together indicate that the practical lever for agent alignment is moving from model-level and prompt-level methods toward harness alignment and skill steering.

What carries the argument

Agent-ValueBench, the benchmark of 394 executable environments across 16 domains that supplies 4,335 value-conflict tasks covering 28 value systems and 332 dimensions, each equipped with two pole-aligned golden trajectories scored by a trajectory-level rubric judge.

Load-bearing premise

The end-to-end synthesis pipeline together with psychologist curation produces tasks and golden trajectories that measure the intended 28 value systems and 332 dimensions without systematic artifacts from generation or expert judgment.

What would settle it

Running the same 14 models on a fresh, independently authored collection of executable value-conflict tasks and finding that agent values match the base LLM values exactly while showing no measurable change under harness swaps or skill insertion would falsify the divergence and non-additive bending claims.

read the original abstract

Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agent-ValueBench as the first dedicated benchmark for evaluating values in autonomous agents. It constructs 4,335 value-conflict tasks across 394 executable environments in 16 domains, covering 28 value systems and 332 dimensions. Tasks are generated via a purpose-built end-to-end synthesis pipeline and per-instance curated by professional psychologists, each accompanied by two pole-aligned golden trajectories for a trajectory-level rubric-based judge. Benchmarking 14 frontier models across 4 harnesses yields three findings: agent values exhibit a 'Value Tide' of cross-model homogeneity distinct from underlying LLMs; this homogeneity bends non-additively under harness influence; and it is further modulated by embedded skill steering. The work argues that these results shift the agent-alignment focus from model-level to harness- and skill-level interventions.

Significance. If the synthesis pipeline and curation produce artifact-free measures, the benchmark would represent a substantial advance in AI safety and alignment research by extending value evaluation beyond text-only LLM protocols to agentic settings. The empirical demonstration of value divergence, cross-model homogeneity, and non-additive modulation by harnesses and skills provides concrete evidence for new alignment levers and highlights dataset-, evaluation-, and system-level challenges unique to agents. The inclusion of executable environments, golden trajectories, and a rubric-based judge supports reproducibility and falsifiability, strengthening the contribution if methodological validity is established.

major comments (3)
  1. [Benchmark Construction] Benchmark Construction section: The end-to-end synthesis pipeline and per-instance psychologist curation are described at a high level but provide no details on validation procedures, inter-rater reliability statistics, or controls for systematic biases (e.g., embedding of LLM priors in task generation or selection effects in the 4,335 tasks). This is load-bearing for the central claims of value divergence and the Value Tide, as any artifact in task instantiation of the 28 value systems would render the homogeneity and steering results non-generalizable.
  2. [Experimental Setup and Results] Experimental Setup and Results section: The reported cross-model homogeneity and non-additive bending under harness pull and skill steering lack explicit statistical controls, significance testing, or ablation details on how dataset-, evaluation-, and system-level factors were isolated. Without these, it is unclear whether the Value Tide findings hold after accounting for potential confounds in the 14-model, 4-harness evaluation.
  3. [Evaluation Methodology] Evaluation Methodology subsection: The trajectory-level rubric-based judge, anchored by golden trajectories, is presented without reported validation against human judgments or sensitivity analysis to variations in harness implementation. This directly affects the reliability of the value measurements underlying all three main findings.
minor comments (2)
  1. [Abstract] Abstract: The novel term 'Value Tide' is used without a concise definition or forward reference, reducing immediate clarity for readers.
  2. [Results] Figure and Table captions: Several result visualizations would benefit from explicit legends distinguishing the 28 value systems and the four harnesses to aid interpretation of the homogeneity patterns.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. The comments highlight important opportunities to increase methodological transparency. We address each major comment point-by-point below and will incorporate the requested details, statistical analyses, and validation results into the revised manuscript to strengthen the evidence for our findings on agent values.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark Construction section: The end-to-end synthesis pipeline and per-instance psychologist curation are described at a high level but provide no details on validation procedures, inter-rater reliability statistics, or controls for systematic biases (e.g., embedding of LLM priors in task generation or selection effects in the 4,335 tasks). This is load-bearing for the central claims of value divergence and the Value Tide, as any artifact in task instantiation of the 28 value systems would render the homogeneity and steering results non-generalizable.

    Authors: We agree that expanded details on validation and bias controls are necessary to support the benchmark's claims. The manuscript currently summarizes the pipeline and per-instance psychologist curation at a high level for conciseness. In the revision, we will substantially expand the Benchmark Construction section to report inter-rater reliability statistics (e.g., Fleiss' kappa across psychologists for value dimension assignments and task curation), explicit procedures for mitigating LLM priors (including multi-stage human-only review of generated tasks), and analysis of selection effects to confirm balanced coverage of the 28 value systems. Full curation protocols and bias-control documentation will be added to the supplementary materials. revision: yes

  2. Referee: [Experimental Setup and Results] Experimental Setup and Results section: The reported cross-model homogeneity and non-additive bending under harness pull and skill steering lack explicit statistical controls, significance testing, or ablation details on how dataset-, evaluation-, and system-level factors were isolated. Without these, it is unclear whether the Value Tide findings hold after accounting for potential confounds in the 14-model, 4-harness evaluation.

    Authors: We acknowledge that additional statistical rigor will better isolate the Value Tide from potential confounds. The current results demonstrate consistent patterns across 14 models and 4 harnesses, but we will revise the Experimental Setup and Results section to include formal significance testing (e.g., ANOVA with post-hoc tests and reported p-values/effect sizes for cross-model homogeneity), ablation studies that systematically vary dataset, evaluation, and system factors, and controls for confounds such as model scale or harness-specific artifacts. Updated figures and tables will show that the homogeneity and non-additive bending persist after these adjustments. revision: yes

  3. Referee: [Evaluation Methodology] Evaluation Methodology subsection: The trajectory-level rubric-based judge, anchored by golden trajectories, is presented without reported validation against human judgments or sensitivity analysis to variations in harness implementation. This directly affects the reliability of the value measurements underlying all three main findings.

    Authors: We appreciate this point on evaluation reliability. The golden trajectories anchor the rubric-based judge, but external validation was not detailed in the original submission. In the revised manuscript, we will augment the Evaluation Methodology subsection with a human validation study on a stratified subset of trajectories (reporting agreement metrics such as Cohen's kappa between the automated judge and professional evaluators) and a sensitivity analysis testing variations in harness implementations (e.g., alternative prompt templates and environment configurations). Full results and methodology for these checks will appear in the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and observations are independent of inputs

full rationale

The paper introduces Agent-ValueBench through an end-to-end synthesis pipeline and per-instance psychologist curation to generate 4,335 tasks covering 28 value systems, then reports direct empirical results from running 14 models across 4 harnesses. No equations, parameter fitting, derivations, or self-referential predictions appear in the presented claims. The Value Tide homogeneity, non-additive harness effects, and skill steering findings are observational outputs from the benchmark rather than quantities forced by construction or prior self-citations. The construction pipeline is described as purpose-built and externally curated, with no reduction of results to the generation process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about value measurability via trajectories and standard psychological value frameworks; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Agent values can be reliably measured and distinguished through choices in executable environments using golden trajectories and rubric-based judging.
    Invoked in the benchmark design and evaluation protocol described in the abstract.
  • domain assumption Professional psychologist curation ensures validity of value-conflict tasks across 28 systems and 332 dimensions.
    Stated as part of the co-synthesis and curation process.

pith-pipeline@v0.9.0 · 5570 in / 1363 out tokens · 32795 ms · 2026-05-12T04:18:10.344344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

203 extracted references · 203 canonical work pages · 21 internal anchors

  1. [1]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...

  2. [2]

    The landscape of agentic reinforcement learning for llms: A survey.Trans

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfo...

  3. [3]

    and Peng, Y

    Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A comprehensive survey of self-evolving AI agents: A new paradigm bridging foundation models andlifelongagenticsystems.CoRR,abs/2508.07407, 2025. doi: 10.48550/ARXIV.2508.07407. UR...

  4. [4]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  5. [5]

    URLhttps://openreview.net/forum?id=WE_vluYUL-X

    OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

  6. [6]

    Voyager: An open-ended embodied agent with large language models.Trans

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/forum?i d=ehfRiF0R3a

  7. [7]

    Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

    HaoranYe,XuningHe,VincentArak,HaonanDong,andGuojieSong. Metacontextengineering via agentic skill evolution.CoRR, abs/2601.21557, 2026. doi: 10.48550/ARXIV.2601.21557. URLhttps://doi.org/10.48550/arXiv.2601.21557

  8. [8]

    Openclaw

    OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. Open-source personal AI assistant, version 2026.3.8, accessed 2026-03-09

  9. [9]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.CoRR, abs/2603.10165, 2026. doi: 10.48550/ARXIV.2603.10165. URLhttps://doi.org/10.48550/arXiv.2603.10165

  10. [10]

    Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026b

    Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Metaclaw: Just talk - an agent that meta-learns and evolves in the wild.CoRR, abs/2603.17187, 2026. doi: 10.48550/ARXIV.2603.17187. URLhttps://doi.org/10.48550/arXiv.2603.17187

  11. [11]

    SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, 2026. URL https://arxiv.org/abs/2604.08377

  12. [12]

    Deceptionbench: A comprehensive benchmark for AI deception behaviors in real-world scenarios.CoRR, abs/2510.15501, 2025

    Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, and Xingxing Wei. Deceptionbench: A comprehensive benchmark for AI deception behaviors in real-world scenarios.CoRR, abs/2510.15501, 2025. doi: 10.48550/ARXIV.2510.15501. URL https://doi.org/10.48550/arXiv.2510.15501

  13. [13]

    Agents of Chaos

    Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, Jasmine Cui, Giordano Rogers, Jannik Brinkmann, Can Rager, Amir Zur, Michael Ripa, Aruna Sankaranarayanan, David Atkinson, Rohit Gandikota, Jaden Fiotto-Kaufman, EunJeong Hwang, Hadas Orgad, P Sam Sahil, Neg...

  14. [14]

    Uncovering Security Threats and Architecting Defenses in Autonomous Agents,

    Zonghao Ying, Xiao Yang, Siyang Wu, Yumeng Song, Yang Qu, Hainan Li, Tianlin Li, Jiakai Wang, Aishan Liu, and Xianglong Liu. Uncovering security threats and architecting defenses in autonomous agents: A case study of openclaw.CoRR, abs/2603.12644, 2026. doi: 10.485 50/ARXIV.2603.12644. URLhttps://doi.org/10.48550/arXiv.2603.12644

  15. [15]

    Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXivpreprintarXiv:2603.10387, 2026

    Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.CoRR, abs/2603.10387, 2026. doi: 10.48550/ARXIV.2603.10387. URLhttps://doi.org/10.48550/arXiv.2603.10387. 12 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

  16. [16]

    An overview of the schwartz theory of basic values.Online readings in Psychology and Culture, 2(1), 2012

    Shalom H Schwartz. An overview of the schwartz theory of basic values.Online readings in Psychology and Culture, 2(1), 2012

  17. [17]

    Values and behavior: Strength and structure of relations

    Anat Bardi and Shalom H Schwartz. Values and behavior: Strength and structure of relations. Personality and social psychology bulletin, 29(10):1207–1220, 2003

  18. [18]

    Artificial Intelligence , Values and Alignment

    Iason Gabriel. Artificial intelligence, values, and alignment.Minds Mach., 30(3):411–437, September 2020. ISSN 0924-6495. doi: 10.1007/s11023-020-09539-2. URL https: //doi.org/10.1007/s11023-020-09539-2

  19. [19]

    The alignment problem from a deep learning perspective

    Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview .net/forum?id=fh8EYKFKns

  20. [20]

    Valuebench: Towards comprehensively evaluating value orientations and understanding of large language models

    Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. Valuebench: Towards comprehensively evaluating value orientations and understanding of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Ba...

  21. [21]

    Value compass benchmarks: A comprehensive, generative and self-evolving platform for LLMs’ value evaluation

    Jing Yao, Xiaoyuan Yi, Shitong Duan, Jindong Wang, Yuzhuo Bai, Muhua Huang, Yang Ou, Scarlett Li, Peng Zhang, Tun Lu, Zhicheng Dou, Maosong Sun, James Evans, and Xing Xie. Value compass benchmarks: A comprehensive, generative and self-evolving platform for LLMs’ value evaluation. In Pushkar Mishra, Smaranda Muresan, and Tao Yu, editors,Proceedings of the ...

  22. [22]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

  23. [23]

    Siegel, Nitya Nadgir, and Arvind Narayanan

    Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.Trans. Mach. Learn. Res., 2025, 2025. URLhttps://openreview.net /forum?id=Zy4uFzMviZ

  24. [24]

    A survey on large language model based autonomous agents , volume =

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers Comput. Sci., 18(6):186345, 2024. doi: 10.1007/S11704-024-40231-1. URL https://doi.org/10.1007/s11704-024-402 31-1. 13 Agent-Val...

  25. [25]

    Mem-t: Densifying rewards for long-horizon memory agents, 2026

    YanweiYue,BociPeng,XuanboFan,JiaxinGuo,QiankunLi,andYanZhang. Mem-t: Densifying rewards for long-horizon memory agents, 2026. URLhttps://arxiv.org/abs/2601.230 14

  26. [26]

    Masrouter: Learningtoroutellmsformulti-agentsystems

    Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learningtoroutellmsformulti-agentsystems. InWanxiangChe,JoyceNabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna,...

  27. [27]

    Aurora: Breaking low-rank bottleneck of lora with nonlinear mapping

    Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. Aurora: Breaking low-rank bottleneck of lora with nonlinear mapping. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 36929–36961. Curran Associates, Inc., 2025. URLhttps://proc eedings.neurip...

  28. [28]

    NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

    Haonan Dong, Kehan Jiang, Haoran Ye, Wenhao Zhu, Zhaolu Kang, and Guojie Song. Neurea- soner: Towards explainable, controllable, and unified reasoning via mixture-of-neurons, 2026. URLhttps://arxiv.org/abs/2604.02972

  29. [29]

    FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

    Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, and Guojie Song. Foe: Forest of errors makes the first solution the best in large reasoning models, 2026. URLhttps: //arxiv.org/abs/2604.02967

  30. [30]

    Hua Shen, Nicholas Clark, and Tanu Mitra. Mind the value-action gap: Do llms act in alignment with their values? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 3097–3118. A...

  31. [31]

    Diab, Daniel Fried, Atoosa Kasirzadeh, and Max Kleiman- Weiner

    Andy Liu, Kshitish Ghate, Mona T. Diab, Daniel Fried, Atoosa Kasirzadeh, and Max Kleiman- Weiner. Generative value conflicts reveal LLM priorities. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?i d=RXCRKAcv3B

  32. [32]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2...

  33. [33]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association ...

  34. [34]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.CoRR, abs/2406.12045, 2024. doi: 10.48550/ARXIV.2406.12045. URLhttps://doi.org/10.48550/arXiv.2406.12045

  35. [35]

    Agentboard: An analytical evaluation board of multi- turn llm agents

    Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi- turn llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 74325–7...

  36. [36]

    Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025

    Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, and Benoit Dumoulin. Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025. URL https://arxiv.org/abs/2510.04550

  37. [37]

    Measuring human and AI values based on generative psychometrics with large language models

    Haoran Ye, Yuhang Xie, Yuanyi Ren, Hanjun Fang, Xin Zhang, and Guojie Song. Measuring human and AI values based on generative psychometrics with large language models. In Toby Walsh, JulieShah, andZicoKolter, editors,Thirty-NinthAAAIConferenceonArtificialIntelligence, Thirty-SeventhConferenceonInnovativeApplicationsofArtificialIntelligence,FifteenthSympos...

  38. [38]

    Generative psycho-lexical approach for constructing value systems in large language models

    Haoran Ye, Tianze Zhang, Yuhang Xie, Liyuan Zhang, Yuanyi Ren, Xin Zhang, and Guojie Song. Generative psycho-lexical approach for constructing value systems in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguisti...

  39. [39]

    ClawSafety: "Safe" LLMs, Unsafe Agents

    Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, and Yingqiang Ge. Clawsafety: "safe" llms, unsafe agents, 2026. URLhttps://arxiv.org/ab s/2604.01438

  40. [40]

    ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

    Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, YiyuanLi, andHanchungLee. Clawsbench: Evaluatingcapabilityandsafetyofllmproductivity agents in simulated workspaces, 2026. URLhttps://arxiv.org/abs/2604.05172

  41. [41]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 50528–5065...

  42. [42]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URLhttps://arxiv.or g/abs/2603.28052. 15 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

  43. [43]

    Natural-Language Agent Harnesses

    Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Haitao Zheng. Natural-language agent harnesses.CoRR, abs/2603.25723, 2026. doi: 10.48550/ARXIV.2603.25723. URLhttps: //doi.org/10.48550/arXiv.2603.25723

  44. [44]

    Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries

    Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in experimental social psychology, volume 25, pages 1–65. Elsevier, 1992

  45. [45]

    Deontological and utilitarian inclinations in moral decision making: a process dissociation approach.Journal of personality and social psychology, 104(2):216, 2013

    Paul Conway and Bertram Gawronski. Deontological and utilitarian inclinations in moral decision making: a process dissociation approach.Journal of personality and social psychology, 104(2):216, 2013

  46. [46]

    Largelanguagemodelpsychomet- rics: A systematic review of evaluation, validation, and enhancement.CoRR, abs/2505.08245,

    HaoranYe,JingJin,YuhangXie,XinZhang,andGuojieSong. Largelanguagemodelpsychomet- rics: A systematic review of evaluation, validation, and enhancement.CoRR, abs/2505.08245,

  47. [47]
  48. [48]

    EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

    Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Envscaler: Scaling tool-interactive environments for LLM agent via programmatic synthesis.CoRR, abs/2601.05808, 2026. doi: 10.48550/ARXIV.2601.05808. URL https: //doi.org/10.48550/arXiv.2601.05808

  49. [49]

    Toolllm: Facilitating large language models to master 16000+ real-world apis

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe Twelfth International Conference on Learning...

  50. [51]

    Toolace: Winning the points of LLM function calling

    Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, and Enhong Chen. Toolace: Winning the points of L...

  51. [52]

    Toolalpaca: Generalized tool learning for language models with 3000 simulated cases

    Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023. URLhttps://arxiv.org/abs/2306.05301

  52. [53]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of LLM agents.CoRR, abs/2412.14470,

  53. [54]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    doi: 10.48550/ARXIV.2412.14470. URLhttps://doi.org/10.48550/arXiv.241 2.14470. 16 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

  54. [55]

    Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2...

  55. [56]

    Gder: Safeguarding efficiency, balancing, and robustness via prototypical graph pruning

    Guibin Zhang, Haonan Dong, Yuchen Zhang, Zhixun Li, Dingshuo Chen, Kai Wang, Tianlong Chen, Yuxuan Liang, Dawei Cheng, and Kun Wang. Gder: Safeguarding efficiency, balancing, and robustness via prototypical graph pruning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U.Paquet, J.Tomczak, andC.Zhang, editors,AdvancesinNeuralInformationProcessingSystems,...

  56. [57]

    Glocalinformationbottleneck fortimeseriesimputation

    JieYang,KexinZhang,GuibinZhang,PhilipSYu,andKaizeDing. Glocalinformationbottleneck fortimeseriesimputation. InD.Belgrave,C.Zhang,H.Lin,R.Pascanu,P.Koniusz,M.Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 104452–104484. Curran Associates, Inc., 2025. URLhttps://proceedings.neurips.cc /paper_files/paper/20...

  57. [58]

    Open rubric system: Scaling reinforcement learning with pairwise adaptive rubric.CoRR, abs/2602.14069, 2026

    Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Open rubric system: Scaling reinforcement learning with pairwise adaptive rubric.CoRR, abs/2602.14069, 2026. doi: 10.48550/ARXIV.2602.14069. URL https://doi.org/10.48550/arXiv.2602.14069

  58. [59]

    How do values affect behavior? let me count the ways

    Lilach Sagiv and Sonia Roccas. How do values affect behavior? let me count the ways. Personality and Social Psychology Review, 25(4):295–316, 2021

  59. [60]

    Motivated decision making: effects of activation and self-centrality of values on choices and behavior.Journal of personality and social psychology, 82(3):434, 2002

    Bas Verplanken and Rob W Holland. Motivated decision making: effects of activation and self-centrality of values on choices and behavior.Journal of personality and social psychology, 82(3):434, 2002

  60. [61]

    Expectations of behaviorally anchored rating scales.Personnel psychology, 33(3):595–640, 1980

    Rick Jacobs, Ditsa Kafry, and Sheldon Zedeck. Expectations of behaviorally anchored rating scales.Personnel psychology, 33(3):595–640, 1980

  61. [62]

    Samuel Messick. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning.American psychologist, 50(9):741, 1995

  62. [63]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural In...

  63. [64]

    Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5, October 2025

    Anthropic. Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5, October 2025. Accessed: 2026-04-30

  64. [65]

    Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4 -6, February 2026

    Anthropic. Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4 -6, February 2026. Accessed: 2026-04-30. 17 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

  65. [66]

    Gemini 3 Flash.https://blog.google/products-and-platforms/product s/gemini/gemini-3-flash/, December 2025

    Google. Gemini 3 Flash.https://blog.google/products-and-platforms/product s/gemini/gemini-3-flash/, December 2025. Accessed: 2026-04-30

  66. [67]

    Gemini 3.1 Pro Model Card.https://deepmind.google/models/m odel-cards/gemini-3-1-pro/, February 2026

    Google DeepMind. Gemini 3.1 Pro Model Card.https://deepmind.google/models/m odel-cards/gemini-3-1-pro/, February 2026. Accessed: 2026-04-30

  67. [68]

    GPT-5.4 Thinking System Card.https://openai.com/index/gpt-5-4-think ing-system-card/, March 2026

    OpenAI. GPT-5.4 Thinking System Card.https://openai.com/index/gpt-5-4-think ing-system-card/, March 2026. Accessed: 2026-04-30

  68. [69]

    Introducing GPT-5.4 mini and nano.https://openai.com/index/introduci ng-gpt-5-4-mini-and-nano/, March 2026

    OpenAI. Introducing GPT-5.4 mini and nano.https://openai.com/index/introduci ng-gpt-5-4-mini-and-nano/, March 2026. Accessed: 2026-04-30

  69. [70]

    Grok 4.20

    xAI. Grok 4.20. https://docs.x.ai/developers/models, 2026. Accessed: 2026-04-30

  70. [71]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.CoRR, abs/2512.02556, 2025. doi: 10.48550/ARXIV.2512.02556. URLhttps://doi.org/10.4 8550/arXiv.2512.02556

  71. [72]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5 Team. GLM-5: from vibe coding to agentic engineering.CoRR, abs/2602.15763, 2026. doi: 10.48550/ARXIV.2602.15763. URL https://doi.org/10.48550/arXiv.2602.15 763

  72. [73]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276, 2026. doi: 10.48550/ARXIV.2602.02276. URLhttps://doi.org/10.48550/arXiv.2602.02276

  73. [74]

    The Llama 3 Herd of Models

    Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARX IV.2407.21783. URLhttps://doi.org/10.48550/arXiv.2407.21783

  74. [75]

    MiniMax M2.7: Early echoes of self-evolution.https://www.minimax.io/new s/minimax-m27-en, March 2026

    MiniMax. MiniMax M2.7: Early echoes of self-evolution.https://www.minimax.io/new s/minimax-m27-en, March 2026. Accessed: 2026-04-30

  75. [76]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. doi: 10.48550/ARXIV.2 505.09388. URLhttps://doi.org/10.48550/arXiv.2505.09388

  76. [77]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

  77. [78]

    Introducing Codex.https://openai.com/index/introducing-codex/, May

    OpenAI. Introducing Codex.https://openai.com/index/introducing-codex/, May

  78. [79]

    Accessed: 2026-04-30

  79. [80]

    Claude Code by Anthropic.https://www.anthropic.com/product/claude -code, 2026

    Anthropic. Claude Code by Anthropic.https://www.anthropic.com/product/claude -code, 2026. Accessed: 2026-04-30

  80. [81]

    Stress-testing model specs reveals character differences among language models.CoRR, abs/2510.07686, 2025

    Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, and Esin Durmus. Stress-testing model specs reveals character differences among language models.CoRR, abs/2510.07686, 2025. doi: 10.48550/ARXIV.2510.07686. URLhttps://doi.org/10.48550/arXiv.2510.07686

Showing first 80 references.