Recognition: unknown
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
Pith reviewed 2026-05-10 04:21 UTC · model grok-4.3
The pith
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language descriptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ClawEnvKit is an autonomous pipeline that converts natural language descriptions into verified environments through a parser that extracts structured generation parameters, a generator that produces task specifications, tool interfaces, and scoring configurations, and a validator that enforces feasibility, diversity, structural validity, and internal consistency. This produces Auto-ClawEval, the first large-scale benchmark for claw-like agents with 1,040 environments in 24 categories. The benchmark matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Across 4 model families and 8 agent harness frameworks, harness engineering improves performance by up
What carries the argument
The three-module ClawEnvKit pipeline: a parser extracting parameters from natural language, a generator creating task specifications with tools and scoring, and a validator enforcing feasibility, diversity, structural validity, and internal consistency.
If this is right
- Harness engineering boosts agent performance by up to 15.7 percentage points over a bare ReAct baseline.
- Completion remains the primary axis of variation, with no model saturating the benchmark.
- Evaluation becomes feasible at scales previously impossible due to manual curation costs.
- Users can obtain verified environments on demand by describing desired capabilities in natural language.
- Training task distributions can adapt dynamically to an agent's identified weaknesses.
Where Pith is reading between the lines
- The pipeline could be adapted to generate environments for agent types beyond claw-like designs.
- On-demand generation might enable continuous curricula where tasks evolve based on ongoing agent performance.
- Validator criteria could be iteratively improved by feeding back observed failures from deployed environments.
Load-bearing premise
The validator can reliably enforce feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language descriptions without missing critical edge cases or introducing systematic biases.
What would settle it
A human review of generated environments that reveals frequent feasibility violations or consistency failures in specific categories, or agents achieving high scores through loopholes absent from human-designed tasks.
read the original abstract
Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ClawEnvKit, an automated pipeline consisting of a parser, generator, and validator to create environments for claw-like agents directly from natural language descriptions. Using this system, the authors construct Auto-ClawEval, a benchmark containing 1,040 environments across 24 categories. They claim that Auto-ClawEval matches or exceeds human-curated environments in coherence and clarity while achieving a 13,800x cost reduction. The paper also reports evaluations across 4 model families and 8 agent harness frameworks, finding that harness engineering improves performance by up to 15.7 points over a ReAct baseline, that completion remains the main source of variation, and that no model saturates the benchmark. Additional discussion covers on-demand live evaluation and adaptive training environment generation.
Significance. If the central empirical claims are substantiated, the work would be significant for scaling AI agent benchmarking and training, particularly in specialized domains. Automating the creation of large, diverse, verified environments from natural language at dramatically reduced cost could enable continuous and personalized evaluation that is currently infeasible. The reported scale of 1,040 environments and the finding that harness engineering yields substantial gains are potentially valuable contributions. The on-demand generation capability for both evaluation and training further strengthens the practical impact if the validator's reliability is demonstrated.
major comments (2)
- [Abstract] Abstract: The headline claim that 'Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost' supplies no details on the precise metrics or scoring rubrics for coherence and clarity, the human curation baseline and comparison protocol, the exact components included in the cost calculation, or any controls for confounding factors. This information is required to assess whether the data support the central empirical result.
- [Validator module] Validator module: The validator is described as enforcing feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language inputs, yet no quantitative evidence is provided on failure rates, performance on ambiguous or complex prompts, inter-rater agreement with humans, or ablation studies of missed edge cases. Because the benchmark construction and the 13,800x cost claim rest on the validator's robustness, this omission is load-bearing for the paper's main contribution.
minor comments (1)
- [Abstract] Abstract: The specific model families and agent harness frameworks used in the evaluation are not named, which reduces the clarity and reproducibility of the reported performance variations.
Simulated Author's Rebuttal
We are grateful to the referee for the careful reading and valuable suggestions. Below we respond to each major comment, indicating the changes we plan to implement in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that 'Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost' supplies no details on the precise metrics or scoring rubrics for coherence and clarity, the human curation baseline and comparison protocol, the exact components included in the cost calculation, or any controls for confounding factors. This information is required to assess whether the data support the central empirical result.
Authors: The referee is correct that the abstract does not provide these specifics. To address this, we will revise the abstract to include a brief description of the evaluation metrics (coherence and clarity rated on standardized scales), the human-curated baseline used for comparison, the protocol followed, the components of the cost calculation, and any controls applied. We will also ensure that the main text explicitly cross-references these elements. This revision will allow readers to better evaluate the central claim without altering the manuscript's core contributions. revision: yes
-
Referee: [Validator module] Validator module: The validator is described as enforcing feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language inputs, yet no quantitative evidence is provided on failure rates, performance on ambiguous or complex prompts, inter-rater agreement with humans, or ablation studies of missed edge cases. Because the benchmark construction and the 13,800x cost claim rest on the validator's robustness, this omission is load-bearing for the paper's main contribution.
Authors: We agree that quantitative evidence regarding the validator's performance is essential to support the claims about benchmark construction and cost savings. The manuscript currently describes the validator's functionality but lacks the requested statistics. In the revision, we will incorporate quantitative results on failure rates, performance with ambiguous or complex inputs, agreement with human raters, and ablations of edge cases. These additions will be added to the Methods section to demonstrate the validator's robustness. revision: yes
Circularity Check
No circularity; pipeline and benchmark rely on external validation and comparison
full rationale
The paper presents ClawEnvKit as a three-module pipeline (parser, generator, validator) that takes natural language descriptions as external input and produces environments. Auto-ClawEval is then constructed from this pipeline and compared empirically to separately human-curated environments on coherence, clarity, and cost. No equations, fitted parameters, or predictions are defined in terms of themselves; the validator is described as an independent enforcement mechanism rather than a self-referential step. No self-citations are invoked as load-bearing uniqueness theorems, and the central empirical claim rests on external human comparison rather than internal reduction to the generation process itself.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
Reference graph
Works this paper leans on
-
[1]
CoPaw : Co personal agent workstation
AgentScope Team . CoPaw : Co personal agent workstation. https://github.com/agentscope-ai/CoPaw, 2026. Accessed: 2026-04-05
2026
-
[2]
Effective harnesses for long-running agents
Anthropic . Effective harnesses for long-running agents. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents, November 2025 a . Anthropic Engineering Blog. Accessed: 2026-04-08
2025
-
[3]
Claude code: AI -powered coding assistant for developers
Anthropic . Claude code: AI -powered coding assistant for developers. https://claude.com/product/claude-code, 2025 b . Accessed: 2026-04-05
2025
-
[4]
Demystifying evals for ai agents
Anthropic . Demystifying evals for ai agents. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents, January 2026. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents. Published January 9, 2026. Accessed: 2026-04-12
2026
-
[5]
Quantifying infrastructure noise in agentic coding evals
Anthropic. Quantifying infrastructure noise in agentic coding evals. https://www.anthropic.com/engineering/infrastructure-noise, 2026. Accessed: 2026-04-05
2026
-
[6]
Introducing Claude Opus 4.6
Anthropic . Introducing Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6, February 2026 a . Accessed: 2026-04-05
2026
-
[7]
Introducing Claude Sonnet 4.6
Anthropic . Introducing Claude Sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6, February 2026 b . Accessed: 2026-04-05
2026
-
[8]
Cursor: The best way to code with AI
Anysphere . Cursor: The best way to code with AI . https://cursor.com/, 2024. Accessed: 2026-04-05
2024
-
[9]
Harness engineering
Birgitta Böckeler. Harness engineering. https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html, February 2026. martinfowler.com. Accessed: 2026-04-08
2026
-
[10]
I improved 15 llms at coding in one afternoon
Can Bölük. I improved 15 llms at coding in one afternoon. only the harness changed. https://blog.can.ac/2026/02/12/the-harness-problem/, February 2026. Personal technical blog. Accessed: 2026-04-08
2026
-
[11]
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang
Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, and Lichao Sun. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding, 2025. https://arxiv.org/abs/2406.10819
-
[13]
The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,
Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...
-
[14]
Benchmark probing: Investigating data leakage in large language models
Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Benchmark probing: Investigating data leakage in large language models. In NeurIPS 2023 workshop on backdoors in deep learning-The good, the bad, and the ugly, 2023
2023
-
[15]
Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024
2024
-
[16]
EnvBench: A benchmark for automated environment setup, 2025
Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. Envbench: A benchmark for automated environment setup, 2025. https://arxiv.org/abs/2503.14443
-
[17]
Goodman, and Dimitris Papailiopoulos
Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents, 2026. https://arxiv.org/abs/2601.16443
-
[18]
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...
work page internal anchor Pith review arXiv 2026
-
[19]
R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents,
Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. https://arxiv.org/abs/2504.07164
-
[20]
ClawArena: Benchmarking AI Agents in Evolving Information Environments
Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Clawarena: Benchmarking ai agents in evolving information environments, 2026. https://arxiv.org/abs/2604.04202
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024. https://arxiv.org/abs/2401.13649
-
[22]
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation, 2025. https://arxiv.org/abs/2505.06120
work page internal anchor Pith review arXiv 2025
-
[23]
Meta-Harness: End-to-End Optimization of Model Harnesses
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. https://arxiv.org/abs/2603.28052
work page internal anchor Pith review arXiv 2026
-
[24]
Ming Li. Verifiable accuracy and abstention rewards in curriculum rl to alleviate lost-in-conversation, 2025. https://arxiv.org/abs/2510.18731
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, and Han chung Lee. Clawsbench: Evaluating capability and safety of llm productivity agents in simulated workspaces, 2026 b . https://arxiv.org/abs/2604.05172
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
MiniMax M2.5 : Built for real-world productivity
MiniMax . MiniMax M2.5 : Built for real-world productivity. https://www.minimax.io/news/minimax-m25, February 2026 a . 230B MoE with 10B active parameters, trained with RL in 200K+ environments. Accessed: 2026-04-05
2026
-
[30]
MiniMax M2.7 : Early echoes of self-evolution
MiniMax . MiniMax M2.7 : Early echoes of self-evolution. https://www.minimax.io/news/minimax-m27-en, March 2026 b . First model to participate in its own recursive self-improvement via 100+ autonomous optimization cycles. Accessed: 2026-04-05
2026
-
[31]
Ironclaw: A security-first open-source ai agent framework in rust
Near AI . Ironclaw: A security-first open-source ai agent framework in rust. https://github.com/nearai/ironclaw, 2026. MIT/Apache-2.0 License, Accessed: 2026-04-04
2026
-
[32]
Hermes agent: The self-improving AI agent
Nous Research . Hermes agent: The self-improving AI agent. https://github.com/NousResearch/hermes-agent, 2026. 23k+ stars. Built-in learning loop with skill creation, memory search, and RL training via Atropos. Accessed: 2026-04-05
2026
-
[33]
NemoClaw : Run OpenClaw more securely inside NVIDIA OpenShell with managed inference
NVIDIA . NemoClaw : Run OpenClaw more securely inside NVIDIA OpenShell with managed inference. https://github.com/NVIDIA/NemoClaw, March 2026. Early preview released March 16, 2026. Part of NVIDIA Agent Toolkit. Accessed: 2026-04-05
2026
-
[34]
Introducing GPT -5
OpenAI . Introducing GPT -5. https://openai.com/index/introducing-gpt-5/, August 2025 a . Accessed: 2026-04-05
2025
-
[35]
Codex: AI coding agent for software development
OpenAI . Codex: AI coding agent for software development. https://openai.com/codex/, 2025 b . Accessed: 2026-04-05
2025
-
[36]
Introducing GPT -5.4
OpenAI . Introducing GPT -5.4. https://openai.com/index/introducing-gpt-5-4/, March 2026 a . Accessed: 2026-04-05
2026
-
[37]
Harness engineering: leveraging codex in an agent-first world
OpenAI . Harness engineering: leveraging codex in an agent-first world. https://openai.com/index/harness-engineering/, 2026 b . Accessed: 2026-04-08
2026
-
[38]
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, and Zhengyang Wu. Webcanvas: Benchmarking web agents in online environments, 2024. https://arxiv.org/abs/2406.12373
-
[39]
Nanoclaw: A lightweight, secure ai agent framework with container isolation
qwibitai . Nanoclaw: A lightweight, secure ai agent framework with container isolation. https://github.com/qwibitai/nanoclaw, 2026. Accessed: 2026-04-04
2026
-
[40]
The contribution of latent human failures to the breakdown of complex systems
James Reason. The contribution of latent human failures to the breakdown of complex systems. Philosophical Transactions of the Royal Society of London B, 327: 0 475--484, 1990
1990
-
[41]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. https://arxiv.org/abs/2303.11366
work page internal anchor Pith review arXiv 2023
-
[42]
PicoClaw : Tiny, fast, and deployable anywhere AI agent
Sipeed . PicoClaw : Tiny, fast, and deployable anywhere AI agent. https://github.com/sipeed/picoclaw, February 2026. Ultra-lightweight Go-based personal AI assistant with <10MB memory footprint. Accessed: 2026-04-05
2026
-
[43]
Openclaw: Your own personal ai assistant (open-source agent framework)
Peter Steinberger. Openclaw: Your own personal ai assistant (open-source agent framework). https://github.com/openclaw/openclaw, 2025. MIT License, Accessed: 2026-04-04
2025
-
[45]
Meta-gui: Towards multi-modal conversational agents on mobile gui
Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. Meta-gui: Towards multi-modal conversational agents on mobile gui. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6699--6712, 2022
2022
-
[46]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1 edition, 1998
1998
-
[47]
A comprehensive survey of continual learning: Theory, method and application
Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE transactions on pattern analysis and machine intelligence, 46 0 (8): 0 5362--5383, 2024
2024
-
[48]
The OpenHands software agent SDK : A composable and extensible foundation for production agents, 2025
Xingyao Wang et al. The OpenHands software agent SDK : A composable and extensible foundation for production agents, 2025
2025
-
[52]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024
2024
-
[54]
SWE-smith: Scaling Data for Software Engineering Agents
John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. https://arxiv.org/abs/2504.21798
work page internal anchor Pith review arXiv 2025
-
[55]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Claw-eval: End-to-end transparent benchmark for ai agents in the real world, 2026
Bowen Ye, Rang Li, Qibin Yang, Zhihui Xie, Yuanxin Liu, Linli Yao, Hanglong Lyu, and Lei Li. Claw-eval: End-to-end transparent benchmark for ai agents in the real world, 2026. https://github.com/claw-eval/claw-eval
2026
-
[57]
Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024. https://arxiv.org/abs/2407.15711
-
[58]
ZeroClaw : Fast, small, and fully autonomous AI assistant infrastructure in Rust
ZeroClaw Labs . ZeroClaw : Fast, small, and fully autonomous AI assistant infrastructure in Rust . https://github.com/zeroclaw-labs/zeroclaw, February 2026. Trait-driven Rust runtime with <5MB memory footprint. Accessed: 2026-04-05
2026
-
[60]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. https://arxiv.org/abs/2306.05685
work page internal anchor Pith review arXiv 2023
-
[62]
GLM -5-turbo: A foundation model optimized for the OpenClaw scenario
Zhipu AI . GLM -5-turbo: A foundation model optimized for the OpenClaw scenario. https://docs.z.ai/guides/llm/glm-5-turbo, 2026. 200K context, optimized for tool invocation and long-chain agent execution. Accessed: 2026-04-05
2026
-
[64]
2025 , eprint=
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding , author=. 2025 , eprint=
2025
-
[65]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
Meta-gui: Towards multi-modal conversational agents on mobile gui , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
2022
-
[66]
Weblinx: Real-world website navigation with multi-turn dialogue , author=. arXiv preprint arXiv:2402.05930 , year=
-
[67]
Agentstudio: A toolkit for building general virtual agents, 2024b
Agentstudio: A toolkit for building general virtual agents , author=. arXiv preprint arXiv:2403.17918 , year=
-
[68]
2026 , url=
Claw-Eval: End-to-End Transparent Benchmark for AI Agents in the Real World , author=. 2026 , url=
2026
-
[69]
AgentBench: Evaluating LLMs as Agents
Agentbench: Evaluating llms as agents , author=. arXiv preprint arXiv:2308.03688 , year=
work page internal anchor Pith review arXiv
-
[70]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=
work page internal anchor Pith review arXiv
-
[71]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
SkillsBench: Benchmarking how well agent skills work across diverse tasks , author=. arXiv preprint arXiv:2602.12670 , year=
work page internal anchor Pith review arXiv
-
[72]
2025 , eprint=
SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=
2025
-
[73]
2025 , eprint=
R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents , author=. 2025 , eprint=
2025
-
[74]
Procedural environment generation for tool- use agents
Sullivan, Michael and Hartmann, Mareike and Koller, Alexander. Procedural Environment Generation for Tool-Use Agents. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.936
-
[75]
MetaClaw: Just Talk An Agent That Meta-Learns and Evolves in the Wild , author=. arXiv preprint arXiv:2603.17187 , year=
-
[76]
Agent world model: Infinity synthetic environments for agentic reinforcement learning , author=. arXiv preprint arXiv:2602.10090 , year=
-
[77]
OpenClaw-RL: Train Any Agent Simply by Talking
OpenClaw-RL: Train Any Agent Simply by Talking , author=. arXiv preprint arXiv:2603.10165 , year=
-
[78]
2025 , howpublished =
Peter Steinberger , title =. 2025 , howpublished =
2025
-
[79]
NanoClaw: A Lightweight, Secure AI Agent Framework with Container Isolation , year =
-
[80]
IronClaw: A Security-First Open-Source AI Agent Framework in Rust , year =
-
[81]
IEEE transactions on pattern analysis and machine intelligence , volume=
A comprehensive survey of continual learning: Theory, method and application , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2024 , publisher=
2024
-
[82]
2023 , eprint=
ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=
2023
-
[83]
2023 , eprint=
Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=
2023
-
[84]
2024 , eprint=
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=
2024
-
[85]
2025 , note =
Claude Code:. 2025 , note =
2025
-
[86]
2025 , note =
Codex:. 2025 , note =
2025
-
[87]
2024 , note =
Cursor: The Best Way to Code with. 2024 , note =
2024
-
[88]
2026 , month = feb, note =
2026
-
[89]
2026 , month = mar, note =
2026
-
[90]
2026 , note =
Hermes Agent: The Self-Improving. 2026 , note =
2026
-
[91]
2024 , eprint=
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author=. 2024 , eprint=
2024
-
[92]
2024 , eprint=
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. 2024 , eprint=
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.