pith. machine review for the scientific record. sign in

arxiv: 2604.18543 · v3 · submitted 2026-04-20 · 💻 cs.AI · cs.CL

Recognition: unknown

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords environment generationclaw-like agentsautomated benchmarkingnatural language to tasksagent evaluationadaptive trainingharness frameworks
0
0 comments X

The pith

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that manual creation of environments for claw-like agents cannot scale with current demands for training and evaluation. ClawEnvKit solves this with a pipeline that parses natural language into structured parameters, generates full task specifications including tool interfaces and scoring, and validates the outputs for feasibility, diversity, and consistency. Using the system, the authors built Auto-ClawEval, a benchmark of 1,040 environments across 24 categories that matches human-curated quality in coherence and clarity but at 13,800 times lower cost. The same pipeline supports live evaluation where users request specific capabilities in natural language and receive ready environments instantly, plus adaptive generation of training tasks matched to an agent's current weaknesses.

Core claim

ClawEnvKit is an autonomous pipeline that converts natural language descriptions into verified environments through a parser that extracts structured generation parameters, a generator that produces task specifications, tool interfaces, and scoring configurations, and a validator that enforces feasibility, diversity, structural validity, and internal consistency. This produces Auto-ClawEval, the first large-scale benchmark for claw-like agents with 1,040 environments in 24 categories. The benchmark matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Across 4 model families and 8 agent harness frameworks, harness engineering improves performance by up

What carries the argument

The three-module ClawEnvKit pipeline: a parser extracting parameters from natural language, a generator creating task specifications with tools and scoring, and a validator enforcing feasibility, diversity, structural validity, and internal consistency.

If this is right

  • Harness engineering boosts agent performance by up to 15.7 percentage points over a bare ReAct baseline.
  • Completion remains the primary axis of variation, with no model saturating the benchmark.
  • Evaluation becomes feasible at scales previously impossible due to manual curation costs.
  • Users can obtain verified environments on demand by describing desired capabilities in natural language.
  • Training task distributions can adapt dynamically to an agent's identified weaknesses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pipeline could be adapted to generate environments for agent types beyond claw-like designs.
  • On-demand generation might enable continuous curricula where tasks evolve based on ongoing agent performance.
  • Validator criteria could be iteratively improved by feeding back observed failures from deployed environments.

Load-bearing premise

The validator can reliably enforce feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language descriptions without missing critical edge cases or introducing systematic biases.

What would settle it

A human review of generated environments that reveals frequent feasibility violations or consistency failures in specific categories, or agents achieving high scores through loopholes absent from human-designed tasks.

read the original abstract

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ClawEnvKit, an automated pipeline consisting of a parser, generator, and validator to create environments for claw-like agents directly from natural language descriptions. Using this system, the authors construct Auto-ClawEval, a benchmark containing 1,040 environments across 24 categories. They claim that Auto-ClawEval matches or exceeds human-curated environments in coherence and clarity while achieving a 13,800x cost reduction. The paper also reports evaluations across 4 model families and 8 agent harness frameworks, finding that harness engineering improves performance by up to 15.7 points over a ReAct baseline, that completion remains the main source of variation, and that no model saturates the benchmark. Additional discussion covers on-demand live evaluation and adaptive training environment generation.

Significance. If the central empirical claims are substantiated, the work would be significant for scaling AI agent benchmarking and training, particularly in specialized domains. Automating the creation of large, diverse, verified environments from natural language at dramatically reduced cost could enable continuous and personalized evaluation that is currently infeasible. The reported scale of 1,040 environments and the finding that harness engineering yields substantial gains are potentially valuable contributions. The on-demand generation capability for both evaluation and training further strengthens the practical impact if the validator's reliability is demonstrated.

major comments (2)
  1. [Abstract] Abstract: The headline claim that 'Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost' supplies no details on the precise metrics or scoring rubrics for coherence and clarity, the human curation baseline and comparison protocol, the exact components included in the cost calculation, or any controls for confounding factors. This information is required to assess whether the data support the central empirical result.
  2. [Validator module] Validator module: The validator is described as enforcing feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language inputs, yet no quantitative evidence is provided on failure rates, performance on ambiguous or complex prompts, inter-rater agreement with humans, or ablation studies of missed edge cases. Because the benchmark construction and the 13,800x cost claim rest on the validator's robustness, this omission is load-bearing for the paper's main contribution.
minor comments (1)
  1. [Abstract] Abstract: The specific model families and agent harness frameworks used in the evaluation are not named, which reduces the clarity and reproducibility of the reported performance variations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the careful reading and valuable suggestions. Below we respond to each major comment, indicating the changes we plan to implement in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that 'Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost' supplies no details on the precise metrics or scoring rubrics for coherence and clarity, the human curation baseline and comparison protocol, the exact components included in the cost calculation, or any controls for confounding factors. This information is required to assess whether the data support the central empirical result.

    Authors: The referee is correct that the abstract does not provide these specifics. To address this, we will revise the abstract to include a brief description of the evaluation metrics (coherence and clarity rated on standardized scales), the human-curated baseline used for comparison, the protocol followed, the components of the cost calculation, and any controls applied. We will also ensure that the main text explicitly cross-references these elements. This revision will allow readers to better evaluate the central claim without altering the manuscript's core contributions. revision: yes

  2. Referee: [Validator module] Validator module: The validator is described as enforcing feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language inputs, yet no quantitative evidence is provided on failure rates, performance on ambiguous or complex prompts, inter-rater agreement with humans, or ablation studies of missed edge cases. Because the benchmark construction and the 13,800x cost claim rest on the validator's robustness, this omission is load-bearing for the paper's main contribution.

    Authors: We agree that quantitative evidence regarding the validator's performance is essential to support the claims about benchmark construction and cost savings. The manuscript currently describes the validator's functionality but lacks the requested statistics. In the revision, we will incorporate quantitative results on failure rates, performance with ambiguous or complex inputs, agreement with human raters, and ablations of edge cases. These additions will be added to the Methods section to demonstrate the validator's robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline and benchmark rely on external validation and comparison

full rationale

The paper presents ClawEnvKit as a three-module pipeline (parser, generator, validator) that takes natural language descriptions as external input and produces environments. Auto-ClawEval is then constructed from this pipeline and compared empirically to separately human-curated environments on coherence, clarity, and cost. No equations, fitted parameters, or predictions are defined in terms of themselves; the validator is described as an independent enforcement mechanism rather than a self-referential step. No self-citations are invoked as load-bearing uniqueness theorems, and the central empirical claim rests on external human comparison rather than internal reduction to the generation process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or axioms; the main introduced element is the ClawEnvKit pipeline itself.

pith-pipeline@v0.9.0 · 5607 in / 1275 out tokens · 62077 ms · 2026-05-10T04:21:06.430100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

Reference graph

Works this paper leans on

114 extracted references · 32 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    CoPaw : Co personal agent workstation

    AgentScope Team . CoPaw : Co personal agent workstation. https://github.com/agentscope-ai/CoPaw, 2026. Accessed: 2026-04-05

  2. [2]

    Effective harnesses for long-running agents

    Anthropic . Effective harnesses for long-running agents. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents, November 2025 a . Anthropic Engineering Blog. Accessed: 2026-04-08

  3. [3]

    Claude code: AI -powered coding assistant for developers

    Anthropic . Claude code: AI -powered coding assistant for developers. https://claude.com/product/claude-code, 2025 b . Accessed: 2026-04-05

  4. [4]

    Demystifying evals for ai agents

    Anthropic . Demystifying evals for ai agents. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents, January 2026. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents. Published January 9, 2026. Accessed: 2026-04-12

  5. [5]

    Quantifying infrastructure noise in agentic coding evals

    Anthropic. Quantifying infrastructure noise in agentic coding evals. https://www.anthropic.com/engineering/infrastructure-noise, 2026. Accessed: 2026-04-05

  6. [6]

    Introducing Claude Opus 4.6

    Anthropic . Introducing Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6, February 2026 a . Accessed: 2026-04-05

  7. [7]

    Introducing Claude Sonnet 4.6

    Anthropic . Introducing Claude Sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6, February 2026 b . Accessed: 2026-04-05

  8. [8]

    Cursor: The best way to code with AI

    Anysphere . Cursor: The best way to code with AI . https://cursor.com/, 2024. Accessed: 2026-04-05

  9. [9]

    Harness engineering

    Birgitta Böckeler. Harness engineering. https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html, February 2026. martinfowler.com. Accessed: 2026-04-08

  10. [10]

    I improved 15 llms at coding in one afternoon

    Can Bölük. I improved 15 llms at coding in one afternoon. only the harness changed. https://blog.can.ac/2026/02/12/the-harness-problem/, February 2026. Personal technical blog. Accessed: 2026-04-08

  11. [11]

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang

    Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, and Lichao Sun. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding, 2025. https://arxiv.org/abs/2406.10819

  12. [13]

    The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

    Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

  13. [14]

    Benchmark probing: Investigating data leakage in large language models

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Benchmark probing: Investigating data leakage in large language models. In NeurIPS 2023 workshop on backdoors in deep learning-The good, the bad, and the ugly, 2023

  14. [15]

    Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024

  15. [16]

    EnvBench: A benchmark for automated environment setup, 2025

    Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. Envbench: A benchmark for automated environment setup, 2025. https://arxiv.org/abs/2503.14443

  16. [17]

    Goodman, and Dimitris Papailiopoulos

    Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents, 2026. https://arxiv.org/abs/2601.16443

  17. [18]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...

  18. [19]

    R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents,

    Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. https://arxiv.org/abs/2504.07164

  19. [20]

    ClawArena: Benchmarking AI Agents in Evolving Information Environments

    Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Clawarena: Benchmarking ai agents in evolving information environments, 2026. https://arxiv.org/abs/2604.04202

  20. [21]

    arXiv preprint arXiv:2401.13649 , year=

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024. https://arxiv.org/abs/2401.13649

  21. [22]

    LLMs Get Lost In Multi-Turn Conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation, 2025. https://arxiv.org/abs/2505.06120

  22. [23]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. https://arxiv.org/abs/2603.28052

  23. [24]

    Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards

    Ming Li. Verifiable accuracy and abstention rewards in curriculum rl to alleviate lost-in-conversation, 2025. https://arxiv.org/abs/2510.18731

  24. [26]

    ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

    Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, and Han chung Lee. Clawsbench: Evaluating capability and safety of llm productivity agents in simulated workspaces, 2026 b . https://arxiv.org/abs/2604.05172

  25. [29]

    MiniMax M2.5 : Built for real-world productivity

    MiniMax . MiniMax M2.5 : Built for real-world productivity. https://www.minimax.io/news/minimax-m25, February 2026 a . 230B MoE with 10B active parameters, trained with RL in 200K+ environments. Accessed: 2026-04-05

  26. [30]

    MiniMax M2.7 : Early echoes of self-evolution

    MiniMax . MiniMax M2.7 : Early echoes of self-evolution. https://www.minimax.io/news/minimax-m27-en, March 2026 b . First model to participate in its own recursive self-improvement via 100+ autonomous optimization cycles. Accessed: 2026-04-05

  27. [31]

    Ironclaw: A security-first open-source ai agent framework in rust

    Near AI . Ironclaw: A security-first open-source ai agent framework in rust. https://github.com/nearai/ironclaw, 2026. MIT/Apache-2.0 License, Accessed: 2026-04-04

  28. [32]

    Hermes agent: The self-improving AI agent

    Nous Research . Hermes agent: The self-improving AI agent. https://github.com/NousResearch/hermes-agent, 2026. 23k+ stars. Built-in learning loop with skill creation, memory search, and RL training via Atropos. Accessed: 2026-04-05

  29. [33]

    NemoClaw : Run OpenClaw more securely inside NVIDIA OpenShell with managed inference

    NVIDIA . NemoClaw : Run OpenClaw more securely inside NVIDIA OpenShell with managed inference. https://github.com/NVIDIA/NemoClaw, March 2026. Early preview released March 16, 2026. Part of NVIDIA Agent Toolkit. Accessed: 2026-04-05

  30. [34]

    Introducing GPT -5

    OpenAI . Introducing GPT -5. https://openai.com/index/introducing-gpt-5/, August 2025 a . Accessed: 2026-04-05

  31. [35]

    Codex: AI coding agent for software development

    OpenAI . Codex: AI coding agent for software development. https://openai.com/codex/, 2025 b . Accessed: 2026-04-05

  32. [36]

    Introducing GPT -5.4

    OpenAI . Introducing GPT -5.4. https://openai.com/index/introducing-gpt-5-4/, March 2026 a . Accessed: 2026-04-05

  33. [37]

    Harness engineering: leveraging codex in an agent-first world

    OpenAI . Harness engineering: leveraging codex in an agent-first world. https://openai.com/index/harness-engineering/, 2026 b . Accessed: 2026-04-08

  34. [38]

    arXiv preprint arXiv:2406.12373 , year=

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, and Zhengyang Wu. Webcanvas: Benchmarking web agents in online environments, 2024. https://arxiv.org/abs/2406.12373

  35. [39]

    Nanoclaw: A lightweight, secure ai agent framework with container isolation

    qwibitai . Nanoclaw: A lightweight, secure ai agent framework with container isolation. https://github.com/qwibitai/nanoclaw, 2026. Accessed: 2026-04-04

  36. [40]

    The contribution of latent human failures to the breakdown of complex systems

    James Reason. The contribution of latent human failures to the breakdown of complex systems. Philosophical Transactions of the Royal Society of London B, 327: 0 475--484, 1990

  37. [41]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. https://arxiv.org/abs/2303.11366

  38. [42]

    PicoClaw : Tiny, fast, and deployable anywhere AI agent

    Sipeed . PicoClaw : Tiny, fast, and deployable anywhere AI agent. https://github.com/sipeed/picoclaw, February 2026. Ultra-lightweight Go-based personal AI assistant with <10MB memory footprint. Accessed: 2026-04-05

  39. [43]

    Openclaw: Your own personal ai assistant (open-source agent framework)

    Peter Steinberger. Openclaw: Your own personal ai assistant (open-source agent framework). https://github.com/openclaw/openclaw, 2025. MIT License, Accessed: 2026-04-04

  40. [45]

    Meta-gui: Towards multi-modal conversational agents on mobile gui

    Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. Meta-gui: Towards multi-modal conversational agents on mobile gui. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6699--6712, 2022

  41. [46]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1 edition, 1998

  42. [47]

    A comprehensive survey of continual learning: Theory, method and application

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE transactions on pattern analysis and machine intelligence, 46 0 (8): 0 5362--5383, 2024

  43. [48]

    The OpenHands software agent SDK : A composable and extensible foundation for production agents, 2025

    Xingyao Wang et al. The OpenHands software agent SDK : A composable and extensible foundation for production agents, 2025

  44. [52]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

  45. [54]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. https://arxiv.org/abs/2504.21798

  46. [55]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. https://arxiv.org/abs/2210.03629

  47. [56]

    Claw-eval: End-to-end transparent benchmark for ai agents in the real world, 2026

    Bowen Ye, Rang Li, Qibin Yang, Zhihui Xie, Yuanxin Liu, Linli Yao, Hanglong Lyu, and Lei Li. Claw-eval: End-to-end transparent benchmark for ai agents in the real world, 2026. https://github.com/claw-eval/claw-eval

  48. [57]

    Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024

    Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024. https://arxiv.org/abs/2407.15711

  49. [58]

    ZeroClaw : Fast, small, and fully autonomous AI assistant infrastructure in Rust

    ZeroClaw Labs . ZeroClaw : Fast, small, and fully autonomous AI assistant infrastructure in Rust . https://github.com/zeroclaw-labs/zeroclaw, February 2026. Trait-driven Rust runtime with <5MB memory footprint. Accessed: 2026-04-05

  50. [60]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. https://arxiv.org/abs/2306.05685

  51. [62]

    GLM -5-turbo: A foundation model optimized for the OpenClaw scenario

    Zhipu AI . GLM -5-turbo: A foundation model optimized for the OpenClaw scenario. https://docs.z.ai/guides/llm/glm-5-turbo, 2026. 200K context, optimized for tool invocation and long-chain agent execution. Accessed: 2026-04-05

  52. [64]

    2025 , eprint=

    GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding , author=. 2025 , eprint=

  53. [65]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Meta-gui: Towards multi-modal conversational agents on mobile gui , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  54. [66]

    H., Kasner, Z., and Reddy, S

    Weblinx: Real-world website navigation with multi-turn dialogue , author=. arXiv preprint arXiv:2402.05930 , year=

  55. [67]

    Agentstudio: A toolkit for building general virtual agents, 2024b

    Agentstudio: A toolkit for building general virtual agents , author=. arXiv preprint arXiv:2403.17918 , year=

  56. [68]

    2026 , url=

    Claw-Eval: End-to-End Transparent Benchmark for AI Agents in the Real World , author=. 2026 , url=

  57. [69]

    AgentBench: Evaluating LLMs as Agents

    Agentbench: Evaluating llms as agents , author=. arXiv preprint arXiv:2308.03688 , year=

  58. [70]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

  59. [71]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    SkillsBench: Benchmarking how well agent skills work across diverse tasks , author=. arXiv preprint arXiv:2602.12670 , year=

  60. [72]

    2025 , eprint=

    SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=

  61. [73]

    2025 , eprint=

    R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents , author=. 2025 , eprint=

  62. [74]

    Procedural environment generation for tool- use agents

    Sullivan, Michael and Hartmann, Mareike and Koller, Alexander. Procedural Environment Generation for Tool-Use Agents. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.936

  63. [75]

    Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026b

    MetaClaw: Just Talk An Agent That Meta-Learns and Evolves in the Wild , author=. arXiv preprint arXiv:2603.17187 , year=

  64. [76]

    Agent world model: Infinity synthetic environments for agentic reinforcement learning.arXiv preprint arXiv:2602.10090, 2026

    Agent world model: Infinity synthetic environments for agentic reinforcement learning , author=. arXiv preprint arXiv:2602.10090 , year=

  65. [77]

    OpenClaw-RL: Train Any Agent Simply by Talking

    OpenClaw-RL: Train Any Agent Simply by Talking , author=. arXiv preprint arXiv:2603.10165 , year=

  66. [78]

    2025 , howpublished =

    Peter Steinberger , title =. 2025 , howpublished =

  67. [79]

    NanoClaw: A Lightweight, Secure AI Agent Framework with Container Isolation , year =

  68. [80]

    IronClaw: A Security-First Open-Source AI Agent Framework in Rust , year =

  69. [81]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    A comprehensive survey of continual learning: Theory, method and application , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2024 , publisher=

  70. [82]

    2023 , eprint=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

  71. [83]

    2023 , eprint=

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

  72. [84]

    2024 , eprint=

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

  73. [85]

    2025 , note =

    Claude Code:. 2025 , note =

  74. [86]

    2025 , note =

    Codex:. 2025 , note =

  75. [87]

    2024 , note =

    Cursor: The Best Way to Code with. 2024 , note =

  76. [88]

    2026 , month = feb, note =

  77. [89]

    2026 , month = mar, note =

  78. [90]

    2026 , note =

    Hermes Agent: The Self-Improving. 2026 , note =

  79. [91]

    2024 , eprint=

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author=. 2024 , eprint=

  80. [92]

    2024 , eprint=

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. 2024 , eprint=

Showing first 80 references.