arxiv: 2604.18543 · v3 · submitted 2026-04-20 · 💻 cs.AI · cs.CL

Recognition: unknown

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Xirui Li , Ming Li , Ion Stoica , Cho-Jui Hsieh , Tianyi Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords environment generationclaw-like agentsautomated benchmarkingnatural language to tasksagent evaluationadaptive trainingharness frameworks

0 comments

The pith

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that manual creation of environments for claw-like agents cannot scale with current demands for training and evaluation. ClawEnvKit solves this with a pipeline that parses natural language into structured parameters, generates full task specifications including tool interfaces and scoring, and validates the outputs for feasibility, diversity, and consistency. Using the system, the authors built Auto-ClawEval, a benchmark of 1,040 environments across 24 categories that matches human-curated quality in coherence and clarity but at 13,800 times lower cost. The same pipeline supports live evaluation where users request specific capabilities in natural language and receive ready environments instantly, plus adaptive generation of training tasks matched to an agent's current weaknesses.

Core claim

ClawEnvKit is an autonomous pipeline that converts natural language descriptions into verified environments through a parser that extracts structured generation parameters, a generator that produces task specifications, tool interfaces, and scoring configurations, and a validator that enforces feasibility, diversity, structural validity, and internal consistency. This produces Auto-ClawEval, the first large-scale benchmark for claw-like agents with 1,040 environments in 24 categories. The benchmark matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Across 4 model families and 8 agent harness frameworks, harness engineering improves performance by up

What carries the argument

The three-module ClawEnvKit pipeline: a parser extracting parameters from natural language, a generator creating task specifications with tools and scoring, and a validator enforcing feasibility, diversity, structural validity, and internal consistency.

If this is right

Harness engineering boosts agent performance by up to 15.7 percentage points over a bare ReAct baseline.
Completion remains the primary axis of variation, with no model saturating the benchmark.
Evaluation becomes feasible at scales previously impossible due to manual curation costs.
Users can obtain verified environments on demand by describing desired capabilities in natural language.
Training task distributions can adapt dynamically to an agent's identified weaknesses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipeline could be adapted to generate environments for agent types beyond claw-like designs.
On-demand generation might enable continuous curricula where tasks evolve based on ongoing agent performance.
Validator criteria could be iteratively improved by feeding back observed failures from deployed environments.

Load-bearing premise

The validator can reliably enforce feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language descriptions without missing critical edge cases or introducing systematic biases.

What would settle it

A human review of generated environments that reveals frequent feasibility violations or consistency failures in specific categories, or agents achieving high scores through loopholes absent from human-designed tasks.

read the original abstract

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClawEnvKit gives a workable NL-to-verified-environment pipeline and a 1,040-environment benchmark, but the validator's reliability and the 13,800x cost claim lack the quantitative backing needed to trust the main results.

read the letter

The paper's central contribution is ClawEnvKit, a three-module pipeline that parses natural language into structured parameters, generates task specs and scoring, then validates for feasibility and consistency. They used it to produce Auto-ClawEval, a 1,040-environment benchmark spanning 24 categories, plus an on-demand mode for live evaluation and adaptive training. That combination of scale and live generation is new relative to the usual manual or semi-manual setups in agent work. The evaluations across four model families and eight harnesses also surface a clear signal that harness engineering can add up to 15.7 points over a plain ReAct baseline, and that completion remains the main source of variation with no model saturating the set. Those are useful, concrete observations for people running agent experiments. The soft spot is the validator. The headline claim that the generated environments match or beat human-curated ones on coherence and clarity at 13,800 times lower cost depends entirely on the validator catching feasibility, diversity, structural validity, and internal consistency for arbitrary inputs. The abstract states the goals but supplies no failure rates, inter-rater numbers, ablation on edge cases, or controls for how cost was measured. Without those numbers the cost multiplier and the benchmark's soundness cannot be evaluated. The concern is not minor because the validator is load-bearing for both the static benchmark and the live-generation feature. This paper is for researchers who need scalable, varied environments for tool-using or claw-like agents and who are willing to treat the current results as a starting point rather than a finished product. It shows clear thinking about the bottleneck and honest engagement with the practical constraints of agent evaluation. It deserves a serious referee because the pipeline idea is worth testing properly, but the authors would need to add detailed validator diagnostics and reproducible cost breakdowns before acceptance. I would send it to review with that expectation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ClawEnvKit, an automated pipeline consisting of a parser, generator, and validator to create environments for claw-like agents directly from natural language descriptions. Using this system, the authors construct Auto-ClawEval, a benchmark containing 1,040 environments across 24 categories. They claim that Auto-ClawEval matches or exceeds human-curated environments in coherence and clarity while achieving a 13,800x cost reduction. The paper also reports evaluations across 4 model families and 8 agent harness frameworks, finding that harness engineering improves performance by up to 15.7 points over a ReAct baseline, that completion remains the main source of variation, and that no model saturates the benchmark. Additional discussion covers on-demand live evaluation and adaptive training environment generation.

Significance. If the central empirical claims are substantiated, the work would be significant for scaling AI agent benchmarking and training, particularly in specialized domains. Automating the creation of large, diverse, verified environments from natural language at dramatically reduced cost could enable continuous and personalized evaluation that is currently infeasible. The reported scale of 1,040 environments and the finding that harness engineering yields substantial gains are potentially valuable contributions. The on-demand generation capability for both evaluation and training further strengthens the practical impact if the validator's reliability is demonstrated.

major comments (2)

[Abstract] Abstract: The headline claim that 'Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost' supplies no details on the precise metrics or scoring rubrics for coherence and clarity, the human curation baseline and comparison protocol, the exact components included in the cost calculation, or any controls for confounding factors. This information is required to assess whether the data support the central empirical result.
[Validator module] Validator module: The validator is described as enforcing feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language inputs, yet no quantitative evidence is provided on failure rates, performance on ambiguous or complex prompts, inter-rater agreement with humans, or ablation studies of missed edge cases. Because the benchmark construction and the 13,800x cost claim rest on the validator's robustness, this omission is load-bearing for the paper's main contribution.

minor comments (1)

[Abstract] Abstract: The specific model families and agent harness frameworks used in the evaluation are not named, which reduces the clarity and reproducibility of the reported performance variations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the careful reading and valuable suggestions. Below we respond to each major comment, indicating the changes we plan to implement in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that 'Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost' supplies no details on the precise metrics or scoring rubrics for coherence and clarity, the human curation baseline and comparison protocol, the exact components included in the cost calculation, or any controls for confounding factors. This information is required to assess whether the data support the central empirical result.

Authors: The referee is correct that the abstract does not provide these specifics. To address this, we will revise the abstract to include a brief description of the evaluation metrics (coherence and clarity rated on standardized scales), the human-curated baseline used for comparison, the protocol followed, the components of the cost calculation, and any controls applied. We will also ensure that the main text explicitly cross-references these elements. This revision will allow readers to better evaluate the central claim without altering the manuscript's core contributions. revision: yes
Referee: [Validator module] Validator module: The validator is described as enforcing feasibility, diversity, structural validity, and internal consistency for environments generated from arbitrary natural language inputs, yet no quantitative evidence is provided on failure rates, performance on ambiguous or complex prompts, inter-rater agreement with humans, or ablation studies of missed edge cases. Because the benchmark construction and the 13,800x cost claim rest on the validator's robustness, this omission is load-bearing for the paper's main contribution.

Authors: We agree that quantitative evidence regarding the validator's performance is essential to support the claims about benchmark construction and cost savings. The manuscript currently describes the validator's functionality but lacks the requested statistics. In the revision, we will incorporate quantitative results on failure rates, performance with ambiguous or complex inputs, agreement with human raters, and ablations of edge cases. These additions will be added to the Methods section to demonstrate the validator's robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline and benchmark rely on external validation and comparison

full rationale

The paper presents ClawEnvKit as a three-module pipeline (parser, generator, validator) that takes natural language descriptions as external input and produces environments. Auto-ClawEval is then constructed from this pipeline and compared empirically to separately human-curated environments on coherence, clarity, and cost. No equations, fitted parameters, or predictions are defined in terms of themselves; the validator is described as an independent enforcement mechanism rather than a self-referential step. No self-citations are invoked as load-bearing uniqueness theorems, and the central empirical claim rests on external human comparison rather than internal reduction to the generation process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or axioms; the main introduced element is the ClawEnvKit pipeline itself.

pith-pipeline@v0.9.0 · 5607 in / 1275 out tokens · 62077 ms · 2026-05-10T04:21:06.430100+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

Reference graph

Works this paper leans on

114 extracted references · 32 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

CoPaw : Co personal agent workstation

AgentScope Team . CoPaw : Co personal agent workstation. https://github.com/agentscope-ai/CoPaw, 2026. Accessed: 2026-04-05

2026
[2]

Effective harnesses for long-running agents

Anthropic . Effective harnesses for long-running agents. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents, November 2025 a . Anthropic Engineering Blog. Accessed: 2026-04-08

2025
[3]

Claude code: AI -powered coding assistant for developers

Anthropic . Claude code: AI -powered coding assistant for developers. https://claude.com/product/claude-code, 2025 b . Accessed: 2026-04-05

2025
[4]

Demystifying evals for ai agents

Anthropic . Demystifying evals for ai agents. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents, January 2026. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents. Published January 9, 2026. Accessed: 2026-04-12

2026
[5]

Quantifying infrastructure noise in agentic coding evals

Anthropic. Quantifying infrastructure noise in agentic coding evals. https://www.anthropic.com/engineering/infrastructure-noise, 2026. Accessed: 2026-04-05

2026
[6]

Introducing Claude Opus 4.6

Anthropic . Introducing Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6, February 2026 a . Accessed: 2026-04-05

2026
[7]

Introducing Claude Sonnet 4.6

Anthropic . Introducing Claude Sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6, February 2026 b . Accessed: 2026-04-05

2026
[8]

Cursor: The best way to code with AI

Anysphere . Cursor: The best way to code with AI . https://cursor.com/, 2024. Accessed: 2026-04-05

2024
[9]

Harness engineering

Birgitta Böckeler. Harness engineering. https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html, February 2026. martinfowler.com. Accessed: 2026-04-08

2026
[10]

I improved 15 llms at coding in one afternoon

Can Bölük. I improved 15 llms at coding in one afternoon. only the harness changed. https://blog.can.ac/2026/02/12/the-harness-problem/, February 2026. Personal technical blog. Accessed: 2026-04-08

2026
[11]

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, and Lichao Sun. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding, 2025. https://arxiv.org/abs/2406.10819

work page arXiv 2025
[13]

The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

work page arXiv 2025
[14]

Benchmark probing: Investigating data leakage in large language models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Benchmark probing: Investigating data leakage in large language models. In NeurIPS 2023 workshop on backdoors in deep learning-The good, the bad, and the ugly, 2023

2023
[15]

Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024

2024
[16]

EnvBench: A benchmark for automated environment setup, 2025

Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. Envbench: A benchmark for automated environment setup, 2025. https://arxiv.org/abs/2503.14443

work page arXiv 2025
[17]

Goodman, and Dimitris Papailiopoulos

Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents, 2026. https://arxiv.org/abs/2601.16443

work page arXiv 2026
[18]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...

work page internal anchor Pith review arXiv 2026
[19]

R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents,

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. https://arxiv.org/abs/2504.07164

work page arXiv 2025
[20]

ClawArena: Benchmarking AI Agents in Evolving Information Environments

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Clawarena: Benchmarking ai agents in evolving information environments, 2026. https://arxiv.org/abs/2604.04202

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649,

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024. https://arxiv.org/abs/2401.13649

work page arXiv 2024
[22]

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation, 2025. https://arxiv.org/abs/2505.06120

work page internal anchor Pith review arXiv 2025
[23]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. https://arxiv.org/abs/2603.28052

work page internal anchor Pith review arXiv 2026
[24]

Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards

Ming Li. Verifiable accuracy and abstention rewards in curriculum rl to alleviate lost-in-conversation, 2025. https://arxiv.org/abs/2510.18731

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, and Han chung Lee. Clawsbench: Evaluating capability and safety of llm productivity agents in simulated workspaces, 2026 b . https://arxiv.org/abs/2604.05172

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

MiniMax M2.5 : Built for real-world productivity

MiniMax . MiniMax M2.5 : Built for real-world productivity. https://www.minimax.io/news/minimax-m25, February 2026 a . 230B MoE with 10B active parameters, trained with RL in 200K+ environments. Accessed: 2026-04-05

2026
[30]

MiniMax M2.7 : Early echoes of self-evolution

MiniMax . MiniMax M2.7 : Early echoes of self-evolution. https://www.minimax.io/news/minimax-m27-en, March 2026 b . First model to participate in its own recursive self-improvement via 100+ autonomous optimization cycles. Accessed: 2026-04-05

2026
[31]

Ironclaw: A security-first open-source ai agent framework in rust

Near AI . Ironclaw: A security-first open-source ai agent framework in rust. https://github.com/nearai/ironclaw, 2026. MIT/Apache-2.0 License, Accessed: 2026-04-04

2026
[32]

Hermes agent: The self-improving AI agent

Nous Research . Hermes agent: The self-improving AI agent. https://github.com/NousResearch/hermes-agent, 2026. 23k+ stars. Built-in learning loop with skill creation, memory search, and RL training via Atropos. Accessed: 2026-04-05

2026
[33]

NemoClaw : Run OpenClaw more securely inside NVIDIA OpenShell with managed inference

NVIDIA . NemoClaw : Run OpenClaw more securely inside NVIDIA OpenShell with managed inference. https://github.com/NVIDIA/NemoClaw, March 2026. Early preview released March 16, 2026. Part of NVIDIA Agent Toolkit. Accessed: 2026-04-05

2026
[34]

Introducing GPT -5

OpenAI . Introducing GPT -5. https://openai.com/index/introducing-gpt-5/, August 2025 a . Accessed: 2026-04-05

2025
[35]

Codex: AI coding agent for software development

OpenAI . Codex: AI coding agent for software development. https://openai.com/codex/, 2025 b . Accessed: 2026-04-05

2025
[36]

Introducing GPT -5.4

OpenAI . Introducing GPT -5.4. https://openai.com/index/introducing-gpt-5-4/, March 2026 a . Accessed: 2026-04-05

2026
[37]

Harness engineering: leveraging codex in an agent-first world

OpenAI . Harness engineering: leveraging codex in an agent-first world. https://openai.com/index/harness-engineering/, 2026 b . Accessed: 2026-04-08

2026
[38]

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, and Zhengyang Wu. Webcanvas: Benchmarking web agents in online environments, 2024. https://arxiv.org/abs/2406.12373

work page arXiv 2024
[39]

Nanoclaw: A lightweight, secure ai agent framework with container isolation

qwibitai . Nanoclaw: A lightweight, secure ai agent framework with container isolation. https://github.com/qwibitai/nanoclaw, 2026. Accessed: 2026-04-04

2026
[40]

The contribution of latent human failures to the breakdown of complex systems

James Reason. The contribution of latent human failures to the breakdown of complex systems. Philosophical Transactions of the Royal Society of London B, 327: 0 475--484, 1990

1990
[41]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. https://arxiv.org/abs/2303.11366

work page internal anchor Pith review arXiv 2023
[42]

PicoClaw : Tiny, fast, and deployable anywhere AI agent

Sipeed . PicoClaw : Tiny, fast, and deployable anywhere AI agent. https://github.com/sipeed/picoclaw, February 2026. Ultra-lightweight Go-based personal AI assistant with <10MB memory footprint. Accessed: 2026-04-05

2026
[43]

Openclaw: Your own personal ai assistant (open-source agent framework)

Peter Steinberger. Openclaw: Your own personal ai assistant (open-source agent framework). https://github.com/openclaw/openclaw, 2025. MIT License, Accessed: 2026-04-04

2025
[45]

Meta-gui: Towards multi-modal conversational agents on mobile gui

Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. Meta-gui: Towards multi-modal conversational agents on mobile gui. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6699--6712, 2022

2022
[46]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1 edition, 1998

1998
[47]

A comprehensive survey of continual learning: Theory, method and application

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE transactions on pattern analysis and machine intelligence, 46 0 (8): 0 5362--5383, 2024

2024
[48]

The OpenHands software agent SDK : A composable and extensible foundation for production agents, 2025

Xingyao Wang et al. The OpenHands software agent SDK : A composable and extensible foundation for production agents, 2025

2025
[52]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

2024
[54]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. https://arxiv.org/abs/2504.21798

work page internal anchor Pith review arXiv 2025
[55]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Claw-eval: End-to-end transparent benchmark for ai agents in the real world, 2026

Bowen Ye, Rang Li, Qibin Yang, Zhihui Xie, Yuanxin Liu, Linli Yao, Hanglong Lyu, and Lei Li. Claw-eval: End-to-end transparent benchmark for ai agents in the real world, 2026. https://github.com/claw-eval/claw-eval

2026
[57]

Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024. https://arxiv.org/abs/2407.15711

work page arXiv 2024
[58]

ZeroClaw : Fast, small, and fully autonomous AI assistant infrastructure in Rust

ZeroClaw Labs . ZeroClaw : Fast, small, and fully autonomous AI assistant infrastructure in Rust . https://github.com/zeroclaw-labs/zeroclaw, February 2026. Trait-driven Rust runtime with <5MB memory footprint. Accessed: 2026-04-05

2026
[60]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. https://arxiv.org/abs/2306.05685

work page internal anchor Pith review arXiv 2023
[62]

GLM -5-turbo: A foundation model optimized for the OpenClaw scenario

Zhipu AI . GLM -5-turbo: A foundation model optimized for the OpenClaw scenario. https://docs.z.ai/guides/llm/glm-5-turbo, 2026. 200K context, optimized for tool invocation and long-chain agent execution. Accessed: 2026-04-05

2026
[64]

2025 , eprint=

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding , author=. 2025 , eprint=

2025
[65]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Meta-gui: Towards multi-modal conversational agents on mobile gui , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[66]

We- blinx: Real-world website navigation with multi-turn dia- logue.arXiv preprint arXiv:2402.05930, 2024

Weblinx: Real-world website navigation with multi-turn dialogue , author=. arXiv preprint arXiv:2402.05930 , year=

work page arXiv
[67]

Agentstudio: A toolkit for building general virtual agents, 2024b

Agentstudio: A toolkit for building general virtual agents , author=. arXiv preprint arXiv:2403.17918 , year=

work page arXiv
[68]

2026 , url=

Claw-Eval: End-to-End Transparent Benchmark for AI Agents in the Real World , author=. 2026 , url=

2026
[69]

AgentBench: Evaluating LLMs as Agents

Agentbench: Evaluating llms as agents , author=. arXiv preprint arXiv:2308.03688 , year=

work page internal anchor Pith review arXiv
[70]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review arXiv
[71]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench: Benchmarking how well agent skills work across diverse tasks , author=. arXiv preprint arXiv:2602.12670 , year=

work page internal anchor Pith review arXiv
[72]

2025 , eprint=

SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=

2025
[73]

2025 , eprint=

R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents , author=. 2025 , eprint=

2025
[74]

Procedural environment generation for tool- use agents

Sullivan, Michael and Hartmann, Mareike and Koller, Alexander. Procedural Environment Generation for Tool-Use Agents. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.936

work page doi:10.18653/v1/2025.emnlp-main.936 2025
[75]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026b

MetaClaw: Just Talk An Agent That Meta-Learns and Evolves in the Wild , author=. arXiv preprint arXiv:2603.17187 , year=

work page arXiv
[76]

Agent world model: Infinity synthetic environments for agentic reinforcement learning.arXiv preprint arXiv:2602.10090, 2026

Agent world model: Infinity synthetic environments for agentic reinforcement learning , author=. arXiv preprint arXiv:2602.10090 , year=

work page arXiv
[77]

OpenClaw-RL: Train Any Agent Simply by Talking

OpenClaw-RL: Train Any Agent Simply by Talking , author=. arXiv preprint arXiv:2603.10165 , year=

work page Pith review arXiv
[78]

2025 , howpublished =

Peter Steinberger , title =. 2025 , howpublished =

2025
[79]

NanoClaw: A Lightweight, Secure AI Agent Framework with Container Isolation , year =
[80]

IronClaw: A Security-First Open-Source AI Agent Framework in Rust , year =
[81]

IEEE transactions on pattern analysis and machine intelligence , volume=

A comprehensive survey of continual learning: Theory, method and application , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2024 , publisher=

2024
[82]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

2023
[83]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

2023
[84]

2024 , eprint=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

2024
[85]

2025 , note =

Claude Code:. 2025 , note =

2025
[86]

2025 , note =

Codex:. 2025 , note =

2025
[87]

2024 , note =

Cursor: The Best Way to Code with. 2024 , note =

2024
[88]

2026 , month = feb, note =

2026
[89]

2026 , month = mar, note =

2026
[90]

2026 , note =

Hermes Agent: The Self-Improving. 2026 , note =

2026
[91]

2024 , eprint=

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author=. 2024 , eprint=

2024
[92]

2024 , eprint=

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. 2024 , eprint=

2024

Showing first 80 references.