ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

Cihang Xie; Fang Wu; Haonian Ji; Huaxiu Yao; Jiaqi Liu; Jike Zhong; Kaide Zeng; Kaiwen Xiong; Peng Xia; Yuxiang Lai

arxiv: 2605.14133 · v2 · pith:CZZMY6FBnew · submitted 2026-05-13 · 💻 cs.AI

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

Yuxiang Lai , Peng Xia , Haonian Ji , Kaiwen Xiong , Kaide Zeng , Jiaqi Liu , Fang Wu , Jike Zhong

show 3 more authors

Zeyu Zheng Cihang Xie Huaxiu Yao

This is my paper

Pith reviewed 2026-05-20 20:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords command-line agentsinteractive benchmarksstate conflictagent evaluationexecutable workflowsbenchmark generationfrontier models

0 comments

The pith

ClawForge generates executable command-line benchmarks that test agents on workflows with pre-existing state conflicts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ClawForge, a generator-backed framework that compiles scenario templates, initialized states with conflicts, reference trajectories, and validators into reproducible tasks for command-line agents. It shifts evaluation from clean starting states and exact trajectory matches to persistent surfaces judged by normalized end-state accuracy and observable side effects. The resulting ClawForge-Bench contains 17 scenarios across six ability categories. When seven frontier models are tested, the strongest reaches only 45.3 percent strict accuracy, wrong-state replacement stays below 17 percent for all, and the largest performance spread arises from whether agents first inspect existing state. Partial-credit and step-efficiency analyses further show many failures are near-misses rather than immediate breakdowns.

Core claim

ClawForge compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications for command-line workflows under state conflict, then evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching, instantiated as the ClawForge-Bench with 17 scenarios in 6 ability categories.

What carries the argument

The ClawForge generator that turns scenario templates and initialized conflicting states into executable tasks evaluated by normalized end-state matching.

If this is right

Frontier models reach at most 45.3 percent strict accuracy when tasks include pre-existing state conflicts.
Wrong-state replacement remains below 17 percent across all tested models.
The largest performance gap (17 percent to 90 percent) occurs between agents that inspect existing state and those that do not.
Many failures appear as near-miss closures rather than early breakdowns.
Models display qualitatively different failure styles when operating under state conflict.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The generator approach could be applied to create similar benchmarks for other persistent environments such as file-system or database agents.
Explicit training on state-inspection behaviors might narrow the observed performance gaps between models.
Extending the set of scenarios beyond 17 could expose additional ability categories that current agents lack.

Load-bearing premise

The 17 scenarios and their validators accurately represent the range of state conflicts that arise in realistic command-line workflows and that normalized end-state matching captures task success without missing important behavioral differences.

What would settle it

A new model achieving over 80 percent strict accuracy on the 17 scenarios while rarely inspecting existing state before acting would challenge the claim that state inspection drives the widest performance differences.

Figures

Figures reproduced from arXiv: 2605.14133 by Cihang Xie, Fang Wu, Haonian Ji, Huaxiu Yao, Jiaqi Liu, Jike Zhong, Kaide Zeng, Kaiwen Xiong, Peng Xia, Yuxiang Lai, Zeyu Zheng.

**Figure 1.** Figure 1: ClawForge-Bench benchmark coverage. Inner: 6 primary ability categories. Outer: 17 scenario families within each category. face, scoring normalized workflow state and observable side effects rather than exact command imitation. Automatic generation is therefore not merely a scalable way to produce more tasks, but a mechanism for maintaining reproducibility, extensibility, and evaluation consistency in int… view at source ↗

**Figure 2.** Figure 2: Automated benchmark generation pipeline. Scenario templates are compiled into executable task specifications τ = (x, S0, C⋆ , E, m) through slot grounding, state initialization, instruction rendering, reference command synthesis, and validator generation. mand history, accumulated effect traces, the latest process outputs, and the merged backend state. The evaluator then applies task-defined checks over Sˆ… view at source ↗

**Figure 3.** Figure 3: Interactive execution and evaluation loop. Agents emit commands step by step; the environment executes them, records state changes and effect traces, and merges everything into a normalized evaluation state Sˆ for result-first scoring. long workflows, but scenarios involving stale state, incorrect replacement, and incomplete workflow closure. Third, partial-credit and step-efficiency analyses expose meani… view at source ↗

**Figure 4.** Figure 4: Per-scenario strict accuracy. Duplicate-aware scenarios are near saturation, while state repair, release gating, and wrong-state replacement remain challenging across all models. Persistent memory and long-horizon agents. A related line of work studies persistent memory and long-horizon behavior in LLM agents. Virtual-context approaches, including MEMGPT (Packer et al., 2023), MEMORYOS (Kang et al., 2025)… view at source ↗

read the original abstract

Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClawForge gives a generator for CLI agent benchmarks that include pre-existing state conflicts and scores on end-state plus side effects instead of trajectories, but the 17 scenarios leave the headline numbers on shaky ground.

read the letter

The main thing to know is that this paper builds a generator framework called ClawForge for creating executable command-line tasks that start with partial, stale, or conflicting state. It then evaluates agents on whether they reach the right final state and produce the right side effects, rather than checking every step against a reference path. The reported results show the strongest model at 45.3% strict accuracy, with models differing most on whether they inspect existing state before changing anything, and wrong-state replacement staying low across the board.

Referee Report

2 major / 2 minor

Summary. The paper introduces ClawForge, a generator-backed framework that compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible executable CLI tasks. It instantiates the framework as ClawForge-Bench (17 scenarios across 6 ability categories) and evaluates seven frontier models using normalized end-state and side-effect matching rather than exact trajectory matching. Key reported results are a maximum strict accuracy of 45.3%, wrong-state replacement below 17% for all models, and the largest performance gap (17%–90%) attributable to whether agents inspect existing state before acting.

Significance. If the benchmark construction and validators are shown to be representative, the work supplies a needed evaluation tool for agent behavior under persistent, conflicting CLI state. The emphasis on reproducible generation, partial-credit analysis, and qualitative failure-style differences adds concrete data on near-miss versus early-breakdown failures that static or clean-state benchmarks miss.

major comments (2)

[§4 and §5.1] §4 (Benchmark Construction) and §5.1 (Evaluation Protocol): The central performance claims (45.3% max accuracy, <17% wrong-state replacement, inspection-driven separation) rest on the assumption that the 17 hand-curated scenarios plus their validators adequately sample realistic pre-existing partial/stale/conflicting state. No coverage analysis, sampling justification, or comparison against common conflict patterns (permissions, version skew, concurrent modification) is provided, making it impossible to determine whether the reported numbers generalize or are artifacts of the chosen templates.
[§5.2] §5.2 (Validator Definition): The normalized end-state matching rule is described as capturing task success via observable side effects, yet the manuscript supplies no sensitivity analysis showing how changes to the normalization function or validator thresholds would affect the strict-accuracy and wrong-state-replacement statistics. This is load-bearing because the headline model-separation result depends on these scoring decisions.

minor comments (2)

[§3] The abstract and §3 mention “normalized end state and observable side effects” but the precise normalization procedure and side-effect logging format are not stated until later sections; moving a concise definition to §3 would improve readability.
[Table 2] Table 2 (model results) reports strict accuracy and wrong-state replacement but does not include confidence intervals or statistical significance tests for the 17–90% separation; adding these would strengthen the quantitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the benchmark's representativeness and the robustness of our evaluation protocol. We address each major comment below, indicating the revisions we plan to make.

read point-by-point responses

Referee: [§4 and §5.1] §4 (Benchmark Construction) and §5.1 (Evaluation Protocol): The central performance claims (45.3% max accuracy, <17% wrong-state replacement, inspection-driven separation) rest on the assumption that the 17 hand-curated scenarios plus their validators adequately sample realistic pre-existing partial/stale/conflicting state. No coverage analysis, sampling justification, or comparison against common conflict patterns (permissions, version skew, concurrent modification) is provided, making it impossible to determine whether the reported numbers generalize or are artifacts of the chosen templates.

Authors: We agree that additional justification for the scenario selection would improve the manuscript. The scenarios in ClawForge-Bench were hand-curated to cover six distinct ability categories that reflect common sources of state conflict in CLI environments. In the revised manuscript, we will expand §4 to include a detailed rationale for scenario selection, with explicit mappings to real-world conflict patterns such as permission issues, version skew, and concurrent modifications drawn from common system administration and development workflows. We will also discuss the limitations of the current set and suggest how future extensions could incorporate broader sampling. This addresses the concern about generalizability while maintaining the focus on executable, reproducible tasks. revision: yes
Referee: [§5.2] §5.2 (Validator Definition): The normalized end-state matching rule is described as capturing task success via observable side effects, yet the manuscript supplies no sensitivity analysis showing how changes to the normalization function or validator thresholds would affect the strict-accuracy and wrong-state-replacement statistics. This is load-bearing because the headline model-separation result depends on these scoring decisions.

Authors: The normalized end-state matching is intended to evaluate functional correctness through observable outcomes rather than precise command sequences, which is appropriate for assessing agent performance under state conflict. We acknowledge the value of a sensitivity analysis. In the revision, we will incorporate a sensitivity study in §5.2 that varies the normalization parameters and validator thresholds to show their impact on the strict accuracy and wrong-state replacement metrics. This will provide evidence that the reported model separations are not overly sensitive to specific threshold choices. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark construction and evaluation

full rationale

The paper constructs ClawForge as a generator-backed framework that compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into executable tasks. It then instantiates this as ClawForge-Bench with 17 scenarios and evaluates seven frontier models directly on them, reporting empirical outcomes such as 45.3% strict accuracy, wrong-state replacement below 17%, and model separation driven by pre-action inspection. These results follow from running the agents step-by-step over persistent state and applying the explicitly defined normalized end-state and side-effect matching rules; no equations, fitted parameters, or self-referential definitions reduce the reported metrics to the inputs by construction. The work contains no load-bearing self-citations, uniqueness theorems, or ansatzes that would create circularity. This is a self-contained empirical benchmark paper whose central claims are externally falsifiable through the released scenarios and validators.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that command-line environments can be reproducibly initialized and that validators based on final state and side effects are sufficient to judge task completion.

axioms (1)

domain assumption Command-line tools produce deterministic outputs for a given initial state and sequence of commands.
Required to define reference trajectories and to treat end-state matching as a reliable success signal.

pith-pipeline@v0.9.0 · 5784 in / 1273 out tokens · 43865 ms · 2026-05-20T20:14:27.781613+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ClawForge compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 43 internal anchors

[1]

A-MEM: Agentic Memory for LLM Agents

A-mem: Agentic memory for llm agents , author=. arXiv preprint arXiv:2502.12110 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep: a temporal knowledge graph architecture for agent memory , author=. arXiv preprint arXiv:2501.13956 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Proceedings of the AAAI conference on artificial intelligence , volume=

Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[4]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Memory os of ai agent , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[5]

, author=

MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher=

work page 2023
[6]

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents , author=. arXiv preprint arXiv:2604.18543 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents , author=. arXiv preprint arXiv:2604.06132 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Nous Research , title=

work page
[10]

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces , author=. arXiv preprint arXiv:2604.05172 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

ClawArena: Benchmarking AI Agents in Evolving Information Environments

ClawArena: Benchmarking AI Agents in Evolving Information Environments , author=. arXiv preprint arXiv:2604.04202 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page
[13]

ACM transactions on intelligent systems and technology , volume=

A survey on evaluation of large language models , author=. ACM transactions on intelligent systems and technology , volume=. 2024 , publisher=

work page 2024
[14]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page
[16]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=

work page
[17]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

AgentBench: Evaluating LLMs as Agents

Agentbench: Evaluating llms as agents , author=. arXiv preprint arXiv:2308.03688 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Proceedings of the National Academy of Sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the National Academy of Sciences , volume=

work page
[22]

2026 , organization =

Liu, Jiaqi and Xia, Peng and Han, Siwei and Qiu, Shi and Zhang, Letian and Chen, Guiming and Tu, Haoqin and Yang, Xinyu and and Zhou, Jiawei and Zhu, Hongtu and Li, Yun and Zheng, Zeyu and Xie, Cihang and Ding, Mingyu and Yao, Huaxiu , title =. 2026 , organization =

work page 2026
[23]

Advances in Neural Information Processing Systems , year=

Gradient Episodic Memory for Continual Learning , author=. Advances in Neural Information Processing Systems , year=

work page
[24]

Efficient Lifelong Learning with

Chaudhry, Arslan and Ranzato, Marc'Aurelio and Rohrbach, Marcus and Elhoseiny, Mohamed , booktitle=. Efficient Lifelong Learning with

work page
[25]

International Conference on Machine Learning , year=

Continual Learning Through Synaptic Intelligence , author=. International Conference on Machine Learning , year=

work page
[26]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

RL ^2 : Fast Reinforcement Learning via Slow Reinforcement Learning , author=. arXiv preprint arXiv:1611.02779 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

International Conference on Machine Learning , year=

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , author=. International Conference on Machine Learning , year=

work page
[28]

International Conference on Learning Representations , year=

ProMP: Proximal Meta-Policy Search , author=. International Conference on Learning Representations , year=

work page
[29]

Advances in Neural Information Processing Systems , year=

Online Structured Meta-learning , author=. Advances in Neural Information Processing Systems , year=

work page
[30]

Thinking Machines Lab TML , title=

work page
[31]

Kimi K2.5: Visual Agentic Intelligence

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[33]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

work page
[34]

MemEvolve: Meta-Evolution of Agent Memory Systems

MemEvolve: Meta-Evolution of Agent Memory Systems , author=. arXiv preprint arXiv:2512.18746 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory , author=. arXiv preprint arXiv:2601.03192 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

Mem- \ alpha \ : Learning memory construction via reinforcement learning , author=. arXiv preprint arXiv:2509.25911 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Agentic Reinforced Policy Optimization

Agentic reinforced policy optimization , author=. arXiv preprint arXiv:2507.19849 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

arXiv preprint arXiv:2511.14460 , year=

Agent-r1: Training powerful llm agents with end-to-end reinforcement learning , author=. arXiv preprint arXiv:2511.14460 , year=

work page arXiv
[40]

Advances in neural information processing systems , volume=

Continuous meta-learning without tasks , author=. Advances in neural information processing systems , volume=

work page
[41]

Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL

Deep online learning via meta-learning: Continual adaptation for model-based RL , author=. arXiv preprint arXiv:1812.07671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

International conference on machine learning , pages=

Online meta-learning , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[43]

Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

A survey on in-context learning , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

work page 2024
[44]

Agent Workflow Memory

Agent workflow memory , author=. arXiv preprint arXiv:2409.07429 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

G- memory: Tracing hierarchical memory for multi-agent systems, 2025

G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems , author=. arXiv preprint arXiv:2506.07398 , year=

work page arXiv
[46]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[47]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

A survey of self-evolving agents: On path to artificial super intelligence , author=. arXiv preprint arXiv:2507.21046 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning

Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning , author=. arXiv preprint arXiv:2511.16043 , year=

work page arXiv
[49]

Agent0 -vl: Exploring self -evolving agent for tool -integrated vision -language reasoning

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning , author=. arXiv preprint arXiv:2511.19900 , year=

work page arXiv
[50]

Neural networks , volume=

Continual lifelong learning with neural networks: A review , author=. Neural networks , volume=. 2019 , publisher=

work page 2019
[51]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning to prompt for continual learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[52]

IEEE transactions on pattern analysis and machine intelligence , volume=

A comprehensive survey of continual learning: Theory, method and application , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2024 , publisher=

work page 2024
[53]

IEEE transactions on pattern analysis and machine intelligence , volume=

Meta-learning in neural networks: A survey , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

work page 2021
[54]

On First-Order Meta-Learning Algorithms

On first-order meta-learning algorithms , author=. arXiv preprint arXiv:1803.02999 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

International conference on machine learning , pages=

Model-agnostic meta-learning for fast adaptation of deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[58]

Advances in Neural Information Processing Systems , volume=

Meta-learning with an adaptive task scheduler , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. arXiv preprint arXiv:2602.02474 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

SimpleMem: Efficient Lifelong Memory for LLM Agents

SimpleMem: Efficient Lifelong Memory for LLM Agents , author=. arXiv preprint arXiv:2601.02553 , year=

work page internal anchor Pith review arXiv
[62]

arXiv preprint arXiv:2311.10538 , year=

Testing language model agents safely in the wild , author=. arXiv preprint arXiv:2311.10538 , year=

work page arXiv
[63]

Agents in the Wild: Safety, Security, and Beyond , author=

work page
[64]

Zhang, J

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? , author=. arXiv preprint arXiv:2509.03312 , year=

work page arXiv
[65]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning , author=. arXiv preprint arXiv:2602.08234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Memory in the Age of AI Agents

Memory in the Age of AI Agents , author=. arXiv preprint arXiv:2512.13564 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Reasoningbank: Scaling agent self-evolving with reasoning memory , author=. arXiv preprint arXiv:2509.25140 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Agent kb: Leveraging cross-domain experience for agentic problem solving

Agent kb: Leveraging cross-domain experience for agentic problem solving , author=. arXiv preprint arXiv:2507.06229 , year=

work page arXiv
[69]

Advances in Neural Information Processing Systems , volume=

Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in Neural Information Processing Systems , volume=

work page
[70]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[71]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Evolver: Self-evolving llm agents through an experience-driven lifecycle , author=. arXiv preprint arXiv:2510.16079 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Memp: Exploring Agent Procedural Memory

Memp: Exploring agent procedural memory , author=. arXiv preprint arXiv:2508.06433 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

Group-in-Group Policy Optimization for LLM Agent Training

Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

2025 , author=

OpenAI Computer-Using Agent , url=. 2025 , author=

work page 2025
[77]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

2024 , author=

The claude 3 model family: Opus, sonnet, haiku , note=. 2024 , author=

work page 2024
[79]

2025 , author=

Introducing the Gemini 2.5 Computer Use model , url=. 2025 , author=

work page 2025
[80]

2025 , author=

OpenAI Deep Research System Card , url=. 2025 , author=

work page 2025

Showing first 80 references.

[1] [1]

A-MEM: Agentic Memory for LLM Agents

A-mem: Agentic memory for llm agents , author=. arXiv preprint arXiv:2502.12110 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep: a temporal knowledge graph architecture for agent memory , author=. arXiv preprint arXiv:2501.13956 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Proceedings of the AAAI conference on artificial intelligence , volume=

Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[4] [4]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Memory os of ai agent , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[5] [5]

, author=

MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher=

work page 2023

[6] [6]

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents , author=. arXiv preprint arXiv:2604.18543 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents , author=. arXiv preprint arXiv:2604.06132 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Nous Research , title=

work page

[10] [10]

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces , author=. arXiv preprint arXiv:2604.05172 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

ClawArena: Benchmarking AI Agents in Evolving Information Environments

ClawArena: Benchmarking AI Agents in Evolving Information Environments , author=. arXiv preprint arXiv:2604.04202 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page

[13] [13]

ACM transactions on intelligent systems and technology , volume=

A survey on evaluation of large language models , author=. ACM transactions on intelligent systems and technology , volume=. 2024 , publisher=

work page 2024

[14] [14]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page

[16] [16]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=

work page

[17] [17]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page

[18] [18]

AgentBench: Evaluating LLMs as Agents

Agentbench: Evaluating llms as agents , author=. arXiv preprint arXiv:2308.03688 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Proceedings of the National Academy of Sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the National Academy of Sciences , volume=

work page

[22] [22]

2026 , organization =

Liu, Jiaqi and Xia, Peng and Han, Siwei and Qiu, Shi and Zhang, Letian and Chen, Guiming and Tu, Haoqin and Yang, Xinyu and and Zhou, Jiawei and Zhu, Hongtu and Li, Yun and Zheng, Zeyu and Xie, Cihang and Ding, Mingyu and Yao, Huaxiu , title =. 2026 , organization =

work page 2026

[23] [23]

Advances in Neural Information Processing Systems , year=

Gradient Episodic Memory for Continual Learning , author=. Advances in Neural Information Processing Systems , year=

work page

[24] [24]

Efficient Lifelong Learning with

Chaudhry, Arslan and Ranzato, Marc'Aurelio and Rohrbach, Marcus and Elhoseiny, Mohamed , booktitle=. Efficient Lifelong Learning with

work page

[25] [25]

International Conference on Machine Learning , year=

Continual Learning Through Synaptic Intelligence , author=. International Conference on Machine Learning , year=

work page

[26] [26]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

RL ^2 : Fast Reinforcement Learning via Slow Reinforcement Learning , author=. arXiv preprint arXiv:1611.02779 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

International Conference on Machine Learning , year=

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , author=. International Conference on Machine Learning , year=

work page

[28] [28]

International Conference on Learning Representations , year=

ProMP: Proximal Meta-Policy Search , author=. International Conference on Learning Representations , year=

work page

[29] [29]

Advances in Neural Information Processing Systems , year=

Online Structured Meta-learning , author=. Advances in Neural Information Processing Systems , year=

work page

[30] [30]

Thinking Machines Lab TML , title=

work page

[31] [31]

Kimi K2.5: Visual Agentic Intelligence

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[33] [33]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

work page

[34] [34]

MemEvolve: Meta-Evolution of Agent Memory Systems

MemEvolve: Meta-Evolution of Agent Memory Systems , author=. arXiv preprint arXiv:2512.18746 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory , author=. arXiv preprint arXiv:2601.03192 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

Mem- \ alpha \ : Learning memory construction via reinforcement learning , author=. arXiv preprint arXiv:2509.25911 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Agentic Reinforced Policy Optimization

Agentic reinforced policy optimization , author=. arXiv preprint arXiv:2507.19849 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

arXiv preprint arXiv:2511.14460 , year=

Agent-r1: Training powerful llm agents with end-to-end reinforcement learning , author=. arXiv preprint arXiv:2511.14460 , year=

work page arXiv

[40] [40]

Advances in neural information processing systems , volume=

Continuous meta-learning without tasks , author=. Advances in neural information processing systems , volume=

work page

[41] [41]

Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL

Deep online learning via meta-learning: Continual adaptation for model-based RL , author=. arXiv preprint arXiv:1812.07671 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

International conference on machine learning , pages=

Online meta-learning , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019

[43] [43]

Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

A survey on in-context learning , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

work page 2024

[44] [44]

Agent Workflow Memory

Agent workflow memory , author=. arXiv preprint arXiv:2409.07429 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

G- memory: Tracing hierarchical memory for multi-agent systems, 2025

G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems , author=. arXiv preprint arXiv:2506.07398 , year=

work page arXiv

[46] [46]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[47] [47]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

A survey of self-evolving agents: On path to artificial super intelligence , author=. arXiv preprint arXiv:2507.21046 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning

Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning , author=. arXiv preprint arXiv:2511.16043 , year=

work page arXiv

[49] [49]

Agent0 -vl: Exploring self -evolving agent for tool -integrated vision -language reasoning

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning , author=. arXiv preprint arXiv:2511.19900 , year=

work page arXiv

[50] [50]

Neural networks , volume=

Continual lifelong learning with neural networks: A review , author=. Neural networks , volume=. 2019 , publisher=

work page 2019

[51] [51]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning to prompt for continual learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[52] [52]

IEEE transactions on pattern analysis and machine intelligence , volume=

A comprehensive survey of continual learning: Theory, method and application , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2024 , publisher=

work page 2024

[53] [53]

IEEE transactions on pattern analysis and machine intelligence , volume=

Meta-learning in neural networks: A survey , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

work page 2021

[54] [54]

On First-Order Meta-Learning Algorithms

On first-order meta-learning algorithms , author=. arXiv preprint arXiv:1803.02999 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

International conference on machine learning , pages=

Model-agnostic meta-learning for fast adaptation of deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017

[58] [58]

Advances in Neural Information Processing Systems , volume=

Meta-learning with an adaptive task scheduler , author=. Advances in Neural Information Processing Systems , volume=

work page

[59] [59]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. arXiv preprint arXiv:2602.02474 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

SimpleMem: Efficient Lifelong Memory for LLM Agents

SimpleMem: Efficient Lifelong Memory for LLM Agents , author=. arXiv preprint arXiv:2601.02553 , year=

work page internal anchor Pith review arXiv

[62] [62]

arXiv preprint arXiv:2311.10538 , year=

Testing language model agents safely in the wild , author=. arXiv preprint arXiv:2311.10538 , year=

work page arXiv

[63] [63]

Agents in the Wild: Safety, Security, and Beyond , author=

work page

[64] [64]

Zhang, J

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? , author=. arXiv preprint arXiv:2509.03312 , year=

work page arXiv

[65] [65]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning , author=. arXiv preprint arXiv:2602.08234 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

Memory in the Age of AI Agents

Memory in the Age of AI Agents , author=. arXiv preprint arXiv:2512.13564 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Reasoningbank: Scaling agent self-evolving with reasoning memory , author=. arXiv preprint arXiv:2509.25140 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

Agent kb: Leveraging cross-domain experience for agentic problem solving

Agent kb: Leveraging cross-domain experience for agentic problem solving , author=. arXiv preprint arXiv:2507.06229 , year=

work page arXiv

[69] [69]

Advances in Neural Information Processing Systems , volume=

Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in Neural Information Processing Systems , volume=

work page

[70] [70]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[71] [71]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Evolver: Self-evolving llm agents through an experience-driven lifecycle , author=. arXiv preprint arXiv:2510.16079 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[73] [73]

Memp: Exploring Agent Procedural Memory

Memp: Exploring agent procedural memory , author=. arXiv preprint arXiv:2508.06433 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[74] [74]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[75] [75]

Group-in-Group Policy Optimization for LLM Agent Training

Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[76] [76]

2025 , author=

OpenAI Computer-Using Agent , url=. 2025 , author=

work page 2025

[77] [77]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[78] [78]

2024 , author=

The claude 3 model family: Opus, sonnet, haiku , note=. 2024 , author=

work page 2024

[79] [79]

2025 , author=

Introducing the Gemini 2.5 Computer Use model , url=. 2025 , author=

work page 2025

[80] [80]

2025 , author=

OpenAI Deep Research System Card , url=. 2025 , author=

work page 2025