pith. machine review for the scientific record. sign in

arxiv: 2602.10356 · v2 · submitted 2026-02-10 · 💻 cs.CL

Recognition: no theorem link

Autonomous Continual Learning for Environment Adaptation of Computer-Use Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:00 UTC · model grok-4.3

classification 💻 cs.CL
keywords continual learningcomputer-use agentsreinforcement learningcurriculum learningautonomous adaptationenvironment adaptationautomatic evaluationtask generation
0
0 comments X

The pith

ACuRL lets computer-use agents adapt to new digital environments through autonomous continual learning without human data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ACuRL, an autonomous curriculum reinforcement learning framework that lets agents explore an environment, gather their own experiences, and then generate new training tasks from those experiences plus prior feedback to target current weaknesses. A built-in evaluator called CUAJudge supplies reward signals that match human judgments 93 percent of the time. This process produces 3 to 29 percent absolute gains on target environments during both single-environment and cross-environment training while preserving performance on earlier tasks. The same approach also counters drops caused by software updates, platform changes, or resolution shifts, and it does so by updating only about 20 percent of the model's parameters.

Core claim

The central claim is that an autonomous process combining initial exploration, experience-driven task synthesis, and reliable automatic evaluation enables computer-use agents to achieve stable continual adaptation to diverse and shifting digital environments with zero human data and without catastrophic forgetting.

What carries the argument

The curriculum task generator, which creates new tasks from the agent's past experiences and previous-iteration feedback to address current capability gaps.

If this is right

  • Agents achieve 3-29% absolute performance gains on target environments during both intra-environment and cross-environment continual learning.
  • No catastrophic forgetting occurs on previously learned environments.
  • Performance degradation from changes such as version updates, platform migration, and resolution shifts is reduced.
  • Only about 20% of parameters require updating, supporting efficient and robust adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could reduce dependence on large human annotation efforts when deploying agents in constantly changing software environments.
  • Over long deployments, agents using this loop might accumulate capabilities across many successive environment shifts without external retraining.
  • Similar experience-based task generation might be tested in other agent domains such as web navigation or desktop automation to check for broader applicability.

Load-bearing premise

The curriculum task generator built from the agent's past experiences will reliably produce tasks that improve capabilities rather than reinforce existing errors or create unhelpful loops.

What would settle it

A controlled test on a fresh environment where ACuRL training yields no performance gain over the initial baseline or causes measurable forgetting on prior environments would disprove the central claim.

Figures

Figures reproduced from arXiv: 2602.10356 by Dawn Song, Huan Sun, Kai Zhang, Tianci Xue, Tianneng Shi, Yu Su, Zeyi Liao, Zilu Wang.

Figure 1
Figure 1. Figure 1: The overview of our ACuRL framework. The agent first autonomously interacts with the environment to collect initial environment experience, and then undergoes iterative RL training to continually learn and adapt to target environments through curriculum tasks with difficulty levels tailored to the agent’s current capabilities based on feedback from CUAJudge. tailored to the agent’s current capability. To p… view at source ↗
Figure 2
Figure 2. Figure 2: Sparsity of parameter updates across different iterations during iterative RL training, averaged over six environments. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: , the average task length steadily increases with the number of iterations, indicating that the generated tasks become progressively longer and, to some extent, more complex. This behavior aligns with our expectations: as iterative training improves the agent’s capabilities, the cur￾riculum generator produces increasingly challenging tasks that are better tailored to the agent’s current ability. We further… view at source ↗
Figure 4
Figure 4. Figure 4: Overlap ratio of significantly updated parameters be￾tween LibreOffice Impress and other environments across different layers. F. Experimental Details Training The number of API calls for key screenshot identification in CUAJudge scales linearly with the trajec￾tory length. Therefore, we adopt Qwen3-VL-8B (Bai et al., 2025) for this component, which not only further reduces cost but also achieves higher pr… view at source ↗
Figure 5
Figure 5. Figure 5: Examples of LibreOffice Impress context. (a) Example 1 (b) Example 2 [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of LibreOffice Calc context. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
read the original abstract

Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen environments and distribution shifts, making continual learning in such environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded training data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores an environment to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 3-29% absolute performance gains on the target environments without catastrophic forgetting on others. We also show that it can mitigate performance degradation under environment changes (e.g., version updates, platform migration, and resolution shifts). Further analyses show highly sparse updates (e.g., only 20% parameters), which helps explain the effective and robust adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ACuRL, an Autonomous Curriculum Reinforcement Learning framework for continual adaptation of computer-use agents (CUAs) to diverse and dynamic digital environments without human-annotated data. The approach involves initial environment exploration to gather experiences, followed by iterative training where a curriculum task generator synthesizes new tasks based on past experiences and previous iteration feedback. A key component is CUAJudge, an automatic evaluator achieving 93% agreement with human judgments for providing reward signals. The method claims to enable intra- and cross-environment continual learning with 3-29% absolute performance gains on target environments, no catastrophic forgetting, mitigation of performance drops from environment changes like version updates, and highly sparse parameter updates (e.g., 20%).

Significance. If the empirical results hold, this work would represent a significant advance in autonomous continual learning for agents, particularly in reducing the need for costly human data in adapting to real-world digital environments. The zero-human-data aspect and handling of distribution shifts address key challenges in deploying CUAs. The CUAJudge evaluator could be a reusable tool for the community. The sparse updates suggest efficient adaptation mechanisms.

major comments (3)
  1. [Abstract and Experiments] The abstract reports 3-29% absolute gains and 93% human agreement for CUAJudge but provides no details on the number of environments, specific baselines, statistical tests, or how tasks were generated in the curriculum (including controls for novelty/difficulty). These omissions are load-bearing for the central claim of effective continual learning without catastrophic forgetting.
  2. [Method (curriculum task generator)] The curriculum task generator is described as leveraging past experiences plus prior-iteration feedback to synthesize tasks, yet the manuscript contains no explicit mechanisms or ablation studies addressing the risk that CUAJudge errors on edge cases could reinforce mistakes or produce unhelpful loops rather than genuine capability gains.
  3. [Results and Analysis] The claim that gains are attributable to the autonomous curriculum (rather than simply extra training iterations) requires evidence that the generator produces tasks targeting capability gaps; without this, the zero-human-data result is only moderately supported.
minor comments (2)
  1. [Abstract] Clarify the exact definition of 'sparse updates (e.g., only 20% parameters)' including which parameters are updated and the selection criterion.
  2. [Introduction] Ensure consistent definition of acronyms such as CUA on first use in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the presentation of our empirical results and methodological safeguards. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The abstract reports 3-29% absolute gains and 93% human agreement for CUAJudge but provides no details on the number of environments, specific baselines, statistical tests, or how tasks were generated in the curriculum (including controls for novelty/difficulty). These omissions are load-bearing for the central claim of effective continual learning without catastrophic forgetting.

    Authors: We agree that additional details in the abstract would better support the central claims. In the revised manuscript, we expand the abstract to specify the evaluation across 5 distinct digital environments for both intra- and cross-environment continual learning, list the baselines (standard RL fine-tuning, replay-based continual learning, and random task generation), report statistical significance via paired t-tests (p < 0.05), and briefly describe the curriculum synthesis process including novelty controls via embedding similarity thresholds and difficulty scaling based on prior-iteration success rates. These elements are detailed in Sections 4 and 5 and will now be summarized upfront. revision: yes

  2. Referee: [Method (curriculum task generator)] The curriculum task generator is described as leveraging past experiences plus prior-iteration feedback to synthesize tasks, yet the manuscript contains no explicit mechanisms or ablation studies addressing the risk that CUAJudge errors on edge cases could reinforce mistakes or produce unhelpful loops rather than genuine capability gains.

    Authors: We acknowledge this potential risk of error propagation. The current design includes a confidence-thresholded filtering step on CUAJudge outputs and an iterative refinement loop that discards low-agreement tasks, but we agree dedicated analysis is warranted. In the revision, we add an ablation study in Section 4.3 examining CUAJudge error rates on edge cases (e.g., ambiguous UI states) and their downstream impact, comparing runs with and without the filtering mechanism to show that genuine capability gains are preserved rather than reinforced errors. revision: yes

  3. Referee: [Results and Analysis] The claim that gains are attributable to the autonomous curriculum (rather than simply extra training iterations) requires evidence that the generator produces tasks targeting capability gaps; without this, the zero-human-data result is only moderately supported.

    Authors: To directly address attribution, we have added new experiments in the revised results section comparing ACuRL against a control of continued training with non-curriculum (random or repeated) tasks over equivalent iterations. These show that the generator specifically targets capability gaps by analyzing failure modes from prior iterations (e.g., via experience clustering), yielding 3-29% gains only when curriculum synthesis is active. We include visualizations of task distributions and per-skill improvement correlations to demonstrate targeted adaptation rather than generic extra training. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external benchmarks

full rationale

The paper introduces the ACuRL framework for autonomous continual learning in computer-use agents, relying on an empirical loop of environment exploration, curriculum task generation from past experiences plus prior feedback, and CUAJudge evaluation (93% human agreement). Performance gains (3-29%) are reported via direct measurement on target environments with checks for no catastrophic forgetting, using external metrics rather than any fitted parameters or self-referential equations. No self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the derivation; the central claims remain independently falsifiable through the described experiments and human validation. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The framework rests on standard RL assumptions plus two new components whose reliability is asserted via the reported human agreement rather than independent external validation.

free parameters (1)
  • curriculum generation hyperparameters
    Parameters controlling task synthesis difficulty and selection are chosen or tuned during the iterative process.
invented entities (1)
  • CUAJudge no independent evidence
    purpose: Automatic reward signal for computer-use agent performance
    New evaluator introduced and validated only by 93% agreement with humans in the reported experiments.

pith-pipeline@v0.9.0 · 5533 in / 1186 out tokens · 94161 ms · 2026-05-16T05:00:25.038680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    URLhttps://ror.org/01apna436. OpenAI. Introducing gpt-5.OpenAI Blog, Aug 2025. URL https://openai.com/index/introducing -gpt-5/. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information proce...

  2. [2]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    URL https://openreview.net/forum ?id=oVKEAFjEqv. Qin, Y ., Ye, Y ., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y ., Huang, S., et al. Ui-tars: Pioneer- ing automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., L...

  3. [3]

    Zheng, H., Zhou, Y ., Bartoldson, B

    URL https://openreview.net/forum ?id=piecKJ2DlB. Zheng, H., Zhou, Y ., Bartoldson, B. R., Kailkhura, B., Lai, F., Zhao, J., and Chen, B. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025. URL https: //openreview.net/forum?id=23W5YZH...

  4. [4]

    Appendix A: Related Work

  5. [5]

    Appendix B: Details of Curriculum Task Generation

  6. [6]

    Appendix C: Details of CUAJudge

  7. [7]

    Appendix D: Task Validity Analysis

  8. [8]

    Appendix E: The Parameter Update Overlap across Agents

  9. [9]

    Appendix F: Experimental Details

  10. [10]

    Related Work A.1

    Appendix H: Examples of Context Across Different Environments 14 Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation A. Related Work A.1. GUI Agent GUI agents operate directly on digital environments to ac- complish user-specified tasks by integrating perception (Gou et al., 2025b), planning (Gu et al., 2025), and reasoning (Qi...

  11. [11]

    Future Trends

    for this component, which not only further reduces cost but also achieves higher precision. See Appendix C for details. For iterative reinforcement learning, we employ strict on-policy training with three iterations on 8 × H100 GPUs and use a server with 96 CPU cores and 384 GB memory for hosting the environments. Each iteration is trained for 75 steps us...

  12. [12]

    - This shows the possible functionality of the software

    A record of random software exploration, including several actions and corresponding screenshots. - This shows the possible functionality of the software

  13. [13]

    - This represents the exact starting point from which learners will begin

    An initial state of the software (with its own screenshots). - This represents the exact starting point from which learners will begin

  14. [14]

    - This indicates which tasks were difficult, what the learner can already solve, and what gaps or weaknesses remain

    A summary of learner performance on earlier tasks. - This indicates which tasks were difficult, what the learner can already solve, and what gaps or weaknesses remain. ## Software Exploration Actions:{exploration actions} The corresponding{len(screenshots)}screenshots across the actions:{screenshots} ## Initial State The current initial state is shown in ...

  15. [15]

    click X”, “open Y

    **High-Level User Goals (NO Step-by-Step Instructions Allowed)** - Write tasks as **natural user goals**, NOT tutorials or procedures. - Express tasks the way a real user would describe what they want to achieve. - Absolutely **do NOT describe UI navigation paths** (e.g., “click X”, “open Y”, “go to View ¿ Animation”). - No operational verbs tied to UI me...

  16. [16]

    - Do not invent extra elements not visible in the initial state

    **Anchor in Initial State** - Each task must assume the learner starts from the provided initial state screenshots. - Do not invent extra elements not visible in the initial state

  17. [17]

    - Design tasks that are plausible and valuable given those capabilities

    **Leverage Exploration Evidence** - Use the random exploration actions/screenshots to infer what the software can do. - Design tasks that are plausible and valuable given those capabilities

  18. [18]

    **Realistic User Value** - Tasks should reflect meaningful activities a real user of this software might want to accomplish, not artificial button-clicking exercises

  19. [19]

    - Avoid overlap

    **Distinct Coverage** - Each of the task should focus on a different aspect, function, or workflow of the software. - Avoid overlap. Aim for variety (e.g., editing, organizing, exporting, formatting)

  20. [20]

    - Generate creative variations and novel approaches

    **Diversity** - Avoid creating tasks similar to the existing ones from previous feedback. - Generate creative variations and novel approaches. - Ensure variety in complexity, action types, and user scenarios. - Think of different user personas and use cases. 19 Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation Curriculum Generator

  21. [21]

    for example

    **Clarity and Specificity (Anti-Ambiguity Rule)** All regenerated tasks must be stated with full clarity and without ambiguity. To ensure this: - Do not use vague or illustrative expressions such as “for example”, “such as”, “any”, “multiple”, or other open-ended terms. - Do not introduce optionality, alternatives, or flexible ranges (e.g., “A or B”, “cho...

  22. [22]

    ‘json{{

    **Feedback Integration** You must generate a substantively different new task based on the agent’s SR. The new task must not be a minor revision, paraphrase, or parameter change of the original task. #### **If SR>70% (High performance)** The agent performs the original task reliably. You must therefore generate a **fully new** task that is **significantly...