Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

Bin Yang; Hao Jiang; Hongtao Duan; Lu Jiang; Lulu Hu; Minying Zhang; Qihua Chen; Shurui Li; Tianpeng Bu; Xin Liu

arxiv: 2605.29447 · v1 · pith:UGNRJGDSnew · submitted 2026-05-28 · 💻 cs.CV · cs.CL

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

Tianpeng Bu , Xin Liu , Qihua Chen , Hao Jiang , Shurui Li , Hongtao Duan , Lu Jiang , Lulu Hu

show 2 more authors

Bin Yang Minying Zhang

This is my paper

Pith reviewed 2026-06-29 08:15 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords GUI agentserror recoverytrajectory synthesisrobustness benchmarkOSWorldpolicy errorsfine-tuning datalong-horizon tasks

0 comments

The pith

GUI agents recover from their own errors after training on 800k synthesized trajectories, reaching 47.4 percent success on OSWorld.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GUI agents often fail when they make mistakes and cannot fix them, which prevents reliable use in practice. The paper creates GUI-RobustEval, a set of 1216 test cases that check recovery from many different error kinds, and RoTS, a way to make 800k examples of how to recover. Training smaller and larger models on this data improves results on both the new tests and standard benchmarks. If correct, this means agents can handle longer tasks better by learning to correct their own slips instead of stopping. Readers should care because real deployment requires agents that keep going after errors without outside help.

Core claim

The paper claims that its Robustness-driven Trajectory Synthesis (RoTS) framework, using a tree-based pipeline to discover error modes and create corresponding recovery steps, produces 800k high-quality trajectories. Fine-tuning models on this data results in RoTS-7B and RoTS-32B that outperform prior approaches on GUI-RobustEval and achieve 47.4 percent success rate with 33.8 percent All-Pass@4 on OSWorld, indicating that better error recovery enhances long-horizon performance.

What carries the argument

The RoTS tree-based pipeline that proactively discovers diverse error modes and synthesizes recovery steps.

If this is right

Both RoTS-7B and RoTS-32B show significant gains on GUI-RobustEval and traditional GUI benchmarks.
RoTS-32B reaches state-of-the-art on OSWorld with 47.4 percent success rate.
Improved long-horizon error recovery ability contributes to both robustness and overall performance.
GUI-RobustEval provides a systematic way to measure error recovery across a broad spectrum of error modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar synthesis pipelines might help build robust agents in other interactive environments like web browsing or robotics.
If the generated trajectories cover the main error modes, this could reduce reliance on expensive human demonstrations for training.
Extending the tree pipeline to more complex multi-step errors could further improve performance on very long tasks.

Load-bearing premise

The distribution of error modes in the tree-generated trajectories matches the errors that actual GUI agent policies encounter in real deployment.

What would settle it

A test showing that RoTS-trained agents fail to recover from error types that occur in real user interactions but are absent from the synthesized dataset would falsify the claim that the method produces generally robust agents.

Figures

Figures reproduced from arXiv: 2605.29447 by Bin Yang, Hao Jiang, Hongtao Duan, Lu Jiang, Lulu Hu, Minying Zhang, Qihua Chen, Shurui Li, Tianpeng Bu, Xin Liu.

**Figure 1.** Figure 1: Policy-induced errors exhibit diverse types and delayed error detectability. GUI agents struggle to identify and recover from such errors (upper part of Fig. (a)), while RoTS improves this by synthesizing reflection-related data matching policy-induced error distribution (lower part of Fig. (a)). Benefit from this, RoTS achieves lower accuracy drop on All-Pass@4 (Fig. (b)) compared with other methods. dept… view at source ↗

**Figure 2.** Figure 2: Overview of our method. It includes (i) the pipeline for constructing our benchmark, GUI-RobustEval, and (ii) RoTS, the pipeline for synthesizing diverse error-recovery trajectories that cover the policy-induced error distribution. We also build a highly parallel infrastructure that supports high-throughput evaluation and data synthesis [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a): Error type distribution of policy-induced errors and existing datasets. (b): Error-horizon distribution of policy-induced errors and existing datasets. (c): Error type percentage in GUI-RobustEval, which is colored by post-error success rate. (d): The post-error success rate w.r.t. the error depth of SOTA agents on GUI-RobustEval. tify two gaps, i.e., coverage mismatch: training data concentrates on … view at source ↗

**Figure 5.** Figure 5: (a), under the dataset size 100k, increasing the number of expansion iterations from 0 to 32 improves the success rate from 15.8 to 21.4. This is because more iterations introduce a higher proportion of error-mode exploration and error-recovery trajectories into the dataset, enhancing the agent’s reflection capability. In addition, we scale the dataset size from 50k to 1000k and report the corresponding p… view at source ↗

**Figure 4.** Figure 4: The impact of different ratio of reflection data. Expansion Rounds and Dataset Size. We investigate the scalability of RoTS with respect to the number of expansion iterations and the scale of dataset size. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: (a) Error example of Incorrect Parameter. The action description beneath each state indicates the agent’s next move. In the final two steps, the agent fails to specify the correct output path in the terminal command. (b) Error example of Miss Necessary Step. The agent correctly navigates the export dialog but predicts an immediate save action (red text) before renaming the file to res.png. B. The Infrastru… view at source ↗

**Figure 7.** Figure 7: (a) Case study of Incorrect UI Element. The agent’s predicted actions for the line spacing button target incorrect UI elements. (b) Case study of a Compositional Error. Initial perception failure leads to a chain of erroneous actions. The agent eventually loses track of the ”Install extension” goal and attempts to open the VSIX as a regular file. our infrastructure is flexible and efficient to (1) seamless… view at source ↗

**Figure 8.** Figure 8: Overview of online sampling system for our GUI-RobustEval and RoTS. browser, disable update notifications for both system and applications. The well-prepared Ubuntu and Windows systems are saved as the base snapshots for all the tasks. Second, we ask annotators to manually curate similar content (e.g., documents, codes and pictures, etc.) used in each task and setup the base snapshot to the similar initial… view at source ↗

**Figure 9.** Figure 9: Prompt template used for extracting key points (milestones) for reward modeling. State-Transition Summarization Prompt Template Below are two consecutive screenshots from a user’s attempt to complete a task. The first image shows the state BEFORE an action, and the second image shows the state AFTER the action. Please concisely describe the change or the action that occurred between these two screens. Inpu… view at source ↗

**Figure 10.** Figure 10: Prompt template used to summarize state transitions between consecutive screenshots. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt template used for final task-success judgment in the reward model. C. More Details for RoTS Dataset C.1. Additional Details for Experience-Informed Recovery C.1.1. TRAJECTORY EXPERIENCE FORMAT Given instruction u and trajectory τ = {(ot, at)} T t=1, the reward model outputs (Eτ , rτ ) = R(u, τ ), rτ ∈ {0, 1}, (15) where the trajectory experience is Eτ ≜ (Pu, ∆τ , ξτ ). (16) Pu is a task-level proce… view at source ↗

**Figure 12.** Figure 12: Prompt template for the progress critic. C.1.2. NEIGHBORING-BRANCH TRAJECTORIES IN THE SEARCH TREE Let the search tree contain nodes as observations/states and directed edges as actions. For a node o, denote by Out(o) the set of outgoing edges: Out(o) ≜ {(o, a, o′ ) | taking action a at o leads to child node o ′ }. (17) Consider a failed trajectory τ fail = {(oi , afail i )} T i=1 (root node at i = 1). Fo… view at source ↗

**Figure 13.** Figure 13: Prompt template for the step-level action critic that verifies whether an intended action is consistent with the observed state transition. The transition state is obtained by the state-transition summarization prompt for the reward model. where ik is a proposed error step index, gik is the corresponding recovery advice, and pik ∈ [0, 1] is the expansion priority. To balance exploiting high-priority candi… view at source ↗

**Figure 14.** Figure 14: Prompt used for the reflection identifier. C.3. CoT Synthesis Procedures Our trajectories are collected from multiple policy models that differ in action spaces and CoT styles. To enable joint training, we unify all trajectories to the AgentNet format (Wang et al., 2025). For each source policy, we define an action mapping from its native action space to AgentNet’s action space and normalize coordinate-ba… view at source ↗

**Figure 15.** Figure 15: Visualization of the FAR-Tree, illustrating policy-induced errors via parallel sampling and FDE and their subsequent recovery through EIR advice. C.4. Rule-Based Data Deduplication After obtaining Dagn and Dref, we design the following rule-based data deduplication methods to remove near-duplicate data to preserve diversity. Specifically, we balance the number of training instances across tasks. Within ea… view at source ↗

**Figure 16.** Figure 16: System Prompt template used for training our agent. exceeds a threshold and we keep a single representative among duplicates. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Trajectory Comparison of OpenCUA and RoTS Model on GUI-RobustEval. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗

**Figure 18.** Figure 18: Exploration and Error Recovery Behavior of RoTS Model on OSWorld. Green text denotes correct actions, red text indicates failed attempts or erroneous actions, and blue text highlights steps involving error recovery. while for proprietary models, we use APIs from corresponding provider. Thus, the cost comes from: the GPU server for self-deployed models, API calls and the Cloud Server for the environment de… view at source ↗

**Figure 19.** Figure 19: Over-reflection Behavior of the RoTS Model on OSWorld. perform EIR for 32 times under the following setup: (i) Reflector (w/o exp.), reflector without trajectory-derived experience; (ii) Reflector (w/ exp.), reflector with trajectory-derived experience; (iii) EIR w/o advice, full recovery but removing advice conditioning; and (iv) Full EIR, the complete method. We report the averaged accuracy of error awa… view at source ↗

read the original abstract

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a GUI error-recovery benchmark and a tree-based synthesis method for 800k trajectories, with the 32B model claiming SOTA on OSWorld, but the gains rest on an unverified assumption that the synthetic errors match real policy failures.

read the letter

The main takeaway is that this work targets error recovery in GUI agents with a new benchmark and a scalable way to generate recovery data, leading to reported gains on OSWorld.

They created GUI-RobustEval with 1,216 test cases meant to cover a range of realistic error modes. RoTS builds trajectories by exploring error paths in a tree structure and pairing them with recovery steps, producing 800k examples. The fine-tuned 7B and 32B models show improvements on the new benchmark and push OSWorld success to 47.4% with a 33.8% All-Pass@4 score. Releasing the code is a plus for anyone who wants to inspect or extend it.

The tree approach for discovering diverse errors proactively is a concrete step beyond waiting for failures in rollouts. The numbers on a standard benchmark like OSWorld give the empirical side some grounding.

The soft spot is the missing check on whether the synthesized error distribution actually lines up with what deployed agents produce. No overlap metric or comparison to baseline failure modes appears in the abstract, so the claim that better recovery drives the OSWorld lift could be overstated if the data over-represents benchmark-friendly cases. Details on error mode selection and statistical controls are also thin, which leaves the central attribution open to the concern in the stress-test note.

This is aimed at people working on GUI agents or agent robustness who need better evals and data for long-horizon reliability. A reader focused on practical deployment issues would get usable resources from the benchmark and pipeline. The combination of new evaluation tools and concrete results on an existing benchmark is enough to warrant sending it to referees for a closer look at the methods and data validation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GUI-RobustEval, a benchmark of 1,216 executable test cases that systematically covers a spectrum of error modes for GUI agents, and RoTS, a tree-based trajectory synthesis pipeline that generates 800k recovery trajectories. Two models (RoTS-7B and RoTS-32B) are fine-tuned on the resulting data and evaluated on both the new benchmark and existing GUI agent suites; the abstract reports that RoTS-32B reaches 47.4% success and 33.8% All-Pass@4 on OSWorld, attributing the gains to improved long-horizon error recovery.

Significance. If the central attribution holds, the work supplies both a targeted evaluation protocol and a scalable data-generation method that could materially advance robustness in deployed GUI agents. The public release of code at the cited GitHub repository is a concrete reproducibility asset.

major comments (2)

[Abstract] Abstract: the claim that the reported OSWorld gains 'suggest that improved long-horizon error recovery ability contributes to both robustness and overall performance' is load-bearing, yet the manuscript provides no quantitative comparison (KL divergence, category overlap, or rollout statistics) between the error-mode distribution of the 800k tree-synthesized trajectories and the actual failures observed when baseline policies are rolled out on the same environments.
[GUI-RobustEval] GUI-RobustEval construction: the selection criteria and sampling procedure for the 1,216 test cases, including how the 'broad and realistic spectrum of error modes' was defined and validated against real deployment traces, are not described; without this, it is impossible to determine whether the benchmark isolates the recovery capability that the synthesis method is intended to improve.

minor comments (1)

[Abstract] The abstract states performance numbers and data scale but defers all methodological detail; a short methods paragraph or pointer to the relevant section would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript without overstating our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the reported OSWorld gains 'suggest that improved long-horizon error recovery ability contributes to both robustness and overall performance' is load-bearing, yet the manuscript provides no quantitative comparison (KL divergence, category overlap, or rollout statistics) between the error-mode distribution of the 800k tree-synthesized trajectories and the actual failures observed when baseline policies are rolled out on the same environments.

Authors: We agree the attribution would be stronger with a direct quantitative comparison of error-mode distributions. The current manuscript relies on performance gains on GUI-RobustEval (which targets recovery) and OSWorld to support the suggestion. In revision we will add category-overlap statistics and rollout failure analysis between the synthesized trajectories and baseline OSWorld rollouts; if the comparison is not feasible with existing logs we will qualify the abstract claim accordingly. revision: yes
Referee: [GUI-RobustEval] GUI-RobustEval construction: the selection criteria and sampling procedure for the 1,216 test cases, including how the 'broad and realistic spectrum of error modes' was defined and validated against real deployment traces, are not described; without this, it is impossible to determine whether the benchmark isolates the recovery capability that the synthesis method is intended to improve.

Authors: Section 3.1 describes the error-mode taxonomy and how the 1,216 cases were constructed to cover them, but the sampling procedure and any validation against external deployment traces are not stated with sufficient detail. We will expand Section 3.1 with explicit selection criteria, sampling method, and validation steps against real traces. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical pipeline for trajectory synthesis via a tree-based method and evaluates resulting models on external benchmarks (GUI-RobustEval and OSWorld). No equations, fitted parameters, or self-citations are present that reduce any reported performance gain or attribution to a definitional equivalence or input by construction. The central claims rest on observed success rates from fine-tuning and external testing rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard supervised fine-tuning assumptions common to the field.

axioms (1)

domain assumption Standard assumptions of supervised fine-tuning and benchmark validity in machine learning for agents.
The reported gains rest on typical ML training and evaluation practices.

pith-pipeline@v0.9.1-grok · 5763 in / 920 out tokens · 28873 ms · 2026-06-29T08:15:07.998700+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 canonical work pages · 1 internal anchor

[1]

GPT-4o System Card

doi: 10.18653/v1/2025.acl-long.369. URL https: //aclanthology.org/2025.acl-long.369/. Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y ., Ma, J., Huang, Z., and Chua, T.-S...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.369 2025
[2]

It must be an explicit judgment of error / non-compliance with the task goal

It states that a prior action/decision/step was wrong relative to the instruction (i.e., it was incorrect, off-track, failed to meet requirements, or was a bad choice). It must be an explicit judgment of error / non-compliance with the task goal
[3]

therefore I need to change to

It includes an intention to correct / an improvement strategy (e.g., “therefore I need to change to. . . ”, “roll back. . . ”, “reselect. . . ”, etc.). If the thought only contains any of the following, it must be judged asNO reflection(has reflection=false): • “The previous step succeeded / completed / now I will do the next step” (success assessment) • ...

2025

[1] [1]

GPT-4o System Card

doi: 10.18653/v1/2025.acl-long.369. URL https: //aclanthology.org/2025.acl-long.369/. Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y ., Ma, J., Huang, Z., and Chua, T.-S...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.369 2025

[2] [2]

It must be an explicit judgment of error / non-compliance with the task goal

It states that a prior action/decision/step was wrong relative to the instruction (i.e., it was incorrect, off-track, failed to meet requirements, or was a bad choice). It must be an explicit judgment of error / non-compliance with the task goal

[3] [3]

therefore I need to change to

It includes an intention to correct / an improvement strategy (e.g., “therefore I need to change to. . . ”, “roll back. . . ”, “reselect. . . ”, etc.). If the thought only contains any of the following, it must be judged asNO reflection(has reflection=false): • “The previous step succeeded / completed / now I will do the next step” (success assessment) • ...

2025