arxiv: 2604.17849 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

On the Reliability of Computer Use Agents

Gonzalo Gonzalez-Pumariega , Saaket Agashe , Jiachen Yang , Ang Li , Xin Eric Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:34 UTC · model grok-4.3

classification 💻 cs.AI

keywords computer-use agentsreliabilitytask ambiguityagent variabilitystochastic executionrepeated evaluationdesktop automationweb navigation

0 comments

The pith

Computer-use agents succeed on a task once yet often fail on identical repeated executions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why agents capable of web navigation, desktop automation, and software tasks show inconsistent performance even when the underlying task and model remain fixed. It isolates three contributing factors through repeated runs of the same tasks: randomness in how actions unfold, vagueness in the way tasks are worded, and shifts in the agent's own choices from one execution to the next. Paired statistical tests on these repeated executions reveal that both the wording of the task and the stability of the agent's responses strongly influence whether success repeats. The work concludes that single-trial evaluations miss these sources of failure and that agents would benefit from repeated-run testing, interactive clarification of ambiguous instructions, and preference for stable strategies.

Core claim

An agent that succeeds at a task once may still fail on subsequent executions of the exact same task because of stochastic elements during execution, ambiguity in the task description, and variability in the agent's behavior across runs. Analysis on repeated executions combined with statistical tests that track task-level changes shows that reliability is shaped by how clearly the task is specified and how much the agent's actions fluctuate from one run to another.

What carries the argument

Repeated executions of unchanged tasks paired with statistical tests that isolate effects from execution stochasticity, task ambiguity, and agent behavior variability.

If this is right

Evaluation protocols must include multiple executions of each task instead of single trials.
Task instructions should be designed to let agents ask clarifying questions during execution.
Agent development should prioritize methods that produce stable action sequences rather than variable ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that report only peak success may systematically overestimate practical usefulness in repeated-use settings.
Real-world automation deployments could reduce failures by adding explicit ambiguity-resolution steps before committing to actions.
Training objectives that penalize variance in action distributions across similar states might improve reliability without changing model size.

Load-bearing premise

The three examined factors account for most observed unreliability and the paired statistical tests on repeated runs can separate their distinct contributions.

What would settle it

Finding that success rates remain low and unchanged even after removing all execution randomness, using fully unambiguous task descriptions, and forcing identical agent actions across runs.

read the original abstract

Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across executions. These findings suggest the need to evaluate agents under repeated execution, to allow agents to resolve task ambiguity through interaction, and to favor strategies that remain stable across runs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates sources of unreliability in computer-use agents by analyzing three factors—stochasticity during execution, ambiguity in task specification, and variability in agent behavior—on the OSWorld benchmark. Using repeated task executions and paired statistical tests that capture task-level changes across settings, it concludes that reliability depends on both task specification and variations in agent behavior across runs, recommending repeated evaluations, interactive ambiguity resolution, and stable strategies.

Significance. If the statistical isolation of factors holds and quantitative results are robust, this work addresses an important gap in agent evaluation by moving beyond single-run success metrics. It offers empirical support for practical improvements in benchmarking and design, grounded in a public benchmark and standard statistical approaches.

major comments (2)

[Experimental analysis section] The description of paired statistical tests (abstract and the section on experimental analysis) does not specify orthogonal controls, partial correlations, or variance decomposition to isolate stochasticity, task ambiguity, and agent-behavior variability. Without such details, observed reliability changes cannot be unambiguously attributed to individual factors, as stochasticity and agent variability are likely confounded.
[Results section] No quantitative results, effect sizes, p-values, data exclusion rules, or multiple-comparison corrections are reported in the abstract or summary of findings, preventing assessment of whether the evidence supports the claim that reliability depends on task specification and agent behavior variation.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative findings (e.g., reliability drop percentages or test statistics) to convey the magnitude of the effects.
[Introduction] Clarify the operational definitions and measurement of 'variability in agent behavior' versus 'stochasticity during execution' to avoid potential overlap in interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make targeted revisions to improve clarity.

read point-by-point responses

Referee: [Experimental analysis section] The description of paired statistical tests (abstract and the section on experimental analysis) does not specify orthogonal controls, partial correlations, or variance decomposition to isolate stochasticity, task ambiguity, and agent-behavior variability. Without such details, observed reliability changes cannot be unambiguously attributed to individual factors, as stochasticity and agent variability are likely confounded.

Authors: We appreciate the concern regarding factor isolation. Our design isolates factors through controlled pairwise comparisons: stochasticity is measured by repeating identical tasks with fixed specifications and agent settings; task ambiguity is isolated by comparing vague versus detailed specifications on the same agent and task; agent-behavior variability is assessed by comparing outcomes across distinct agent runs or configurations while holding task and specification fixed. Paired tests (e.g., McNemar or Wilcoxon signed-rank on per-task success) then quantify changes attributable to the manipulated factor. We did not apply partial correlations or variance decomposition, as the orthogonal experimental controls suffice to support our attributions. We will revise the experimental analysis section to explicitly document these controls, the test choices, and their assumptions. revision: yes
Referee: [Results section] No quantitative results, effect sizes, p-values, data exclusion rules, or multiple-comparison corrections are reported in the abstract or summary of findings, preventing assessment of whether the evidence supports the claim that reliability depends on task specification and agent behavior variation.

Authors: The full results section reports per-setting success rates, p-values from the paired tests, and effect sizes (Cohen's h for proportions). No runs were excluded. Multiple-comparison corrections (Bonferroni) were applied across the task set. To address the referee's point, we will revise the abstract and the summary of findings to include representative quantitative results, effect sizes, p-values, and explicit statements on data handling and corrections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study on external benchmark with independent statistical tests

full rationale

The paper is an empirical analysis of agent reliability on the public OSWorld benchmark. It uses repeated task executions and standard paired statistical tests to examine three factors (stochasticity, task ambiguity, agent behavior variability). No equations, parameter fits, derivations, or self-citations are presented that reduce any claim to a quantity defined or fitted from the same data by construction. The central findings rest on observed outcomes from independent runs rather than any self-referential or load-bearing reduction, satisfying the criteria for a self-contained non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the public OSWorld benchmark and standard statistical methods for comparing repeated runs; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Paired statistical tests can isolate task-level effects of stochasticity, ambiguity, and behavioral variability across repeated executions
Invoked to support the claim that reliability depends on task specification and agent behavior variation.

pith-pipeline@v0.9.0 · 5470 in / 1204 out tokens · 38367 ms · 2026-05-10T04:34:05.102931+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
cs.AI 2026-05 unverdicted novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.

Reference graph

Works this paper leans on

65 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

URLhttps://arxiv.org/abs/2601.03267. E. Tabassi.Artificial Intelligence Risk Management Framework (AI RMF 1.0). Jan. 2023. doi: 10.6028/ nist.ai.100-1. URLhttp://dx.doi.org/10.6028/nist.ai.100-1. H. Wang, N. Chin, G. Gonzalez-Pumariega, X. Sun, N. Sunkara, M. A. Pace, J. Bohg, and S. Choudhury. Apricot: Active preference learning and constraint-aware task...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.6028/nist.ai.100-1 2023
[2]

Be specific about what to click, type, or interact with
[3]

Include the sequence of actions needed
[4]

Be detailed enough that another agent could follow it Respond with a numbered list of steps:
[5]

[Action description]
[6]

Only provide the plan - do not execute any actions

[Action description] ... Only provide the plan - do not execute any actions. C.1.2. Reusing Plan Prompt Listing 2|Strategy Determinism Plan Addon [PLAN TO FOLLOW] PRE-GENERATED STEP-BY-STEP PLAN: Below is a detailed plan for completing the task:`{instruction}` You must follow this plan step-by-step to succeed at the task. {plan_text} IMPORTANT: Follow thi...
[7]

**Keeps the natural human tone** - should sound like a human said it
[8]

**Only adds missing details** that the evaluator checks for
[9]

**Specifies exact file names/formats/locations** if the evaluator expects them
[10]

**Does NOT become a step-by-step procedure** - stay conversational Additional criteria:
[11]

do not omit links)

The clarified instruction will replace the original instruction - make sure to retain essential details to avoid losing current context (e.g. do not omit links)
[12]

For tasks that have exact file checks, be very careful in accidentally adding or preventing information to the instruction that would cause it to no longer match the evaluator's criteria. Format your response as: <thoughts> [Think through what's ambiguous and what the evaluator actually checks for] </thoughts> <answer> [The minimally clarified instruction...
[13]

ORIGINAL (before) instruction and 3 trajectory runs
[14]

CLARIFIED (after) instruction and 3 trajectory runs
[15]

Success rate scores for all runs
[16]

**EVALUATOR FUNCTION**: The exact code that determines task success/failure Your goal is to determine: - How agent behavior changed between the original and clarified instructions - Whether the instruction clarification had a meaningful impact on success rate - What specific aspect of the clarification caused behavior changes Each trajectory contains: - *...
[17]

**Understanding Original Success Rate** - Analyze the 3'before'runs: what did successful runs do RIGHT according to the evaluator? - For failures: what specific evaluator requirement did they miss?
[18]

**Understanding Clarified Success Rate** - Analyze the 3'after'runs: what changed in agent behavior? - Do the new behaviors align with or conflict with evaluator requirements?
[19]

**Instruction Impact on Evaluator Alignment** - Did the clarification help agents meet evaluator requirements more precisely? - Did the clarification inadvertently steer agents away from evaluator requirements? - Are success rate changes due to better/worse evaluator alignment or just variance?
[20]

{instruction}

**Flag Assignment** Choose ONE flag that best describes this case: **Instructions Need Manual Correction:** - IMPOSSIBLE_TASK: Clarification made the task impossible to complete - TOO_TRIVIAL: Clarification gave away the solution, making task too easy - HARMFUL_CONSTRAINTS: Added unnecessary constraints that hurt success rate - REMOVED_HELPFUL_AMBIGUITY: ...
[21]

Be specific about what is wrong -- mention exact values, file names, or formats that are incorrect
[22]

Just point out the problem

Do NOT give step-by-step instructions. Just point out the problem
[23]

Do NOT reveal evaluator function names or code
[24]

Keep it to 2-4 sentences
[25]

If you can identify the specific mismatch between what the agent produced and what you expected, state it clearly. C.3.2. User Simulator: Context Listing 7|User Simulator Context The {context} variable contains: ## Task Configuration ```json {full task JSON config including instruction, evaluator spec, setup config} ``` ## Evaluator Function ```python {ev...
[27]

ANALYSIS APPROACH:

**Agent Reasoning**: The agent's internal thoughts and decision-making process at each step Your goal is to identify what worked in successful runs versus what went wrong in failed runs, then provide actionable feedback for future attempts. ANALYSIS APPROACH:
[28]

**Compare Trajectories Step-by-Step** - Compare both visual changes and agent reasoning between successful/failed runs - Identify where successful and failed runs diverge in actions AND thinking - Note different approaches taken by successful vs failed runs - Look for consistent patterns across successes and failures
[29]

**Key Success Factors** - What specific actions or strategies led to success? - What reasoning patterns did successful runs exhibit? - Were there critical steps that successful runs handled correctly? - What environmental awareness did successful runs demonstrate?
[30]

**Failure Analysis** - What mistakes did failed runs make in actions or reasoning? 21 On the Reliability of Computer Use Agents - Were there flawed reasoning patterns or incorrect decision making? - Did failed runs misunderstand the task or environment? - Were there missed opportunities or incorrect decisions?
[31]

**Actionable Insights** - What should the agent think about differently next time? - What should the agent do differently next time? - What should the agent avoid based on failed attempts? - Are there general principles or heuristics that emerge? OUTPUT FORMAT: <analysis> [Detailed step-by-step comparison of successful vs failed trajectories, analyzing bo...
[32]

[Specific actionable instruction based on successful patterns]
[33]

[Another positive action or strategy to follow]
[34]

(up to a max of 10 items and can stop earlier if not applicable) DON'T:

[Key behavior or approach to adopt] ... (up to a max of 10 items and can stop earlier if not applicable) DON'T:
[35]

[Specific mistake or behavior to avoid from failed runs]
[36]

[Common error pattern to prevent]
[37]

[Action or thinking pattern that leads to failure] ... (up to a max of 10 items and can stop earlier if not applicable) PLAN: [Step-by-step execution plan extracted from successful runs - concrete sequence of actions that leads to success] </feedback> C.4.2. Plan Refinement and Addon Prompt Listing 11|Plan Refinement Addon Prompt (PREVIOUS_FEEDBACK_SECTIO...
[38]

Identify which parts of the previous feedback were helpful vs harmful
[39]

Understand why the feedback led to the current outcomes
[40]

Refine the feedback based on new evidence 22 On the Reliability of Computer Use Agents C.4.3. Plan Extraction with Historical Success Listing 12|Plan Extraction with Historical Success You are analyzing failed agent trajectories against a successful historical reference to generate corrective feedback. SITUATION: All current attempts failed, but we have a...
[41]

What the successful run did right that current attempts are missing
[42]

What systematic errors current attempts are making
[43]

How to guide future attempts toward the successful pattern For each run, you will see:
[46]

**Success Pattern Analysis** - What made the historical run successful? - What key decisions, actions, or reasoning patterns led to success? - What environmental awareness did it demonstrate?
[47]

**Current Failure Analysis** - How do current failures differ from the successful pattern? - What systematic mistakes are being repeated? - Where do current attempts go wrong compared to the success?
[48]

**Corrective Strategy** - How can future attempts align with the successful pattern? - What specific changes in reasoning or actions are needed? - What should be avoided based on current failures? OUTPUT FORMAT: <analysis> [Detailed comparison of historical success vs current failures, identifying key differences and systematic error patterns] </analysis>...
[49]

[Key strategy from the successful run to adopt]
[50]

[Specific action or approach that led to success] 23 On the Reliability of Computer Use Agents
[51]

(up to a max of 10 items and can stop earlier if not applicable) DON'T:

[Reasoning pattern that worked in the successful case] ... (up to a max of 10 items and can stop earlier if not applicable) DON'T:
[52]

[Specific mistake in current attempts to avoid]
[53]

[Error pattern that differs from successful approach]
[54]

[Action or thinking that leads to failure vs success] ... (up to a max of 10 items and can stop earlier if not applicable) PLAN: [Step-by-step execution plan extracted from the historical successful run - concrete sequence of actions to replicate] </feedback> C.4.4. Plan Extraction from Failures Listing 13|Plan Extraction from Failures You are analyzing f...
[55]

Which failure showed the most promise/progress
[56]

What systematic errors are preventing success
[57]

What fundamental strategies might lead to success For each failed run, you will see:
[58]

**Visual Changes**: Fact captions describing what changed between screenshots at each step
[59]

**Agent Reasoning**: The agent's internal thoughts and decision-making process at each step ANALYSIS APPROACH:
[60]

**Comparative Failure Analysis** - Which attempt made the most progress before failing? - What different approaches were tried? - What common error patterns appear across attempts?
[61]

**Root Cause Identification** - What fundamental issues are causing failures? - Are failures due to misunderstanding, poor execution, or environment issues? - What capabilities seem to be missing? 24 On the Reliability of Computer Use Agents
[62]

**Diagnostic Strategy** - What should future attempts focus on first? - What basic principles or approaches might help? - What specific errors should be avoided? OUTPUT FORMAT: <analysis> [Detailed analysis of failure patterns, identifying the most promising approach and systematic issues preventing success] </analysis> <feedback> DO:
[63]

[Basic strategy that might lead to success]
[64]

[Fundamental approach to try based on most promising failure]
[65]

(up to a max of 10 items and can stop earlier if not applicable) DON'T:

[Key capability or awareness to develop] ... (up to a max of 10 items and can stop earlier if not applicable) DON'T:
[66]

[Common error pattern to avoid across all attempts]
[67]

[Systematic mistake that prevents success]
[68]

expected

[Flawed approach or reasoning to prevent] ... (up to a max of 10 items and can stop earlier if not applicable) </feedback> C.4.5. Planning Feedback Addon Listing 14|Planning Feedback Addon Task:`{instruction}` Based on analysis of previous attempts at this task, here's what you should know: {feedback_text} IMPORTANT: Use this feedback to guide your approa...

2026