pith. sign in

arxiv: 2605.08261 · v1 · submitted 2026-05-07 · 💻 cs.SE · cs.AI

Computer Use at the Edge of the Statistical Precipice

Pith reviewed 2026-05-12 01:02 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords computer use agentsbenchmark evaluationreplay scriptsstatistical aggregationenvironment designmobile applicationspass@kconfidence intervals
0
0 comments X

The pith

A 1MB blind replay script matches frontier models on static computer-use benchmarks because its success rate equals the original agent's pass@k in deterministic settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple script replaying a recorded sequence of actions without viewing the screen can outperform or match leading AI models on widely used benchmarks for agents that control computers. This occurs because those benchmarks use fixed, deterministic environments where the replay succeeds exactly on the trials the source model passed. The authors trace the problem to two sources: environments that are static, unsandboxed, and poorly verified, plus evaluation methods that naively average results or apply pass@k without accounting for stateful interactions. They respond with PRISM, a set of five principles for building sound test environments, and an aggregation approach that combines Wilson score intervals with hierarchical bootstrap to produce trustworthy confidence ranges. If these changes are adopted, evaluations would no longer be fooled by trivial recorders and would better predict performance in varied, real-world use.

Core claim

We show that a replay script that blindly executes a recorded action sequence without ever observing the screen has an expected success rate exactly equal to the source agent's pass@k in deterministic environments, and that this explains why such scripts outperform models on prominent static benchmarks. We identify non-principled environment design and evaluation methodology as root causes and propose PRISM principles together with a hierarchical bootstrap aggregation framework to address them.

What carries the argument

The replay-script equivalence that equates blind execution success to the original agent's pass@k rate under determinism, plus the PRISM principles for environment construction.

If this is right

  • Current static benchmarks cannot distinguish competent agents from simple record-and-replay scripts.
  • Meaningful CUA evaluation requires environments that are sandboxed, integrity-checked, and support multifactorial variability.
  • Statistical reporting must use methods that respect the nested structure of tasks rather than naive aggregation or direct pass@k misuse.
  • A benchmark built on the PRISM principles can support evaluation across millions of verified configurations.
  • Principled design and rigorous methodology become prerequisites rather than optional refinements for valid research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same replay vulnerability may appear in other agent benchmarks that rely on deterministic or low-variability tasks.
  • Shifting to PRISM-style environments could move research emphasis from model scaling toward better test-bed engineering.
  • The proposed statistical aggregation could be applied to other multi-trial agent evaluations outside computer use.
  • Direct comparisons on a PRISM-compliant benchmark would supply a concrete baseline for judging whether new models clear the replay threshold.

Load-bearing premise

That the prominent static benchmarks are representative of real interactive computer-use tasks and that environments remain deterministic enough for the replay equivalence to hold.

What would settle it

Running the same replay script on a benchmark whose environments include non-determinism or forced variability and observing whether its success rate still equals the source model's pass@k.

read the original abstract

Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically address. We show that a 1MB replay script that blindly executes a recorded action sequence without ever observing the screen outperforms frontier models on prominent static benchmarks, and prove that its expected success rate is exactly equal to the source agent's pass@k in deterministic environments. We trace this and other failures to two root causes: non-principled environment design (static, unsandboxed, or unreliably verified environments) and non-principled evaluation methodology (naive aggregation and misuse of pass@k for stateful UI interactions). To address the first, we propose PRISM, five design principles for CUA environments (privileged verification, realistic environments, integrity-checked configurations, sandboxed execution, and multifactorial variability) and instantiate them in DigiWorld, a benchmark of 15 realistic sandboxed mobile applications able to evaluate agents in over 3.2 million verified unique configurations. To address the second, we develop an aggregation framework pairing Wilson score intervals with hierarchical bootstrap, producing confidence intervals that correctly account for the nested structure of CUA benchmarks, as we empirically demonstrate. All together, we show that principled environment design and rigorous evaluation methodology are not optional refinements but prerequisites for meaningful CUA research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that CUA evaluation on interactive environments suffers from non-principled design (static/unsandboxed environments) and methodology (naive pass@k aggregation). It shows a 1MB blind replay script outperforming frontier models on prominent static benchmarks, proves that the replay's expected success rate equals the source agent's pass@k exactly in deterministic environments, proposes the PRISM principles (privileged verification, realistic environments, integrity-checked configurations, sandboxed execution, multifactorial variability) instantiated in the DigiWorld benchmark (15 sandboxed mobile apps supporting >3.2M verified configurations), and introduces a Wilson score interval paired with hierarchical bootstrap for confidence intervals that respect the nested benchmark structure.

Significance. If the central claims hold, the work is significant for CUA research because it supplies a direct mathematical identity establishing when replay equivalence occurs, a large-scale verifiable benchmark, and a statistically grounded aggregation method. The parameter-free nature of the replay-pass@k identity (under the stated determinism) and the empirical scale of DigiWorld are particular strengths that could shift how the community designs and reports interactive agent evaluations.

major comments (2)
  1. [§3] §3 (replay experiment and outperformance results): The claim that the 1MB replay script outperforms frontier models on prominent static benchmarks is load-bearing for the methodological critique, yet the manuscript provides no explicit verification that those benchmarks satisfy the determinism precondition required by the proof (e.g., no repeated executions with fixed seeds, state checksums, or measured variance from UI timing or content). Without this, the reported success rates may diverge from the source pass@k, turning the outperformance into a possible benchmark artifact rather than a general result.
  2. [§5.1] §5.1 (Wilson-hierarchical bootstrap framework): The description of the aggregation method states that it 'correctly account[s] for the nested structure,' but the manuscript does not supply the explicit hierarchical model, variance components, or the precise bootstrap resampling procedure applied to the DigiWorld data. This detail is needed to confirm that the resulting intervals differ materially from naive aggregation and to allow reproduction.
minor comments (2)
  1. The acronym PRISM is expanded only after first use in the abstract; spelling it out on first appearance would improve readability.
  2. Figure captions for the DigiWorld configuration counts and bootstrap interval plots could include the exact number of trials and the nesting levels used, to make the statistical claims immediately verifiable from the figures alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate the requested clarifications and additions into the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (replay experiment and outperformance results): The claim that the 1MB replay script outperforms frontier models on prominent static benchmarks is load-bearing for the methodological critique, yet the manuscript provides no explicit verification that those benchmarks satisfy the determinism precondition required by the proof (e.g., no repeated executions with fixed seeds, state checksums, or measured variance from UI timing or content). Without this, the reported success rates may diverge from the source pass@k, turning the outperformance into a possible benchmark artifact rather than a general result.

    Authors: We agree that explicit verification of the determinism precondition is required to substantiate the outperformance results and to ensure they are not benchmark artifacts. The revised manuscript will add, in §3, repeated executions of the source agents and replay script under fixed seeds, state checksum comparisons across runs, and measured variance attributable to UI timing or content. These additions will confirm that the evaluated benchmarks satisfy the determinism assumption underlying the pass@k equivalence proof. revision: yes

  2. Referee: [§5.1] §5.1 (Wilson-hierarchical bootstrap framework): The description of the aggregation method states that it 'correctly account[s] for the nested structure,' but the manuscript does not supply the explicit hierarchical model, variance components, or the precise bootstrap resampling procedure applied to the DigiWorld data. This detail is needed to confirm that the resulting intervals differ materially from naive aggregation and to allow reproduction.

    Authors: We acknowledge that the current description in §5.1 lacks sufficient implementation detail for reproduction and for demonstrating the material difference from naive methods. The revised manuscript will expand this section to include the explicit hierarchical model (with nested random effects for applications, configurations, and trials), the full variance components decomposition, and the precise bootstrap resampling procedure (including pseudocode) applied to the DigiWorld data. We will also report empirical comparisons showing how the resulting intervals differ from naive aggregation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; core claims are mathematical identities and novel proposals

full rationale

The paper's key result is a direct mathematical identity: in deterministic environments, the expected success rate of a blind replay of a recorded action sequence equals the source agent's pass@k by definition of fixed-sequence execution and the success probability metric. This follows immediately from the problem setup without fitting parameters, self-citation chains, or ansatzes. The PRISM principles, DigiWorld benchmark, and Wilson/bootstrap aggregation framework are newly introduced without reducing to prior inputs or self-referential definitions. No load-bearing steps invoke self-citations for uniqueness or smuggle assumptions via citation. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

No free parameters are fitted; the work relies on a mathematical identity under a domain assumption and standard statistical techniques without ad hoc adjustments.

axioms (1)
  • domain assumption The environments in the critiqued benchmarks are deterministic
    The equality between replay success rate and pass@k is proven only for deterministic environments.
invented entities (2)
  • PRISM no independent evidence
    purpose: Five design principles for CUA environments
    Newly proposed set of principles to address identified failures.
  • DigiWorld no independent evidence
    purpose: Benchmark of 15 sandboxed mobile applications with millions of configurations
    New instantiated benchmark applying the PRISM principles.

pith-pipeline@v0.9.0 · 5557 in / 1522 out tokens · 85303 ms · 2026-05-12T01:02:07.104336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Deep Reinforcement Learning at the Edge of the Statistical Precipice , author=. Advances in Neural Information Processing Systems , volume=. 2021 , url=

  2. [2]

    and Cai, T

    Brown, Lawrence D. and Cai, T. Tony and DasGupta, Anirban , title =. Statistical Science , volume =. 2001 , doi =

  3. [3]

    2026 , month =

    A primer on measuring the uncertainty of success rates , author =. 2026 , month =

  4. [4]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  5. [5]

    Proceedings of the 37th International Conference on Machine Learning , pages =

    Leveraging Procedural Generation to Benchmark Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

  6. [6]

    2024 , eprint=

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. 2024 , eprint=

  7. [7]

    REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites.arXiv preprint arXiv:2504.11543, April 2025

    Divyansh Garg and Shaun VanWeelden and Diego Caples and Andis Draguns and Nikil Ravi and Pranav Putta and Naman Garg and Tomas Abraham and Michael Lara and Federico Lopez and James Liu and Atharva Gundawar and Prannay Hebbar and Youngchul Joo and Jindong Gu and Charles London and Christian Schroeder de Witt and Sumeet Motwani , title =. arXiv preprint arX...

  8. [8]

    2024 , eprint=

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models , author=. 2024 , eprint=

  9. [9]

    VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

    Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yu and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel. V isual W eb A rena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol...

  10. [10]

    2025 , eprint=

    MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments , author=. 2025 , eprint=

  11. [11]

    International Conference on Learning Representations , year =

    Evan Zheran Liu and Kelvin Guu and Panupong Pasupat and Tianlin Shi and Percy Liang , title =. International Conference on Learning Representations , year =

  12. [12]

    2024 , eprint=

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , author=. 2024 , eprint=

  13. [13]

    Detecting Safety Violations Across Many Agent Traces

    Detecting Safety Violations Across Many Agent Traces , author=. arXiv preprint arXiv:2604.11806 , year=

  14. [14]

    2025 , eprint=

    DigiData: Training and Evaluating General-Purpose Mobile Control Agents , author=. 2025 , eprint=

  15. [15]

    Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Josh Tobin and Rachel Fong and Alex Ray and Jonas Schneider and Wojciech Zaremba and Pieter Abbeel , title =. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages =. 2017 , doi =. 1703.06907 , archivePrefix =

  16. [16]

    arXiv preprint arXiv:2511.20766 , year =

    Karen Ullrich and Jingtong Su and Claudia Shi and Arjun Subramonian and Amir Bar and Ivan Evtimov and Nikolaos Tsilivis and Randall Balestriero and Julia Kempe and Mark Ibrahim , title =. arXiv preprint arXiv:2511.20766 , year =. 2511.20766 , archivePrefix =

  17. [17]

    2026 , url =

    Hao Wang and Qiuyang Mang and Alvin Cheung and Koushik Sen and Dawn Song , title =. 2026 , url =

  18. [18]

    Journal of the American Statistical Association , volume=

    Probable inference, the law of succession, and statistical inference , author=. Journal of the American Statistical Association , volume=. 1927 , publisher=

  19. [19]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author=. Advances in Neural Information Processing Systems , year=. 2404.07972 , archivePrefix=

  20. [20]

    2018 , eprint=

    A Study on Overfitting in Deep Reinforcement Learning , author=. 2018 , eprint=

  21. [21]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  22. [22]

    The Twelfth International Conference on Learning Representations , year=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=