OpenComputer: Verifiable Software Worlds for Computer-Use Agents
Pith reviewed 2026-05-20 06:24 UTC · model grok-4.3
The pith
App-specific hard-coded verifiers match human judgments more closely than LLM-as-judge methods when evaluating computer-use agents on fine-grained desktop tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenComputer constructs verifiable software worlds by integrating four components: app-specific state verifiers that expose structured inspection endpoints over real applications, a self-evolving verification layer that refines reliability through execution-grounded feedback, a task-generation pipeline that produces realistic and machine-checkable desktop tasks, and an evaluation harness that records complete trajectories while computing auditable partial-credit rewards. The resulting system spans 33 applications and 1,000 finalized tasks and demonstrates that its hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, particularly when success depends,
What carries the argument
App-specific state verifiers that expose structured inspection endpoints over real applications, supplying precise, machine-checkable criteria for task success.
If this is right
- Frontier agents make partial progress on tasks but rarely achieve full end-to-end completion.
- Open-source models exhibit sharp performance drops relative to their scores on benchmarks such as OSWorld-Verified.
- The evaluation harness enables computation of auditable partial-credit rewards based on recorded trajectories.
- The task-generation pipeline produces realistic desktop tasks that remain machine-checkable through the verifiers.
Where Pith is reading between the lines
- The same verifier approach could be adapted to mobile or web-based agent environments where application state is similarly inspectable.
- Accurate, execution-grounded rewards from the framework could support reinforcement-learning loops that directly optimize for verifiable task completion.
- Persistent gaps in end-to-end success point to a need for agent architectures that maintain long-horizon state awareness across application boundaries.
Load-bearing premise
The assumption that app-specific state verifiers can be implemented to deliver accurate and complete structured inspection of live applications without introducing systematic bias or coverage gaps relative to human judgment.
What would settle it
A controlled comparison on a new set of tasks in which independent human raters score agent trajectories and the agreement rate of the hard-coded verifiers is measured directly against the agreement rate of LLM judges.
Figures
read the original abstract
We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents OpenComputer, a verifier-grounded framework for computer-use agents. It integrates app-specific state verifiers exposing structured inspection endpoints over 33 real desktop applications, a self-evolving verification layer using execution-grounded feedback, a task-generation pipeline producing 1,000 machine-checkable tasks across browsers, office tools, creative software, and development environments, and an evaluation harness that records full trajectories with auditable partial-credit rewards. The central experimental claim is that these hard-coded verifiers align more closely with human adjudication than LLM-as-judge methods, particularly when success depends on fine-grained application state; the paper also reports that frontier agents struggle with end-to-end completion and open-source models exhibit sharp drops relative to OSWorld-Verified scores.
Significance. If the alignment and performance claims hold, the work would be a meaningful advance in agent evaluation by replacing unreliable LLM judges with verifiable, auditable metrics grounded in real application state. The emphasis on machine-checkable tasks, full-trajectory recording, and self-evolving verifiers provides concrete strengths for reproducibility and iterative improvement that are currently rare in computer-use agent benchmarks.
major comments (2)
- [Experiments] The central claim that hard-coded verifiers align more closely with human adjudication than LLM-as-judge (especially for fine-grained state) is load-bearing yet unsupported by any reported quantitative metrics such as agreement rates, Cohen's kappa, or confusion matrices; without these numbers and a description of the human adjudication protocol (number of annotators, inter-annotator agreement, adjudication criteria), the superiority cannot be assessed.
- [Verifier Architecture] The description of the 33 app-specific state verifiers asserts that they expose structured inspection endpoints whose outputs match human judgments on success, but no evidence or test is provided for completeness of coverage in dynamic UIs (e.g., transient unsaved buffers, modal focus changes, or network-induced rendering artifacts in browsers and creative software); if these endpoints rely on partial hooks such as accessibility trees or window properties, systematic omissions could artifactually inflate alignment scores.
minor comments (2)
- [Abstract] The abstract states that open-source models exhibit 'sharp drops' from OSWorld-Verified scores but does not name the specific models or report the exact score deltas.
- [Verification Layer] Notation for the self-evolving verification layer (e.g., how execution-grounded feedback is formalized) would benefit from a short pseudocode example or diagram.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and valuable suggestions for improving the manuscript. The comments on the experimental validation of verifier alignment and the completeness of the verifier architecture are well-taken. We provide point-by-point responses below and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [Experiments] The central claim that hard-coded verifiers align more closely with human adjudication than LLM-as-judge (especially for fine-grained state) is load-bearing yet unsupported by any reported quantitative metrics such as agreement rates, Cohen's kappa, or confusion matrices; without these numbers and a description of the human adjudication protocol (number of annotators, inter-annotator agreement, adjudication criteria), the superiority cannot be assessed.
Authors: We agree that quantitative metrics are essential to substantiate the alignment claim. The current manuscript presents the alignment observation qualitatively. In the revised version, we will expand the Experiments section to report agreement rates between hard-coded verifiers and human labels, Cohen's kappa for both inter-annotator agreement and verifier-human agreement, and confusion matrices comparing verifiers against LLM-as-judge. We will also add a detailed description of the human adjudication protocol, including the number of annotators, inter-annotator agreement statistics, and the specific criteria used for success labeling. revision: yes
-
Referee: [Verifier Architecture] The description of the 33 app-specific state verifiers asserts that they expose structured inspection endpoints whose outputs match human judgments on success, but no evidence or test is provided for completeness of coverage in dynamic UIs (e.g., transient unsaved buffers, modal focus changes, or network-induced rendering artifacts in browsers and creative software); if these endpoints rely on partial hooks such as accessibility trees or window properties, systematic omissions could artifactually inflate alignment scores.
Authors: The verifiers combine accessibility tree queries with application-specific automation interfaces and direct state inspection (e.g., file system checks for saved documents and browser DOM queries) to capture task-relevant state. We acknowledge that dynamic UI elements can introduce coverage gaps. In the revision, we will add a dedicated subsection on verifier implementation that discusses handling of transient states such as unsaved buffers and modal dialogs, provides concrete examples across app categories, and explicitly notes remaining limitations in highly dynamic scenarios. This will allow readers to better assess potential impacts on alignment scores. revision: yes
Circularity Check
No significant circularity; empirical claims rest on experimental comparisons without derivations or self-referential reductions
full rationale
The paper describes a systems framework with four components and reports experimental results on verifier-human alignment across 33 applications and 1000 tasks. No equations, parameter fittings, or mathematical derivations are present in the provided text. Claims about superior alignment of hard-coded verifiers versus LLM-as-judge are grounded in direct empirical comparisons rather than reducing to fitted inputs, self-definitions, or load-bearing self-citations. The evaluation harness and task-generation pipeline are presented as constructed artifacts whose performance is measured externally, leaving the central results self-contained and non-circular.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback...
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat ≃ Nat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agen- tic framework that uses computers like a human. InThe Thirteenth International Conference on Learning Representations. Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framewor...
work page internal anchor Pith review arXiv
-
[2]
Gym-Anything: Turn any Software into an Agent Environment
Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment. arXiv preprint arXiv:2604.06126,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,
-
[4]
Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, et al. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training. arXiv preprint arXiv:2602.14093,
-
[5]
Chaoqun Cui, Jing Huang, Shijing Wang, Liming Zheng, Qingchao Kong, and Zhixiong Zeng. Agentic reward modeling: Verifying gui agent via online proactive interaction.arXiv preprint arXiv:2602.00575,
-
[6]
Scuba: Salesforce computer use benchmark.arXiv preprint arXiv:2509.26506,
Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, et al. Scuba: Salesforce computer use benchmark.arXiv preprint arXiv:2509.26506,
-
[7]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, L´eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311,
Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, et al. Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311,
-
[9]
9 Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, and Pengfei Liu. Pc agent: While you sleep, ai works–a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,
-
[10]
GitHub repository. Accessed: 2026-05-02. Seungone Kim, Jay Shin, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Ryan Shin, Sungdong Kim, James Thorne, Minjoon Seo, et al. Prometheus: Inducing fine-grained evaluation capability in language models. In International Conference on Learning Representations, volume 2024, pages 29927–29962,
work page 2026
-
[11]
Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025a. Yuetai Li, Hus...
-
[12]
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538,
work page 2025
-
[13]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025a
Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025a. Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, and Mohit Iyyer. Bearcubs: A benchmark for computer-using web...
-
[15]
Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, and Yuxiong He. Agent world model: Infinity synthetic environments for agentic reinforcement learning.arXiv preprint arXiv:2602.10090,
-
[16]
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855,
-
[17]
arXiv preprint arXiv:2412.09605 , year=
Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605,
-
[18]
Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876,
-
[19]
Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126,
-
[20]
Immersion in the github universe: Scaling coding agents to mastery.arXiv preprint arXiv:2602.09892,
Jiale Zhao, Guoxin Chen, Fanzhe Meng, Minghao Li, Jie Chen, Hui Xu, Yongshuai Sun, Wayne Xin Zhao, Ruihua Song, Yuan Zhang, et al. Immersion in the github universe: Scaling coding agents to mastery.arXiv preprint arXiv:2602.09892,
-
[21]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, et al. Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents. arXiv preprint arXiv:2602.07274,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.