OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Arman Cohan; Guo Gan; Jinbiao Wei; Kangqi Ni; Qianran Ma; Xiao Zhou; Yilun Zhao

arxiv: 2605.19769 · v1 · pith:B57JKSGJnew · submitted 2026-05-19 · 💻 cs.AI · cs.SE

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Jinbiao Wei , Qianran Ma , Yilun Zhao , Xiao Zhou , Kangqi Ni , Guo Gan , Arman Cohan This is my paper

Pith reviewed 2026-05-20 06:24 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords computer-use agentsverifiable evaluationdesktop automationstate verifiersLLM-as-judgeagent benchmarkstask generationsoftware worlds

0 comments

The pith

App-specific hard-coded verifiers match human judgments more closely than LLM-as-judge methods when evaluating computer-use agents on fine-grained desktop tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenComputer as a framework for creating verifiable software environments that support reliable testing of AI agents performing real desktop work. It combines app-specific state verifiers that inspect live application internals, a feedback loop to refine those verifiers, a pipeline for generating checkable tasks, and a harness that logs full agent trajectories while awarding partial credit. Experiments across 33 applications and 1,000 tasks show the hard-coded verifiers align better with human decisions than large-language-model judges, especially when success hinges on precise internal states rather than surface outputs. This setup exposes that current frontier agents achieve only partial progress and rarely finish end-to-end tasks, while open-source models drop sharply from their scores on prior benchmarks. Reliable, auditable evaluation matters because it lets researchers measure and close the gap between partial automation and robust, repeatable computer use.

Core claim

OpenComputer constructs verifiable software worlds by integrating four components: app-specific state verifiers that expose structured inspection endpoints over real applications, a self-evolving verification layer that refines reliability through execution-grounded feedback, a task-generation pipeline that produces realistic and machine-checkable desktop tasks, and an evaluation harness that records complete trajectories while computing auditable partial-credit rewards. The resulting system spans 33 applications and 1,000 finalized tasks and demonstrates that its hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, particularly when success depends,

What carries the argument

App-specific state verifiers that expose structured inspection endpoints over real applications, supplying precise, machine-checkable criteria for task success.

If this is right

Frontier agents make partial progress on tasks but rarely achieve full end-to-end completion.
Open-source models exhibit sharp performance drops relative to their scores on benchmarks such as OSWorld-Verified.
The evaluation harness enables computation of auditable partial-credit rewards based on recorded trajectories.
The task-generation pipeline produces realistic desktop tasks that remain machine-checkable through the verifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verifier approach could be adapted to mobile or web-based agent environments where application state is similarly inspectable.
Accurate, execution-grounded rewards from the framework could support reinforcement-learning loops that directly optimize for verifiable task completion.
Persistent gaps in end-to-end success point to a need for agent architectures that maintain long-horizon state awareness across application boundaries.

Load-bearing premise

The assumption that app-specific state verifiers can be implemented to deliver accurate and complete structured inspection of live applications without introducing systematic bias or coverage gaps relative to human judgment.

What would settle it

A controlled comparison on a new set of tasks in which independent human raters score agent trajectories and the agreement rate of the hard-coded verifiers is measured directly against the agreement rate of LLM judges.

Figures

Figures reproduced from arXiv: 2605.19769 by Arman Cohan, Guo Gan, Jinbiao Wei, Kangqi Ni, Qianran Ma, Xiao Zhou, Yilun Zhao.

**Figure 2.** Figure 2: Example application endpoint specification [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Alignment with human adjudication on a 120-task comparison set [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: A dense spreadsheet-style interface where the visual output looks almost correct, but the state is wrong: [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: A terminal-heavy workflow where the decisive evidence lives in log lines, exit codes, and filesystem [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenComputer builds app-specific verifiers for desktop agent tasks and claims better human alignment than LLM judges, but the abstract leaves the quantitative support thin.

read the letter

Hi, the main thing to know is that OpenComputer puts together hard-coded verifiers for 33 desktop applications to give machine-checkable signals on agent success, and it reports that these line up closer to human judgments than LLM-as-judge methods, especially on fine-grained state changes. The work also includes a self-evolving layer that refines the verifiers from execution feedback, a pipeline for generating checkable tasks, and an evaluation harness that scores full trajectories with partial credit. They scale this to 1000 tasks across browsers, office tools, creative software, and dev environments. That integrated setup and the reported performance drops for open models relative to OSWorld-Verified are the concrete pieces that stand out. The partial-credit approach and move away from pure model judging address a practical pain point in training computer-use agents. The soft spots sit mostly in the evidence presented. The abstract states the alignment advantage without numbers, breakdowns by task type, or details on how the human comparisons were run, so the size of the improvement stays unclear. The stress-test point about verifier completeness also lands: for dynamic interfaces the structured endpoints may miss transient states like unsaved buffers or rendering shifts, and if task selection favored cases the verifiers handle cleanly, the reported edge could shrink. This is aimed at groups building or benchmarking GUI agents who need reproducible signals. A reader focused on evaluation methods would get usable ideas from the framework even if the current results need more backing. I would send it for peer review; the core components are developed enough that referees could usefully pressure the validation details and coverage claims.

Referee Report

2 major / 2 minor

Summary. The manuscript presents OpenComputer, a verifier-grounded framework for computer-use agents. It integrates app-specific state verifiers exposing structured inspection endpoints over 33 real desktop applications, a self-evolving verification layer using execution-grounded feedback, a task-generation pipeline producing 1,000 machine-checkable tasks across browsers, office tools, creative software, and development environments, and an evaluation harness that records full trajectories with auditable partial-credit rewards. The central experimental claim is that these hard-coded verifiers align more closely with human adjudication than LLM-as-judge methods, particularly when success depends on fine-grained application state; the paper also reports that frontier agents struggle with end-to-end completion and open-source models exhibit sharp drops relative to OSWorld-Verified scores.

Significance. If the alignment and performance claims hold, the work would be a meaningful advance in agent evaluation by replacing unreliable LLM judges with verifiable, auditable metrics grounded in real application state. The emphasis on machine-checkable tasks, full-trajectory recording, and self-evolving verifiers provides concrete strengths for reproducibility and iterative improvement that are currently rare in computer-use agent benchmarks.

major comments (2)

[Experiments] The central claim that hard-coded verifiers align more closely with human adjudication than LLM-as-judge (especially for fine-grained state) is load-bearing yet unsupported by any reported quantitative metrics such as agreement rates, Cohen's kappa, or confusion matrices; without these numbers and a description of the human adjudication protocol (number of annotators, inter-annotator agreement, adjudication criteria), the superiority cannot be assessed.
[Verifier Architecture] The description of the 33 app-specific state verifiers asserts that they expose structured inspection endpoints whose outputs match human judgments on success, but no evidence or test is provided for completeness of coverage in dynamic UIs (e.g., transient unsaved buffers, modal focus changes, or network-induced rendering artifacts in browsers and creative software); if these endpoints rely on partial hooks such as accessibility trees or window properties, systematic omissions could artifactually inflate alignment scores.

minor comments (2)

[Abstract] The abstract states that open-source models exhibit 'sharp drops' from OSWorld-Verified scores but does not name the specific models or report the exact score deltas.
[Verification Layer] Notation for the self-evolving verification layer (e.g., how execution-grounded feedback is formalized) would benefit from a short pseudocode example or diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and valuable suggestions for improving the manuscript. The comments on the experimental validation of verifier alignment and the completeness of the verifier architecture are well-taken. We provide point-by-point responses below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Experiments] The central claim that hard-coded verifiers align more closely with human adjudication than LLM-as-judge (especially for fine-grained state) is load-bearing yet unsupported by any reported quantitative metrics such as agreement rates, Cohen's kappa, or confusion matrices; without these numbers and a description of the human adjudication protocol (number of annotators, inter-annotator agreement, adjudication criteria), the superiority cannot be assessed.

Authors: We agree that quantitative metrics are essential to substantiate the alignment claim. The current manuscript presents the alignment observation qualitatively. In the revised version, we will expand the Experiments section to report agreement rates between hard-coded verifiers and human labels, Cohen's kappa for both inter-annotator agreement and verifier-human agreement, and confusion matrices comparing verifiers against LLM-as-judge. We will also add a detailed description of the human adjudication protocol, including the number of annotators, inter-annotator agreement statistics, and the specific criteria used for success labeling. revision: yes
Referee: [Verifier Architecture] The description of the 33 app-specific state verifiers asserts that they expose structured inspection endpoints whose outputs match human judgments on success, but no evidence or test is provided for completeness of coverage in dynamic UIs (e.g., transient unsaved buffers, modal focus changes, or network-induced rendering artifacts in browsers and creative software); if these endpoints rely on partial hooks such as accessibility trees or window properties, systematic omissions could artifactually inflate alignment scores.

Authors: The verifiers combine accessibility tree queries with application-specific automation interfaces and direct state inspection (e.g., file system checks for saved documents and browser DOM queries) to capture task-relevant state. We acknowledge that dynamic UI elements can introduce coverage gaps. In the revision, we will add a dedicated subsection on verifier implementation that discusses handling of transient states such as unsaved buffers and modal dialogs, provides concrete examples across app categories, and explicitly notes remaining limitations in highly dynamic scenarios. This will allow readers to better assess potential impacts on alignment scores. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experimental comparisons without derivations or self-referential reductions

full rationale

The paper describes a systems framework with four components and reports experimental results on verifier-human alignment across 33 applications and 1000 tasks. No equations, parameter fittings, or mathematical derivations are present in the provided text. Claims about superior alignment of hard-coded verifiers versus LLM-as-judge are grounded in direct empirical comparisons rather than reducing to fitted inputs, self-definitions, or load-bearing self-citations. The evaluation harness and task-generation pipeline are presented as constructed artifacts whose performance is measured externally, leaving the central results self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to identify concrete free parameters, axioms, or invented entities; the framework description implies domain assumptions about verifier accuracy but does not specify them.

pith-pipeline@v0.9.0 · 5723 in / 1079 out tokens · 42832 ms · 2026-05-20T06:24:02.932241+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback...
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

[1]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agen- tic framework that uses computers like a human. InThe Thirteenth International Conference on Learning Representations. Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framewor...

work page internal anchor Pith review arXiv
[2]

Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment. arXiv preprint arXiv:2604.06126,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

work page arXiv
[4]

Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training

Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, et al. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training. arXiv preprint arXiv:2602.14093,

work page arXiv
[5]

Agentic reward modeling: Verifying gui agent via online proactive interaction.arXiv preprint arXiv:2602.00575,

Chaoqun Cui, Jing Huang, Shijing Wang, Liming Zheng, Qingchao Kong, and Zhixiong Zeng. Agentic reward modeling: Verifying gui agent via online proactive interaction.arXiv preprint arXiv:2602.00575,

work page arXiv
[6]

Scuba: Salesforce computer use benchmark.arXiv preprint arXiv:2509.26506,

Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, et al. Scuba: Salesforce computer use benchmark.arXiv preprint arXiv:2509.26506,

work page arXiv
[7]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, L´eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311,

Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, et al. Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311,

work page arXiv
[9]

Pc agent: While you sleep, ai works–a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,

9 Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, and Pengfei Liu. Pc agent: While you sleep, ai works–a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,

work page arXiv
[10]

Accessed: 2026-05-02

GitHub repository. Accessed: 2026-05-02. Seungone Kim, Jay Shin, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Ryan Shin, Sungdong Kim, James Thorne, Minjoon Seo, et al. Prometheus: Inducing fine-grained evaluation capability in language models. In International Conference on Learning Representations, volume 2024, pages 29927–29962,

work page 2026
[11]

Simulating environments with reasoning models for agent training.arXiv preprint arXiv:2511.01824, 2025

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025a. Yuetai Li, Hus...

work page arXiv 2025
[12]

Gui agents: A survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538,

work page 2025
[13]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025a

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025a. Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, and Mohit Iyyer. Bearcubs: A benchmark for computer-using web...

work page arXiv
[15]

Agent world model: Infinity synthetic environments for agentic reinforcement learning.arXiv preprint arXiv:2602.10090, 2026

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, and Yuxiong He. Agent world model: Infinity synthetic environments for agentic reinforcement learning.arXiv preprint arXiv:2602.10090,

work page arXiv
[16]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855,

work page arXiv
[17]

arXiv preprint arXiv:2412.09605 , year=

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605,

work page arXiv
[18]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876,

work page arXiv
[19]

Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126, 2026

Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126,

work page arXiv
[20]

Immersion in the github universe: Scaling coding agents to mastery.arXiv preprint arXiv:2602.09892,

Jiale Zhao, Guoxin Chen, Fanzhe Meng, Minghao Li, Jie Chen, Hui Xu, Yongshuai Sun, Wayne Xin Zhao, Ruihua Song, Yuan Zhang, et al. Immersion in the github universe: Scaling coding agents to mastery.arXiv preprint arXiv:2602.09892,

work page arXiv
[21]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents.arXiv preprint arXiv:2602.07274, 2026

Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, et al. Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents. arXiv preprint arXiv:2602.07274,

work page arXiv

[1] [1]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agen- tic framework that uses computers like a human. InThe Thirteenth International Conference on Learning Representations. Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framewor...

work page internal anchor Pith review arXiv

[2] [2]

Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment. arXiv preprint arXiv:2604.06126,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

work page arXiv

[4] [4]

Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training

Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, et al. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training. arXiv preprint arXiv:2602.14093,

work page arXiv

[5] [5]

Agentic reward modeling: Verifying gui agent via online proactive interaction.arXiv preprint arXiv:2602.00575,

Chaoqun Cui, Jing Huang, Shijing Wang, Liming Zheng, Qingchao Kong, and Zhixiong Zeng. Agentic reward modeling: Verifying gui agent via online proactive interaction.arXiv preprint arXiv:2602.00575,

work page arXiv

[6] [6]

Scuba: Salesforce computer use benchmark.arXiv preprint arXiv:2509.26506,

Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, et al. Scuba: Salesforce computer use benchmark.arXiv preprint arXiv:2509.26506,

work page arXiv

[7] [7]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, L´eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311,

Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, et al. Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311,

work page arXiv

[9] [9]

Pc agent: While you sleep, ai works–a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,

9 Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, and Pengfei Liu. Pc agent: While you sleep, ai works–a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,

work page arXiv

[10] [10]

Accessed: 2026-05-02

GitHub repository. Accessed: 2026-05-02. Seungone Kim, Jay Shin, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Ryan Shin, Sungdong Kim, James Thorne, Minjoon Seo, et al. Prometheus: Inducing fine-grained evaluation capability in language models. In International Conference on Learning Representations, volume 2024, pages 29927–29962,

work page 2026

[11] [11]

Simulating environments with reasoning models for agent training.arXiv preprint arXiv:2511.01824, 2025

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025a. Yuetai Li, Hus...

work page arXiv 2025

[12] [12]

Gui agents: A survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538,

work page 2025

[13] [13]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025a

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025a. Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, and Mohit Iyyer. Bearcubs: A benchmark for computer-using web...

work page arXiv

[15] [15]

Agent world model: Infinity synthetic environments for agentic reinforcement learning.arXiv preprint arXiv:2602.10090, 2026

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, and Yuxiong He. Agent world model: Infinity synthetic environments for agentic reinforcement learning.arXiv preprint arXiv:2602.10090,

work page arXiv

[16] [16]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855,

work page arXiv

[17] [17]

arXiv preprint arXiv:2412.09605 , year=

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605,

work page arXiv

[18] [18]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876,

work page arXiv

[19] [19]

Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126, 2026

Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126,

work page arXiv

[20] [20]

Immersion in the github universe: Scaling coding agents to mastery.arXiv preprint arXiv:2602.09892,

Jiale Zhao, Guoxin Chen, Fanzhe Meng, Minghao Li, Jie Chen, Hui Xu, Yongshuai Sun, Wayne Xin Zhao, Ruihua Song, Yuan Zhang, et al. Immersion in the github universe: Scaling coding agents to mastery.arXiv preprint arXiv:2602.09892,

work page arXiv

[21] [21]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents.arXiv preprint arXiv:2602.07274, 2026

Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, et al. Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents. arXiv preprint arXiv:2602.07274,

work page arXiv