pith. machine review for the scientific record. sign in

arxiv: 2604.06126 · v1 · submitted 2026-04-07 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Gym-Anything: Turn any Software into an Agent Environment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Gym-Anythingcomputer-use agentsenvironment creationCUA-Worldlong-horizon tasksvision-language modelsmulti-agent systemsbenchmark
0
0 comments X

The pith

Gym-Anything frames environment creation as a multi-agent process that turns any software into a scalable computer-use agent environment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to remove the main bottleneck in building computer-use agents by automating the creation of interactive environments for complex software. A coding agent writes setup scripts, downloads real data, and generates evidence of correct configuration, while a separate audit agent checks that evidence against a fixed quality checklist. Running the pipeline on 200 applications selected for broad occupational coverage produces CUA-World, a collection of more than 10,000 long-horizon tasks that often require hundreds of steps and span domains such as medicine, engineering, and enterprise systems. Successful trajectories from the training portion can be distilled into a 2-billion-parameter vision-language model that beats larger models, and the same auditing step applied at test time raises the success rate of an existing model on the hardest tasks from 11.5 percent to 14.0 percent.

Core claim

Gym-Anything treats environment creation itself as a multi-agent task: a coding agent produces setup scripts, downloads realistic data, and supplies evidence that the software is ready, while an independent audit agent verifies the evidence against a quality checklist. Applied to 200 software applications chosen according to a taxonomy of economically valuable occupations, the method yields CUA-World containing over 10,000 long-horizon tasks, each equipped with train and test splits and realistic data. Distilling successful trajectories from the training split into a 2B vision-language model produces performance superior to models twice as large, and reusing the auditing principle at test 2B

What carries the argument

Gym-Anything, the multi-agent framework in which a coding agent generates setup scripts and evidence while an independent audit agent verifies the setup against a quality checklist.

If this is right

  • Distilling trajectories from the training split of CUA-World into a 2B vision-language model yields performance that exceeds models twice its size.
  • Applying the same audit principle at test time improves an existing model’s success rate on long-horizon tasks from 11.5 percent to 14.0 percent.
  • The resulting collection of more than 10,000 long-horizon tasks across 200 applications supplies realistic data and splits for training and evaluating agents on economically relevant work.
  • Release of the full code, infrastructure, and benchmark data allows other researchers to extend the same pipeline to additional software.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-agent creation process could be run on additional software categories to enlarge coverage of professional workflows.
  • Combining the long-horizon tasks with existing short-horizon benchmarks might produce training mixtures that improve both quick responses and complex multi-step behavior.
  • The auditing step may transfer to other agent-training pipelines where verifying intermediate states is expensive.

Load-bearing premise

The independent audit agent, given only the checklist and the evidence produced by the coding agent, can correctly identify successful environment setups without missing configuration errors or accepting incomplete ones.

What would settle it

Manual inspection or automated functionality tests on a random sample of the generated environments would find that a substantial fraction fail to run correctly despite having passed the audit step.

Figures

Figures reproduced from arXiv: 2604.06126 by Graham Neubig, Pranjal Aggarwal, Sean Welleck.

Figure 1
Figure 1. Figure 1: Built with Gym-Anything, CUA-World covers all major occupation groups and indus￾tries, spanning over 10K+ long-horizon tasks and environments across 200 software applications, dramatically expanding the scope of computer-use agent evaluation and training. Preprint. arXiv:2604.06126v1 [cs.LG] 7 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Gym-Anything pipeline. Phase 1: We select ∼200 economically important software applications grounded in GDP data, balancing high economic impact with broad coverage across occupations, industries, and software categories. Phase 2: Each software is converted into an interactive environment via a creation-audit loop, in which a creation agent iteratively builds and verifies the environment, w… view at source ↗
Figure 3
Figure 3. Figure 3: GDP-grounded software selection pipeline. Starting from U.S. occupational data, we estimate per-software GDP, filter to sandboxable candidates, and apply tiered selection to yield 200 software applications. 2 Methodology In this section, we introduce the problem setup, the GDP-grounded software selection procedure, and the library abstraction that makes large-scale environment construction possible. In Sec… view at source ↗
Figure 4
Figure 4. Figure 4: The Gym-Anything creation-audit loop. A Creation Agent writes setup scripts and produces evidence documents (screenshots, logs, etc.) while an Audit Agent evaluates this evidence against quality checklists and returns feedback. Learnings accumulate in a shared memory M, which a Summarization Agent periodically condenses so that newer environments are created faster. operating systems. This specification is… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaling behavior on CUA-World. (a) Training data scaling on CUA-World-Test: varying the number of software (50, 100, 200) or the fraction of tasks (25%, 50%, 100%). Both axes improve with scale, following a roughly log-linear trend. (b) Test-time compute scaling on CUA-World-Long: pass rate as a function of average steps taken per task, where each point corresponds to a different maximum step budget (50, 1… view at source ↗
Figure 7
Figure 7. Figure 7: Generalization to seen (IID) vs. unseen (OOD) software. We train models on 25% (left) and 50% (right) of the 200 software applications, and evaluate on the training software applications (IID) and the held-out software applications (OOD). Each bar spans from the untrained baseline (bottom) to the model trained on all software (top). The solid portion shows the gain recovered by the model trained on the sub… view at source ↗
Figure 9
Figure 9. Figure 9: Properties of CUA-World-Long. (a) Distribution of average steps per task. The y-axis is broken to accommodate the spike at the 500-step cap. (b) Distribution of per-task average checklist scores for Gemini 3 Flash and Kimi-K 2.5. CUA-World-Long tasks require a minimum number of steps (>100) before the agent can complete them at all. Increasing compute beyond that continues to help, reaching 11.5% at ∼1,300… view at source ↗
Figure 8
Figure 8. Figure 8: Behavioral patterns in passed vs. failed trajectories across Gemini-3-Flash evalu￾ations on CUA-World. See Appendix for the full set of 15 patterns. Trajectory Behavioral Patterns: Failed tra￾jectories are dominated by retry loops, while successful ones verify their progress more often. To understand how agents behave on CUA-World, we analyze all trajectories from Gemini-3-Flash evaluated on CUA-World, usi… view at source ↗
Figure 10
Figure 10. Figure 10: Pass rate on CUA-World-Test by software category. (a) Visual complexity and (b) do￾main knowledge. See Appendix I for category definitions and assignment of software to categories. strongest models (Gemini 3 Flash, mean 44; Kimi-K 2.5, mean 35). Outside of the 0-10 bin, scores are spread fairly evenly across the range, indicating that CUA-World-Long contains tasks at every difficulty level rather than bei… view at source ↗
Figure 11
Figure 11. Figure 11: Step-weighted pattern intensity across all 15 discovered behavioral patterns. For each [PITH_FULL_IMAGE:figures/full_fig_p067_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Pattern presence rate across all 15 discovered behavioral patterns. For each pattern, bars [PITH_FULL_IMAGE:figures/full_fig_p067_12.png] view at source ↗
read the original abstract

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Gym-Anything, a multi-agent framework (coding agent + independent audit agent) for automatically converting arbitrary software into interactive computer-use agent (CUA) environments. Applying the pipeline to 200 applications selected via a U.S. GDP-grounded occupational taxonomy yields CUA-World: a benchmark of >10K long-horizon tasks with realistic data, train/test splits, and a harder CUA-World-Long subset (tasks often >500 steps). The authors show that distilling successful trajectories into a 2B VLM outperforms models twice its size and that test-time VLM auditing lifts Gemini-3-Flash from 11.5% to 14.0% on CUA-World-Long. All code, infrastructure, and data are released.

Significance. If the generated environments are verifiably correct, this work would be significant for the field: it offers a scalable, low-human-effort method to produce diverse, economically relevant long-horizon CUA benchmarks far beyond current short-horizon, narrow-scope suites. The release of code and data, the distillation result, and the test-time auditing technique are concrete strengths that could accelerate reproducible research on realistic computer-use agents.

major comments (2)
  1. [§4 and §3.2] §4 (CUA-World construction) and §3.2 (audit agent): The claim that the 200 environments are correctly configured with realistic data and functional long-horizon tasks rests entirely on the automated audit agent applying a quality checklist to evidence produced by the coding agent. No human audit of a statistically meaningful sample is reported, nor are false-positive or false-negative rates for the audit agent quantified. This is load-bearing for the central benchmark-validity claim.
  2. [§5] §5 (experiments): Success rates (including the 11.5% → 14.0% lift and the 2B VLM outperforming larger models) are presented without an explicit definition of task success, how partial progress is scored on >500-step trajectories, or inter-rater reliability for the audit agent at test time. These details are required to interpret the numerical results.
minor comments (2)
  1. [§4.1] The occupational taxonomy and its mapping to the 200 applications could be described with greater precision (e.g., explicit list or selection criteria) to allow readers to assess domain coverage.
  2. [Figures in §5] Figure captions and axis labels in the experimental plots should explicitly state the evaluation metric and number of runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of benchmark validity and experimental reporting. We address each major comment below and will revise the manuscript accordingly to strengthen these sections.

read point-by-point responses
  1. Referee: [§4 and §3.2] §4 (CUA-World construction) and §3.2 (audit agent): The claim that the 200 environments are correctly configured with realistic data and functional long-horizon tasks rests entirely on the automated audit agent applying a quality checklist to evidence produced by the coding agent. No human audit of a statistically meaningful sample is reported, nor are false-positive or false-negative rates for the audit agent quantified. This is load-bearing for the central benchmark-validity claim.

    Authors: We agree that the automated audit is central to the validity claim and that quantifying its reliability strengthens the work. The audit agent operates independently on concrete evidence (setup logs, data files, and screenshots) against a fixed checklist, but we acknowledge the absence of human validation in the current manuscript. In revision, we will add a human audit study: two independent human reviewers will evaluate a random sample of 50 environments (25 accepted, 25 rejected by the audit agent) for correctness of setup, data realism, and task functionality. We will report inter-annotator agreement and estimate false-positive and false-negative rates of the audit agent. These results and methodology will be added to §4 and §3.2. revision: yes

  2. Referee: [§5] §5 (experiments): Success rates (including the 11.5% → 14.0% lift and the 2B VLM outperforming larger models) are presented without an explicit definition of task success, how partial progress is scored on >500-step trajectories, or inter-rater reliability for the audit agent at test time. These details are required to interpret the numerical results.

    Authors: We appreciate this point on clarity. Task success is defined as binary completion: the agent produces the expected final state or output as verified by the audit agent (e.g., correct file generated, query result matches target, or application state satisfies the goal condition). No partial credit is assigned; trajectories are scored as success only if the audit confirms goal achievement within the maximum step limit. For CUA-World-Long tasks (>500 steps), the same binary criterion applies, with the audit agent checking the terminal state rather than intermediate progress. For the test-time audit agent, we will add inter-rater reliability by running the audit on a sample of 100 trajectories with two independent VLMs and reporting agreement metrics (e.g., Cohen's kappa). These explicit definitions and metrics will be added to §5. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and empirical benchmark construction without derivations or self-referential predictions

full rationale

The manuscript presents Gym-Anything as an engineering framework that applies a multi-agent coding-plus-audit pipeline to 200 software packages, yielding the CUA-World benchmark and associated empirical results (trajectory distillation and test-time auditing). No equations, parameter fits, uniqueness theorems, or first-principles derivations appear in the provided text. Performance numbers are reported as direct observations on the constructed tasks rather than predictions that reduce to the inputs by construction. The audit agent's role is a methodological assumption whose validity is external to any internal derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-based coding and audit agents can produce and verify correct software environments at scale; no free parameters, new physical entities, or ad-hoc axioms beyond standard multi-agent reliability are introduced in the abstract.

axioms (1)
  • domain assumption LLM agents can write correct setup scripts and an independent LLM auditor can reliably verify them against a checklist
    Invoked in the description of the multi-agent pipeline for environment creation.

pith-pipeline@v0.9.0 · 5623 in / 1268 out tokens · 66782 ms · 2026-05-10T19:29:59.458329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

90 extracted references · 10 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    The simple macroeconomics of ai.SSRN Electronic Journal, 2024

    Daron Acemoglu. The simple macroeconomics of ai.SSRN Electronic Journal, 2024

  2. [2]

    Programming with pixels: Can computer-use agents do software engineering?arXiv preprint arXiv:2502.18525, 2025

    Pranjal Aggarwal and Sean Welleck. Programming with pixels: Can computer-use agents do software engineering?arXiv preprint arXiv:2502.18525, 2025

  3. [3]

    The claude model family, 2025

    Anthropic. The claude model family, 2025

  4. [4]

    Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report. a...

  6. [6]

    WorkArena++: Towards compositional planning and reasoning-based common knowledge work tasks

    Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. WorkArena++: Towards compositional planning and reasoning-based common knowledge work tasks. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, A...

  7. [7]

    Windows Agent Arena: Evaluating multi-modal OS agents at scale

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff,...

  8. [8]

    OpenAI Gym, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016

  9. [9]

    Spider2-V: How far are multimodal agents from automat- ing data science and engineering workflows? In A

    Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, and Tao Yu. Spider2-V: How far are multimodal agents from automat- ing data sc...

  10. [10]

    Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026

    Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, Wei Yang, and Tao Xie. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026

  11. [11]

    Mind2Web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023

  12. [12]

    Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...

  13. [13]

    Score the steps, not just the goal: Vlm-based subgoal evaluation for robotic manipulation, 2025

    Ramy ElMallah, Krish Chhajer, and Chi-Guhn Lee. Score the steps, not just the goal: Vlm-based subgoal evaluation for robotic manipulation, 2025. 19

  14. [14]

    Gpts are gpts: An early look at the labor market impact potential of large language models, 2023

    Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look at the labor market impact potential of large language models, 2023

  15. [15]

    Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021

    Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021

  16. [16]

    Carl Benedikt Frey and Michael A. Osborne. The future of employment: How susceptible are jobs to computerisation?Technological Forecasting and Social Change, 114:254–280, 2017

  17. [17]

    Evilgenie: A reward hacking bench- mark, 2025

    Jonathan Gabor, Jayson Lynch, and Jonathan Rosenfeld. Evilgenie: A reward hacking bench- mark, 2025

  18. [18]

    AssistGUI: Task-oriented PC graphical user interface automation

    Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, and Mike Zheng Shou. AssistGUI: Task-oriented PC graphical user interface automation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13289–13298, June 2024

  19. [19]

    Efficient agent training for computer use, 2026

    Yanheng He, Jiahe Jin, and Pengfei Liu. Efficient agent training for computer use, 2026

  20. [20]

    PC agent: While you sleep, AI works – a cognitive journey into digital world

    Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, and Pengfei Liu. PC agent: While you sleep, AI works – a cognitive journey into digital world. arXiv preprint arXiv:2412.17589, 2024

  21. [21]

    Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation, 2025

    Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jianguang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation, 2025

  22. [22]

    OmniACT: A dataset and benchmark for enabling multi- modal generalist autonomous agents for desktop and web

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniACT: A dataset and benchmark for enabling multi- modal generalist autonomous agents for desktop and web. InComputer Vision – ECCV 2024, pages 161–178. Springer Nature Switzerland, 2024

  23. [23]

    VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Li...

  24. [24]

    On the effects of data scale on UI control agents

    Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on UI control agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 92130–92154. Curran Associates, Inc., 2024

  25. [25]

    Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024

  26. [26]

    Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. InInternational Conference on Learning Representations, 2018. ICLR 2018; arXiv:1802.08802

  27. [27]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  28. [28]

    Agentinstruct: Toward generative teaching with agentic flows, 2024

    Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah. Agentinstruct: Toward generative teaching with agentic flows, 2024. 20

  29. [29]

    Bagel: Bootstrapping agents by guiding exploration with language, 2024

    Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, and Kenton Lee. Bagel: Bootstrapping agents by guiding exploration with language, 2024

  30. [30]

    Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents, 2025

    Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Awadallah. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents, 2025

  31. [31]

    Training software engineering agents and verifiers with swe-gym, 2025

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025

  32. [32]

    Au- tonomous evaluation and refinement of digital agents, 2024

    Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Au- tonomous evaluation and refinement of digital agents, 2024

  33. [33]

    Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek

    Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. Gdpval: Evaluating ai model performance on real-worl...

  34. [34]

    Peterson, Michael D

    Norman G. Peterson, Michael D. Mumford, Walter C. Borman, P. Richard Jeanneret, Edwin A. Fleishman, Kerry Y . Levin, Michael A. Campion, Melinda S. Mayfield, Frederick P. Morgeson, Kenneth Pearlman, Marilyn K. Gowing, Anita R. Lancaster, Marilyn B. Silver, and Donna M. Dye. Understanding work using the occupational information network (O*NET): Implication...

  35. [35]

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025

  36. [36]

    Ui-tars: Pioneering automated gui interaction with native agents, 2025

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

  37. [37]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A dynamic benchmarking environment for autonomous agents. InThe Thirteenth International Conferen...

  38. [38]

    An- droidInTheWild: A large-scale dataset for Android device control

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidInTheWild: A large-scale dataset for Android device control. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 59708–59728. Curran Associates, Inc., 2023

  39. [39]

    The illusion of diminishing returns: Measuring long horizon execution in llms, 2026

    Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms, 2026

  40. [40]

    Os-genesis: Automating gui agent trajectory construction via reverse task synthesis, 2025

    Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis, 2025

  41. [41]

    ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

    Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. ScienceBoard: Evaluating multimodal autonomous agents in realistic scientific workflo...

  42. [42]

    Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025

  43. [43]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

  44. [44]

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17...

  45. [45]

    Bureau of Economic Analysis

    U.S. Bureau of Economic Analysis. National income and product accounts (NIPA). U.S. Department of Commerce, 2024. Interactive data tables, annual estimates. Accessed February 22 2025

  46. [46]

    Bureau of Labor Statistics

    U.S. Bureau of Labor Statistics. Occupational employment and wage statistics (OEWS). U.S. Department of Labor, 2024. May 2024 estimates. Accessed February 2025

  47. [47]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  48. [48]

    Charles, Zhilin Yang, and Tao Yu

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu,...

  49. [49]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023

  50. [50]

    Agent world model: Infinity synthetic environments for agentic reinforcement learning, 2026

    Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, and Yuxiong He. Agent world model: Infinity synthetic environments for agentic reinforcement learning, 2026

  51. [51]

    How well does agent development reflect real-world work?, 2026

    Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangara- jan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, and Graham Neubig. How well does agent development reflect real-world work?, 2026

  52. [52]

    Os-atlas: A foundation action model for generalist gui agents, 2024

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024

  53. [53]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In A. Globerson, L. Mackey, D. Bel...

  54. [54]

    Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025

  55. [55]

    Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks. InAdv...

  56. [56]

    Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605,

    Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605, 2025

  57. [57]

    Stronger models are not always stronger teachers for instruction tuning

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. Stronger models are not always stronger teachers for instruction tuning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...

  58. [58]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  59. [59]

    Holodeck: Language guided generation of 3d embodied ai environments, 2024

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments, 2024

  60. [60]

    Autoenv: Automated environments for measuring cross-environment agent learning, 2025

    Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, and Yuyu Luo. Autoenv: Automated environments for measuring cross-environment agent learning, 2025

  61. [61]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  62. [62]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024. ICLR 2024; arXiv:2307.13854

  63. [63]

    Proposer-agent-evaluator(pae): Autonomous skill discovery for foundation model internet agents, 2024

    Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Erran Li. Proposer-agent-evaluator(pae): Autonomous skill discovery for foundation model internet agents, 2024

  64. [64]

    Training versatile coding agents in synthetic environments, 2026

    Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments, 2026

  65. [65]

    Agent-as-a-judge: Evaluate agents with agents, 2024

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents, 2024. 24 Appendix Table of Contents A GDP-Grounded Software Selection: Full Pipeline 26 A.1 Phas...

  66. [66]

    Wage bill: For each SOC-2018 occupation, compute employment×mean_wage from BLS OEWS (May 2024)

  67. [67]

    What software categories does each occupation use?

    Labor compensation: Scale wage bills by the national ratio Total Compensation Total Wages from BEA accounts. 3.Total GDP: Scale labor compensation by National GDP National Compensation. Output: us_gdp_by_occupation_USD.csv with columns: onetsoc, soc2018, occupation_title,employment,mean_wage,wage_bill,gdp_labor,gdp_total. A.2 Phase 2: Software Discovery C...

  68. [68]

    Returns{passed, score, feedback}

    Program: a Python function receives the trajectory (screenshots, action log), environment utilities (exec_capture,copy_from_env,query_vlm), and task metadata. Returns{passed, score, feedback}. Programmatic verifiers can also call a VLM internally (e.g., for checklist-based evaluation), combining the flexibility of code with visual grounding

  69. [69]

    envs/libreoffice_calc/env.json

    Image match: SSIM comparison between the final screenshot and a reference image, with a configurable threshold. 3.Multi: cascades program verification first, falling back to image match. Custom verification strategies can be added by writing a new Python file following the same interface. B.6 Episode Artifacts Each episode produces a structured artifact d...

  70. [70]

    What dataset is used (name, source URL, specific case/patient ID)

  71. [71]

    What claims does task.json metadata make about expected values (numerical thresholds, counts, measurements)

  72. [72]

    Which values are hardcoded in scripts vs dynamically computed at runtime

  73. [73]

    Which values in metadata could be verified via web search

  74. [74]

    {task_id}

    Whether the data is real (downloaded from a public dataset) or synthetic (generated by scripts) Mark all claims as verifiable_via_web=true and provide a suggested_search_query for each. Your goal is to find the correct value for every claim, not just confirm what task.json says. Analyze these files for task "{task_id}" and produce a structured JSON respon...

  75. [75]

    Items should be ordered from earliest to latest

    **task_completion** (5-8 items, points must sum to exactly 100): Each item represents a sub-goal or evidence of progress. Items should be ordered from earliest to latest. - CRITICAL: ONLY include items that are explicitly required by the task description. Do NOT add extra steps. - Assign more points to harder items - Each item must be visually verifiable ...

  76. [76]

    privileged_info_for_vlm

    **integrity** (3-4 items): Each item checks for cheating/shortcuts. Common checks: - Agent used the GUI, not terminal commands - Agent interacted with the actual application - Agent didn’t copy-paste expected answers - Results come from genuine software interaction Also produce a "privileged_info_for_vlm" field: a concise text with ONLY verified facts tha...

  77. [77]

    Screenshots.Timestamped screen captures showing: (i) the application running after boot, (ii) the correct starting state for each task, and (iii) the absence of blocking error dialogs

  78. [78]

    Structured verification data.A JSON file per task recording database queries, file-system checks, service health, and baseline counts—anything the audit agent needs to confirm that preconditions hold without launching the VM

  79. [79]

    What’s New in Ladybug

    Export-script output.Proof that the task’sexport_result.sh runs without error and produces valid, parseable JSON with all expected fields. All artifacts are stored inside the environment directory under the following layout: 37 examples/<env_name>/ +– evidence_docs/ | +– <task_name>_screenshot.png | +– <task_name>_evidence.json | +– ... (one set per task)...

  80. [80]

    identify top talkers

    200722_tcp_anon.pcapng has only 2 IPv4 endpoints and 35 packets.The “identify top talkers” task is trivial — there are only 2 endpoints, so the agent has a 50% chance of guessing correctly without even looking. The Endpoints dialog shows 2 entries and the answer is immediately obvious without sorting

Showing first 80 references.