Recognition: no theorem link
Gym-Anything: Turn any Software into an Agent Environment
Pith reviewed 2026-05-10 19:29 UTC · model grok-4.3
The pith
Gym-Anything frames environment creation as a multi-agent process that turns any software into a scalable computer-use agent environment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gym-Anything treats environment creation itself as a multi-agent task: a coding agent produces setup scripts, downloads realistic data, and supplies evidence that the software is ready, while an independent audit agent verifies the evidence against a quality checklist. Applied to 200 software applications chosen according to a taxonomy of economically valuable occupations, the method yields CUA-World containing over 10,000 long-horizon tasks, each equipped with train and test splits and realistic data. Distilling successful trajectories from the training split into a 2B vision-language model produces performance superior to models twice as large, and reusing the auditing principle at test 2B
What carries the argument
Gym-Anything, the multi-agent framework in which a coding agent generates setup scripts and evidence while an independent audit agent verifies the setup against a quality checklist.
If this is right
- Distilling trajectories from the training split of CUA-World into a 2B vision-language model yields performance that exceeds models twice its size.
- Applying the same audit principle at test time improves an existing model’s success rate on long-horizon tasks from 11.5 percent to 14.0 percent.
- The resulting collection of more than 10,000 long-horizon tasks across 200 applications supplies realistic data and splits for training and evaluating agents on economically relevant work.
- Release of the full code, infrastructure, and benchmark data allows other researchers to extend the same pipeline to additional software.
Where Pith is reading between the lines
- The same multi-agent creation process could be run on additional software categories to enlarge coverage of professional workflows.
- Combining the long-horizon tasks with existing short-horizon benchmarks might produce training mixtures that improve both quick responses and complex multi-step behavior.
- The auditing step may transfer to other agent-training pipelines where verifying intermediate states is expensive.
Load-bearing premise
The independent audit agent, given only the checklist and the evidence produced by the coding agent, can correctly identify successful environment setups without missing configuration errors or accepting incomplete ones.
What would settle it
Manual inspection or automated functionality tests on a random sample of the generated environments would find that a substantial fraction fail to run correctly despite having passed the audit step.
Figures
read the original abstract
Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Gym-Anything, a multi-agent framework (coding agent + independent audit agent) for automatically converting arbitrary software into interactive computer-use agent (CUA) environments. Applying the pipeline to 200 applications selected via a U.S. GDP-grounded occupational taxonomy yields CUA-World: a benchmark of >10K long-horizon tasks with realistic data, train/test splits, and a harder CUA-World-Long subset (tasks often >500 steps). The authors show that distilling successful trajectories into a 2B VLM outperforms models twice its size and that test-time VLM auditing lifts Gemini-3-Flash from 11.5% to 14.0% on CUA-World-Long. All code, infrastructure, and data are released.
Significance. If the generated environments are verifiably correct, this work would be significant for the field: it offers a scalable, low-human-effort method to produce diverse, economically relevant long-horizon CUA benchmarks far beyond current short-horizon, narrow-scope suites. The release of code and data, the distillation result, and the test-time auditing technique are concrete strengths that could accelerate reproducible research on realistic computer-use agents.
major comments (2)
- [§4 and §3.2] §4 (CUA-World construction) and §3.2 (audit agent): The claim that the 200 environments are correctly configured with realistic data and functional long-horizon tasks rests entirely on the automated audit agent applying a quality checklist to evidence produced by the coding agent. No human audit of a statistically meaningful sample is reported, nor are false-positive or false-negative rates for the audit agent quantified. This is load-bearing for the central benchmark-validity claim.
- [§5] §5 (experiments): Success rates (including the 11.5% → 14.0% lift and the 2B VLM outperforming larger models) are presented without an explicit definition of task success, how partial progress is scored on >500-step trajectories, or inter-rater reliability for the audit agent at test time. These details are required to interpret the numerical results.
minor comments (2)
- [§4.1] The occupational taxonomy and its mapping to the 200 applications could be described with greater precision (e.g., explicit list or selection criteria) to allow readers to assess domain coverage.
- [Figures in §5] Figure captions and axis labels in the experimental plots should explicitly state the evaluation metric and number of runs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of benchmark validity and experimental reporting. We address each major comment below and will revise the manuscript accordingly to strengthen these sections.
read point-by-point responses
-
Referee: [§4 and §3.2] §4 (CUA-World construction) and §3.2 (audit agent): The claim that the 200 environments are correctly configured with realistic data and functional long-horizon tasks rests entirely on the automated audit agent applying a quality checklist to evidence produced by the coding agent. No human audit of a statistically meaningful sample is reported, nor are false-positive or false-negative rates for the audit agent quantified. This is load-bearing for the central benchmark-validity claim.
Authors: We agree that the automated audit is central to the validity claim and that quantifying its reliability strengthens the work. The audit agent operates independently on concrete evidence (setup logs, data files, and screenshots) against a fixed checklist, but we acknowledge the absence of human validation in the current manuscript. In revision, we will add a human audit study: two independent human reviewers will evaluate a random sample of 50 environments (25 accepted, 25 rejected by the audit agent) for correctness of setup, data realism, and task functionality. We will report inter-annotator agreement and estimate false-positive and false-negative rates of the audit agent. These results and methodology will be added to §4 and §3.2. revision: yes
-
Referee: [§5] §5 (experiments): Success rates (including the 11.5% → 14.0% lift and the 2B VLM outperforming larger models) are presented without an explicit definition of task success, how partial progress is scored on >500-step trajectories, or inter-rater reliability for the audit agent at test time. These details are required to interpret the numerical results.
Authors: We appreciate this point on clarity. Task success is defined as binary completion: the agent produces the expected final state or output as verified by the audit agent (e.g., correct file generated, query result matches target, or application state satisfies the goal condition). No partial credit is assigned; trajectories are scored as success only if the audit confirms goal achievement within the maximum step limit. For CUA-World-Long tasks (>500 steps), the same binary criterion applies, with the audit agent checking the terminal state rather than intermediate progress. For the test-time audit agent, we will add inter-rater reliability by running the audit on a sample of 100 trajectories with two independent VLMs and reporting agreement metrics (e.g., Cohen's kappa). These explicit definitions and metrics will be added to §5. revision: yes
Circularity Check
No circularity: framework and empirical benchmark construction without derivations or self-referential predictions
full rationale
The manuscript presents Gym-Anything as an engineering framework that applies a multi-agent coding-plus-audit pipeline to 200 software packages, yielding the CUA-World benchmark and associated empirical results (trajectory distillation and test-time auditing). No equations, parameter fits, uniqueness theorems, or first-principles derivations appear in the provided text. Performance numbers are reported as direct observations on the constructed tasks rather than predictions that reduce to the inputs by construction. The audit agent's role is a methodological assumption whose validity is external to any internal derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can write correct setup scripts and an independent LLM auditor can reliably verify them against a checklist
Forward citations
Cited by 1 Pith paper
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
Reference graph
Works this paper leans on
-
[1]
The simple macroeconomics of ai.SSRN Electronic Journal, 2024
Daron Acemoglu. The simple macroeconomics of ai.SSRN Electronic Journal, 2024
2024
-
[2]
Pranjal Aggarwal and Sean Welleck. Programming with pixels: Can computer-use agents do software engineering?arXiv preprint arXiv:2502.18525, 2025
-
[3]
The claude model family, 2025
Anthropic. The claude model family, 2025
2025
-
[4]
Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024
2024
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report. a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
WorkArena++: Towards compositional planning and reasoning-based common knowledge work tasks
Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. WorkArena++: Towards compositional planning and reasoning-based common knowledge work tasks. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, A...
2024
-
[7]
Windows Agent Arena: Evaluating multi-modal OS agents at scale
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff,...
2025
-
[8]
OpenAI Gym, 2016
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016
2016
-
[9]
Spider2-V: How far are multimodal agents from automat- ing data science and engineering workflows? In A
Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, and Tao Yu. Spider2-V: How far are multimodal agents from automat- ing data sc...
2024
-
[10]
Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026
Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, Wei Yang, and Tao Xie. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026
2026
-
[11]
Mind2Web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023
2023
-
[12]
Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...
2024
-
[13]
Score the steps, not just the goal: Vlm-based subgoal evaluation for robotic manipulation, 2025
Ramy ElMallah, Krish Chhajer, and Chi-Guhn Lee. Score the steps, not just the goal: Vlm-based subgoal evaluation for robotic manipulation, 2025. 19
2025
-
[14]
Gpts are gpts: An early look at the labor market impact potential of large language models, 2023
Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look at the labor market impact potential of large language models, 2023
2023
-
[15]
Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021
Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021
2021
-
[16]
Carl Benedikt Frey and Michael A. Osborne. The future of employment: How susceptible are jobs to computerisation?Technological Forecasting and Social Change, 114:254–280, 2017
2017
-
[17]
Evilgenie: A reward hacking bench- mark, 2025
Jonathan Gabor, Jayson Lynch, and Jonathan Rosenfeld. Evilgenie: A reward hacking bench- mark, 2025
2025
-
[18]
AssistGUI: Task-oriented PC graphical user interface automation
Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, and Mike Zheng Shou. AssistGUI: Task-oriented PC graphical user interface automation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13289–13298, June 2024
2024
-
[19]
Efficient agent training for computer use, 2026
Yanheng He, Jiahe Jin, and Pengfei Liu. Efficient agent training for computer use, 2026
2026
-
[20]
PC agent: While you sleep, AI works – a cognitive journey into digital world
Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, and Pengfei Liu. PC agent: While you sleep, AI works – a cognitive journey into digital world. arXiv preprint arXiv:2412.17589, 2024
-
[21]
Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation, 2025
Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jianguang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation, 2025
2025
-
[22]
OmniACT: A dataset and benchmark for enabling multi- modal generalist autonomous agents for desktop and web
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniACT: A dataset and benchmark for enabling multi- modal generalist autonomous agents for desktop and web. InComputer Vision – ECCV 2024, pages 161–178. Springer Nature Switzerland, 2024
2024
-
[23]
VisualWebArena: Evaluating multimodal agents on realistic visual web tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Li...
2024
-
[24]
On the effects of data scale on UI control agents
Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on UI control agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 92130–92154. Curran Associates, Inc., 2024
2024
-
[25]
Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024
2024
-
[26]
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. InInternational Conference on Learning Representations, 2018. ICLR 2018; arXiv:1802.08802
work page Pith review arXiv 2018
-
[27]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
2024
-
[28]
Agentinstruct: Toward generative teaching with agentic flows, 2024
Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah. Agentinstruct: Toward generative teaching with agentic flows, 2024. 20
2024
-
[29]
Bagel: Bootstrapping agents by guiding exploration with language, 2024
Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, and Kenton Lee. Bagel: Bootstrapping agents by guiding exploration with language, 2024
2024
-
[30]
Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents, 2025
Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Awadallah. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents, 2025
2025
-
[31]
Training software engineering agents and verifiers with swe-gym, 2025
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025
2025
-
[32]
Au- tonomous evaluation and refinement of digital agents, 2024
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Au- tonomous evaluation and refinement of digital agents, 2024
2024
-
[33]
Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. Gdpval: Evaluating ai model performance on real-worl...
2025
-
[34]
Peterson, Michael D
Norman G. Peterson, Michael D. Mumford, Walter C. Borman, P. Richard Jeanneret, Edwin A. Fleishman, Kerry Y . Levin, Michael A. Campion, Melinda S. Mayfield, Frederick P. Morgeson, Kenneth Pearlman, Marilyn K. Gowing, Anita R. Lancaster, Marilyn B. Silver, and Donna M. Dye. Understanding work using the occupational information network (O*NET): Implication...
2001
-
[35]
Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025
Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025
2025
-
[36]
Ui-tars: Pioneering automated gui interaction with native agents, 2025
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...
2025
-
[37]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A dynamic benchmarking environment for autonomous agents. InThe Thirteenth International Conferen...
work page internal anchor Pith review arXiv 2025
-
[38]
An- droidInTheWild: A large-scale dataset for Android device control
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidInTheWild: A large-scale dataset for Android device control. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 59708–59728. Curran Associates, Inc., 2023
2023
-
[39]
The illusion of diminishing returns: Measuring long horizon execution in llms, 2026
Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms, 2026
2026
-
[40]
Os-genesis: Automating gui agent trajectory construction via reverse task synthesis, 2025
Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis, 2025
2025
-
[41]
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. ScienceBoard: Evaluating multimodal autonomous agents in realistic scientific workflo...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025
Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025
2025
-
[43]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...
2026
-
[44]
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17...
work page internal anchor Pith review arXiv 2025
-
[45]
Bureau of Economic Analysis
U.S. Bureau of Economic Analysis. National income and product accounts (NIPA). U.S. Department of Commerce, 2024. Interactive data tables, annual estimates. Accessed February 22 2025
2024
-
[46]
Bureau of Labor Statistics
U.S. Bureau of Labor Statistics. Occupational employment and wage statistics (OEWS). U.S. Department of Labor, 2024. May 2024 estimates. Accessed February 2025
2024
-
[47]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Charles, Zhilin Yang, and Tao Yu
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu,...
2025
-
[49]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023
2023
-
[50]
Agent world model: Infinity synthetic environments for agentic reinforcement learning, 2026
Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, and Yuxiong He. Agent world model: Infinity synthetic environments for agentic reinforcement learning, 2026
2026
-
[51]
How well does agent development reflect real-world work?, 2026
Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangara- jan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, and Graham Neubig. How well does agent development reflect real-world work?, 2026
2026
-
[52]
Os-atlas: A foundation action model for generalist gui agents, 2024
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024
2024
-
[53]
OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In A. Globerson, L. Mackey, D. Bel...
2024
-
[54]
Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025
2025
-
[55]
Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z
Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks. InAdv...
2025
-
[56]
Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605, 2025
-
[57]
Stronger models are not always stronger teachers for instruction tuning
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. Stronger models are not always stronger teachers for instruction tuning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...
2025
-
[58]
Qwen3 technical report, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
2025
-
[59]
Holodeck: Language guided generation of 3d embodied ai environments, 2024
Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments, 2024
2024
-
[60]
Autoenv: Automated environments for measuring cross-environment agent learning, 2025
Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, and Yuyu Luo. Autoenv: Automated environments for measuring cross-environment agent learning, 2025
2025
-
[61]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
2023
-
[62]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024. ICLR 2024; arXiv:2307.13854
work page Pith review arXiv 2024
-
[63]
Proposer-agent-evaluator(pae): Autonomous skill discovery for foundation model internet agents, 2024
Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Erran Li. Proposer-agent-evaluator(pae): Autonomous skill discovery for foundation model internet agents, 2024
2024
-
[64]
Training versatile coding agents in synthetic environments, 2026
Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments, 2026
2026
-
[65]
Agent-as-a-judge: Evaluate agents with agents, 2024
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents, 2024. 24 Appendix Table of Contents A GDP-Grounded Software Selection: Full Pipeline 26 A.1 Phas...
2024
-
[66]
Wage bill: For each SOC-2018 occupation, compute employment×mean_wage from BLS OEWS (May 2024)
2018
-
[67]
What software categories does each occupation use?
Labor compensation: Scale wage bills by the national ratio Total Compensation Total Wages from BEA accounts. 3.Total GDP: Scale labor compensation by National GDP National Compensation. Output: us_gdp_by_occupation_USD.csv with columns: onetsoc, soc2018, occupation_title,employment,mean_wage,wage_bill,gdp_labor,gdp_total. A.2 Phase 2: Software Discovery C...
-
[68]
Returns{passed, score, feedback}
Program: a Python function receives the trajectory (screenshots, action log), environment utilities (exec_capture,copy_from_env,query_vlm), and task metadata. Returns{passed, score, feedback}. Programmatic verifiers can also call a VLM internally (e.g., for checklist-based evaluation), combining the flexibility of code with visual grounding
-
[69]
envs/libreoffice_calc/env.json
Image match: SSIM comparison between the final screenshot and a reference image, with a configurable threshold. 3.Multi: cascades program verification first, falling back to image match. Custom verification strategies can be added by writing a new Python file following the same interface. B.6 Episode Artifacts Each episode produces a structured artifact d...
-
[70]
What dataset is used (name, source URL, specific case/patient ID)
-
[71]
What claims does task.json metadata make about expected values (numerical thresholds, counts, measurements)
-
[72]
Which values are hardcoded in scripts vs dynamically computed at runtime
-
[73]
Which values in metadata could be verified via web search
-
[74]
{task_id}
Whether the data is real (downloaded from a public dataset) or synthetic (generated by scripts) Mark all claims as verifiable_via_web=true and provide a suggested_search_query for each. Your goal is to find the correct value for every claim, not just confirm what task.json says. Analyze these files for task "{task_id}" and produce a structured JSON respon...
-
[75]
Items should be ordered from earliest to latest
**task_completion** (5-8 items, points must sum to exactly 100): Each item represents a sub-goal or evidence of progress. Items should be ordered from earliest to latest. - CRITICAL: ONLY include items that are explicitly required by the task description. Do NOT add extra steps. - Assign more points to harder items - Each item must be visually verifiable ...
-
[76]
privileged_info_for_vlm
**integrity** (3-4 items): Each item checks for cheating/shortcuts. Common checks: - Agent used the GUI, not terminal commands - Agent interacted with the actual application - Agent didn’t copy-paste expected answers - Results come from genuine software interaction Also produce a "privileged_info_for_vlm" field: a concise text with ONLY verified facts tha...
-
[77]
Screenshots.Timestamped screen captures showing: (i) the application running after boot, (ii) the correct starting state for each task, and (iii) the absence of blocking error dialogs
-
[78]
Structured verification data.A JSON file per task recording database queries, file-system checks, service health, and baseline counts—anything the audit agent needs to confirm that preconditions hold without launching the VM
-
[79]
What’s New in Ladybug
Export-script output.Proof that the task’sexport_result.sh runs without error and produces valid, parseable JSON with all expected fields. All artifacts are stored inside the environment directory under the following layout: 37 examples/<env_name>/ +– evidence_docs/ | +– <task_name>_screenshot.png | +– <task_name>_evidence.json | +– ... (one set per task)...
2026
-
[80]
identify top talkers
200722_tcp_anon.pcapng has only 2 IPv4 endpoints and 35 packets.The “identify top talkers” task is trivial — there are only 2 endpoints, so the agent has a 50% chance of guessing correctly without even looking. The Endpoints dialog shows 2 entries and the answer is immediately obvious without sorting
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.