TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Belinda Zeng; Fanny Yang; Feiyan Zhou; Luyuan Wang; Shoufa Chen; Xiaohui Zhang; Xuan Yang; Yuanfeng Ji; Yuren Cong; Zhiheng Liu

arxiv: 2606.28480 · v1 · pith:2ZP3DHFDnew · submitted 2026-06-26 · 💻 cs.SE · cs.AI

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Shoufa Chen , Luyuan Wang , Xuan Yang , Zhiheng Liu , Yuren Cong , Yuanfeng Ji , Feiyan Zhou , Xiaohui Zhang

show 2 more authors

Fanny Yang Belinda Zeng

This is my paper

Pith reviewed 2026-06-30 01:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords terminal-use agentscomputer-use benchmarkTUA-Benchgeneral-purpose agentsexecution-based evaluationdocument editingscientific workflowsClaude Opus

0 comments

The pith

TUA-Bench evaluates terminal agents on 120 general tasks and finds the top model at 65.8 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TUA-Bench to measure general-purpose terminal-use agents on tasks that extend past coding into everyday digital work. It supplies 120 manually designed tasks spanning document editing, email management, web information seeking, and PhD-level scientific workflows, each executed in a real terminal with a deterministic setup script. Performance is measured by an execution-based scoring protocol rather than subjective review. The evaluation shows the strongest agent reaches only 65.8 percent overall, with clear shortfalls across task families. This matters because it gives a concrete way to track progress toward agents that can handle diverse terminal environments reliably.

Core claim

TUA-Bench consists of 120 real-world tasks across five families for terminal-use agents. The tasks address routine activities such as document editing and email management together with scientific and engineering workflows co-designed with domain experts. Each task runs in a real terminal under a deterministic setup script and receives an execution-based score. The strongest frontier agent, Claude Code powered by Claude Opus 4.8 at maximum reasoning effort, reaches 65.8 percent overall performance with substantial gaps across both routine and expert tracks.

What carries the argument

TUA-Bench benchmark of 120 manually designed tasks with deterministic terminal setups and execution-based scoring protocol.

If this is right

Terminal agents must close large performance gaps before they can be considered reliable for general use.
Routine tasks such as document editing and email management expose limitations not addressed by coding-only benchmarks.
Scientific workflows that require specialized software demand agent capabilities beyond standard shell commands.
Execution-based scoring offers an objective alternative to human judgment for terminal agent evaluation.
Future agent development can target measurable improvement on both routine and expert task families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended with tasks that involve live state changes or multi-turn user interactions to test robustness further.
The observed gaps suggest current agents would benefit from training data that emphasizes non-programming terminal workflows.
Pairing TUA-Bench results with existing GUI benchmarks would allow direct comparison of terminal versus graphical computer-use performance.
A score of 65.8 percent implies that unsupervised deployment of these agents in varied digital environments remains premature.

Load-bearing premise

The 120 manually designed tasks and the execution-based scoring protocol together provide a representative and unbiased measure of general-purpose terminal-use agent capabilities.

What would settle it

A new agent that scores above 90 percent on the 120 tasks yet fails on similar but unseen terminal tasks outside the benchmark set would indicate the tasks do not generalize.

read the original abstract

As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TUA-Bench adds a terminal benchmark covering routine and expert tasks, but the 65.8% result depends on unvalidated manual task selection.

read the letter

Here's the quick take: TUA-Bench introduces 120 tasks for terminal-use agents that span everyday digital work and specialized scientific workflows, with the strongest agent reaching 65.8%. The benchmark is the central new piece.

The paper does a reasonable job explaining the gap it targets. Prior terminal benchmarks lean heavily on programming, while general computer-use ones focus on GUIs. Adding document editing, email handling, web searches, and PhD-level tasks co-designed with domain experts gives it wider scope than those. The deterministic setup scripts and execution-based scoring are practical steps that support reproducibility.

The soft spot sits in task construction. All tasks are manually designed by the authors across five families. The description supplies no external checks against usage logs, coverage metrics, or agreement among multiple experts on whether the set is representative. Without that, the reported gaps and ceiling could trace back to curation choices as much as to agent limits. The abstract gives no numbers on inter-rater reliability or edge-case handling either.

This work is aimed at researchers building or testing general computer-use agents who want a terminal-specific evaluation set. A reader already working on agent harnesses could pull ideas from the task families, but broader adoption will require the community to judge the selection process.

I would send it to peer review. The execution protocol is a clear plus and the motivation is timely, but referees can press on validation and suggest concrete ways to test representativeness. That feedback would strengthen the benchmark before wider use.

Referee Report

2 major / 1 minor

Summary. The paper introduces TUA-Bench, a benchmark for general-purpose terminal-use agents (TUAs) consisting of 120 manually designed tasks across five families. Tasks span routine activities (document editing, email, web search) and expert co-designed scientific/engineering workflows, each with deterministic setup scripts and execution-based scoring. The central empirical result is that the strongest frontier agent (Claude Code with Claude Opus 4.8 at max reasoning effort) reaches 65.8% overall success, with substantial gaps across tracks; the benchmark is positioned as more general than prior GUI or shell-focused suites.

Significance. If the task set is representative, TUA-Bench fills a documented gap between GUI-centric computer-use benchmarks and narrow coding/shell benchmarks, supplying a reproducible, execution-scored evaluation that could guide development of broader terminal agents. The deterministic setups and execution-based protocol are explicit strengths that support reproducibility and reduce scoring ambiguity.

major comments (2)

[Abstract] Abstract: the claim that the 65.8% result and observed gaps constitute a 'broad and realistic evaluation' of general-purpose TUA capabilities is load-bearing on the representativeness of the 120 tasks; the abstract supplies no information on task selection process, coverage metrics, inter-rater reliability for task design, or external validation against real-world terminal usage distributions or expert consensus, leaving open the possibility that performance ceilings reflect curation choices rather than intrinsic limitations.
[Section 3] Task construction description (Section 3 / benchmark design): while the five task families and PhD co-design are described, the manuscript provides no quantitative validation (e.g., diversity statistics, overlap with usage logs, or comparison to existing terminal corpora) that would confirm the tasks are unbiased relative to the general-purpose claim; this directly affects whether the 65.8% ceiling can be interpreted as a frontier measurement.

minor comments (1)

[Abstract] Abstract: the model identifier 'Claude Opus 4.8' is not standard; a brief clarification of the exact model/version and reasoning-effort parameterization would improve precision.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for highlighting the importance of justifying the representativeness of TUA-Bench tasks. We address each major comment below and will revise the manuscript to strengthen the description of task construction while being transparent about limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the 65.8% result and observed gaps constitute a 'broad and realistic evaluation' of general-purpose TUA capabilities is load-bearing on the representativeness of the 120 tasks; the abstract supplies no information on task selection process, coverage metrics, inter-rater reliability for task design, or external validation against real-world terminal usage distributions or expert consensus, leaving open the possibility that performance ceilings reflect curation choices rather than intrinsic limitations.

Authors: We agree that the abstract would benefit from more context on task selection. In the revision we will expand the abstract to state that the 120 tasks were manually designed to cover five families spanning routine digital activities and PhD-co-designed scientific workflows, with deterministic setups and execution-based scoring. Inter-rater reliability metrics were not computed because task design was an iterative collaborative process rather than independent rating. We will also note that external validation against usage distributions was not performed. These additions will make the scope and limitations of the 'broad and realistic' phrasing clearer without overstating the evidence. revision: partial
Referee: [Section 3] Task construction description (Section 3 / benchmark design): while the five task families and PhD co-design are described, the manuscript provides no quantitative validation (e.g., diversity statistics, overlap with usage logs, or comparison to existing terminal corpora) that would confirm the tasks are unbiased relative to the general-purpose claim; this directly affects whether the 65.8% ceiling can be interpreted as a frontier measurement.

Authors: We will revise Section 3 to include quantitative diversity statistics, such as the breakdown of tasks by family, command categories, and estimated complexity. We will also add a limitations paragraph explicitly discussing the absence of direct comparison to public terminal usage logs or corpora. The task set was constructed through author expertise supplemented by PhD-level domain experts to target both everyday and specialized terminal activities; however, no suitable public corpora existed for quantitative overlap analysis. This revision will allow readers to evaluate the general-purpose claim more precisely while preserving the benchmark's contribution. revision: partial

standing simulated objections not resolved

Direct quantitative overlap or comparison against real-world terminal usage distributions or existing corpora, as no appropriate public datasets were identified for such validation.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper introduces TUA-Bench as a new set of 120 manually designed tasks with deterministic setup scripts and execution-based scoring. No equations, fitted parameters, or predictions are derived from prior quantities; reported performance (e.g., 65.8% for Claude Code) consists of direct empirical measurements on the introduced tasks. No self-citations serve as load-bearing premises for any result, and the construction does not reduce any claimed outcome to its own inputs by definition. The work is self-contained as a benchmark proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark introduction paper containing no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5798 in / 934 out tokens · 32225 ms · 2026-06-30T01:16:51.018011+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 4 linked inside Pith

[1]

International Conference on Learning Representations , volume=

Openhands: An open platform for ai software developers as generalist agents , author=. International Conference on Learning Representations , volume=
[2]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
[3]

The twelfth international conference on learning representations , year=

Swe-bench: Can language models resolve real-world github issues? , author=. The twelfth international conference on learning representations , year=
[4]

The Fourteenth International Conference on Learning Representations , year=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. The Fourteenth International Conference on Learning Representations , year=
[5]

Advances in Neural Information Processing Systems , volume=

macosworld: A multilingual interactive benchmark for gui agents , author=. Advances in Neural Information Processing Systems , volume=
[6]

2026 , url=

Hongrui Jia and Jitong Liao and Xi Zhang and Haiyang Xu and Tianbao Xie and Chaoya Jiang and Ming Yan and Si Liu and Wei Ye and Fei Huang , booktitle=. 2026 , url=

2026
[7]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=
[8]

2022 , month = nov, howpublished =

Introducing ChatGPT , author =. 2022 , month = nov, howpublished =

2022
[9]

2026 , month = apr, day =

Introducing. 2026 , month = apr, day =

2026
[10]

2025 , month = may, howpublished =

Introducing Codex , author =. 2025 , month = may, howpublished =

2025
[11]

2026 , howpublished =

Claude Code , author =. 2026 , howpublished =

2026
[12]

2026 , howpublished =

Introducing Claude Opus 4.7 , author =. 2026 , howpublished =

2026
[13]

2026 , howpublished =

Gemini 3.1 Pro: A smarter model for your most complex tasks , author =. 2026 , howpublished =

2026
[14]

2026 , howpublished =

OpenCode: A Powerful AI Coding Agent Built for the Terminal , author =. 2026 , howpublished =

2026
[15]

2023 , month = mar, howpublished =

Introducing Claude , author =. 2023 , month = mar, howpublished =

2023
[16]

2026 , howpublished =

2026
[17]

2025 , month = jun, howpublished =

Gemini CLI: Your Open-Source AI Agent , author =. 2025 , month = jun, howpublished =

2025
[18]

2021 , month = jun, howpublished =

Introducing GitHub Copilot: Your AI Pair Programmer , author =. 2021 , month = jun, howpublished =

2021
[19]

2026 , howpublished =

Cursor , author =. 2026 , howpublished =

2026
[20]

2025 , month = mar, howpublished =

Manus , author =. 2025 , month = mar, howpublished =

2025
[21]

2026 , howpublished =

OpenClaw: Personal AI Assistant , author =. 2026 , howpublished =

2026
[22]

The UNIX

Kernighan, Brian W and Mashey, John R , journal=. The UNIX. 1979 , publisher=

1979
[23]

2014 , publisher=

Data science at the command line: Facing the future with time-tested tools , author=. 2014 , publisher=

2014
[24]

Gigascience , volume=

Tools and techniques for computational reproducibility , author=. Gigascience , volume=. 2016 , publisher=

2016
[25]

GitHub repository , howpublished =

OpenCLI Contributors , title =. GitHub repository , howpublished =. 2026 , publisher =

2026
[27]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=
[28]

2026 , howpublished =

lark-cli: The Official Lark/Feishu CLI Tool , author =. 2026 , howpublished =

2026
[29]

2026 , howpublished =

Podman: The Best Free and Open Source Container Tools , author =. 2026 , howpublished =

2026
[30]

International Conference on Machine Learning , pages=

World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017
[31]

International Conference on Learning Representations , year=

Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=
[32]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=
[33]

2025 USENIX Annual Technical Conference (USENIX ATC 25) , pages=

The Koala benchmarks for the shell: characterization and implications , author=. 2025 USENIX Annual Technical Conference (USENIX ATC 25) , pages=

2025
[36]

LLM-supported natural language to bash translation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[38]

International Conference on Learning Representations , volume=

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. International Conference on Learning Representations , volume=
[41]

The Fourteenth International Conference on Learning Representations , year=

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution , author=. The Fourteenth International Conference on Learning Representations , year=
[44]

Windows Agent Arena: Evaluating Multi-Modal

Rogerio Bonatti and Dan Zhao and Francesco Bonacci and Dillon Dupont and Sara Abdali and Yinheng Li and Yadong Lu and Justin Wagle and Kazuhito Koishida and Arthur Bucker and Lawrence Keunho Jang and Zheng Hui , booktitle=. Windows Agent Arena: Evaluating Multi-Modal. 2025 , url=

2025
[45]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Webvoyager: Building an end-to-end web agent with large multimodal models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[47]

Introducing claude

Anthropic . Introducing claude. https://www.anthropic.com/news/introducing-claude, March 2023

2023
[48]

Claude code

Anthropic . Claude code. https://www.anthropic.com/claude-code, 2026 a

2026
[49]

Introducing claude opus 4.7

Anthropic . Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7, 2026 b

2026
[50]

Anysphere . Cursor. https://cursor.com/, 2026

2026
[51]

Setupbench: Assessing software engineering agents' ability to bootstrap development environments

Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. Setupbench: Assessing software engineering agents' ability to bootstrap development environments. arXiv preprint arXiv:2507.09063, 2025

arXiv 2025
[52]

Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories

Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena, Aditi Raghunathan, and Ziqian Zhong. Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories. arXiv preprint arXiv:2604.17596, 2026

Pith/arXiv arXiv 2026
[53]

Windows agent arena: Evaluating multi-modal OS agents at scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows agent arena: Evaluating multi-modal OS agents at scale. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=W9s817KqYf

2025
[54]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In International Conference on Learning Representations, volume 2025, pages 96934--96990, 2025

2025
[55]

Terminalworld: Benchmarking agents on real-world terminal tasks

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T Barr, Mark Harman, Federica Sarro, et al. Terminalworld: Benchmarking agents on real-world terminal tasks. arXiv preprint arXiv:2605.22535, 2026

Pith/arXiv arXiv 2026
[56]

OpenCLI : Make any website your CLI --- an AI -native runtime for browser automation and dynamic web data extraction

OpenCLI Contributors. OpenCLI : Make any website your CLI --- an AI -native runtime for browser automation and dynamic web data extraction. https://github.com/jackwener/opencli, 2026

2026
[57]

Osuniverse: Benchmark for multimodal gui-navigation ai agents

Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo M \'a rquez Flores, and Sin \'e ad Ryan. Osuniverse: Benchmark for multimodal gui-navigation ai agents. arXiv preprint arXiv:2505.03570, 2025

arXiv 2025
[58]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

2023
[59]

Introducing github copilot: Your ai pair programmer

GitHub . Introducing github copilot: Your ai pair programmer. https://github.blog/news-insights/product-news/introducing-github-copilot-ai-pair-programmer/, June 2021

2021
[60]

GitHub CLI Documentation

GitHub . GitHub CLI Documentation . https://docs.github.com/en/github-cli, 2026

2026
[61]

Gemini cli: Your open-source ai agent

Google . Gemini cli: Your open-source ai agent. https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemini-cli-open-source-ai-agent/, June 2025

2025
[62]

gcloud CLI Overview

Google Cloud . gcloud CLI Overview . https://docs.cloud.google.com/sdk/gcloud, 2026

2026
[63]

Terminus-2: Harbor's High-Performance Reference Agent Implementation

Harbor . Terminus-2: Harbor's High-Performance Reference Agent Implementation . https://www.harborframework.com/docs/agents/terminus-2, 2026

2026
[64]

Harbor: A framework for evaluating and optimizing agents and models in container environments , January 2026

Harbor Framework Team . Harbor: A framework for evaluating and optimizing agents and models in container environments , January 2026. https://github.com/harbor-framework/harbor

2026
[65]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864--6890, 2024

2024
[66]

O'Reilly Media, Inc

Jeroen Janssens. Data science at the command line: Facing the future with time-tested tools. " O'Reilly Media, Inc.", 2014

2014
[67]

OSW orld- MCP : Benchmarking MCP tool invocation in computer-use agents

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. OSW orld- MCP : Benchmarking MCP tool invocation in computer-use agents. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=rceD6wwt4B

2026
[68]

Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023

2023
[69]

The unix programming environment

Brian W Kernighan and John R Mashey. The unix programming environment. Software: Practice and Experience, 9 0 (1): 0 1--15, 1979

1979
[70]

Process-level trajectory evaluation for environment configuration in software engineering agents

Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, and Philip S Yu. Process-level trajectory evaluation for environment configuration in software engineering agents. arXiv preprint arXiv:2510.25694, 2025

arXiv 2025
[71]

The koala benchmarks for the shell: characterization and implications

Evangelos Lamprou, Ethan Williams, Georgios Kaoukis, Zhuoxuan Zhang, Michael Greenberg, Konstantinos Kallas, Lukas Lazarek, and Nikos Vasilakis. The koala benchmarks for the shell: characterization and implications. In 2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 449--464, 2025

2025
[72]

lark-cli: The official lark/feishu cli tool

LarkSuite . lark-cli: The official lark/feishu cli tool. https://github.com/larksuite/cli, 2026

2026
[73]

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...

2026
[74]

Reinforcement learning on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018. https://openreview.net/forum?id=ryTp3f-0-

2018
[75]

Mcp-universe: Benchmarking large language models with real-world model context protocol servers

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. Mcp-universe: Benchmarking large language models with real-world model context protocol servers. arXiv preprint arXiv:2508.14704, 2025

arXiv 2025
[76]

Manus . Manus. https://manus.im/, March 2025

2025
[77]

Mike A Merrill, Alexander Glenn Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap...

2026
[78]

Kimi Code : Next-gen ai code agent

Moonshot AI . Kimi Code : Next-gen ai code agent. https://www.kimi.com/code, 2026

2026
[79]

Introducing chatgpt

OpenAI . Introducing chatgpt. https://openai.com/index/chatgpt/, November 2022

2022
[80]

Introducing codex

OpenAI . Introducing codex. https://openai.com/index/introducing-codex/, May 2025

2025
[81]

Introducing GPT-5.5

OpenAI . Introducing GPT-5.5 . https://openai.com/index/introducing-gpt-5-5/, April 2026

2026
[82]

Openclaw: Personal ai assistant

OpenClaw . Openclaw: Personal ai assistant. https://openclaw.ai/, 2026

2026
[83]

Opencode: A powerful ai coding agent built for the terminal

OpenCode . Opencode: A powerful ai coding agent built for the terminal. https://github.com/opencode-ai/opencode, 2026

2026
[84]

Gdpval: Evaluating ai model performance on real-world economically valuable tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Sim \'o n Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks. arXiv preprint arXiv:2510.04374, 2025

Pith/arXiv arXiv 2025
[85]

Tools and techniques for computational reproducibility

Stephen R Piccolo and Michael B Frampton. Tools and techniques for computational reproducibility. Gigascience, 5 0 (1): 0 s13742--016, 2016

2016
[86]

Podman: The best free and open source container tools

Podman Container Tools . Podman: The best free and open source container tools. https://podman.io/, 2026

2026
[87]

Qwen Code : An open-source ai coding agent that lives in your terminal

QwenLM . Qwen Code : An open-source ai coding agent that lives in your terminal. https://github.com/QwenLM/qwen-code, 2026

2026
[88]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135--3144. PMLR, 2017

2017
[89]

Slack CLI

Slack . Slack CLI . https://docs.slack.dev/tools/slack-cli/, 2026

2026

Showing first 80 references.

[1] [1]

International Conference on Learning Representations , volume=

Openhands: An open platform for ai software developers as generalist agents , author=. International Conference on Learning Representations , volume=

[2] [2]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

[3] [3]

The twelfth international conference on learning representations , year=

Swe-bench: Can language models resolve real-world github issues? , author=. The twelfth international conference on learning representations , year=

[4] [4]

The Fourteenth International Conference on Learning Representations , year=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. The Fourteenth International Conference on Learning Representations , year=

[5] [5]

Advances in Neural Information Processing Systems , volume=

macosworld: A multilingual interactive benchmark for gui agents , author=. Advances in Neural Information Processing Systems , volume=

[6] [6]

2026 , url=

Hongrui Jia and Jitong Liao and Xi Zhang and Haiyang Xu and Tianbao Xie and Chaoya Jiang and Ming Yan and Si Liu and Wei Ye and Fei Huang , booktitle=. 2026 , url=

2026

[7] [7]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

[8] [8]

2022 , month = nov, howpublished =

Introducing ChatGPT , author =. 2022 , month = nov, howpublished =

2022

[9] [9]

2026 , month = apr, day =

Introducing. 2026 , month = apr, day =

2026

[10] [10]

2025 , month = may, howpublished =

Introducing Codex , author =. 2025 , month = may, howpublished =

2025

[11] [11]

2026 , howpublished =

Claude Code , author =. 2026 , howpublished =

2026

[12] [12]

2026 , howpublished =

Introducing Claude Opus 4.7 , author =. 2026 , howpublished =

2026

[13] [13]

2026 , howpublished =

Gemini 3.1 Pro: A smarter model for your most complex tasks , author =. 2026 , howpublished =

2026

[14] [14]

2026 , howpublished =

OpenCode: A Powerful AI Coding Agent Built for the Terminal , author =. 2026 , howpublished =

2026

[15] [15]

2023 , month = mar, howpublished =

Introducing Claude , author =. 2023 , month = mar, howpublished =

2023

[16] [16]

2026 , howpublished =

2026

[17] [17]

2025 , month = jun, howpublished =

Gemini CLI: Your Open-Source AI Agent , author =. 2025 , month = jun, howpublished =

2025

[18] [18]

2021 , month = jun, howpublished =

Introducing GitHub Copilot: Your AI Pair Programmer , author =. 2021 , month = jun, howpublished =

2021

[19] [19]

2026 , howpublished =

Cursor , author =. 2026 , howpublished =

2026

[20] [20]

2025 , month = mar, howpublished =

Manus , author =. 2025 , month = mar, howpublished =

2025

[21] [21]

2026 , howpublished =

OpenClaw: Personal AI Assistant , author =. 2026 , howpublished =

2026

[22] [22]

The UNIX

Kernighan, Brian W and Mashey, John R , journal=. The UNIX. 1979 , publisher=

1979

[23] [23]

2014 , publisher=

Data science at the command line: Facing the future with time-tested tools , author=. 2014 , publisher=

2014

[24] [24]

Gigascience , volume=

Tools and techniques for computational reproducibility , author=. Gigascience , volume=. 2016 , publisher=

2016

[25] [25]

GitHub repository , howpublished =

OpenCLI Contributors , title =. GitHub repository , howpublished =. 2026 , publisher =

2026

[26] [27]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

[27] [28]

2026 , howpublished =

lark-cli: The Official Lark/Feishu CLI Tool , author =. 2026 , howpublished =

2026

[28] [29]

2026 , howpublished =

Podman: The Best Free and Open Source Container Tools , author =. 2026 , howpublished =

2026

[29] [30]

International Conference on Machine Learning , pages=

World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017

[30] [31]

International Conference on Learning Representations , year=

Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=

[31] [32]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

[32] [33]

2025 USENIX Annual Technical Conference (USENIX ATC 25) , pages=

The Koala benchmarks for the shell: characterization and implications , author=. 2025 USENIX Annual Technical Conference (USENIX ATC 25) , pages=

2025

[33] [36]

LLM-supported natural language to bash translation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[34] [38]

International Conference on Learning Representations , volume=

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. International Conference on Learning Representations , volume=

[35] [41]

The Fourteenth International Conference on Learning Representations , year=

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution , author=. The Fourteenth International Conference on Learning Representations , year=

[36] [44]

Windows Agent Arena: Evaluating Multi-Modal

Rogerio Bonatti and Dan Zhao and Francesco Bonacci and Dillon Dupont and Sara Abdali and Yinheng Li and Yadong Lu and Justin Wagle and Kazuhito Koishida and Arthur Bucker and Lawrence Keunho Jang and Zheng Hui , booktitle=. Windows Agent Arena: Evaluating Multi-Modal. 2025 , url=

2025

[37] [45]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Webvoyager: Building an end-to-end web agent with large multimodal models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[38] [47]

Introducing claude

Anthropic . Introducing claude. https://www.anthropic.com/news/introducing-claude, March 2023

2023

[39] [48]

Claude code

Anthropic . Claude code. https://www.anthropic.com/claude-code, 2026 a

2026

[40] [49]

Introducing claude opus 4.7

Anthropic . Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7, 2026 b

2026

[41] [50]

Anysphere . Cursor. https://cursor.com/, 2026

2026

[42] [51]

Setupbench: Assessing software engineering agents' ability to bootstrap development environments

Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. Setupbench: Assessing software engineering agents' ability to bootstrap development environments. arXiv preprint arXiv:2507.09063, 2025

arXiv 2025

[43] [52]

Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories

Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena, Aditi Raghunathan, and Ziqian Zhong. Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories. arXiv preprint arXiv:2604.17596, 2026

Pith/arXiv arXiv 2026

[44] [53]

Windows agent arena: Evaluating multi-modal OS agents at scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows agent arena: Evaluating multi-modal OS agents at scale. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=W9s817KqYf

2025

[45] [54]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In International Conference on Learning Representations, volume 2025, pages 96934--96990, 2025

2025

[46] [55]

Terminalworld: Benchmarking agents on real-world terminal tasks

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T Barr, Mark Harman, Federica Sarro, et al. Terminalworld: Benchmarking agents on real-world terminal tasks. arXiv preprint arXiv:2605.22535, 2026

Pith/arXiv arXiv 2026

[47] [56]

OpenCLI : Make any website your CLI --- an AI -native runtime for browser automation and dynamic web data extraction

OpenCLI Contributors. OpenCLI : Make any website your CLI --- an AI -native runtime for browser automation and dynamic web data extraction. https://github.com/jackwener/opencli, 2026

2026

[48] [57]

Osuniverse: Benchmark for multimodal gui-navigation ai agents

Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo M \'a rquez Flores, and Sin \'e ad Ryan. Osuniverse: Benchmark for multimodal gui-navigation ai agents. arXiv preprint arXiv:2505.03570, 2025

arXiv 2025

[49] [58]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

2023

[50] [59]

Introducing github copilot: Your ai pair programmer

GitHub . Introducing github copilot: Your ai pair programmer. https://github.blog/news-insights/product-news/introducing-github-copilot-ai-pair-programmer/, June 2021

2021

[51] [60]

GitHub CLI Documentation

GitHub . GitHub CLI Documentation . https://docs.github.com/en/github-cli, 2026

2026

[52] [61]

Gemini cli: Your open-source ai agent

Google . Gemini cli: Your open-source ai agent. https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemini-cli-open-source-ai-agent/, June 2025

2025

[53] [62]

gcloud CLI Overview

Google Cloud . gcloud CLI Overview . https://docs.cloud.google.com/sdk/gcloud, 2026

2026

[54] [63]

Terminus-2: Harbor's High-Performance Reference Agent Implementation

Harbor . Terminus-2: Harbor's High-Performance Reference Agent Implementation . https://www.harborframework.com/docs/agents/terminus-2, 2026

2026

[55] [64]

Harbor: A framework for evaluating and optimizing agents and models in container environments , January 2026

Harbor Framework Team . Harbor: A framework for evaluating and optimizing agents and models in container environments , January 2026. https://github.com/harbor-framework/harbor

2026

[56] [65]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864--6890, 2024

2024

[57] [66]

O'Reilly Media, Inc

Jeroen Janssens. Data science at the command line: Facing the future with time-tested tools. " O'Reilly Media, Inc.", 2014

2014

[58] [67]

OSW orld- MCP : Benchmarking MCP tool invocation in computer-use agents

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. OSW orld- MCP : Benchmarking MCP tool invocation in computer-use agents. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=rceD6wwt4B

2026

[59] [68]

Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023

2023

[60] [69]

The unix programming environment

Brian W Kernighan and John R Mashey. The unix programming environment. Software: Practice and Experience, 9 0 (1): 0 1--15, 1979

1979

[61] [70]

Process-level trajectory evaluation for environment configuration in software engineering agents

Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, and Philip S Yu. Process-level trajectory evaluation for environment configuration in software engineering agents. arXiv preprint arXiv:2510.25694, 2025

arXiv 2025

[62] [71]

The koala benchmarks for the shell: characterization and implications

Evangelos Lamprou, Ethan Williams, Georgios Kaoukis, Zhuoxuan Zhang, Michael Greenberg, Konstantinos Kallas, Lukas Lazarek, and Nikos Vasilakis. The koala benchmarks for the shell: characterization and implications. In 2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 449--464, 2025

2025

[63] [72]

lark-cli: The official lark/feishu cli tool

LarkSuite . lark-cli: The official lark/feishu cli tool. https://github.com/larksuite/cli, 2026

2026

[64] [73]

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...

2026

[65] [74]

Reinforcement learning on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018. https://openreview.net/forum?id=ryTp3f-0-

2018

[66] [75]

Mcp-universe: Benchmarking large language models with real-world model context protocol servers

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. Mcp-universe: Benchmarking large language models with real-world model context protocol servers. arXiv preprint arXiv:2508.14704, 2025

arXiv 2025

[67] [76]

Manus . Manus. https://manus.im/, March 2025

2025

[68] [77]

Mike A Merrill, Alexander Glenn Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap...

2026

[69] [78]

Kimi Code : Next-gen ai code agent

Moonshot AI . Kimi Code : Next-gen ai code agent. https://www.kimi.com/code, 2026

2026

[70] [79]

Introducing chatgpt

OpenAI . Introducing chatgpt. https://openai.com/index/chatgpt/, November 2022

2022

[71] [80]

Introducing codex

OpenAI . Introducing codex. https://openai.com/index/introducing-codex/, May 2025

2025

[72] [81]

Introducing GPT-5.5

OpenAI . Introducing GPT-5.5 . https://openai.com/index/introducing-gpt-5-5/, April 2026

2026

[73] [82]

Openclaw: Personal ai assistant

OpenClaw . Openclaw: Personal ai assistant. https://openclaw.ai/, 2026

2026

[74] [83]

Opencode: A powerful ai coding agent built for the terminal

OpenCode . Opencode: A powerful ai coding agent built for the terminal. https://github.com/opencode-ai/opencode, 2026

2026

[75] [84]

Gdpval: Evaluating ai model performance on real-world economically valuable tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Sim \'o n Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks. arXiv preprint arXiv:2510.04374, 2025

Pith/arXiv arXiv 2025

[76] [85]

Tools and techniques for computational reproducibility

Stephen R Piccolo and Michael B Frampton. Tools and techniques for computational reproducibility. Gigascience, 5 0 (1): 0 s13742--016, 2016

2016

[77] [86]

Podman: The best free and open source container tools

Podman Container Tools . Podman: The best free and open source container tools. https://podman.io/, 2026

2026

[78] [87]

Qwen Code : An open-source ai coding agent that lives in your terminal

QwenLM . Qwen Code : An open-source ai coding agent that lives in your terminal. https://github.com/QwenLM/qwen-code, 2026

2026

[79] [88]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135--3144. PMLR, 2017

2017

[80] [89]

Slack CLI

Slack . Slack CLI . https://docs.slack.dev/tools/slack-cli/, 2026

2026