pith. sign in

arxiv: 2606.28480 · v1 · pith:2ZP3DHFDnew · submitted 2026-06-26 · 💻 cs.SE · cs.AI

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Pith reviewed 2026-06-30 01:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords terminal-use agentscomputer-use benchmarkTUA-Benchgeneral-purpose agentsexecution-based evaluationdocument editingscientific workflowsClaude Opus
0
0 comments X

The pith

TUA-Bench evaluates terminal agents on 120 general tasks and finds the top model at 65.8 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TUA-Bench to measure general-purpose terminal-use agents on tasks that extend past coding into everyday digital work. It supplies 120 manually designed tasks spanning document editing, email management, web information seeking, and PhD-level scientific workflows, each executed in a real terminal with a deterministic setup script. Performance is measured by an execution-based scoring protocol rather than subjective review. The evaluation shows the strongest agent reaches only 65.8 percent overall, with clear shortfalls across task families. This matters because it gives a concrete way to track progress toward agents that can handle diverse terminal environments reliably.

Core claim

TUA-Bench consists of 120 real-world tasks across five families for terminal-use agents. The tasks address routine activities such as document editing and email management together with scientific and engineering workflows co-designed with domain experts. Each task runs in a real terminal under a deterministic setup script and receives an execution-based score. The strongest frontier agent, Claude Code powered by Claude Opus 4.8 at maximum reasoning effort, reaches 65.8 percent overall performance with substantial gaps across both routine and expert tracks.

What carries the argument

TUA-Bench benchmark of 120 manually designed tasks with deterministic terminal setups and execution-based scoring protocol.

If this is right

  • Terminal agents must close large performance gaps before they can be considered reliable for general use.
  • Routine tasks such as document editing and email management expose limitations not addressed by coding-only benchmarks.
  • Scientific workflows that require specialized software demand agent capabilities beyond standard shell commands.
  • Execution-based scoring offers an objective alternative to human judgment for terminal agent evaluation.
  • Future agent development can target measurable improvement on both routine and expert task families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended with tasks that involve live state changes or multi-turn user interactions to test robustness further.
  • The observed gaps suggest current agents would benefit from training data that emphasizes non-programming terminal workflows.
  • Pairing TUA-Bench results with existing GUI benchmarks would allow direct comparison of terminal versus graphical computer-use performance.
  • A score of 65.8 percent implies that unsupervised deployment of these agents in varied digital environments remains premature.

Load-bearing premise

The 120 manually designed tasks and the execution-based scoring protocol together provide a representative and unbiased measure of general-purpose terminal-use agent capabilities.

What would settle it

A new agent that scores above 90 percent on the 120 tasks yet fails on similar but unseen terminal tasks outside the benchmark set would indicate the tasks do not generalize.

read the original abstract

As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TUA-Bench, a benchmark for general-purpose terminal-use agents (TUAs) consisting of 120 manually designed tasks across five families. Tasks span routine activities (document editing, email, web search) and expert co-designed scientific/engineering workflows, each with deterministic setup scripts and execution-based scoring. The central empirical result is that the strongest frontier agent (Claude Code with Claude Opus 4.8 at max reasoning effort) reaches 65.8% overall success, with substantial gaps across tracks; the benchmark is positioned as more general than prior GUI or shell-focused suites.

Significance. If the task set is representative, TUA-Bench fills a documented gap between GUI-centric computer-use benchmarks and narrow coding/shell benchmarks, supplying a reproducible, execution-scored evaluation that could guide development of broader terminal agents. The deterministic setups and execution-based protocol are explicit strengths that support reproducibility and reduce scoring ambiguity.

major comments (2)
  1. [Abstract] Abstract: the claim that the 65.8% result and observed gaps constitute a 'broad and realistic evaluation' of general-purpose TUA capabilities is load-bearing on the representativeness of the 120 tasks; the abstract supplies no information on task selection process, coverage metrics, inter-rater reliability for task design, or external validation against real-world terminal usage distributions or expert consensus, leaving open the possibility that performance ceilings reflect curation choices rather than intrinsic limitations.
  2. [Section 3] Task construction description (Section 3 / benchmark design): while the five task families and PhD co-design are described, the manuscript provides no quantitative validation (e.g., diversity statistics, overlap with usage logs, or comparison to existing terminal corpora) that would confirm the tasks are unbiased relative to the general-purpose claim; this directly affects whether the 65.8% ceiling can be interpreted as a frontier measurement.
minor comments (1)
  1. [Abstract] Abstract: the model identifier 'Claude Opus 4.8' is not standard; a brief clarification of the exact model/version and reasoning-effort parameterization would improve precision.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for highlighting the importance of justifying the representativeness of TUA-Bench tasks. We address each major comment below and will revise the manuscript to strengthen the description of task construction while being transparent about limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the 65.8% result and observed gaps constitute a 'broad and realistic evaluation' of general-purpose TUA capabilities is load-bearing on the representativeness of the 120 tasks; the abstract supplies no information on task selection process, coverage metrics, inter-rater reliability for task design, or external validation against real-world terminal usage distributions or expert consensus, leaving open the possibility that performance ceilings reflect curation choices rather than intrinsic limitations.

    Authors: We agree that the abstract would benefit from more context on task selection. In the revision we will expand the abstract to state that the 120 tasks were manually designed to cover five families spanning routine digital activities and PhD-co-designed scientific workflows, with deterministic setups and execution-based scoring. Inter-rater reliability metrics were not computed because task design was an iterative collaborative process rather than independent rating. We will also note that external validation against usage distributions was not performed. These additions will make the scope and limitations of the 'broad and realistic' phrasing clearer without overstating the evidence. revision: partial

  2. Referee: [Section 3] Task construction description (Section 3 / benchmark design): while the five task families and PhD co-design are described, the manuscript provides no quantitative validation (e.g., diversity statistics, overlap with usage logs, or comparison to existing terminal corpora) that would confirm the tasks are unbiased relative to the general-purpose claim; this directly affects whether the 65.8% ceiling can be interpreted as a frontier measurement.

    Authors: We will revise Section 3 to include quantitative diversity statistics, such as the breakdown of tasks by family, command categories, and estimated complexity. We will also add a limitations paragraph explicitly discussing the absence of direct comparison to public terminal usage logs or corpora. The task set was constructed through author expertise supplemented by PhD-level domain experts to target both everyday and specialized terminal activities; however, no suitable public corpora existed for quantitative overlap analysis. This revision will allow readers to evaluate the general-purpose claim more precisely while preserving the benchmark's contribution. revision: partial

standing simulated objections not resolved
  • Direct quantitative overlap or comparison against real-world terminal usage distributions or existing corpora, as no appropriate public datasets were identified for such validation.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper introduces TUA-Bench as a new set of 120 manually designed tasks with deterministic setup scripts and execution-based scoring. No equations, fitted parameters, or predictions are derived from prior quantities; reported performance (e.g., 65.8% for Claude Code) consists of direct empirical measurements on the introduced tasks. No self-citations serve as load-bearing premises for any result, and the construction does not reduce any claimed outcome to its own inputs by definition. The work is self-contained as a benchmark proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark introduction paper containing no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5798 in / 934 out tokens · 32225 ms · 2026-06-30T01:16:51.018011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 4 linked inside Pith

  1. [1]

    International Conference on Learning Representations , volume=

    Openhands: An open platform for ai software developers as generalist agents , author=. International Conference on Learning Representations , volume=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    The twelfth international conference on learning representations , year=

    Swe-bench: Can language models resolve real-world github issues? , author=. The twelfth international conference on learning representations , year=

  4. [4]

    The Fourteenth International Conference on Learning Representations , year=

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. The Fourteenth International Conference on Learning Representations , year=

  5. [5]

    Advances in Neural Information Processing Systems , volume=

    macosworld: A multilingual interactive benchmark for gui agents , author=. Advances in Neural Information Processing Systems , volume=

  6. [6]

    2026 , url=

    Hongrui Jia and Jitong Liao and Xi Zhang and Haiyang Xu and Tianbao Xie and Chaoya Jiang and Ming Yan and Si Liu and Wei Ye and Fei Huang , booktitle=. 2026 , url=

  7. [7]

    The Twelfth International Conference on Learning Representations , year=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

  8. [8]

    2022 , month = nov, howpublished =

    Introducing ChatGPT , author =. 2022 , month = nov, howpublished =

  9. [9]

    2026 , month = apr, day =

    Introducing. 2026 , month = apr, day =

  10. [10]

    2025 , month = may, howpublished =

    Introducing Codex , author =. 2025 , month = may, howpublished =

  11. [11]

    2026 , howpublished =

    Claude Code , author =. 2026 , howpublished =

  12. [12]

    2026 , howpublished =

    Introducing Claude Opus 4.7 , author =. 2026 , howpublished =

  13. [13]

    2026 , howpublished =

    Gemini 3.1 Pro: A smarter model for your most complex tasks , author =. 2026 , howpublished =

  14. [14]

    2026 , howpublished =

    OpenCode: A Powerful AI Coding Agent Built for the Terminal , author =. 2026 , howpublished =

  15. [15]

    2023 , month = mar, howpublished =

    Introducing Claude , author =. 2023 , month = mar, howpublished =

  16. [16]

    2026 , howpublished =

  17. [17]

    2025 , month = jun, howpublished =

    Gemini CLI: Your Open-Source AI Agent , author =. 2025 , month = jun, howpublished =

  18. [18]

    2021 , month = jun, howpublished =

    Introducing GitHub Copilot: Your AI Pair Programmer , author =. 2021 , month = jun, howpublished =

  19. [19]

    2026 , howpublished =

    Cursor , author =. 2026 , howpublished =

  20. [20]

    2025 , month = mar, howpublished =

    Manus , author =. 2025 , month = mar, howpublished =

  21. [21]

    2026 , howpublished =

    OpenClaw: Personal AI Assistant , author =. 2026 , howpublished =

  22. [22]

    The UNIX

    Kernighan, Brian W and Mashey, John R , journal=. The UNIX. 1979 , publisher=

  23. [23]

    2014 , publisher=

    Data science at the command line: Facing the future with time-tested tools , author=. 2014 , publisher=

  24. [24]

    Gigascience , volume=

    Tools and techniques for computational reproducibility , author=. Gigascience , volume=. 2016 , publisher=

  25. [25]

    GitHub repository , howpublished =

    OpenCLI Contributors , title =. GitHub repository , howpublished =. 2026 , publisher =

  26. [27]

    Advances in Neural Information Processing Systems , volume=

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

  27. [28]

    2026 , howpublished =

    lark-cli: The Official Lark/Feishu CLI Tool , author =. 2026 , howpublished =

  28. [29]

    2026 , howpublished =

    Podman: The Best Free and Open Source Container Tools , author =. 2026 , howpublished =

  29. [30]

    International Conference on Machine Learning , pages=

    World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  30. [31]

    International Conference on Learning Representations , year=

    Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=

  31. [32]

    Advances in Neural Information Processing Systems , volume=

    Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

  32. [33]

    2025 USENIX Annual Technical Conference (USENIX ATC 25) , pages=

    The Koala benchmarks for the shell: characterization and implications , author=. 2025 USENIX Annual Technical Conference (USENIX ATC 25) , pages=

  33. [36]

    LLM-supported natural language to bash translation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  34. [38]

    International Conference on Learning Representations , volume=

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. International Conference on Learning Representations , volume=

  35. [41]

    The Fourteenth International Conference on Learning Representations , year=

    The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution , author=. The Fourteenth International Conference on Learning Representations , year=

  36. [44]

    Windows Agent Arena: Evaluating Multi-Modal

    Rogerio Bonatti and Dan Zhao and Francesco Bonacci and Dillon Dupont and Sara Abdali and Yinheng Li and Yadong Lu and Justin Wagle and Kazuhito Koishida and Arthur Bucker and Lawrence Keunho Jang and Zheng Hui , booktitle=. Windows Agent Arena: Evaluating Multi-Modal. 2025 , url=

  37. [45]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Webvoyager: Building an end-to-end web agent with large multimodal models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  38. [47]

    Introducing claude

    Anthropic . Introducing claude. https://www.anthropic.com/news/introducing-claude, March 2023

  39. [48]

    Claude code

    Anthropic . Claude code. https://www.anthropic.com/claude-code, 2026 a

  40. [49]

    Introducing claude opus 4.7

    Anthropic . Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7, 2026 b

  41. [50]

    Anysphere . Cursor. https://cursor.com/, 2026

  42. [51]

    Setupbench: Assessing software engineering agents' ability to bootstrap development environments

    Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. Setupbench: Assessing software engineering agents' ability to bootstrap development environments. arXiv preprint arXiv:2507.09063, 2025

  43. [52]

    Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories

    Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena, Aditi Raghunathan, and Ziqian Zhong. Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories. arXiv preprint arXiv:2604.17596, 2026

  44. [53]

    Windows agent arena: Evaluating multi-modal OS agents at scale

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows agent arena: Evaluating multi-modal OS agents at scale. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=W9s817KqYf

  45. [54]

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In International Conference on Learning Representations, volume 2025, pages 96934--96990, 2025

  46. [55]

    Terminalworld: Benchmarking agents on real-world terminal tasks

    Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T Barr, Mark Harman, Federica Sarro, et al. Terminalworld: Benchmarking agents on real-world terminal tasks. arXiv preprint arXiv:2605.22535, 2026

  47. [56]

    OpenCLI : Make any website your CLI --- an AI -native runtime for browser automation and dynamic web data extraction

    OpenCLI Contributors. OpenCLI : Make any website your CLI --- an AI -native runtime for browser automation and dynamic web data extraction. https://github.com/jackwener/opencli, 2026

  48. [57]

    Osuniverse: Benchmark for multimodal gui-navigation ai agents

    Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo M \'a rquez Flores, and Sin \'e ad Ryan. Osuniverse: Benchmark for multimodal gui-navigation ai agents. arXiv preprint arXiv:2505.03570, 2025

  49. [58]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

  50. [59]

    Introducing github copilot: Your ai pair programmer

    GitHub . Introducing github copilot: Your ai pair programmer. https://github.blog/news-insights/product-news/introducing-github-copilot-ai-pair-programmer/, June 2021

  51. [60]

    GitHub CLI Documentation

    GitHub . GitHub CLI Documentation . https://docs.github.com/en/github-cli, 2026

  52. [61]

    Gemini cli: Your open-source ai agent

    Google . Gemini cli: Your open-source ai agent. https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemini-cli-open-source-ai-agent/, June 2025

  53. [62]

    gcloud CLI Overview

    Google Cloud . gcloud CLI Overview . https://docs.cloud.google.com/sdk/gcloud, 2026

  54. [63]

    Terminus-2: Harbor's High-Performance Reference Agent Implementation

    Harbor . Terminus-2: Harbor's High-Performance Reference Agent Implementation . https://www.harborframework.com/docs/agents/terminus-2, 2026

  55. [64]

    Harbor: A framework for evaluating and optimizing agents and models in container environments , January 2026

    Harbor Framework Team . Harbor: A framework for evaluating and optimizing agents and models in container environments , January 2026. https://github.com/harbor-framework/harbor

  56. [65]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864--6890, 2024

  57. [66]

    O'Reilly Media, Inc

    Jeroen Janssens. Data science at the command line: Facing the future with time-tested tools. " O'Reilly Media, Inc.", 2014

  58. [67]

    OSW orld- MCP : Benchmarking MCP tool invocation in computer-use agents

    Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. OSW orld- MCP : Benchmarking MCP tool invocation in computer-use agents. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=rceD6wwt4B

  59. [68]

    Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023

  60. [69]

    The unix programming environment

    Brian W Kernighan and John R Mashey. The unix programming environment. Software: Practice and Experience, 9 0 (1): 0 1--15, 1979

  61. [70]

    Process-level trajectory evaluation for environment configuration in software engineering agents

    Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, and Philip S Yu. Process-level trajectory evaluation for environment configuration in software engineering agents. arXiv preprint arXiv:2510.25694, 2025

  62. [71]

    The koala benchmarks for the shell: characterization and implications

    Evangelos Lamprou, Ethan Williams, Georgios Kaoukis, Zhuoxuan Zhang, Michael Greenberg, Konstantinos Kallas, Lukas Lazarek, and Nikos Vasilakis. The koala benchmarks for the shell: characterization and implications. In 2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 449--464, 2025

  63. [72]

    lark-cli: The official lark/feishu cli tool

    LarkSuite . lark-cli: The official lark/feishu cli tool. https://github.com/larksuite/cli, 2026

  64. [73]

    The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution

    Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...

  65. [74]

    Reinforcement learning on web interfaces using workflow-guided exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018. https://openreview.net/forum?id=ryTp3f-0-

  66. [75]

    Mcp-universe: Benchmarking large language models with real-world model context protocol servers

    Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. Mcp-universe: Benchmarking large language models with real-world model context protocol servers. arXiv preprint arXiv:2508.14704, 2025

  67. [76]

    Manus . Manus. https://manus.im/, March 2025

  68. [77]

    Mike A Merrill, Alexander Glenn Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap...

  69. [78]

    Kimi Code : Next-gen ai code agent

    Moonshot AI . Kimi Code : Next-gen ai code agent. https://www.kimi.com/code, 2026

  70. [79]

    Introducing chatgpt

    OpenAI . Introducing chatgpt. https://openai.com/index/chatgpt/, November 2022

  71. [80]

    Introducing codex

    OpenAI . Introducing codex. https://openai.com/index/introducing-codex/, May 2025

  72. [81]

    Introducing GPT-5.5

    OpenAI . Introducing GPT-5.5 . https://openai.com/index/introducing-gpt-5-5/, April 2026

  73. [82]

    Openclaw: Personal ai assistant

    OpenClaw . Openclaw: Personal ai assistant. https://openclaw.ai/, 2026

  74. [83]

    Opencode: A powerful ai coding agent built for the terminal

    OpenCode . Opencode: A powerful ai coding agent built for the terminal. https://github.com/opencode-ai/opencode, 2026

  75. [84]

    Gdpval: Evaluating ai model performance on real-world economically valuable tasks

    Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Sim \'o n Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks. arXiv preprint arXiv:2510.04374, 2025

  76. [85]

    Tools and techniques for computational reproducibility

    Stephen R Piccolo and Michael B Frampton. Tools and techniques for computational reproducibility. Gigascience, 5 0 (1): 0 s13742--016, 2016

  77. [86]

    Podman: The best free and open source container tools

    Podman Container Tools . Podman: The best free and open source container tools. https://podman.io/, 2026

  78. [87]

    Qwen Code : An open-source ai coding agent that lives in your terminal

    QwenLM . Qwen Code : An open-source ai coding agent that lives in your terminal. https://github.com/QwenLM/qwen-code, 2026

  79. [88]

    World of bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135--3144. PMLR, 2017

  80. [89]

    Slack CLI

    Slack . Slack CLI . https://docs.slack.dev/tools/slack-cli/, 2026

Showing first 80 references.