CLI-Anything: Towards Agent-Native Computer Use
Pith reviewed 2026-06-28 08:18 UTC · model grok-4.3
The pith
Applications can be transformed into command-line harnesses that let AI agents use structured commands and explicit states instead of visual clicks and screenshots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLI-Anything establishes that existing applications can be turned into command-line harnesses preserving full functionality while providing structured commands, explicit state representations, and deterministic feedback, allowing agents to operate through precise programmatic control rather than emulating human perceptual limits and eliminating the visual-to-computational translation that limits GUI agents.
What carries the argument
Command-line harnesses that preserve application functionality while exposing machine-readable protocols optimized for AI interaction.
If this is right
- Agents avoid brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes.
- Agents use structured data processing and programmatic control rather than emulating human perceptual limitations.
- Redesign centers on explicit state representations and deterministic execution instead of screen readers or click simulators.
- Existing software can be adapted without building new visual interpretation layers.
- Computer use paradigms shift to align directly with agent computational strengths.
Where Pith is reading between the lines
- Developers may need to plan for both human and agent users when maintaining or updating applications.
- Converted interfaces could combine with existing CLI tools to expand what agents can reliably accomplish.
- Direct comparisons of agent performance on the same tasks before and after conversion would test the approach on specific software.
- Widespread use might lower the need for agents to perform expensive visual processing at each step.
Load-bearing premise
It is feasible and practical to convert existing applications into CLI harnesses without losing functionality for human users or adding new limitations for agents.
What would settle it
An implementation on a complex application where the CLI harness version loses core features, requires visual fallbacks, or produces lower agent task success rates than the original GUI.
read the original abstract
As large language models advance in reasoning and tool use capabilities, researchers increasingly seek to leverage them for computer use agents that can interact with existing software. The dominant approach develops GUI agents that control applications through visual interfaces: interpreting screenshots, locating UI elements, and executing mouse clicks to mimic human interaction. This GUI-centric paradigm fundamentally misaligns with agent capabilities. Current GUI agents struggle with brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes. They force agents to emulate human perceptual limitations rather than leverage their computational strengths in structured data processing and programmatic control. CLI-Anything argues for agent-native computer use design. Instead of forcing agents to navigate visual layouts, we create interfaces aligned with how agents naturally operate: through structured commands, explicit state representations, and deterministic feedback. We transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols optimized for AI-native interaction. This eliminates the lossy visual-to-computational translation that plagues GUI agents. Rather than building sophisticated screen readers and click simulators, we should redesign interaction paradigms around agent strengths: precise programmatic control and deterministic execution. We examine the methodology, architecture, evidence, and future directions for this agent-native transformation of computer use. We have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision. The platform provides methodology, architecture, and infrastructure for this fundamental transformation of computer use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that GUI agents for computer use misalign with LLM strengths in structured reasoning, proposing instead to transform existing applications into command-line harnesses that expose machine-readable protocols and deterministic state. It introduces CLI-Hub as the platform that operationalizes this agent-native paradigm, claiming to eliminate lossy visual-to-computational translation while preserving full application functionality.
Significance. If the feasibility of lossless, limitation-free app-to-CLI transformation were demonstrated with concrete methodology and examples, the work could meaningfully redirect HCI and agent research toward programmatic interfaces. In its current form the contribution remains conceptual; the absence of any architecture, protocol, worked example, or evaluation means the significance is limited to a position statement rather than a substantiated result.
major comments (2)
- [Abstract] Abstract: the claim that 'we transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols' is presented as accomplished fact, yet the manuscript supplies neither the conversion methodology, protocol specification, architecture diagram, nor a single before/after example of any application.
- [Abstract] Abstract: the statement that the authors 'have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision' and 'examine the methodology, architecture, evidence' is unsupported; no such sections, diagrams, or evidence appear, which is load-bearing for the central claim that visual-to-computational loss is eliminated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The manuscript is a position paper advocating a paradigm shift toward agent-native interfaces, and we agree the abstract language overstates the work as completed implementation rather than proposal. We will revise accordingly to clarify scope while maintaining the core argument.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'we transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols' is presented as accomplished fact, yet the manuscript supplies neither the conversion methodology, protocol specification, architecture diagram, nor a single before/after example of any application.
Authors: We accept this observation. The paper advances a conceptual argument for redesigning applications around programmatic protocols rather than providing a concrete conversion toolkit or worked examples. We will revise the abstract to use prospective language (e.g., 'we propose transforming...' and 'we outline a methodology...') to accurately reflect the position-paper nature of the contribution. revision: yes
-
Referee: [Abstract] Abstract: the statement that the authors 'have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision' and 'examine the methodology, architecture, evidence' is unsupported; no such sections, diagrams, or evidence appear, which is load-bearing for the central claim that visual-to-computational loss is eliminated.
Authors: The referee correctly notes the absence of detailed architecture, protocols, or empirical evidence. The manuscript introduces CLI-Hub at the level of vision and high-level principles. We will revise the abstract to state that we 'introduce the CLI-Hub concept' and 'discuss the methodology and architecture at a conceptual level,' removing unsupported claims of a built platform and examined evidence. revision: yes
Circularity Check
No circularity; conceptual argument lacks derivations or self-referential reductions
full rationale
The manuscript advances a position that existing applications can be transformed into CLI harnesses preserving functionality and eliminating visual-to-computational loss, operationalized via CLI-Hub. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claim is an assertion of performed transformation rather than a derivation that reduces to its own inputs by construction. No self-citations or ansatzes are invoked in a load-bearing manner. The argument is therefore self-contained as a design proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agents possess computational strengths in structured data processing and programmatic control that exceed their performance on visual interfaces.
invented entities (1)
-
CLI-Hub
no independent evidence
Forward citations
Cited by 2 Pith papers
-
TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents
TUA-Bench provides 120 manually designed terminal tasks across five families with execution-based scoring; the top agent reaches 65.8% success.
-
PhoneBuddy: Training Open Models for Agentic Phone Use
PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.
Reference graph
Works this paper leans on
-
[1]
Ruisheng Cao et al.Qwen3-Coder-Next Technical Report. arXiv:2603.00729, 2026. https: //arxiv.org/abs/2603.00729
Pith/arXiv arXiv 2026
-
[2]
DeepSeek-AI.DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556, 2025. https://arxiv.org/abs/2512.02556
Pith/arXiv arXiv 2025
-
[3]
Kimi Team.Kimi K2.5: Visual Agentic Intelligence. arXiv:2602.02276, 2026. https://arxiv.org/ab s/2602.02276
Pith/arXiv arXiv 2026
-
[4]
Narasimhan, and Yuan Cao.ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao.ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations, 2023. https://openreview.net/forum?id=WE_vluYUL-X
2023
-
[5]
Advances in Neural Information Processing Systems 36, 2023
Timo Schick et al.Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2302.04761
Pith/arXiv arXiv 2023
-
[6]
International Conference on Learning Representations, 2024
Yujia Qin et al.ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. International Conference on Learning Representations, 2024. https://openreview.net/forum?id= dHng2O0Jjr
2024
-
[7]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez.Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334, 2023. https://arxiv.org/abs/2305.15334
Pith/arXiv arXiv 2023
-
[8]
Empirical Methods in Natural Language Processing, 2023
Minghao Li et al.API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Empirical Methods in Natural Language Processing, 2023. https://arxiv.org/abs/2304.08244
Pith/arXiv arXiv 2023
-
[9]
Annual Meeting of the Association for Computational Linguistics, 2024
Harsh Trivedi et al.AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. Annual Meeting of the Association for Computational Linguistics, 2024. https://arxiv.org/abs/2407.18901
arXiv 2024
-
[10]
Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. International Conference on Learning Representations, 2025. https://openreview.net/forum?id=roNSXZpUDN. 26 CLI-Anything References
2025
-
[11]
Model Context Protocol.Model Context Protocol Specification. 2024. https://modelcontextprotoc ol.io/specification/2024-11-05/index
2024
-
[12]
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.SWE-bench: Can Language Models Resolve Real-World GitHub Issues?International Conference on Learning Representations, 2024. https://proceedings.iclr.cc/paper_files/paper/202 4/hash/edac78c3e300629acfe6cbe9ca88fb84-Abstract-Conference.html
2024
-
[13]
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793, 2024. https://arxiv.org/abs/2405.15793
Pith/arXiv arXiv 2024
-
[14]
International Conference on Learning Representations, 2024
Shuyan Zhou et al.WebArena: A Realistic Web Environment for Building Autonomous Agents. International Conference on Learning Representations, 2024. https://proceedings.iclr.cc/paper_fil es/paper/2024/hash/4410c0711e9154a7a2d26f9b3816d1ef-Abstract-Conference.html
2024
-
[15]
Annual Meeting of the Association for Computational Linguistics, 2024
Jing Yu Koh et al.VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. Annual Meeting of the Association for Computational Linguistics, 2024. https://arxiv.org/abs/24 01.13649
2024
-
[16]
Advances in Neural Information Processing Systems 36, 2023
Xiang Deng et al.Mind2Web: Towards a Generalist Agent for the Web. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2306.06070
Pith/arXiv arXiv 2023
-
[17]
https://arxiv.org/abs/2403.077 18
Alexandre Drouin et al.WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?International Conference on Machine Learning, 2024. https://arxiv.org/abs/2403.077 18
2024
-
[18]
Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024
Tianbao Xie et al.OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024. https://papers.nips.cc/paper_files/paper/2024/hash/5d413e48f84dc61 244b6be550f1cd8f5-Abstract-Datasets_and_Benchmarks_Track.html
2024
-
[19]
Christopher Rawles et al.AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573, 2024. https://arxiv.org/abs/2405.14573
Pith/arXiv arXiv 2024
-
[20]
Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap.Android in the Wild: A Large-Scale Dataset for Android Device Control. Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023. https://arxiv.org/abs/2307.10088
arXiv 2023
-
[21]
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3135–3144, 2017
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang.World of Bits: An Open-Domain Platform for Web-Based Agents. Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3135–3144, 2017. https://proceedings.mlr.press/v70/shi17a.html
2017
-
[22]
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu.AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771, 2023. https: //arxiv.org/abs/2312.13771
Pith/arXiv arXiv 2023
-
[23]
Ben Shneiderman.Direct Manipulation: A Step Beyond Programming Languages. Computer 16(8):57–69, 1983. https://doi.org/10.1109/MC.1983.1654471
-
[24]
Miller.Sikuli: Using GUI Screenshots for Search and Automation
Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller.Sikuli: Using GUI Screenshots for Search and Automation. ACM Symposium on User Interface Software and Technology, 2009. https: //doi.org/10.1145/1622176.1622213. 27 CLI-Anything References
-
[25]
ACM Conference on Human Factors in Computing Systems, 2010
Morgan Dixon and James Fogarty.Prefab: Implementing Advanced Behaviors Using Pixel-Based Reverse Engineering of Interface Structure. ACM Conference on Human Factors in Computing Systems, 2010. https://doi.org/10.1145/1753326.1753554
-
[26]
Advances in Neural Information Processing Systems 36, 2023
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2303.11366
Pith/arXiv arXiv 2023
-
[27]
Transactions on Machine Learning Research, 2024
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar.Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research, 2024. https://openreview.net/forum?id=ehfRiF0R3a
2024
-
[28]
Advances in Neural Information Processing Systems 35, 2022
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan.WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. Advances in Neural Information Processing Systems 35, 2022. https://papers.nips.cc/paper_files/paper/2022/hash/82ad13ec01f9f e44c01cb91814fd7b8c-Abstract-Conference.html. 28
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.