pith. sign in

arxiv: 2606.03854 · v1 · pith:QRP44QLUnew · submitted 2026-06-02 · 💻 cs.HC

CLI-Anything: Towards Agent-Native Computer Use

Pith reviewed 2026-06-28 08:18 UTC · model grok-4.3

classification 💻 cs.HC
keywords CLI harnessesagent-native interfacesGUI agentscomputer use agentsprogrammatic controlmachine-readable protocolsAI tool use
0
0 comments X

The pith

Applications can be transformed into command-line harnesses that let AI agents use structured commands and explicit states instead of visual clicks and screenshots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that GUI agents for computer use force AI systems to mimic human visual perception, leading to brittle interactions based on pixels, coordinates, and timing. It proposes instead converting existing applications into command-line harnesses that preserve all original functionality while exposing machine-readable protocols. This shift aligns interfaces with agent strengths in programmatic control and deterministic feedback. The authors describe the methodology and introduce CLI-Hub as a platform to support the change. A sympathetic reader would care because it removes the lossy translation between visual interfaces and computational reasoning.

Core claim

CLI-Anything establishes that existing applications can be turned into command-line harnesses preserving full functionality while providing structured commands, explicit state representations, and deterministic feedback, allowing agents to operate through precise programmatic control rather than emulating human perceptual limits and eliminating the visual-to-computational translation that limits GUI agents.

What carries the argument

Command-line harnesses that preserve application functionality while exposing machine-readable protocols optimized for AI interaction.

If this is right

  • Agents avoid brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes.
  • Agents use structured data processing and programmatic control rather than emulating human perceptual limitations.
  • Redesign centers on explicit state representations and deterministic execution instead of screen readers or click simulators.
  • Existing software can be adapted without building new visual interpretation layers.
  • Computer use paradigms shift to align directly with agent computational strengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may need to plan for both human and agent users when maintaining or updating applications.
  • Converted interfaces could combine with existing CLI tools to expand what agents can reliably accomplish.
  • Direct comparisons of agent performance on the same tasks before and after conversion would test the approach on specific software.
  • Widespread use might lower the need for agents to perform expensive visual processing at each step.

Load-bearing premise

It is feasible and practical to convert existing applications into CLI harnesses without losing functionality for human users or adding new limitations for agents.

What would settle it

An implementation on a complex application where the CLI harness version loses core features, requires visual fallbacks, or produces lower agent task success rates than the original GUI.

read the original abstract

As large language models advance in reasoning and tool use capabilities, researchers increasingly seek to leverage them for computer use agents that can interact with existing software. The dominant approach develops GUI agents that control applications through visual interfaces: interpreting screenshots, locating UI elements, and executing mouse clicks to mimic human interaction. This GUI-centric paradigm fundamentally misaligns with agent capabilities. Current GUI agents struggle with brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes. They force agents to emulate human perceptual limitations rather than leverage their computational strengths in structured data processing and programmatic control. CLI-Anything argues for agent-native computer use design. Instead of forcing agents to navigate visual layouts, we create interfaces aligned with how agents naturally operate: through structured commands, explicit state representations, and deterministic feedback. We transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols optimized for AI-native interaction. This eliminates the lossy visual-to-computational translation that plagues GUI agents. Rather than building sophisticated screen readers and click simulators, we should redesign interaction paradigms around agent strengths: precise programmatic control and deterministic execution. We examine the methodology, architecture, evidence, and future directions for this agent-native transformation of computer use. We have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision. The platform provides methodology, architecture, and infrastructure for this fundamental transformation of computer use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper argues that GUI agents for computer use misalign with LLM strengths in structured reasoning, proposing instead to transform existing applications into command-line harnesses that expose machine-readable protocols and deterministic state. It introduces CLI-Hub as the platform that operationalizes this agent-native paradigm, claiming to eliminate lossy visual-to-computational translation while preserving full application functionality.

Significance. If the feasibility of lossless, limitation-free app-to-CLI transformation were demonstrated with concrete methodology and examples, the work could meaningfully redirect HCI and agent research toward programmatic interfaces. In its current form the contribution remains conceptual; the absence of any architecture, protocol, worked example, or evaluation means the significance is limited to a position statement rather than a substantiated result.

major comments (2)
  1. [Abstract] Abstract: the claim that 'we transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols' is presented as accomplished fact, yet the manuscript supplies neither the conversion methodology, protocol specification, architecture diagram, nor a single before/after example of any application.
  2. [Abstract] Abstract: the statement that the authors 'have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision' and 'examine the methodology, architecture, evidence' is unsupported; no such sections, diagrams, or evidence appear, which is load-bearing for the central claim that visual-to-computational loss is eliminated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The manuscript is a position paper advocating a paradigm shift toward agent-native interfaces, and we agree the abstract language overstates the work as completed implementation rather than proposal. We will revise accordingly to clarify scope while maintaining the core argument.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'we transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols' is presented as accomplished fact, yet the manuscript supplies neither the conversion methodology, protocol specification, architecture diagram, nor a single before/after example of any application.

    Authors: We accept this observation. The paper advances a conceptual argument for redesigning applications around programmatic protocols rather than providing a concrete conversion toolkit or worked examples. We will revise the abstract to use prospective language (e.g., 'we propose transforming...' and 'we outline a methodology...') to accurately reflect the position-paper nature of the contribution. revision: yes

  2. Referee: [Abstract] Abstract: the statement that the authors 'have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision' and 'examine the methodology, architecture, evidence' is unsupported; no such sections, diagrams, or evidence appear, which is load-bearing for the central claim that visual-to-computational loss is eliminated.

    Authors: The referee correctly notes the absence of detailed architecture, protocols, or empirical evidence. The manuscript introduces CLI-Hub at the level of vision and high-level principles. We will revise the abstract to state that we 'introduce the CLI-Hub concept' and 'discuss the methodology and architecture at a conceptual level,' removing unsupported claims of a built platform and examined evidence. revision: yes

Circularity Check

0 steps flagged

No circularity; conceptual argument lacks derivations or self-referential reductions

full rationale

The manuscript advances a position that existing applications can be transformed into CLI harnesses preserving functionality and eliminating visual-to-computational loss, operationalized via CLI-Hub. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claim is an assertion of performed transformation rather than a derivation that reduces to its own inputs by construction. No self-citations or ansatzes are invoked in a load-bearing manner. The argument is therefore self-contained as a design proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on domain assumptions about agent capabilities and introduces the CLI-Hub platform as a new entity without supporting evidence.

axioms (1)
  • domain assumption Agents possess computational strengths in structured data processing and programmatic control that exceed their performance on visual interfaces.
    Invoked in the abstract to justify the misalignment of GUI agents.
invented entities (1)
  • CLI-Hub no independent evidence
    purpose: Platform providing methodology, architecture, and infrastructure for agent-native computer use.
    Introduced as the operationalization of the vision but without any description of implementation or validation.

pith-pipeline@v0.9.1-grok · 5782 in / 1184 out tokens · 30856 ms · 2026-06-28T08:18:12.655728+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

    cs.SE 2026-06 unverdicted novelty 7.0

    TUA-Bench provides 120 manually designed terminal tasks across five families with execution-based scoring; the top agent reaches 65.8% success.

  2. PhoneBuddy: Training Open Models for Agentic Phone Use

    cs.CL 2026-06 unverdicted novelty 6.0

    PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

Reference graph

Works this paper leans on

28 extracted references · 3 canonical work pages · cited by 2 Pith papers

  1. [1]

    arXiv:2603.00729, 2026

    Ruisheng Cao et al.Qwen3-Coder-Next Technical Report. arXiv:2603.00729, 2026. https: //arxiv.org/abs/2603.00729

  2. [2]

    arXiv:2512.02556, 2025

    DeepSeek-AI.DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556, 2025. https://arxiv.org/abs/2512.02556

  3. [3]

    arXiv:2602.02276, 2026

    Kimi Team.Kimi K2.5: Visual Agentic Intelligence. arXiv:2602.02276, 2026. https://arxiv.org/ab s/2602.02276

  4. [4]

    Narasimhan, and Yuan Cao.ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao.ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations, 2023. https://openreview.net/forum?id=WE_vluYUL-X

  5. [5]

    Advances in Neural Information Processing Systems 36, 2023

    Timo Schick et al.Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2302.04761

  6. [6]

    International Conference on Learning Representations, 2024

    Yujia Qin et al.ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. International Conference on Learning Representations, 2024. https://openreview.net/forum?id= dHng2O0Jjr

  7. [7]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez.Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334, 2023. https://arxiv.org/abs/2305.15334

  8. [8]

    Empirical Methods in Natural Language Processing, 2023

    Minghao Li et al.API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Empirical Methods in Natural Language Processing, 2023. https://arxiv.org/abs/2304.08244

  9. [9]

    Annual Meeting of the Association for Computational Linguistics, 2024

    Harsh Trivedi et al.AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. Annual Meeting of the Association for Computational Linguistics, 2024. https://arxiv.org/abs/2407.18901

  10. [10]

    Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. International Conference on Learning Representations, 2025. https://openreview.net/forum?id=roNSXZpUDN. 26 CLI-Anything References

  11. [11]

    Model Context Protocol.Model Context Protocol Specification. 2024. https://modelcontextprotoc ol.io/specification/2024-11-05/index

  12. [12]

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.SWE-bench: Can Language Models Resolve Real-World GitHub Issues?International Conference on Learning Representations, 2024. https://proceedings.iclr.cc/paper_files/paper/202 4/hash/edac78c3e300629acfe6cbe9ca88fb84-Abstract-Conference.html

  13. [13]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793, 2024. https://arxiv.org/abs/2405.15793

  14. [14]

    International Conference on Learning Representations, 2024

    Shuyan Zhou et al.WebArena: A Realistic Web Environment for Building Autonomous Agents. International Conference on Learning Representations, 2024. https://proceedings.iclr.cc/paper_fil es/paper/2024/hash/4410c0711e9154a7a2d26f9b3816d1ef-Abstract-Conference.html

  15. [15]

    Annual Meeting of the Association for Computational Linguistics, 2024

    Jing Yu Koh et al.VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. Annual Meeting of the Association for Computational Linguistics, 2024. https://arxiv.org/abs/24 01.13649

  16. [16]

    Advances in Neural Information Processing Systems 36, 2023

    Xiang Deng et al.Mind2Web: Towards a Generalist Agent for the Web. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2306.06070

  17. [17]

    https://arxiv.org/abs/2403.077 18

    Alexandre Drouin et al.WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?International Conference on Machine Learning, 2024. https://arxiv.org/abs/2403.077 18

  18. [18]

    Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024

    Tianbao Xie et al.OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024. https://papers.nips.cc/paper_files/paper/2024/hash/5d413e48f84dc61 244b6be550f1cd8f5-Abstract-Datasets_and_Benchmarks_Track.html

  19. [19]

    arXiv:2405.14573, 2024

    Christopher Rawles et al.AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573, 2024. https://arxiv.org/abs/2405.14573

  20. [20]

    Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap.Android in the Wild: A Large-Scale Dataset for Android Device Control. Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023. https://arxiv.org/abs/2307.10088

  21. [21]

    Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3135–3144, 2017

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang.World of Bits: An Open-Domain Platform for Web-Based Agents. Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3135–3144, 2017. https://proceedings.mlr.press/v70/shi17a.html

  22. [22]

    arXiv:2312.13771, 2023

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu.AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771, 2023. https: //arxiv.org/abs/2312.13771

  23. [23]

    Computer 16(8):57–69, 1983

    Ben Shneiderman.Direct Manipulation: A Step Beyond Programming Languages. Computer 16(8):57–69, 1983. https://doi.org/10.1109/MC.1983.1654471

  24. [24]

    Miller.Sikuli: Using GUI Screenshots for Search and Automation

    Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller.Sikuli: Using GUI Screenshots for Search and Automation. ACM Symposium on User Interface Software and Technology, 2009. https: //doi.org/10.1145/1622176.1622213. 27 CLI-Anything References

  25. [25]

    ACM Conference on Human Factors in Computing Systems, 2010

    Morgan Dixon and James Fogarty.Prefab: Implementing Advanced Behaviors Using Pixel-Based Reverse Engineering of Interface Structure. ACM Conference on Human Factors in Computing Systems, 2010. https://doi.org/10.1145/1753326.1753554

  26. [26]

    Advances in Neural Information Processing Systems 36, 2023

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2303.11366

  27. [27]

    Transactions on Machine Learning Research, 2024

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar.Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research, 2024. https://openreview.net/forum?id=ehfRiF0R3a

  28. [28]

    Advances in Neural Information Processing Systems 35, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan.WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. Advances in Neural Information Processing Systems 35, 2022. https://papers.nips.cc/paper_files/paper/2022/hash/82ad13ec01f9f e44c01cb91814fd7b8c-Abstract-Conference.html. 28