CLI-Anything: Towards Agent-Native Computer Use

Chao Huang; Tianyu Fan; Yuhao Yang

arxiv: 2606.03854 · v1 · pith:QRP44QLUnew · submitted 2026-06-02 · 💻 cs.HC

CLI-Anything: Towards Agent-Native Computer Use

Yuhao Yang , Tianyu Fan , Chao Huang This is my paper

Pith reviewed 2026-06-28 08:18 UTC · model grok-4.3

classification 💻 cs.HC

keywords CLI harnessesagent-native interfacesGUI agentscomputer use agentsprogrammatic controlmachine-readable protocolsAI tool use

0 comments

The pith

Applications can be transformed into command-line harnesses that let AI agents use structured commands and explicit states instead of visual clicks and screenshots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that GUI agents for computer use force AI systems to mimic human visual perception, leading to brittle interactions based on pixels, coordinates, and timing. It proposes instead converting existing applications into command-line harnesses that preserve all original functionality while exposing machine-readable protocols. This shift aligns interfaces with agent strengths in programmatic control and deterministic feedback. The authors describe the methodology and introduce CLI-Hub as a platform to support the change. A sympathetic reader would care because it removes the lossy translation between visual interfaces and computational reasoning.

Core claim

CLI-Anything establishes that existing applications can be turned into command-line harnesses preserving full functionality while providing structured commands, explicit state representations, and deterministic feedback, allowing agents to operate through precise programmatic control rather than emulating human perceptual limits and eliminating the visual-to-computational translation that limits GUI agents.

What carries the argument

Command-line harnesses that preserve application functionality while exposing machine-readable protocols optimized for AI interaction.

If this is right

Agents avoid brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes.
Agents use structured data processing and programmatic control rather than emulating human perceptual limitations.
Redesign centers on explicit state representations and deterministic execution instead of screen readers or click simulators.
Existing software can be adapted without building new visual interpretation layers.
Computer use paradigms shift to align directly with agent computational strengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need to plan for both human and agent users when maintaining or updating applications.
Converted interfaces could combine with existing CLI tools to expand what agents can reliably accomplish.
Direct comparisons of agent performance on the same tasks before and after conversion would test the approach on specific software.
Widespread use might lower the need for agents to perform expensive visual processing at each step.

Load-bearing premise

It is feasible and practical to convert existing applications into CLI harnesses without losing functionality for human users or adding new limitations for agents.

What would settle it

An implementation on a complex application where the CLI harness version loses core features, requires visual fallbacks, or produces lower agent task success rates than the original GUI.

read the original abstract

As large language models advance in reasoning and tool use capabilities, researchers increasingly seek to leverage them for computer use agents that can interact with existing software. The dominant approach develops GUI agents that control applications through visual interfaces: interpreting screenshots, locating UI elements, and executing mouse clicks to mimic human interaction. This GUI-centric paradigm fundamentally misaligns with agent capabilities. Current GUI agents struggle with brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes. They force agents to emulate human perceptual limitations rather than leverage their computational strengths in structured data processing and programmatic control. CLI-Anything argues for agent-native computer use design. Instead of forcing agents to navigate visual layouts, we create interfaces aligned with how agents naturally operate: through structured commands, explicit state representations, and deterministic feedback. We transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols optimized for AI-native interaction. This eliminates the lossy visual-to-computational translation that plagues GUI agents. Rather than building sophisticated screen readers and click simulators, we should redesign interaction paradigms around agent strengths: precise programmatic control and deterministic execution. We examine the methodology, architecture, evidence, and future directions for this agent-native transformation of computer use. We have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision. The platform provides methodology, architecture, and infrastructure for this fundamental transformation of computer use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper arguing for CLI harnesses over GUI agents but supplies no methodology, examples, or evidence for the claimed lossless transformation.

read the letter

The main point to take away is that the paper advocates shifting computer-use agents to command-line interfaces instead of visual ones, claiming this removes lossy translation steps, but the argument stays conceptual with nothing shown to support the core feasibility claim.

It does lay out the practical problems with GUI agents clearly enough: brittle pixel interactions, timing issues, and the mismatch with agents' strengths in structured commands. Naming the approach 'agent-native' and introducing CLI-Hub as an operationalizing platform gives the idea a label that might help frame future discussions.

The soft spots are substantial and central. The strongest claim is that existing applications can be converted into CLI harnesses while fully preserving functionality and adding no new limits for agents or humans. Yet the text offers no conversion method, no protocol details, no architecture, and no before-and-after example of any application. The platform is announced but not described in any operational way. Without that, the elimination of visual-to-computational loss remains an assumption rather than a result.

This is for readers already working on agent interfaces in HCI or AI tool-use who want to think through design alternatives. It might prompt useful discussion in a reading group about interface alignment, but it does not deliver new measurements, derivations, or reproducible work.

I would not recommend sending it for serious peer review in its current form. It reads as an extended position statement rather than a paper with the technical or empirical grounding needed for a full review process.

Referee Report

2 major / 0 minor

Summary. The paper argues that GUI agents for computer use misalign with LLM strengths in structured reasoning, proposing instead to transform existing applications into command-line harnesses that expose machine-readable protocols and deterministic state. It introduces CLI-Hub as the platform that operationalizes this agent-native paradigm, claiming to eliminate lossy visual-to-computational translation while preserving full application functionality.

Significance. If the feasibility of lossless, limitation-free app-to-CLI transformation were demonstrated with concrete methodology and examples, the work could meaningfully redirect HCI and agent research toward programmatic interfaces. In its current form the contribution remains conceptual; the absence of any architecture, protocol, worked example, or evaluation means the significance is limited to a position statement rather than a substantiated result.

major comments (2)

[Abstract] Abstract: the claim that 'we transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols' is presented as accomplished fact, yet the manuscript supplies neither the conversion methodology, protocol specification, architecture diagram, nor a single before/after example of any application.
[Abstract] Abstract: the statement that the authors 'have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision' and 'examine the methodology, architecture, evidence' is unsupported; no such sections, diagrams, or evidence appear, which is load-bearing for the central claim that visual-to-computational loss is eliminated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The manuscript is a position paper advocating a paradigm shift toward agent-native interfaces, and we agree the abstract language overstates the work as completed implementation rather than proposal. We will revise accordingly to clarify scope while maintaining the core argument.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'we transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols' is presented as accomplished fact, yet the manuscript supplies neither the conversion methodology, protocol specification, architecture diagram, nor a single before/after example of any application.

Authors: We accept this observation. The paper advances a conceptual argument for redesigning applications around programmatic protocols rather than providing a concrete conversion toolkit or worked examples. We will revise the abstract to use prospective language (e.g., 'we propose transforming...' and 'we outline a methodology...') to accurately reflect the position-paper nature of the contribution. revision: yes
Referee: [Abstract] Abstract: the statement that the authors 'have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision' and 'examine the methodology, architecture, evidence' is unsupported; no such sections, diagrams, or evidence appear, which is load-bearing for the central claim that visual-to-computational loss is eliminated.

Authors: The referee correctly notes the absence of detailed architecture, protocols, or empirical evidence. The manuscript introduces CLI-Hub at the level of vision and high-level principles. We will revise the abstract to state that we 'introduce the CLI-Hub concept' and 'discuss the methodology and architecture at a conceptual level,' removing unsupported claims of a built platform and examined evidence. revision: yes

Circularity Check

0 steps flagged

No circularity; conceptual argument lacks derivations or self-referential reductions

full rationale

The manuscript advances a position that existing applications can be transformed into CLI harnesses preserving functionality and eliminating visual-to-computational loss, operationalized via CLI-Hub. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claim is an assertion of performed transformation rather than a derivation that reduces to its own inputs by construction. No self-citations or ansatzes are invoked in a load-bearing manner. The argument is therefore self-contained as a design proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on domain assumptions about agent capabilities and introduces the CLI-Hub platform as a new entity without supporting evidence.

axioms (1)

domain assumption Agents possess computational strengths in structured data processing and programmatic control that exceed their performance on visual interfaces.
Invoked in the abstract to justify the misalignment of GUI agents.

invented entities (1)

CLI-Hub no independent evidence
purpose: Platform providing methodology, architecture, and infrastructure for agent-native computer use.
Introduced as the operationalization of the vision but without any description of implementation or validation.

pith-pipeline@v0.9.1-grok · 5782 in / 1184 out tokens · 30856 ms · 2026-06-28T08:18:12.655728+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents
cs.SE 2026-06 unverdicted novelty 7.0

TUA-Bench provides 120 manually designed terminal tasks across five families with execution-based scoring; the top agent reaches 65.8% success.
PhoneBuddy: Training Open Models for Agentic Phone Use
cs.CL 2026-06 unverdicted novelty 6.0

PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

Reference graph

Works this paper leans on

28 extracted references · 3 canonical work pages · cited by 2 Pith papers

[1]

arXiv:2603.00729, 2026

Ruisheng Cao et al.Qwen3-Coder-Next Technical Report. arXiv:2603.00729, 2026. https: //arxiv.org/abs/2603.00729

Pith/arXiv arXiv 2026
[2]

arXiv:2512.02556, 2025

DeepSeek-AI.DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556, 2025. https://arxiv.org/abs/2512.02556

Pith/arXiv arXiv 2025
[3]

arXiv:2602.02276, 2026

Kimi Team.Kimi K2.5: Visual Agentic Intelligence. arXiv:2602.02276, 2026. https://arxiv.org/ab s/2602.02276

Pith/arXiv arXiv 2026
[4]

Narasimhan, and Yuan Cao.ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao.ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations, 2023. https://openreview.net/forum?id=WE_vluYUL-X

2023
[5]

Advances in Neural Information Processing Systems 36, 2023

Timo Schick et al.Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023
[6]

International Conference on Learning Representations, 2024

Yujia Qin et al.ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. International Conference on Learning Representations, 2024. https://openreview.net/forum?id= dHng2O0Jjr

2024
[7]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez.Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334, 2023. https://arxiv.org/abs/2305.15334

Pith/arXiv arXiv 2023
[8]

Empirical Methods in Natural Language Processing, 2023

Minghao Li et al.API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Empirical Methods in Natural Language Processing, 2023. https://arxiv.org/abs/2304.08244

Pith/arXiv arXiv 2023
[9]

Annual Meeting of the Association for Computational Linguistics, 2024

Harsh Trivedi et al.AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. Annual Meeting of the Association for Computational Linguistics, 2024. https://arxiv.org/abs/2407.18901

arXiv 2024
[10]

Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. International Conference on Learning Representations, 2025. https://openreview.net/forum?id=roNSXZpUDN. 26 CLI-Anything References

2025
[11]

Model Context Protocol.Model Context Protocol Specification. 2024. https://modelcontextprotoc ol.io/specification/2024-11-05/index

2024
[12]

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.SWE-bench: Can Language Models Resolve Real-World GitHub Issues?International Conference on Learning Representations, 2024. https://proceedings.iclr.cc/paper_files/paper/202 4/hash/edac78c3e300629acfe6cbe9ca88fb84-Abstract-Conference.html

2024
[13]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793, 2024. https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024
[14]

International Conference on Learning Representations, 2024

Shuyan Zhou et al.WebArena: A Realistic Web Environment for Building Autonomous Agents. International Conference on Learning Representations, 2024. https://proceedings.iclr.cc/paper_fil es/paper/2024/hash/4410c0711e9154a7a2d26f9b3816d1ef-Abstract-Conference.html

2024
[15]

Annual Meeting of the Association for Computational Linguistics, 2024

Jing Yu Koh et al.VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. Annual Meeting of the Association for Computational Linguistics, 2024. https://arxiv.org/abs/24 01.13649

2024
[16]

Advances in Neural Information Processing Systems 36, 2023

Xiang Deng et al.Mind2Web: Towards a Generalist Agent for the Web. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2306.06070

Pith/arXiv arXiv 2023
[17]

https://arxiv.org/abs/2403.077 18

Alexandre Drouin et al.WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?International Conference on Machine Learning, 2024. https://arxiv.org/abs/2403.077 18

2024
[18]

Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024

Tianbao Xie et al.OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024. https://papers.nips.cc/paper_files/paper/2024/hash/5d413e48f84dc61 244b6be550f1cd8f5-Abstract-Datasets_and_Benchmarks_Track.html

2024
[19]

arXiv:2405.14573, 2024

Christopher Rawles et al.AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573, 2024. https://arxiv.org/abs/2405.14573

Pith/arXiv arXiv 2024
[20]

Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap.Android in the Wild: A Large-Scale Dataset for Android Device Control. Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023. https://arxiv.org/abs/2307.10088

arXiv 2023
[21]

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3135–3144, 2017

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang.World of Bits: An Open-Domain Platform for Web-Based Agents. Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3135–3144, 2017. https://proceedings.mlr.press/v70/shi17a.html

2017
[22]

arXiv:2312.13771, 2023

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu.AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771, 2023. https: //arxiv.org/abs/2312.13771

Pith/arXiv arXiv 2023
[23]

Computer 16(8):57–69, 1983

Ben Shneiderman.Direct Manipulation: A Step Beyond Programming Languages. Computer 16(8):57–69, 1983. https://doi.org/10.1109/MC.1983.1654471

work page doi:10.1109/mc.1983.1654471 1983
[24]

Miller.Sikuli: Using GUI Screenshots for Search and Automation

Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller.Sikuli: Using GUI Screenshots for Search and Automation. ACM Symposium on User Interface Software and Technology, 2009. https: //doi.org/10.1145/1622176.1622213. 27 CLI-Anything References

work page doi:10.1145/1622176.1622213 2009
[25]

ACM Conference on Human Factors in Computing Systems, 2010

Morgan Dixon and James Fogarty.Prefab: Implementing Advanced Behaviors Using Pixel-Based Reverse Engineering of Interface Structure. ACM Conference on Human Factors in Computing Systems, 2010. https://doi.org/10.1145/1753326.1753554

work page doi:10.1145/1753326.1753554 2010
[26]

Advances in Neural Information Processing Systems 36, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2303.11366

Pith/arXiv arXiv 2023
[27]

Transactions on Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar.Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research, 2024. https://openreview.net/forum?id=ehfRiF0R3a

2024
[28]

Advances in Neural Information Processing Systems 35, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan.WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. Advances in Neural Information Processing Systems 35, 2022. https://papers.nips.cc/paper_files/paper/2022/hash/82ad13ec01f9f e44c01cb91814fd7b8c-Abstract-Conference.html. 28

2022

[1] [1]

arXiv:2603.00729, 2026

Ruisheng Cao et al.Qwen3-Coder-Next Technical Report. arXiv:2603.00729, 2026. https: //arxiv.org/abs/2603.00729

Pith/arXiv arXiv 2026

[2] [2]

arXiv:2512.02556, 2025

DeepSeek-AI.DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556, 2025. https://arxiv.org/abs/2512.02556

Pith/arXiv arXiv 2025

[3] [3]

arXiv:2602.02276, 2026

Kimi Team.Kimi K2.5: Visual Agentic Intelligence. arXiv:2602.02276, 2026. https://arxiv.org/ab s/2602.02276

Pith/arXiv arXiv 2026

[4] [4]

Narasimhan, and Yuan Cao.ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao.ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations, 2023. https://openreview.net/forum?id=WE_vluYUL-X

2023

[5] [5]

Advances in Neural Information Processing Systems 36, 2023

Timo Schick et al.Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023

[6] [6]

International Conference on Learning Representations, 2024

Yujia Qin et al.ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. International Conference on Learning Representations, 2024. https://openreview.net/forum?id= dHng2O0Jjr

2024

[7] [7]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez.Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334, 2023. https://arxiv.org/abs/2305.15334

Pith/arXiv arXiv 2023

[8] [8]

Empirical Methods in Natural Language Processing, 2023

Minghao Li et al.API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Empirical Methods in Natural Language Processing, 2023. https://arxiv.org/abs/2304.08244

Pith/arXiv arXiv 2023

[9] [9]

Annual Meeting of the Association for Computational Linguistics, 2024

Harsh Trivedi et al.AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. Annual Meeting of the Association for Computational Linguistics, 2024. https://arxiv.org/abs/2407.18901

arXiv 2024

[10] [10]

Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. International Conference on Learning Representations, 2025. https://openreview.net/forum?id=roNSXZpUDN. 26 CLI-Anything References

2025

[11] [11]

Model Context Protocol.Model Context Protocol Specification. 2024. https://modelcontextprotoc ol.io/specification/2024-11-05/index

2024

[12] [12]

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.SWE-bench: Can Language Models Resolve Real-World GitHub Issues?International Conference on Learning Representations, 2024. https://proceedings.iclr.cc/paper_files/paper/202 4/hash/edac78c3e300629acfe6cbe9ca88fb84-Abstract-Conference.html

2024

[13] [13]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793, 2024. https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024

[14] [14]

International Conference on Learning Representations, 2024

Shuyan Zhou et al.WebArena: A Realistic Web Environment for Building Autonomous Agents. International Conference on Learning Representations, 2024. https://proceedings.iclr.cc/paper_fil es/paper/2024/hash/4410c0711e9154a7a2d26f9b3816d1ef-Abstract-Conference.html

2024

[15] [15]

Annual Meeting of the Association for Computational Linguistics, 2024

Jing Yu Koh et al.VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. Annual Meeting of the Association for Computational Linguistics, 2024. https://arxiv.org/abs/24 01.13649

2024

[16] [16]

Advances in Neural Information Processing Systems 36, 2023

Xiang Deng et al.Mind2Web: Towards a Generalist Agent for the Web. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2306.06070

Pith/arXiv arXiv 2023

[17] [17]

https://arxiv.org/abs/2403.077 18

Alexandre Drouin et al.WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?International Conference on Machine Learning, 2024. https://arxiv.org/abs/2403.077 18

2024

[18] [18]

Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024

Tianbao Xie et al.OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024. https://papers.nips.cc/paper_files/paper/2024/hash/5d413e48f84dc61 244b6be550f1cd8f5-Abstract-Datasets_and_Benchmarks_Track.html

2024

[19] [19]

arXiv:2405.14573, 2024

Christopher Rawles et al.AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573, 2024. https://arxiv.org/abs/2405.14573

Pith/arXiv arXiv 2024

[20] [20]

Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap.Android in the Wild: A Large-Scale Dataset for Android Device Control. Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023. https://arxiv.org/abs/2307.10088

arXiv 2023

[21] [21]

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3135–3144, 2017

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang.World of Bits: An Open-Domain Platform for Web-Based Agents. Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3135–3144, 2017. https://proceedings.mlr.press/v70/shi17a.html

2017

[22] [22]

arXiv:2312.13771, 2023

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu.AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771, 2023. https: //arxiv.org/abs/2312.13771

Pith/arXiv arXiv 2023

[23] [23]

Computer 16(8):57–69, 1983

Ben Shneiderman.Direct Manipulation: A Step Beyond Programming Languages. Computer 16(8):57–69, 1983. https://doi.org/10.1109/MC.1983.1654471

work page doi:10.1109/mc.1983.1654471 1983

[24] [24]

Miller.Sikuli: Using GUI Screenshots for Search and Automation

Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller.Sikuli: Using GUI Screenshots for Search and Automation. ACM Symposium on User Interface Software and Technology, 2009. https: //doi.org/10.1145/1622176.1622213. 27 CLI-Anything References

work page doi:10.1145/1622176.1622213 2009

[25] [25]

ACM Conference on Human Factors in Computing Systems, 2010

Morgan Dixon and James Fogarty.Prefab: Implementing Advanced Behaviors Using Pixel-Based Reverse Engineering of Interface Structure. ACM Conference on Human Factors in Computing Systems, 2010. https://doi.org/10.1145/1753326.1753554

work page doi:10.1145/1753326.1753554 2010

[26] [26]

Advances in Neural Information Processing Systems 36, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems 36, 2023. https://arxiv.org/abs/2303.11366

Pith/arXiv arXiv 2023

[27] [27]

Transactions on Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar.Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research, 2024. https://openreview.net/forum?id=ehfRiF0R3a

2024

[28] [28]

Advances in Neural Information Processing Systems 35, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan.WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. Advances in Neural Information Processing Systems 35, 2022. https://papers.nips.cc/paper_files/paper/2022/hash/82ad13ec01f9f e44c01cb91814fd7b8c-Abstract-Conference.html. 28

2022