DocOS: Towards Proactive Document-Guided Actions in GUI Agents

HaiFeng Wang; Jiahong Wu; Jingjing Liu; Kehai Chen; Yuhang Guo; Yunhong Wang; Zeming Liu; Zihao Cheng; Ziye Huang

arxiv: 2605.18048 · v1 · pith:ULIWWOAInew · submitted 2026-05-18 · 💻 cs.AI

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

Jingjing Liu , Ziye Huang , Zihao Cheng , Zeming Liu , Jiahong Wu , Yuhang Guo , Kehai Chen , Yunhong Wang

show 1 more author

Haifeng Wang

This is my paper

Pith reviewed 2026-05-20 11:09 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI agentsdocument-guided actionsproactive searchDocOS benchmarklong-tailed tasksaction groundingweb interaction

0 comments

The pith

GUI agents can handle uncommon tasks by searching for and using online documentation to guide their actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GUI agents automate clicks and interactions on screens but mostly rely on what they learned during training. This fails for rare or specific tasks that need step-by-step instructions not in their knowledge. The paper introduces Proactive Document-Guided Action, where agents actively search the web for documentation, understand the procedures, and apply them as precise GUI operations. They built the DocOS benchmark to measure how well agents do this in live browser settings. Tests show agents have difficulty finding useful docs and accurately translating instructions into actions, indicating that external documents are essential for agents that can learn and adapt on their own.

Core claim

The paper claims that progress in GUI agents for long-tailed tasks is limited by difficulties in locating relevant documentation during proactive searches and in grounding those instructions into accurate executable actions, and that document-guided interaction offers a key way to create self-evolving agents in changing environments.

What carries the argument

The Proactive Document-Guided Action paradigm, which allows agents to autonomously search for, comprehend, and execute instructions from online documentation, evaluated using the DocOS benchmark in fully interactive web environments.

If this is right

GUI agents will be able to manage tasks requiring specific procedural knowledge by accessing external sources.
Advances will require better methods for information retrieval and precise action mapping from text.
Self-evolving agents become possible in dynamic settings through continuous document use.
This shifts away from pure trial-and-error exploration toward informed decision making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could enable agents to work with newly released software by finding updated manuals online.
Similar methods might help other AI systems that need to perform actions based on external knowledge.
Integrating this with search engines could create more robust interactive assistants.

Load-bearing premise

Relevant and accurate documentation is available online for long-tailed tasks, and agents can find it and convert the instructions into correct actions without errors or loss of context.

What would settle it

If agents fail to complete DocOS tasks even when provided with the exact relevant documentation in advance, this would show that the grounding step does not work reliably.

Figures

Figures reproduced from arXiv: 2605.18048 by HaiFeng Wang, Jiahong Wu, Jingjing Liu, Kehai Chen, Yuhang Guo, Yunhong Wang, Zeming Liu, Zihao Cheng, Ziye Huang.

**Figure 1.** Figure 1: An example of a Proactive Document-Guided Action task. The workflow is divided into two distinct phases: Proactive Knowledge Retrieval (top row) and Document-Grounded Execution (bottom row). tions in up-to-date documentation, enabling adaptation to long-tailed and evolving tasks without extensive retraining. Guided by this paradigm, we propose DocOS, a benchmark designed to assess document-guided problem … view at source ↗

**Figure 2.** Figure 2: The data construction pipeline of DocOS consists of 3 stages. Stage 1: Task Construction involves defining task instructions, prerequisites, and difficulty levels (Easy, Medium, Hard) based on execution steps. Stage 2: Document Collection utilizes an automated crawler to retrieve and parse official documentation, extracting structured raw information (e.g., instructions, headers). Stage 3: Task Filtering i… view at source ↗

**Figure 3.** Figure 3: Pass rates of six different GUI agents across varying task steps under two settings: with (w/) and without (w/o) provided documents. The x-axis represents the number of steps required to complete the task, and the y-axis represents the pass rate. In this section, we analyze how varying the length and step size of the documents provided to the GUI agent influences its task completion performance. As shown i… view at source ↗

**Figure 4.** Figure 4: A sample of Anki. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: A sample of Blender. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: A sample of Element. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: A sample of Godot. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: A sample of Grafana. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: A sample of Idea. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: A sample of Netbeans. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: A sample of Notepad++. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: A sample of Odoo. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: A sample of Postman. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: A sample of PyCharm. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: A sample of VSCode. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: A sample of Zotero. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: A sample of Zulip. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: An error case of imprecise localization. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: An error case of Non-official Reference. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: An error case of execute before retrieval. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: An error case of action grounding failure. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: An error case of context misidentification. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt for GUI Content Classification. Prompt for Document Content Classification You are an assistant that classifies content based on specific criteria. Your task is to evaluate whether a given piece of content serves as a tutorial specifically related to graphical user interfaces (GUI), such as web applications, desktop applications, or operating systems. The content qualifies as a GUI-related tutorial… view at source ↗

read the original abstract

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GUI agents that search for online docs on rare tasks is a reasonable extension, but the dual-bottleneck conclusion depends on docs actually being available and clear.

read the letter

The main point here is that the authors want GUI agents to stop guessing on long-tailed tasks and instead go find relevant online documentation, then turn what they read into exact actions. They formalize this as Proactive Document-Guided Action and release DocOS, a benchmark that runs the whole loop inside a live browser: navigate, locate docs, comprehend the steps, and ground them without extra hints. This moves past the usual static instruction setups that dominate the area and tests something closer to how people actually solve unfamiliar interface problems. Their reported runs show agents often miss the right documents during search and then make grounding mistakes even when the text is in front of them, which they take as evidence that document interaction is a necessary route for more capable agents. That framing is straightforward and the interactive benchmark is a clear addition to existing evaluation practice. The work stays empirical with no equations or formal claims, and the citations track standard GUI-agent references without obvious omissions. The softer part is the load-bearing assumption that accurate, sufficiently detailed documentation exists online for the tasks in the benchmark and that agents can retrieve and apply it without major context loss. If many long-tailed cases simply lack usable guides, the observed failures could trace more to information scarcity than to search or grounding limits in the agents themselves. Without fuller tables on retrieval success, document quality checks, or error breakdowns, it is difficult to separate those factors cleanly. This is aimed at groups working on practical web automation and agent generalization. Readers who care about external knowledge in dynamic environments will find the benchmark setup useful. It is grounded enough on motivation and evaluation design to merit peer review, though the results section will probably need tighter controls on the documentation premise.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Proactive Document-Guided Action as a paradigm for GUI agents operating in dynamic open-web environments. It proposes the DocOS benchmark, which requires agents to autonomously navigate browsers, locate relevant online documentation for long-tailed tasks, comprehend procedural instructions, and ground them into precise executable GUI actions. Experiments identify two primary bottlenecks—unreliable proactive search for information and failures in faithfully translating retrieved instructions into actions—and conclude that document-guided interaction is a crucial pathway for self-evolving GUI agents.

Significance. If the empirical findings hold under closer scrutiny, the work provides a concrete benchmark and diagnostic for limitations in current GUI agents that rely solely on parametric knowledge or trial-and-error. By framing document search and grounding as central challenges, it opens a research direction that could improve adaptability on tasks absent from training data, with potential value for reproducible evaluation in interactive agent settings.

major comments (1)

[Benchmark construction] Benchmark construction and task selection (likely §3): The claim that progress is 'strictly constrained by dual bottlenecks' in proactive search and instruction grounding is load-bearing on the premise that accurate, sufficiently detailed, and relevant online documentation exists for the chosen long-tailed tasks and can be directly mapped to executable actions. The manuscript does not appear to describe a verification step confirming document availability, completeness, or lack of ambiguity for each task; without this, search failures may reflect data absence rather than agent capability, weakening the interpretation that document-guided interaction is the key enabling pathway.

minor comments (1)

[Abstract] Abstract and experimental reporting: The abstract states that 'extensive experiments reveal' the bottlenecks but provides no quantitative metrics, baseline comparisons, or error breakdowns; adding a concise summary of key performance numbers and failure categorizations would improve clarity without altering the core contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of DocOS. We address the single major comment below and have revised the manuscript to strengthen the benchmark description.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction and task selection (likely §3): The claim that progress is 'strictly constrained by dual bottlenecks' in proactive search and instruction grounding is load-bearing on the premise that accurate, sufficiently detailed, and relevant online documentation exists for the chosen long-tailed tasks and can be directly mapped to executable actions. The manuscript does not appear to describe a verification step confirming document availability, completeness, or lack of ambiguity for each task; without this, search failures may reflect data absence rather than agent capability, weakening the interpretation that document-guided interaction is the key enabling pathway.

Authors: We agree that an explicit description of the verification process is necessary to support the interpretation of the dual bottlenecks. In the revised manuscript, we have added Section 3.2 (Task Curation and Documentation Verification) that details the following procedure: (1) We first identified long-tailed tasks from real-world usage logs and forums that are unlikely to be covered in model pre-training data. (2) For each candidate task, two authors independently searched for official or authoritative online documentation (e.g., vendor support pages, step-by-step tutorials). (3) We verified that the retrieved documents contain accurate, sufficiently detailed procedural instructions that can be unambiguously mapped to a finite sequence of GUI actions (clicks, typing, navigation) without requiring external knowledge or trial-and-error. Tasks lacking such documentation or containing irresolvable ambiguities were excluded from the final benchmark. This verification step ensures that observed agent failures in proactive search and instruction grounding are attributable to model limitations rather than missing or inadequate source material, thereby reinforcing our claim that document-guided interaction is a crucial pathway. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark proposal with no derivations or self-referential reductions

full rationale

The paper introduces the DocOS benchmark and Proactive Document-Guided Action paradigm for GUI agents, then reports experimental results on agent performance in search and grounding tasks. No equations, parameter fitting, or mathematical derivations are present. Conclusions about dual bottlenecks follow directly from observed empirical failures on the new benchmark tasks rather than reducing to any self-definition, fitted input renamed as prediction, or load-bearing self-citation chain. The work is self-contained as an empirical evaluation of a proposed interaction paradigm against external agent baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from the GUI agent literature plus the domain assumption that online documentation is available and usable for long-tailed tasks.

axioms (1)

domain assumption GUI agents can improve on long-tailed tasks by retrieving and grounding external procedural documentation
Core premise of the proposed paradigm stated in the abstract.

pith-pipeline@v0.9.0 · 5759 in / 1131 out tokens · 48720 ms · 2026-05-20T11:09:44.886053+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize the interaction ... as a Partially Observable Markov Decision Process (POMDP) ... Proactive Knowledge Retrieval ... Document-Grounded Execution
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 5 internal anchors

[1]

European Conference on Computer Vision , pages=

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[2]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

arXiv preprint arXiv:2404.05955 , year=

Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? , author=. arXiv preprint arXiv:2404.05955 , year=

work page arXiv
[5]

NeurIPS , year=

VideoGUI: A Benchmark for GUI Automation from Instructional Videos , author=. NeurIPS , year=

work page
[7]

Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services , pages=

ScreenSpot: Multidimensional resource discovery for distributed applications in smart spaces , author=. Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services , pages=

work page
[8]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Crab: Cross-environment agent benchmark for multimodal language model agents , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[9]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =

Shunyu Yao and Howard Chen and John Yang and Karthik Narasimhan , editor =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =. 2022 , timestamp =

work page 2022
[10]

2023 , eprint=

Mind2Web: Towards a Generalist Agent for the Web , author=. 2023 , eprint=

work page 2023
[11]

CoRR , volume =

Luyuan Wang and Yongyu Deng and Yiwei Zha and Guodong Mao and Qinmin Wang and Tianchen Min and Wei Chen and Shoufa Chen , title =. CoRR , volume =. 2024 , doi =. 2406.08184 , timestamp =

work page arXiv 2024
[12]

2023 , eprint=

AppAgent: Multimodal Agents as Smartphone Users , author=. 2023 , eprint=

work page 2023
[14]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Gui agents: A survey , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[15]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

work page 2024
[16]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page
[17]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Retrieval-augmented GUI Agents with Generative Guidelines , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[20]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025
[25]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction , author=

work page
[27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[28]

2025 , month=sep, day=

Introducing Claude Sonnet 4.5 , author=. 2025 , month=sep, day=

work page 2025
[29]

2024 , eprint=

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents , author=. 2024 , eprint=

work page 2024
[30]

2025 , eprint=

Mano Technical Report , author=. 2025 , eprint=

work page 2025
[31]

2025 , eprint=

OpenCUA: Open Foundations for Computer-Use Agents , author=. 2025 , eprint=

work page 2025
[32]

2024 , eprint=

CogAgent: A Visual Language Model for GUI Agents , author=. 2024 , eprint=

work page 2024
[33]

2024 , eprint=

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement , author=. 2024 , eprint=

work page 2024
[34]

2023 , eprint=

Android in the Wild: A Large-Scale Dataset for Android Device Control , author=. 2023 , eprint=

work page 2023
[35]

ScreenSpot-Pro:

Kaixin Li and Meng Ziyang and Hongzhan Lin and Ziyang Luo and Yuchen Tian and Jing Ma and Zhiyong Huang and Tat-Seng Chua , booktitle=. ScreenSpot-Pro:

work page
[36]

2024 , eprint=

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? , author=. 2024 , eprint=

work page 2024
[39]

2025 , eprint=

TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents , author=. 2025 , eprint=

work page 2025
[49]

2026 , //url=

Zhenyu Li and Xuefeng Bai and YUNFEI LONG and Kehai Chen and Yaoyin Zhang and Xuchen Wei and Juntao Li and Min Zhang , booktitle=. 2026 , //url=

work page 2026
[50]

Qwen3-vl technical report, 2025

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page 2025
[51]

What limits virtual agent application? omnibench: A scalable multi-dimensional benchmark for essential virtual agent capabilities

Bu, W., Wu, Y., Yu, Q., Gao, M., Miao, B., Zhang, Z., Pan, K., Li, Y., Li, M., Ji, W., et al. What limits virtual agent application? omnibench: A scalable multi-dimensional benchmark for essential virtual agent capabilities. arXiv preprint arXiv:2506.08933, 2025

work page arXiv 2025
[52]

Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024

Cao, R., Lei, F., Wu, H., Chen, J., Fu, Y., Gao, H., Xiong, X., Zhang, H., Mao, Y., Hu, W., Xie, T., Xu, H., Zhang, D., Wang, S., Sun, R., Yin, P., Xiong, C., Ni, A., Liu, Q., Zhong, V., Chen, L., Yu, K., and Yu, T. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024

work page 2024
[53]

Benchmarking LLM s for translating classical C hinese poetry: Evaluating adequacy, fluency, and elegance

Chen, A., Lou, L., Chen, K., Bai, X., Xiang, Y., Yang, M., Zhao, T., and Zhang, M. Benchmarking LLM s for translating classical C hinese poetry: Evaluating adequacy, fluency, and elegance. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ ...

work page doi:10.18653/v1/2025.emnlp-main.1678 2025
[54]

Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding

Chen, D., Huang, Y., Wu, S., Tang, J., Chen, L., Bai, Y., He, Z., Wang, C., Zhou, H., Li, Y., et al. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. arXiv preprint arXiv:2406.10819, 2024

work page arXiv 2024
[55]

Mind2web: Towards a generalist agent for the web, 2023

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web, 2023

work page 2023
[56]

Mano technical report, 2025

Fu, T., Su, A., Zhao, C., Wang, H., Wu, M., Yu, Z., Hu, F., Shi, M., Dong, W., Wang, J., Chen, Y., Yu, R., Peng, S., Li, M., Huang, N., Wei, H., Yu, J., Xin, Y., Zhao, X., Gu, K., Jiang, P., Zhou, S., and Wang, S. Mano technical report, 2025

work page 2025
[57]

H., Gutierrez, B

Gou, B., Huang, Z., Ning, Y., Gu, Y., Lin, M., Yu, B., Kopanev, A., Qi, W., Shu, Y., Wu, J., Song, C. H., Gutierrez, B. J., Li, Y., Liao, Z., Moussa, H. N., ZHANG, T., Xie, J., Xue, T., Chen, S., Zheng, B., Zhang, K., Cai, Z., Rozgic, V., Ziyadi, M., Sun, H., and Su, Y. Mind2web 2: Evaluating agentic search with agent-as-a-judge. In The Thirty-ninth Annua...

work page 2025
[58]

Cogagent: A visual language model for gui agents, 2024

Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Zhang, Y., Li, J., Xu, B., Dong, Y., Ding, M., and Tang, J. Cogagent: A visual language model for gui agents, 2024

work page 2024
[59]

Screenspot: Multidimensional resource discovery for distributed applications in smart spaces

Jurmu, M., Boring, S., and Riekki, J. Screenspot: Multidimensional resource discovery for distributed applications in smart spaces. In Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services, pp.\ 1--9, 2008

work page 2008
[60]

P., Russak, M., Koh, J

Kapoor, R., Butala, Y. P., Russak, M., Koh, J. Y., Kamble, K., AlShikh, W., and Salakhutdinov, R. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. In European Conference on Computer Vision, pp.\ 161--178. Springer, 2024

work page 2024
[61]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Koh, J. Y., Lo, R., Jang, L., Duvvur, V., Lim, M. C., Huang, P.-Y., Neubig, G., Zhou, S., Salakhutdinov, R., and Fried, D. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Screenspot-pro: GUI grounding for professional high-resolution computer use

Li, K., Ziyang, M., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., and Chua, T.-S. Screenspot-pro: GUI grounding for professional high-resolution computer use. In Workshop on Reasoning and Planning for Large Language Models, 2025

work page 2025
[63]

End-to-end speech translation with adversarial training

Li, X., Kehai, C., Zhao, T., and Yang, M. End-to-end speech translation with adversarial training. In Wu, H., Cherry, C., Huang, L., He, Z., Liberman, M., Cross, J., and Liu, Y. (eds.), Proceedings of the First Workshop on Automatic Simultaneous Translation, pp.\ 10--14, Seattle, Washington, July 2020. Association for Computational Linguistics. doi:10.186...

work page doi:10.18653/v1/2020.autosimtrans-1.2 2020
[64]

XIFB ench: Evaluating large language models on multilingual instruction following

Li, Z., Bai, X., LONG, Y., Chen, K., Zhang, Y., Wei, X., Li, J., and Zhang, M. XIFB ench: Evaluating large language models on multilingual instruction following. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

work page 2026
[65]

Tool learning via inference-time scaling and cycle verifier

Liang, X., Xie, W., Li, J., Wang, W., Chen, Y., Chen, K., and Zhang, M. Tool learning via inference-time scaling and cycle verifier. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 24658--24671, Vienna, Austria, July 2025. Association for Computational Linguistics....

work page doi:10.18653/v1/2025.findings-acl.1266 2025
[66]

Q., Li, L., Gao, D., Wu, Q., Yan, M., Yang, Z., Wang, L., and Shou, M

Lin, K. Q., Li, L., Gao, D., Wu, Q., Yan, M., Yang, Z., Wang, L., and Shou, M. Z. Videogui: A benchmark for gui automation from instructional videos. In NeurIPS, 2024

work page 2024
[67]

L., Sun, J., Wang, J., et al

Liu, X., Qin, B., Liang, D., Dong, G., Lai, H., Zhang, H., Zhao, H., Iong, I. L., Sun, J., Wang, J., et al. Autoglm: Autonomous foundation agents for guis. arXiv preprint arXiv:2411.00820, 2024

work page arXiv 2024
[68]

Towards conversational recommendation over multi-type dialogs

Liu, Z., Wang, H., Niu, Z.-Y., Wu, H., Che, W., and Liu, T. Towards conversational recommendation over multi-type dialogs. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 1036--1049, Online, July 2020. Association for Computational Linguistics....

work page doi:10.18653/v1/2020.acl-main.98 2020
[69]

D u R ec D ial 2.0: A bilingual parallel corpus for conversational recommendation

Liu, Z., Wang, H., Niu, Z.-Y., Wu, H., and Che, W. D u R ec D ial 2.0: A bilingual parallel corpus for conversational recommendation. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 4335--4347, Online and Punta Cana, Dominican Republic, November ...

work page doi:10.18653/v1/2021.emnlp-main.356 2021
[70]

Where to go for the holidays: Towards mixed-type dialogs for clarification of user goals

Liu, Z., Xu, J., Lei, Z., Wang, H., Niu, Z.-Y., and Wu, H. Where to go for the holidays: Towards mixed-type dialogs for clarification of user goals. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1024--1034, Dublin, Ireland, May ...

work page doi:10.18653/v1/2022.acl-long.73 2022
[71]

A survey on the feedback mechanism of llm-based ai agents

Liu, Z., Bai, X., Chen, K., Chen, X., Li, X., Xiang, Y., Liu, J., Li, H.-D., Wang, Y., Nie, L., and Zhang, M. A survey on the feedback mechanism of llm-based ai agents. In Kwok, J. (ed.), Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25 , pp.\ 10582--10592. International Joint Conferences on Artificial I...

work page doi:10.24963/ijcai.2025/1175 2025
[72]

T rans B ench: Breaking barriers for transferable graphical user interface agents in dynamic digital environments

Lu, Y., Yu, Q., Wang, H., Liu, Z., Su, W., Liu, Y., Guo, Y., Liang, M., Wang, Y., and Wang, H. T rans B ench: Breaking barriers for transferable graphical user interface agents in dynamic digital environments. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 12464--...

work page doi:10.18653/v1/2025.findings-acl.645 2025
[73]

Gui agents: A survey

Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., et al. Gui agents: A survey. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 22522--22538, 2025

work page 2025
[74]

WebCanvas: Benchmarking Web Agents in Online Environments

Pan, Y., Kong, D., Zhou, S., Cui, C., Leng, Y., Jiang, B., Liu, H., Shang, Y., Zhou, S., Wu, T., et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024

work page internal anchor Pith review arXiv 2024
[75]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Android in the wild: A large-scale dataset for android device control, 2023

Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Android in the wild: A large-scale dataset for android device control, 2023

work page 2023
[77]

M id M ed: Towards mixed-type dialogues for medical consultation

Shi, X., Liu, Z., Wang, C., Leng, H., Xue, K., Zhang, X., and Zhang, S. M id M ed: Towards mixed-type dialogues for medical consultation. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8145--8157, Toronto, Canada, July 2023. Assoc...

work page doi:10.18653/v1/2023.acl-long.453 2023
[78]

CoRR , volume =

Wang, L., Deng, Y., Zha, Y., Mao, G., Wang, Q., Min, T., Chen, W., and Chen, S. Mobileagentbench: An efficient and user-friendly benchmark for mobile LLM agents. CoRR, abs/2406.08184, 2024 a . doi:10.48550/ARXIV.2406.08184

work page doi:10.48550/arxiv.2406.08184 2024
[79]

Gui agents with foundation models: A comprehensive survey

Wang, S., Liu, W., Chen, J., Zhou, Y., Gan, W., Zeng, X., Che, Y., Yu, S., Hao, X., Shao, K., et al. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890, 2024 b

work page arXiv 2024
[80]

Wang, X., Wang, B., Lu, D., Yang, J., Xie, T., Wang, J., Deng, J., Guo, X., Xu, Y., Wu, C. H., Shen, Z., Li, Z., Li, R., Li, X., Chen, J., Zheng, B., Li, P., Lei, F., Cao, R., Fu, Y., Shin, D., Shin, M., Hu, J., Wang, Y., Chen, J., Ye, Y., Zhang, D., Du, D., Hu, H., Chen, H., Zhou, Z., Yao, H., Chen, Z., Gu, Q., Wang, Y., Wang, H., Yang, D., Zhong, V., Su...

work page 2025
[81]

Os-copilot: Towards generalist computer agents with self-improvement, 2024 a

Wu, Z., Han, C., Ding, Z., Weng, Z., Liu, Z., Yao, S., Yu, T., and Kong, L. Os-copilot: Towards generalist computer agents with self-improvement, 2024 a

work page 2024
[82]

P., and Qiao, Y

Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P. P., and Qiao, Y. Os-atlas: A foundation action model for generalist gui agents, 2024 b

work page 2024
[83]

GUI -explorer: Autonomous exploration and mining of transition-aware knowledge for GUI agent

Xie, B., Shao, R., Chen, G., Zhou, K., Li, Y., Liu, J., Zhang, M., and Nie, L. GUI -explorer: Autonomous exploration and mining of transition-aware knowledge for GUI agent. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ ...

work page doi:10.18653/v1/2025.acl-long.282 2025
[84]

J., Cheng, Z., Shin, D., Lei, F., et al

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37: 0 52040--52094, 2024

work page 2024
[85]

C., Yang, C., and Yu, D

Xu, R., Ma, K., Yu, W., Zhang, H., Ho, J. C., Yang, C., and Yu, D. Retrieval-augmented gui agents with generative guidelines. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 17877--17886, 2025 a

work page 2025
[86]

Crab: Cross-environment agent benchmark for multimodal language model agents

Xu, T., Chen, L., Wu, D.-J., Chen, Y., Zhang, Z., Yao, X., Xie, Z., Chen, Y., Liu, S., Qian, B., et al. Crab: Cross-environment agent benchmark for multimodal language model agents. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 21607--21647, 2025 b

work page 2025
[87]

Aguvis: Unified pure vision agents for autonomous gui interaction

Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., and Xiong, C. Aguvis: Unified pure vision agents for autonomous gui interaction. In Forty-second International Conference on Machine Learning

work page
[88]

Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

Yan, H., Wang, J., Huang, X., Shen, Y., Meng, Z., Fan, Z., Tan, K., Gao, J., Shi, L., Yang, M., et al. Step-gui technical report. arXiv preprint arXiv:2512.15431, 2025

work page arXiv 2025
[89]

Webshop: Towards scalable real-world web interaction with grounded language agents

Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orlea...

work page 2022
[90]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al. Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025

work page internal anchor Pith review arXiv 2025
[91]

A survey on multimodal large language models

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. National Science Review, 11 0 (12): 0 nwae403, 2024

work page 2024
[92]

Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

Zhang, B., Shang, Z., Gao, Z., Zhang, W., Xie, R., Ma, X., Yuan, T., Wu, X., Zhu, S.-C., and Li, Q. Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

work page 2025
[93]

Appagent: Multimodal agents as smartphone users, 2023

Zhang, C., Yang, Z., Liu, J., Han, Y., Chen, X., Huang, Z., Fu, B., and Yu, G. Appagent: Multimodal agents as smartphone users, 2023

work page 2023
[94]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025

work page arXiv 2025
[95]

F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[1] [1]

European Conference on Computer Vision , pages=

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[2] [2]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page

[3] [4]

arXiv preprint arXiv:2404.05955 , year=

Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? , author=. arXiv preprint arXiv:2404.05955 , year=

work page arXiv

[4] [5]

NeurIPS , year=

VideoGUI: A Benchmark for GUI Automation from Instructional Videos , author=. NeurIPS , year=

work page

[5] [7]

Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services , pages=

ScreenSpot: Multidimensional resource discovery for distributed applications in smart spaces , author=. Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services , pages=

work page

[6] [8]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Crab: Cross-environment agent benchmark for multimodal language model agents , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[7] [9]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =

Shunyu Yao and Howard Chen and John Yang and Karthik Narasimhan , editor =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =. 2022 , timestamp =

work page 2022

[8] [10]

2023 , eprint=

Mind2Web: Towards a Generalist Agent for the Web , author=. 2023 , eprint=

work page 2023

[9] [11]

CoRR , volume =

Luyuan Wang and Yongyu Deng and Yiwei Zha and Guodong Mao and Qinmin Wang and Tianchen Min and Wei Chen and Shoufa Chen , title =. CoRR , volume =. 2024 , doi =. 2406.08184 , timestamp =

work page arXiv 2024

[10] [12]

2023 , eprint=

AppAgent: Multimodal Agents as Smartphone Users , author=. 2023 , eprint=

work page 2023

[11] [14]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Gui agents: A survey , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[12] [15]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

work page 2024

[13] [16]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page

[14] [17]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Retrieval-augmented GUI Agents with Generative Guidelines , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[15] [20]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [22]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025

[17] [25]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction , author=

work page

[18] [27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page

[19] [28]

2025 , month=sep, day=

Introducing Claude Sonnet 4.5 , author=. 2025 , month=sep, day=

work page 2025

[20] [29]

2024 , eprint=

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents , author=. 2024 , eprint=

work page 2024

[21] [30]

2025 , eprint=

Mano Technical Report , author=. 2025 , eprint=

work page 2025

[22] [31]

2025 , eprint=

OpenCUA: Open Foundations for Computer-Use Agents , author=. 2025 , eprint=

work page 2025

[23] [32]

2024 , eprint=

CogAgent: A Visual Language Model for GUI Agents , author=. 2024 , eprint=

work page 2024

[24] [33]

2024 , eprint=

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement , author=. 2024 , eprint=

work page 2024

[25] [34]

2023 , eprint=

Android in the Wild: A Large-Scale Dataset for Android Device Control , author=. 2023 , eprint=

work page 2023

[26] [35]

ScreenSpot-Pro:

Kaixin Li and Meng Ziyang and Hongzhan Lin and Ziyang Luo and Yuchen Tian and Jing Ma and Zhiyong Huang and Tat-Seng Chua , booktitle=. ScreenSpot-Pro:

work page

[27] [36]

2024 , eprint=

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? , author=. 2024 , eprint=

work page 2024

[28] [39]

2025 , eprint=

TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents , author=. 2025 , eprint=

work page 2025

[29] [49]

2026 , //url=

Zhenyu Li and Xuefeng Bai and YUNFEI LONG and Kehai Chen and Yaoyin Zhang and Xuchen Wei and Juntao Li and Min Zhang , booktitle=. 2026 , //url=

work page 2026

[30] [50]

Qwen3-vl technical report, 2025

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page 2025

[31] [51]

What limits virtual agent application? omnibench: A scalable multi-dimensional benchmark for essential virtual agent capabilities

Bu, W., Wu, Y., Yu, Q., Gao, M., Miao, B., Zhang, Z., Pan, K., Li, Y., Li, M., Ji, W., et al. What limits virtual agent application? omnibench: A scalable multi-dimensional benchmark for essential virtual agent capabilities. arXiv preprint arXiv:2506.08933, 2025

work page arXiv 2025

[32] [52]

Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024

Cao, R., Lei, F., Wu, H., Chen, J., Fu, Y., Gao, H., Xiong, X., Zhang, H., Mao, Y., Hu, W., Xie, T., Xu, H., Zhang, D., Wang, S., Sun, R., Yin, P., Xiong, C., Ni, A., Liu, Q., Zhong, V., Chen, L., Yu, K., and Yu, T. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024

work page 2024

[33] [53]

Benchmarking LLM s for translating classical C hinese poetry: Evaluating adequacy, fluency, and elegance

Chen, A., Lou, L., Chen, K., Bai, X., Xiang, Y., Yang, M., Zhao, T., and Zhang, M. Benchmarking LLM s for translating classical C hinese poetry: Evaluating adequacy, fluency, and elegance. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ ...

work page doi:10.18653/v1/2025.emnlp-main.1678 2025

[34] [54]

Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding

Chen, D., Huang, Y., Wu, S., Tang, J., Chen, L., Bai, Y., He, Z., Wang, C., Zhou, H., Li, Y., et al. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. arXiv preprint arXiv:2406.10819, 2024

work page arXiv 2024

[35] [55]

Mind2web: Towards a generalist agent for the web, 2023

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web, 2023

work page 2023

[36] [56]

Mano technical report, 2025

Fu, T., Su, A., Zhao, C., Wang, H., Wu, M., Yu, Z., Hu, F., Shi, M., Dong, W., Wang, J., Chen, Y., Yu, R., Peng, S., Li, M., Huang, N., Wei, H., Yu, J., Xin, Y., Zhao, X., Gu, K., Jiang, P., Zhou, S., and Wang, S. Mano technical report, 2025

work page 2025

[37] [57]

H., Gutierrez, B

Gou, B., Huang, Z., Ning, Y., Gu, Y., Lin, M., Yu, B., Kopanev, A., Qi, W., Shu, Y., Wu, J., Song, C. H., Gutierrez, B. J., Li, Y., Liao, Z., Moussa, H. N., ZHANG, T., Xie, J., Xue, T., Chen, S., Zheng, B., Zhang, K., Cai, Z., Rozgic, V., Ziyadi, M., Sun, H., and Su, Y. Mind2web 2: Evaluating agentic search with agent-as-a-judge. In The Thirty-ninth Annua...

work page 2025

[38] [58]

Cogagent: A visual language model for gui agents, 2024

Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Zhang, Y., Li, J., Xu, B., Dong, Y., Ding, M., and Tang, J. Cogagent: A visual language model for gui agents, 2024

work page 2024

[39] [59]

Screenspot: Multidimensional resource discovery for distributed applications in smart spaces

Jurmu, M., Boring, S., and Riekki, J. Screenspot: Multidimensional resource discovery for distributed applications in smart spaces. In Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services, pp.\ 1--9, 2008

work page 2008

[40] [60]

P., Russak, M., Koh, J

Kapoor, R., Butala, Y. P., Russak, M., Koh, J. Y., Kamble, K., AlShikh, W., and Salakhutdinov, R. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. In European Conference on Computer Vision, pp.\ 161--178. Springer, 2024

work page 2024

[41] [61]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Koh, J. Y., Lo, R., Jang, L., Duvvur, V., Lim, M. C., Huang, P.-Y., Neubig, G., Zhou, S., Salakhutdinov, R., and Fried, D. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [62]

Screenspot-pro: GUI grounding for professional high-resolution computer use

Li, K., Ziyang, M., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., and Chua, T.-S. Screenspot-pro: GUI grounding for professional high-resolution computer use. In Workshop on Reasoning and Planning for Large Language Models, 2025

work page 2025

[43] [63]

End-to-end speech translation with adversarial training

Li, X., Kehai, C., Zhao, T., and Yang, M. End-to-end speech translation with adversarial training. In Wu, H., Cherry, C., Huang, L., He, Z., Liberman, M., Cross, J., and Liu, Y. (eds.), Proceedings of the First Workshop on Automatic Simultaneous Translation, pp.\ 10--14, Seattle, Washington, July 2020. Association for Computational Linguistics. doi:10.186...

work page doi:10.18653/v1/2020.autosimtrans-1.2 2020

[44] [64]

XIFB ench: Evaluating large language models on multilingual instruction following

Li, Z., Bai, X., LONG, Y., Chen, K., Zhang, Y., Wei, X., Li, J., and Zhang, M. XIFB ench: Evaluating large language models on multilingual instruction following. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

work page 2026

[45] [65]

Tool learning via inference-time scaling and cycle verifier

Liang, X., Xie, W., Li, J., Wang, W., Chen, Y., Chen, K., and Zhang, M. Tool learning via inference-time scaling and cycle verifier. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 24658--24671, Vienna, Austria, July 2025. Association for Computational Linguistics....

work page doi:10.18653/v1/2025.findings-acl.1266 2025

[46] [66]

Q., Li, L., Gao, D., Wu, Q., Yan, M., Yang, Z., Wang, L., and Shou, M

Lin, K. Q., Li, L., Gao, D., Wu, Q., Yan, M., Yang, Z., Wang, L., and Shou, M. Z. Videogui: A benchmark for gui automation from instructional videos. In NeurIPS, 2024

work page 2024

[47] [67]

L., Sun, J., Wang, J., et al

Liu, X., Qin, B., Liang, D., Dong, G., Lai, H., Zhang, H., Zhao, H., Iong, I. L., Sun, J., Wang, J., et al. Autoglm: Autonomous foundation agents for guis. arXiv preprint arXiv:2411.00820, 2024

work page arXiv 2024

[48] [68]

Towards conversational recommendation over multi-type dialogs

Liu, Z., Wang, H., Niu, Z.-Y., Wu, H., Che, W., and Liu, T. Towards conversational recommendation over multi-type dialogs. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 1036--1049, Online, July 2020. Association for Computational Linguistics....

work page doi:10.18653/v1/2020.acl-main.98 2020

[49] [69]

D u R ec D ial 2.0: A bilingual parallel corpus for conversational recommendation

Liu, Z., Wang, H., Niu, Z.-Y., Wu, H., and Che, W. D u R ec D ial 2.0: A bilingual parallel corpus for conversational recommendation. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 4335--4347, Online and Punta Cana, Dominican Republic, November ...

work page doi:10.18653/v1/2021.emnlp-main.356 2021

[50] [70]

Where to go for the holidays: Towards mixed-type dialogs for clarification of user goals

Liu, Z., Xu, J., Lei, Z., Wang, H., Niu, Z.-Y., and Wu, H. Where to go for the holidays: Towards mixed-type dialogs for clarification of user goals. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1024--1034, Dublin, Ireland, May ...

work page doi:10.18653/v1/2022.acl-long.73 2022

[51] [71]

A survey on the feedback mechanism of llm-based ai agents

Liu, Z., Bai, X., Chen, K., Chen, X., Li, X., Xiang, Y., Liu, J., Li, H.-D., Wang, Y., Nie, L., and Zhang, M. A survey on the feedback mechanism of llm-based ai agents. In Kwok, J. (ed.), Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25 , pp.\ 10582--10592. International Joint Conferences on Artificial I...

work page doi:10.24963/ijcai.2025/1175 2025

[52] [72]

T rans B ench: Breaking barriers for transferable graphical user interface agents in dynamic digital environments

Lu, Y., Yu, Q., Wang, H., Liu, Z., Su, W., Liu, Y., Guo, Y., Liang, M., Wang, Y., and Wang, H. T rans B ench: Breaking barriers for transferable graphical user interface agents in dynamic digital environments. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 12464--...

work page doi:10.18653/v1/2025.findings-acl.645 2025

[53] [73]

Gui agents: A survey

Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., et al. Gui agents: A survey. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 22522--22538, 2025

work page 2025

[54] [74]

WebCanvas: Benchmarking Web Agents in Online Environments

Pan, Y., Kong, D., Zhou, S., Cui, C., Leng, Y., Jiang, B., Liu, H., Shang, Y., Zhou, S., Wu, T., et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024

work page internal anchor Pith review arXiv 2024

[55] [75]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [76]

Android in the wild: A large-scale dataset for android device control, 2023

Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Android in the wild: A large-scale dataset for android device control, 2023

work page 2023

[57] [77]

M id M ed: Towards mixed-type dialogues for medical consultation

Shi, X., Liu, Z., Wang, C., Leng, H., Xue, K., Zhang, X., and Zhang, S. M id M ed: Towards mixed-type dialogues for medical consultation. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8145--8157, Toronto, Canada, July 2023. Assoc...

work page doi:10.18653/v1/2023.acl-long.453 2023

[58] [78]

CoRR , volume =

Wang, L., Deng, Y., Zha, Y., Mao, G., Wang, Q., Min, T., Chen, W., and Chen, S. Mobileagentbench: An efficient and user-friendly benchmark for mobile LLM agents. CoRR, abs/2406.08184, 2024 a . doi:10.48550/ARXIV.2406.08184

work page doi:10.48550/arxiv.2406.08184 2024

[59] [79]

Gui agents with foundation models: A comprehensive survey

Wang, S., Liu, W., Chen, J., Zhou, Y., Gan, W., Zeng, X., Che, Y., Yu, S., Hao, X., Shao, K., et al. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890, 2024 b

work page arXiv 2024

[60] [80]

Wang, X., Wang, B., Lu, D., Yang, J., Xie, T., Wang, J., Deng, J., Guo, X., Xu, Y., Wu, C. H., Shen, Z., Li, Z., Li, R., Li, X., Chen, J., Zheng, B., Li, P., Lei, F., Cao, R., Fu, Y., Shin, D., Shin, M., Hu, J., Wang, Y., Chen, J., Ye, Y., Zhang, D., Du, D., Hu, H., Chen, H., Zhou, Z., Yao, H., Chen, Z., Gu, Q., Wang, Y., Wang, H., Yang, D., Zhong, V., Su...

work page 2025

[61] [81]

Os-copilot: Towards generalist computer agents with self-improvement, 2024 a

Wu, Z., Han, C., Ding, Z., Weng, Z., Liu, Z., Yao, S., Yu, T., and Kong, L. Os-copilot: Towards generalist computer agents with self-improvement, 2024 a

work page 2024

[62] [82]

P., and Qiao, Y

Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P. P., and Qiao, Y. Os-atlas: A foundation action model for generalist gui agents, 2024 b

work page 2024

[63] [83]

GUI -explorer: Autonomous exploration and mining of transition-aware knowledge for GUI agent

Xie, B., Shao, R., Chen, G., Zhou, K., Li, Y., Liu, J., Zhang, M., and Nie, L. GUI -explorer: Autonomous exploration and mining of transition-aware knowledge for GUI agent. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ ...

work page doi:10.18653/v1/2025.acl-long.282 2025

[64] [84]

J., Cheng, Z., Shin, D., Lei, F., et al

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37: 0 52040--52094, 2024

work page 2024

[65] [85]

C., Yang, C., and Yu, D

Xu, R., Ma, K., Yu, W., Zhang, H., Ho, J. C., Yang, C., and Yu, D. Retrieval-augmented gui agents with generative guidelines. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 17877--17886, 2025 a

work page 2025

[66] [86]

Crab: Cross-environment agent benchmark for multimodal language model agents

Xu, T., Chen, L., Wu, D.-J., Chen, Y., Zhang, Z., Yao, X., Xie, Z., Chen, Y., Liu, S., Qian, B., et al. Crab: Cross-environment agent benchmark for multimodal language model agents. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 21607--21647, 2025 b

work page 2025

[67] [87]

Aguvis: Unified pure vision agents for autonomous gui interaction

Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., and Xiong, C. Aguvis: Unified pure vision agents for autonomous gui interaction. In Forty-second International Conference on Machine Learning

work page

[68] [88]

Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

Yan, H., Wang, J., Huang, X., Shen, Y., Meng, Z., Fan, Z., Tan, K., Gao, J., Shi, L., Yang, M., et al. Step-gui technical report. arXiv preprint arXiv:2512.15431, 2025

work page arXiv 2025

[69] [89]

Webshop: Towards scalable real-world web interaction with grounded language agents

Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orlea...

work page 2022

[70] [90]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al. Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025

work page internal anchor Pith review arXiv 2025

[71] [91]

A survey on multimodal large language models

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. National Science Review, 11 0 (12): 0 nwae403, 2024

work page 2024

[72] [92]

Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

Zhang, B., Shang, Z., Gao, Z., Zhang, W., Xie, R., Ma, X., Yuan, T., Wu, X., Zhu, S.-C., and Li, Q. Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

work page 2025

[73] [93]

Appagent: Multimodal agents as smartphone users, 2023

Zhang, C., Yang, Z., Liu, J., Han, Y., Chen, X., Huang, Z., Fu, B., and Yu, G. Appagent: Multimodal agents as smartphone users, 2023

work page 2023

[74] [94]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025

work page arXiv 2025

[75] [95]

F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2024

work page 2024