pith. sign in

arxiv: 2605.18048 · v1 · pith:ULIWWOAInew · submitted 2026-05-18 · 💻 cs.AI

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

Pith reviewed 2026-05-20 11:09 UTC · model grok-4.3

classification 💻 cs.AI
keywords GUI agentsdocument-guided actionsproactive searchDocOS benchmarklong-tailed tasksaction groundingweb interaction
0
0 comments X

The pith

GUI agents can handle uncommon tasks by searching for and using online documentation to guide their actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GUI agents automate clicks and interactions on screens but mostly rely on what they learned during training. This fails for rare or specific tasks that need step-by-step instructions not in their knowledge. The paper introduces Proactive Document-Guided Action, where agents actively search the web for documentation, understand the procedures, and apply them as precise GUI operations. They built the DocOS benchmark to measure how well agents do this in live browser settings. Tests show agents have difficulty finding useful docs and accurately translating instructions into actions, indicating that external documents are essential for agents that can learn and adapt on their own.

Core claim

The paper claims that progress in GUI agents for long-tailed tasks is limited by difficulties in locating relevant documentation during proactive searches and in grounding those instructions into accurate executable actions, and that document-guided interaction offers a key way to create self-evolving agents in changing environments.

What carries the argument

The Proactive Document-Guided Action paradigm, which allows agents to autonomously search for, comprehend, and execute instructions from online documentation, evaluated using the DocOS benchmark in fully interactive web environments.

If this is right

  • GUI agents will be able to manage tasks requiring specific procedural knowledge by accessing external sources.
  • Advances will require better methods for information retrieval and precise action mapping from text.
  • Self-evolving agents become possible in dynamic settings through continuous document use.
  • This shifts away from pure trial-and-error exploration toward informed decision making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could enable agents to work with newly released software by finding updated manuals online.
  • Similar methods might help other AI systems that need to perform actions based on external knowledge.
  • Integrating this with search engines could create more robust interactive assistants.

Load-bearing premise

Relevant and accurate documentation is available online for long-tailed tasks, and agents can find it and convert the instructions into correct actions without errors or loss of context.

What would settle it

If agents fail to complete DocOS tasks even when provided with the exact relevant documentation in advance, this would show that the grounding step does not work reliably.

Figures

Figures reproduced from arXiv: 2605.18048 by HaiFeng Wang, Jiahong Wu, Jingjing Liu, Kehai Chen, Yuhang Guo, Yunhong Wang, Zeming Liu, Zihao Cheng, Ziye Huang.

Figure 1
Figure 1. Figure 1: An example of a Proactive Document-Guided Action task. The workflow is divided into two distinct phases: Proactive Knowledge Retrieval (top row) and Document-Grounded Execution (bottom row). tions in up-to-date documentation, enabling adaptation to long-tailed and evolving tasks without extensive retraining. Guided by this paradigm, we propose DocOS, a bench￾mark designed to assess document-guided problem … view at source ↗
Figure 2
Figure 2. Figure 2: The data construction pipeline of DocOS consists of 3 stages. Stage 1: Task Construction involves defining task instructions, prerequisites, and difficulty levels (Easy, Medium, Hard) based on execution steps. Stage 2: Document Collection utilizes an automated crawler to retrieve and parse official documentation, extracting structured raw information (e.g., instructions, headers). Stage 3: Task Filtering i… view at source ↗
Figure 3
Figure 3. Figure 3: Pass rates of six different GUI agents across varying task steps under two settings: with (w/) and without (w/o) provided documents. The x-axis represents the number of steps required to complete the task, and the y-axis represents the pass rate. In this section, we analyze how varying the length and step size of the documents provided to the GUI agent influences its task completion performance. As shown i… view at source ↗
Figure 4
Figure 4. Figure 4: A sample of Anki. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A sample of Blender. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A sample of Element. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A sample of Godot. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A sample of Grafana. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A sample of Idea. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A sample of Netbeans. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A sample of Notepad++. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A sample of Odoo. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: A sample of Postman. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A sample of PyCharm. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: A sample of VSCode. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A sample of Zotero. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: A sample of Zulip. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: An error case of imprecise localization. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: An error case of Non-official Reference. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: An error case of execute before retrieval. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: An error case of action grounding failure. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: An error case of context misidentification. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt for GUI Content Classification. Prompt for Document Content Classification You are an assistant that classifies content based on specific criteria. Your task is to evaluate whether a given piece of content serves as a tutorial specifically related to graphical user interfaces (GUI), such as web applications, desktop applications, or operating systems. The content qualifies as a GUI-related tutorial… view at source ↗
read the original abstract

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Proactive Document-Guided Action as a paradigm for GUI agents operating in dynamic open-web environments. It proposes the DocOS benchmark, which requires agents to autonomously navigate browsers, locate relevant online documentation for long-tailed tasks, comprehend procedural instructions, and ground them into precise executable GUI actions. Experiments identify two primary bottlenecks—unreliable proactive search for information and failures in faithfully translating retrieved instructions into actions—and conclude that document-guided interaction is a crucial pathway for self-evolving GUI agents.

Significance. If the empirical findings hold under closer scrutiny, the work provides a concrete benchmark and diagnostic for limitations in current GUI agents that rely solely on parametric knowledge or trial-and-error. By framing document search and grounding as central challenges, it opens a research direction that could improve adaptability on tasks absent from training data, with potential value for reproducible evaluation in interactive agent settings.

major comments (1)
  1. [Benchmark construction] Benchmark construction and task selection (likely §3): The claim that progress is 'strictly constrained by dual bottlenecks' in proactive search and instruction grounding is load-bearing on the premise that accurate, sufficiently detailed, and relevant online documentation exists for the chosen long-tailed tasks and can be directly mapped to executable actions. The manuscript does not appear to describe a verification step confirming document availability, completeness, or lack of ambiguity for each task; without this, search failures may reflect data absence rather than agent capability, weakening the interpretation that document-guided interaction is the key enabling pathway.
minor comments (1)
  1. [Abstract] Abstract and experimental reporting: The abstract states that 'extensive experiments reveal' the bottlenecks but provides no quantitative metrics, baseline comparisons, or error breakdowns; adding a concise summary of key performance numbers and failure categorizations would improve clarity without altering the core contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of DocOS. We address the single major comment below and have revised the manuscript to strengthen the benchmark description.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction and task selection (likely §3): The claim that progress is 'strictly constrained by dual bottlenecks' in proactive search and instruction grounding is load-bearing on the premise that accurate, sufficiently detailed, and relevant online documentation exists for the chosen long-tailed tasks and can be directly mapped to executable actions. The manuscript does not appear to describe a verification step confirming document availability, completeness, or lack of ambiguity for each task; without this, search failures may reflect data absence rather than agent capability, weakening the interpretation that document-guided interaction is the key enabling pathway.

    Authors: We agree that an explicit description of the verification process is necessary to support the interpretation of the dual bottlenecks. In the revised manuscript, we have added Section 3.2 (Task Curation and Documentation Verification) that details the following procedure: (1) We first identified long-tailed tasks from real-world usage logs and forums that are unlikely to be covered in model pre-training data. (2) For each candidate task, two authors independently searched for official or authoritative online documentation (e.g., vendor support pages, step-by-step tutorials). (3) We verified that the retrieved documents contain accurate, sufficiently detailed procedural instructions that can be unambiguously mapped to a finite sequence of GUI actions (clicks, typing, navigation) without requiring external knowledge or trial-and-error. Tasks lacking such documentation or containing irresolvable ambiguities were excluded from the final benchmark. This verification step ensures that observed agent failures in proactive search and instruction grounding are attributable to model limitations rather than missing or inadequate source material, thereby reinforcing our claim that document-guided interaction is a crucial pathway. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark proposal with no derivations or self-referential reductions

full rationale

The paper introduces the DocOS benchmark and Proactive Document-Guided Action paradigm for GUI agents, then reports experimental results on agent performance in search and grounding tasks. No equations, parameter fitting, or mathematical derivations are present. Conclusions about dual bottlenecks follow directly from observed empirical failures on the new benchmark tasks rather than reducing to any self-definition, fitted input renamed as prediction, or load-bearing self-citation chain. The work is self-contained as an empirical evaluation of a proposed interaction paradigm against external agent baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from the GUI agent literature plus the domain assumption that online documentation is available and usable for long-tailed tasks.

axioms (1)
  • domain assumption GUI agents can improve on long-tailed tasks by retrieving and grounding external procedural documentation
    Core premise of the proposed paradigm stated in the abstract.

pith-pipeline@v0.9.0 · 5759 in / 1131 out tokens · 48720 ms · 2026-05-20T11:09:44.886053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 5 internal anchors

  1. [1]

    European Conference on Computer Vision , pages=

    Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

  3. [4]

    arXiv preprint arXiv:2404.05955 , year=

    Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? , author=. arXiv preprint arXiv:2404.05955 , year=

  4. [5]

    NeurIPS , year=

    VideoGUI: A Benchmark for GUI Automation from Instructional Videos , author=. NeurIPS , year=

  5. [7]

    Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services , pages=

    ScreenSpot: Multidimensional resource discovery for distributed applications in smart spaces , author=. Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services , pages=

  6. [8]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Crab: Cross-environment agent benchmark for multimodal language model agents , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  7. [9]

    WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =

    Shunyu Yao and Howard Chen and John Yang and Karthik Narasimhan , editor =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =. 2022 , timestamp =

  8. [10]

    2023 , eprint=

    Mind2Web: Towards a Generalist Agent for the Web , author=. 2023 , eprint=

  9. [11]

    CoRR , volume =

    Luyuan Wang and Yongyu Deng and Yiwei Zha and Guodong Mao and Qinmin Wang and Tianchen Min and Wei Chen and Shoufa Chen , title =. CoRR , volume =. 2024 , doi =. 2406.08184 , timestamp =

  10. [12]

    2023 , eprint=

    AppAgent: Multimodal Agents as Smartphone Users , author=. 2023 , eprint=

  11. [14]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Gui agents: A survey , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  12. [15]

    National Science Review , volume=

    A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

  13. [16]

    The Twelfth International Conference on Learning Representations , year=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

  14. [17]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Retrieval-augmented GUI Agents with Generative Guidelines , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  15. [20]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

  16. [22]

    2025 , eprint=

    Qwen3-VL Technical Report , author=. 2025 , eprint=

  17. [25]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction , author=

  18. [27]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  19. [28]

    2025 , month=sep, day=

    Introducing Claude Sonnet 4.5 , author=. 2025 , month=sep, day=

  20. [29]

    2024 , eprint=

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents , author=. 2024 , eprint=

  21. [30]

    2025 , eprint=

    Mano Technical Report , author=. 2025 , eprint=

  22. [31]

    2025 , eprint=

    OpenCUA: Open Foundations for Computer-Use Agents , author=. 2025 , eprint=

  23. [32]

    2024 , eprint=

    CogAgent: A Visual Language Model for GUI Agents , author=. 2024 , eprint=

  24. [33]

    2024 , eprint=

    OS-Copilot: Towards Generalist Computer Agents with Self-Improvement , author=. 2024 , eprint=

  25. [34]

    2023 , eprint=

    Android in the Wild: A Large-Scale Dataset for Android Device Control , author=. 2023 , eprint=

  26. [35]

    ScreenSpot-Pro:

    Kaixin Li and Meng Ziyang and Hongzhan Lin and Ziyang Luo and Yuchen Tian and Jing Ma and Zhiyong Huang and Tat-Seng Chua , booktitle=. ScreenSpot-Pro:

  27. [36]

    2024 , eprint=

    Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? , author=. 2024 , eprint=

  28. [39]

    2025 , eprint=

    TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents , author=. 2025 , eprint=

  29. [49]

    2026 , //url=

    Zhenyu Li and Xuefeng Bai and YUNFEI LONG and Kehai Chen and Yaoyin Zhang and Xuchen Wei and Juntao Li and Min Zhang , booktitle=. 2026 , //url=

  30. [50]

    Qwen3-vl technical report, 2025

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  31. [51]

    What limits virtual agent application? omnibench: A scalable multi-dimensional benchmark for essential virtual agent capabilities

    Bu, W., Wu, Y., Yu, Q., Gao, M., Miao, B., Zhang, Z., Pan, K., Li, Y., Li, M., Ji, W., et al. What limits virtual agent application? omnibench: A scalable multi-dimensional benchmark for essential virtual agent capabilities. arXiv preprint arXiv:2506.08933, 2025

  32. [52]

    Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024

    Cao, R., Lei, F., Wu, H., Chen, J., Fu, Y., Gao, H., Xiong, X., Zhang, H., Mao, Y., Hu, W., Xie, T., Xu, H., Zhang, D., Wang, S., Sun, R., Yin, P., Xiong, C., Ni, A., Liu, Q., Zhong, V., Chen, L., Yu, K., and Yu, T. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024

  33. [53]

    Benchmarking LLM s for translating classical C hinese poetry: Evaluating adequacy, fluency, and elegance

    Chen, A., Lou, L., Chen, K., Bai, X., Xiang, Y., Yang, M., Zhao, T., and Zhang, M. Benchmarking LLM s for translating classical C hinese poetry: Evaluating adequacy, fluency, and elegance. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ ...

  34. [54]

    Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding

    Chen, D., Huang, Y., Wu, S., Tang, J., Chen, L., Bai, Y., He, Z., Wang, C., Zhou, H., Li, Y., et al. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. arXiv preprint arXiv:2406.10819, 2024

  35. [55]

    Mind2web: Towards a generalist agent for the web, 2023

    Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web, 2023

  36. [56]

    Mano technical report, 2025

    Fu, T., Su, A., Zhao, C., Wang, H., Wu, M., Yu, Z., Hu, F., Shi, M., Dong, W., Wang, J., Chen, Y., Yu, R., Peng, S., Li, M., Huang, N., Wei, H., Yu, J., Xin, Y., Zhao, X., Gu, K., Jiang, P., Zhou, S., and Wang, S. Mano technical report, 2025

  37. [57]

    H., Gutierrez, B

    Gou, B., Huang, Z., Ning, Y., Gu, Y., Lin, M., Yu, B., Kopanev, A., Qi, W., Shu, Y., Wu, J., Song, C. H., Gutierrez, B. J., Li, Y., Liao, Z., Moussa, H. N., ZHANG, T., Xie, J., Xue, T., Chen, S., Zheng, B., Zhang, K., Cai, Z., Rozgic, V., Ziyadi, M., Sun, H., and Su, Y. Mind2web 2: Evaluating agentic search with agent-as-a-judge. In The Thirty-ninth Annua...

  38. [58]

    Cogagent: A visual language model for gui agents, 2024

    Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Zhang, Y., Li, J., Xu, B., Dong, Y., Ding, M., and Tang, J. Cogagent: A visual language model for gui agents, 2024

  39. [59]

    Screenspot: Multidimensional resource discovery for distributed applications in smart spaces

    Jurmu, M., Boring, S., and Riekki, J. Screenspot: Multidimensional resource discovery for distributed applications in smart spaces. In Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services, pp.\ 1--9, 2008

  40. [60]

    P., Russak, M., Koh, J

    Kapoor, R., Butala, Y. P., Russak, M., Koh, J. Y., Kamble, K., AlShikh, W., and Salakhutdinov, R. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. In European Conference on Computer Vision, pp.\ 161--178. Springer, 2024

  41. [61]

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    Koh, J. Y., Lo, R., Jang, L., Duvvur, V., Lim, M. C., Huang, P.-Y., Neubig, G., Zhou, S., Salakhutdinov, R., and Fried, D. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024

  42. [62]

    Screenspot-pro: GUI grounding for professional high-resolution computer use

    Li, K., Ziyang, M., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., and Chua, T.-S. Screenspot-pro: GUI grounding for professional high-resolution computer use. In Workshop on Reasoning and Planning for Large Language Models, 2025

  43. [63]

    End-to-end speech translation with adversarial training

    Li, X., Kehai, C., Zhao, T., and Yang, M. End-to-end speech translation with adversarial training. In Wu, H., Cherry, C., Huang, L., He, Z., Liberman, M., Cross, J., and Liu, Y. (eds.), Proceedings of the First Workshop on Automatic Simultaneous Translation, pp.\ 10--14, Seattle, Washington, July 2020. Association for Computational Linguistics. doi:10.186...

  44. [64]

    XIFB ench: Evaluating large language models on multilingual instruction following

    Li, Z., Bai, X., LONG, Y., Chen, K., Zhang, Y., Wei, X., Li, J., and Zhang, M. XIFB ench: Evaluating large language models on multilingual instruction following. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

  45. [65]

    Tool learning via inference-time scaling and cycle verifier

    Liang, X., Xie, W., Li, J., Wang, W., Chen, Y., Chen, K., and Zhang, M. Tool learning via inference-time scaling and cycle verifier. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 24658--24671, Vienna, Austria, July 2025. Association for Computational Linguistics....

  46. [66]

    Q., Li, L., Gao, D., Wu, Q., Yan, M., Yang, Z., Wang, L., and Shou, M

    Lin, K. Q., Li, L., Gao, D., Wu, Q., Yan, M., Yang, Z., Wang, L., and Shou, M. Z. Videogui: A benchmark for gui automation from instructional videos. In NeurIPS, 2024

  47. [67]

    L., Sun, J., Wang, J., et al

    Liu, X., Qin, B., Liang, D., Dong, G., Lai, H., Zhang, H., Zhao, H., Iong, I. L., Sun, J., Wang, J., et al. Autoglm: Autonomous foundation agents for guis. arXiv preprint arXiv:2411.00820, 2024

  48. [68]

    Towards conversational recommendation over multi-type dialogs

    Liu, Z., Wang, H., Niu, Z.-Y., Wu, H., Che, W., and Liu, T. Towards conversational recommendation over multi-type dialogs. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 1036--1049, Online, July 2020. Association for Computational Linguistics....

  49. [69]

    D u R ec D ial 2.0: A bilingual parallel corpus for conversational recommendation

    Liu, Z., Wang, H., Niu, Z.-Y., Wu, H., and Che, W. D u R ec D ial 2.0: A bilingual parallel corpus for conversational recommendation. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 4335--4347, Online and Punta Cana, Dominican Republic, November ...

  50. [70]

    Where to go for the holidays: Towards mixed-type dialogs for clarification of user goals

    Liu, Z., Xu, J., Lei, Z., Wang, H., Niu, Z.-Y., and Wu, H. Where to go for the holidays: Towards mixed-type dialogs for clarification of user goals. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1024--1034, Dublin, Ireland, May ...

  51. [71]

    A survey on the feedback mechanism of llm-based ai agents

    Liu, Z., Bai, X., Chen, K., Chen, X., Li, X., Xiang, Y., Liu, J., Li, H.-D., Wang, Y., Nie, L., and Zhang, M. A survey on the feedback mechanism of llm-based ai agents. In Kwok, J. (ed.), Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25 , pp.\ 10582--10592. International Joint Conferences on Artificial I...

  52. [72]

    T rans B ench: Breaking barriers for transferable graphical user interface agents in dynamic digital environments

    Lu, Y., Yu, Q., Wang, H., Liu, Z., Su, W., Liu, Y., Guo, Y., Liang, M., Wang, Y., and Wang, H. T rans B ench: Breaking barriers for transferable graphical user interface agents in dynamic digital environments. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 12464--...

  53. [73]

    Gui agents: A survey

    Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., et al. Gui agents: A survey. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 22522--22538, 2025

  54. [74]

    WebCanvas: Benchmarking Web Agents in Online Environments

    Pan, Y., Kong, D., Zhou, S., Cui, C., Leng, Y., Jiang, B., Liu, H., Shang, Y., Zhou, S., Wu, T., et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024

  55. [75]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

  56. [76]

    Android in the wild: A large-scale dataset for android device control, 2023

    Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Android in the wild: A large-scale dataset for android device control, 2023

  57. [77]

    M id M ed: Towards mixed-type dialogues for medical consultation

    Shi, X., Liu, Z., Wang, C., Leng, H., Xue, K., Zhang, X., and Zhang, S. M id M ed: Towards mixed-type dialogues for medical consultation. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8145--8157, Toronto, Canada, July 2023. Assoc...

  58. [78]

    CoRR , volume =

    Wang, L., Deng, Y., Zha, Y., Mao, G., Wang, Q., Min, T., Chen, W., and Chen, S. Mobileagentbench: An efficient and user-friendly benchmark for mobile LLM agents. CoRR, abs/2406.08184, 2024 a . doi:10.48550/ARXIV.2406.08184

  59. [79]

    Gui agents with foundation models: A comprehensive survey

    Wang, S., Liu, W., Chen, J., Zhou, Y., Gan, W., Zeng, X., Che, Y., Yu, S., Hao, X., Shao, K., et al. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890, 2024 b

  60. [80]

    Wang, X., Wang, B., Lu, D., Yang, J., Xie, T., Wang, J., Deng, J., Guo, X., Xu, Y., Wu, C. H., Shen, Z., Li, Z., Li, R., Li, X., Chen, J., Zheng, B., Li, P., Lei, F., Cao, R., Fu, Y., Shin, D., Shin, M., Hu, J., Wang, Y., Chen, J., Ye, Y., Zhang, D., Du, D., Hu, H., Chen, H., Zhou, Z., Yao, H., Chen, Z., Gu, Q., Wang, Y., Wang, H., Yang, D., Zhong, V., Su...

  61. [81]

    Os-copilot: Towards generalist computer agents with self-improvement, 2024 a

    Wu, Z., Han, C., Ding, Z., Weng, Z., Liu, Z., Yao, S., Yu, T., and Kong, L. Os-copilot: Towards generalist computer agents with self-improvement, 2024 a

  62. [82]

    P., and Qiao, Y

    Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P. P., and Qiao, Y. Os-atlas: A foundation action model for generalist gui agents, 2024 b

  63. [83]

    GUI -explorer: Autonomous exploration and mining of transition-aware knowledge for GUI agent

    Xie, B., Shao, R., Chen, G., Zhou, K., Li, Y., Liu, J., Zhang, M., and Nie, L. GUI -explorer: Autonomous exploration and mining of transition-aware knowledge for GUI agent. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ ...

  64. [84]

    J., Cheng, Z., Shin, D., Lei, F., et al

    Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37: 0 52040--52094, 2024

  65. [85]

    C., Yang, C., and Yu, D

    Xu, R., Ma, K., Yu, W., Zhang, H., Ho, J. C., Yang, C., and Yu, D. Retrieval-augmented gui agents with generative guidelines. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 17877--17886, 2025 a

  66. [86]

    Crab: Cross-environment agent benchmark for multimodal language model agents

    Xu, T., Chen, L., Wu, D.-J., Chen, Y., Zhang, Z., Yao, X., Xie, Z., Chen, Y., Liu, S., Qian, B., et al. Crab: Cross-environment agent benchmark for multimodal language model agents. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 21607--21647, 2025 b

  67. [87]

    Aguvis: Unified pure vision agents for autonomous gui interaction

    Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., and Xiong, C. Aguvis: Unified pure vision agents for autonomous gui interaction. In Forty-second International Conference on Machine Learning

  68. [88]

    Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

    Yan, H., Wang, J., Huang, X., Shen, Y., Meng, Z., Fan, Z., Tan, K., Gao, J., Shi, L., Yang, M., et al. Step-gui technical report. arXiv preprint arXiv:2512.15431, 2025

  69. [89]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orlea...

  70. [90]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al. Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025

  71. [91]

    A survey on multimodal large language models

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. National Science Review, 11 0 (12): 0 nwae403, 2024

  72. [92]

    Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

    Zhang, B., Shang, Z., Gao, Z., Zhang, W., Xie, R., Ma, X., Yuan, T., Wu, X., Zhu, S.-C., and Li, Q. Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

  73. [93]

    Appagent: Multimodal agents as smartphone users, 2023

    Zhang, C., Yang, Z., Liu, J., Han, Y., Chen, X., Huang, Z., Fu, B., and Yu, G. Appagent: Multimodal agents as smartphone users, 2023

  74. [94]

    Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

    Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025

  75. [95]

    F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al

    Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2024