X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

Binqiang Pan; Chao Li; Haobo Ji; Haonan Lu; Peng Liu; Qi Qi; Qiuxia Hou; Qi Wu; Quanlong Zheng; Ru Zhen

arxiv: 2605.05765 · v2 · pith:5JVWW2JHnew · submitted 2026-05-07 · 💻 cs.CV

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

Xiaoming Ren , Ru Zhen , Chao Li , Yang Song , Qiuxia Hou , Yanhao Zhang , Peng Liu , Qi Qi

show 6 more authors

Quanlong Zheng Qi Wu Zhenyi Liao Binqiang Pan Haobo Ji Haonan Lu

This is my paper

Pith reviewed 2026-05-22 09:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords mobile agentmultimodal perceptionAndroid interactionmemory optimizationhybrid groundingbehavior cloningpersonal assistantUI navigation

0 comments

The pith

X-OmniClaw combines perception, memory, and action into one mobile agent that processes UI states, visuals, and speech to perform complex Android tasks with contextual awareness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces X-OmniClaw as a single architecture that links three modules to let a mobile agent understand and act on phones in a more natural way. Omni Perception turns raw inputs from screens, cameras, and microphones into structured intent using temporal alignment. Omni Memory keeps both short-term task details and long-term user preferences to support continuity. Omni Action mixes XML data with visual checks and learns reusable skills from user behavior through cloning and replay. A reader would care if this setup truly reduces the friction of everyday phone tasks by making the agent remember personal context and execute precisely.

Core claim

The paper claims that a unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Omni Perception supplies a multimodal pipeline that aligns UI, visual, and speech data into intent representations. Omni Memory merges runtime working memory with distilled long-term personal memory. Omni Action uses a hybrid grounding method of XML metadata plus visual perception, together with behavior cloning and trajectory replay to turn user paths into direct skills. Demonstrations across scenarios indicate gains in efficiency and reliability.

What carries the argument

The three Omni modules—Perception for unified multimodal input processing with temporal alignment, Memory for combining runtime and personalized long-term storage, and Action for hybrid XML-visual grounding plus skill replay—that together form the agent's unified architecture.

If this is right

User navigation paths become reusable skills that the agent can execute directly without step-by-step guidance.
The agent maintains task continuity across sessions through integrated working and personal memory.
Interaction efficiency rises because the system avoids repeated full perception cycles on familiar tasks.
The architecture serves as a practical blueprint for building other mobile-native assistants that stay on-device.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar three-part designs could be tested on non-Android platforms to check whether the hybrid grounding still holds without Android-specific XML.
If the memory module scales, it might allow agents to handle multi-day tasks that span several unrelated apps without losing user intent.
Behavior cloning from real users could reduce the need for hand-coded scripts, but only if the replay mechanism generalizes beyond the recorded trajectories.

Load-bearing premise

The assumption that combining structural XML metadata with visual perception will deliver reliable interaction across all diverse mobile apps and conditions.

What would settle it

A test where the agent is given a new app lacking usable XML metadata and with partial screen occlusion, then measured for successful task completion rate compared to the reported demonstrations.

Figures

Figures reproduced from arXiv: 2605.05765 by Binqiang Pan, Chao Li, Haobo Ji, Haonan Lu, Peng Liu, Qi Qi, Qiuxia Hou, Qi Wu, Quanlong Zheng, Ru Zhen, Xiaoming Ren, Yang Song, Yanhao Zhang, Zhenyi Liao.

**Figure 1.** Figure 1: summarizes the concrete architecture—integrated multimodal perception (Voice, Screen, and Camera) drives on-device execution via the agent loop, which is then transformed into refined experience and persistent memory to iteratively optimize future performance. The following subsections unpack these components in greater detail. X-OmniClaw Local Engine — Your Devices, Your Driver Omni Perception Omni Action… view at source ↗

**Figure 2.** Figure 2: Overview of Omni Perception: multimodal entry, multimodal perception, and scene-grounded intent view at source ↗

**Figure 3.** Figure 3: Overview of Omni Memory: runtime context, long-term artifacts, and Skill–Tool coordination. view at source ↗

**Figure 4.** Figure 4: Overview of Omni Action in the app ecosystem: agent loop and trajectory-cloned execution. view at source ↗

**Figure 5.** Figure 5: Scenario A illustrations: camera-informed execution with direct app entry and result extraction (a); view at source ↗

**Figure 6.** Figure 6: Illustration of the theme-based one-tap video composition: (a) multimodal gallery memory and (b) view at source ↗

**Figure 7.** Figure 7: Illustration of instant portal to a Meituan flash-sale page (Demo C). view at source ↗

read the original abstract

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear but incremental system description of an Android agent that combines existing techniques without new measurements or derivations.

read the letter

The paper gives a practical blueprint for X-OmniClaw, a mobile agent that ties together perception, memory, and action for Android tasks. It builds directly on OpenClaw, adds temporal alignment for multimodal inputs, uses working memory plus distilled personal memory for continuity, and applies hybrid XML-plus-visual grounding for actions. Behavior cloning and trajectory replay turn user paths into reusable skills. The components are described in enough detail that an engineer could see how they fit together for context-aware interactions.

Referee Report

2 major / 2 minor

Summary. The paper introduces X-OmniClaw, a unified mobile agent for multimodal understanding and interaction in the Android ecosystem. It describes an architecture with three main components: Omni Perception, which integrates UI states, real-world visual contexts, and speech inputs using a temporal alignment module to produce structured multimodal intent representations; Omni Memory, which combines runtime working memory for task continuity with long-term personal memory distilled from local data; and Omni Action, which employs a hybrid grounding strategy of structural XML metadata and visual perception, along with behavior cloning and trajectory replay to capture and reuse user navigation skills. The central claim is that this unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness, as shown through demonstrations across diverse scenarios that enhance interaction efficiency and task reliability.

Significance. If the architectural integration functions as described, this technical report supplies a coherent blueprint for mobile-native personal assistants that fuse multimodal inputs, personalized memory, and precise action execution. The emphasis on local data distillation for personalization and the use of behavior cloning to create reusable skills are constructive elements that could support practical implementations in mobile AI. The work contributes to the broader area of context-aware agents by outlining specific modules such as temporal alignment and hybrid grounding, potentially guiding subsequent empirical validation in computer vision and human-computer interaction.

major comments (2)

[Omni Perception] In the Omni Perception section: the temporal alignment module is described as decomposing raw data into structured multimodal intent representations, but no details are given on the alignment procedure, synchronization mechanism, or handling of asynchronous inputs, which is load-bearing for the multimodal understanding claim.
[Demonstrations] In the description of demonstrations: the claim that X-OmniClaw effectively enhances interaction efficiency and task reliability rests on undescribed demonstrations across diverse scenarios, with no metrics, baselines, error analysis, or qualitative breakdown provided to substantiate the high contextual awareness assertion.

minor comments (2)

[Introduction] The abstract mentions inspiration from OpenClaw but does not include a brief comparison or reference to prior work in the introduction, which would help situate the contribution.
Notation for components (e.g., 'Omni Perception') is used consistently but could be accompanied by a diagram or table summarizing data flows between modules for improved clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the technical report. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Omni Perception] In the Omni Perception section: the temporal alignment module is described as decomposing raw data into structured multimodal intent representations, but no details are given on the alignment procedure, synchronization mechanism, or handling of asynchronous inputs, which is load-bearing for the multimodal understanding claim.

Authors: We agree that the current description lacks sufficient implementation details on the temporal alignment module. In the revised version, we will expand this section to describe the alignment procedure, including timestamp-based synchronization across modalities, buffering for asynchronous inputs, and the mechanism for producing structured multimodal intent representations. revision: yes
Referee: [Demonstrations] In the description of demonstrations: the claim that X-OmniClaw effectively enhances interaction efficiency and task reliability rests on undescribed demonstrations across diverse scenarios, with no metrics, baselines, error analysis, or qualitative breakdown provided to substantiate the high contextual awareness assertion.

Authors: We appreciate this point. As this is a technical report presenting an architectural blueprint rather than a benchmark study, the demonstrations are intended to be illustrative. We will revise the demonstrations section to include a qualitative breakdown of the scenarios, specific examples of contextual awareness in action, and an error analysis of observed failure modes. Quantitative metrics and baselines are outside the scope of this report but could be addressed in follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a technical report presenting a high-level architectural blueprint for the X-OmniClaw mobile agent, with descriptive modules for Omni Perception (temporal alignment for multimodal intent), Omni Memory (runtime and long-term optimization), and Omni Action (hybrid XML+visual grounding plus behavior cloning). No equations, derivations, fitted parameters, or quantitative predictions appear. Central claims rest on the logical integration of these system components rather than any self-referential definitions, self-citation load-bearing steps, or reductions of outputs to inputs by construction. The argument is self-contained within the genre of a system design description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only text supplies no explicit free parameters, mathematical axioms, or newly postulated entities; all components are described at a conceptual level without supporting derivations or data.

pith-pipeline@v0.9.0 · 5783 in / 1086 out tokens · 37915 ms · 2026-05-22T09:45:46.194360+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified architecture of perception, memory, and action... hybrid grounding strategy that combines structural XML metadata with visual perception
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Behavior Cloning and Trajectory Replay... reusable skills

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 9 internal anchors

[1]

Hermes agent vs openclaw 2026: Memory, ecosystem, integrations & self-improvement

a2a mcp. Hermes agent vs openclaw 2026: Memory, ecosystem, integrations & self-improvement. https:// a2a-mcp.org/blog/hermes-agent-vs-openclaw, 2026. 2

work page 2026
[2]

Operit GitHub Repository

AAswordman. Operit GitHub Repository. https://github.com/AAswordman/Operit, 2026. Accessed: 2026-05-21. 2

work page 2026
[3]

Wuying cloud phone

Alibaba Cloud. Wuying cloud phone. https://www.aliyun.com/product/cloud-phone, 2026. Retrieved from Alibaba Cloud Official Product Page. 2

work page 2026
[4]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.arXiv preprint arXiv:2406.11896, 2024

Hao Bai et al. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.arXiv preprint arXiv:2406.11896, 2024. 2

work page arXiv 2024
[5]

AppAgent: Multimodal Agents as Smartphone Users

Yucheng Han et al. Appagent: Multimodal agents as smartphone users.arXiv preprint arXiv:2312.13771, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Hermes agent architecture documentation

NousResearch. Hermes agent architecture documentation. https://hermes-agent.nousresearch.com/ docs/developer-guide/architecture/, 2026. 2

work page 2026
[7]

Hermes agent: Persistent memory and emergent skills in an open-source ai agent framework

NousResearch. Hermes agent: Persistent memory and emergent skills in an open-source ai agent framework. https: //github.com/NousResearch/hermes-agent, 2026. Version v2026.3.23. 2

work page 2026
[8]

Openclaw memory concept documentation

OpenClaw. Openclaw memory concept documentation. https://github.com/openclaw/openclaw/ blob/main/docs/concepts/memory.md, 2026. Project documentation, accessed 2026-04-13. 2

work page 2026
[9]

OpenClaw GitHub Repository

OpenClaw Team. OpenClaw GitHub Repository. https://github.com/openclaw/openclaw, 2026. Ac- cessed: 2026-04-13. 1, 2

work page 2026
[10]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Android in the wild: A large-scale dataset for android device control.arXiv preprint arXiv:2307.10088, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control.arXiv preprint arXiv:2307.10088, 2023. 2

work page arXiv 2023
[13]

Redfinger cloud phone

RedFinger. Redfinger cloud phone. https://www.gc.com.cn/, 2026. Retrieved from RedFinger Official Website. 2

work page 2026
[14]

AndroidClaw Source Code Branch

SelectXn00b. AndroidClaw Source Code Branch. https://github.com/SelectXn00b/HermesApp/ tree/AndroidClaw, 2026. Accessed: 2026-04-23. 2

work page 2026
[15]

Cloud virtual phone (cvp)

Tencent Cloud. Cloud virtual phone (cvp). https://cloud.tencent.cn/document/product/1801,

work page
[16]

Retrieved from Tencent Cloud Official Document. 2

work page
[17]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 2 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Hermes agent vs openclaw 2026: Memory, ecosystem, integrations & self-improvement

a2a mcp. Hermes agent vs openclaw 2026: Memory, ecosystem, integrations & self-improvement. https:// a2a-mcp.org/blog/hermes-agent-vs-openclaw, 2026. 2

work page 2026

[2] [2]

Operit GitHub Repository

AAswordman. Operit GitHub Repository. https://github.com/AAswordman/Operit, 2026. Accessed: 2026-05-21. 2

work page 2026

[3] [3]

Wuying cloud phone

Alibaba Cloud. Wuying cloud phone. https://www.aliyun.com/product/cloud-phone, 2026. Retrieved from Alibaba Cloud Official Product Page. 2

work page 2026

[4] [4]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.arXiv preprint arXiv:2406.11896, 2024

Hao Bai et al. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.arXiv preprint arXiv:2406.11896, 2024. 2

work page arXiv 2024

[5] [5]

AppAgent: Multimodal Agents as Smartphone Users

Yucheng Han et al. Appagent: Multimodal agents as smartphone users.arXiv preprint arXiv:2312.13771, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Hermes agent architecture documentation

NousResearch. Hermes agent architecture documentation. https://hermes-agent.nousresearch.com/ docs/developer-guide/architecture/, 2026. 2

work page 2026

[7] [7]

Hermes agent: Persistent memory and emergent skills in an open-source ai agent framework

NousResearch. Hermes agent: Persistent memory and emergent skills in an open-source ai agent framework. https: //github.com/NousResearch/hermes-agent, 2026. Version v2026.3.23. 2

work page 2026

[8] [8]

Openclaw memory concept documentation

OpenClaw. Openclaw memory concept documentation. https://github.com/openclaw/openclaw/ blob/main/docs/concepts/memory.md, 2026. Project documentation, accessed 2026-04-13. 2

work page 2026

[9] [9]

OpenClaw GitHub Repository

OpenClaw Team. OpenClaw GitHub Repository. https://github.com/openclaw/openclaw, 2026. Ac- cessed: 2026-04-13. 1, 2

work page 2026

[10] [10]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Android in the wild: A large-scale dataset for android device control.arXiv preprint arXiv:2307.10088, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control.arXiv preprint arXiv:2307.10088, 2023. 2

work page arXiv 2023

[13] [13]

Redfinger cloud phone

RedFinger. Redfinger cloud phone. https://www.gc.com.cn/, 2026. Retrieved from RedFinger Official Website. 2

work page 2026

[14] [14]

AndroidClaw Source Code Branch

SelectXn00b. AndroidClaw Source Code Branch. https://github.com/SelectXn00b/HermesApp/ tree/AndroidClaw, 2026. Accessed: 2026-04-23. 2

work page 2026

[15] [15]

Cloud virtual phone (cvp)

Tencent Cloud. Cloud virtual phone (cvp). https://cloud.tencent.cn/document/product/1801,

work page

[16] [16]

Retrieved from Tencent Cloud Official Document. 2

work page

[17] [17]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 2 12

work page internal anchor Pith review Pith/arXiv arXiv 2023