pith. sign in

arxiv: 2606.19926 · v1 · pith:IEAJIHSDnew · submitted 2026-06-18 · 💻 cs.HC

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Pith reviewed 2026-06-26 15:59 UTC · model grok-4.3

classification 💻 cs.HC
keywords mobile GUI agentslong-horizon taskscontext managementConActMLLM-based agentssupervised fine-tuningMemGUI-BenchMobileWorld benchmark
0
0 comments X

The pith

MemGUI-Agent treats context management as first-class actions to enable reliable long-horizon mobile GUI performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that passive history accumulation in ReAct-style agents causes prompt explosion and loss of key facts in long mobile tasks spanning apps. By making context management proactive actions decided by the same model, it keeps three structured fields compact while retaining critical information. This is supported by creating a dataset of nearly 3,000 annotated trajectories and training an 8B model that leads open 8B results on their benchmark while working on a different one. A sympathetic reader would care because it offers a way to scale GUI agents beyond short tasks without external memory systems.

Core claim

MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. Training an 8B model on the 2,956-trajectory MemGUI-3K dataset produces MemGUI-8B-SFT that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark.

What carries the argument

Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions, maintaining folded action history, folded UI state, and recent step record.

If this is right

  • The same policy learns to decide when and how to fold context, preserving critical cross-app facts.
  • Supervised training on annotated trajectories makes proactive management learnable across model scales.
  • The resulting 8B agent sets the best open-data performance on MemGUI-Bench.
  • It generalizes to out-of-distribution benchmarks like MobileWorld.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might reduce the need for separate memory architectures in agent systems.
  • Similar context-as-action ideas could apply to web or desktop agents facing similar horizon limits.
  • If models learn context actions well, it could improve reliability on tasks with many app transitions.
  • The three-field structure could be adapted for other structured memory needs.

Load-bearing premise

That the model will learn to emit useful context management actions rather than unhelpful or noisy ones that fail to preserve critical facts.

What would settle it

If the 8B model trained on MemGUI-3K does not achieve the highest open-data 8B score on MemGUI-Bench or fails to retain facts on long sequences, the claim would not hold.

read the original abstract

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MemGUI-Agent, an MLLM-based mobile GUI agent that frames context management as first-class actions (ConAct) emitted by the same policy as UI actions. ConAct maintains three structured fields (folded action history, folded UI state, recent step record) to avoid passive accumulation and prompt explosion in long-horizon tasks. The authors construct MemGUI-3K, a dataset of 2,956 fully annotated trajectories, perform SFT on an 8B model to obtain MemGUI-8B-SFT, and claim this yields the best open-data 8B performance on MemGUI-Bench while generalizing to the out-of-distribution MobileWorld benchmark.

Significance. If the results hold, the work provides a concrete mechanism for making context management proactive and learnable within the policy itself, which could improve reliability of long-horizon agents across GUI and related domains. The public release of code, data, and trained models is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. [Abstract] Abstract: The headline performance claims for MemGUI-8B-SFT (best open-data 8B result on MemGUI-Bench and generalization to MobileWorld) are stated without any experimental details, error bars, dataset statistics, ablation studies, or baseline comparisons, rendering the central empirical result unevaluable from the provided text.
  2. [ConAct / SFT sections] Section on ConAct and SFT training (likely §3–4): The performance attribution to ConAct requires that SFT on MemGUI-3K induces the 8B policy to emit useful, fact-preserving context actions at appropriate times rather than defaulting to passive accumulation or noisy/empty folds. No analysis of emitted ConAct actions (frequency, fidelity to annotations, effect on context length, or comparison to ReAct baselines) is described, leaving this load-bearing assumption unverified.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'best open-data 8B performance' is undefined; the manuscript should clarify what 'open-data' baselines are considered and how they are selected.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which highlight important aspects for improving the clarity and verifiability of our empirical claims. We address each point below and commit to revisions that will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance claims for MemGUI-8B-SFT (best open-data 8B result on MemGUI-Bench and generalization to MobileWorld) are stated without any experimental details, error bars, dataset statistics, ablation studies, or baseline comparisons, rendering the central empirical result unevaluable from the provided text.

    Authors: The abstract is designed to provide a high-level overview of the contributions and key results within the typical length constraints. Full details on the experimental setup, including the MemGUI-Bench and MobileWorld benchmarks, dataset statistics for MemGUI-3K (2,956 trajectories), SFT training on the 8B model, and comparisons to baselines are presented in the Experiments section. We will revise the abstract to briefly mention the evaluation on MemGUI-Bench with open-data 8B models and generalization to MobileWorld, while noting that detailed results and ablations appear in the main text. Error bars and full ablations are not standard in abstracts but are included in the paper body. revision: partial

  2. Referee: [ConAct / SFT sections] Section on ConAct and SFT training (likely §3–4): The performance attribution to ConAct requires that SFT on MemGUI-3K induces the 8B policy to emit useful, fact-preserving context actions at appropriate times rather than defaulting to passive accumulation or noisy/empty folds. No analysis of emitted ConAct actions (frequency, fidelity to annotations, effect on context length, or comparison to ReAct baselines) is described, leaving this load-bearing assumption unverified.

    Authors: We agree that verifying the policy's use of ConAct is important for attributing performance gains. The current manuscript focuses on the design of ConAct and the construction of the annotated dataset but does not include a post-hoc analysis of the trained model's ConAct emissions. In the revision, we will add such an analysis, including quantitative measures of ConAct emission frequency, fidelity to the ground-truth annotations in MemGUI-3K, impact on context length compared to ReAct-style accumulation, and qualitative examples. This will be incorporated into the experimental results section to directly address the load-bearing assumption. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical construction with external benchmarks

full rationale

The paper describes an empirical pipeline: define ConAct as context actions, annotate a new 2,956-trajectory dataset (MemGUI-3K) with those actions, perform SFT on an 8B model, and report performance on MemGUI-Bench plus out-of-distribution MobileWorld. No equations, parameter fits, or derivations appear. No self-citations are invoked as load-bearing uniqueness theorems. The central claim (SFT on annotated trajectories yields a policy that emits useful ConAct actions) is tested against separate benchmarks rather than reducing to the training data by construction. This is a standard supervised-learning result whose validity is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations or implementation details, so no free parameters, axioms, or invented entities beyond the high-level concepts named in the text can be identified.

pith-pipeline@v0.9.1-grok · 5803 in / 1152 out tokens · 21870 ms · 2026-06-26T15:59:56.540442+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 10 linked inside Pith

  1. [1]

    Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

  2. [2]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    OpenMobile: Building open mobile agents with task and trajectory synthesis.arXiv preprint arXiv:2604.15093, 2026

    Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, et al. OpenMobile: Building open mobile agents with task and trajectory synthesis.arXiv preprint arXiv:2604.15093, 2026

  4. [4]

    Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  5. [5]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  6. [6]

    Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026

    Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026

  7. [7]

    Ui-venus technical report: Building high-performance ui agents with rft

    Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

  8. [8]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

  9. [9]

    Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint arXiv:2512.19432, 2025

    Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint arXiv:2512.19432, 2025

  10. [10]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  11. [11]

    Llm-powered gui agents in phone automation: Surveying progress and prospects

    Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, et al. Llm-powered gui agents in phone automation: Surveying progress and prospects. arXiv preprint arXiv:2504.19838, 2025

  12. [12]

    Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026

    Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, et al. Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026

  13. [13]

    Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

    Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

  14. [14]

    Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

    Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

  15. [15]

    Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  16. [16]

    Androidworld: A dynamic benchmarking environment for autonomous agents

    Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. InInternational Conference on Learning Representations, volume 2025, pages 406–441, 2025

  17. [17]

    ClawGUI: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026

    Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. ClawGUI: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026. 11

  18. [18]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems, 37:2686–2710, 2024

  19. [19]

    Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

  20. [20]

    Mobile-agent-v3

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

  21. [21]

    A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026

  22. [22]

    Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

  23. [23]

    Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

    Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

  24. [24]

    Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

  25. [25]

    Appagent: Multimodal agents as smartphone users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

  26. [26]

    G-memory: Tracing hierarchical memory for multi-agent systems.Advances in Neural Information Processing Systems, 38:12988–13018, 2026

    Guibin Zhang, Muxin Fu, Kun Wang, Frank Wan, Miao Yu, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.Advances in Neural Information Processing Systems, 38:12988–13018, 2026

  27. [27]

    Expel: Llm agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

  28. [28]

    Swift: a scalable lightweight infrastructure for fine-tuning

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

  29. [29]

    GPT-4V(ision) is a generalist web agent, if grounded

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. GPT-4V(ision) is a generalist web agent, if grounded. InInternational Conference on Machine Learning, 2024

  30. [30]

    Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

  31. [31]

    reasonable

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents. arXiv preprint arXiv:2506.15841, 2025. 12 Appendix organization.We place background discussion first, then benchmark and dataset details, follow...

  32. [32]

    Thinking: a <thinking>...</thinking> block explaining the next move (no multi-step reasoning)

  33. [33]

    name": <function-name>,

    Tool call: a <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.,→ 26

  34. [34]

    failure"`immediately.,→ - If task is successfully completed, use`action=terminate`with`status=

    Conclusion: a short <conclusion>...</conclusion> block describing what to do in the UI. Rules: - Output exactly in the order: <thinking>,<tool_call>,<conclusion>. - Be brief: one sentence for <thinking>, one for <conclusion>. - Do not output anything else outside those three parts. - **Task Feasibility**: If you determine the task is INFEASIBLE (e.g., req...

  35. [35]

    **Folded UI State**: Explicitly stored critical information extracted from UI

  36. [36]

    **Folded Action History**: Compressed records of past actions

  37. [37]

    type": "function

    **Recent Step Record**: Full details of your most recent step (to be folded this turn) Under CONACT, these three fields form the structured context state, and the model may emit both UI actions and context actions (history folding or UI memory operations).,→ # Tools You may call ONE function per step. <tools> [ { "type": "function", "function": { "name": ...

  38. [38]

    **Thinking**:`<thinking>...</thinking>`- Your reasoning for next action AND folding decision

  39. [39]

    range": [start_step, current_step],

    **Folding Directive**:`<folding>...</folding>`- JSON object specifying how to compress history: ```json {"range": [start_step, current_step], "summary": "Compressed description"} ``` - **Step-level Distillation** (start_step == current_step): Distill only the latest step into a compact record Example:`{"range": [5, 5], "summary": "[Step 5] Opened Settings...

  40. [40]

    **Tool Call**:`<tool_call>...</tool_call>`- Your action (UI or memory operation)

  41. [41]

    Include exact text, numbers, prices, names, counts visible

    **UI Observation**:`<ui_observation>...</ui_observation>`- **DETAILED** screen description. Include exact text, numbers, prices, names, counts visible. Quote task-relevant info verbatim.,→ 28

  42. [42]

    **Action Intent**:`<action_intent>...</action_intent>`- What you INTEND to do next. ### Rules: - Output exactly in order: <thinking>, <folding>, <tool_call>, <ui_observation>, <action_intent> - First step (step 1): Skip <folding> as there's no history to fold - ALWAYS include <folding> from step 2 onwards - In <folding>, "range" must include the current s...

  43. [43]

    Output <thinking> with your reasoning

  44. [44]

    Skip <folding> for the first step

    {"Skip <folding> for the first step" if self.current_step == 1 else "Output <folding> to compress your previous step(s)"}

  45. [45]

    Output <tool_call> with your action

  46. [46]

    Output <ui_observation> with **DETAILED** screen description (include ALL task-relevant info: exact text, numbers, prices, names, counts visible on screen),→

  47. [47]

    Output <action_intent> describing your planned action 29 Figure 14Representative process-hallucination failure. The agent deviates from the required workflow or falsely assumes that a necessary intermediate operation has been completed, causing progress loss even when the task remains feasible. 30 Figure 15Representative output-hallucination failure. The ...