pith. sign in

arxiv: 2605.29324 · v1 · pith:HBPNMDHTnew · submitted 2026-05-28 · 💻 cs.CL · cs.CV

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

Pith reviewed 2026-06-29 07:42 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords mobile GUI agentsexplicit memoryvirtual environmentslong-horizon tasksreinforcement learningmemory benchmarkcontext window limitssupervised memory data
0
0 comments X

The pith

Virtual environments with injected memory variables train GUI agents to explicitly remember information across long tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that mobile GUI agents lose critical past information because screenshot inputs overwhelm context windows, forcing them to discard history. To fix this, STAMP creates simulated environments where specific facts are programmatically inserted into tasks at controlled moments, generating labeled examples of what to encode and when to retrieve it. This produces scalable supervised data plus environment rewards for reinforcement learning. The resulting agent then performs better on long-horizon mobile tasks that require memory while keeping general navigation skills intact.

Core claim

STAMP trains explicit memory in mobile GUI agents by building controllable virtual environments that inject deterministic memory variables into synthesized tasks, specifying exactly what must be memorized, when it must be encoded, and when it must be retrieved, thereby yielding verifiable training signals at scale and enabling online reinforcement learning with environment-driven rewards.

What carries the argument

Programmatically injected deterministic memory variables in virtual environments that control encoding and retrieval timing to create supervised memory data.

If this is right

  • Agents gain the ability to retain transient details across dozens of steps without exhausting context windows.
  • Training data for memory behavior can be generated programmatically instead of collected from expensive real interactions.
  • Online reinforcement learning becomes feasible because environment rewards directly score correct recall timing.
  • General mobile navigation performance remains intact while memory accuracy improves on the Memory-World benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same injection technique could be adapted to train memory in non-GUI agents such as web or desktop controllers.
  • If virtual memory variables transfer well, real-device data collection for long-horizon tasks could be reduced.
  • Benchmark results on Memory-World may understate failure modes that appear only when memory variables are not explicitly provided.

Load-bearing premise

That the memory variables and rewards programmed into virtual environments match the memory demands of actual mobile GUI use.

What would settle it

A test showing that an agent trained only on the virtual memory tasks performs no better than baseline agents when given the same long real-world mobile navigation sequences.

Figures

Figures reproduced from arXiv: 2605.29324 by Haiyang Xu, Jieping Ye, Jitao Sang, Junyang Wang, Ming Yan, Xi Zhang, Zhaoqing Zhu.

Figure 1
Figure 1. Figure 1: Existing agents fail by pruning critical in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The end-to-end framework of STAMP. The proposed pipeline consists of three sequential phases: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of Stamp-GUI predict memory content when appropriate and using memory when needed on an [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: General capability evaluation on the full 116 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the STAMP framework for training explicit memory capabilities in mobile GUI agents. It uses controllable virtual environments in which deterministic memory variables are programmatically injected into synthesized tasks to generate verifiable supervised data at scale and to supply environment-driven rewards for online reinforcement learning. The work also introduces the Memory-World benchmark; the resulting Stamp-GUI agent is claimed to achieve state-of-the-art performance among GUI-specialized models, set a new high watermark on Memory-World, exhibit exceptional memory accuracy and task resilience, and retain strong general mobile navigation capabilities.

Significance. If the central claim holds and the virtual-environment training signal transfers, the approach would supply a scalable, controllable route to teaching long-horizon memory policies that current context-window-limited agents lack, addressing a recognized bottleneck in realistic GUI task execution.

major comments (2)
  1. [Abstract] Abstract: the claim that Stamp-GUI 'achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark' is asserted without any reported metrics, baselines, ablation results, or experimental protocol, rendering the central empirical claim unsupported in the provided text.
  2. [Method description] Method description (and Abstract): the training signal rests on the assumption that programmatically injected deterministic memory variables and environment rewards accurately capture the memory requirements of real mobile GUI tasks. Real GUIs contain non-deterministic state, implicit visual/layout cues, and app-internal state not exposed as variables; no cross-domain evaluation on unmodified real-device traces or human-annotated memory tasks is described to test transfer of the learned policies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that Stamp-GUI 'achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark' is asserted without any reported metrics, baselines, ablation results, or experimental protocol, rendering the central empirical claim unsupported in the provided text.

    Authors: We agree that the abstract asserts the central empirical claims without including supporting metrics or protocol details. The full manuscript reports these results in the experiments section. We will revise the abstract to include key quantitative results (e.g., Memory-World accuracy scores and baseline comparisons) and a concise reference to the evaluation setup. revision: yes

  2. Referee: [Method description] Method description (and Abstract): the training signal rests on the assumption that programmatically injected deterministic memory variables and environment rewards accurately capture the memory requirements of real mobile GUI tasks. Real GUIs contain non-deterministic state, implicit visual/layout cues, and app-internal state not exposed as variables; no cross-domain evaluation on unmodified real-device traces or human-annotated memory tasks is described to test transfer of the learned policies.

    Authors: We acknowledge the assumption underlying the virtual-environment training signal and the absence of cross-domain transfer experiments on real devices. The manuscript deliberately centers on controllable virtual environments to enable scalable, verifiable memory supervision; it does not claim or evaluate transfer to unmodified real-device traces. We will revise the method description and abstract to state this scope more explicitly and will add a limitations paragraph discussing the gap to real-world non-deterministic GUIs as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: training and benchmark both rely on explicit injected variables but the performance metric is not forced by construction

full rationale

The paper describes a methodological framework that programmatically injects deterministic memory variables into synthesized virtual tasks to generate training signals and then evaluates the resulting agent on the newly introduced Memory-World benchmark. No equations, self-citations, or derivations are presented that reduce the reported SOTA memory accuracy or task resilience to a tautological fit or renaming of the injected variables themselves. The central claim remains an empirical statement about agent behavior in the constructed environments rather than a self-referential prediction. This is the most common honest non-finding for a methods paper that introduces both its training distribution and its evaluation distribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5741 in / 894 out tokens · 22442 ms · 2026-06-29T07:42:48.304483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. 2025. Amex: Android multi- annotation expo dataset for mobile gui agents. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 2138–2156. Tongbo Chen, Zhen...

  2. [2]

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, and 1 others

    Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075. Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, and 1 others. 2024. Auto- glm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820. Quanfeng Lu, Wenqi S...

  3. [3]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458. Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namy- ong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, and 1 others. 2024. Gui agents: A survey.ArXiv preprint, abs/2412.13501. OpenAI. 2026. Gpt 5.4. Technical report, OpenAI. System Card. ...

  4. [4]

    Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082. Tongyi. 2026. Qwen-3.5. Technical report, Tongyi. System Card. Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, and 1 others. 2025a. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcemen...

  5. [5]

    goal": ...,

    Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20. Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024b. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Lingu...

  6. [6]

    Action: a short imperative describing what to do in the UI

  7. [7]

    Memory: the key information need to memory in the screenshot

  8. [8]

    name": <function-name>,

    A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args- json-object>}. Rules: - Output exactly in the order: Action, Memory, <tool_call>. - Be brief: one for Action. - Do not output anything else outside those three parts. - If finishing, use action=terminate in the tool call. Table 15: System prom...

  9. [9]

    Do not perform exploratory actions based on your own assumptions

    Follow the task guideline strictly. Do not perform exploratory actions based on your own assumptions

  10. [10]

    The input action already includes one click, so you do not need to click the input box separately before inputting

  11. [11]

    If a popup appears, try closing it first

  12. [12]

    You must first browse all information according to the detailed guideline, and only then input the answer

    Do not try to input the final answer directly. You must first browse all information according to the detailed guideline, and only then input the answer

  13. [13]

    Table 16: Prompt used by the dynamic testing planner

    Before entering the final answer, you must review the action history and screenshots to ensure that all required information for the task has already appeared. Table 16: Prompt used by the dynamic testing planner. Memory-Acc Evaluation Prompt I have a task to perform using Chrome on my phone, as follows: {} Please note that this task requires memorizing s...

  14. [14]

    Complete Match

    If the mobile phone agent’s memory completely matches the provided content to be memorized, output “Complete Match”

  15. [15]

    Partial Match

    If the mobile phone agent’s memory partially matches the provided content to be memorized, output “Partial Match”

  16. [16]

    No Match

    If the mobile phone agent’s memory does not match the provided content to be memorized at all, output “No Match”. “Match” means that both expressions convey the same meaning. They do not need to be completely identical. As long as the core content to be memorized, such as numbers or strings, is the same, it is acceptable. Now, after careful consideration,...

  17. [17]

    task.final_action.gold must be exactly identical to scenario.data.truth.gold

  18. [18]

    task.grading.gold must be exactly identical to task.final_action.gold

  19. [19]

    task.grading.pass_regex must match the gold string, preferably using ˆ...$

  20. [20]

    required_testids must include at least: - all tabs: tab-* - search input: search-input (if the scenario has a search page) - final submission: go-submit-answer, answer-input, answer-submit, result - fact display containers used by memory_items, such as fact-xxx / followers-name / price-sku

  21. [21]

    task.natural_language must clearly describe the task flow, such as which pages to visit and what kind of information to look for, but it must not reveal the actual values that need to be remembered

  22. [22]

    task.guideline must be a more detailed version of task.natural_language, and must describe the full workflow, all memory targets, and the final submission process in detail

  23. [23]

    Find the code 1234 in My > Orders > History

    task.guideline must include the exact location and exact content for every memory item, for example: “Find the code 1234 in My > Orders > History” and “Find the event code 5678 in Home > Activities > Popular Events”. Table 19: Prompt used by the task generation agent. 22 Webpage Generation Agent Prompt You are a web app generator. You must generate a sing...

  24. [24]

    A mobile app style bottom tab bar using task_spec.platform.tabs, with corresponding unique data-testid values in the form tab-*

  25. [25]

    Every testid listed in task_spec.ui_contract.required_testids must exist, be unique, and be interactable when appropriate

  26. [26]

    The navigation behavior described in task_spec.ui_contract.navigation_contract must be implemented

  27. [27]

    Do not alter truth values or truth text

    The truth facts in scenario.data.truth.facts must be displayed on the appropriate pages exactly as specified. Do not alter truth values or truth text

  28. [28]

    go-submit-answer

    There must be an entry point with data-testid="go-submit-answer" that leads to a dedicated answer submission page containing: - data-testid="answer-input" - data-testid="answer-submit" - data-testid="result"

  29. [29]

    Example: xxx

    Do not use the <select> element to implement the filter function. If you need to implement the filter function, please use multiple clickable buttons instead. Submission and validation requirements: - The one and only correct answer is: {gold} - Make sure no distractor content creates ambiguity. - {gold} must remain the unique correct answer under all cir...

  30. [30]

    predicted_answer must comply with task_spec.task.final_action.required_output_format, or at least follow the same exact formatting style as task_spec.task.final_action.gold

  31. [31]

    ok=true if and only if: - predicted_answer is exactly equal to gold - uniqueness.is_unique is true - the task flow is executable and the required information can actually be found in the HTML without contradiction

  32. [32]

    evidence.key_observations must contain enough information for a third party to verify how predicted_answer was obtained without running the page

  33. [33]

    Use the smallest possible fix strategy, with the following priority: (a) first adjust distractor values, ordering, labels, or section membership so they no longer affect the task result (b) second move distractors out of the relevant comparison or calculation area, such as from a ranking list into a recommendation or ad section (c) avoid changing scenario...

  34. [34]

    For common bugs involving highest value, top-k, sorting, filtering, or ranking, you must check all candidate items that look comparable, including newly introduced distractors, and ensure they do not change the result away from gold

  35. [35]

    Comparison scope for this task: Popularity ranking

    You may modify the HTML to make the task interpretation clearer, for example by renaming a section title to “Comparison scope for this task: Popularity ranking”, but do not leak the answer in instructional text

  36. [36]

    If the final answer is formed by combining multiple strings, list each minimal component separately

    evidence.key_notes must list all minimal memory units required for the final answer. If the final answer is formed by combining multiple strings, list each minimal component separately

  37. [37]

    All output content must be in English only

  38. [38]

    Do not include any Chinese characters anywhere, including in the repaired HTML

  39. [39]

    It must be implemented through multiple clickable buttons

    Please ensure that the filter functionality of the webpage is not implemented through the <select> element. It must be implemented through multiple clickable buttons. scenario.truth.facts: {json.dumps(facts, ensure_ascii=False, indent=2)} gold: {gold} scenario JSON: {json.dumps(scn, ensure_ascii=False, indent=2)} task_spec JSON: {json.dumps(task_spec, ens...