pith. sign in

arxiv: 2605.18652 · v1 · pith:AY5EP5PCnew · submitted 2026-05-18 · 💻 cs.CV

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Pith reviewed 2026-05-20 11:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords GUI agentsmultimodal memorylong-horizon tasksmemory controllerepisodic memorymemory compressionagentic frameworkMementoCore
0
0 comments X

The pith

GUI agents handle long tasks better when a learned controller selects and compresses multimodal memory instead of replaying full history or using text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MementoGUI as a plug-in system that adds a controllable memory layer to existing MLLM-based GUI agents for tasks spanning many interface steps. Standard approaches either flood the model with every past screenshot or strip away visual details by summarizing in text only, leading to brittle performance when context must persist across dozens of actions. MementoGUI treats memory management as a learnable online control problem: a dedicated MementoCore module decides which events to retain with short text summaries plus targeted image regions, writes reusable past episodes, and retrieves them when relevant. Experiments across three benchmarks show consistent gains over no-history, full-replay, and text-only baselines, with larger memory-controller models amplifying the benefit. If the approach holds, agents could sustain coherent behavior on extended computer-use workflows without manual history engineering or context overflow.

Core claim

MementoGUI formulates long-horizon GUI control as an online memory-control problem solved by MementoCore, a modular learned controller that performs step processing, memory compression into textual summaries with ROI-level visual evidence, episodic writing of reusable trajectories, and relevance-based episodic selection, allowing plug-in augmentation of any MLLM GUI agent backbone without finetuning while delivering measurable gains on GUI-Odyssey, MM-Mind2Web, and the introduced MementoGUI-Bench.

What carries the argument

MementoCore, a modular learned controller with specialized operators for step processing, memory compression, episodic writing, and episodic selection that selectively preserves task-relevant interface events using both text and visual regions.

If this is right

  • Agents maintain coherent task state across dozens of steps without being overwhelmed by redundant screenshots.
  • Retaining localized visual evidence alongside text summaries improves decision accuracy over text-only memory.
  • Performance scales with the size of the MementoCore backbone, indicating that stronger memory control models yield larger gains.
  • MementoGUI-Bench and the associated MLLM-based metrics provide a standardized way to evaluate memory consistency in long GUI tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective multimodal memory pattern could transfer to other long-horizon agent domains such as web navigation or robotic manipulation where full history replay is impractical.
  • If the curation pipeline proves robust, agent designers may shift effort from hand-crafted history rules to training data generation for memory controllers.
  • Testing whether the learned controller remains effective when the underlying GUI agent backbone is updated or replaced would clarify the degree of modularity achieved.

Load-bearing premise

The automated pipeline that converts raw computer-use trajectories into training data for the memory controller produces unbiased, high-quality examples that generalize directly to the evaluation benchmarks without post-hoc filtering or task-specific adjustments.

What would settle it

Apply MementoGUI to a new long-horizon GUI benchmark whose trajectories were collected independently of the training data curation pipeline and measure whether accuracy, task progress, and memory consistency still exceed the no-history, history-replay, and text-only baselines.

Figures

Figures reproduced from arXiv: 2605.18652 by Bocheng Zou, Hang Hua, Jiebo Luo, Mu Cai, Rogerio Feris, Ziyun Zeng.

Figure 1
Figure 1. Figure 1: Overview of the MEMENTOGUI data curation pipeline. (A) Raw computer-use videos are parsed into hierarchical frame- and subgoal-level annotations. (B) These annotations are converted into SFT data for four MEMENTOCORE operators: step processing, memory compression, episodic memory writing, and episodic memory selection. (C) Step-processing and memory-compression samples are further corrupted and VLM-filtere… view at source ↗
Figure 2
Figure 2. Figure 2: MEMENTOGUI augments a frozen GUI action backbone with multimodal working and episodic memory. It updates, retrieves, and writes memory, then serializes textual summaries and ROI references as multimodal context for GUI action prediction. where ot ∈ [0, 1] is a write-salience score, st is an event summary, bt is a task-relevant ROI box, and γt indicates whether episodic retrieval is needed. This yields a pr… view at source ↗
Figure 3
Figure 3. Figure 3: GUI-Odyssey performance by trajectory length on UI-Venus-1.5-8B. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of episodic memory bank size on GUI-Odyssey across frozen GUI backbones. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MementoGUI, a plug-in agentic memory framework for MLLM-based GUI agents that equips them with MementoCore, a learned controller for online memory selection, compression, and retrieval. It modularizes memory control into operators for step processing, compression, episodic writing, and selection, trained via a scalable data curation pipeline that converts computer-use trajectories into supervision. The authors introduce MementoGUI-Bench and MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench claim consistent improvements over no-history, history-replay, and text-only baselines, with larger MementoCore backbones strengthening results.

Significance. If the central claims hold after verification, the modular plug-in design of MementoCore (without finetuning the GUI agent backbone) and the new benchmark could meaningfully advance long-horizon GUI agent research by addressing memory brittleness in a reusable way. The framework's separation of working and episodic memory with ROI-level visual evidence is a clear strength over raw replay or text-only approaches.

major comments (2)
  1. [Method / Data Curation] Data curation pipeline (described in the method section following the abstract): no mechanics are provided for negative sampling, relevance labeling, overlap detection between curated trajectories and the evaluation benchmarks (GUI-Odyssey, MM-Mind2Web, MementoGUI-Bench), or explicit controls against distribution shift and post-hoc filtering. This is load-bearing for the central claim of unbiased generalization and consistent gains, as the improvements are attributed to the learned MementoCore trained on this pipeline.
  2. [Experiments] Experimental results (abstract and §5): the claim of 'consistent improvements' and 'larger MementoCore backbones further strengthening' results is stated without reference to specific quantitative tables, error bars, ablation details on the curation choices, or statistical significance tests. This prevents assessment of whether baseline comparisons or metric definitions contain post-hoc choices that affect the reported gains.
minor comments (2)
  1. [Method] The description of MementoCore operators would benefit from a single pseudocode listing or diagram showing the flow from step processing to episodic selection.
  2. [Benchmarks] Clarify whether MementoGUI-Bench tasks were held out from the data curation pipeline or if any filtering was applied to avoid train-test leakage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight opportunities to improve transparency in the data curation pipeline and the reporting of experimental results. We address each point below and will incorporate the necessary clarifications and additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Method / Data Curation] Data curation pipeline (described in the method section following the abstract): no mechanics are provided for negative sampling, relevance labeling, overlap detection between curated trajectories and the evaluation benchmarks (GUI-Odyssey, MM-Mind2Web, MementoGUI-Bench), or explicit controls against distribution shift and post-hoc filtering. This is load-bearing for the central claim of unbiased generalization and consistent gains, as the improvements are attributed to the learned MementoCore trained on this pipeline.

    Authors: We agree that the current description of the data curation pipeline in Section 3.3 would benefit from greater detail on these aspects. In the revised manuscript we will expand the pipeline description to explicitly cover: (1) negative sampling by selecting trajectories with low semantic overlap to the target task and assigning corresponding low relevance labels via MLLM scoring; (2) relevance labeling using a combination of action-sequence similarity and visual ROI matching; (3) overlap detection between curated training trajectories and the three evaluation benchmarks via normalized edit distance on action sequences and visual feature cosine similarity, with any overlapping trajectories removed; and (4) controls for distribution shift including temporal hold-out splits and post-hoc filtering based on trajectory length and task category balance. These additions will make the training data generation process fully reproducible and strengthen the claim of unbiased generalization. revision: yes

  2. Referee: [Experiments] Experimental results (abstract and §5): the claim of 'consistent improvements' and 'larger MementoCore backbones further strengthening' results is stated without reference to specific quantitative tables, error bars, ablation details on the curation choices, or statistical significance tests. This prevents assessment of whether baseline comparisons or metric definitions contain post-hoc choices that affect the reported gains.

    Authors: We acknowledge that the experimental claims in the abstract and Section 5 would be clearer with tighter linkage to the reported numbers. In the revised version we will: (1) insert explicit cross-references (e.g., “as shown in Table 2, row 3”) for every statement of improvement; (2) add error bars representing standard deviation across five random seeds for all main results; (3) include a new ablation table examining the impact of negative-sampling ratio and relevance-labeling threshold on final performance; and (4) report paired t-test p-values comparing MementoGUI against each baseline to establish statistical significance. These changes will allow readers to directly evaluate the robustness of the gains and the absence of post-hoc metric adjustments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from independent training and evaluation

full rationale

The paper introduces MementoGUI as a modular plug-in memory controller with MementoCore operators trained via a data curation pipeline on computer-use trajectories. Reported gains on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench are presented as direct experimental comparisons to no-history, history-replay, and text-only baselines. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the central claims rest on external benchmark performance rather than quantities defined by the same parameters or inputs used to generate the result. The framework is self-contained against the stated evaluation metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the existence of a scalable, unbiased trajectory-to-training-data pipeline and on the assumption that a separate learned controller can be attached without retraining the base MLLM. No explicit numerical free parameters are named, but the learned relevance selection and compression operators implicitly contain fitted weights. MementoCore and MementoGUI-Bench are new entities introduced without external validation.

free parameters (1)
  • Learned relevance and compression parameters inside MementoCore
    Weights of the memory-selection and compression operators are fitted on the curated trajectory data.
axioms (1)
  • domain assumption A plug-in memory controller can be trained and attached to an unchanged MLLM backbone while still producing measurable gains on long-horizon tasks
    The design explicitly avoids finetuning the GUI agent backbone.
invented entities (2)
  • MementoCore no independent evidence
    purpose: Learned controller that performs step processing, memory compression, episodic writing, and episodic selection
    New modular component introduced to manage multimodal memory.
  • MementoGUI-Bench no independent evidence
    purpose: Benchmark for long-horizon decision-making and memory consistency in GUI agents
    New evaluation suite developed for the paper.

pith-pipeline@v0.9.0 · 5844 in / 1536 out tokens · 41985 ms · 2026-05-20T11:52:20.522655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 20 internal anchors

  1. [1]

    Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

  2. [2]

    Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

    Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Memori: A persistent memory layer for efficient, context-aware llm agents.arXiv preprint arXiv:2603.19935, 2026

    Luiz C Borro, Luiz AB Macarini, Gordon Tindall, Michael Montero, and Adam B Struck. Memori: A persistent memory layer for efficient, context-aware llm agents.arXiv preprint arXiv:2603.19935, 2026

  5. [5]

    Presto: Pro- gressive pretraining enhances synthetic chemistry outcomes.arXiv preprint arXiv:2406.13193, 2024

    He Cao, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, and Yu Li. Presto: Pro- gressive pretraining enhances synthetic chemistry outcomes.arXiv preprint arXiv:2406.13193, 2024

  6. [6]

    Telemem: Building long-term and multimodal memory for agentic ai.arXiv preprint arXiv:2601.06037, 2025

    Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, et al. Telemem: Building long-term and multimodal memory for agentic ai.arXiv preprint arXiv:2601.06037, 2025

  7. [7]

    Less is more: Empowering gui agent with context- aware simplification

    Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, Zhongang Qi, and Liqiang Nie. Less is more: Empowering gui agent with context- aware simplification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5901–5911, 2025

  8. [8]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

  9. [9]

    MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

    Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, and Ding Wang. Mga: Memory-driven gui agent for observation-centric interaction.arXiv preprint arXiv:2510.24168, 2025

  10. [10]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  11. [11]

    Memp: Exploring Agent Procedural Memory

    Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

  12. [12]

    Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

    Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

  13. [13]

    Chain-of-memory: Enhancing gui agents for cross-application navigation.arXiv preprint arXiv:2506.18158, 2025

    Xinzge Gao, Chuanrui Hu, Bin Chen, and Teng Li. Chain-of-memory: Enhancing gui agents for cross-application navigation.arXiv preprint arXiv:2506.18158, 2025

  14. [14]

    Gemini 3.1 Pro Model Card

    Google DeepMind. Gemini 3.1 Pro Model Card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf , February 2026. Published February 2026; updated 19 February 2026

  15. [15]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 10

  16. [16]

    MemFactory: Unified Inference & Training Framework for Agent Memory

    Ziliang Guo, Ziheng Li, and Zhiyu Li. Memfactory: Unified inference & training framework for agent memory.arXiv preprint arXiv:2603.29493, 2026

  17. [17]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

  18. [18]

    Computer use data - paradigm shift ai, 2025

    Anais Howland, Ashwin Thinnappan, and Jameel Shahid Mohammed. Computer use data - paradigm shift ai, 2025

  19. [19]

    Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model

    Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025

  20. [20]

    Smith, and Jiebo Luo

    Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, and Jiebo Luo. Promptcap: Prompt-guided image captioning for vqa with gpt-3.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2951–2963, 2023

  21. [21]

    Finecaption: Compositional image captioning focusing on wherever you want at any granularity

    Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. InProceedings of the computer vision and pattern recognition conference, pages 24763–24773, 2025

  22. [22]

    Collomosse, Scott Cohen, and Jiebo Luo

    Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John P. Collomosse, Scott Cohen, and Jiebo Luo. Finematch: Aspect-based fine-grained image and text mismatch detection and correction. InEuropean Conference on Computer Vision, 2024

  23. [23]

    V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

    Hang Hua, Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3599–3607, 2025

  24. [24]

    Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733,

    Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024

  25. [25]

    Aliaga, Wei Xiong, and Jiebo Luo

    Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel G. Aliaga, Wei Xiong, and Jiebo Luo. Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.ArXiv, abs/2505.19415, 2025

  26. [26]

    Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025

    Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, and Roei Herzig. Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025

  27. [27]

    Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

    Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

  28. [28]

    Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

    Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

  29. [29]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

  30. [30]

    Mobilegpt: Augmenting llm with human-like app memory for mobile task automation

    Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steve Ko, Sangeun Oh, and Insik Shin. Mobilegpt: Augmenting llm with human-like app memory for mobile task automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking, pages 1119–1133, 2024. 11

  31. [31]

    Grounding multimodal large language model in gui world

    Weixian Lei, Difei Gao, and Mike Zheng Shou. Grounding multimodal large language model in gui world. InThe Thirteenth International Conference on Learning Representations, 2025

  32. [32]

    EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

    Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, and Liang Wang. Echotrail-gui: Building actionable memory for gui agents via critic-guided self-exploration. arXiv preprint arXiv:2512.19396, 2025

  33. [33]

    Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024

  34. [34]

    Showui: One vision-language-action model for gui visual agent

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

  35. [35]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  36. [36]

    SimpleMem: Efficient Lifelong Memory for LLM Agents

    Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

  37. [37]

    Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

    Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji. Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

  38. [38]

    Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

    Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, et al. Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

  39. [39]

    Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

    Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

  40. [40]

    Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

    Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

  41. [41]

    Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

    Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, and Dong Yu. Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

  42. [42]

    ishift: Lightweight slow-fast gui agent with adaptive perception.arXiv preprint arXiv:2512.22009, 2025

    Sarthak Mehrotra, Sairam VC Rebbapragada, Mani Hemanth Reddy Bonthu, and Vineeth N Balasubramanian. ishift: Lightweight slow-fast gui agent with adaptive perception.arXiv preprint arXiv:2512.22009, 2025

  43. [43]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  44. [44]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  45. [45]

    An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

  46. [46]

    Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

    Gabriel Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024. 12

  47. [47]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  48. [48]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

  49. [49]

    Beyond heuristics: A decision-theoretic framework for agent memory management.arXiv preprint arXiv:2512.21567, 2025

    Changzhi Sun, Xiangyu Chen, Jixiang Luo, Dell Zhang, and Xuelong Li. Beyond heuristics: A decision-theoretic framework for agent memory management.arXiv preprint arXiv:2512.21567, 2025

  50. [50]

    Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao

    Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail A. Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning.ArXiv, abs/2510.23925, 2025

  51. [51]

    Seagent: Self-evolving computer use agent with autonomous learning from experience

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv preprint arXiv:2508.04700, 2025

  52. [52]

    Cradle: Empowering foundation agents towards general computer control,

    Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024

  53. [53]

    ChemAgent: Self-updating memories in large language models improves chemical reasoning

    Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, et al. Chemagent: Self-updating library in large language models improves chemical reasoning.arXiv preprint arXiv:2501.06590, 2025

  54. [54]

    Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning.arXiv preprint arXiv:2503.07459, 2025

    Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, et al. Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning.arXiv preprint arXiv:2503.07459, 2025

  55. [55]

    Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025

    Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, et al. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025

  56. [56]

    Cellforge: Agentic design of virtual cell models

    Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, et al. Cellforge: Agentic design of virtual cell models. arXiv preprint arXiv:2508.02276, 2025

  57. [57]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

  58. [58]

    AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

    Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, and Yuanchun Li. Agentprog: Empowering long-horizon gui agents with program- guided context management.arXiv preprint arXiv:2512.10371, 2025

  59. [59]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  60. [60]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024

  61. [61]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 13

  62. [62]

    Infmem: Learning system-2 memory control for long-context agent.arXiv preprint arXiv:2602.02704, 2026

    Xinyu Wang, Mingze Li, Peng Lu, Xiao-Wen Chang, Lifeng Shang, Jinping Li, Fei Mi, Prasanna Parthasarathi, and Yufei Cui. Infmem: Learning system-2 memory control for long-context agent.arXiv preprint arXiv:2602.02704, 2026

  63. [63]

    Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

    Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

  64. [64]

    History-aware reasoning for gui agents

    Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. History-aware reasoning for gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36448–36456, 2026

  65. [65]

    Agent Workflow Memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

  66. [66]

    Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

    Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, and Biwei Huang. Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

  67. [67]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

  68. [68]

    Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents

    Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, and Hongsheng Li. Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents.arXiv preprint arXiv:2602.05832, 2026

  69. [69]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  70. [70]

    Mobile-agent-v3

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

  71. [71]

    Retrieval-augmented gui agents with generative guidelines

    Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C Ho, Carl Yang, and Dong Yu. Retrieval-augmented gui agents with generative guidelines. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17877–17886, 2025

  72. [72]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  73. [73]

    Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

    Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

  74. [74]

    Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

    Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

  75. [75]

    Worldmm: Dynamic multimodal memory agent for long video reasoning.arXiv preprint arXiv:2512.02425, 2025

    Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning.arXiv preprint arXiv:2512.02425, 2025

  76. [76]

    Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

    Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

  77. [77]

    Promptfix: You prompt and we fix the photo.arXiv preprint arXiv:2405.16785, 2024

    Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, and Jiebo Luo. Promptfix: You prompt and we fix the photo.arXiv preprint arXiv:2405.16785, 2024

  78. [78]

    arXiv preprint arXiv:2503.08677 , year=

    Yongsheng Yu, Ziyun Zeng, Haitian Zheng, and Jiebo Luo. Omnipaint: Mastering object- oriented editing via disentangled insertion-removal inpainting.arXiv preprint arXiv:2503.08677, 2025. 14

  79. [79]

    Automated detection and quantitative assessment of dental plaque in intraoral images.ACM Transactions on Computing for Healthcare, 7(2):1–12, 2026

    Ziyun Zeng, Junyu Chen, Noha Rashwan, Nisreen Al Jallad, Jin Xiao, and Jiebo Luo. Automated detection and quantitative assessment of dental plaque in intraoral images.ACM Transactions on Computing for Healthcare, 7(2):1–12, 2026

  80. [80]

    Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

    Ziyun Zeng, Hang Hua, and Jiebo Luo. Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

Showing first 80 references.