MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
Pith reviewed 2026-05-20 11:52 UTC · model grok-4.3
The pith
GUI agents handle long tasks better when a learned controller selects and compresses multimodal memory instead of replaying full history or using text alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MementoGUI formulates long-horizon GUI control as an online memory-control problem solved by MementoCore, a modular learned controller that performs step processing, memory compression into textual summaries with ROI-level visual evidence, episodic writing of reusable trajectories, and relevance-based episodic selection, allowing plug-in augmentation of any MLLM GUI agent backbone without finetuning while delivering measurable gains on GUI-Odyssey, MM-Mind2Web, and the introduced MementoGUI-Bench.
What carries the argument
MementoCore, a modular learned controller with specialized operators for step processing, memory compression, episodic writing, and episodic selection that selectively preserves task-relevant interface events using both text and visual regions.
If this is right
- Agents maintain coherent task state across dozens of steps without being overwhelmed by redundant screenshots.
- Retaining localized visual evidence alongside text summaries improves decision accuracy over text-only memory.
- Performance scales with the size of the MementoCore backbone, indicating that stronger memory control models yield larger gains.
- MementoGUI-Bench and the associated MLLM-based metrics provide a standardized way to evaluate memory consistency in long GUI tasks.
Where Pith is reading between the lines
- The same selective multimodal memory pattern could transfer to other long-horizon agent domains such as web navigation or robotic manipulation where full history replay is impractical.
- If the curation pipeline proves robust, agent designers may shift effort from hand-crafted history rules to training data generation for memory controllers.
- Testing whether the learned controller remains effective when the underlying GUI agent backbone is updated or replaced would clarify the degree of modularity achieved.
Load-bearing premise
The automated pipeline that converts raw computer-use trajectories into training data for the memory controller produces unbiased, high-quality examples that generalize directly to the evaluation benchmarks without post-hoc filtering or task-specific adjustments.
What would settle it
Apply MementoGUI to a new long-horizon GUI benchmark whose trajectories were collected independently of the training data curation pipeline and measure whether accuracy, task progress, and memory consistency still exceed the no-history, history-replay, and text-only baselines.
Figures
read the original abstract
Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MementoGUI, a plug-in agentic memory framework for MLLM-based GUI agents that equips them with MementoCore, a learned controller for online memory selection, compression, and retrieval. It modularizes memory control into operators for step processing, compression, episodic writing, and selection, trained via a scalable data curation pipeline that converts computer-use trajectories into supervision. The authors introduce MementoGUI-Bench and MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench claim consistent improvements over no-history, history-replay, and text-only baselines, with larger MementoCore backbones strengthening results.
Significance. If the central claims hold after verification, the modular plug-in design of MementoCore (without finetuning the GUI agent backbone) and the new benchmark could meaningfully advance long-horizon GUI agent research by addressing memory brittleness in a reusable way. The framework's separation of working and episodic memory with ROI-level visual evidence is a clear strength over raw replay or text-only approaches.
major comments (2)
- [Method / Data Curation] Data curation pipeline (described in the method section following the abstract): no mechanics are provided for negative sampling, relevance labeling, overlap detection between curated trajectories and the evaluation benchmarks (GUI-Odyssey, MM-Mind2Web, MementoGUI-Bench), or explicit controls against distribution shift and post-hoc filtering. This is load-bearing for the central claim of unbiased generalization and consistent gains, as the improvements are attributed to the learned MementoCore trained on this pipeline.
- [Experiments] Experimental results (abstract and §5): the claim of 'consistent improvements' and 'larger MementoCore backbones further strengthening' results is stated without reference to specific quantitative tables, error bars, ablation details on the curation choices, or statistical significance tests. This prevents assessment of whether baseline comparisons or metric definitions contain post-hoc choices that affect the reported gains.
minor comments (2)
- [Method] The description of MementoCore operators would benefit from a single pseudocode listing or diagram showing the flow from step processing to episodic selection.
- [Benchmarks] Clarify whether MementoGUI-Bench tasks were held out from the data curation pipeline or if any filtering was applied to avoid train-test leakage.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight opportunities to improve transparency in the data curation pipeline and the reporting of experimental results. We address each point below and will incorporate the necessary clarifications and additions in the revised manuscript.
read point-by-point responses
-
Referee: [Method / Data Curation] Data curation pipeline (described in the method section following the abstract): no mechanics are provided for negative sampling, relevance labeling, overlap detection between curated trajectories and the evaluation benchmarks (GUI-Odyssey, MM-Mind2Web, MementoGUI-Bench), or explicit controls against distribution shift and post-hoc filtering. This is load-bearing for the central claim of unbiased generalization and consistent gains, as the improvements are attributed to the learned MementoCore trained on this pipeline.
Authors: We agree that the current description of the data curation pipeline in Section 3.3 would benefit from greater detail on these aspects. In the revised manuscript we will expand the pipeline description to explicitly cover: (1) negative sampling by selecting trajectories with low semantic overlap to the target task and assigning corresponding low relevance labels via MLLM scoring; (2) relevance labeling using a combination of action-sequence similarity and visual ROI matching; (3) overlap detection between curated training trajectories and the three evaluation benchmarks via normalized edit distance on action sequences and visual feature cosine similarity, with any overlapping trajectories removed; and (4) controls for distribution shift including temporal hold-out splits and post-hoc filtering based on trajectory length and task category balance. These additions will make the training data generation process fully reproducible and strengthen the claim of unbiased generalization. revision: yes
-
Referee: [Experiments] Experimental results (abstract and §5): the claim of 'consistent improvements' and 'larger MementoCore backbones further strengthening' results is stated without reference to specific quantitative tables, error bars, ablation details on the curation choices, or statistical significance tests. This prevents assessment of whether baseline comparisons or metric definitions contain post-hoc choices that affect the reported gains.
Authors: We acknowledge that the experimental claims in the abstract and Section 5 would be clearer with tighter linkage to the reported numbers. In the revised version we will: (1) insert explicit cross-references (e.g., “as shown in Table 2, row 3”) for every statement of improvement; (2) add error bars representing standard deviation across five random seeds for all main results; (3) include a new ablation table examining the impact of negative-sampling ratio and relevance-labeling threshold on final performance; and (4) report paired t-test p-values comparing MementoGUI against each baseline to establish statistical significance. These changes will allow readers to directly evaluate the robustness of the gains and the absence of post-hoc metric adjustments. revision: yes
Circularity Check
No circularity: empirical results from independent training and evaluation
full rationale
The paper introduces MementoGUI as a modular plug-in memory controller with MementoCore operators trained via a data curation pipeline on computer-use trajectories. Reported gains on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench are presented as direct experimental comparisons to no-history, history-replay, and text-only baselines. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the central claims rest on external benchmark performance rather than quantities defined by the same parameters or inputs used to generate the result. The framework is self-contained against the stated evaluation metrics.
Axiom & Free-Parameter Ledger
free parameters (1)
- Learned relevance and compression parameters inside MementoCore
axioms (1)
- domain assumption A plug-in memory controller can be trained and attached to an unchanged MLLM backbone while still producing measurable gains on long-horizon tasks
invented entities (2)
-
MementoCore
no independent evidence
-
MementoGUI-Bench
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MEMENTOCORE modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection... scalable data curation pipeline that converts computer-use trajectories into memory-controller training data
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Luiz C Borro, Luiz AB Macarini, Gordon Tindall, Michael Montero, and Adam B Struck. Memori: A persistent memory layer for efficient, context-aware llm agents.arXiv preprint arXiv:2603.19935, 2026
-
[5]
He Cao, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, and Yu Li. Presto: Pro- gressive pretraining enhances synthetic chemistry outcomes.arXiv preprint arXiv:2406.13193, 2024
-
[6]
Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, et al. Telemem: Building long-term and multimodal memory for agentic ai.arXiv preprint arXiv:2601.06037, 2025
-
[7]
Less is more: Empowering gui agent with context- aware simplification
Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, Zhongang Qi, and Liqiang Nie. Less is more: Empowering gui agent with context- aware simplification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5901–5911, 2025
work page 2025
-
[8]
Seeclick: Harnessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024
work page 2024
-
[9]
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, and Ding Wang. Mga: Memory-driven gui agent for observation-centric interaction.arXiv preprint arXiv:2510.24168, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
work page 2023
-
[11]
Memp: Exploring Agent Procedural Memory
Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026
Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026
-
[13]
Xinzge Gao, Chuanrui Hu, Bin Chen, and Teng Li. Chain-of-memory: Enhancing gui agents for cross-application navigation.arXiv preprint arXiv:2506.18158, 2025
-
[14]
Google DeepMind. Gemini 3.1 Pro Model Card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf , February 2026. Published February 2026; updated 19 February 2026
work page 2026
-
[15]
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
MemFactory: Unified Inference & Training Framework for Agent Memory
Ziliang Guo, Ziheng Li, and Zhiyu Li. Memfactory: Unified inference & training framework for agent memory.arXiv preprint arXiv:2603.29493, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024
work page 2024
-
[18]
Computer use data - paradigm shift ai, 2025
Anais Howland, Ashwin Thinnappan, and Jameel Shahid Mohammed. Computer use data - paradigm shift ai, 2025
work page 2025
-
[19]
Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025
work page 2025
-
[20]
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, and Jiebo Luo. Promptcap: Prompt-guided image captioning for vqa with gpt-3.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2951–2963, 2023
work page 2023
-
[21]
Finecaption: Compositional image captioning focusing on wherever you want at any granularity
Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. InProceedings of the computer vision and pattern recognition conference, pages 24763–24773, 2025
work page 2025
-
[22]
Collomosse, Scott Cohen, and Jiebo Luo
Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John P. Collomosse, Scott Cohen, and Jiebo Luo. Finematch: Aspect-based fine-grained image and text mismatch detection and correction. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[23]
V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning
Hang Hua, Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3599–3607, 2025
work page 2025
-
[24]
Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024
-
[25]
Aliaga, Wei Xiong, and Jiebo Luo
Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel G. Aliaga, Wei Xiong, and Jiebo Luo. Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.ArXiv, abs/2505.19415, 2025
-
[26]
Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, and Roei Herzig. Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025
-
[27]
Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025
-
[28]
Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025
-
[29]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024
work page 2024
-
[30]
Mobilegpt: Augmenting llm with human-like app memory for mobile task automation
Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steve Ko, Sangeun Oh, and Insik Shin. Mobilegpt: Augmenting llm with human-like app memory for mobile task automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking, pages 1119–1133, 2024. 11
work page 2024
-
[31]
Grounding multimodal large language model in gui world
Weixian Lei, Difei Gao, and Mike Zheng Shou. Grounding multimodal large language model in gui world. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[32]
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, and Liang Wang. Echotrail-gui: Building actionable memory for gui agents via critic-guided self-exploration. arXiv preprint arXiv:2512.19396, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024
work page 2024
-
[34]
Showui: One vision-language-action model for gui visual agent
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025
work page 2025
-
[35]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[36]
SimpleMem: Efficient Lifelong Memory for LLM Agents
Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026
work page internal anchor Pith review arXiv 2026
-
[37]
Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji. Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026
-
[38]
Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025
Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, et al. Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025
-
[39]
Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025
-
[40]
Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices
Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025
work page 2025
-
[41]
Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, and Dong Yu. Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026
-
[42]
Sarthak Mehrotra, Sairam VC Rebbapragada, Mani Hemanth Reddy Bonthu, and Vineeth N Balasubramanian. ishift: Lightweight slow-fast gui agent with adaptive perception.arXiv preprint arXiv:2512.22009, 2025
-
[43]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[44]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023
work page 2023
-
[46]
Gabriel Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024. 12
work page 2024
-
[47]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[48]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Changzhi Sun, Xiangyu Chen, Jixiang Luo, Dell Zhang, and Xuelong Li. Beyond heuristics: A decision-theoretic framework for agent memory management.arXiv preprint arXiv:2512.21567, 2025
-
[50]
Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao
Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail A. Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning.ArXiv, abs/2510.23925, 2025
-
[51]
Seagent: Self-evolving computer use agent with autonomous learning from experience
Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv preprint arXiv:2508.04700, 2025
-
[52]
Cradle: Empowering foundation agents towards general computer control,
Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024
-
[53]
ChemAgent: Self-updating memories in large language models improves chemical reasoning
Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, et al. Chemagent: Self-updating library in large language models improves chemical reasoning.arXiv preprint arXiv:2501.06590, 2025
-
[54]
Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, et al. Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning.arXiv preprint arXiv:2503.07459, 2025
-
[55]
Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, et al. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025
-
[56]
Cellforge: Agentic design of virtual cell models
Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, et al. Cellforge: Agentic design of virtual cell models. arXiv preprint arXiv:2508.02276, 2025
-
[57]
Winoground: Probing vision and language models for visio-linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022
work page 2022
-
[58]
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, and Yuanchun Li. Agentprog: Empowering long-horizon gui agents with program- guided context management.arXiv preprint arXiv:2512.10371, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024
work page 2024
-
[61]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Xinyu Wang, Mingze Li, Peng Lu, Xiao-Wen Chang, Lifeng Shang, Jinping Li, Fei Mi, Prasanna Parthasarathi, and Yufei Cui. Infmem: Learning system-2 memory control for long-context agent.arXiv preprint arXiv:2602.02704, 2026
-
[63]
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024
work page 1907
-
[64]
History-aware reasoning for gui agents
Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. History-aware reasoning for gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36448–36456, 2026
work page 2026
-
[65]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025
Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, and Biwei Huang. Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025
-
[67]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents
Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, and Hongsheng Li. Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents.arXiv preprint arXiv:2602.05832, 2026
-
[69]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[70]
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026
-
[71]
Retrieval-augmented gui agents with generative guidelines
Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C Ho, Carl Yang, and Dong Yu. Retrieval-augmented gui agents with generative guidelines. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17877–17886, 2025
work page 2025
-
[72]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026
-
[74]
Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025
Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025
-
[75]
Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning.arXiv preprint arXiv:2512.02425, 2025
-
[76]
Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[77]
Promptfix: You prompt and we fix the photo.arXiv preprint arXiv:2405.16785, 2024
Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, and Jiebo Luo. Promptfix: You prompt and we fix the photo.arXiv preprint arXiv:2405.16785, 2024
-
[78]
arXiv preprint arXiv:2503.08677 , year=
Yongsheng Yu, Ziyun Zeng, Haitian Zheng, and Jiebo Luo. Omnipaint: Mastering object- oriented editing via disentangled insertion-removal inpainting.arXiv preprint arXiv:2503.08677, 2025. 14
-
[79]
Ziyun Zeng, Junyu Chen, Noha Rashwan, Nisreen Al Jallad, Jin Xiao, and Jiebo Luo. Automated detection and quantitative assessment of dental plaque in intraoral images.ACM Transactions on Computing for Healthcare, 7(2):1–12, 2026
work page 2026
-
[80]
Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025
Ziyun Zeng, Hang Hua, and Jiebo Luo. Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.