MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Bocheng Zou; Hang Hua; Jiebo Luo; Mu Cai; Rogerio Feris; Ziyun Zeng

arxiv: 2605.18652 · v1 · pith:AY5EP5PCnew · submitted 2026-05-18 · 💻 cs.CV

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Ziyun Zeng , Hang Hua , Bocheng Zou , Mu Cai , Rogerio Feris , Jiebo Luo This is my paper

Pith reviewed 2026-05-20 11:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords GUI agentsmultimodal memorylong-horizon tasksmemory controllerepisodic memorymemory compressionagentic frameworkMementoCore

0 comments

The pith

GUI agents handle long tasks better when a learned controller selects and compresses multimodal memory instead of replaying full history or using text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MementoGUI as a plug-in system that adds a controllable memory layer to existing MLLM-based GUI agents for tasks spanning many interface steps. Standard approaches either flood the model with every past screenshot or strip away visual details by summarizing in text only, leading to brittle performance when context must persist across dozens of actions. MementoGUI treats memory management as a learnable online control problem: a dedicated MementoCore module decides which events to retain with short text summaries plus targeted image regions, writes reusable past episodes, and retrieves them when relevant. Experiments across three benchmarks show consistent gains over no-history, full-replay, and text-only baselines, with larger memory-controller models amplifying the benefit. If the approach holds, agents could sustain coherent behavior on extended computer-use workflows without manual history engineering or context overflow.

Core claim

MementoGUI formulates long-horizon GUI control as an online memory-control problem solved by MementoCore, a modular learned controller that performs step processing, memory compression into textual summaries with ROI-level visual evidence, episodic writing of reusable trajectories, and relevance-based episodic selection, allowing plug-in augmentation of any MLLM GUI agent backbone without finetuning while delivering measurable gains on GUI-Odyssey, MM-Mind2Web, and the introduced MementoGUI-Bench.

What carries the argument

MementoCore, a modular learned controller with specialized operators for step processing, memory compression, episodic writing, and episodic selection that selectively preserves task-relevant interface events using both text and visual regions.

If this is right

Agents maintain coherent task state across dozens of steps without being overwhelmed by redundant screenshots.
Retaining localized visual evidence alongside text summaries improves decision accuracy over text-only memory.
Performance scales with the size of the MementoCore backbone, indicating that stronger memory control models yield larger gains.
MementoGUI-Bench and the associated MLLM-based metrics provide a standardized way to evaluate memory consistency in long GUI tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective multimodal memory pattern could transfer to other long-horizon agent domains such as web navigation or robotic manipulation where full history replay is impractical.
If the curation pipeline proves robust, agent designers may shift effort from hand-crafted history rules to training data generation for memory controllers.
Testing whether the learned controller remains effective when the underlying GUI agent backbone is updated or replaced would clarify the degree of modularity achieved.

Load-bearing premise

The automated pipeline that converts raw computer-use trajectories into training data for the memory controller produces unbiased, high-quality examples that generalize directly to the evaluation benchmarks without post-hoc filtering or task-specific adjustments.

What would settle it

Apply MementoGUI to a new long-horizon GUI benchmark whose trajectories were collected independently of the training data curation pipeline and measure whether accuracy, task progress, and memory consistency still exceed the no-history, history-replay, and text-only baselines.

Figures

Figures reproduced from arXiv: 2605.18652 by Bocheng Zou, Hang Hua, Jiebo Luo, Mu Cai, Rogerio Feris, Ziyun Zeng.

**Figure 1.** Figure 1: Overview of the MEMENTOGUI data curation pipeline. (A) Raw computer-use videos are parsed into hierarchical frame- and subgoal-level annotations. (B) These annotations are converted into SFT data for four MEMENTOCORE operators: step processing, memory compression, episodic memory writing, and episodic memory selection. (C) Step-processing and memory-compression samples are further corrupted and VLM-filtere… view at source ↗

**Figure 2.** Figure 2: MEMENTOGUI augments a frozen GUI action backbone with multimodal working and episodic memory. It updates, retrieves, and writes memory, then serializes textual summaries and ROI references as multimodal context for GUI action prediction. where ot ∈ [0, 1] is a write-salience score, st is an event summary, bt is a task-relevant ROI box, and γt indicates whether episodic retrieval is needed. This yields a pr… view at source ↗

**Figure 3.** Figure 3: GUI-Odyssey performance by trajectory length on UI-Venus-1.5-8B. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of episodic memory bank size on GUI-Odyssey across frozen GUI backbones. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MementoGUI adds a modular four-operator memory controller for GUI agents and reports gains over replay baselines, but the data curation pipeline for training MementoCore is the unverified piece that could affect whether those gains hold up.

read the letter

Hi, the main point is that this work treats long-horizon GUI control as an online memory-management task and decomposes it into four learned operators inside MementoCore: step processing, compression, episodic writing, and selection. That modular split is new relative to the raw-replay or text-only baselines they compare against, and the plug-in design means you can attach it to an existing MLLM agent without retraining the backbone. They also ship MementoGUI-Bench and some MLLM-based metrics for action matching and memory consistency, which is concrete engineering progress on a real pain point for multi-step interface tasks. The reported improvements on GUI-Odyssey, MM-Mind2Web, and the new bench, plus the note that larger MementoCore backbones help more, give a clear signal that selective multimodal memory can reduce overload while keeping useful visual evidence. The soft spot is the data curation pipeline that turns raw trajectories into training examples for those operators. The abstract does not detail negative sampling, relevance labeling, overlap checks with the evaluation sets, or controls for distribution shift, so it is hard to know whether the gains are robust or partly an artifact of how the supervision was built. Without seeing the full tables, error bars, and ablations it is also difficult to judge how fair the baselines really are. This is aimed at researchers and engineers building practical long-horizon GUI agents who need better state tracking than full history or text summaries provide. A reader focused on agent memory mechanisms would get usable ideas from the operator breakdown even if the empirical claims need more scrutiny. I would send it to peer review; the architecture is coherent and the problem is worth referee attention, so the details can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces MementoGUI, a plug-in agentic memory framework for MLLM-based GUI agents that equips them with MementoCore, a learned controller for online memory selection, compression, and retrieval. It modularizes memory control into operators for step processing, compression, episodic writing, and selection, trained via a scalable data curation pipeline that converts computer-use trajectories into supervision. The authors introduce MementoGUI-Bench and MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench claim consistent improvements over no-history, history-replay, and text-only baselines, with larger MementoCore backbones strengthening results.

Significance. If the central claims hold after verification, the modular plug-in design of MementoCore (without finetuning the GUI agent backbone) and the new benchmark could meaningfully advance long-horizon GUI agent research by addressing memory brittleness in a reusable way. The framework's separation of working and episodic memory with ROI-level visual evidence is a clear strength over raw replay or text-only approaches.

major comments (2)

[Method / Data Curation] Data curation pipeline (described in the method section following the abstract): no mechanics are provided for negative sampling, relevance labeling, overlap detection between curated trajectories and the evaluation benchmarks (GUI-Odyssey, MM-Mind2Web, MementoGUI-Bench), or explicit controls against distribution shift and post-hoc filtering. This is load-bearing for the central claim of unbiased generalization and consistent gains, as the improvements are attributed to the learned MementoCore trained on this pipeline.
[Experiments] Experimental results (abstract and §5): the claim of 'consistent improvements' and 'larger MementoCore backbones further strengthening' results is stated without reference to specific quantitative tables, error bars, ablation details on the curation choices, or statistical significance tests. This prevents assessment of whether baseline comparisons or metric definitions contain post-hoc choices that affect the reported gains.

minor comments (2)

[Method] The description of MementoCore operators would benefit from a single pseudocode listing or diagram showing the flow from step processing to episodic selection.
[Benchmarks] Clarify whether MementoGUI-Bench tasks were held out from the data curation pipeline or if any filtering was applied to avoid train-test leakage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight opportunities to improve transparency in the data curation pipeline and the reporting of experimental results. We address each point below and will incorporate the necessary clarifications and additions in the revised manuscript.

read point-by-point responses

Referee: [Method / Data Curation] Data curation pipeline (described in the method section following the abstract): no mechanics are provided for negative sampling, relevance labeling, overlap detection between curated trajectories and the evaluation benchmarks (GUI-Odyssey, MM-Mind2Web, MementoGUI-Bench), or explicit controls against distribution shift and post-hoc filtering. This is load-bearing for the central claim of unbiased generalization and consistent gains, as the improvements are attributed to the learned MementoCore trained on this pipeline.

Authors: We agree that the current description of the data curation pipeline in Section 3.3 would benefit from greater detail on these aspects. In the revised manuscript we will expand the pipeline description to explicitly cover: (1) negative sampling by selecting trajectories with low semantic overlap to the target task and assigning corresponding low relevance labels via MLLM scoring; (2) relevance labeling using a combination of action-sequence similarity and visual ROI matching; (3) overlap detection between curated training trajectories and the three evaluation benchmarks via normalized edit distance on action sequences and visual feature cosine similarity, with any overlapping trajectories removed; and (4) controls for distribution shift including temporal hold-out splits and post-hoc filtering based on trajectory length and task category balance. These additions will make the training data generation process fully reproducible and strengthen the claim of unbiased generalization. revision: yes
Referee: [Experiments] Experimental results (abstract and §5): the claim of 'consistent improvements' and 'larger MementoCore backbones further strengthening' results is stated without reference to specific quantitative tables, error bars, ablation details on the curation choices, or statistical significance tests. This prevents assessment of whether baseline comparisons or metric definitions contain post-hoc choices that affect the reported gains.

Authors: We acknowledge that the experimental claims in the abstract and Section 5 would be clearer with tighter linkage to the reported numbers. In the revised version we will: (1) insert explicit cross-references (e.g., “as shown in Table 2, row 3”) for every statement of improvement; (2) add error bars representing standard deviation across five random seeds for all main results; (3) include a new ablation table examining the impact of negative-sampling ratio and relevance-labeling threshold on final performance; and (4) report paired t-test p-values comparing MementoGUI against each baseline to establish statistical significance. These changes will allow readers to directly evaluate the robustness of the gains and the absence of post-hoc metric adjustments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from independent training and evaluation

full rationale

The paper introduces MementoGUI as a modular plug-in memory controller with MementoCore operators trained via a data curation pipeline on computer-use trajectories. Reported gains on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench are presented as direct experimental comparisons to no-history, history-replay, and text-only baselines. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the central claims rest on external benchmark performance rather than quantities defined by the same parameters or inputs used to generate the result. The framework is self-contained against the stated evaluation metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the existence of a scalable, unbiased trajectory-to-training-data pipeline and on the assumption that a separate learned controller can be attached without retraining the base MLLM. No explicit numerical free parameters are named, but the learned relevance selection and compression operators implicitly contain fitted weights. MementoCore and MementoGUI-Bench are new entities introduced without external validation.

free parameters (1)

Learned relevance and compression parameters inside MementoCore
Weights of the memory-selection and compression operators are fitted on the curated trajectory data.

axioms (1)

domain assumption A plug-in memory controller can be trained and attached to an unchanged MLLM backbone while still producing measurable gains on long-horizon tasks
The design explicitly avoids finetuning the GUI agent backbone.

invented entities (2)

MementoCore no independent evidence
purpose: Learned controller that performs step processing, memory compression, episodic writing, and episodic selection
New modular component introduced to manage multimodal memory.
MementoGUI-Bench no independent evidence
purpose: Benchmark for long-horizon decision-making and memory consistency in GUI agents
New evaluation suite developed for the paper.

pith-pipeline@v0.9.0 · 5844 in / 1536 out tokens · 41985 ms · 2026-05-20T11:52:20.522655+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MEMENTOCORE modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection... scalable data curation pipeline that converts computer-use trajectories into memory-controller training data
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 20 internal anchors

[1]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

work page internal anchor Pith review arXiv 2025
[2]

Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

work page arXiv 2026
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Memori: A persistent memory layer for efficient, context-aware llm agents.arXiv preprint arXiv:2603.19935, 2026

Luiz C Borro, Luiz AB Macarini, Gordon Tindall, Michael Montero, and Adam B Struck. Memori: A persistent memory layer for efficient, context-aware llm agents.arXiv preprint arXiv:2603.19935, 2026

work page arXiv 2026
[5]

Presto: Pro- gressive pretraining enhances synthetic chemistry outcomes.arXiv preprint arXiv:2406.13193, 2024

He Cao, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, and Yu Li. Presto: Pro- gressive pretraining enhances synthetic chemistry outcomes.arXiv preprint arXiv:2406.13193, 2024

work page arXiv 2024
[6]

Telemem: Building long-term and multimodal memory for agentic ai.arXiv preprint arXiv:2601.06037, 2025

Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, et al. Telemem: Building long-term and multimodal memory for agentic ai.arXiv preprint arXiv:2601.06037, 2025

work page arXiv 2025
[7]

Less is more: Empowering gui agent with context- aware simplification

Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, Zhongang Qi, and Liqiang Nie. Less is more: Empowering gui agent with context- aware simplification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5901–5911, 2025

work page 2025
[8]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

work page 2024
[9]

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, and Ding Wang. Mga: Memory-driven gui agent for observation-centric interaction.arXiv preprint arXiv:2510.24168, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[11]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

work page arXiv 2026
[13]

Chain-of-memory: Enhancing gui agents for cross-application navigation.arXiv preprint arXiv:2506.18158, 2025

Xinzge Gao, Chuanrui Hu, Bin Chen, and Teng Li. Chain-of-memory: Enhancing gui agents for cross-application navigation.arXiv preprint arXiv:2506.18158, 2025

work page arXiv 2025
[14]

Gemini 3.1 Pro Model Card

Google DeepMind. Gemini 3.1 Pro Model Card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf , February 2026. Published February 2026; updated 19 February 2026

work page 2026
[15]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

MemFactory: Unified Inference & Training Framework for Agent Memory

Ziliang Guo, Ziheng Li, and Zhiyu Li. Memfactory: Unified inference & training framework for agent memory.arXiv preprint arXiv:2603.29493, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

work page 2024
[18]

Computer use data - paradigm shift ai, 2025

Anais Howland, Ashwin Thinnappan, and Jameel Shahid Mohammed. Computer use data - paradigm shift ai, 2025

work page 2025
[19]

Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025

work page 2025
[20]

Smith, and Jiebo Luo

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, and Jiebo Luo. Promptcap: Prompt-guided image captioning for vqa with gpt-3.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2951–2963, 2023

work page 2023
[21]

Finecaption: Compositional image captioning focusing on wherever you want at any granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. InProceedings of the computer vision and pattern recognition conference, pages 24763–24773, 2025

work page 2025
[22]

Collomosse, Scott Cohen, and Jiebo Luo

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John P. Collomosse, Scott Cohen, and Jiebo Luo. Finematch: Aspect-based fine-grained image and text mismatch detection and correction. InEuropean Conference on Computer Vision, 2024

work page 2024
[23]

V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

Hang Hua, Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3599–3607, 2025

work page 2025
[24]

Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733,

Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024

work page arXiv 2024
[25]

Aliaga, Wei Xiong, and Jiebo Luo

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel G. Aliaga, Wei Xiong, and Jiebo Luo. Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.ArXiv, abs/2505.19415, 2025

work page arXiv 2025
[26]

Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, and Roei Herzig. Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025

work page arXiv 2025
[27]

Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

work page arXiv 2025
[28]

Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

work page arXiv 2025
[29]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

work page 2024
[30]

Mobilegpt: Augmenting llm with human-like app memory for mobile task automation

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steve Ko, Sangeun Oh, and Insik Shin. Mobilegpt: Augmenting llm with human-like app memory for mobile task automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking, pages 1119–1133, 2024. 11

work page 2024
[31]

Grounding multimodal large language model in gui world

Weixian Lei, Difei Gao, and Mike Zheng Shou. Grounding multimodal large language model in gui world. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[32]

EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, and Liang Wang. Echotrail-gui: Building actionable memory for gui agents via critic-guided self-exploration. arXiv preprint arXiv:2512.19396, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024

work page 2024
[34]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

work page 2025
[35]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[36]

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

work page internal anchor Pith review arXiv 2026
[37]

Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji. Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

work page arXiv 2026
[38]

Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, et al. Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

work page arXiv 2025
[39]

Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

work page arXiv 2025
[40]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

work page 2025
[41]

Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, and Dong Yu. Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

work page arXiv 2026
[42]

ishift: Lightweight slow-fast gui agent with adaptive perception.arXiv preprint arXiv:2512.22009, 2025

Sarthak Mehrotra, Sairam VC Rebbapragada, Mani Hemanth Reddy Bonthu, and Vineeth N Balasubramanian. ishift: Lightweight slow-fast gui agent with adaptive perception.arXiv preprint arXiv:2512.22009, 2025

work page arXiv 2025
[43]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[44]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

work page 2023
[46]

Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

Gabriel Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024. 12

work page 2024
[47]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[48]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Beyond heuristics: A decision-theoretic framework for agent memory management.arXiv preprint arXiv:2512.21567, 2025

Changzhi Sun, Xiangyu Chen, Jixiang Luo, Dell Zhang, and Xuelong Li. Beyond heuristics: A decision-theoretic framework for agent memory management.arXiv preprint arXiv:2512.21567, 2025

work page arXiv 2025
[50]

Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail A. Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning.ArXiv, abs/2510.23925, 2025

work page arXiv 2025
[51]

Seagent: Self-evolving computer use agent with autonomous learning from experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv preprint arXiv:2508.04700, 2025

work page arXiv 2025
[52]

Cradle: Empowering foundation agents towards general computer control,

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024

work page arXiv 2024
[53]

ChemAgent: Self-updating memories in large language models improves chemical reasoning

Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, et al. Chemagent: Self-updating library in large language models improves chemical reasoning.arXiv preprint arXiv:2501.06590, 2025

work page arXiv 2025
[54]

Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning.arXiv preprint arXiv:2503.07459, 2025

Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, et al. Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning.arXiv preprint arXiv:2503.07459, 2025

work page arXiv 2025
[55]

Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025

Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, et al. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025

work page arXiv 2025
[56]

Cellforge: Agentic design of virtual cell models

Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, et al. Cellforge: Agentic design of virtual cell models. arXiv preprint arXiv:2508.02276, 2025

work page arXiv 2025
[57]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

work page 2022
[58]

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, and Yuanchun Li. Agentprog: Empowering long-horizon gui agents with program- guided context management.arXiv preprint arXiv:2512.10371, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024

work page 2024
[61]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Infmem: Learning system-2 memory control for long-context agent.arXiv preprint arXiv:2602.02704, 2026

Xinyu Wang, Mingze Li, Peng Lu, Xiao-Wen Chang, Lifeng Shang, Jinping Li, Fei Mi, Prasanna Parthasarathi, and Yufei Cui. Infmem: Learning system-2 memory control for long-context agent.arXiv preprint arXiv:2602.02704, 2026

work page arXiv 2026
[63]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

work page 1907
[64]

History-aware reasoning for gui agents

Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. History-aware reasoning for gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36448–36456, 2026

work page 2026
[65]

Agent Workflow Memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, and Biwei Huang. Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

work page arXiv 2025
[67]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents

Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, and Hongsheng Li. Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents.arXiv preprint arXiv:2602.05832, 2026

work page arXiv 2026
[69]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[70]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026
[71]

Retrieval-augmented gui agents with generative guidelines

Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C Ho, Carl Yang, and Dong Yu. Retrieval-augmented gui agents with generative guidelines. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17877–17886, 2025

work page 2025
[72]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

work page arXiv 2026
[74]

Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

work page arXiv 2025
[75]

Worldmm: Dynamic multimodal memory agent for long video reasoning.arXiv preprint arXiv:2512.02425, 2025

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning.arXiv preprint arXiv:2512.02425, 2025

work page arXiv 2025
[76]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[77]

Promptfix: You prompt and we fix the photo.arXiv preprint arXiv:2405.16785, 2024

Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, and Jiebo Luo. Promptfix: You prompt and we fix the photo.arXiv preprint arXiv:2405.16785, 2024

work page arXiv 2024
[78]

arXiv preprint arXiv:2503.08677 , year=

Yongsheng Yu, Ziyun Zeng, Haitian Zheng, and Jiebo Luo. Omnipaint: Mastering object- oriented editing via disentangled insertion-removal inpainting.arXiv preprint arXiv:2503.08677, 2025. 14

work page arXiv 2025
[79]

Automated detection and quantitative assessment of dental plaque in intraoral images.ACM Transactions on Computing for Healthcare, 7(2):1–12, 2026

Ziyun Zeng, Junyu Chen, Noha Rashwan, Nisreen Al Jallad, Jin Xiao, and Jiebo Luo. Automated detection and quantitative assessment of dental plaque in intraoral images.ACM Transactions on Computing for Healthcare, 7(2):1–12, 2026

work page 2026
[80]

Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

Ziyun Zeng, Hang Hua, and Jiebo Luo. Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

work page arXiv 2025

Showing first 80 references.

[1] [1]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

work page internal anchor Pith review arXiv 2025

[2] [2]

Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

work page arXiv 2026

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Memori: A persistent memory layer for efficient, context-aware llm agents.arXiv preprint arXiv:2603.19935, 2026

Luiz C Borro, Luiz AB Macarini, Gordon Tindall, Michael Montero, and Adam B Struck. Memori: A persistent memory layer for efficient, context-aware llm agents.arXiv preprint arXiv:2603.19935, 2026

work page arXiv 2026

[5] [5]

Presto: Pro- gressive pretraining enhances synthetic chemistry outcomes.arXiv preprint arXiv:2406.13193, 2024

He Cao, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, and Yu Li. Presto: Pro- gressive pretraining enhances synthetic chemistry outcomes.arXiv preprint arXiv:2406.13193, 2024

work page arXiv 2024

[6] [6]

Telemem: Building long-term and multimodal memory for agentic ai.arXiv preprint arXiv:2601.06037, 2025

Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, et al. Telemem: Building long-term and multimodal memory for agentic ai.arXiv preprint arXiv:2601.06037, 2025

work page arXiv 2025

[7] [7]

Less is more: Empowering gui agent with context- aware simplification

Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, Zhongang Qi, and Liqiang Nie. Less is more: Empowering gui agent with context- aware simplification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5901–5911, 2025

work page 2025

[8] [8]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

work page 2024

[9] [9]

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, and Ding Wang. Mga: Memory-driven gui agent for observation-centric interaction.arXiv preprint arXiv:2510.24168, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023

[11] [11]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

work page arXiv 2026

[13] [13]

Chain-of-memory: Enhancing gui agents for cross-application navigation.arXiv preprint arXiv:2506.18158, 2025

Xinzge Gao, Chuanrui Hu, Bin Chen, and Teng Li. Chain-of-memory: Enhancing gui agents for cross-application navigation.arXiv preprint arXiv:2506.18158, 2025

work page arXiv 2025

[14] [14]

Gemini 3.1 Pro Model Card

Google DeepMind. Gemini 3.1 Pro Model Card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf , February 2026. Published February 2026; updated 19 February 2026

work page 2026

[15] [15]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

MemFactory: Unified Inference & Training Framework for Agent Memory

Ziliang Guo, Ziheng Li, and Zhiyu Li. Memfactory: Unified inference & training framework for agent memory.arXiv preprint arXiv:2603.29493, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

work page 2024

[18] [18]

Computer use data - paradigm shift ai, 2025

Anais Howland, Ashwin Thinnappan, and Jameel Shahid Mohammed. Computer use data - paradigm shift ai, 2025

work page 2025

[19] [19]

Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025

work page 2025

[20] [20]

Smith, and Jiebo Luo

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, and Jiebo Luo. Promptcap: Prompt-guided image captioning for vqa with gpt-3.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2951–2963, 2023

work page 2023

[21] [21]

Finecaption: Compositional image captioning focusing on wherever you want at any granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. InProceedings of the computer vision and pattern recognition conference, pages 24763–24773, 2025

work page 2025

[22] [22]

Collomosse, Scott Cohen, and Jiebo Luo

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John P. Collomosse, Scott Cohen, and Jiebo Luo. Finematch: Aspect-based fine-grained image and text mismatch detection and correction. InEuropean Conference on Computer Vision, 2024

work page 2024

[23] [23]

V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

Hang Hua, Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3599–3607, 2025

work page 2025

[24] [24]

Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733,

Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024

work page arXiv 2024

[25] [25]

Aliaga, Wei Xiong, and Jiebo Luo

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel G. Aliaga, Wei Xiong, and Jiebo Luo. Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.ArXiv, abs/2505.19415, 2025

work page arXiv 2025

[26] [26]

Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, and Roei Herzig. Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025

work page arXiv 2025

[27] [27]

Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

work page arXiv 2025

[28] [28]

Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

work page arXiv 2025

[29] [29]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

work page 2024

[30] [30]

Mobilegpt: Augmenting llm with human-like app memory for mobile task automation

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steve Ko, Sangeun Oh, and Insik Shin. Mobilegpt: Augmenting llm with human-like app memory for mobile task automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking, pages 1119–1133, 2024. 11

work page 2024

[31] [31]

Grounding multimodal large language model in gui world

Weixian Lei, Difei Gao, and Mike Zheng Shou. Grounding multimodal large language model in gui world. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[32] [32]

EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, and Liang Wang. Echotrail-gui: Building actionable memory for gui agents via critic-guided self-exploration. arXiv preprint arXiv:2512.19396, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024

work page 2024

[34] [34]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

work page 2025

[35] [35]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[36] [36]

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

work page internal anchor Pith review arXiv 2026

[37] [37]

Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji. Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

work page arXiv 2026

[38] [38]

Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, et al. Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

work page arXiv 2025

[39] [39]

Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

work page arXiv 2025

[40] [40]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

work page 2025

[41] [41]

Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, and Dong Yu. Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

work page arXiv 2026

[42] [42]

ishift: Lightweight slow-fast gui agent with adaptive perception.arXiv preprint arXiv:2512.22009, 2025

Sarthak Mehrotra, Sairam VC Rebbapragada, Mani Hemanth Reddy Bonthu, and Vineeth N Balasubramanian. ishift: Lightweight slow-fast gui agent with adaptive perception.arXiv preprint arXiv:2512.22009, 2025

work page arXiv 2025

[43] [43]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023

[44] [44]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

work page 2023

[46] [46]

Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

Gabriel Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024. 12

work page 2024

[47] [47]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023

[48] [48]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

Beyond heuristics: A decision-theoretic framework for agent memory management.arXiv preprint arXiv:2512.21567, 2025

Changzhi Sun, Xiangyu Chen, Jixiang Luo, Dell Zhang, and Xuelong Li. Beyond heuristics: A decision-theoretic framework for agent memory management.arXiv preprint arXiv:2512.21567, 2025

work page arXiv 2025

[50] [50]

Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail A. Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning.ArXiv, abs/2510.23925, 2025

work page arXiv 2025

[51] [51]

Seagent: Self-evolving computer use agent with autonomous learning from experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv preprint arXiv:2508.04700, 2025

work page arXiv 2025

[52] [52]

Cradle: Empowering foundation agents towards general computer control,

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024

work page arXiv 2024

[53] [53]

ChemAgent: Self-updating memories in large language models improves chemical reasoning

Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, et al. Chemagent: Self-updating library in large language models improves chemical reasoning.arXiv preprint arXiv:2501.06590, 2025

work page arXiv 2025

[54] [54]

Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning.arXiv preprint arXiv:2503.07459, 2025

Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, et al. Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning.arXiv preprint arXiv:2503.07459, 2025

work page arXiv 2025

[55] [55]

Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025

Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, et al. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025

work page arXiv 2025

[56] [56]

Cellforge: Agentic design of virtual cell models

Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, et al. Cellforge: Agentic design of virtual cell models. arXiv preprint arXiv:2508.02276, 2025

work page arXiv 2025

[57] [57]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

work page 2022

[58] [58]

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, and Yuanchun Li. Agentprog: Empowering long-horizon gui agents with program- guided context management.arXiv preprint arXiv:2512.10371, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024

work page 2024

[61] [61]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Infmem: Learning system-2 memory control for long-context agent.arXiv preprint arXiv:2602.02704, 2026

Xinyu Wang, Mingze Li, Peng Lu, Xiao-Wen Chang, Lifeng Shang, Jinping Li, Fei Mi, Prasanna Parthasarathi, and Yufei Cui. Infmem: Learning system-2 memory control for long-context agent.arXiv preprint arXiv:2602.02704, 2026

work page arXiv 2026

[63] [63]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

work page 1907

[64] [64]

History-aware reasoning for gui agents

Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. History-aware reasoning for gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36448–36456, 2026

work page 2026

[65] [65]

Agent Workflow Memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, and Biwei Huang. Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

work page arXiv 2025

[67] [67]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents

Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, and Hongsheng Li. Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents.arXiv preprint arXiv:2602.05832, 2026

work page arXiv 2026

[69] [69]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

[70] [70]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026

[71] [71]

Retrieval-augmented gui agents with generative guidelines

Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C Ho, Carl Yang, and Dong Yu. Retrieval-augmented gui agents with generative guidelines. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17877–17886, 2025

work page 2025

[72] [72]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

work page arXiv 2026

[74] [74]

Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

work page arXiv 2025

[75] [75]

Worldmm: Dynamic multimodal memory agent for long video reasoning.arXiv preprint arXiv:2512.02425, 2025

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning.arXiv preprint arXiv:2512.02425, 2025

work page arXiv 2025

[76] [76]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[77] [77]

Promptfix: You prompt and we fix the photo.arXiv preprint arXiv:2405.16785, 2024

Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, and Jiebo Luo. Promptfix: You prompt and we fix the photo.arXiv preprint arXiv:2405.16785, 2024

work page arXiv 2024

[78] [78]

arXiv preprint arXiv:2503.08677 , year=

Yongsheng Yu, Ziyun Zeng, Haitian Zheng, and Jiebo Luo. Omnipaint: Mastering object- oriented editing via disentangled insertion-removal inpainting.arXiv preprint arXiv:2503.08677, 2025. 14

work page arXiv 2025

[79] [79]

Automated detection and quantitative assessment of dental plaque in intraoral images.ACM Transactions on Computing for Healthcare, 7(2):1–12, 2026

Ziyun Zeng, Junyu Chen, Noha Rashwan, Nisreen Al Jallad, Jin Xiao, and Jiebo Luo. Automated detection and quantitative assessment of dental plaque in intraoral images.ACM Transactions on Computing for Healthcare, 7(2):1–12, 2026

work page 2026

[80] [80]

Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

Ziyun Zeng, Hang Hua, and Jiebo Luo. Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

work page arXiv 2025