pith. sign in

arxiv: 2605.13527 · v3 · pith:G3QQ7POGnew · submitted 2026-05-13 · 💻 cs.AI

MMSkills: Towards Multimodal Skills for General Visual Agents

Pith reviewed 2026-06-30 21:29 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal skillsvisual agentsprocedural knowledgeGUI agentstrajectory to skillreusable proceduresmultimodal agentsstate cards
0
0 comments X

The pith

Multimodal procedural skills extracted from trajectories improve visual agents on GUI and game benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that visual agents need multimodal procedural knowledge that links textual steps to visual state recognition, progress evidence, and next actions. It introduces MMSkills as compact packages that bundle a textual procedure with runtime state cards and multi-view keyframes. These packages are built from public non-evaluation trajectories by an agentic generator that performs workflow grouping, procedure induction, visual grounding, and auditing. At runtime a branch-loaded agent temporarily inspects selected cards and keyframes, aligns them to the live scene, and distills guidance without flooding the main context. Experiments show consistent gains for both frontier and smaller multimodal models, indicating that external multimodal skills can supplement model-internal priors.

Core claim

MMSkills is a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. The packages are constructed by an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. At inference time a branch-loaded multimodal skill agent selects and inspects state cards and keyframes in a temporary branch, aligns them with the live environment, and distills structured guidance for the mai

What carries the argument

The branch-loaded multimodal skill agent, which inspects selected state cards and keyframes in a temporary branch, aligns them with the live environment, and distills structured guidance for the main agent.

If this is right

  • Visual agents gain reusable procedural knowledge that includes visual evidence of states and progress rather than text alone.
  • Both frontier and smaller multimodal agents show improved performance on GUI and game-based benchmarks.
  • Reusable multimodal skills can be derived from public interaction experience without requiring evaluation-specific trajectories.
  • The three challenges of skill content, sourcing from public data, and efficient runtime consultation are addressed by the package format and branch mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents could accumulate growing libraries of skills across repeated interactions, enabling compounding capability gains over time.
  • The same trajectory-to-skill process might apply to other embodied or web-based agent settings where visual state evidence matters.
  • Providing structured external multimodal memory could reduce the need for ever-larger model sizes by offloading procedural details.

Load-bearing premise

An agentic trajectory-to-skill Generator can reliably transform public non-evaluation trajectories into high-quality reusable multimodal skills via workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing without introducing significant biases or loss of applicability.

What would settle it

Running the same GUI and game benchmarks with MMSkills-augmented agents versus identical agents given no skills or randomly generated skills would show no consistent performance gain.

read the original abstract

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MMSkills, a framework for reusable multimodal procedural knowledge in visual agents. Each skill couples a textual procedure with state cards and multi-view keyframes. An agentic trajectory-to-skill Generator derives these from public non-evaluation trajectories via workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. A branch-loaded multimodal skill agent consults selected cards and keyframes at inference time in a temporary branch. Experiments on GUI and game-based visual-agent benchmarks report consistent improvements for both frontier and smaller multimodal agents, supporting the claim that external multimodal procedural knowledge complements model-internal priors.

Significance. If the generator produces unbiased, reusable skills and the reported gains are attributable to the multimodal evidence rather than curation artifacts, the work would offer a practical route to augmenting visual agents with external procedural knowledge. The explicit separation of skill generation from evaluation trajectories and the branch-loaded inference mechanism are concrete contributions that could be adopted or extended by other agent frameworks.

major comments (3)
  1. [§3] §3 (Generator pipeline): The description of workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing provides no quantitative validation (e.g., inter-annotator agreement on skill quality, grounding error rates, or comparison against human-authored skills). This is load-bearing for the central claim, because any systematic selection or distortion in the generated MMSkills could produce benchmark gains without demonstrating complementarity of multimodal procedural knowledge.
  2. [§4] §4 (Experiments): The reported improvements are summarized as 'consistent' across benchmarks, but no per-benchmark tables, baseline comparisons, or ablation isolating the contribution of state cards versus keyframes are referenced. Without these, it is impossible to determine whether the gains exceed what could be obtained by simply increasing context length or adding textual skills.
  3. [§4.2] §4.2 (Agent integration): The branch-loaded inference procedure is described at a high level; the paper does not report the additional token or latency cost of the temporary branch or how alignment failures between reference keyframes and live states are handled. These details are necessary to evaluate whether the method scales beyond the evaluated benchmarks.
minor comments (2)
  1. [Abstract] The abstract and introduction use the term 'multimodal procedural knowledge' without an explicit formal definition or contrast to existing textual or code-based skill representations; a short definitional paragraph would improve clarity.
  2. Figure captions and table headers should explicitly state the number of runs or seeds used for the reported improvements to allow readers to assess statistical reliability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify key areas where additional evidence and details will strengthen the manuscript. We address each point below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Generator pipeline): The description of workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing provides no quantitative validation (e.g., inter-annotator agreement on skill quality, grounding error rates, or comparison against human-authored skills). This is load-bearing for the central claim, because any systematic selection or distortion in the generated MMSkills could produce benchmark gains without demonstrating complementarity of multimodal procedural knowledge.

    Authors: We agree that quantitative validation of the generator pipeline is necessary to support the central claim. The current manuscript emphasizes the pipeline design and downstream agent improvements but does not include these metrics. In the revised manuscript we will add a dedicated analysis subsection reporting: inter-annotator agreement (Cohen’s kappa) from three human raters on a random sample of 100 generated skills for quality and grounding accuracy; measured grounding error rates against manual annotations; and a side-by-side comparison of 20 MMSkills versus human-authored equivalents on a held-out agent performance task. These additions will directly address concerns about potential curation artifacts. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported improvements are summarized as 'consistent' across benchmarks, but no per-benchmark tables, baseline comparisons, or ablation isolating the contribution of state cards versus keyframes are referenced. Without these, it is impossible to determine whether the gains exceed what could be obtained by simply increasing context length or adding textual skills.

    Authors: We will expand §4 with full per-benchmark result tables containing exact metrics for all evaluated agents and conditions. New baselines will be added: (i) textual-only skill variants, (ii) context-length-matched baselines without multimodal elements, and (iii) explicit ablations that remove state cards or keyframes independently. These results will be presented in additional tables and an ablation figure to isolate the contribution of each multimodal component. revision: yes

  3. Referee: [§4.2] §4.2 (Agent integration): The branch-loaded inference procedure is described at a high level; the paper does not report the additional token or latency cost of the temporary branch or how alignment failures between reference keyframes and live states are handled. These details are necessary to evaluate whether the method scales beyond the evaluated benchmarks.

    Authors: We will augment §4.2 with concrete measurements: average additional tokens consumed by the temporary branch (reported per benchmark) and measured latency overhead in milliseconds on the same hardware used for the main experiments. Alignment handling will be described in detail: a CLIP-based similarity threshold determines whether keyframes are usable; below-threshold cases fall back to the textual procedure only. We will also report the observed failure rate (<5 % in our runs) to demonstrate practical scalability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmark validation

full rationale

The paper introduces MMSkills as a practical framework for generating multimodal procedural skills from public trajectories and demonstrates gains via experiments on GUI and game benchmarks. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps in the provided text. The generator pipeline and branch-loaded agent are described as engineering contributions whose value is assessed externally through benchmark improvements, making the central claim independent of any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the framework description does not specify any fitted values or unproven background assumptions beyond standard AI agent concepts.

pith-pipeline@v0.9.1-grok · 5839 in / 1096 out tokens · 36704 ms · 2026-06-30T21:29:10.597206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VISUALSKILL: Multimodal Skills for Computer-Use Agents

    cs.CL 2026-06 unverdicted novelty 6.0

    Multimodal skills retaining visual figures improve CUA benchmark scores by 8.3 points over text-only equivalents generated from the same source content.

Reference graph

Works this paper leans on

40 extracted references · 38 canonical work pages · cited by 1 Pith paper · 22 internal anchors

  1. [1]

    URL https://arxiv.org/abs/2410.08164. Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julia...

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    URL https://arxiv.org/abs/2204.01691. Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi- agent systems,

  3. [3]

    URL https://arxiv.org/abs/2603.02766. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong...

  4. [4]

    Qwen3-VL Technical Report

    URL https://arxiv.org/abs/2511.21631. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding,

  5. [5]

    URL https://arxiv.org/abs/2308.14508. Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, Suzhen Zheng, Hao Fan, Pashmina Cameron, Justin Wagle, and Kazuhito Koishida. CUA-skill: Develop skills for computer using agent,

  6. [6]

    CUA- Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123, 2026

    URL https://arxiv.org/abs/2601.21123. Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing GUI grounding for advanced visual GUI agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 9313–9332. Association for Computational Linguistics,

  7. [7]

    S ee C lick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    doi: 10.18653/V1/2024.ACL-LONG.505. URL https://doi.org/10.18653/v1/2024.acl-long.505. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web,

  8. [8]

    Mind2Web: Towards a Generalist Agent for the Web

    URL https://arxiv.org/abs/2306.06070. Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents,

  9. [9]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    URL https://arxiv.org/abs/2410.05243. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Web- voyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 6864–6890. Association for Computational Li...

  10. [10]

    URL https://doi.org/10.18653/v1/2024.acl-long.371

    doi: 10.18653/V1/2024.ACL-LONG.371. URL https://doi.org/10.18653/v1/2024.acl-long.371. 10 Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents,

  11. [11]

    arXiv preprint arXiv:2312.08914 , year =

    URL https: //arxiv.org/abs/2312.08914. Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?,

  12. [12]

    lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025

    URL https://arxiv.org/abs/2505.15146. Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: Learning attractor models for motor behaviors.Neural Computation, 25(2):328–373,

  13. [13]

    URL https: //doi.org/10.1162/NECO_a_00393

    doi: 10.1162/NECO_a_00393. URL https: //doi.org/10.1162/NECO_a_00393. Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R. Fung. Xskill: Continual learning from experience and skills in multimodal agents,

  14. [14]

    XSkill: Continual Learning from Experience and Skills in Multimodal Agents

    URL https://arxiv.org/abs/2603.12056. Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pag...

  15. [15]

    V isual W eb A rena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    doi: 10.18653/V1/2024.ACL-LONG.50. URL https://doi.org/10.18653/v1/2024.acl-long.50. Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025a. URL https://arxiv.org/abs/2504.07981. Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Da...

  16. [16]

    Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

    doi: 10.1109/ICRA48891.2023.10160591. URL https://doi.org/10.1109/ICRA48891.2023.10160591. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts,

  17. [17]

    URL https://arxiv.org/abs/2307.03172. Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan He...

  18. [18]

    How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

    URL https://arxiv.org/abs/2604.04323. Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent,

  19. [19]

    OmniParser for pure vision based GUI agent,

    URL https: //arxiv.org/abs/2408.00203. Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver,

  20. [20]

    SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

    URL https://arxiv.org/abs/2604.08377. Richard E. Mayer.Multimedia Learning. Cambridge University Press,

  21. [21]

    URL https://doi.org/ 10.1017/CBO9780511811678

    doi: 10.1017/CBO9780511811678. URL https://doi.org/ 10.1017/CBO9780511811678. Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems,

  22. [22]

    MemGPT: Towards LLMs as Operating Systems

    URL https://arxiv.org/abs/2310.08560. 11 Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior,

  23. [23]

    URL https://arxiv.org/abs/2304.03442. Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wa...

  24. [24]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    URL https://arxiv.org/abs/2501.12326. Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control,

  25. [25]

    URL https://arxiv.org/abs/2307.10088. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. An- droidworld: A dynamic benchmarking environment for autonomous agents,

  26. [26]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    URL https://arxiv.org/abs/2405.14573. Shuai Shao, Yixiang Liu, Bingwei Lu, and Weinan Zhang. Monoscale: Scaling multi-agent system with monotonic improvement, 2026a. URL https://arxiv.org/abs/2601.23219. Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your agent may m...

  27. [27]

    nips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html

    URL http://papers. nips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html. Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2):181–211,

  28. [28]

    Kimi K2.5: Visual Agentic Intelligence

    doi: 10.1016/S0004-3702(99)00052-1. URL https: //doi.org/10.1016/S0004-3702(99)00052-1. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yim...

  29. [29]

    Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

    URL https://arxiv.org/abs/2302.01560. Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: A foundation action model for generalist GUI agents,

  30. [30]

    URL https://arxiv.org/abs/2602.08234. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

  31. [31]

    13 Renjun Xu and Yang Yan

    URL https://arxiv.org/abs/2506.10387. 13 Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward,

  32. [32]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    URL https://arxiv.org/abs/2602.12430. Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, and Yaohua Tang. Deskvision: Large scale desktop region captioning for advanced gui agents,

  33. [33]

    Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao

    URL https://arxiv.org/abs/2503.11170. Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents, 2025a. URL https://arxiv.org/abs/2506.00618. Pei Yang, Hai Ci, and Mike Zheng Shou. macosworld: A multilingual interactive benchmark for GUI agents, 2025b. URL https: //arxiv.org/abs/2506.04135. Yin...

  34. [34]

    AppAgent: Multimodal Agents as Smartphone Users

    URL https://arxiv.org/abs/2312.13771. Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu. Dream: A dual representation learning model for multimodal recommendation,

  35. [35]

    Kangning Zhang, Wenxiang Jiao, Kounianhua Du, Yuan Lu, Weiwen Liu, Weinan Zhang, and Yong Yu

    URL https://arxiv.org/abs/2404.11119. Kangning Zhang, Wenxiang Jiao, Kounianhua Du, Yuan Lu, Weiwen Liu, Weinan Zhang, and Yong Yu. Looptool: Closing the data-training loop for robust llm tool calls,

  36. [36]

    Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, and Xiangyu Yue

    URL https://arxiv.org/abs/2511.09148. Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded,

  37. [37]

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    URL https://arxiv.org/abs/2401.01614. Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills,

  38. [38]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    URL https://arxiv.org/abs/2504.07079. Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents,

  39. [39]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    URL https://arxiv.org/abs/2307.13854. 14 Appendix A Benchmark Statistics We use four visual-agent benchmarks.OSWorldis the primary GUI benchmark and contains Ubuntu desktop tasks across browsers, office software, creative tools, media applications, system settings, code editors, email, and multi- application workflows (Xie et al., 2024).macOSWorldprovides...

  40. [40]

    Recent LLM agents have made skills a practical interface for storing and composing procedural knowledge in language-conditioned environments

    Skills for agents.Skill reuse has a long history in temporal abstraction for reinforcement learning and motor primitives for robotics (Sutton et al., 1999; Ijspeert et al., 2013). Recent LLM agents have made skills a practical interface for storing and composing procedural knowledge in language-conditioned environments. Early systems connected language mo...