pith. sign in

arxiv: 2606.01414 · v1 · pith:CB5EVLIPnew · submitted 2026-05-31 · 💻 cs.CV

Agent Skills Should Go Beyond Text: The Case for Visual Skills

Pith reviewed 2026-06-28 17:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords agent skillsmultimodal skillsvisual priorsGUI agentsreusable experienceskill learningtrajectory extraction
0
0 comments X

The pith

Text-only skills limit agents on visual tasks because reusable knowledge often requires spatial layout and visual grounding that text cannot reliably encode.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that storing reusable agent experience solely as text instructions or summaries creates a bottleneck for tasks that depend on spatial layout, fine-grained appearance, and localized state changes. It introduces a multimodal skill approach that pairs textual logic with three visual forms: static priors for fixed spatial conventions, dynamic priors for temporary visual memory, and interleaved skills that link each text step to the exact frames or regions that justify it. An automatic extraction system pulls these elements from task trajectories while keeping spatial references and interaction patterns intact. Experiments on GUI and similar visual tasks show the multimodal version outperforms text-only skills, especially when success hinges on knowing where to look and how to confirm visual outcomes.

Core claim

Reusable agent skills should be stored as multimodal assets rather than text alone; the three forms of visual support—static priors, dynamic priors, and interleaved visual bindings—directly encode where to look, how to inspect, and how to verify results, and an automatic trajectory-to-skill converter can extract them at scale while preserving spatial and visual detail.

What carries the argument

The multimodal skill paradigm that augments declarative text with explicit visual support in the form of static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved skills that bind ordered steps to source frames or regions.

If this is right

  • Visual skills improve success rates on tasks that require spatial correspondence or state-aware visual verification.
  • The extraction system scales creation of reusable skills by preserving textual reasoning together with the visual evidence that supports it.
  • Multimodal skills encode verification steps that text descriptions alone cannot capture reliably.
  • Agents accumulate experience more effectively when skills retain the original visual context rather than summaries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same extraction approach could be applied to non-GUI visual domains such as robotic manipulation where trajectory data also contains camera frames.
  • Maintaining a growing library of these multimodal skills might reduce the need for agents to re-learn visual patterns across similar tasks.
  • Interleaved skills could serve as a bridge between language-based planning and pixel-level control in future agent architectures.

Load-bearing premise

Visual skills can be extracted automatically from trajectories while keeping accurate spatial references and visual boundaries without introducing errors that cancel out the reported performance gains.

What would settle it

A controlled test in which the automatic extraction system produces visual skills whose error rate in spatial references or boundary detection causes lower task success than equivalent text-only skills on the same GUI benchmarks.

Figures

Figures reproduced from arXiv: 2606.01414 by Binxiao Xu, Bocheng Zou, Hang Hua, Ruichuan An.

Figure 1
Figure 1. Figure 1: From text-only reuse to visual skills. Direct answering solves each visual task as a one-off prompt, while text-only skills can reuse rules but leave spatial conventions implicit. VISUAL SKILL combines reusable text rules with explicit visual priors, making the visual protocol visible, reusable, and grounded. nearby distractors [6]. Similarly, UI design and poster generation depend on reusable visual patte… view at source ↗
Figure 2
Figure 2. Figure 2: VISUAL SKILL capability demo. Visual skills are organized by the visual bottleneck they solve: static skills clarify reusable spatial conventions; dynamic skills write intermediate state back onto the task image, including slide critique and ARC route-state planning; interleaved skills keep reasoning steps adjacent to their visual evidence. To address this issue, we propose VISUAL SKILL, a new agent-skill … view at source ↗
Figure 3
Figure 3. Figure 3: Authoring reusable visual skills from multimodal context. AUTOVISUALSKILL converts a user goal and optional multimodal materials into a reusable visual skill by analyzing task constraints, generating visual priors or source-grounded references when needed, and packaging them with textual logic and binding manifests for cross-instance transfer. with skill.md, manifest.json, visual assets, and provenance rec… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples of visual skills. Top: GUI grounding examples where visual priors improve hitbox localization. Bottom: dense-counting examples where dynamic visual traces improve count prediction. D, T, and V denote direct prompting, text-only skill prompting, and visual-skill prompting, respectively; GUI numbers are IoU, and counting numbers are predicted counts. 6 Discussion 6.1 VISUAL SKILL vs. Ima… view at source ↗
Figure 5
Figure 5. Figure 5: Representative failure cases. Visual Skills can fail when the intended spatial convention conflicts with semantic granularity: GUI priors may over-focus on the smallest glyph-like component, while counting trajectories may expose ambiguity between composite objects, subparts, and low￾salience instances. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: When to use text-only rules, static priors, dynamic priors, or interleaved visual skills. The boundary is determined by the type of bottleneck, not by the dataset name. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Static visual skill examples. These examples use source-backed overlays to clarify reusable spatial conventions. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dynamic visual skill examples. Dynamic skills render intermediate state onto the current task image, making progress auditable and reusable across reasoning steps. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interleaved visual skill examples. Interleaved skills preserve ordered text–visual evidence bindings for tutorials, documentation, videos, and document-like sources. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
read the original abstract

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf{\NAME}, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \textbf{\SYSTEM}, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript argues that text-only reusable skills create a bottleneck for visual-centric agent tasks and proposes extbf{ ame}, a multimodal skill paradigm with three forms (static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills binding ordered text steps to source frames/screenshots/regions). It introduces extbf{ extsc{system}}, an automatic converter that extracts these from trajectories while preserving textual reasoning, spatial references, visual boundaries, and interaction patterns. Experiments on GUI and other visual-centric tasks are reported to show consistent outperformance of visual skills over text-only skills, especially when spatial correspondence, visual evidence, and state-aware interaction are required.

Significance. If the empirical results hold after addressing extraction fidelity and experimental controls, the work could meaningfully influence skill representation in multimodal agents by demonstrating that explicit visual assets improve performance on tasks where text alone is insufficient. The automatic extraction approach addresses a key scalability issue for accumulating reusable multimodal experience.

major comments (3)
  1. [ extsc{system} description and extraction pipeline] The central claim rests on extsc{system} producing faithful visual skills without offsetting errors in spatial references or boundaries. No quantitative evaluation of extraction fidelity (e.g., precision/recall of bounding boxes, frame selection accuracy, or state-change localization error rates) is provided in the extsc{system} section or experiments; this is load-bearing because misalignment would directly undermine the reported gains over text-only skills.
  2. [Experiments and evaluation] The experiments section states 'consistent outperformance' but provides no details on baselines (e.g., specific text-only skill methods), metrics, statistical tests, number of runs, or controls for post-hoc selection. Without these, it is impossible to determine whether the results support the multimodal advantage or reflect experimental artifacts.
  3. [Results on GUI tasks] The claim that visual skills are particularly advantageous 'when success requires spatial correspondence, visual evidence, and state-aware interaction' requires explicit supporting evidence (e.g., per-task breakdowns or ablation tables) rather than aggregate statements; the current presentation leaves the differential benefit unquantified.
minor comments (2)
  1. [Abstract and introduction] The manuscript uses placeholder macros extbf{ ame} and extbf{ extsc{system}}; replace with the actual names and expand the first use of each acronym.
  2. [Multimodal skill paradigm] Clarify whether the three reusable forms are mutually exclusive or can be combined in a single skill asset, and provide a concrete example trajectory showing all three forms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that directly strengthen the manuscript's claims regarding extraction fidelity, experimental rigor, and differential benefits of visual skills.

read point-by-point responses
  1. Referee: [system description and extraction pipeline] The central claim rests on extsc{system} producing faithful visual skills without offsetting errors in spatial references or boundaries. No quantitative evaluation of extraction fidelity (e.g., precision/recall of bounding boxes, frame selection accuracy, or state-change localization error rates) is provided in the extsc{system} section or experiments; this is load-bearing because misalignment would directly undermine the reported gains over text-only skills.

    Authors: We agree that the absence of quantitative fidelity metrics for the extraction pipeline is a limitation that weakens the central claim. In the revised manuscript we will add a dedicated evaluation subsection (and associated table) reporting precision/recall for bounding-box extraction, frame-selection accuracy, and state-change localization error on a manually annotated subset of trajectories. These metrics will be computed against human-verified ground truth and will be used to bound the potential impact of misalignment on the reported performance gains. revision: yes

  2. Referee: [Experiments and evaluation] The experiments section states 'consistent outperformance' but provides no details on baselines (e.g., specific text-only skill methods), metrics, statistical tests, number of runs, or controls for post-hoc selection. Without these, it is impossible to determine whether the results support the multimodal advantage or reflect experimental artifacts.

    Authors: We acknowledge that the current experimental description lacks the necessary methodological detail. The revised Experiments section will explicitly list the text-only baselines (including ReAct-style and Reflexion-style skill variants), the primary metric (task success rate), the number of independent runs (five random seeds), the statistical tests performed (paired t-tests with reported p-values), and the procedure used to avoid post-hoc selection bias. Variance across runs will also be reported. revision: yes

  3. Referee: [Results on GUI tasks] The claim that visual skills are particularly advantageous 'when success requires spatial correspondence, visual evidence, and state-aware interaction' requires explicit supporting evidence (e.g., per-task breakdowns or ablation tables) rather than aggregate statements; the current presentation leaves the differential benefit unquantified.

    Authors: We agree that aggregate statements alone are insufficient to substantiate the differential advantage. The revised results section will include per-task breakdown tables (and an accompanying ablation) that separate tasks according to whether they require spatial correspondence, visual evidence, or state-aware interaction. These tables will report success-rate deltas between visual and text-only skills for each category, thereby quantifying the claimed benefit. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a conceptual proposal introducing multimodal visual skills (static priors, dynamic priors, interleaved forms) and an automatic extraction system SYSTEM, supported by empirical results on GUI tasks. No mathematical derivations, equations, fitted parameters, predictions, or self-citations appear in the provided text that reduce any claim to its inputs by construction. The central position is advanced as an argument plus experiments rather than a closed definitional or self-referential loop, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based solely on the abstract, the paper introduces new conceptual categories for skills and a new extraction system but contains no free parameters, mathematical axioms, or invented physical entities.

invented entities (2)
  • Multimodal skill paradigm with three reusable forms no independent evidence
    purpose: To encode spatial layout, visual grounding, and state changes alongside text for agent skills
    New conceptual framework proposed to address the stated text-only bottleneck.
  • SYSTEM automatic conversion system no independent evidence
    purpose: To scale construction of visual skills from agent trajectories
    New method introduced for preserving textual reasoning and visual references.

pith-pipeline@v0.9.1-grok · 5789 in / 1176 out tokens · 35181 ms · 2026-06-28T17:09:33.922268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 37 canonical work pages · 16 internal anchors

  1. [1]

    DeFT: A conceptual framework for considering learning with multiple representations

    Shaaron Ainsworth. DeFT: A conceptual framework for considering learning with multiple representations. Learning and Instruction, 16(3):183–198, 2006. doi: 10.1016/j.learninstruc.2006.03.001

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. doi: 10.48550/arXiv. 2309.16609. URLhttps://arxiv.org/abs/2309.16609

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/ abs/2511.21631

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. PaliGemma: A versatile 3b VLM for transfer.arXiv preprint arXiv:2407.07726, 2024. doi: 10.48550/arXiv.2407.07726. URLhttps://arxiv.org/abs/2407.07726. 11

  5. [5]

    Cognitive load theory and the format of instruction.Cognition and Instruction, 8(4):293–332, 1991

    Paul Chandler and John Sweller. Cognitive load theory and the format of instruction.Cognition and Instruction, 8(4):293–332, 1991. doi: 10.1207/s1532690xci0804_2

  6. [6]

    SeeClick: Harnessing GUI grounding for advanced visual GUI agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, YanTao Li, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok,...

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https: //arxiv.org/abs/2507.06261

  8. [8]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024. doi: 10.48550/arXiv.2412.19437. URL https://arxiv.org/abs/2412. 19437

  9. [9]

    Mind2Web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, pages 28091–28114, 2023. URL https://proceedings.neurips.cc/paper_files/ paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks. pdf

  10. [10]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/2312.11805

  11. [11]

    UI-Venus technical report: Building high-performance UI agents with RFT.arXiv preprint arXiv:2508.10833, 2025

    Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. UI-Venus technical report: Building high-performance UI agents with RFT.arXiv preprint arXiv:2508.10833, 2025. doi: 10.48550/arXiv.2508.10833. URL https://arxiv.org/abs/2508.10833

  12. [12]

    Black, and Otmar Hilliges

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14953–14962, June 2023. doi: 10.1109/CVPR52729.2023.01436. URL https://openaccess.thecvf.com/content/CVPR2023/html/Gupta_Visual_Programming_ Compositional_Vis...

  13. [13]

    SugarCrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in Neural Information Processing Systems, 36:31096–31116, 2023

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in Neural Information Processing Systems, 36:31096–31116, 2023

  14. [14]

    Black, and Otmar Hilliges

    Hsiao Yuan Hsu, Xiangteng He, Yuxin Peng, Hao Kong, and Qing Zhang. PosterLay- out: A new benchmark and approach for content-aware visual-textual presentation layout. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 6018–6026, June 2023. doi: 10.1109/CVPR52729.2023.00583. URL https: //openaccess.thecvf.com/conte...

  15. [15]

    Promptcap: Prompt- guided image captioning for vqa with gpt-3

    Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt- guided image captioning for vqa with gpt-3. InProceedings of the IEEE/CVF international conference on computer vision, pages 2963–2975, 2023

  16. [16]

    FineMatch: Aspect-based fine-grained image and text mismatch detection and correction

    Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, and Jiebo Luo. FineMatch: Aspect-based fine-grained image and text mismatch detection and correction. In European Conference on Computer Vision, pages 474–491. Springer, 2024

  17. [17]

    MMCOMPOSITION: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024

    Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. MMCOMPOSITION: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024. doi: 10.48550/arXiv.2410.09733. URL https://arxiv. org/abs/2410.09733. 12

  18. [18]

    V2Xum-LLM: Cross-modal video summarization with temporal prompt instruction tuning

    Hang Hua, Yolo Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2Xum-LLM: Cross-modal video summarization with temporal prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3599–3607, 2025. doi: 10.1609/aaai.v39i4.32374. URL https://arxiv.org/abs/2404.12353

  19. [19]

    Aliaga, Wei Xiong, and Jiebo Luo

    Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel G. Aliaga, Wei Xiong, and Jiebo Luo. MMIG-Bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.arXiv preprint arXiv:2505.19415, 2025. doi: 10.48550/arXiv.2505.19415. URL https://arxiv. org/abs/2505.19415

  20. [20]

    Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

    Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

  21. [21]

    LangChain. Agents. https://docs.langchain.com/oss/python/langchain/agents, 2026. URL https://docs.langchain.com/oss/python/langchain/agents. Documentation; accessed 2026- 05-03

  22. [22]

    Larkin and Herbert A

    Jill H. Larkin and Herbert A. Simon. Why a diagram is (sometimes) worth ten thousand words.Cognitive Science, 11(1):65–100, 1987. doi: 10.1111/j.1551-6708.1987.tb00863.x. URL https://doi.org/10. 1111/j.1551-6708.1987.tb00863.x

  23. [23]

    Laurençon, L

    Hugo Laurençon, Léo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into HTML code with the WebSight dataset, 2024. URLhttps://arxiv.org/abs/2403.09029

  24. [24]

    JobBench: Aligning Agent Work With Human Will

    Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, et al. Jobbench: Aligning agent work with human will.arXiv preprint arXiv:2605.26329, 2026

  25. [25]

    Mayer and Roxana Moreno

    Richard E. Mayer and Roxana Moreno. Nine ways to reduce cognitive load in multimedia learning. Educational Psychologist, 38(1):43–52, 2003. doi: 10.1207/S15326985EP3801_6

  26. [26]

    GPT-4 Technical Report

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://arxiv. org/abs/2303.08774

  27. [27]

    Teaching CLIP to count to ten.arXiv preprint arXiv:2302.12066, 2023

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten.arXiv preprint arXiv:2302.12066, 2023. doi: 10.48550/arXiv.2302.12066. URL https://arxiv.org/abs/2302.12066

  28. [28]

    Number 9 in Oxford Psychology Series

    Allan Paivio.Mental Representations: A Dual Coding Approach. Number 9 in Oxford Psychology Series. Oxford University Press, New York, 1986. ISBN 0-19-503936-X. URL https://academic.oup.com/ book/10932

  29. [29]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, pages 68539–68551, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ha...

  30. [30]

    Claude E. Shannon. Coding theorems for a discrete source with a fidelity criterion. InIRE National Convention Record, Part 4, pages 142–163, 1959

  31. [31]

    Design2Code: Benchmarking multimodal code generation for automated front-end engineering

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3956...

  32. [32]

    Latent chain-of-thought for visual reasoning.Advances in neural information processing systems, 38:103739–103762, 2026

    Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning.Advances in neural information processing systems, 38:103739–103762, 2026

  33. [33]

    Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2): 257–285, 1988

    John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2): 257–285, 1988. doi: 10.1207/s15516709cog1202_4. 13

  34. [34]

    John Sweller, Jeroen J. G. van Merrienboer, and Fred G. W. C. Paas. Cognitive architecture and instructional design.Educational Psychology Review, 10(3):251–296, 1998. doi: 10.1023/A:1022193728205

  35. [35]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

  36. [36]

    The information bottleneck method

    Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368–377, Monticello, IL, USA, 1999. URLhttps://arxiv.org/abs/physics/0004057

  37. [37]

    Cognitive fit: A theory-based analysis of the graphs versus tables literature.Decision Sciences, 22(2):219–240, 1991

    Iris Vessey. Cognitive fit: A theory-based analysis of the graphs versus tables literature.Decision Sciences, 22(2):219–240, 1991. doi: 10.1111/j.1540-5915.1991.tb00344.x. URL https://doi.org/10.1111/j. 1540-5915.1991.tb00344.x

  38. [38]

    White, Doug Burger, and Chi Wang

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. InProceedings of the First Conference on Language Modeling, 2024. URL https://openreview.net/...

  39. [39]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218, 2024. doi: 10.48550/arXiv.2410.23218. URL https: //arxiv.org/abs/2410.23218

  40. [40]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- Act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2210.03629. URL https://arxiv.org/abs/2210.03629

  41. [41]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for GUI automation.arXiv preprint arXiv:2508.15144, 2025. doi: 10.48550/arXiv.2508.15144. URL https://arxiv.org/abs/2508. 15144

  42. [42]

    Aurora: Unified Video Editing with a Tool-Using Agent

    Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua, Wei Xiong, and Jiebo Luo. Aurora: Unified video editing with a tool-using agent.arXiv preprint arXiv:2605.18748, 2026

  43. [43]

    MIRA: Multimodal iterative reasoning agent for image editing

    Ziyun Zeng, Hang Hua, and Jiebo Luo. MIRA: Multimodal iterative reasoning agent for image editing. arXiv preprint arXiv:2511.21087, 2025. doi: 10.48550/arXiv.2511.21087. URL https://arxiv.org/ abs/2511.21087

  44. [44]

    MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

    Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, and Jiebo Luo. Mementogui: Learning agentic multimodal memory control for long-horizon gui agents.arXiv preprint arXiv:2605.18652, 2026

  45. [45]

    AppAgent: Multimodal Agents as Smartphone Users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Ap- pAgent: Multimodal agents as smartphone users, 2023. URLhttps://arxiv.org/abs/2312.13771

  46. [46]

    AgentStudio: A toolkit for building general virtual agents, 2024

    Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, and Shuicheng Yan. AgentStudio: A toolkit for building general virtual agents, 2024. URL https://arxiv.org/abs/2403.17918. ICLR 2025

  47. [47]

    goal": ...,

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and Steven Hoi. MAI-UI technical report: Real-world centric foundation GUI agents.arXiv preprint arXiv:2512.22047, 2025. doi: 10.48550/arXiv.2512.22047. URL https://arxiv.org/abs/2512.22047

  48. [48]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. doi: 10. 48550/arXiv.2307.13854. URL https://proceedings.iclr.cc/pa...

  49. [49]

    If nested in a row/card, choose the smallest child action, not the parent con- tainer

  50. [50]

    20 Table 4:Dynamic Visual Skill for CountBenchQA.The visual skill converts counting into point- anchored enumeration

    Return the hitbox center and a tight bbox_2d in the task screenshot coordi- nate space. 20 Table 4:Dynamic Visual Skill for CountBenchQA.The visual skill converts counting into point- anchored enumeration. ♂¶ap-¶arker-altPoint-Anchored Enumeration for Visual Counting Description Count dense or ambiguous objects by producing one spatial anchor per valid ta...

  51. [51]

    Ignore legends, labels, embedded media, reflections, fragments, and sub-parts unless explicitly requested

  52. [52]

    Scan exhaustively; low-contrast or partially occluded valid instances still receive anchors

  53. [53]

    Dynamic Prior The prior is not a fixed image template

    Report points_2d and total_count; the count is the length of the an- chored instance list. Dynamic Prior The prior is not a fixed image template. It is generatedduring inferenceby overlaying numberedgreen anchorson the task image. This makes the model’s counting process visual, auditable, and reusable across object categories. 21 1 Text-only rules Use whe...