Agent Skills Should Go Beyond Text: The Case for Visual Skills

Binxiao Xu; Bocheng Zou; Hang Hua; Ruichuan An

arxiv: 2606.01414 · v1 · pith:CB5EVLIPnew · submitted 2026-05-31 · 💻 cs.CV

Agent Skills Should Go Beyond Text: The Case for Visual Skills

Binxiao Xu , Ruichuan An , Bocheng Zou , Hang Hua This is my paper

Pith reviewed 2026-06-28 17:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords agent skillsmultimodal skillsvisual priorsGUI agentsreusable experienceskill learningtrajectory extraction

0 comments

The pith

Text-only skills limit agents on visual tasks because reusable knowledge often requires spatial layout and visual grounding that text cannot reliably encode.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that storing reusable agent experience solely as text instructions or summaries creates a bottleneck for tasks that depend on spatial layout, fine-grained appearance, and localized state changes. It introduces a multimodal skill approach that pairs textual logic with three visual forms: static priors for fixed spatial conventions, dynamic priors for temporary visual memory, and interleaved skills that link each text step to the exact frames or regions that justify it. An automatic extraction system pulls these elements from task trajectories while keeping spatial references and interaction patterns intact. Experiments on GUI and similar visual tasks show the multimodal version outperforms text-only skills, especially when success hinges on knowing where to look and how to confirm visual outcomes.

Core claim

Reusable agent skills should be stored as multimodal assets rather than text alone; the three forms of visual support—static priors, dynamic priors, and interleaved visual bindings—directly encode where to look, how to inspect, and how to verify results, and an automatic trajectory-to-skill converter can extract them at scale while preserving spatial and visual detail.

What carries the argument

The multimodal skill paradigm that augments declarative text with explicit visual support in the form of static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved skills that bind ordered steps to source frames or regions.

If this is right

Visual skills improve success rates on tasks that require spatial correspondence or state-aware visual verification.
The extraction system scales creation of reusable skills by preserving textual reasoning together with the visual evidence that supports it.
Multimodal skills encode verification steps that text descriptions alone cannot capture reliably.
Agents accumulate experience more effectively when skills retain the original visual context rather than summaries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same extraction approach could be applied to non-GUI visual domains such as robotic manipulation where trajectory data also contains camera frames.
Maintaining a growing library of these multimodal skills might reduce the need for agents to re-learn visual patterns across similar tasks.
Interleaved skills could serve as a bridge between language-based planning and pixel-level control in future agent architectures.

Load-bearing premise

Visual skills can be extracted automatically from trajectories while keeping accurate spatial references and visual boundaries without introducing errors that cancel out the reported performance gains.

What would settle it

A controlled test in which the automatic extraction system produces visual skills whose error rate in spatial references or boundary detection causes lower task success than equivalent text-only skills on the same GUI benchmarks.

Figures

Figures reproduced from arXiv: 2606.01414 by Binxiao Xu, Bocheng Zou, Hang Hua, Ruichuan An.

**Figure 1.** Figure 1: From text-only reuse to visual skills. Direct answering solves each visual task as a one-off prompt, while text-only skills can reuse rules but leave spatial conventions implicit. VISUAL SKILL combines reusable text rules with explicit visual priors, making the visual protocol visible, reusable, and grounded. nearby distractors [6]. Similarly, UI design and poster generation depend on reusable visual patte… view at source ↗

**Figure 2.** Figure 2: VISUAL SKILL capability demo. Visual skills are organized by the visual bottleneck they solve: static skills clarify reusable spatial conventions; dynamic skills write intermediate state back onto the task image, including slide critique and ARC route-state planning; interleaved skills keep reasoning steps adjacent to their visual evidence. To address this issue, we propose VISUAL SKILL, a new agent-skill … view at source ↗

**Figure 3.** Figure 3: Authoring reusable visual skills from multimodal context. AUTOVISUALSKILL converts a user goal and optional multimodal materials into a reusable visual skill by analyzing task constraints, generating visual priors or source-grounded references when needed, and packaging them with textual logic and binding manifests for cross-instance transfer. with skill.md, manifest.json, visual assets, and provenance rec… view at source ↗

**Figure 4.** Figure 4: Qualitative examples of visual skills. Top: GUI grounding examples where visual priors improve hitbox localization. Bottom: dense-counting examples where dynamic visual traces improve count prediction. D, T, and V denote direct prompting, text-only skill prompting, and visual-skill prompting, respectively; GUI numbers are IoU, and counting numbers are predicted counts. 6 Discussion 6.1 VISUAL SKILL vs. Ima… view at source ↗

**Figure 5.** Figure 5: Representative failure cases. Visual Skills can fail when the intended spatial convention conflicts with semantic granularity: GUI priors may over-focus on the smallest glyph-like component, while counting trajectories may expose ambiguity between composite objects, subparts, and lowsalience instances. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: When to use text-only rules, static priors, dynamic priors, or interleaved visual skills. The boundary is determined by the type of bottleneck, not by the dataset name. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Static visual skill examples. These examples use source-backed overlays to clarify reusable spatial conventions. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Dynamic visual skill examples. Dynamic skills render intermediate state onto the current task image, making progress auditable and reusable across reasoning steps. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Interleaved visual skill examples. Interleaved skills preserve ordered text–visual evidence bindings for tutorials, documentation, videos, and document-like sources. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf{\NAME}, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \textbf{\SYSTEM}, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper makes a practical case for multimodal agent skills via a three-part taxonomy and auto-extraction from trajectories, but the abstract leaves the extraction accuracy and experimental controls unexamined.

read the letter

The main takeaway is that reusable skills for visual agents need explicit visual components beyond text, and the authors supply a taxonomy plus an automatic conversion system to create them.

What is new is the split into static priors for stable layouts, dynamic priors for in-situ visual memory, and interleaved skills that tie ordered text steps to the actual frames or regions. The SYSTEM they introduce tries to pull these out of task trajectories while keeping spatial references and interaction patterns. This directly targets the gap in text-only skill work for GUI and similar tasks.

The paper does a clean job naming the limitation: text summaries drop the spatial and appearance details that matter when success depends on looking at the right place or confirming a visual state change. The idea of turning experience into reusable multimodal assets is straightforward and fits applied agent work.

The soft spots sit in the missing specifics. The abstract claims consistent outperformance but gives no baselines, metrics, or controls, so it is impossible to judge whether the gains are robust or post-hoc. The stress-test concern about extraction fidelity is on point: if frame selection or region bounding introduces misalignment, the reported advantage could disappear. Without quantitative checks on how well SYSTEM preserves boundaries and state changes, the central claim stays provisional.

This is for readers working on agent skill reuse and GUI automation who want a concrete way to add visual grounding. It is not reshaping theory but could help practical systems.

It deserves a serious referee because the position is coherent and the proposal is implementable, even if the evidence needs tightening. I would send it to review.

Referee Report

3 major / 2 minor

Summary. The manuscript argues that text-only reusable skills create a bottleneck for visual-centric agent tasks and proposes extbf{ ame}, a multimodal skill paradigm with three forms (static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills binding ordered text steps to source frames/screenshots/regions). It introduces extbf{ extsc{system}}, an automatic converter that extracts these from trajectories while preserving textual reasoning, spatial references, visual boundaries, and interaction patterns. Experiments on GUI and other visual-centric tasks are reported to show consistent outperformance of visual skills over text-only skills, especially when spatial correspondence, visual evidence, and state-aware interaction are required.

Significance. If the empirical results hold after addressing extraction fidelity and experimental controls, the work could meaningfully influence skill representation in multimodal agents by demonstrating that explicit visual assets improve performance on tasks where text alone is insufficient. The automatic extraction approach addresses a key scalability issue for accumulating reusable multimodal experience.

major comments (3)

[ extsc{system} description and extraction pipeline] The central claim rests on extsc{system} producing faithful visual skills without offsetting errors in spatial references or boundaries. No quantitative evaluation of extraction fidelity (e.g., precision/recall of bounding boxes, frame selection accuracy, or state-change localization error rates) is provided in the extsc{system} section or experiments; this is load-bearing because misalignment would directly undermine the reported gains over text-only skills.
[Experiments and evaluation] The experiments section states 'consistent outperformance' but provides no details on baselines (e.g., specific text-only skill methods), metrics, statistical tests, number of runs, or controls for post-hoc selection. Without these, it is impossible to determine whether the results support the multimodal advantage or reflect experimental artifacts.
[Results on GUI tasks] The claim that visual skills are particularly advantageous 'when success requires spatial correspondence, visual evidence, and state-aware interaction' requires explicit supporting evidence (e.g., per-task breakdowns or ablation tables) rather than aggregate statements; the current presentation leaves the differential benefit unquantified.

minor comments (2)

[Abstract and introduction] The manuscript uses placeholder macros extbf{ ame} and extbf{ extsc{system}}; replace with the actual names and expand the first use of each acronym.
[Multimodal skill paradigm] Clarify whether the three reusable forms are mutually exclusive or can be combined in a single skill asset, and provide a concrete example trajectory showing all three forms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that directly strengthen the manuscript's claims regarding extraction fidelity, experimental rigor, and differential benefits of visual skills.

read point-by-point responses

Referee: [system description and extraction pipeline] The central claim rests on extsc{system} producing faithful visual skills without offsetting errors in spatial references or boundaries. No quantitative evaluation of extraction fidelity (e.g., precision/recall of bounding boxes, frame selection accuracy, or state-change localization error rates) is provided in the extsc{system} section or experiments; this is load-bearing because misalignment would directly undermine the reported gains over text-only skills.

Authors: We agree that the absence of quantitative fidelity metrics for the extraction pipeline is a limitation that weakens the central claim. In the revised manuscript we will add a dedicated evaluation subsection (and associated table) reporting precision/recall for bounding-box extraction, frame-selection accuracy, and state-change localization error on a manually annotated subset of trajectories. These metrics will be computed against human-verified ground truth and will be used to bound the potential impact of misalignment on the reported performance gains. revision: yes
Referee: [Experiments and evaluation] The experiments section states 'consistent outperformance' but provides no details on baselines (e.g., specific text-only skill methods), metrics, statistical tests, number of runs, or controls for post-hoc selection. Without these, it is impossible to determine whether the results support the multimodal advantage or reflect experimental artifacts.

Authors: We acknowledge that the current experimental description lacks the necessary methodological detail. The revised Experiments section will explicitly list the text-only baselines (including ReAct-style and Reflexion-style skill variants), the primary metric (task success rate), the number of independent runs (five random seeds), the statistical tests performed (paired t-tests with reported p-values), and the procedure used to avoid post-hoc selection bias. Variance across runs will also be reported. revision: yes
Referee: [Results on GUI tasks] The claim that visual skills are particularly advantageous 'when success requires spatial correspondence, visual evidence, and state-aware interaction' requires explicit supporting evidence (e.g., per-task breakdowns or ablation tables) rather than aggregate statements; the current presentation leaves the differential benefit unquantified.

Authors: We agree that aggregate statements alone are insufficient to substantiate the differential advantage. The revised results section will include per-task breakdown tables (and an accompanying ablation) that separate tasks according to whether they require spatial correspondence, visual evidence, or state-aware interaction. These tables will report success-rate deltas between visual and text-only skills for each category, thereby quantifying the claimed benefit. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a conceptual proposal introducing multimodal visual skills (static priors, dynamic priors, interleaved forms) and an automatic extraction system SYSTEM, supported by empirical results on GUI tasks. No mathematical derivations, equations, fitted parameters, predictions, or self-citations appear in the provided text that reduce any claim to its inputs by construction. The central position is advanced as an argument plus experiments rather than a closed definitional or self-referential loop, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based solely on the abstract, the paper introduces new conceptual categories for skills and a new extraction system but contains no free parameters, mathematical axioms, or invented physical entities.

invented entities (2)

Multimodal skill paradigm with three reusable forms no independent evidence
purpose: To encode spatial layout, visual grounding, and state changes alongside text for agent skills
New conceptual framework proposed to address the stated text-only bottleneck.
SYSTEM automatic conversion system no independent evidence
purpose: To scale construction of visual skills from agent trajectories
New method introduced for preserving textual reasoning and visual references.

pith-pipeline@v0.9.1-grok · 5789 in / 1176 out tokens · 35181 ms · 2026-06-28T17:09:33.922268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 37 canonical work pages · 16 internal anchors

[1]

DeFT: A conceptual framework for considering learning with multiple representations

Shaaron Ainsworth. DeFT: A conceptual framework for considering learning with multiple representations. Learning and Instruction, 16(3):183–198, 2006. doi: 10.1016/j.learninstruc.2006.03.001

work page doi:10.1016/j.learninstruc.2006.03.001 2006
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. doi: 10.48550/arXiv. 2309.16609. URLhttps://arxiv.org/abs/2309.16609

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/ abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. PaliGemma: A versatile 3b VLM for transfer.arXiv preprint arXiv:2407.07726, 2024. doi: 10.48550/arXiv.2407.07726. URLhttps://arxiv.org/abs/2407.07726. 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07726 2024
[5]

Cognitive load theory and the format of instruction.Cognition and Instruction, 8(4):293–332, 1991

Paul Chandler and John Sweller. Cognitive load theory and the format of instruction.Cognition and Instruction, 8(4):293–332, 1991. doi: 10.1207/s1532690xci0804_2

work page doi:10.1207/s1532690xci0804_2 1991
[6]

SeeClick: Harnessing GUI grounding for advanced visual GUI agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, YanTao Li, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok,...

work page doi:10.18653/v1/2024.acl-long.505 2024
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https: //arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024. doi: 10.48550/arXiv.2412.19437. URL https://arxiv.org/abs/2412. 19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[9]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, pages 28091–28114, 2023. URL https://proceedings.neurips.cc/paper_files/ paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks. pdf

2023
[10]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
[11]

UI-Venus technical report: Building high-performance UI agents with RFT.arXiv preprint arXiv:2508.10833, 2025

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. UI-Venus technical report: Building high-performance UI agents with RFT.arXiv preprint arXiv:2508.10833, 2025. doi: 10.48550/arXiv.2508.10833. URL https://arxiv.org/abs/2508.10833

work page doi:10.48550/arxiv.2508.10833 2025
[12]

Black, and Otmar Hilliges

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14953–14962, June 2023. doi: 10.1109/CVPR52729.2023.01436. URL https://openaccess.thecvf.com/content/CVPR2023/html/Gupta_Visual_Programming_ Compositional_Vis...

work page doi:10.1109/cvpr52729.2023.01436 2023
[13]

SugarCrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in Neural Information Processing Systems, 36:31096–31116, 2023

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in Neural Information Processing Systems, 36:31096–31116, 2023

2023
[14]

Black, and Otmar Hilliges

Hsiao Yuan Hsu, Xiangteng He, Yuxin Peng, Hao Kong, and Qing Zhang. PosterLay- out: A new benchmark and approach for content-aware visual-textual presentation layout. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 6018–6026, June 2023. doi: 10.1109/CVPR52729.2023.00583. URL https: //openaccess.thecvf.com/conte...

work page doi:10.1109/cvpr52729.2023.00583 2023
[15]

Promptcap: Prompt- guided image captioning for vqa with gpt-3

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt- guided image captioning for vqa with gpt-3. InProceedings of the IEEE/CVF international conference on computer vision, pages 2963–2975, 2023

2023
[16]

FineMatch: Aspect-based fine-grained image and text mismatch detection and correction

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, and Jiebo Luo. FineMatch: Aspect-based fine-grained image and text mismatch detection and correction. In European Conference on Computer Vision, pages 474–491. Springer, 2024

2024
[17]

MMCOMPOSITION: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024

Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. MMCOMPOSITION: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024. doi: 10.48550/arXiv.2410.09733. URL https://arxiv. org/abs/2410.09733. 12

work page doi:10.48550/arxiv.2410.09733 2024
[18]

V2Xum-LLM: Cross-modal video summarization with temporal prompt instruction tuning

Hang Hua, Yolo Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2Xum-LLM: Cross-modal video summarization with temporal prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3599–3607, 2025. doi: 10.1609/aaai.v39i4.32374. URL https://arxiv.org/abs/2404.12353

work page doi:10.1609/aaai.v39i4.32374 2025
[19]

Aliaga, Wei Xiong, and Jiebo Luo

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel G. Aliaga, Wei Xiong, and Jiebo Luo. MMIG-Bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.arXiv preprint arXiv:2505.19415, 2025. doi: 10.48550/arXiv.2505.19415. URL https://arxiv. org/abs/2505.19415

work page doi:10.48550/arxiv.2505.19415 2025
[20]

Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

work page arXiv 2025
[21]

LangChain. Agents. https://docs.langchain.com/oss/python/langchain/agents, 2026. URL https://docs.langchain.com/oss/python/langchain/agents. Documentation; accessed 2026- 05-03

2026
[22]

Larkin and Herbert A

Jill H. Larkin and Herbert A. Simon. Why a diagram is (sometimes) worth ten thousand words.Cognitive Science, 11(1):65–100, 1987. doi: 10.1111/j.1551-6708.1987.tb00863.x. URL https://doi.org/10. 1111/j.1551-6708.1987.tb00863.x

work page doi:10.1111/j.1551-6708.1987.tb00863.x 1987
[23]

Laurençon, L

Hugo Laurençon, Léo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into HTML code with the WebSight dataset, 2024. URLhttps://arxiv.org/abs/2403.09029

work page arXiv 2024
[24]

JobBench: Aligning Agent Work With Human Will

Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, et al. Jobbench: Aligning agent work with human will.arXiv preprint arXiv:2605.26329, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Mayer and Roxana Moreno

Richard E. Mayer and Roxana Moreno. Nine ways to reduce cognitive load in multimedia learning. Educational Psychologist, 38(1):43–52, 2003. doi: 10.1207/S15326985EP3801_6

work page doi:10.1207/s15326985ep3801_6 2003
[26]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://arxiv. org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[27]

Teaching CLIP to count to ten.arXiv preprint arXiv:2302.12066, 2023

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten.arXiv preprint arXiv:2302.12066, 2023. doi: 10.48550/arXiv.2302.12066. URL https://arxiv.org/abs/2302.12066

work page doi:10.48550/arxiv.2302.12066 2023
[28]

Number 9 in Oxford Psychology Series

Allan Paivio.Mental Representations: A Dual Coding Approach. Number 9 in Oxford Psychology Series. Oxford University Press, New York, 1986. ISBN 0-19-503936-X. URL https://academic.oup.com/ book/10932

1986
[29]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, pages 68539–68551, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ha...

2023
[30]

Claude E. Shannon. Coding theorems for a discrete source with a fidelity criterion. InIRE National Convention Record, Part 4, pages 142–163, 1959

1959
[31]

Design2Code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3956...

work page doi:10.18653/v1/2025.naacl-long.199 2025
[32]

Latent chain-of-thought for visual reasoning.Advances in neural information processing systems, 38:103739–103762, 2026

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning.Advances in neural information processing systems, 38:103739–103762, 2026

2026
[33]

Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2): 257–285, 1988

John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2): 257–285, 1988. doi: 10.1207/s15516709cog1202_4. 13

work page doi:10.1207/s15516709cog1202_4 1988
[34]

John Sweller, Jeroen J. G. van Merrienboer, and Fred G. W. C. Paas. Cognitive architecture and instructional design.Educational Psychology Review, 10(3):251–296, 1998. doi: 10.1023/A:1022193728205

work page doi:10.1023/a:1022193728205 1998
[35]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

2022
[36]

The information bottleneck method

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368–377, Monticello, IL, USA, 1999. URLhttps://arxiv.org/abs/physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 1999
[37]

Cognitive fit: A theory-based analysis of the graphs versus tables literature.Decision Sciences, 22(2):219–240, 1991

Iris Vessey. Cognitive fit: A theory-based analysis of the graphs versus tables literature.Decision Sciences, 22(2):219–240, 1991. doi: 10.1111/j.1540-5915.1991.tb00344.x. URL https://doi.org/10.1111/j. 1540-5915.1991.tb00344.x

work page doi:10.1111/j.1540-5915.1991.tb00344.x 1991
[38]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. InProceedings of the First Conference on Language Modeling, 2024. URL https://openreview.net/...

2024
[39]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218, 2024. doi: 10.48550/arXiv.2410.23218. URL https: //arxiv.org/abs/2410.23218

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.23218 2024
[40]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- Act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2210.03629. URL https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023
[41]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for GUI automation.arXiv preprint arXiv:2508.15144, 2025. doi: 10.48550/arXiv.2508.15144. URL https://arxiv.org/abs/2508. 15144

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.15144 2025
[42]

Aurora: Unified Video Editing with a Tool-Using Agent

Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua, Wei Xiong, and Jiebo Luo. Aurora: Unified video editing with a tool-using agent.arXiv preprint arXiv:2605.18748, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

MIRA: Multimodal iterative reasoning agent for image editing

Ziyun Zeng, Hang Hua, and Jiebo Luo. MIRA: Multimodal iterative reasoning agent for image editing. arXiv preprint arXiv:2511.21087, 2025. doi: 10.48550/arXiv.2511.21087. URL https://arxiv.org/ abs/2511.21087

work page doi:10.48550/arxiv.2511.21087 2025
[44]

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, and Jiebo Luo. Mementogui: Learning agentic multimodal memory control for long-horizon gui agents.arXiv preprint arXiv:2605.18652, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Ap- pAgent: Multimodal agents as smartphone users, 2023. URLhttps://arxiv.org/abs/2312.13771

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

AgentStudio: A toolkit for building general virtual agents, 2024

Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, and Shuicheng Yan. AgentStudio: A toolkit for building general virtual agents, 2024. URL https://arxiv.org/abs/2403.17918. ICLR 2025

work page arXiv 2024
[47]

goal": ...,

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and Steven Hoi. MAI-UI technical report: Real-world centric foundation GUI agents.arXiv preprint arXiv:2512.22047, 2025. doi: 10.48550/arXiv.2512.22047. URL https://arxiv.org/abs/2512.22047

work page doi:10.48550/arxiv.2512.22047 2025
[48]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. doi: 10. 48550/arXiv.2307.13854. URL https://proceedings.iclr.cc/pa...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

If nested in a row/card, choose the smallest child action, not the parent con- tainer
[50]

20 Table 4:Dynamic Visual Skill for CountBenchQA.The visual skill converts counting into point- anchored enumeration

Return the hitbox center and a tight bbox_2d in the task screenshot coordi- nate space. 20 Table 4:Dynamic Visual Skill for CountBenchQA.The visual skill converts counting into point- anchored enumeration. ♂¶ap-¶arker-altPoint-Anchored Enumeration for Visual Counting Description Count dense or ambiguous objects by producing one spatial anchor per valid ta...
[51]

Ignore legends, labels, embedded media, reflections, fragments, and sub-parts unless explicitly requested
[52]

Scan exhaustively; low-contrast or partially occluded valid instances still receive anchors
[53]

Dynamic Prior The prior is not a fixed image template

Report points_2d and total_count; the count is the length of the an- chored instance list. Dynamic Prior The prior is not a fixed image template. It is generatedduring inferenceby overlaying numberedgreen anchorson the task image. This makes the model’s counting process visual, auditable, and reusable across object categories. 21 1 Text-only rules Use whe...

[1] [1]

DeFT: A conceptual framework for considering learning with multiple representations

Shaaron Ainsworth. DeFT: A conceptual framework for considering learning with multiple representations. Learning and Instruction, 16(3):183–198, 2006. doi: 10.1016/j.learninstruc.2006.03.001

work page doi:10.1016/j.learninstruc.2006.03.001 2006

[2] [2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. doi: 10.48550/arXiv. 2309.16609. URLhttps://arxiv.org/abs/2309.16609

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/ abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. PaliGemma: A versatile 3b VLM for transfer.arXiv preprint arXiv:2407.07726, 2024. doi: 10.48550/arXiv.2407.07726. URLhttps://arxiv.org/abs/2407.07726. 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07726 2024

[5] [5]

Cognitive load theory and the format of instruction.Cognition and Instruction, 8(4):293–332, 1991

Paul Chandler and John Sweller. Cognitive load theory and the format of instruction.Cognition and Instruction, 8(4):293–332, 1991. doi: 10.1207/s1532690xci0804_2

work page doi:10.1207/s1532690xci0804_2 1991

[6] [6]

SeeClick: Harnessing GUI grounding for advanced visual GUI agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, YanTao Li, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok,...

work page doi:10.18653/v1/2024.acl-long.505 2024

[7] [7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https: //arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024. doi: 10.48550/arXiv.2412.19437. URL https://arxiv.org/abs/2412. 19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024

[9] [9]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, pages 28091–28114, 2023. URL https://proceedings.neurips.cc/paper_files/ paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks. pdf

2023

[10] [10]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023

[11] [11]

UI-Venus technical report: Building high-performance UI agents with RFT.arXiv preprint arXiv:2508.10833, 2025

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. UI-Venus technical report: Building high-performance UI agents with RFT.arXiv preprint arXiv:2508.10833, 2025. doi: 10.48550/arXiv.2508.10833. URL https://arxiv.org/abs/2508.10833

work page doi:10.48550/arxiv.2508.10833 2025

[12] [12]

Black, and Otmar Hilliges

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14953–14962, June 2023. doi: 10.1109/CVPR52729.2023.01436. URL https://openaccess.thecvf.com/content/CVPR2023/html/Gupta_Visual_Programming_ Compositional_Vis...

work page doi:10.1109/cvpr52729.2023.01436 2023

[13] [13]

SugarCrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in Neural Information Processing Systems, 36:31096–31116, 2023

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in Neural Information Processing Systems, 36:31096–31116, 2023

2023

[14] [14]

Black, and Otmar Hilliges

Hsiao Yuan Hsu, Xiangteng He, Yuxin Peng, Hao Kong, and Qing Zhang. PosterLay- out: A new benchmark and approach for content-aware visual-textual presentation layout. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 6018–6026, June 2023. doi: 10.1109/CVPR52729.2023.00583. URL https: //openaccess.thecvf.com/conte...

work page doi:10.1109/cvpr52729.2023.00583 2023

[15] [15]

Promptcap: Prompt- guided image captioning for vqa with gpt-3

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt- guided image captioning for vqa with gpt-3. InProceedings of the IEEE/CVF international conference on computer vision, pages 2963–2975, 2023

2023

[16] [16]

FineMatch: Aspect-based fine-grained image and text mismatch detection and correction

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, and Jiebo Luo. FineMatch: Aspect-based fine-grained image and text mismatch detection and correction. In European Conference on Computer Vision, pages 474–491. Springer, 2024

2024

[17] [17]

MMCOMPOSITION: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024

Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. MMCOMPOSITION: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024. doi: 10.48550/arXiv.2410.09733. URL https://arxiv. org/abs/2410.09733. 12

work page doi:10.48550/arxiv.2410.09733 2024

[18] [18]

V2Xum-LLM: Cross-modal video summarization with temporal prompt instruction tuning

Hang Hua, Yolo Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2Xum-LLM: Cross-modal video summarization with temporal prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3599–3607, 2025. doi: 10.1609/aaai.v39i4.32374. URL https://arxiv.org/abs/2404.12353

work page doi:10.1609/aaai.v39i4.32374 2025

[19] [19]

Aliaga, Wei Xiong, and Jiebo Luo

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel G. Aliaga, Wei Xiong, and Jiebo Luo. MMIG-Bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.arXiv preprint arXiv:2505.19415, 2025. doi: 10.48550/arXiv.2505.19415. URL https://arxiv. org/abs/2505.19415

work page doi:10.48550/arxiv.2505.19415 2025

[20] [20]

Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

work page arXiv 2025

[21] [21]

LangChain. Agents. https://docs.langchain.com/oss/python/langchain/agents, 2026. URL https://docs.langchain.com/oss/python/langchain/agents. Documentation; accessed 2026- 05-03

2026

[22] [22]

Larkin and Herbert A

Jill H. Larkin and Herbert A. Simon. Why a diagram is (sometimes) worth ten thousand words.Cognitive Science, 11(1):65–100, 1987. doi: 10.1111/j.1551-6708.1987.tb00863.x. URL https://doi.org/10. 1111/j.1551-6708.1987.tb00863.x

work page doi:10.1111/j.1551-6708.1987.tb00863.x 1987

[23] [23]

Laurençon, L

Hugo Laurençon, Léo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into HTML code with the WebSight dataset, 2024. URLhttps://arxiv.org/abs/2403.09029

work page arXiv 2024

[24] [24]

JobBench: Aligning Agent Work With Human Will

Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, et al. Jobbench: Aligning agent work with human will.arXiv preprint arXiv:2605.26329, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Mayer and Roxana Moreno

Richard E. Mayer and Roxana Moreno. Nine ways to reduce cognitive load in multimedia learning. Educational Psychologist, 38(1):43–52, 2003. doi: 10.1207/S15326985EP3801_6

work page doi:10.1207/s15326985ep3801_6 2003

[26] [26]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://arxiv. org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[27] [27]

Teaching CLIP to count to ten.arXiv preprint arXiv:2302.12066, 2023

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten.arXiv preprint arXiv:2302.12066, 2023. doi: 10.48550/arXiv.2302.12066. URL https://arxiv.org/abs/2302.12066

work page doi:10.48550/arxiv.2302.12066 2023

[28] [28]

Number 9 in Oxford Psychology Series

Allan Paivio.Mental Representations: A Dual Coding Approach. Number 9 in Oxford Psychology Series. Oxford University Press, New York, 1986. ISBN 0-19-503936-X. URL https://academic.oup.com/ book/10932

1986

[29] [29]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, pages 68539–68551, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ha...

2023

[30] [30]

Claude E. Shannon. Coding theorems for a discrete source with a fidelity criterion. InIRE National Convention Record, Part 4, pages 142–163, 1959

1959

[31] [31]

Design2Code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3956...

work page doi:10.18653/v1/2025.naacl-long.199 2025

[32] [32]

Latent chain-of-thought for visual reasoning.Advances in neural information processing systems, 38:103739–103762, 2026

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning.Advances in neural information processing systems, 38:103739–103762, 2026

2026

[33] [33]

Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2): 257–285, 1988

John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2): 257–285, 1988. doi: 10.1207/s15516709cog1202_4. 13

work page doi:10.1207/s15516709cog1202_4 1988

[34] [34]

John Sweller, Jeroen J. G. van Merrienboer, and Fred G. W. C. Paas. Cognitive architecture and instructional design.Educational Psychology Review, 10(3):251–296, 1998. doi: 10.1023/A:1022193728205

work page doi:10.1023/a:1022193728205 1998

[35] [35]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

2022

[36] [36]

The information bottleneck method

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368–377, Monticello, IL, USA, 1999. URLhttps://arxiv.org/abs/physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 1999

[37] [37]

Cognitive fit: A theory-based analysis of the graphs versus tables literature.Decision Sciences, 22(2):219–240, 1991

Iris Vessey. Cognitive fit: A theory-based analysis of the graphs versus tables literature.Decision Sciences, 22(2):219–240, 1991. doi: 10.1111/j.1540-5915.1991.tb00344.x. URL https://doi.org/10.1111/j. 1540-5915.1991.tb00344.x

work page doi:10.1111/j.1540-5915.1991.tb00344.x 1991

[38] [38]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. InProceedings of the First Conference on Language Modeling, 2024. URL https://openreview.net/...

2024

[39] [39]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218, 2024. doi: 10.48550/arXiv.2410.23218. URL https: //arxiv.org/abs/2410.23218

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.23218 2024

[40] [40]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- Act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2210.03629. URL https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023

[41] [41]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for GUI automation.arXiv preprint arXiv:2508.15144, 2025. doi: 10.48550/arXiv.2508.15144. URL https://arxiv.org/abs/2508. 15144

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.15144 2025

[42] [42]

Aurora: Unified Video Editing with a Tool-Using Agent

Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua, Wei Xiong, and Jiebo Luo. Aurora: Unified video editing with a tool-using agent.arXiv preprint arXiv:2605.18748, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

MIRA: Multimodal iterative reasoning agent for image editing

Ziyun Zeng, Hang Hua, and Jiebo Luo. MIRA: Multimodal iterative reasoning agent for image editing. arXiv preprint arXiv:2511.21087, 2025. doi: 10.48550/arXiv.2511.21087. URL https://arxiv.org/ abs/2511.21087

work page doi:10.48550/arxiv.2511.21087 2025

[44] [44]

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, and Jiebo Luo. Mementogui: Learning agentic multimodal memory control for long-horizon gui agents.arXiv preprint arXiv:2605.18652, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Ap- pAgent: Multimodal agents as smartphone users, 2023. URLhttps://arxiv.org/abs/2312.13771

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

AgentStudio: A toolkit for building general virtual agents, 2024

Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, and Shuicheng Yan. AgentStudio: A toolkit for building general virtual agents, 2024. URL https://arxiv.org/abs/2403.17918. ICLR 2025

work page arXiv 2024

[47] [47]

goal": ...,

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and Steven Hoi. MAI-UI technical report: Real-world centric foundation GUI agents.arXiv preprint arXiv:2512.22047, 2025. doi: 10.48550/arXiv.2512.22047. URL https://arxiv.org/abs/2512.22047

work page doi:10.48550/arxiv.2512.22047 2025

[48] [48]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. doi: 10. 48550/arXiv.2307.13854. URL https://proceedings.iclr.cc/pa...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

If nested in a row/card, choose the smallest child action, not the parent con- tainer

[50] [50]

20 Table 4:Dynamic Visual Skill for CountBenchQA.The visual skill converts counting into point- anchored enumeration

Return the hitbox center and a tight bbox_2d in the task screenshot coordi- nate space. 20 Table 4:Dynamic Visual Skill for CountBenchQA.The visual skill converts counting into point- anchored enumeration. ♂¶ap-¶arker-altPoint-Anchored Enumeration for Visual Counting Description Count dense or ambiguous objects by producing one spatial anchor per valid ta...

[51] [51]

Ignore legends, labels, embedded media, reflections, fragments, and sub-parts unless explicitly requested

[52] [52]

Scan exhaustively; low-contrast or partially occluded valid instances still receive anchors

[53] [53]

Dynamic Prior The prior is not a fixed image template

Report points_2d and total_count; the count is the length of the an- chored instance list. Dynamic Prior The prior is not a fixed image template. It is generatedduring inferenceby overlaying numberedgreen anchorson the task image. This makes the model’s counting process visual, auditable, and reusable across object categories. 21 1 Text-only rules Use whe...