pith. sign in

arxiv: 2605.24785 · v2 · pith:RUIYZ3LSnew · submitted 2026-05-24 · 💻 cs.AI

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Pith reviewed 2026-06-30 11:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal web agentsonline skill distillationskill libraryVisualWebArenatoken efficiencyhierarchical routingprogress reflectionefficiency metrics
0
0 comments X

The pith

PANDO shows a multimodal web agent can grow more efficient with experience by distilling skills online in a single rollout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent multimodal web agents have grown more capable by spending more inference-time computation on search, verification, and offline discovery. The paper examines whether efficiency can improve instead as agents gain experience in a single rollout. Analysis of VisualWebArena trajectories reveals three main inefficiencies: repeating the same actions, paying hidden costs to discover skills on the fly, and failing to reuse prompt caches. PANDO addresses them by building and using a Skill Library online through progress reflection, demoting weak skills by confidence, routing at multiple levels, compressing images, and prompting with cache awareness. The result is higher task success at substantially lower token cost on the full benchmark suite.

Core claim

PANDO is a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. This allows the agent to become more efficient as it accumulates experience rather than more expensive, achieving 58.3% success rate on 910 VisualWebArena tasks with 58% fewer tokens than SGV and 61% fewer than WALT, without any pre-evaluation discovery budget. A 300-task ablation shows rules and routines provide most success gains while routing, compression, and cache-aware prompting convert the larger library into lower marginal token cost. The paper als

What carries the argument

The structured Skill Library maintained through progress reflection and confidence-based demotion, supported by hierarchical routing, visual compression, and cache-aware prompting.

If this is right

  • Agents reach higher success without requiring any pre-evaluation discovery budget.
  • Token consumption falls as the skill library grows and is reused within one rollout.
  • Rules and routines account for most of the success improvement; the efficiency techniques mainly reduce marginal costs.
  • Efficiency becomes directly measurable with the three new trajectory metrics rather than only terminal success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same online distillation pattern could reduce compute waste in other agent domains that suffer repeated actions and low reuse.
  • If the library continues to grow, long-term management of skill relevance may become a new limiting factor.
  • The approach suggests that general inefficiency patterns in LLM agents are addressable without domain-specific engineering.

Load-bearing premise

The three identified sources of inefficiency are the dominant bottlenecks and the proposed components can be combined without creating new offsetting costs in a single rollout.

What would settle it

An evaluation on the same 910 tasks in which PANDO either fails to reduce tokens below the SGV and WALT baselines or loses its success-rate advantage when all listed components are active.

Figures

Figures reproduced from arXiv: 2605.24785 by Yidi Miao, Yubo Li, Yuntian Shen, Yuxin Liu.

Figure 1
Figure 1. Figure 1: PANDO architecture. The Planner decomposes tasks into subgoals; the Skill Selector [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency and online learning diagnostics. Left: PANDO is the only evaluated point with both higher SR and fewer tokens than all baselines. Right: the skill library grows, demotes brittle routines, and reduces the rolling average steps from an unstable cold start to about 8.5 steps/task. Configuration SR ∆SR Steps Tok. ARR Cache Dominant effect (%) (pp) (K) (%) (%) Backbone: SoM-Qwen (M) 38.6 – 15.2 223 3… view at source ↗
Figure 3
Figure 3. Figure 3: summarizes the multi-metric pattern from the main table; [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Step composition per method under our LLM-call + action accounting. PANDO’s lower step count comes from deterministic routine invocations replacing repeated Actor calls and primitive action chains. Repeat-action early stop Max-step exhaust Wrong final answer Environment error Other 0 10 20 30 40 50 60 70 80 90 Share of failed tasks (\%) Text-only SoM (M) WALT PANDO [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure-mode composition across four methods (VWA-Classifieds, 300 tasks). Repeat￾action loops dominate text-only and SoM methods and are cut by roughly 4× under PANDO; grounding errors are backbone-limited. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt-cache utilization on VWA. Cache utilization rises as the skill-library prefix stops churning, complementing the online skill-dynamics panel in [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cost–success Pareto frontier on VWA. PANDO defines a new Pareto point: no other method in Tab. 8 simultaneously achieves higher SR and lower per-task cost—every baseline lies strictly north-east of PANDO ($0.085). WALT is drawn at both its headline cost ($0.592) and its 910-task-amortized cost ($0.641); both lie strictly north-east of PANDO. Headline numbers derived from Tab. 8. PANDO is 86% cheaper per ta… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation progression from Tab. 9. Skill components account for most of the success-rate lift, while routing, visual compression, and cache-aware prompt layout convert the larger library into a lower-cost execution path. The full system ends with both the largest SR gain and the lowest per-task cost. I.3 Learning Curve: Cost Compounds with Task Index Figures 9a and 9b show the per-task and cumulative cost c… view at source ↗
Figure 9
Figure 9. Figure 9: Cost compounds with task index. Learning during evaluation produces a monotonically decreasing per-task cost (left) and a sub-linear cumulative spend (right). The gap between PANDO and the fixed-library counterfactual quantifies the dollar value of in-evaluation skill distillation. I.4 Token-Level Composition per Method [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: decomposes per-task token spend into Planner, Reflector, Actor, and (for WALT) offline tool-discovery tokens. The offline bar is reported at the 910-task-amortized rate; the headline WALT figure reported in its paper corresponds to omitting that bar entirely. Across the full baseline set, PANDO has the lowest total token load (115K per task). 0 100 200 300 Tokens per task (K) 132K Text-only 166K Caption 2… view at source ↗
read the original abstract

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PANDO, a single-rollout online skill-distillation framework for multimodal web agents. It identifies three sources of inefficiency in prior agents (repeat-action loops, hidden discovery costs, low prompt-cache reuse) and introduces a structured Skill Library maintained via progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full 910-task VisualWebArena benchmark, PANDO reports 58.3% success (vs. SGV 54.0% and WALT reproduction 45.2%), with 58% and 61% token reductions respectively and zero pre-evaluation discovery budget. A 300-task ablation attributes most success gains to rules/routines and efficiency gains to the remaining modules; three new trajectory-level efficiency metrics (Action Repetition Rate, Step Overhead Ratio, Prompt Cache Utilization) are introduced.

Significance. If the reported performance numbers hold under independent replication, the work provides concrete evidence that multimodal web agents can improve both success rate and token efficiency through online experience accumulation rather than increased inference-time search or offline discovery. The introduction of the three trajectory-level efficiency metrics is a clear positive contribution that makes efficiency claims more falsifiable. The zero pre-evaluation budget and single-rollout constraint are also notable strengths relative to prior approaches that rely on specialist stacks or rollout search.

major comments (3)
  1. [Results on 910 VisualWebArena tasks] The central performance claims (58.3% success and 58–61% token reductions on all 910 tasks) rest on the authors' own WALT reproduction (45.2% success); without release of the reproduction code, exact hyper-parameters, or a side-by-side comparison against the original WALT implementation, the token-reduction numbers cannot be independently verified and are load-bearing for the efficiency claim.
  2. [Ablation study] The 300-task ablation attributes success gains primarily to 'rules and routines' and efficiency gains to routing/compression/cache-aware prompting, yet provides no quantitative breakdown (e.g., per-module token or step counts) for the full 910-task PANDO system; this leaves open whether skill-library maintenance and hierarchical routing introduce offsetting overheads that cancel the reported marginal savings when all five components run together.
  3. [Experimental results] No error bars, standard deviations across runs, or statistical significance tests are reported for the 58.3% vs. 54.0% success-rate difference on the full benchmark, making it impossible to assess whether the observed improvement is robust or could be explained by variance in the evaluation.
minor comments (2)
  1. The abstract and results section would benefit from explicit statement of the number of independent runs or random seeds used for the 910-task evaluation.
  2. Notation for the Skill Library structure and the exact demotion threshold is introduced without a dedicated figure or pseudocode, reducing reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below with clarifications and indicate the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [Results on 910 VisualWebArena tasks] The central performance claims (58.3% success and 58–61% token reductions on all 910 tasks) rest on the authors' own WALT reproduction (45.2% success); without release of the reproduction code, exact hyper-parameters, or a side-by-side comparison against the original WALT implementation, the token-reduction numbers cannot be independently verified and are load-bearing for the efficiency claim.

    Authors: We agree that independent verification requires releasing the WALT reproduction details. In the revised version we will provide the reproduction code, exact hyperparameters, and a side-by-side comparison table against the original WALT implementation, either in the supplementary material or via a public repository link. revision: yes

  2. Referee: [Ablation study] The 300-task ablation attributes success gains primarily to 'rules and routines' and efficiency gains to routing/compression/cache-aware prompting, yet provides no quantitative breakdown (e.g., per-module token or step counts) for the full 910-task PANDO system; this leaves open whether skill-library maintenance and hierarchical routing introduce offsetting overheads that cancel the reported marginal savings when all five components run together.

    Authors: We concur that a quantitative per-module breakdown on the full 910-task set would strengthen the efficiency claims. We will add an extended ablation table (or appendix) reporting token and step counts for each component when all modules operate together on the complete benchmark, explicitly quantifying any overhead from skill-library maintenance and routing. revision: yes

  3. Referee: [Experimental results] No error bars, standard deviations across runs, or statistical significance tests are reported for the 58.3% vs. 54.0% success-rate difference on the full benchmark, making it impossible to assess whether the observed improvement is robust or could be explained by variance in the evaluation.

    Authors: The 910-task evaluation was executed as a single run due to the high computational cost. We will revise the manuscript to state this limitation explicitly and discuss its implications for assessing robustness. We cannot supply error bars or significance tests without additional independent runs, which we will note as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements on fixed benchmark

full rationale

The paper reports success rates (58.3%) and token reductions on the fixed 910-task VisualWebArena benchmark via direct evaluation of the described agent framework. No equations, fitted parameters, or derivation steps are present that reduce these outcomes to self-defined inputs or predictions by construction. Ablations attribute gains to specific modules but remain experimental measurements. Self-citation to WALT reproduction is present but not load-bearing for the central claims, which rest on external benchmark results rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that the three inefficiency sources identified in VisualWebArena trajectories are primary and that the listed components can be combined online without negative interactions; no new physical entities are postulated.

axioms (2)
  • domain assumption VisualWebArena trajectories exhibit repeat-action loops, hidden discovery costs, and low prompt-cache reuse as the main inefficiencies.
    The paper begins its contribution by analyzing trajectories to identify these three sources.
  • ad hoc to paper The listed mechanisms (progress reflection, confidence-based demotion, hierarchical routing, visual compression, cache-aware prompting) can be integrated into a single-rollout online skill library without offsetting costs.
    The framework description assumes the combination yields the reported net gains.

pith-pipeline@v0.9.1-grok · 5775 in / 1566 out tokens · 40426 ms · 2026-06-30T11:59:50.831611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 18 canonical work pages · 11 internal anchors

  1. [1]

    OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

    Reyna Abhyankar, Qi Qi, and Yiying Zhang. OSWorld-Human: Benchmarking the efficiency of computer-use agents.arXiv preprint arXiv:2506.16042,

  2. [2]

    Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

  3. [3]

    The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250,

    Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, and Xin Eric Wang. The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250,

  4. [4]

    Recon-Act: A self-evolving multi-agent browser-use system via web reconnaissance, tool generation, and task execution.arXiv preprint arXiv:2509.21072,

    Kaiwen He et al. Recon-Act: A self-evolving multi-agent browser-use system via web reconnaissance, tool generation, and task execution.arXiv preprint arXiv:2509.21072,

  5. [5]

    VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024a. Jing Yu ...

  6. [6]

    Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT),

    Alexandra Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT),

  7. [7]

    GAIA: a benchmark for General AI Assistants

    Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983,

  8. [8]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

  9. [9]

    Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling

    Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, and Juan M. Lavista Ferres. Energy use of AI inference: Efficiency pathways and test-time compute.arXiv preprint arXiv:2509.20241,

  10. [10]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

  11. [11]

    arXiv preprint arXiv:2503.21614 (2025)

    11 Xiaoye Qu, Yafu Li, Zhao-Chen Su, Weigao Sun, Jianhao Yan, et al. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614,

  12. [12]

    Qwen2.5-VL Technical Report

    Qwen Team. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,

  13. [13]

    LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

  14. [14]

    Smith, Alex Hubbard, Adam Newkirk, Nuoa Lei, Md Abu Bakar Siddik, Billie Holecek, Jonathan Koomey, Eric Masanet, and Dale Sartor

    Arman Shehabi, Sarah J. Smith, Alex Hubbard, Adam Newkirk, Nuoa Lei, Md Abu Bakar Siddik, Billie Holecek, Jonathan Koomey, Eric Masanet, and Dale Sartor. 2024 united states data center energy usage report. https://eta-publications.lbl.gov/publications/2024-unite d-states-data-center-energy,

  15. [15]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Hongye Jin, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

  16. [16]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. VOYAGER: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,

  17. [17]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang et al. UI-TARS-2 technical report: Advancing GUI agent with multi-turn reinforce- ment learning.arXiv preprint arXiv:2509.02544, 2025a. Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. InInternational Conference on Learning Representations (ICLR), 2025b. Xiaoqiang ...

  18. [18]

    Inducing programmatic skills for agentic tasks

    Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. Inducing programmatic skills for agentic tasks. InConference on Language Modeling (COLM), 2025c. 12 Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In International Conference on Machine Learning (ICML), 2025d. Zhiyong Wu, Chengcheng Han, Zichen Ding, ...

  19. [19]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks...

  20. [20]

    Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a

    Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a. Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous GUI interaction. In International Conference on Mach...

  21. [21]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079,

  22. [22]

    I spread, I extend, I unfold

    C A Note on the Name Pando, the first-person singular present indicative of the Latin verbpandere, means “I spread, I extend, I unfold.” It is also the name of a grove: a single clonal colony of quaking aspen (Populus tremuloides) in Fishlake National Forest, Utah, whose roughly 47,000 visible trunks share one genome and one root system. The colony’s age ...

  23. [23]

    Amortized

    relative to the single-rollout, single-model, no-pre-evaluation baseline π0. ‡ SGV’sρ≈2.2 comes from its two-pass self-grounded verifier (Eq. 4); it is not a dollar ratio against π0 but against its own no-verifier single-rollout form. ⋆ W ALT’s at-evalρ=1 is preserved but Cpre is unreported in the original paper (Eq. 5); the “Amortized” column bounds the ...

  24. [24]

    58.3+19.9 SGV∼45.0 (Gemini-Flash, single-pass; Andrade et al., 2026, Tab

  25. [25]

    PANDO achieves the highest reproduced SR

    (single-pass Gemini- Flash ≈45% , PANDO-on-Gemini 50.3%), confirming that the lift is mechanism-driven rather than Opus-specific. The remaining 4.4 pp gap to Opus-backboned PANDO matches the Opus-vs-Gemini capability gap on multimodal web tasks reported in concurrent benchmarks. We will scale both runs to full VW A-910 for the camera-ready and report the ...

  26. [26]

    discard vs. compress

    Row ordering follows the narrative order of the subsections. Method Benchmark(s) Grounding Cost axis Headline SR WebV oyager [He et al., 2024] 643 live-web tasks Screenshot+SoM single-rollout VLM 59.1 SeeAct [Zheng et al., 2024] Mind2Web-Live HTML+SoM hybrid single-rollout VLM 51.1 (oracle) OS-Copilot/FRIDAY [Wu et al., 2024] GAIA L1 Text+tools code+APIs ...