PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Yidi Miao; Yubo Li; Yuntian Shen; Yuxin Liu

arxiv: 2605.24785 · v2 · pith:RUIYZ3LSnew · submitted 2026-05-24 · 💻 cs.AI

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Yubo Li , Yidi Miao , Yuntian Shen , Yuxin Liu This is my paper

Pith reviewed 2026-06-30 11:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal web agentsonline skill distillationskill libraryVisualWebArenatoken efficiencyhierarchical routingprogress reflectionefficiency metrics

0 comments

The pith

PANDO shows a multimodal web agent can grow more efficient with experience by distilling skills online in a single rollout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent multimodal web agents have grown more capable by spending more inference-time computation on search, verification, and offline discovery. The paper examines whether efficiency can improve instead as agents gain experience in a single rollout. Analysis of VisualWebArena trajectories reveals three main inefficiencies: repeating the same actions, paying hidden costs to discover skills on the fly, and failing to reuse prompt caches. PANDO addresses them by building and using a Skill Library online through progress reflection, demoting weak skills by confidence, routing at multiple levels, compressing images, and prompting with cache awareness. The result is higher task success at substantially lower token cost on the full benchmark suite.

Core claim

PANDO is a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. This allows the agent to become more efficient as it accumulates experience rather than more expensive, achieving 58.3% success rate on 910 VisualWebArena tasks with 58% fewer tokens than SGV and 61% fewer than WALT, without any pre-evaluation discovery budget. A 300-task ablation shows rules and routines provide most success gains while routing, compression, and cache-aware prompting convert the larger library into lower marginal token cost. The paper als

What carries the argument

The structured Skill Library maintained through progress reflection and confidence-based demotion, supported by hierarchical routing, visual compression, and cache-aware prompting.

If this is right

Agents reach higher success without requiring any pre-evaluation discovery budget.
Token consumption falls as the skill library grows and is reused within one rollout.
Rules and routines account for most of the success improvement; the efficiency techniques mainly reduce marginal costs.
Efficiency becomes directly measurable with the three new trajectory metrics rather than only terminal success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same online distillation pattern could reduce compute waste in other agent domains that suffer repeated actions and low reuse.
If the library continues to grow, long-term management of skill relevance may become a new limiting factor.
The approach suggests that general inefficiency patterns in LLM agents are addressable without domain-specific engineering.

Load-bearing premise

The three identified sources of inefficiency are the dominant bottlenecks and the proposed components can be combined without creating new offsetting costs in a single rollout.

What would settle it

An evaluation on the same 910 tasks in which PANDO either fails to reduce tokens below the SGV and WALT baselines or loses its success-rate advantage when all listed components are active.

Figures

Figures reproduced from arXiv: 2605.24785 by Yidi Miao, Yubo Li, Yuntian Shen, Yuxin Liu.

**Figure 2.** Figure 2: Efficiency and online learning diagnostics. Left: PANDO is the only evaluated point with both higher SR and fewer tokens than all baselines. Right: the skill library grows, demotes brittle routines, and reduces the rolling average steps from an unstable cold start to about 8.5 steps/task. Configuration SR ∆SR Steps Tok. ARR Cache Dominant effect (%) (pp) (K) (%) (%) Backbone: SoM-Qwen (M) 38.6 – 15.2 223 3… view at source ↗

**Figure 3.** Figure 3: summarizes the multi-metric pattern from the main table; [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Step composition per method under our LLM-call + action accounting. PANDO’s lower step count comes from deterministic routine invocations replacing repeated Actor calls and primitive action chains. Repeat-action early stop Max-step exhaust Wrong final answer Environment error Other 0 10 20 30 40 50 60 70 80 90 Share of failed tasks (\%) Text-only SoM (M) WALT PANDO [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Failure-mode composition across four methods (VWA-Classifieds, 300 tasks). Repeataction loops dominate text-only and SoM methods and are cut by roughly 4× under PANDO; grounding errors are backbone-limited. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt-cache utilization on VWA. Cache utilization rises as the skill-library prefix stops churning, complementing the online skill-dynamics panel in [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Cost–success Pareto frontier on VWA. PANDO defines a new Pareto point: no other method in Tab. 8 simultaneously achieves higher SR and lower per-task cost—every baseline lies strictly north-east of PANDO ($0.085). WALT is drawn at both its headline cost ($0.592) and its 910-task-amortized cost ($0.641); both lie strictly north-east of PANDO. Headline numbers derived from Tab. 8. PANDO is 86% cheaper per ta… view at source ↗

**Figure 8.** Figure 8: Ablation progression from Tab. 9. Skill components account for most of the success-rate lift, while routing, visual compression, and cache-aware prompt layout convert the larger library into a lower-cost execution path. The full system ends with both the largest SR gain and the lowest per-task cost. I.3 Learning Curve: Cost Compounds with Task Index Figures 9a and 9b show the per-task and cumulative cost c… view at source ↗

**Figure 9.** Figure 9: Cost compounds with task index. Learning during evaluation produces a monotonically decreasing per-task cost (left) and a sub-linear cumulative spend (right). The gap between PANDO and the fixed-library counterfactual quantifies the dollar value of in-evaluation skill distillation. I.4 Token-Level Composition per Method [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: decomposes per-task token spend into Planner, Reflector, Actor, and (for WALT) offline tool-discovery tokens. The offline bar is reported at the 910-task-amortized rate; the headline WALT figure reported in its paper corresponds to omitting that bar entirely. Across the full baseline set, PANDO has the lowest total token load (115K per task). 0 100 200 300 Tokens per task (K) 132K Text-only 166K Caption 2… view at source ↗

read the original abstract

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PANDO reports solid token savings and success gains on VisualWebArena via online skill management, but the evidence is still mostly abstract-level and needs fuller verification.

read the letter

The main point is that this paper shows an online-only way to cut token use in multimodal web agents while holding or improving success rates. They identify repeat loops, discovery costs, and cache misses as the problems, then combine reflection, confidence-based demotion, hierarchical routing, visual compression, and cache-aware prompting into a single-rollout system with a growing skill library. On the full 910 VisualWebArena tasks they get 58.3% success versus 54% for SGV and 45.2% for their WALT reproduction, with 58-61% fewer tokens and no separate discovery phase. The 300-task ablation credits most success lift to the rules and routines, and the efficiency lift to the other modules. They also define three new trajectory metrics that make the efficiency claims trackable.

That is useful work. The numbers are concrete, the benchmark is fixed, and the approach directly targets deployment cost rather than just accuracy. The new metrics are a clear addition.

The soft spots are the usual ones at this stage. All comparisons rest on the authors' own WALT reproduction, there are no error bars or statistical tests mentioned, and the ablation does not show that the full five-component stack preserves both success and token savings on the same tasks. The central claim—that the pieces combine without new overhead in a single rollout—rests on the abstract's summary rather than detailed per-module cost breakdowns. The stress-test concern about offsetting costs from library maintenance and routing is reasonable until the methods section is checked.

This is for groups working on practical agent efficiency rather than pure capability scaling. It has enough empirical grounding and a clear problem statement to deserve referee time, even if the paper will need more implementation detail and independent checks before it lands.

Referee Report

3 major / 2 minor

Summary. The paper proposes PANDO, a single-rollout online skill-distillation framework for multimodal web agents. It identifies three sources of inefficiency in prior agents (repeat-action loops, hidden discovery costs, low prompt-cache reuse) and introduces a structured Skill Library maintained via progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full 910-task VisualWebArena benchmark, PANDO reports 58.3% success (vs. SGV 54.0% and WALT reproduction 45.2%), with 58% and 61% token reductions respectively and zero pre-evaluation discovery budget. A 300-task ablation attributes most success gains to rules/routines and efficiency gains to the remaining modules; three new trajectory-level efficiency metrics (Action Repetition Rate, Step Overhead Ratio, Prompt Cache Utilization) are introduced.

Significance. If the reported performance numbers hold under independent replication, the work provides concrete evidence that multimodal web agents can improve both success rate and token efficiency through online experience accumulation rather than increased inference-time search or offline discovery. The introduction of the three trajectory-level efficiency metrics is a clear positive contribution that makes efficiency claims more falsifiable. The zero pre-evaluation budget and single-rollout constraint are also notable strengths relative to prior approaches that rely on specialist stacks or rollout search.

major comments (3)

[Results on 910 VisualWebArena tasks] The central performance claims (58.3% success and 58–61% token reductions on all 910 tasks) rest on the authors' own WALT reproduction (45.2% success); without release of the reproduction code, exact hyper-parameters, or a side-by-side comparison against the original WALT implementation, the token-reduction numbers cannot be independently verified and are load-bearing for the efficiency claim.
[Ablation study] The 300-task ablation attributes success gains primarily to 'rules and routines' and efficiency gains to routing/compression/cache-aware prompting, yet provides no quantitative breakdown (e.g., per-module token or step counts) for the full 910-task PANDO system; this leaves open whether skill-library maintenance and hierarchical routing introduce offsetting overheads that cancel the reported marginal savings when all five components run together.
[Experimental results] No error bars, standard deviations across runs, or statistical significance tests are reported for the 58.3% vs. 54.0% success-rate difference on the full benchmark, making it impossible to assess whether the observed improvement is robust or could be explained by variance in the evaluation.

minor comments (2)

The abstract and results section would benefit from explicit statement of the number of independent runs or random seeds used for the 910-task evaluation.
Notation for the Skill Library structure and the exact demotion threshold is introduced without a dedicated figure or pseudocode, reducing reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below with clarifications and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Results on 910 VisualWebArena tasks] The central performance claims (58.3% success and 58–61% token reductions on all 910 tasks) rest on the authors' own WALT reproduction (45.2% success); without release of the reproduction code, exact hyper-parameters, or a side-by-side comparison against the original WALT implementation, the token-reduction numbers cannot be independently verified and are load-bearing for the efficiency claim.

Authors: We agree that independent verification requires releasing the WALT reproduction details. In the revised version we will provide the reproduction code, exact hyperparameters, and a side-by-side comparison table against the original WALT implementation, either in the supplementary material or via a public repository link. revision: yes
Referee: [Ablation study] The 300-task ablation attributes success gains primarily to 'rules and routines' and efficiency gains to routing/compression/cache-aware prompting, yet provides no quantitative breakdown (e.g., per-module token or step counts) for the full 910-task PANDO system; this leaves open whether skill-library maintenance and hierarchical routing introduce offsetting overheads that cancel the reported marginal savings when all five components run together.

Authors: We concur that a quantitative per-module breakdown on the full 910-task set would strengthen the efficiency claims. We will add an extended ablation table (or appendix) reporting token and step counts for each component when all modules operate together on the complete benchmark, explicitly quantifying any overhead from skill-library maintenance and routing. revision: yes
Referee: [Experimental results] No error bars, standard deviations across runs, or statistical significance tests are reported for the 58.3% vs. 54.0% success-rate difference on the full benchmark, making it impossible to assess whether the observed improvement is robust or could be explained by variance in the evaluation.

Authors: The 910-task evaluation was executed as a single run due to the high computational cost. We will revise the manuscript to state this limitation explicitly and discuss its implications for assessing robustness. We cannot supply error bars or significance tests without additional independent runs, which we will note as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements on fixed benchmark

full rationale

The paper reports success rates (58.3%) and token reductions on the fixed 910-task VisualWebArena benchmark via direct evaluation of the described agent framework. No equations, fitted parameters, or derivation steps are present that reduce these outcomes to self-defined inputs or predictions by construction. Ablations attribute gains to specific modules but remain experimental measurements. Self-citation to WALT reproduction is present but not load-bearing for the central claims, which rest on external benchmark results rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that the three inefficiency sources identified in VisualWebArena trajectories are primary and that the listed components can be combined online without negative interactions; no new physical entities are postulated.

axioms (2)

domain assumption VisualWebArena trajectories exhibit repeat-action loops, hidden discovery costs, and low prompt-cache reuse as the main inefficiencies.
The paper begins its contribution by analyzing trajectories to identify these three sources.
ad hoc to paper The listed mechanisms (progress reflection, confidence-based demotion, hierarchical routing, visual compression, cache-aware prompting) can be integrated into a single-rollout online skill library without offsetting costs.
The framework description assumes the combination yields the reported net gains.

pith-pipeline@v0.9.1-grok · 5775 in / 1566 out tokens · 40426 ms · 2026-06-30T11:59:50.831611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 18 canonical work pages · 11 internal anchors

[1]

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar, Qi Qi, and Yiying Zhang. OSWorld-Human: Benchmarking the efficiency of computer-use agents.arXiv preprint arXiv:2506.16042,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

work page arXiv
[3]

The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250,

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, and Xin Eric Wang. The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250,

work page arXiv
[4]

Recon-Act: A self-evolving multi-agent browser-use system via web reconnaissance, tool generation, and task execution.arXiv preprint arXiv:2509.21072,

Kaiwen He et al. Recon-Act: A self-evolving multi-agent browser-use system via web reconnaissance, tool generation, and task execution.arXiv preprint arXiv:2509.21072,

work page arXiv
[5]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024a. Jing Yu ...

work page arXiv
[6]

Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT),

Alexandra Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT),

2024
[7]

GAIA: a benchmark for General AI Assistants

Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, and Juan M. Lavista Ferres. Energy use of AI inference: Efficiency pathways and test-time compute.arXiv preprint arXiv:2509.20241,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2503.21614 (2025)

11 Xiaoye Qu, Yafu Li, Zhao-Chen Su, Weigao Sun, Jianhao Yan, et al. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614,

work page arXiv
[12]

Qwen2.5-VL Technical Report

Qwen Team. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

work page arXiv
[14]

Smith, Alex Hubbard, Adam Newkirk, Nuoa Lei, Md Abu Bakar Siddik, Billie Holecek, Jonathan Koomey, Eric Masanet, and Dale Sartor

Arman Shehabi, Sarah J. Smith, Alex Hubbard, Adam Newkirk, Nuoa Lei, Md Abu Bakar Siddik, Billie Holecek, Jonathan Koomey, Eric Masanet, and Dale Sartor. 2024 united states data center energy usage report. https://eta-publications.lbl.gov/publications/2024-unite d-states-data-center-energy,

2024
[15]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Hongye Jin, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. VOYAGER: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang et al. UI-TARS-2 technical report: Advancing GUI agent with multi-turn reinforce- ment learning.arXiv preprint arXiv:2509.02544, 2025a. Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. InInternational Conference on Learning Representations (ICLR), 2025b. Xiaoqiang ...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Inducing programmatic skills for agentic tasks

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. Inducing programmatic skills for agentic tasks. InConference on Language Modeling (COLM), 2025c. 12 Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In International Conference on Machine Learning (ICML), 2025d. Zhiyong Wu, Chengcheng Han, Zichen Ding, ...

2024
[19]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a

Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a. Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous GUI interaction. In International Conference on Mach...

work page arXiv
[21]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

I spread, I extend, I unfold

C A Note on the Name Pando, the first-person singular present indicative of the Latin verbpandere, means “I spread, I extend, I unfold.” It is also the name of a grove: a single clonal colony of quaking aspen (Populus tremuloides) in Fishlake National Forest, Utah, whose roughly 47,000 visible trunks share one genome and one root system. The colony’s age ...

2026
[23]

Amortized

relative to the single-rollout, single-model, no-pre-evaluation baseline π0. ‡ SGV’sρ≈2.2 comes from its two-pass self-grounded verifier (Eq. 4); it is not a dollar ratio against π0 but against its own no-verifier single-rollout form. ⋆ W ALT’s at-evalρ=1 is preserved but Cpre is unreported in the original paper (Eq. 5); the “Amortized” column bounds the ...

2026
[24]

58.3+19.9 SGV∼45.0 (Gemini-Flash, single-pass; Andrade et al., 2026, Tab

2026
[25]

PANDO achieves the highest reproduced SR

(single-pass Gemini- Flash ≈45% , PANDO-on-Gemini 50.3%), confirming that the lift is mechanism-driven rather than Opus-specific. The remaining 4.4 pp gap to Opus-backboned PANDO matches the Opus-vs-Gemini capability gap on multimodal web tasks reported in concurrent benchmarks. We will scale both runs to full VW A-910 for the camera-ready and report the ...

2000
[26]

discard vs. compress

Row ordering follows the narrative order of the subsections. Method Benchmark(s) Grounding Cost axis Headline SR WebV oyager [He et al., 2024] 643 live-web tasks Screenshot+SoM single-rollout VLM 59.1 SeeAct [Zheng et al., 2024] Mind2Web-Live HTML+SoM hybrid single-rollout VLM 51.1 (oracle) OS-Copilot/FRIDAY [Wu et al., 2024] GAIA L1 Text+tools code+APIs ...

2024

[1] [1]

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar, Qi Qi, and Yiying Zhang. OSWorld-Human: Benchmarking the efficiency of computer-use agents.arXiv preprint arXiv:2506.16042,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

work page arXiv

[3] [3]

The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250,

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, and Xin Eric Wang. The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250,

work page arXiv

[4] [4]

Recon-Act: A self-evolving multi-agent browser-use system via web reconnaissance, tool generation, and task execution.arXiv preprint arXiv:2509.21072,

Kaiwen He et al. Recon-Act: A self-evolving multi-agent browser-use system via web reconnaissance, tool generation, and task execution.arXiv preprint arXiv:2509.21072,

work page arXiv

[5] [5]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024a. Jing Yu ...

work page arXiv

[6] [6]

Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT),

Alexandra Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT),

2024

[7] [7]

GAIA: a benchmark for General AI Assistants

Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, and Juan M. Lavista Ferres. Energy use of AI inference: Efficiency pathways and test-time compute.arXiv preprint arXiv:2509.20241,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2503.21614 (2025)

11 Xiaoye Qu, Yafu Li, Zhao-Chen Su, Weigao Sun, Jianhao Yan, et al. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614,

work page arXiv

[12] [12]

Qwen2.5-VL Technical Report

Qwen Team. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

work page arXiv

[14] [14]

Smith, Alex Hubbard, Adam Newkirk, Nuoa Lei, Md Abu Bakar Siddik, Billie Holecek, Jonathan Koomey, Eric Masanet, and Dale Sartor

Arman Shehabi, Sarah J. Smith, Alex Hubbard, Adam Newkirk, Nuoa Lei, Md Abu Bakar Siddik, Billie Holecek, Jonathan Koomey, Eric Masanet, and Dale Sartor. 2024 united states data center energy usage report. https://eta-publications.lbl.gov/publications/2024-unite d-states-data-center-energy,

2024

[15] [15]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Hongye Jin, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. VOYAGER: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang et al. UI-TARS-2 technical report: Advancing GUI agent with multi-turn reinforce- ment learning.arXiv preprint arXiv:2509.02544, 2025a. Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. InInternational Conference on Learning Representations (ICLR), 2025b. Xiaoqiang ...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Inducing programmatic skills for agentic tasks

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. Inducing programmatic skills for agentic tasks. InConference on Language Modeling (COLM), 2025c. 12 Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In International Conference on Machine Learning (ICML), 2025d. Zhiyong Wu, Chengcheng Han, Zichen Ding, ...

2024

[19] [19]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a

Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a. Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous GUI interaction. In International Conference on Mach...

work page arXiv

[21] [21]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

I spread, I extend, I unfold

C A Note on the Name Pando, the first-person singular present indicative of the Latin verbpandere, means “I spread, I extend, I unfold.” It is also the name of a grove: a single clonal colony of quaking aspen (Populus tremuloides) in Fishlake National Forest, Utah, whose roughly 47,000 visible trunks share one genome and one root system. The colony’s age ...

2026

[23] [23]

Amortized

relative to the single-rollout, single-model, no-pre-evaluation baseline π0. ‡ SGV’sρ≈2.2 comes from its two-pass self-grounded verifier (Eq. 4); it is not a dollar ratio against π0 but against its own no-verifier single-rollout form. ⋆ W ALT’s at-evalρ=1 is preserved but Cpre is unreported in the original paper (Eq. 5); the “Amortized” column bounds the ...

2026

[24] [24]

58.3+19.9 SGV∼45.0 (Gemini-Flash, single-pass; Andrade et al., 2026, Tab

2026

[25] [25]

PANDO achieves the highest reproduced SR

(single-pass Gemini- Flash ≈45% , PANDO-on-Gemini 50.3%), confirming that the lift is mechanism-driven rather than Opus-specific. The remaining 4.4 pp gap to Opus-backboned PANDO matches the Opus-vs-Gemini capability gap on multimodal web tasks reported in concurrent benchmarks. We will scale both runs to full VW A-910 for the camera-ready and report the ...

2000

[26] [26]

discard vs. compress

Row ordering follows the narrative order of the subsections. Method Benchmark(s) Grounding Cost axis Headline SR WebV oyager [He et al., 2024] 643 live-web tasks Screenshot+SoM single-rollout VLM 59.1 SeeAct [Zheng et al., 2024] Mind2Web-Live HTML+SoM hybrid single-rollout VLM 51.1 (oracle) OS-Copilot/FRIDAY [Wu et al., 2024] GAIA L1 Text+tools code+APIs ...

2024