Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.
MetaPS trains models via simulation rollouts to select from programmatic strategy libraries for market agents, yielding better performance than fixed or direct LLM baselines across model sizes.
Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
Hierarchically grouped demonstrations raise pass rates from 76.7% to 90.7% on 43 vague-description tasks while flat logs show smaller non-significant gains.
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
Trajectory mining produces readable skill clusters with high purity but GRPO training on them improves skill-step accuracy only from 18.5% to 20.5% and underperforms frequency priors.
citing papers explorer
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
MetaPS: Adaptive Programmatic Strategy Selection for Market Agents
MetaPS trains models via simulation rollouts to select from programmatic strategy libraries for market agents, yielding better performance than fixed or direct LLM baselines across model sizes.
-
How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs
Hierarchically grouped demonstrations raise pass rates from 76.7% to 90.7% on 43 vague-description tasks while flat logs show smaller non-significant gains.
-
Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining
Trajectory mining produces readable skill clusters with high purity but GRPO training on them improves skill-step accuracy only from 18.5% to 20.5% and underperforms frequency priors.