arxiv: 2307.13854 · v4 · submitted 2023-07-25 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 1 theorem link

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou , Frank F. Xu , Hao Zhu , Xuhui Zhou , Robert Lo , Abishek Sridhar , Xianyi Cheng , Tianyue Ou

show 4 more authors

Yonatan Bisk Daniel Fried Uri Alon Graham Neubig

Authors on Pith no claims yet

Pith reviewed 2026-05-10 22:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords autonomous agentsweb environmentbenchmark taskslanguage modelstask success raterealistic simulationGPT-4 evaluation

0 comments

The pith

Current language-model agents complete only 14 percent of realistic web tasks that humans finish at 78 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds WebArena, a realistic and reproducible environment consisting of fully functional websites across e-commerce, social forums, software development, and content management. It augments these sites with tools and external knowledge bases to support everyday task solving and supplies a benchmark of diverse, long-horizon tasks that emulate routine internet use. Experiments with baseline agents that incorporate reasoning before acting show that the strongest GPT-4 setup reaches just 14.41 percent end-to-end success. Humans achieve 78.24 percent on the identical tasks. The results indicate that existing models fall short on real web interactions and that WebArena offers a way to track future gains in agent capability.

Core claim

We introduce WebArena as a highly realistic web environment for language-guided autonomous agents, populated with operational websites from four everyday domains and augmented with tools and knowledge bases. We also provide a diverse set of benchmark tasks that are long-horizon and functionally correct to evaluate task completion. Our evaluations of several baseline agents, including those using reasoning before acting, show that the best GPT-4-based agent attains an end-to-end success rate of 14.41 percent, well below the 78.24 percent achieved by humans.

What carries the argument

The WebArena environment, which supplies fully functional websites across multiple domains together with a benchmark task suite that tests functional correctness on extended, realistic interactions.

If this is right

Agents for web tasks need substantial further development to reach reliable performance.
Current large language models do not yet deliver near-perfect results on these real-life tasks.
WebArena supplies a standardized, reproducible measure for tracking progress in agent robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Results on WebArena could serve as a predictor for agent behavior on live web services if the simulations prove faithful enough.
Adding more domains or introducing time pressure and multi-user elements might expose additional failure modes in current agents.
Pairing WebArena with benchmarks from other domains could help develop agents that transfer skills across different kinds of interaction.

Load-bearing premise

The constructed websites and tasks capture enough of the complexity, variability, and edge cases of actual web use that performance gaps will hold outside the simulated setting.

What would settle it

Running the same agents on live websites that match the simulated ones and checking whether end-to-end success rates remain near 14 percent.

read the original abstract

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebArena ships a released, multi-domain functional web testbed that shows GPT-4 agents at 14% success versus 78% human on long-horizon tasks, and the gap plus the release make it worth using.

read the letter

WebArena gives the agent community a concrete, released environment with working sites instead of the usual toy setups. The core result is that even their strongest GPT-4 baseline with reasoning reaches only 14.41% end-to-end success on the benchmark tasks while humans hit 78.24%. That gap is the main thing to take away, and the fact that the sites are actually functional across e-commerce, forums, dev tools, and CMS makes the measurement more relevant than prior synthetic work. They also wire in external tools like maps and knowledge bases, which forces agents to handle realistic information lookup rather than pure text prompts. The tasks are long-horizon and scored on functional correctness, not just surface matches, and the paper reports the numbers plainly. Releasing the full environment is the practical win here; others can now run their own agents and compare directly. The construction looks solid: the sites behave as described, the baselines include recent techniques, and there are no obvious circular definitions in the success metric. Citation coverage of prior agent environments is standard and sufficient. The soft spots are limited. The sites are still controlled simulations, so they will miss some live-web noise such as changing layouts or intermittent failures, though the paper does not claim perfect replication. A few task specifications could use more explicit detail for exact reproduction, but this does not affect the reported gap. Overall the empirical claim stands on its own measurements. This is for anyone building or testing web agents with LLMs. Researchers who need a reproducible yardstick beyond simple environments will get direct value from the tasks and the released code. It deserves a serious referee because the setup is constructive, the data is there, and the release lowers the barrier for follow-up work. I would engage with it by trying the benchmark on new agent variants and citing the environment when discussing realistic evaluation.

Referee Report

0 major / 3 minor

Summary. The manuscript presents WebArena, a realistic and reproducible web environment for language-guided autonomous agents. It constructs fully functional websites across four domains (e-commerce, social forum discussions, collaborative software development, and content management), augmented with tools (e.g., maps) and external knowledge bases (e.g., user manuals). A benchmark of diverse, long-horizon tasks is released with evaluation criteria based on functional correctness of completions. Baseline agents incorporating recent techniques such as reasoning before acting are evaluated, with the best GPT-4-based agent achieving 14.41% end-to-end success versus 78.24% for humans.

Significance. If the results hold, this provides a significant contribution by bridging synthetic environments and real-world web scenarios for agent evaluation. The functional sites, concrete success metrics, human baseline, and large reported performance gap (14.41% vs. 78.24%) offer a reproducible way to measure progress in LLM-based agents for practical tasks. The environment release and task suite enable community-driven advancements and address a key disconnect in current agent research.

minor comments (3)

[§3] §3 (Environment Construction): Expand on the specific limitations of the functional websites, such as which interactions or edge cases (e.g., network errors, dynamic content changes) are not fully supported, to better contextualize the realism claim.
[§4] §4 (Benchmark Tasks): Include additional concrete examples of task templates and success criteria for each domain to improve reproducibility and allow readers to assess task diversity without accessing the full release.
[Abstract] Abstract: The final sentence contains a grammatical issue ('highlight the need for further development of robust agents, that current...') that should be rephrased for clarity, e.g., by inserting 'and' or restructuring.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its significance in bridging synthetic and real-world web agent evaluation, and recommendation to accept. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark construction and evaluation

full rationale

The paper introduces a new web environment and task suite as an explicit contribution, then reports direct empirical measurements of agent success rates (14.41% for best GPT-4 agent vs. 78.24% human) on those tasks. No derivations, first-principles predictions, fitted parameters renamed as predictions, or self-citation chains are present; the central results are observable task-completion outcomes within the released benchmark, which remain independent of any internal reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on constructing a new simulated environment rather than deriving results from prior equations or postulates; no free parameters or invented entities are introduced to support a mathematical claim.

axioms (1)

domain assumption Functional websites can be implemented in a controlled, reproducible simulator that preserves the interaction patterns of real web applications.
Invoked when stating that the environment is 'highly realistic' and supports human-like task solving.

pith-pipeline@v0.9.0 · 5601 in / 1276 out tokens · 42432 ms · 2026-05-10T22:59:38.157580+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
cs.AI 2026-04 accept novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 7.0

MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
State-Centric Decision Process
cs.AI 2026-05 unverdicted novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 conditional novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 unverdicted novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
cs.CR 2026-05 unverdicted novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
cs.CR 2026-05 unverdicted novelty 7.0

LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
Inference-Time Budget Control for LLM Search Agents
cs.AI 2026-05 unverdicted novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
WAAA! Web Adversaries Against Agentic Browsers
cs.CR 2026-05 unverdicted novelty 7.0

Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
cs.CL 2026-04 unverdicted novelty 7.0

DV-World is a benchmark of 260 tasks across spreadsheet manipulation, visual evolution, and interactive intent alignment that shows state-of-the-art AI models achieve less than 50% overall performance on real-world da...
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
cs.CV 2026-04 unverdicted novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
cs.AI 2026-04 unverdicted novelty 7.0

AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.
The Moltbook Observatory Archive: an incremental dataset of agent-only social network activity
cs.SI 2026-04 unverdicted novelty 7.0

The Moltbook Observatory Archive is the first large-scale dataset from a social network populated exclusively by autonomous AI agents, covering 78 days with 2.6 million posts and 1.2 million comments.
Beyond Chat and Clicks: GUI Agents for In-Situ Assistance via Live Interface Transformation
cs.HC 2026-04 unverdicted novelty 7.0

GUI agents can transform live web interfaces in real-time via DOM manipulations to deliver contextual assistance directly within the application.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation
cs.LG 2026-04 unverdicted novelty 7.0

A new 590k-instance dataset built with hard-negative mining and dual-agent verification, plus progressive SFT-to-ORPO-to-GRPO training, yields 58.7% step success on Mind2Web, beating GPT-4.5 and Claude-4.5.
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
cs.AI 2026-04 unverdicted novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
cs.AI 2026-04 unverdicted novelty 7.0

A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop a...
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
cs.AI 2026-04 unverdicted novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
SAGE: A Service Agent Graph-guided Evaluation Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
cs.CR 2026-04 unverdicted novelty 7.0

WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
cs.AI 2026-04 unverdicted novelty 7.0

GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation
cs.CY 2026-03 conditional novelty 7.0

M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.
StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability
cs.SE 2026-03 unverdicted novelty 7.0

StressWeb is a new diagnostic benchmark that applies structured perturbations to web interactions to expose robustness failures in LLM-based agents that standard clean evaluations miss.
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents
cs.CL 2026-03 unverdicted novelty 7.0

CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.
TelcoAgent-Bench: A Multilingual Benchmark for Telecom AI Agents
cs.CL 2026-03 unverdicted novelty 7.0

TelcoAgent-Bench is a new framework that evaluates how well multilingual LLM agents recognize intents, execute troubleshooting steps, and stay consistent across variations in telecom scenarios.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
cs.AI 2025-06 unverdicted novelty 7.0

τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
cs.AI 2024-06 unverdicted novelty 7.0

τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
cs.LG 2024-03 unverdicted novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
cs.LG 2026-05 conditional novelty 6.0

ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...
Web Agents Should Adopt the Plan-Then-Execute Paradigm
cs.CR 2026-05 unverdicted novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
How to Interpret Agent Behavior
cs.AI 2026-05 conditional novelty 6.0

ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence
cs.CV 2026-05 unverdicted novelty 6.0

MMCL-Bench shows that even the strongest frontier multimodal models solve fewer than one-third of tasks requiring recovery and application of visual rules, procedures, and empirical patterns.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 121 Pith papers · 2 internal anchors

[1]

WebGPT: Browser-assisted question-answering with human feedback

Association for Computational Linguistics. URL https://aclanthology.org/ 2022.emnlp-main.90. Dipendra K Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3):281–300, 2016. Reiichiro Nakano, Jacob Hilton, Suchi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr 2022
[2]

Reflexion: Language Agents with Verbal Reinforcement Learning

PMLR, 2017. URL http://proceedings.mlr.press/v70/shi17a.html. Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv preprint, abs/2303.11366, 2023. URL https://arxiv. org/abs/2303.11366. Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zett...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr42600.2020 2017
[3]

These situations typically indicate a high likelihood of execution failure and hence warrant early termination

We halt execution if the same action is repeated more than three times on the same observation or if the agent generates three consecutive invalid actions. These situations typically indicate a high likelihood of execution failure and hence warrant early termination. For TEXT-BISON -001 , we additionally allow ten retries until it generates a valid action...

work page 2022
[6]

You should follow the examples to reason step by step and then issue the next action

work page
[7]

In summary, the next action I will perform is“ phrase, followed by action inside ``````. For example,

Generate the action in the correct format. Start with a "In summary, the next action I will perform is“ phrase, followed by action inside ``````. For example, "In summary, the next action I will perform is```click [1234]```”

work page
[8]

N/A" in the bracket

Issue stop action when you think you have achieved the objective. Don’t generate anything after stop.""" Figure 7: The system message of the reasoning agent. This message has a general explanation of the task, the available actions, and some notes on avoiding common failures. impossible to complete, provide the answer as "N/A" in the bracket.”. We remove ...

work page 2024
[13]

This page lists the information of HP Inkjet Fax Machine, which is the product identified in the objective

button ’Add to Compare’ URL: http://onestopmarket.com/office-products/office-electronics.html OBJECTIVE: What is the price of HP Inkjet Fax Machine PREVIOUS ACTION: None example_assistant Let’s think step-by-step. This page lists the information of HP Inkjet Fax Machine, which is the product identified in the objective. Its price is $279.49. I think I hav...

work page
[17]

This page has a search box whose ID is [164]

button ’Close’ URL: http://openstreetmap.org OBJECTIVE: Show me the restaurants near ABC PREVIOUS ACTION: None example_assistant Let’s think step-by-step. This page has a search box whose ID is [164]. According to the nominatim rule of openstreetmap, I can search for the restaurants near a location by ¨restaurants near¨. I can submit my typing by pressing...

work page 2024
[18]

You should only issue an action that is valid given the current observation

work page
[19]

You should only issue one action at a time

work page
[20]

Always put the action inside a pair of ```

Generate the action in the correct format. Always put the action inside a pair of ```. For example, ```click [1234]```

work page
[21]

Don’t generate anything after stop.""" Figure 9: The system message of the direct agent

Issue stop action when you think you have achieved the objective. Don’t generate anything after stop.""" Figure 9: The system message of the direct agent. This message has the general explanation of the task, the available actions and some notes on avoiding common failures. 20 Published as a conference paper at ICLR 2024 example_user OBSERV ATION:

work page 2024
[22]

link ’HP CB782A#ABA 640 Inkjet Fax Machine (Renewed)’

work page
[23]

StaticText ’$279.49’

work page
[24]

button ’Add to Cart’

work page
[25]

button ’Add to Wish List’

work page
[26]

button ’Add to Compare’ URL: http://onestopmarket.com/office-products/office-electronics.html OBJECTIVE: What is the price of HP Inkjet Fax Machine PREVIOUS ACTION: None example_assistant ```stop [$279.49]``` example_user OBSERV ATION:

work page
[27]

textbox ’Search’ focused: True required: False

work page
[28]

link ’Find directions between two points’

work page
[29]

heading ’Search Results’

work page
[30]

The agent directly emits the next action given the observation

button ’Close’ URL: http://openstreetmap.org OBJECTIVE: Show me the restaurants near ABC PREVIOUS ACTION: None example_assistant ```type [164] [restaurants near ABC] [1]``` Figure 10: The two examples provided as example_user andexample_assistant for the direct agent. The agent directly emits the next action given the observation. 21 Published as a confer...

work page 2024
[31]

searchbox 'Search query'

work page
[32]

StaticText 'DMV area'

work page
[33]

We couldn't find any projects matching Facebook

heading " We couldn't find any projects matching Facebook" Figure 11: Two examples where the GPT-4 agent failed, along with their screenshot and the accessibility tree of the relevant sections (grey). On the left, the agent fails to proceed to the “Users” section to accomplish the task of “Fork all Facebook repos”; on the right, the agent repeats entering...

work page 2022