Recognition: 1 theorem link
WebArena: A Realistic Web Environment for Building Autonomous Agents
Pith reviewed 2026-05-10 22:59 UTC · model grok-4.3
The pith
Current language-model agents complete only 14 percent of realistic web tasks that humans finish at 78 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce WebArena as a highly realistic web environment for language-guided autonomous agents, populated with operational websites from four everyday domains and augmented with tools and knowledge bases. We also provide a diverse set of benchmark tasks that are long-horizon and functionally correct to evaluate task completion. Our evaluations of several baseline agents, including those using reasoning before acting, show that the best GPT-4-based agent attains an end-to-end success rate of 14.41 percent, well below the 78.24 percent achieved by humans.
What carries the argument
The WebArena environment, which supplies fully functional websites across multiple domains together with a benchmark task suite that tests functional correctness on extended, realistic interactions.
If this is right
- Agents for web tasks need substantial further development to reach reliable performance.
- Current large language models do not yet deliver near-perfect results on these real-life tasks.
- WebArena supplies a standardized, reproducible measure for tracking progress in agent robustness.
Where Pith is reading between the lines
- Results on WebArena could serve as a predictor for agent behavior on live web services if the simulations prove faithful enough.
- Adding more domains or introducing time pressure and multi-user elements might expose additional failure modes in current agents.
- Pairing WebArena with benchmarks from other domains could help develop agents that transfer skills across different kinds of interaction.
Load-bearing premise
The constructed websites and tasks capture enough of the complexity, variability, and edge cases of actual web use that performance gaps will hold outside the simulated setting.
What would settle it
Running the same agents on live websites that match the simulated ones and checking whether end-to-end success rates remain near 14 percent.
read the original abstract
With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents WebArena, a realistic and reproducible web environment for language-guided autonomous agents. It constructs fully functional websites across four domains (e-commerce, social forum discussions, collaborative software development, and content management), augmented with tools (e.g., maps) and external knowledge bases (e.g., user manuals). A benchmark of diverse, long-horizon tasks is released with evaluation criteria based on functional correctness of completions. Baseline agents incorporating recent techniques such as reasoning before acting are evaluated, with the best GPT-4-based agent achieving 14.41% end-to-end success versus 78.24% for humans.
Significance. If the results hold, this provides a significant contribution by bridging synthetic environments and real-world web scenarios for agent evaluation. The functional sites, concrete success metrics, human baseline, and large reported performance gap (14.41% vs. 78.24%) offer a reproducible way to measure progress in LLM-based agents for practical tasks. The environment release and task suite enable community-driven advancements and address a key disconnect in current agent research.
minor comments (3)
- [§3] §3 (Environment Construction): Expand on the specific limitations of the functional websites, such as which interactions or edge cases (e.g., network errors, dynamic content changes) are not fully supported, to better contextualize the realism claim.
- [§4] §4 (Benchmark Tasks): Include additional concrete examples of task templates and success criteria for each domain to improve reproducibility and allow readers to assess task diversity without accessing the full release.
- [Abstract] Abstract: The final sentence contains a grammatical issue ('highlight the need for further development of robust agents, that current...') that should be rephrased for clarity, e.g., by inserting 'and' or restructuring.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the manuscript, recognition of its significance in bridging synthetic and real-world web agent evaluation, and recommendation to accept. No major comments were raised in the report.
Circularity Check
No significant circularity in empirical benchmark construction and evaluation
full rationale
The paper introduces a new web environment and task suite as an explicit contribution, then reports direct empirical measurements of agent success rates (14.41% for best GPT-4 agent vs. 78.24% human) on those tasks. No derivations, first-principles predictions, fitted parameters renamed as predictions, or self-citation chains are present; the central results are observable task-completion outcomes within the released benchmark, which remain independent of any internal reduction to the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Functional websites can be implemented in a controlled, reproducible simulator that preserves the interaction patterns of real web applications.
Forward citations
Cited by 60 Pith papers
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
-
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
Inference-Time Budget Control for LLM Search Agents
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
-
WAAA! Web Adversaries Against Agentic Browsers
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
-
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
DV-World is a benchmark of 260 tasks across spreadsheet manipulation, visual evolution, and interactive intent alignment that shows state-of-the-art AI models achieve less than 50% overall performance on real-world da...
-
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.
-
The Moltbook Observatory Archive: an incremental dataset of agent-only social network activity
The Moltbook Observatory Archive is the first large-scale dataset from a social network populated exclusively by autonomous AI agents, covering 78 days with 2.6 million posts and 1.2 million comments.
-
Beyond Chat and Clicks: GUI Agents for In-Situ Assistance via Live Interface Transformation
GUI agents can transform live web interfaces in real-time via DOM manipulations to deliver contextual assistance directly within the application.
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation
A new 590k-instance dataset built with hard-negative mining and dual-agent verification, plus progressive SFT-to-ORPO-to-GRPO training, yields 58.7% step success on Mind2Web, beating GPT-4.5 and Claude-4.5.
-
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
-
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop a...
-
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
-
SAGE: A Service Agent Graph-guided Evaluation Benchmark
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
-
M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation
M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.
-
StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability
StressWeb is a new diagnostic benchmark that applies structured perturbations to web interactions to expose robustness failures in LLM-based agents that standard clean evaluations miss.
-
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents
CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.
-
TelcoAgent-Bench: A Multilingual Benchmark for Telecom AI Agents
TelcoAgent-Bench is a new framework that evaluates how well multilingual LLM agents recognize intents, execute troubleshooting steps, and stay consistent across variations in telecom scenarios.
-
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
-
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...
-
Web Agents Should Adopt the Plan-Then-Execute Paradigm
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
-
How to Interpret Agent Behavior
ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
-
MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence
MMCL-Bench shows that even the strongest frontier multimodal models solve fewer than one-third of tasks requiring recovery and application of visual rules, procedures, and empirical patterns.
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
Reference graph
Works this paper leans on
-
[1]
WebGPT: Browser-assisted question-answering with human feedback
Association for Computational Linguistics. URL https://aclanthology.org/ 2022.emnlp-main.90. Dipendra K Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3):281–300, 2016. Reiichiro Nakano, Jacob Hilton, Suchi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr 2022
-
[2]
Reflexion: Language Agents with Verbal Reinforcement Learning
PMLR, 2017. URL http://proceedings.mlr.press/v70/shi17a.html. Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv preprint, abs/2303.11366, 2023. URL https://arxiv. org/abs/2303.11366. Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zett...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr42600.2020 2017
-
[3]
We halt execution if the same action is repeated more than three times on the same observation or if the agent generates three consecutive invalid actions. These situations typically indicate a high likelihood of execution failure and hence warrant early termination. For TEXT-BISON -001 , we additionally allow ten retries until it generates a valid action...
work page 2022
-
[6]
You should follow the examples to reason step by step and then issue the next action
-
[7]
Generate the action in the correct format. Start with a "In summary, the next action I will perform is“ phrase, followed by action inside ``````. For example, "In summary, the next action I will perform is```click [1234]```”
-
[8]
Issue stop action when you think you have achieved the objective. Don’t generate anything after stop.""" Figure 7: The system message of the reasoning agent. This message has a general explanation of the task, the available actions, and some notes on avoiding common failures. impossible to complete, provide the answer as "N/A" in the bracket.”. We remove ...
work page 2024
-
[13]
button ’Add to Compare’ URL: http://onestopmarket.com/office-products/office-electronics.html OBJECTIVE: What is the price of HP Inkjet Fax Machine PREVIOUS ACTION: None example_assistant Let’s think step-by-step. This page lists the information of HP Inkjet Fax Machine, which is the product identified in the objective. Its price is $279.49. I think I hav...
-
[17]
This page has a search box whose ID is [164]
button ’Close’ URL: http://openstreetmap.org OBJECTIVE: Show me the restaurants near ABC PREVIOUS ACTION: None example_assistant Let’s think step-by-step. This page has a search box whose ID is [164]. According to the nominatim rule of openstreetmap, I can search for the restaurants near a location by ¨restaurants near¨. I can submit my typing by pressing...
work page 2024
-
[18]
You should only issue an action that is valid given the current observation
-
[19]
You should only issue one action at a time
-
[20]
Always put the action inside a pair of ```
Generate the action in the correct format. Always put the action inside a pair of ```. For example, ```click [1234]```
-
[21]
Don’t generate anything after stop.""" Figure 9: The system message of the direct agent
Issue stop action when you think you have achieved the objective. Don’t generate anything after stop.""" Figure 9: The system message of the direct agent. This message has the general explanation of the task, the available actions and some notes on avoiding common failures. 20 Published as a conference paper at ICLR 2024 example_user OBSERV ATION:
work page 2024
-
[22]
link ’HP CB782A#ABA 640 Inkjet Fax Machine (Renewed)’
-
[23]
StaticText ’$279.49’
-
[24]
button ’Add to Cart’
-
[25]
button ’Add to Wish List’
-
[26]
button ’Add to Compare’ URL: http://onestopmarket.com/office-products/office-electronics.html OBJECTIVE: What is the price of HP Inkjet Fax Machine PREVIOUS ACTION: None example_assistant ```stop [$279.49]``` example_user OBSERV ATION:
-
[27]
textbox ’Search’ focused: True required: False
-
[28]
link ’Find directions between two points’
-
[29]
heading ’Search Results’
-
[30]
The agent directly emits the next action given the observation
button ’Close’ URL: http://openstreetmap.org OBJECTIVE: Show me the restaurants near ABC PREVIOUS ACTION: None example_assistant ```type [164] [restaurants near ABC] [1]``` Figure 10: The two examples provided as example_user andexample_assistant for the direct agent. The agent directly emits the next action given the observation. 21 Published as a confer...
work page 2024
-
[31]
searchbox 'Search query'
-
[32]
StaticText 'DMV area'
-
[33]
We couldn't find any projects matching Facebook
heading " We couldn't find any projects matching Facebook" Figure 11: Two examples where the GPT-4 agent failed, along with their screenshot and the accessibility tree of the relevant sections (grey). On the left, the agent fails to proceed to the “Users” section to accomplish the task of “Fork all Facebook repos”; on the right, the agent repeats entering...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.