super hub Canonical reference

WebArena: A Realistic Web Environment for Building Autonomous Agents

Abishek Sridhar, Frank F. Xu, Hao Zhu, Robert Lo, Shuyan Zhou, Xuhui Zhou · 2023 · cs.AI · arXiv 2307.13854

Canonical reference. 76% of citing Pith papers cite this work as background.

232 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 232 citing papers more from Abishek Sridhar arXiv PDF

abstract

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 35 dataset 4 baseline 2 method 1

citation-polarity summary

background 32 use dataset 4 baseline 2 support 2 unclear 1 use method 1

claims ledger

abstract With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software develop

authors

Abishek Sridhar Frank F. Xu Hao Zhu Robert Lo Shuyan Zhou Xuhui Zhou

co-cited works

representative citing papers

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

cs.AI · 2026-05-12 · conditional · novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

SEATauBench is the first agent benchmark for SEA languages, finding that performance holds for language-only changes but degrades sharply with full domain localization.

Same-Origin Policy for Agentic Browsers

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

The paper builds SOPBench showing frequent SOP violations in agentic browsers and introduces SOPGuard to enforce the policy with low overhead in BrowserOS.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.

AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?

cs.CR · 2026-05-26 · unverdicted · novelty 7.0

AgenticVBench evaluates frontier VLMs on 100 real-world video post-production tasks across four families, with the best agent stack scoring just over 30% versus human experts.

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

cs.SE · 2026-05-26 · conditional · novelty 7.0

LogDx-CI benchmark shows hybrid grep+tail reducers achieve top diagnosis quality at low cost, agent loops shrink quality variance across reducers, and cross-family LLM summarizers outperform same-family pairs.

JobBench: Aligning Agent Work With Human Will

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

cs.CR · 2026-05-25 · unverdicted · novelty 7.0

CyberEvolver introduces a four-layer self-evolving agent architecture with trace-to-diagnosis and population beam search that raises seed agent success rates by 13.6% on CTF, exploitation, and penetration tasks across four LLMs.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

cs.SE · 2026-05-22 · unverdicted · novelty 7.0

VISTA is a new benchmark for end-to-end visual spec-to-web-app generation by LLM agents, featuring five prompt conditions, manual UI annotations, multi-metric evaluation, and results on four agent systems showing partial decoupling of visual and functional performance.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

cs.AI · 2026-05-21 · conditional · novelty 7.0

IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

citing papers explorer

Showing 50 of 197 citing papers after filters.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data econ.EM · 2026-05-13 · accept · none · ref 17 · internal anchor
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare cs.AI · 2026-05-12 · conditional · none · ref 44 · internal anchor
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty cs.CL · 2026-05-12 · unverdicted · none · ref 24 · internal anchor
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 59 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments cs.AI · 2026-04-30 · accept · none · ref 7 · internal anchor
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers cs.SE · 2026-01-31 · accept · none · ref 3 · 2 links · internal anchor
MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents cs.AI · 2026-06-29 · unverdicted · none · ref 45 · internal anchor
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows cs.SE · 2026-06-29 · unverdicted · none · ref 44 · internal anchor
SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.
Agent-Computer Observation Interfaces Enable Dynamic Computer Use cs.AI · 2026-06-28 · conditional · none · ref 25 · internal anchor
AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.
SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages cs.CL · 2026-06-27 · unverdicted · none · ref 14 · internal anchor
SEATauBench is the first agent benchmark for SEA languages, finding that performance holds for language-only changes but degrades sharply with full domain localization.
Same-Origin Policy for Agentic Browsers cs.CR · 2026-06-12 · unverdicted · none · ref 16 · internal anchor
The paper builds SOPBench showing frequent SOP violations in agentic browsers and introduces SOPGuard to enforce the policy with low overhead in BrowserOS.
OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents cs.AI · 2026-05-27 · unverdicted · none · ref 51 · internal anchor
OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications cs.CV · 2026-05-26 · unverdicted · none · ref 64 · internal anchor
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks? cs.CR · 2026-05-26 · unverdicted · none · ref 13 · internal anchor
AgenticVBench evaluates frontier VLMs on 100 real-world video post-production tasks across four families, with the best agent stack scoring just over 30% versus human experts.
LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis cs.SE · 2026-05-26 · conditional · none · ref 8 · internal anchor
LogDx-CI benchmark shows hybrid grep+tail reducers achieve top diagnosis quality at low cost, agent loops shrink quality variance across reducers, and cross-family LLM summarizers outperform same-family pairs.
JobBench: Aligning Agent Work With Human Will cs.AI · 2026-05-25 · unverdicted · none · ref 46 · internal anchor
JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.
CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly cs.CR · 2026-05-25 · unverdicted · none · ref 79 · internal anchor
CyberEvolver introduces a four-layer self-evolving agent architecture with trace-to-diagnosis and population beam search that raises seed agent success rates by 13.6% on CTF, exploitation, and penetration tasks across four LLMs.
ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis cs.AI · 2026-05-24 · unverdicted · none · ref 31 · internal anchor
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents cs.SE · 2026-05-22 · unverdicted · none · ref 14 · internal anchor
VISTA is a new benchmark for end-to-end visual spec-to-web-app generation by LLM agents, featuring five prompt conditions, manual UI annotations, multi-metric evaluation, and results on four agent systems showing partial decoupling of visual and functional performance.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 99 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents cs.AI · 2026-05-21 · conditional · none · ref 5 · internal anchor
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
MemGym: a Long-Horizon Memory Environment for LLM Agents cs.CL · 2026-05-20 · unverdicted · none · ref 60 · internal anchor
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU cs.DC · 2026-05-20 · conditional · none · ref 71 · internal anchor
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents cs.CL · 2026-05-18 · unverdicted · none · ref 31 · 2 links · internal anchor
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science cs.AI · 2026-05-18 · unverdicted · none · ref 84 · internal anchor
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
Skim: Speculative Execution for Fast and Efficient Web Agents cs.AI · 2026-05-15 · unverdicted · none · ref 35 · internal anchor
Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.
Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces cs.CR · 2026-05-14 · unverdicted · none · ref 10 · internal anchor
UI traces of actions and timings from LLM browser agents enable identification of the underlying model with up to 96% F1 across 14 models and multiple tasks.
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows cs.AI · 2026-05-14 · unverdicted · none · ref 49 · 2 links · internal anchor
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment cs.LG · 2026-05-14 · unverdicted · none · ref 132 · 2 links · internal anchor
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents cs.AI · 2026-05-13 · unverdicted · none · ref 20 · 2 links · internal anchor
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
MMSkills: Towards Multimodal Skills for General Visual Agents cs.AI · 2026-05-13 · unverdicted · none · ref 39 · 3 links · internal anchor
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · unverdicted · none · ref 51 · internal anchor
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
State-Centric Decision Process cs.AI · 2026-05-12 · unverdicted · none · ref 63 · internal anchor
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 65 · internal anchor
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces cs.CR · 2026-05-12 · unverdicted · none · ref 94 · internal anchor
SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation cs.CL · 2026-05-12 · conditional · none · ref 33 · 2 links · internal anchor
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection cs.CR · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems cs.CL · 2026-05-10 · unverdicted · none · ref 28 · internal anchor
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 68 · 2 links · internal anchor
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios cs.CR · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory cs.AI · 2026-05-08 · unverdicted · none · ref 38 · internal anchor
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 38 · internal anchor
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 50 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Inference-Time Budget Control for LLM Search Agents cs.AI · 2026-05-07 · unverdicted · none · ref 77 · internal anchor
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios cs.CL · 2026-04-28 · unverdicted · none · ref 1 · internal anchor
DV-World is a benchmark of 260 tasks across spreadsheet manipulation, visual evolution, and interactive intent alignment that shows state-of-the-art AI models achieve less than 50% overall performance on real-world data visualization challenges.
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents cs.CV · 2026-04-26 · unverdicted · none · ref 6 · internal anchor
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 96 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation cs.AI · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.
The Moltbook Observatory Archive: an incremental dataset of agent-only social network activity cs.SI · 2026-04-16 · unverdicted · none · ref 11 · internal anchor
The Moltbook Observatory Archive is the first large-scale dataset from a social network populated exclusively by autonomous AI agents, covering 78 days with 2.6 million posts and 1.2 million comments.

WebArena: A Realistic Web Environment for Building Autonomous Agents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer