HealthAdminBench evaluates LLM computer-use agents on healthcare admin tasks and finds only 36.3% end-to-end success for the best agent despite 82.8% subtask success, revealing a substantial reliability gap.
hub
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites.arXiv preprint arXiv:2504.11543, April 2025
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
RAG, MCP, and NLWeb interfaces let LLM web agents achieve higher F1 scores (0.75-0.77 vs 0.67) and much lower token usage and runtime than HTML in controlled e-commerce tasks.
WebMall is the first offline multi-shop benchmark for evaluating LLM web agents on complex comparison shopping tasks across heterogeneous product data from multiple simulated e-shops.
A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved statistical aggregation.
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
PRAXIS enables AI agents to acquire procedural knowledge in real time by indexing and retrieving state-action-result experiences, leading to better accuracy, reliability, and efficiency on web browsing benchmarks.
citing papers explorer
-
HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
HealthAdminBench evaluates LLM computer-use agents on healthcare admin tasks and finds only 36.3% end-to-end success for the best agent despite 82.8% subtask success, revealing a substantial reliability gap.
-
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
-
FlowEval: Reference-based Evaluation of Generated User Interfaces
FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.
-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
-
MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)
RAG, MCP, and NLWeb interfaces let LLM web agents achieve higher F1 scores (0.75-0.77 vs 0.67) and much lower token usage and runtime than HTML in controlled e-commerce tasks.
-
WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents
WebMall is the first offline multi-shop benchmark for evaluating LLM web agents on complex comparison shopping tasks across heterogeneous product data from multiple simulated e-shops.
-
Computer Use at the Edge of the Statistical Precipice
A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved statistical aggregation.
-
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
-
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
-
Real-Time Procedural Learning From Experience for AI Agents
PRAXIS enables AI agents to acquire procedural knowledge in real time by indexing and retrieving state-action-result experiences, leading to better accuracy, reliability, and efficiency on web browsing benchmarks.
- MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents