EComAgentBench is a new benchmark with 662 tasks distributing hidden intent across sources and using source-tagged rubrics, where the strongest of seven tested models reaches only 57.1% accuracy.
hub
2407.15711 , archivePrefix=
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
A queueing framework segments vulnerability data with Gaussian mixture models, fits arrival/service/resource parameters by KL-divergence minimization, and reports 91-96% accuracy in estimating organizational cyber resources from timestamps.
Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.
RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.
AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.
A survey of evaluation methods for LLM-based agents from five perspectives, identifying trends toward realistic benchmarks and gaps in safety, cost-efficiency, and robustness.