hub

2407.15711 , archivePrefix=

AssistantBench: Can Web Agents Solve Realistic, Time-Consuming Tasks? , author= · 2024 · arXiv 2407.15711

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

cs.AI · 2026-06-16 · unverdicted · novelty 7.0

EComAgentBench is a new benchmark with 662 tasks distributing hidden intent across sources and using source-tagged rubrics, where the strongest of seven tested models reaches only 57.1% accuracy.

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

cs.AI · 2026-06-01 · conditional · novelty 7.0

AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.

Design and Report Benchmarks for Knowledge Work

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.

Open-World Evaluations for Measuring Frontier AI Capabilities

cs.AI · 2026-05-19 · conditional · novelty 6.0

Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

cs.CL · 2026-04-23 · conditional · novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

Organizational Security Resource Estimation via Vulnerability Queueing

cs.CR · 2026-04-11 · unverdicted · novelty 6.0

A queueing framework segments vulnerability data with Gaussian mixture models, fits arrival/service/resource parameters by KL-divergence minimization, and reports 91-96% accuracy in estimating organizational cyber resources from timestamps.

Structured Distillation of Web Agent Capabilities Enables Generalization

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.

RISK: A Framework for GUI Agents in E-commerce Risk Management

cs.AI · 2025-09-26 · unverdicted · novelty 6.0

RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.

General Agentic Planning Through Simulative Reasoning with World Models

cs.AI · 2025-07-31 · conditional · novelty 6.0

SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.

Agent Workflow Memory

cs.CL · 2024-09-11 · unverdicted · novelty 6.0

AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.

Survey on Evaluation of LLM-based Agents

cs.AI · 2025-03-20 · unverdicted · novelty 3.0

A survey of evaluation methods for LLM-based agents from five perspectives, identifying trends toward realistic benchmarks and gaps in safety, cost-efficiency, and robustness.

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

cs.AI · 2026-04-20

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

cs.CR · 2026-04-07

citing papers explorer

Showing 2 of 2 citing papers after filters.

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unreviewed · ref 57
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks cs.CR · 2026-04-07 · unreviewed · ref 54

2407.15711 , archivePrefix=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer