EnterpriseClawBench is a benchmark for enterprise agents constructed from proprietary real-world sessions, with the reusable contribution being the construction and evaluation protocol rather than the data itself.
Entworld: A holistic environment and benchmark for verifiable enterprise gui agents
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
GauntletBench reveals frontier AI agents achieve 19.1% success on 100 tasks in video editing, 3D modeling, and similar tools versus over 80% for humans, exposing limitations in overlooked capabilities.
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.
WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through multi-dimensional difficulty analysis.
citing papers explorer
-
Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments
GauntletBench reveals frontier AI agents achieve 19.1% success on 100 tasks in video editing, 3D modeling, and similar tools versus over 80% for humans, exposing limitations in overlooked capabilities.