Entworld: A holistic environment and benchmark for verifiable enterprise gui agents

Ying Mo, Yu Bai, Dapeng Sun, Yuqian Shi, Yukai Miao, Li Chen, Dan Li · 2026 · arXiv 2601.17722

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

EnterpriseClawBench is a benchmark for enterprise agents constructed from proprietary real-world sessions, with the reusable contribution being the construction and evaluation protocol rather than the data itself.

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

cs.LG · 2026-06-12 · unverdicted · novelty 7.0

GauntletBench reveals frontier AI agents achieve 19.1% success on 100 tasks in video editing, 3D modeling, and similar tools versus over 80% for humans, exposing limitations in overlooked capabilities.

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through multi-dimensional difficulty analysis.

citing papers explorer

Showing 4 of 4 citing papers.

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions cs.CL · 2026-06-22 · unverdicted · none · ref 8
EnterpriseClawBench is a benchmark for enterprise agents constructed from proprietary real-world sessions, with the reusable contribution being the construction and evaluation protocol rather than the data itself.
Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments cs.LG · 2026-06-12 · unverdicted · none · ref 22
GauntletBench reveals frontier AI agents achieve 19.1% success on 100 tasks in video editing, 3D modeling, and similar tools versus over 80% for humans, exposing limitations in overlooked capabilities.
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows cs.MA · 2026-05-09 · unverdicted · none · ref 11
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark cs.AI · 2026-04-13 · unverdicted · none · ref 19
WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through multi-dimensional difficulty analysis.

Entworld: A holistic environment and benchmark for verifiable enterprise gui agents

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer