hub Canonical reference

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao · 2024 · cs.CL · arXiv 2412.14161

Canonical reference. 100% of citing Pith papers cite this work as background.

46 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 46 citing papers arXiv PDF

abstract

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents--in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the-agent-company.com.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 7

representative citing papers

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

cs.SE · 2026-04-25 · unverdicted · novelty 8.0

CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.

Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

UnderSpecBench shows coding agents guess and violate boundaries in 55.8-67.8% of underspecified DevOps tasks rather than clarifying or refusing.

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

EnterpriseClawBench is a benchmark for enterprise agents constructed from proprietary real-world sessions, with the reusable contribution being the construction and evaluation protocol rather than the data itself.

MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

cs.AI · 2026-06-21 · unverdicted · novelty 7.0

MacAgentBench is a new benchmark for macOS AI agents with 676 tasks, deterministic multi-checkpoint evaluation, and tests across frameworks showing skill libraries drive performance more than framework design.

Specialize Roles, Mix Deployments: Pushing the Cost-Accuracy Frontier of LLM Agent Teams

cs.MA · 2026-05-28 · unverdicted · novelty 7.0

AgentCARD benchmark shows heterogeneous LLM agent teams with mixed deployments reach the cost-accuracy frontier, delivering up to 44% higher accuracy or 12x lower cost than uniform teams, with domain-specific role bottlenecks.

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.

SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.

AcademiClaw: When Students Set Challenges for AI Agents

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

PhoneBuddy: Training Open Models for Agentic Phone Use

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

ChainWorld: Composing Long-Horizon Desktop Workloads from Atomic OSWorld Tasks

cs.AI · 2026-06-19 · unverdicted · novelty 6.0

ChainWorld builds 347 chains from atomic OSWorld tasks and benchmarks four agents under single-turn and multi-turn protocols, reporting a maximum 31% completion rate with distinct failure profiles.

Offline Preference-Based Trajectory Evaluation

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Preference-based trajectory evaluation reduces tied comparisons from roughly 75% to 35% across agentic benchmarks by using temporal preferences over progress and return profiles.

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

No major agentic AI framework complies with six safety containment principles; a memory poisoning attack on LangChain causes persistent targeted errors up to 88.9% wrongful denials and 3.5x increase under complex policies, fixed by two sub-millisecond validators.

SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

cs.SE · 2026-06-05 · unverdicted · novelty 6.0

SWE-Marathon benchmark of 20 ultra-long-horizon tasks shows frontier AI agents solve fewer than 30%, highlighting gaps in long-context planning and self-verification.

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

cs.MA · 2026-05-26 · unverdicted · novelty 6.0

AgensFlow learns coordination policies from task trajectories and outperforms fixed pipelines on distributed-systems incident and security-advisory tasks.

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

cs.AI · 2026-05-25 · unverdicted · novelty 6.0

Anchor generates consistent long-horizon agent tasks from parametric constraint programs, yielding ERP-Bench of 300 ERP tasks where frontier models reach optimal solutions in 17.4% of trials.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer