A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
hub Canonical reference
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents--in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the-agent-company.com.
hub tools
citation-role summary
citation-polarity summary
roles
background 7polarities
background 7representative citing papers
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
AgentCARD benchmark shows heterogeneous LLM agent teams with mixed deployments reach the cost-accuracy frontier, delivering up to 44% higher accuracy or 12x lower cost than uniform teams, with domain-specific role bottlenecks.
PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.
SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
Anchor generates consistent long-horizon agent tasks from parametric constraint programs, yielding ERP-Bench of 300 ERP tasks where frontier models reach optimal solutions in 17.4% of trials.
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic judgment but inadequate for professional teaching standards.
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versus confirm-at-end.
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
citing papers explorer
-
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.