arxiv: 2412.14161 · v3 · submitted 2024-12-18 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu , Yufan Song , Boxuan Li , Yuxuan Tang , Kritanjali Jain , Mengxue Bao , Zora Z. Wang , Xuhui Zhou

show 13 more authors

Zhitong Guo Murong Cao Mingyang Yang Hao Yang Lu Amaad Martin Zhe Su Leander Maben Raj Mehta Wayne Chi Lawrence Jang Yiqing Xie Shuyan Zhou Graham Neubig

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentsAI benchmarkingtask automationsoftware company simulationprofessional workflowsautonomous agentsworkplace taskslong-horizon planning

0 comments

The pith

LLM agents autonomously complete only 30% of tasks in a simulated software company environment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TheAgentCompany as a benchmark that places LLM agents inside a self-contained digital workplace modeled after a small software company. Agents must browse internal sites, write and run code, execute programs, and communicate with simulated coworkers to complete assigned tasks. Testing shows that the strongest current agents finish 30% of these tasks without any human intervention. Simpler short tasks prove more tractable while longer sequences that require sustained planning remain unsolved. The benchmark is released to let researchers track whether future agents close this gap.

Core claim

We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models, and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents--in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.

What carries the argument

TheAgentCompany benchmark: a simulated small software company with internal websites, data stores, and tasks that require web browsing, code writing, program execution, and coworker communication.

If this is right

Simpler, short-horizon workplace tasks can be completed autonomously by current LLM agents.
Long-horizon tasks that require chaining many steps remain beyond the capability of existing systems.
The benchmark supplies a repeatable way to measure incremental gains in agent performance over time.
Industry adoption of AI agents should initially target routine digital tasks where autonomous completion is already feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the simulation captures real workplace friction, companies could automate routine digital work to reallocate human effort toward judgment-intensive activities.
Replicating the benchmark in other sectors such as law or finance would show whether automation ceilings are similar across professional domains.
Fine-tuning agents on traces from the simulated company might raise success rates on the harder long-horizon tasks without changing the core architecture.

Load-bearing premise

The self-contained simulated environment with internal websites and data is representative enough of a real small software company to make the tasks consequential and realistic for professional work.

What would settle it

Running the identical agents on equivalent tasks inside an actual operating small software company and checking whether the autonomous success rate stays near 30%.

read the original abstract

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents--in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the-agent-company.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TheAgentCompany, a benchmark consisting of a self-contained simulated small software company environment with internal websites, data, and tasks that agents perform via web browsing, code writing, program execution, and coworker communication. It evaluates baseline LLM agents (both closed and open-weights models) and reports that the strongest agent completes 30% of tasks autonomously, concluding that simpler tasks are within reach while long-horizon ones remain challenging for current systems. The code, data, and environment are released.

Significance. If the simulated environment and tasks are representative, the benchmark offers a useful new testbed for measuring LLM agent progress on consequential professional work, providing an empirical baseline that could inform industry adoption and labor-market policy. The release of the full environment supports reproducibility and extension by the community.

major comments (2)

[§3] §3 (Benchmark Construction): The headline interpretation—that the 30% rate shows simpler tasks are solvable but long-horizon tasks are out of reach—depends on the claim that the self-contained internal websites and data 'mimics a small software company environment.' No external validation (expert ratings of task realism, comparison against real company logs, or ablation with live external services) is provided, leaving open the possibility that the simulation is systematically narrower or easier than claimed real-world work.
[§4] §4 (Experiments and Results): The abstract states a concrete 30% completion rate for the most competitive agent, yet the manuscript does not report the total number of tasks, their breakdown by horizon/complexity, precise success criteria, number of trials per task, or statistical details (e.g., variance or confidence intervals). Without these, the robustness of the claim that long-horizon tasks remain beyond reach cannot be fully assessed.

minor comments (2)

[Abstract] Abstract: The phrase 'a variety of tasks' would be more informative if it included the total task count and a high-level categorization (e.g., by duration or required skills).
[Tables/Figures] Tables/Figures: Performance tables should include per-agent success rates with error bars or standard deviations across runs to allow readers to gauge variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the benchmark's construction and experimental reporting. We have revised the manuscript to address the points raised while maintaining the integrity of our claims.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The headline interpretation—that the 30% rate shows simpler tasks are solvable but long-horizon tasks are out of reach—depends on the claim that the self-contained internal websites and data 'mimics a small software company environment.' No external validation (expert ratings of task realism, comparison against real company logs, or ablation with live external services) is provided, leaving open the possibility that the simulation is systematically narrower or easier than claimed real-world work.

Authors: We agree that external validation would strengthen claims about the environment's representativeness of real-world software company work. The tasks were designed around standard professional workflows involving internal web tools, code execution, and coworker interactions to enable controlled, reproducible evaluation. In the revised manuscript, we have expanded §3 with a new subsection on task design principles and added an explicit limitations paragraph acknowledging that the self-contained simulation may not capture the full variability of live external services or specific company contexts. This revision clarifies the scope without overstating generalizability. revision: partial
Referee: [§4] §4 (Experiments and Results): The abstract states a concrete 30% completion rate for the most competitive agent, yet the manuscript does not report the total number of tasks, their breakdown by horizon/complexity, precise success criteria, number of trials per task, or statistical details (e.g., variance or confidence intervals). Without these, the robustness of the claim that long-horizon tasks remain beyond reach cannot be fully assessed.

Authors: We thank the referee for highlighting this reporting gap. We have revised §4 to explicitly include the total number of tasks, a breakdown by horizon and complexity categories, the precise success criteria (verified via automated state and output checks), the number of trials per task, and statistical details including variance and confidence intervals. These additions provide the necessary transparency to evaluate the robustness of our findings on task difficulty. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements on new benchmark

full rationale

The paper introduces TheAgentCompany benchmark and reports direct empirical success rates (e.g., best agent completes 30% of tasks) obtained by running agents in the constructed self-contained environment. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the load-bearing claims. The central result is a straightforward measurement against the newly defined tasks and environment; the fidelity assumption is stated but not derived from prior results or reduced by construction. This is a standard empirical benchmark paper with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the constructed environment and tasks capture essential features of real professional work; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The simulated company environment with internal websites and data is representative of real-world software company tasks
Invoked to justify that measured performance reflects consequential workplace automation potential.

pith-pipeline@v0.9.0 · 5679 in / 1141 out tokens · 31970 ms · 2026-05-14T22:34:54.054991+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
cs.SE 2026-04 unverdicted novelty 8.0

CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
cs.MA 2026-05 unverdicted novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
cs.CL 2026-05 unverdicted novelty 7.0

SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
cs.CV 2026-04 unverdicted novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
cs.AI 2026-04 unverdicted novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
cs.AI 2026-04 unverdicted novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
SkillEvolver: Skill Learning as a Meta-Skill
cs.AI 2026-05 unverdicted novelty 6.0

A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
cs.AI 2026-05 unverdicted novelty 6.0

Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Memory in the Age of AI Agents
cs.CL 2025-12 unverdicted novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
cs.AI 2026-05 unverdicted novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
AlphaEval: Evaluating Agents in Production
cs.CL 2026-04 unverdicted novelty 5.0

AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 20 Pith papers · 2 internal anchors

[1]

The Llama 3 Herd of Models

URL https://arcprize.org/media/arc-prize-2024-technical-report.pdf. Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y . Mind2web: Towards a generalist agent for the web. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum? id=kiYqbO3wqw. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

doi: 10.18653/v1/2024.acl-long.850. URL https://aclanthology.org/2024.acl-long.850. 11 Wang, X., Chen, Y ., Yuan, L., Zhang, Y ., Li, Y ., Peng, H., and Ji, H. Executable Code Actions Elicit Better LLM Agents. InICML, 2024a. Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R., Zhe...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.850 2024
[3]

Distributed Graph Database (based on JanusGraph)

work page
[4]

Streaming Database (based on RisingWave)

work page
[5]

AI Model Development and Inference Platform (based on OpenHands and llama.cpp)

work page
[6]

Web Crawler Framework (based on Colly)

work page
[7]

Distributed Search Engine (based on OpenSearch)

work page
[8]

Low-Code Event-Driven Application Platform (based on Node-RED) ## Technology Stack - Programming Languages: Rust, Python, C++, Go, Java - Databases: Graph databases, Streaming databases, Search engines - AI/ML: Large Language Models (LLM) - Others: Distributed systems, API development, Documentation management ## Company Vision To become a global leader i...

work page
[9]

AI Agent (Agent employee being tested in TheAgentCompany) - Role: All - Responsibilities: All - Project: All - Slack Channels: All

work page
[10]

Sarah Johnson (Female, 42 years old) - Role: CTO - Responsibilities: Technical strategy planning, R&D team leadership, new technology assessment 17 - Project: Oversees all technical projects - Slack Channels: All technical channels, #general, #tech-talk

work page
[11]

Li Ming (Male, 35 years old) - Role: Database Team Project Manager - Responsibilities: Managing database projects, resource coordination, ensuring timely delivery - Skills: Java, distributed systems - Project: JanusGraph (Graph Database) - Slack Channels: #project-graphdb, #engineering, #tech-talk

work page
[12]

Zhang Wei (Male, 31 years old) - Role: Senior Software Engineer (Streaming Database Team) - Responsibilities: Developing and optimizing core streaming database functionalities - Skills: Rust, database systems - Project: RisingWave (Streaming Database) - Slack Channels: #project-streamdb, #engineering, #tech-talk

work page
[13]

Wang Fang (Female, 28 years old) - Role: AI Researcher (AI Team) - Responsibilities: Designing and implementing machine learning models, optimizing model performance - Skills: Python, machine learning, LLM - Project: OpenHands (LLM project) - Slack Channels: #project-ai, #engineering, #tech-talk

work page
[14]

Mike Chen (Male, 33 years old) - Role: Senior Software Engineer (AI Team) - Responsibilities: Developing and optimizing LLM inference engines - Skills: C++, CUDA, performance optimization - Project: llama.cpp (LLM inference project) - Slack Channels: #project-ai, #engineering, #tech-talk

work page
[15]

Emily Zhou (Female, 29 years old) - Role: Software Engineer (Web Crawler Team) - Responsibilities: Designing and implementing web crawler functionalities - Skills: Go, distributed systems - Project: Colly (Web Crawler Framework) - Slack Channels: #project-webcrawler, #engineering, #tech-talk

work page
[16]

Liu Qiang (Male, 36 years old) - Role: Quality Assurance Engineer - Responsibilities: Developing test strategies, executing tests, ensuring product quality - Project: All projects (focusing on testing and quality) - Slack Channels: All project channels, #engineering, #tech-talk

work page
[17]

Priya Sharma (Female, 27 years old) - Role: Documentation Engineer - Responsibilities: Writing technical documentation, maintaining wiki, improving documentation processes - Project: Documentation (Wiki) - Slack Channels: All project channels, #engineering, #tech-talk

work page
[18]

Mark Johnson (Male, 40 years old) - Role: Sales Director - Responsibilities: Developing sales strategies, managing sales team, expanding client relationships - Project: N/A (Sales) - Slack Channels: #sales-marketing, #general

work page
[19]

Jessica Lee (Female, 32 years old) - Role: Marketing Manager 18 - Responsibilities: Developing marketing strategies, managing brand image, organizing marketing events - Project: N/A (Marketing) - Slack Channels: #sales-marketing, #general

work page
[20]

Chen Xinyi (Female, 30 years old) - Role: Human Resources Manager - Responsibilities: Recruitment, employee training, compensation management - Project: N/A (HR) - Slack Channels: #hr-announcements, #general

work page
[21]

David Wong (Male, 45 years old) - Role: Finance Director - Responsibilities: Financial planning, budget management, financial reporting - Project: N/A (Finance) - Slack Channels: #general

work page
[22]

Huang Jie (Male, 34 years old) - Role: Product Manager (Search Engine Team) - Responsibilities: Defining product requirements, planning product roadmap, communicating with clients - Project: OpenSearch (Search Engine) - Slack Channels: #project-search, #product, #tech-talk

work page
[23]

Sophia Rodriguez (Female, 37 years old) - Role: UX Designer - Responsibilities: Designing user interfaces, improving user experience, conducting user research - Project: All projects (focusing on user experience) - Slack Channels: All project channels, #product, #tech-talk

work page
[24]

Alex Turner (Male, 30 years old) - Role: Software Engineer (Low-Code Platform Team) - Project: Node-RED (Low-Code Platform) - Slack Channels: #project-lowcode, #engineering, #tech-talk

work page
[25]

Emma Lewis (Female, 33 years old) - Role: Software Engineer (API Team) - Project: API-server (Python project) - Slack Channels: #engineering, #tech-talk

work page
[26]

Jessica Chen (Female, 28 years old) - Role: Frontend Software Engineer - Responsibilities: Developing user interfaces, implementing responsive designs, optimizing web performance - Project: E-commerce Website Redesign - Slack Channels: #project-ecommerce, #frontend, #tech-talk G.3 TheAgentCompany Q3 2024 Quarterly Sprint Goals ## Engineering Teams

work page 2024
[27]

Graph Database Team (JanusGraph) - Optimize large-scale graph query performance - Implement new graph analysis algorithms - Improve stability of distributed deployments

work page
[28]

Streaming Database Team (RisingWave) - Implement new stream processing operators - Optimize memory usage - Improve fault recovery mechanisms

work page
[29]

AI Team (OpenHands & llama.cpp) - Integrate latest LLM models - Optimize model inference speed 19 - Develop model fine-tuning functionality

work page
[30]

Web Crawler Team (Colly) - Implement distributed crawling functionality - Improve anti-crawling detection and bypass mechanisms - Develop data cleaning and preprocessing modules

work page
[31]

Search Engine Team (OpenSearch) - Optimize full-text search performance - Implement new relevance ranking algorithms - Develop custom analyzer functionality

work page
[32]

Thank you for providing the details

Low-Code Platform Team (Node-RED) - Design new visual components - Improve workflow execution engine - Develop more third-party service integrations ## Product Team - Conduct user research, collect product feedback - Develop Q4 product roadmap - Optimize product documentation and user guides ## Quality Assurance Team - Develop automated test suites - Cond...

work page
[33]

Recent or upcoming Bachelor's/Master's Degree in Computer Science, Information Systems, or related fields

work page
[34]

Experience with SQL and at least one database system (e.g., MySQL, PostgreSQL)

work page
[35]

Preferred qualifications:

Knowledge of database design, normalization, and query optimization. Preferred qualifications:

work page
[36]

Internship experience in database development or administration

work page
[37]

Familiarity with cloud databases (e.g., AWS RDS, Azure SQL)

work page
[38]

Strong problem-solving and troubleshooting skills. Thank you so much! Lastly, what is the ideal salary range? 9:50 AM 10:06 AM 10:07 AM 10:24 AM 10:25 AM Li Ming Male, 34 years old Project Manager (Graph Database Team) The salary range for the new grad database engineer position is between $120,000 and $150,000. If you have further questions, feel free to...

work page