Recognition: 2 theorem links
· Lean TheoremTheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Pith reviewed 2026-05-14 22:34 UTC · model grok-4.3
The pith
LLM agents autonomously complete only 30% of tasks in a simulated software company environment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models, and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents--in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.
What carries the argument
TheAgentCompany benchmark: a simulated small software company with internal websites, data stores, and tasks that require web browsing, code writing, program execution, and coworker communication.
If this is right
- Simpler, short-horizon workplace tasks can be completed autonomously by current LLM agents.
- Long-horizon tasks that require chaining many steps remain beyond the capability of existing systems.
- The benchmark supplies a repeatable way to measure incremental gains in agent performance over time.
- Industry adoption of AI agents should initially target routine digital tasks where autonomous completion is already feasible.
Where Pith is reading between the lines
- If the simulation captures real workplace friction, companies could automate routine digital work to reallocate human effort toward judgment-intensive activities.
- Replicating the benchmark in other sectors such as law or finance would show whether automation ceilings are similar across professional domains.
- Fine-tuning agents on traces from the simulated company might raise success rates on the harder long-horizon tasks without changing the core architecture.
Load-bearing premise
The self-contained simulated environment with internal websites and data is representative enough of a real small software company to make the tasks consequential and realistic for professional work.
What would settle it
Running the identical agents on equivalent tasks inside an actual operating small software company and checking whether the autonomous success rate stays near 30%.
read the original abstract
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents--in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the-agent-company.com.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TheAgentCompany, a benchmark consisting of a self-contained simulated small software company environment with internal websites, data, and tasks that agents perform via web browsing, code writing, program execution, and coworker communication. It evaluates baseline LLM agents (both closed and open-weights models) and reports that the strongest agent completes 30% of tasks autonomously, concluding that simpler tasks are within reach while long-horizon ones remain challenging for current systems. The code, data, and environment are released.
Significance. If the simulated environment and tasks are representative, the benchmark offers a useful new testbed for measuring LLM agent progress on consequential professional work, providing an empirical baseline that could inform industry adoption and labor-market policy. The release of the full environment supports reproducibility and extension by the community.
major comments (2)
- [§3] §3 (Benchmark Construction): The headline interpretation—that the 30% rate shows simpler tasks are solvable but long-horizon tasks are out of reach—depends on the claim that the self-contained internal websites and data 'mimics a small software company environment.' No external validation (expert ratings of task realism, comparison against real company logs, or ablation with live external services) is provided, leaving open the possibility that the simulation is systematically narrower or easier than claimed real-world work.
- [§4] §4 (Experiments and Results): The abstract states a concrete 30% completion rate for the most competitive agent, yet the manuscript does not report the total number of tasks, their breakdown by horizon/complexity, precise success criteria, number of trials per task, or statistical details (e.g., variance or confidence intervals). Without these, the robustness of the claim that long-horizon tasks remain beyond reach cannot be fully assessed.
minor comments (2)
- [Abstract] Abstract: The phrase 'a variety of tasks' would be more informative if it included the total task count and a high-level categorization (e.g., by duration or required skills).
- [Tables/Figures] Tables/Figures: Performance tables should include per-agent success rates with error bars or standard deviations across runs to allow readers to gauge variability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the benchmark's construction and experimental reporting. We have revised the manuscript to address the points raised while maintaining the integrity of our claims.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The headline interpretation—that the 30% rate shows simpler tasks are solvable but long-horizon tasks are out of reach—depends on the claim that the self-contained internal websites and data 'mimics a small software company environment.' No external validation (expert ratings of task realism, comparison against real company logs, or ablation with live external services) is provided, leaving open the possibility that the simulation is systematically narrower or easier than claimed real-world work.
Authors: We agree that external validation would strengthen claims about the environment's representativeness of real-world software company work. The tasks were designed around standard professional workflows involving internal web tools, code execution, and coworker interactions to enable controlled, reproducible evaluation. In the revised manuscript, we have expanded §3 with a new subsection on task design principles and added an explicit limitations paragraph acknowledging that the self-contained simulation may not capture the full variability of live external services or specific company contexts. This revision clarifies the scope without overstating generalizability. revision: partial
-
Referee: [§4] §4 (Experiments and Results): The abstract states a concrete 30% completion rate for the most competitive agent, yet the manuscript does not report the total number of tasks, their breakdown by horizon/complexity, precise success criteria, number of trials per task, or statistical details (e.g., variance or confidence intervals). Without these, the robustness of the claim that long-horizon tasks remain beyond reach cannot be fully assessed.
Authors: We thank the referee for highlighting this reporting gap. We have revised §4 to explicitly include the total number of tasks, a breakdown by horizon and complexity categories, the precise success criteria (verified via automated state and output checks), the number of trials per task, and statistical details including variance and confidence intervals. These additions provide the necessary transparency to evaluate the robustness of our findings on task difficulty. revision: yes
Circularity Check
No circularity; results are direct empirical measurements on new benchmark
full rationale
The paper introduces TheAgentCompany benchmark and reports direct empirical success rates (e.g., best agent completes 30% of tasks) obtained by running agents in the constructed self-contained environment. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the load-bearing claims. The central result is a straightforward measurement against the newly defined tasks and environment; the fidelity assumption is stated but not derived from prior results or reduced by construction. This is a standard empirical benchmark paper with no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The simulated company environment with internal websites and data is representative of real-world software company tasks
Forward citations
Cited by 21 Pith papers
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
-
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
-
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
-
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
-
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
-
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
-
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
SkillEvolver: Skill Learning as a Meta-Skill
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
-
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
-
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.
-
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
Memory in the Age of AI Agents
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
AlphaEval: Evaluating Agents in Production
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Reference graph
Works this paper leans on
-
[1]
URL https://arcprize.org/media/arc-prize-2024-technical-report.pdf. Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y . Mind2web: Towards a generalist agent for the web. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum? id=kiYqbO3wqw. ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
doi: 10.18653/v1/2024.acl-long.850. URL https://aclanthology.org/2024.acl-long.850. 11 Wang, X., Chen, Y ., Yuan, L., Zhang, Y ., Li, Y ., Peng, H., and Ji, H. Executable Code Actions Elicit Better LLM Agents. InICML, 2024a. Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R., Zhe...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.850 2024
-
[3]
Distributed Graph Database (based on JanusGraph)
-
[4]
Streaming Database (based on RisingWave)
-
[5]
AI Model Development and Inference Platform (based on OpenHands and llama.cpp)
-
[6]
Web Crawler Framework (based on Colly)
-
[7]
Distributed Search Engine (based on OpenSearch)
-
[8]
Low-Code Event-Driven Application Platform (based on Node-RED) ## Technology Stack - Programming Languages: Rust, Python, C++, Go, Java - Databases: Graph databases, Streaming databases, Search engines - AI/ML: Large Language Models (LLM) - Others: Distributed systems, API development, Documentation management ## Company Vision To become a global leader i...
-
[9]
AI Agent (Agent employee being tested in TheAgentCompany) - Role: All - Responsibilities: All - Project: All - Slack Channels: All
-
[10]
Sarah Johnson (Female, 42 years old) - Role: CTO - Responsibilities: Technical strategy planning, R&D team leadership, new technology assessment 17 - Project: Oversees all technical projects - Slack Channels: All technical channels, #general, #tech-talk
-
[11]
Li Ming (Male, 35 years old) - Role: Database Team Project Manager - Responsibilities: Managing database projects, resource coordination, ensuring timely delivery - Skills: Java, distributed systems - Project: JanusGraph (Graph Database) - Slack Channels: #project-graphdb, #engineering, #tech-talk
-
[12]
Zhang Wei (Male, 31 years old) - Role: Senior Software Engineer (Streaming Database Team) - Responsibilities: Developing and optimizing core streaming database functionalities - Skills: Rust, database systems - Project: RisingWave (Streaming Database) - Slack Channels: #project-streamdb, #engineering, #tech-talk
-
[13]
Wang Fang (Female, 28 years old) - Role: AI Researcher (AI Team) - Responsibilities: Designing and implementing machine learning models, optimizing model performance - Skills: Python, machine learning, LLM - Project: OpenHands (LLM project) - Slack Channels: #project-ai, #engineering, #tech-talk
-
[14]
Mike Chen (Male, 33 years old) - Role: Senior Software Engineer (AI Team) - Responsibilities: Developing and optimizing LLM inference engines - Skills: C++, CUDA, performance optimization - Project: llama.cpp (LLM inference project) - Slack Channels: #project-ai, #engineering, #tech-talk
-
[15]
Emily Zhou (Female, 29 years old) - Role: Software Engineer (Web Crawler Team) - Responsibilities: Designing and implementing web crawler functionalities - Skills: Go, distributed systems - Project: Colly (Web Crawler Framework) - Slack Channels: #project-webcrawler, #engineering, #tech-talk
-
[16]
Liu Qiang (Male, 36 years old) - Role: Quality Assurance Engineer - Responsibilities: Developing test strategies, executing tests, ensuring product quality - Project: All projects (focusing on testing and quality) - Slack Channels: All project channels, #engineering, #tech-talk
-
[17]
Priya Sharma (Female, 27 years old) - Role: Documentation Engineer - Responsibilities: Writing technical documentation, maintaining wiki, improving documentation processes - Project: Documentation (Wiki) - Slack Channels: All project channels, #engineering, #tech-talk
-
[18]
Mark Johnson (Male, 40 years old) - Role: Sales Director - Responsibilities: Developing sales strategies, managing sales team, expanding client relationships - Project: N/A (Sales) - Slack Channels: #sales-marketing, #general
-
[19]
Jessica Lee (Female, 32 years old) - Role: Marketing Manager 18 - Responsibilities: Developing marketing strategies, managing brand image, organizing marketing events - Project: N/A (Marketing) - Slack Channels: #sales-marketing, #general
-
[20]
Chen Xinyi (Female, 30 years old) - Role: Human Resources Manager - Responsibilities: Recruitment, employee training, compensation management - Project: N/A (HR) - Slack Channels: #hr-announcements, #general
-
[21]
David Wong (Male, 45 years old) - Role: Finance Director - Responsibilities: Financial planning, budget management, financial reporting - Project: N/A (Finance) - Slack Channels: #general
-
[22]
Huang Jie (Male, 34 years old) - Role: Product Manager (Search Engine Team) - Responsibilities: Defining product requirements, planning product roadmap, communicating with clients - Project: OpenSearch (Search Engine) - Slack Channels: #project-search, #product, #tech-talk
-
[23]
Sophia Rodriguez (Female, 37 years old) - Role: UX Designer - Responsibilities: Designing user interfaces, improving user experience, conducting user research - Project: All projects (focusing on user experience) - Slack Channels: All project channels, #product, #tech-talk
-
[24]
Alex Turner (Male, 30 years old) - Role: Software Engineer (Low-Code Platform Team) - Project: Node-RED (Low-Code Platform) - Slack Channels: #project-lowcode, #engineering, #tech-talk
-
[25]
Emma Lewis (Female, 33 years old) - Role: Software Engineer (API Team) - Project: API-server (Python project) - Slack Channels: #engineering, #tech-talk
-
[26]
Jessica Chen (Female, 28 years old) - Role: Frontend Software Engineer - Responsibilities: Developing user interfaces, implementing responsive designs, optimizing web performance - Project: E-commerce Website Redesign - Slack Channels: #project-ecommerce, #frontend, #tech-talk G.3 TheAgentCompany Q3 2024 Quarterly Sprint Goals ## Engineering Teams
work page 2024
-
[27]
Graph Database Team (JanusGraph) - Optimize large-scale graph query performance - Implement new graph analysis algorithms - Improve stability of distributed deployments
-
[28]
Streaming Database Team (RisingWave) - Implement new stream processing operators - Optimize memory usage - Improve fault recovery mechanisms
-
[29]
AI Team (OpenHands & llama.cpp) - Integrate latest LLM models - Optimize model inference speed 19 - Develop model fine-tuning functionality
-
[30]
Web Crawler Team (Colly) - Implement distributed crawling functionality - Improve anti-crawling detection and bypass mechanisms - Develop data cleaning and preprocessing modules
-
[31]
Search Engine Team (OpenSearch) - Optimize full-text search performance - Implement new relevance ranking algorithms - Develop custom analyzer functionality
-
[32]
Thank you for providing the details
Low-Code Platform Team (Node-RED) - Design new visual components - Improve workflow execution engine - Develop more third-party service integrations ## Product Team - Conduct user research, collect product feedback - Develop Q4 product roadmap - Optimize product documentation and user guides ## Quality Assurance Team - Develop automated test suites - Cond...
-
[33]
Recent or upcoming Bachelor's/Master's Degree in Computer Science, Information Systems, or related fields
-
[34]
Experience with SQL and at least one database system (e.g., MySQL, PostgreSQL)
-
[35]
Knowledge of database design, normalization, and query optimization. Preferred qualifications:
-
[36]
Internship experience in database development or administration
-
[37]
Familiarity with cloud databases (e.g., AWS RDS, Azure SQL)
-
[38]
Strong problem-solving and troubleshooting skills. Thank you so much! Lastly, what is the ideal salary range? 9:50 AM 10:06 AM 10:07 AM 10:24 AM 10:25 AM Li Ming Male, 34 years old Project Manager (Graph Database Team) The salary range for the new grad database engineer position is between $120,000 and $150,000. If you have further questions, feel free to...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.