EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions
Pith reviewed 2026-06-26 08:18 UTC · model grok-4.3
The pith
EnterpriseClawBench shows top agent configurations reach only 0.663 success on recovered real workplace tasks and requires reporting harness-model pairs plus multiple dimensions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EnterpriseClawBench converts an archive of real enterprise agent sessions into 852 reproducible tasks without releasing the underlying data. Evaluation across configurations shows a maximum score of 0.663 with Codex and GPT-5.5. The results establish that valid enterprise agent assessment must publish harness-model combinations, artifact delivery quality, visual quality, cost, runtime, and skill-transfer behavior rather than a single aggregate number.
What carries the argument
The construction and evaluation protocol that recovers tasks from proprietary sessions and equips them with fixtures and semantic rubrics for cross-model comparison.
If this is right
- Evaluations must report harness-model combinations rather than model-only results.
- Artifact delivery and visual quality must be scored separately from task completion.
- Cost and runtime must be measured and reported for each configuration.
- Skill-transfer behavior across role and skill subclasses must be tracked.
- Single-number summaries are insufficient for enterprise agent assessment.
Where Pith is reading between the lines
- Companies considering agent deployment could use the protocol to test internal tasks before scaling.
- The gap at 0.663 suggests that current agents may require additional tooling or human oversight for routine enterprise work.
- Future benchmarks built from private data could adopt similar fixture-and-rubric methods to enable comparison while preserving confidentiality.
Load-bearing premise
Tasks recovered and rewritten from proprietary sessions can be made reproducible with fixtures and rubrics in a manner that supports valid cross-model comparisons without access to the original data.
What would settle it
A new harness-model combination that produces a single score above 0.663 while also producing consistent rankings across artifact quality, cost, runtime, and skill transfer would contradict the claim that multiple separate dimensions are required.
read the original abstract
Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce EnterpriseClawBench, a benchmark constructed from proprietary real-world workplace sessions yielding 852 reproducible tasks with associated fixtures, prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Due to proprietary content, the data is not released, but a construction and evaluation protocol is provided as the reusable contribution. The best configuration achieves a score of 0.663, leading to the conclusion that enterprise agent evaluation should report multiple dimensions rather than a single score.
Significance. If the protocol is sound, this work could be significant for the field by demonstrating the gap between current agent capabilities and enterprise requirements, and by promoting more comprehensive evaluation practices that include cost, runtime, and skill transfer. The open-sourcing of the protocol code is a strength that allows others to apply similar methods. However, the proprietary data limits the benchmark's immediate utility and verifiability.
major comments (2)
- [Abstract] The abstract reports the creation of 852 tasks and a performance of 0.663 but supplies no information on validation, inter-rater reliability, or how exclusions or rewrites were performed. This absence makes it impossible to judge the reliability of the tasks and the support they provide for the stated conclusions about agent performance.
- [Benchmark Construction] The central claim that the tasks are reproducible relies on the protocol, but without release of the tasks, fixtures, rewritten prompts, and rubrics, external parties cannot re-execute the harness or recompute the 0.663 score. This is load-bearing for the recommendation that evaluations must report harness-model combinations and other metrics.
minor comments (1)
- Clarify in the abstract or introduction whether the GitHub repository includes any sample data or only the protocol code.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [Abstract] The abstract reports the creation of 852 tasks and a performance of 0.663 but supplies no information on validation, inter-rater reliability, or how exclusions or rewrites were performed. This absence makes it impossible to judge the reliability of the tasks and the support they provide for the stated conclusions about agent performance.
Authors: We agree that the abstract omits these details. The full manuscript describes the validation steps, inter-rater reliability checks, and procedures for exclusions and rewrites in the Benchmark Construction and Evaluation sections. We will revise the abstract to include a concise summary of these elements. revision: partial
-
Referee: [Benchmark Construction] The central claim that the tasks are reproducible relies on the protocol, but without release of the tasks, fixtures, rewritten prompts, and rubrics, external parties cannot re-execute the harness or recompute the 0.663 score. This is load-bearing for the recommendation that evaluations must report harness-model combinations and other metrics.
Authors: The manuscript states that the data cannot be released due to proprietary enterprise content and positions the open-sourced construction and evaluation protocol as the reusable contribution. The protocol and harness code enable others to apply the same methodology to their own data. The 0.663 result illustrates the framework on our dataset; the recommendation to report multiple dimensions (harness-model pairs, cost, runtime, etc.) follows from the evaluation design itself and does not require external recomputation of this specific score. revision: no
Circularity Check
No circularity; empirical benchmark protocol is self-contained
full rationale
The paper introduces EnterpriseClawBench via an empirical construction protocol that recovers tasks from proprietary sessions, produces 852 tasks with fixtures/rubrics, and reports an observed maximum score of 0.663 under specific harness-model pairs. No equations, fitted parameters, predictions, or first-principles derivations appear. The central claim (need for multi-dimensional reporting) follows directly from the empirical results and protocol description without reducing to self-definition, self-citation chains, or renamed inputs. The non-release of data affects verifiability but does not create circularity in the reported construction or measurements.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Zongheng Cao, Yi Zheng, Rui Song, and Xinyu Hu. 2026.AgenticVBench: Can AI agents complete real world post-production tasks? arXiv preprint arXiv:2605.27705
Pith/arXiv arXiv 2026
-
[2]
Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Y ang JingYi, Penghui Y ang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, and Yuhang Zang. 2026. WildClawBench: A benchmark for real-world, long-horizon agent evaluation. arXiv preprint arXiv:2605.10912
Pith/arXiv arXiv 2026
-
[3]
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718
Pith/arXiv arXiv 2024
-
[4]
Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. RAGAS: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217
Pith/arXiv arXiv 2023
-
[5]
Xiangyi Li, Yimin Liu, Wenbo Chen, Bingran Y ou, Zonglin Di, Yifeng He, Shenghan Zheng, Kyoung Whan Choe, Jiankai Sun, Shuyi Wang, Chujun Tao, Binxu Li, Xuandong Zhao, Hejia Geng, Xiaojun Wu, Junwei Zhou, Xiaokun Chen, Hanwen Xing, Yubo Li, and 59 others. 2026. SkillsBench: Benchmarking how well agent skills work across diverse tasks . arXiv preprint arXi...
Pith/arXiv arXiv 2026
-
[6]
Manning, Christopher Ré, Diana Acosta- Navas, Drew A
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Y asunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Y an, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta- Navas, Drew A. Hudson, and 31 others. 2022. Holistic evaluation of language models...
Pith/arXiv arXiv 2022
-
[7]
2023.G-Eval: NLG evaluation using GPT-4 with better human alignment
Y ang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023.G-Eval: NLG evaluation using GPT-4 with better human alignment . arXiv preprint arXiv:2303.16634
Pith/arXiv arXiv 2023
-
[8]
Ying Mo, Yu Bai, Dapeng Sun, Yuqian Shi, Yukai Miao, Li Chen, and Dan Li. 2026. Ent- World: A holistic environment and benchmark for verifiable enterprise GUI agents . arXiv preprint arXiv:2601.17722
arXiv 2026
-
[9]
Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Yukai Wu, Weizheng Wang, Hongzhang Huang, Wei Zhou, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, and 3 others. 2026. Workspace-Bench 1.0: Benchmarking AI agents on workspace tasks with large-scale file dependencies . arX...
Pith/arXiv arXiv 2026
-
[10]
Harsh Vishwakarma, Ankush Agarwal, Ojas Patil, Chaitanya Devaguptapu, and Mahesh Chandran
-
[11]
arXiv preprint arXiv:2510.27287
Can LLMs help you at work? a sandbox for evaluating LLM agents in enterprise environ- ments. arXiv preprint arXiv:2510.27287
-
[12]
Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z
Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Y ang, Hao Y ang Lu, Amaad Martin, Zhe 9 EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, and 2 others. 2024. TheAgentCompany: Be...
Pith/arXiv arXiv 2024
-
[13]
Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yun- zhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, and 2 others. 2026. ClawBench: Can AI agents complete everyday online tasks? arXiv preprint arXiv:2604.08523
Pith/arXiv arXiv 2026
-
[14]
Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, and Yu Wang
-
[15]
arXiv preprint arXiv:2606.12344
Claw-SWE-Bench: A benchmark for evaluating OpenClaw-style agent harnesses on coding tasks. arXiv preprint arXiv:2606.12344. A. Appendix: Judge Ablation The main text reports the judge-reliability conclusion. Here we separate the LLM-judge analysis into horizontal and vertical views. Horizontally, different judges use different absolute score scales; Stron...
-
[16]
Do you want to:
ఃĤ ഈ ****ࠇࡱ******** Ĥ self_contain_decision: self_contained: false reason_code: ambiguous_task 12 EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions brief_reason:ၬ҂ૼđagentऎุༀଢѓb rewritten_prompt: ”” English translation: benchmark_id: fb-****_i**** business_task: configure automatic delivery for **** user_messages: - Employee ****: aut...
-
[17]
generate a evaluation for Employee ****?
-
[18]
build a automatic-delivery workflow for ****?
-
[19]
Even the original agent could not infer the user’s intent, so there is not enough information to recover a concrete task
do something else? Do you have a **** file or ******** link attached? self_contain_decision: self_contained: false reason_code: ambiguous_task brief_reason: the user message is too short and ambiguous. Even the original agent could not infer the user’s intent, so there is not enough information to recover a concrete task. rewritten_prompt: ”” Here the ori...
-
[20]
the text version of today’s meeting recording of ****
-
[21]
today’s meeting notes
-
[22]
Wait until I send all three to you
detailed progress updates from project-group owners. Wait until I send all three to you. — Quoted message, for reference only — [Quote 1] [**** 19:50:17 CST] cli_a9f48d3258****: @Employee **** @Employee**** @Employee**** @Employee**** @Employee**** @ Employee**** @Employee**** @Employee**** ... Received! I will continue collecting replies from the owners ...
-
[23]
Risks/blockers (whether they affect the **** node) š
Today’s progress (quantified) š 2. Risks/blockers (whether they affect the **** node) š
-
[24]
If there is a **** node risk, please directly mark it as W ARNING
External dependencies (waiting for whom, for what + deadline) Please make sure to reply before 20:00. If there is a **** node risk, please directly mark it as W ARNING. [**** 19:52:03 CST] ou_f9f43034440321aad669a034 ******: @_user_1 13 EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Progress on ****:
-
[28]
The backend Agent-related **** information and performance report information still require ** to finalize the table structure before data retrieval and integration can be determined. ... (truncated, original length 4803 characters) The pipeline rewrites the messy raw prompt into a single task specification with explicit materials, tem- plate context, and...
-
[29]
text version of today’s meeting recording, file path: /root/inputs/****** daily-meeting recording text.docx
-
[30]
today’s meeting notes, file path: /root/inputs/****** daily-meeting notes.docx
-
[31]
Detailed progress replies from each owner in the project group (excerpt) [19:52] Employee**** (** & *** and ***** Marketing) Progress on * month * day:
-
[32]
The product design demo has been updated according to the latest **** data status
-
[33]
On the design side, the lists and detail pages for ******, ***, and the **** marketing page have been completed
-
[34]
The frontend UI layout for the **** list and details has been completed, and frontend-backend integration for the list and details has been completed
-
[35]
The backend Agent-related **** information and performance report information still require ** to finalize the table structure before data retrieval and integration can be determined. ... (truncated, original length 1859 characters) Taxonomy and skill-subclass labeling. The same rewritten task is then labeled with both a benchmark task class and a role/sk...
-
[36]
fb-****_i****: Organize engineering safety-risk and algorithm-list files into an AI-recognition summary spreadsheet
-
[37]
fb-****_i****: Process a task-effort spreadsheet into project scope, owner, and fee-assignment fields
-
[38]
fb-****_i****: Complete an indoor-finishing process-node collec- tion table from examples and domain logic
-
[39]
fb-****_i****: Generate a frontend page from a business-flow dia- gram and demo script
-
[40]
fb-****_i****: Analyze a new product form and turn the analysis into a PRD plus an HTML product page. Engineering / IT Development tools, system architecture, APIs/SDKs, message queues, frontend integration, and engineering implementation plans; the output should guide development, configuration, integration, debugging, or technical choice
-
[41]
fb-****_i****: Explain how to configure an API key and base URL in Cursor for a masked model proxy
-
[42]
fb-****_i****: Compare Vercel AI SDK with LangChain/Lang- Graph and explain abort semantics
-
[43]
fb-****_i****: Explain RabbitMQ, Kafka, and RocketMQ, includ- ing scenarios and backend connection patterns
-
[44]
fb-****_i****: Propose a low-intrusion way to embed one SPA page into another
-
[45]
fb-****_i****: Analyze logs or scripts and report likely bugs, failure modes, and engineering fixes. HR / admin HR, organization design, interview evaluation, attendance or time-report checks, internal notices, and administrative coordination where personnel context and policy constraints are central
-
[46]
fb-****_i****: Calculate compensation costs for two employee departure-plan options
-
[47]
fb-****_i****: Refine an organization-structure document, including role responsibilities and OKRs
-
[48]
fb-****_i****: Produce an interview evaluation report from a resume and interview recording
-
[49]
fb-****_i****: Check missing Q1 time-report entries and generate owner notifications
-
[50]
fb-****_i****: Review an employee remote-work request caused by office-environment issues and propose a response. Executive Management reporting, organizational coordination, OKR review, and internal decision support; outputs support managerial judgment, cross-team alignment, formal reporting, or leadership-facing communication
-
[51]
fb-****_i****: Evaluate a weekly report, identify weak sections, and give concrete revision suggestions
-
[52]
fb-****_i****: Draft a first-level department weekly report from sev- eral second-level department reports
-
[53]
fb-****_i****: Revise organization-structure wording for a leadership-facing weekly report
-
[54]
fb-****_i****: Review OKRs, propose execution plans for each KR, and flag numerical risks
-
[55]
fb-****_i****: Update a company-level quarterly OKR document and export the revised document. Sales / customer Customer-facing solution communication, visit planning, account research, sales intelligence, and external stakeholder materials; emphasis is on customer context, decision roles, risks, and executable next steps
-
[56]
fb-****_i****: Enhance and visually reorganize a masked commodity-warehousing AI solution deck
-
[57]
fb-****_i****: Build customer-relationship, reverse-plan, risk- assessment, and decision-map HTML pages
-
[58]
fb-****_i****: Convert a customer research report into a PPT-like light-theme webpage
-
[59]
fb-****_i****: Optimize a masked customer visit plan and check agenda feasibility
-
[60]
fb-****_i****: Analyze participant roles and company concerns from a roadshow image and prepare positioning notes. Marketing Brand, PR, launch, event, public communication, and marketing-content delivery; outputs are externally readable or presentation-oriented and must preserve message hierarchy and visual style
-
[61]
fb-****_i****: Revise an activity-summary document according to event-feedback comments
-
[62]
fb-****_i****: Generate customer and investor versions of a busi- ness presentation for a product launch
-
[63]
fb-****_i****: Beautify an existing presentation according to a spec- ified color scheme
-
[64]
fb-****_i****: Generate a complete case presentation from a PDF and a template deck
-
[65]
fb-****_i****: Draft a professional official-account article from media-report materials about a masked AI case. Finance / ops Finance, accounting, business operations, structured data, spreadsheets, operating metrics, and reconciliation; the core action is calibration, cleaning, aggregation, categorization, completion, or analysis
-
[66]
2.fb-****_i****: Fill a public-payment accounting-entry template from approval-flow data
fb-****_i****: Calibrate and analyze masked revenue, cost, gross- profit, and cash-flow data across business lines. 2.fb-****_i****: Fill a public-payment accounting-entry template from approval-flow data
-
[67]
fb-****_i****: Check accounting entries against cost-category rules and flag problematic rows
-
[68]
fb-****_i****: Reorder disrupted accounting entries by debit-credit balance and fill voucher numbers
-
[69]
Table 4 | Representative tasks from the Lite subset by role class
fb-****_i****: Generate a department operating-analysis report cov- ering revenue, payments, outsourcing, contracts, and receivables. Table 4 | Representative tasks from the Lite subset by role class. Each row gives the classification boundary and five release-safe prompt excerpts. 16 EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions A...
-
[70]
Sheet 2 -༅ — English translation: Completed! The calibration analysis report has been generated. Artifact path: /workspace/outputs/business_data_calibration_report.xlsx Key findings: - Data calibration status: all passed; all gross-profit calculations and gross-margin rates are accurate. - Total revenue: **** ten thousand yuan - Total cost: **** ten thous...
-
[71]
Sheet 1 - revenue-cost calibration analysis: detailed data for **** projects, including original values, calibrated values, deviation analysis, and calibration status
-
[72]
Delivered artifact
Sheet 2 - business-unit statistics: summary data by business unit, convenient for benchmark comparison. Delivered artifact. The agent produced two sheets. The following excerpt masks all project names, business-unit names, and sensitive numerical values: Original Chinese: Sheet:༅ ݼ| ෮උཛଢ | ൙ြ҆ | ҆ | ൬ೆčຣჭĎ| ຓҐӮЧčຣჭĎ| ჰર০čຣჭĎ| ሙર০čຣჭĎ 1 | ****ඳ | **** | ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.