pith. sign in

arxiv: 2606.23654 · v1 · pith:3C2J23RTnew · submitted 2026-06-22 · 💻 cs.CL · cs.SE

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Pith reviewed 2026-06-26 08:18 UTC · model grok-4.3

classification 💻 cs.CL cs.SE
keywords enterprise agentsagent benchmarkingworkplace tasksagent evaluationreproducible tasksartifact deliveryskill transferharness evaluation
0
0 comments X

The pith

EnterpriseClawBench shows top agent configurations reach only 0.663 success on recovered real workplace tasks and requires reporting harness-model pairs plus multiple dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds EnterpriseClawBench by recovering 852 tasks from proprietary workplace sessions and pairing each with fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. It evaluates several harness-model combinations and reports the highest score at 0.663. A sympathetic reader would care because the work argues that enterprise agent evaluation cannot collapse results into a single score and must instead track harness-model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior. The reusable contribution is the construction protocol itself, since the original data stays private.

Core claim

EnterpriseClawBench converts an archive of real enterprise agent sessions into 852 reproducible tasks without releasing the underlying data. Evaluation across configurations shows a maximum score of 0.663 with Codex and GPT-5.5. The results establish that valid enterprise agent assessment must publish harness-model combinations, artifact delivery quality, visual quality, cost, runtime, and skill-transfer behavior rather than a single aggregate number.

What carries the argument

The construction and evaluation protocol that recovers tasks from proprietary sessions and equips them with fixtures and semantic rubrics for cross-model comparison.

If this is right

  • Evaluations must report harness-model combinations rather than model-only results.
  • Artifact delivery and visual quality must be scored separately from task completion.
  • Cost and runtime must be measured and reported for each configuration.
  • Skill-transfer behavior across role and skill subclasses must be tracked.
  • Single-number summaries are insufficient for enterprise agent assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Companies considering agent deployment could use the protocol to test internal tasks before scaling.
  • The gap at 0.663 suggests that current agents may require additional tooling or human oversight for routine enterprise work.
  • Future benchmarks built from private data could adopt similar fixture-and-rubric methods to enable comparison while preserving confidentiality.

Load-bearing premise

Tasks recovered and rewritten from proprietary sessions can be made reproducible with fixtures and rubrics in a manner that supports valid cross-model comparisons without access to the original data.

What would settle it

A new harness-model combination that produces a single score above 0.663 while also producing consistent rankings across artifact quality, cost, runtime, and skill transfer would contradict the claim that multiple separate dimensions are required.

read the original abstract

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce EnterpriseClawBench, a benchmark constructed from proprietary real-world workplace sessions yielding 852 reproducible tasks with associated fixtures, prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Due to proprietary content, the data is not released, but a construction and evaluation protocol is provided as the reusable contribution. The best configuration achieves a score of 0.663, leading to the conclusion that enterprise agent evaluation should report multiple dimensions rather than a single score.

Significance. If the protocol is sound, this work could be significant for the field by demonstrating the gap between current agent capabilities and enterprise requirements, and by promoting more comprehensive evaluation practices that include cost, runtime, and skill transfer. The open-sourcing of the protocol code is a strength that allows others to apply similar methods. However, the proprietary data limits the benchmark's immediate utility and verifiability.

major comments (2)
  1. [Abstract] The abstract reports the creation of 852 tasks and a performance of 0.663 but supplies no information on validation, inter-rater reliability, or how exclusions or rewrites were performed. This absence makes it impossible to judge the reliability of the tasks and the support they provide for the stated conclusions about agent performance.
  2. [Benchmark Construction] The central claim that the tasks are reproducible relies on the protocol, but without release of the tasks, fixtures, rewritten prompts, and rubrics, external parties cannot re-execute the harness or recompute the 0.663 score. This is load-bearing for the recommendation that evaluations must report harness-model combinations and other metrics.
minor comments (1)
  1. Clarify in the abstract or introduction whether the GitHub repository includes any sample data or only the protocol code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] The abstract reports the creation of 852 tasks and a performance of 0.663 but supplies no information on validation, inter-rater reliability, or how exclusions or rewrites were performed. This absence makes it impossible to judge the reliability of the tasks and the support they provide for the stated conclusions about agent performance.

    Authors: We agree that the abstract omits these details. The full manuscript describes the validation steps, inter-rater reliability checks, and procedures for exclusions and rewrites in the Benchmark Construction and Evaluation sections. We will revise the abstract to include a concise summary of these elements. revision: partial

  2. Referee: [Benchmark Construction] The central claim that the tasks are reproducible relies on the protocol, but without release of the tasks, fixtures, rewritten prompts, and rubrics, external parties cannot re-execute the harness or recompute the 0.663 score. This is load-bearing for the recommendation that evaluations must report harness-model combinations and other metrics.

    Authors: The manuscript states that the data cannot be released due to proprietary enterprise content and positions the open-sourced construction and evaluation protocol as the reusable contribution. The protocol and harness code enable others to apply the same methodology to their own data. The 0.663 result illustrates the framework on our dataset; the recommendation to report multiple dimensions (harness-model pairs, cost, runtime, etc.) follows from the evaluation design itself and does not require external recomputation of this specific score. revision: no

Circularity Check

0 steps flagged

No circularity; empirical benchmark protocol is self-contained

full rationale

The paper introduces EnterpriseClawBench via an empirical construction protocol that recovers tasks from proprietary sessions, produces 852 tasks with fixtures/rubrics, and reports an observed maximum score of 0.663 under specific harness-model pairs. No equations, fitted parameters, predictions, or first-principles derivations appear. The central claim (need for multi-dimensional reporting) follows directly from the empirical results and protocol description without reducing to self-definition, self-citation chains, or renamed inputs. The non-release of data affects verifiability but does not create circularity in the reported construction or measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work centers on an empirical benchmark construction protocol extracted from existing sessions rather than any mathematical model, so no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5713 in / 1177 out tokens · 51409 ms · 2026-06-26T08:18:22.252882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 11 linked inside Pith

  1. [1]

    2026.AgenticVBench: Can AI agents complete real world post-production tasks? arXiv preprint arXiv:2605.27705

    Zongheng Cao, Yi Zheng, Rui Song, and Xinyu Hu. 2026.AgenticVBench: Can AI agents complete real world post-production tasks? arXiv preprint arXiv:2605.27705

  2. [2]

    Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Y ang JingYi, Penghui Y ang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, and Yuhang Zang. 2026. WildClawBench: A benchmark for real-world, long-horizon agent evaluation. arXiv preprint arXiv:2605.10912

  3. [3]

    Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718

  4. [4]

    Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. RAGAS: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217

  5. [5]

    Xiangyi Li, Yimin Liu, Wenbo Chen, Bingran Y ou, Zonglin Di, Yifeng He, Shenghan Zheng, Kyoung Whan Choe, Jiankai Sun, Shuyi Wang, Chujun Tao, Binxu Li, Xuandong Zhao, Hejia Geng, Xiaojun Wu, Junwei Zhou, Xiaokun Chen, Hanwen Xing, Yubo Li, and 59 others. 2026. SkillsBench: Benchmarking how well agent skills work across diverse tasks . arXiv preprint arXi...

  6. [6]

    Manning, Christopher Ré, Diana Acosta- Navas, Drew A

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Y asunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Y an, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta- Navas, Drew A. Hudson, and 31 others. 2022. Holistic evaluation of language models...

  7. [7]

    2023.G-Eval: NLG evaluation using GPT-4 with better human alignment

    Y ang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023.G-Eval: NLG evaluation using GPT-4 with better human alignment . arXiv preprint arXiv:2303.16634

  8. [8]

    Ying Mo, Yu Bai, Dapeng Sun, Yuqian Shi, Yukai Miao, Li Chen, and Dan Li. 2026. Ent- World: A holistic environment and benchmark for verifiable enterprise GUI agents . arXiv preprint arXiv:2601.17722

  9. [9]

    Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Yukai Wu, Weizheng Wang, Hongzhang Huang, Wei Zhou, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, and 3 others. 2026. Workspace-Bench 1.0: Benchmarking AI agents on workspace tasks with large-scale file dependencies . arX...

  10. [10]

    Harsh Vishwakarma, Ankush Agarwal, Ojas Patil, Chaitanya Devaguptapu, and Mahesh Chandran

  11. [11]

    arXiv preprint arXiv:2510.27287

    Can LLMs help you at work? a sandbox for evaluating LLM agents in enterprise environ- ments. arXiv preprint arXiv:2510.27287

  12. [12]

    Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Y ang, Hao Y ang Lu, Amaad Martin, Zhe 9 EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, and 2 others. 2024. TheAgentCompany: Be...

  13. [13]

    Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yun- zhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, and 2 others. 2026. ClawBench: Can AI agents complete everyday online tasks? arXiv preprint arXiv:2604.08523

  14. [14]

    Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, and Yu Wang

  15. [15]

    arXiv preprint arXiv:2606.12344

    Claw-SWE-Bench: A benchmark for evaluating OpenClaw-style agent harnesses on coding tasks. arXiv preprint arXiv:2606.12344. A. Appendix: Judge Ablation The main text reports the judge-reliability conclusion. Here we separate the LLM-judge analysis into horizontal and vertical views. Horizontally, different judges use different absolute score scales; Stron...

  16. [16]

    Do you want to:

    ః෰Ĥ ഈ ****ࠇࡱ******** Ĥ self_contain_decision: self_contained: false reason_code: ambiguous_task 12 EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions brief_reason:ၬ҂ૼđagentऎุ಩ༀଢѓb rewritten_prompt: ”” English translation: benchmark_id: fb-****_i**** business_task: configure automatic delivery for **** user_messages: - Employee ****: aut...

  17. [17]

    generate a evaluation for Employee ****?

  18. [18]

    build a automatic-delivery workflow for ****?

  19. [19]

    Even the original agent could not infer the user’s intent, so there is not enough information to recover a concrete task

    do something else? Do you have a **** file or ******** link attached? self_contain_decision: self_contained: false reason_code: ambiguous_task brief_reason: the user message is too short and ambiguous. Even the original agent could not infer the user’s intent, so there is not enough information to recover a concrete task. rewritten_prompt: ”” Here the ori...

  20. [20]

    the text version of today’s meeting recording of ****

  21. [21]

    today’s meeting notes

  22. [22]

    Wait until I send all three to you

    detailed progress updates from project-group owners. Wait until I send all three to you. — Quoted message, for reference only — [Quote 1] [**** 19:50:17 CST] cli_a9f48d3258****: @Employee **** @Employee**** @Employee**** @Employee**** @Employee**** @ Employee**** @Employee**** @Employee**** ... Received! I will continue collecting replies from the owners ...

  23. [23]

    Risks/blockers (whether they affect the **** node) š

    Today’s progress (quantified) š 2. Risks/blockers (whether they affect the **** node) š

  24. [24]

    If there is a **** node risk, please directly mark it as W ARNING

    External dependencies (waiting for whom, for what + deadline) Please make sure to reply before 20:00. If there is a **** node risk, please directly mark it as W ARNING. [**** 19:52:03 CST] ou_f9f43034440321aad669a034 ******: @_user_1 13 EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Progress on ****:

  25. [28]

    The backend Agent-related **** information and performance report information still require ** to finalize the table structure before data retrieval and integration can be determined. ... (truncated, original length 4803 characters) The pipeline rewrites the messy raw prompt into a single task specification with explicit materials, tem- plate context, and...

  26. [29]

    text version of today’s meeting recording, file path: /root/inputs/****** daily-meeting recording text.docx

  27. [30]

    today’s meeting notes, file path: /root/inputs/****** daily-meeting notes.docx

  28. [31]

    Detailed progress replies from each owner in the project group (excerpt) [19:52] Employee**** (** & *** and ***** Marketing) Progress on * month * day:

  29. [32]

    The product design demo has been updated according to the latest **** data status

  30. [33]

    On the design side, the lists and detail pages for ******, ***, and the **** marketing page have been completed

  31. [34]

    The frontend UI layout for the **** list and details has been completed, and frontend-backend integration for the list and details has been completed

  32. [35]

    The backend Agent-related **** information and performance report information still require ** to finalize the table structure before data retrieval and integration can be determined. ... (truncated, original length 1859 characters) Taxonomy and skill-subclass labeling. The same rewritten task is then labeled with both a benchmark task class and a role/sk...

  33. [36]

    fb-****_i****: Organize engineering safety-risk and algorithm-list files into an AI-recognition summary spreadsheet

  34. [37]

    fb-****_i****: Process a task-effort spreadsheet into project scope, owner, and fee-assignment fields

  35. [38]

    fb-****_i****: Complete an indoor-finishing process-node collec- tion table from examples and domain logic

  36. [39]

    fb-****_i****: Generate a frontend page from a business-flow dia- gram and demo script

  37. [40]

    fb-****_i****: Analyze a new product form and turn the analysis into a PRD plus an HTML product page. Engineering / IT Development tools, system architecture, APIs/SDKs, message queues, frontend integration, and engineering implementation plans; the output should guide development, configuration, integration, debugging, or technical choice

  38. [41]

    fb-****_i****: Explain how to configure an API key and base URL in Cursor for a masked model proxy

  39. [42]

    fb-****_i****: Compare Vercel AI SDK with LangChain/Lang- Graph and explain abort semantics

  40. [43]

    fb-****_i****: Explain RabbitMQ, Kafka, and RocketMQ, includ- ing scenarios and backend connection patterns

  41. [44]

    fb-****_i****: Propose a low-intrusion way to embed one SPA page into another

  42. [45]

    fb-****_i****: Analyze logs or scripts and report likely bugs, failure modes, and engineering fixes. HR / admin HR, organization design, interview evaluation, attendance or time-report checks, internal notices, and administrative coordination where personnel context and policy constraints are central

  43. [46]

    fb-****_i****: Calculate compensation costs for two employee departure-plan options

  44. [47]

    fb-****_i****: Refine an organization-structure document, including role responsibilities and OKRs

  45. [48]

    fb-****_i****: Produce an interview evaluation report from a resume and interview recording

  46. [49]

    fb-****_i****: Check missing Q1 time-report entries and generate owner notifications

  47. [50]

    fb-****_i****: Review an employee remote-work request caused by office-environment issues and propose a response. Executive Management reporting, organizational coordination, OKR review, and internal decision support; outputs support managerial judgment, cross-team alignment, formal reporting, or leadership-facing communication

  48. [51]

    fb-****_i****: Evaluate a weekly report, identify weak sections, and give concrete revision suggestions

  49. [52]

    fb-****_i****: Draft a first-level department weekly report from sev- eral second-level department reports

  50. [53]

    fb-****_i****: Revise organization-structure wording for a leadership-facing weekly report

  51. [54]

    fb-****_i****: Review OKRs, propose execution plans for each KR, and flag numerical risks

  52. [55]

    fb-****_i****: Update a company-level quarterly OKR document and export the revised document. Sales / customer Customer-facing solution communication, visit planning, account research, sales intelligence, and external stakeholder materials; emphasis is on customer context, decision roles, risks, and executable next steps

  53. [56]

    fb-****_i****: Enhance and visually reorganize a masked commodity-warehousing AI solution deck

  54. [57]

    fb-****_i****: Build customer-relationship, reverse-plan, risk- assessment, and decision-map HTML pages

  55. [58]

    fb-****_i****: Convert a customer research report into a PPT-like light-theme webpage

  56. [59]

    fb-****_i****: Optimize a masked customer visit plan and check agenda feasibility

  57. [60]

    fb-****_i****: Analyze participant roles and company concerns from a roadshow image and prepare positioning notes. Marketing Brand, PR, launch, event, public communication, and marketing-content delivery; outputs are externally readable or presentation-oriented and must preserve message hierarchy and visual style

  58. [61]

    fb-****_i****: Revise an activity-summary document according to event-feedback comments

  59. [62]

    fb-****_i****: Generate customer and investor versions of a busi- ness presentation for a product launch

  60. [63]

    fb-****_i****: Beautify an existing presentation according to a spec- ified color scheme

  61. [64]

    fb-****_i****: Generate a complete case presentation from a PDF and a template deck

  62. [65]

    fb-****_i****: Draft a professional official-account article from media-report materials about a masked AI case. Finance / ops Finance, accounting, business operations, structured data, spreadsheets, operating metrics, and reconciliation; the core action is calibration, cleaning, aggregation, categorization, completion, or analysis

  63. [66]

    2.fb-****_i****: Fill a public-payment accounting-entry template from approval-flow data

    fb-****_i****: Calibrate and analyze masked revenue, cost, gross- profit, and cash-flow data across business lines. 2.fb-****_i****: Fill a public-payment accounting-entry template from approval-flow data

  64. [67]

    fb-****_i****: Check accounting entries against cost-category rules and flag problematic rows

  65. [68]

    fb-****_i****: Reorder disrupted accounting entries by debit-credit balance and fill voucher numbers

  66. [69]

    Table 4 | Representative tasks from the Lite subset by role class

    fb-****_i****: Generate a department operating-analysis report cov- ering revenue, payments, outsourcing, contracts, and receivables. Table 4 | Representative tasks from the Lite subset by role class. Each row gives the classification boundary and five release-safe prompt excerpts. 16 EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions A...

  67. [70]

    Sheet 2 -༅ — English translation: Completed! The calibration analysis report has been generated. Artifact path: /workspace/outputs/business_data_calibration_report.xlsx Key findings: - Data calibration status: all passed; all gross-profit calculations and gross-margin rates are accurate. - Total revenue: **** ten thousand yuan - Total cost: **** ten thous...

  68. [71]

    Sheet 1 - revenue-cost calibration analysis: detailed data for **** projects, including original values, calibrated values, deviation analysis, and calibration status

  69. [72]

    Delivered artifact

    Sheet 2 - business-unit statistics: summary data by business unit, convenient for benchmark comparison. Delivered artifact. The agent produced two sheets. The following excerpt masks all project names, business-unit names, and sensitive numerical values: Original Chinese: Sheet:༅ ݼ| ෮උཛଢ | ൙ြ҆ | ҆૊ | ൬ೆčຣჭĎ| ຓҐӮЧčຣჭĎ| ჰર০čຣჭĎ| ཮ሙર০čຣჭĎ 1 | ****ඳ | **** | ...