Design and Report Benchmarks for Knowledge Work
Pith reviewed 2026-05-25 04:37 UTC · model grok-4.3
The pith
Benchmark scores for knowledge-work AI support reliable claims only when tasks define the work activity, the tested setting, and the scored product.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that benchmarked tasks for knowledge-work AI represent work claims only when the design process first defines the work activity under evaluation, then specifies the tested setting with its local materials, tools, roles and constraints, and finally scores the work product that must remain usable in downstream workflows; this mapping is derived from work studies and demonstrated through an inventory of eighteen activities plus case analyses of three benchmarks.
What carries the argument
The three-step approach of defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product.
Load-bearing premise
Insights from studies of human knowledge work on roles, responsibilities, materials, tools, and downstream-usable artifacts can be translated directly into benchmark design rules without losing critical aspects or creating new mismatches.
What would settle it
A benchmark redesigned with explicit activity definition, setting specification, and product scoring that still yields scores unrelated to performance on the corresponding real-world work activity in actual deployment settings.
Figures
read the original abstract
The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that current benchmarks for knowledge-work AI (e.g., coding, research) follow traditional NLP task logic and thus do not reliably support real-world work claims. It contributes a three-step approach—defining the work activity under evaluation (via an inventory of 18 activities derived from the public O*NET database), specifying the tested setting (materials, tools, roles, constraints), and scoring the appropriate work product—after reviewing work studies on roles/responsibilities, local materials/tools, and downstream-usable artifacts, translating those into design/reporting guidance, and illustrating via post-hoc case analyses of GDPval, OfficeQA Pro, and APEX-SWE.
Significance. If the framework holds, it would provide a structured, explicit method to align benchmark scores with deployment claims for LLM agents in knowledge work, filling a recognized gap. The derivation of the 18-activity inventory from the public O*NET database is a reproducible, non-circular strength that supports the proposal's grounding in external work studies rather than self-referential fitting.
major comments (2)
- [paragraph on reviewing work studies and translating concerns into guidance] The translation from reviewed work-study concerns (roles, local materials/tools, downstream artifacts) into benchmark design rules is the load-bearing step for the central claim that the three-step approach makes work claims explicit. The manuscript acknowledges situated contextual factors in the work-studies review but provides no systematic mapping, coverage check, or independent validation that the resulting rules avoid omitting those factors or introducing new mismatches.
- [demonstration through three benchmark case analyses] The three case analyses (GDPval, OfficeQA Pro, APEX-SWE) demonstrate how design choices shape the strongest supported work claim and identify gaps, but they are retrospective applications to existing benchmarks. This does not test whether prospectively applying the three-step approach during benchmark design would produce measurably better fidelity between scores and work claims.
minor comments (2)
- The derivation of the 18 work activities from O*NET is referenced but not accompanied by an explicit aggregation procedure or example mappings from specific occupational tasks; an appendix with this detail would aid reproducibility.
- Terms such as 'work product' and 'tested setting' are used throughout without early formal definitions, which may reduce clarity for readers outside work-studies literature.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recognition of the framework's potential value. We respond point-by-point to the major comments below.
read point-by-point responses
-
Referee: The translation from reviewed work-study concerns (roles, local materials/tools, downstream artifacts) into benchmark design rules is the load-bearing step for the central claim that the three-step approach makes work claims explicit. The manuscript acknowledges situated contextual factors in the work-studies review but provides no systematic mapping, coverage check, or independent validation that the resulting rules avoid omitting those factors or introducing new mismatches.
Authors: We agree that the translation step would be strengthened by an explicit mapping. The current manuscript synthesizes the reviewed work-study literature directly into the three design rules and the 18-activity inventory (derived from O*NET), but does not include a tabular or systematic coverage check against the original concerns. We will add a dedicated subsection that maps each reviewed factor (roles/responsibilities, local materials/tools, downstream-usable artifacts) to the corresponding guidance on activity definition, setting specification, and product scoring, and will note any potential gaps or unaddressed contextual elements. revision: yes
-
Referee: The three case analyses (GDPval, OfficeQA Pro, APEX-SWE) demonstrate how design choices shape the strongest supported work claim and identify gaps, but they are retrospective applications to existing benchmarks. This does not test whether prospectively applying the three-step approach during benchmark design would produce measurably better fidelity between scores and work claims.
Authors: The case analyses are retrospective by design: the paper's primary contribution is a methodological proposal illustrated on three published benchmarks to show how the framework surfaces gaps between tasks, settings, products, and work claims. A prospective test—designing a new benchmark from scratch with the method and then measuring improved fidelity—would require a separate empirical study and is outside the scope of the current manuscript. We will revise the text to state this limitation explicitly and to frame the cases as illustrative rather than as a validation of prospective efficacy. revision: partial
- A prospective empirical test of the framework on newly designed benchmarks cannot be performed within the bounds of this methodological paper.
Circularity Check
No circularity; derivation draws on external work studies and O*NET
full rationale
The paper proposes a three-step approach (define work activity, specify tested setting, score work product) by reviewing external work studies on roles/responsibilities/materials/tools/artifacts and translating those into benchmark guidance. It derives an inventory of 18 activities from the public O*NET database. No self-citations, fitted parameters, or self-definitional reductions are present in the derivation chain. The case analyses apply the method to existing benchmarks without reducing claims to inputs by construction. The central claim remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2503.04761 , year=
Handa, Kunal and Tamkin, Alex and McCain, Miles and Huang, Saffron and Durmus, Esin and Heck, Sarah and Mueller, Jared and Hong, Jerry and Ritchie, Stuart and Belonax, Tim and Troy, Kevin K. and Amodei, Dario and Kaplan, Jared and Clark, Jack and Ganguli, Deep , year =. Which Economic Tasks are Performed with. 2503.04761 , archivePrefix =
-
[2]
American Psychologist , volume =
Messick, Samuel , title =. American Psychologist , volume =. 1995 , doi =
work page 1995
- [3]
- [4]
-
[5]
The System of Professions: An Essay on the Division of Expert Labor , author=. 1988 , publisher=
work page 1988
-
[6]
Standards for Educational and Psychological Testing , author =. 2014 , publisher =
work page 2014
-
[7]
Anthropic Economic Index Report: Uneven Geographic and Enterprise. 2025 , month = sep, url =
work page 2025
-
[8]
and Wei, Jason and Soskin Hicks, Rebecca and Bowman, Preston and Qui
Arora, Rahul K. and Wei, Jason and Soskin Hicks, Rebecca and Bowman, Preston and Qui. 2025 , eprint =
work page 2025
-
[9]
Administrative Science Quarterly , year =
Technicians in the Workplace: Ethnographic Evidence for Bringing Work into Organization Studies , author =. Administrative Science Quarterly , year =
- [10]
-
[11]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia , year =. 2504.12516 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
AEA Papers and Proceedings , year =
What Can Machines Learn, and What Does It Mean for Occupations and the Economy? , author =. AEA Papers and Proceedings , year =
-
[13]
Brynjolfsson, Erik and Li, Danielle and Raymond, Lindsey R. , journal =. Generative. 2025 , doi =
work page 2025
-
[14]
The Enterprise AI Playbook: Lessons from 51 Successful Enterprise AI Deployments , author=
-
[15]
Advances in Knowledge Discovery and Data Mining , pages =
Density-Based Clustering Based on Hierarchical Density Estimates , author =. Advances in Knowledge Discovery and Data Mining , pages =. 2013 , publisher =
work page 2013
-
[16]
Organization Science , volume=
A Pragmatic View of Knowledge and Boundaries: Boundary Objects in New Product Development , author=. Organization Science , volume=. 2002 , publisher=
work page 2002
-
[17]
Organization Science , volume =
Transferring, Translating, and Transforming: An Integrative Framework for Managing Knowledge Across Boundaries , author =. Organization Science , volume =. 2004 , doi =
work page 2004
-
[18]
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =. doi:10.48550/arXiv.2107.03374 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
-
[19]
Cui, Zheyuan (Kevin) and Demirer, Mert and Jaffe, Sonia and Musolff, Leon and Peng, Sida and Salz, Tobias , journal =. The Effects of Generative. 2025 , note =
work page 2025
-
[20]
Thinking for a Living: How to Get Better Performance and Results from Knowledge Workers , author=. 2005 , publisher=
work page 2005
-
[21]
Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality , author=
-
[22]
Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , booktitle =. 2024 , publisher =
work page 2024
- [23]
- [24]
-
[25]
doi:10.48550/arXiv.2303.10130 , url =
Eloundou, Tyna and Manning, Sam and Mishkin, Pamela and Rock, Daniel , year =. doi:10.48550/arXiv.2303.10130 , url =. 2303.10130 , archivePrefix=
-
[26]
2022 , howpublished =
work page 2022
-
[27]
Strategic Management Journal , year =
Occupational, Industry, and Geographic Exposure to Artificial Intelligence , author =. Strategic Management Journal , year =
- [28]
-
[29]
International Conference on Learning Representations (ICLR) , year =
Mialon, Gr. International Conference on Learning Representations (ICLR) , year =
-
[30]
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Simon Posada and Aljubeh, Marwan and Thacker, Phoebe and Fauconnet, Laurance and Kim, Natalie S. and Chao, Patrick and Miserendino, Samuel and Chabot, Gildas and Li, David and Sharman, Michael and Barr, Alexandra and Glaese, Amelia an...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.04374
-
[31]
Agostinelli, Andrea and Denk, Timo I. and Borsos, Zal. 2023 , eprint=
work page 2023
-
[32]
Program Synthesis with Large Language Models
Program Synthesis with Large Language Models , author=. 2021 , eprint=. doi:10.48550/arXiv.2108.07732 , url=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021
-
[33]
Bianchi, Federico and Chia, Patrick John and Yuksekgonul, Mert and Tagliabue, Jacopo and Jurafsky, Dan and Zou, James , year=. How Well Can. 2402.05863 , archivePrefix=
-
[34]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios N. and Li, Tianle and Li, Dacheng and Zhu, Hao and Zhang, Banghua and Jordan, Michael I. and Gonzalez, Joseph E. and Stoica, Ion , year=. Chatbot Arena: An Open Platform for Evaluating. 2403.04132 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Chern, I-Chun and Chern, Steffi and Chen, Shiqi and Yuan, Weizhe and Feng, Kehua and Zhou, Chunting and He, Junxian and Neubig, Graham and Liu, Pengfei , year=. 2307.13528 , archivePrefix=
-
[36]
Chia, Yew Ken and Hong, Pengfei and Bing, Lidong and Poria, Soujanya , year=. 2306.04757 , archivePrefix=
-
[37]
Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang , booktitle=. 2021 , publisher=
work page 2021
-
[38]
Think You Have Solved Question Answering? Try
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , booktitle=. Think You Have Solved Question Answering? Try
-
[39]
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=. 2019 , publisher=
work page 2019
-
[40]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[41]
Mind2Web: Towards a Generalist Agent for the Web
Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu , year=. 2306.06070 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Du, Xueying and Liu, Mingwei and Wang, Kaixin and Wang, Hanlin and Liu, Junwei and Chen, Yixuan and Feng, Jiayi and Sha, Chaofeng and Peng, Xin and Lou, Yiling , year=. 2308.01861 , archivePrefix=
-
[43]
and Li, Irene and She, Tianwei and Li, Suyi and Radev, Dragomir R
Fabbri, Alexander R. and Li, Irene and She, Tianwei and Li, Suyi and Radev, Dragomir R. , journal=
- [44]
-
[45]
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , pages=
Hierarchical Neural Story Generation , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , pages=. 2018 , publisher=
work page 2018
-
[46]
Fein, Daniel and Russo, Sebastian and Xiang, Violet and Jolly, Kabir and Rafailov, Rafael and Haber, Nick , booktitle=. 2026 , publisher=
work page 2026
-
[47]
Measuring Coding Challenge Competence With APPS
Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , year=. Measuring Coding Challenge Competence With. 2105.09938 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Hendrycks, Dan and Burns, Collin and Chen, Anya and Ball, Spencer , booktitle=
-
[49]
Measuring Mathematical Problem Solving With the
Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Mathematical Problem Solving With the
-
[50]
International Conference on Learning Representations , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
-
[51]
and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , booktitle=
Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida I. and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , booktitle=
-
[52]
Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W. and Lu, Xinghua , booktitle=. 2019 , publisher=
work page 2019
-
[53]
What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author=. 2020 , eprint=
work page 2020
-
[54]
and Zettlemoyer, Luke , booktitle=
Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , booktitle=. 2017 , publisher=
work page 2017
-
[55]
Kim, Chris Dongjoo and Kim, Byeongchang and Lee, Hyunmin and Kim, Gunhee , booktitle=. 2019 , publisher=
work page 2019
-
[56]
Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yao and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel , journal=
- [57]
-
[58]
Krithara, Anastasia and Nentidis, Anastasios and Bougiatiotis, Konstantinos and Paliouras, Georgios , journal=
-
[59]
Transactions of the Association for Computational Linguistics , volume=
Natural Questions: A Benchmark for Question Answering Research , author=. Transactions of the Association for Computational Linguistics , volume=
- [60]
-
[61]
arXiv preprint arXiv:2305.11747 (2023)
Li, Junyi and Cheng, Xiaoxue and Zhao, Wayne Xin and Nie, Jian-Yun and Wen, Ji-Rong , year=. 2305.11747 , archivePrefix=
-
[62]
Transactions on Machine Learning Research , year=
Holistic Evaluation of Language Models , author=. Transactions on Machine Learning Research , year=
-
[63]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Lin, Stephanie and Hilton, Jacob and Evans, Owain , year=. 2109.07958 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
AgentBench: Evaluating LLMs as Agents
Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , year=. 23...
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
2024 , howpublished=
work page 2024
-
[66]
Miserendino, Samuel and Wang, Michele and Patwardhan, Tejal and Kuo, Charles E. and Dias, Rachel and Thacker, Phoebe and Thanneeru, Vishnu and Eapen, Suhas and Chastain, Eric and Barr, Alexandra and Thacker, Benjamin and Yau, Alvin and Li, David and Ludwinski, Pierce and Chabot, Gildas and Knutson, Thea and Glaese, Amelia and Sharman, Michael and Tworek, ...
-
[67]
Abstractive Text Summarization Using Sequence-to-Sequence
Nallapati, Ramesh and Zhou, Bowen and dos Santos, C. Abstractive Text Summarization Using Sequence-to-Sequence. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages=
-
[68]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2018
-
[69]
Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=
-
[70]
Schlichtkrull, Michael and Guo, Zhijiang and Vlachos, Andreas , year=. 2305.13117 , archivePrefix=
-
[71]
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
Schmidgall, Samuel and Ziaei, Rojin and Harris, Carl and Reis, Eduardo and Jopling, Jeffrey and Moor, Michael , year=. 2405.07960 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle=. 2018 , publisher=
work page 2018
- [73]
-
[74]
Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , booktitle=
-
[75]
Yoran, Ori and Wolfson, Tomer and Ram, Ori and Berant, Jonathan , year=. 2407.15711 , archivePrefix=
-
[76]
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=
-
[77]
Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng , year=. 2105.07624 , archivePrefix=
-
[78]
Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and Gong, Chen and Nguyen, Thong Hoang and Phan, Nam Dinh and Yan, Xingyao and Le, Chakkrit and Hoang, Anh Tuan and Nguyen, An and Wang, Ziwei and Liu, Ming an...
work page internal anchor Pith review Pith/arXiv arXiv
- [79]
-
[80]
Guha, Neel and Nyarko, Julian and Ho, Daniel E. and R. 2023 , eprint =
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.