pith. sign in

arxiv: 2605.23262 · v1 · pith:V3CLH7PVnew · submitted 2026-05-22 · 💻 cs.AI

Design and Report Benchmarks for Knowledge Work

Pith reviewed 2026-05-25 04:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords benchmark designknowledge workAI evaluationLLM agentswork studiesoccupational taskssoftware engineering benchmarksdocument analysis
0
0 comments X

The pith

Benchmark scores for knowledge-work AI support reliable claims only when tasks define the work activity, the tested setting, and the scored product.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current benchmarks for LLM agents in areas like coding and research follow traditional NLP task formats, so higher scores do not reliably indicate ability to perform real knowledge work in deployment. It contributes a three-step method to make the connection explicit: name the work activity being evaluated, specify the setting with its materials, tools, roles and constraints, and score the actual work product left for downstream use. This method comes from reviewing studies of how knowledge work organizes around roles, local resources, and usable artifacts, then turning those observations into concrete design and reporting rules. The paper supplies an inventory of eighteen work activities drawn from an occupational task database to help distinguish the evaluated activity from common benchmark tasks. Three case analyses of existing benchmarks illustrate how the choices in each step determine the strength of the work claim a score can support.

Core claim

The central claim is that benchmarked tasks for knowledge-work AI represent work claims only when the design process first defines the work activity under evaluation, then specifies the tested setting with its local materials, tools, roles and constraints, and finally scores the work product that must remain usable in downstream workflows; this mapping is derived from work studies and demonstrated through an inventory of eighteen activities plus case analyses of three benchmarks.

What carries the argument

The three-step approach of defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product.

Load-bearing premise

Insights from studies of human knowledge work on roles, responsibilities, materials, tools, and downstream-usable artifacts can be translated directly into benchmark design rules without losing critical aspects or creating new mismatches.

What would settle it

A benchmark redesigned with explicit activity definition, setting specification, and product scoring that still yields scores unrelated to performance on the corresponding real-world work activity in actual deployment settings.

Figures

Figures reproduced from arXiv: 2605.23262 by Cyrus Ayubcha, Hongbin Na, Levi Lian, Yining Hua.

Figure 1
Figure 1. Figure 1: Construction pipeline for the 18-work-activity inventory. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Work-activity atlas of O*NET knowledge work. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that current benchmarks for knowledge-work AI (e.g., coding, research) follow traditional NLP task logic and thus do not reliably support real-world work claims. It contributes a three-step approach—defining the work activity under evaluation (via an inventory of 18 activities derived from the public O*NET database), specifying the tested setting (materials, tools, roles, constraints), and scoring the appropriate work product—after reviewing work studies on roles/responsibilities, local materials/tools, and downstream-usable artifacts, translating those into design/reporting guidance, and illustrating via post-hoc case analyses of GDPval, OfficeQA Pro, and APEX-SWE.

Significance. If the framework holds, it would provide a structured, explicit method to align benchmark scores with deployment claims for LLM agents in knowledge work, filling a recognized gap. The derivation of the 18-activity inventory from the public O*NET database is a reproducible, non-circular strength that supports the proposal's grounding in external work studies rather than self-referential fitting.

major comments (2)
  1. [paragraph on reviewing work studies and translating concerns into guidance] The translation from reviewed work-study concerns (roles, local materials/tools, downstream artifacts) into benchmark design rules is the load-bearing step for the central claim that the three-step approach makes work claims explicit. The manuscript acknowledges situated contextual factors in the work-studies review but provides no systematic mapping, coverage check, or independent validation that the resulting rules avoid omitting those factors or introducing new mismatches.
  2. [demonstration through three benchmark case analyses] The three case analyses (GDPval, OfficeQA Pro, APEX-SWE) demonstrate how design choices shape the strongest supported work claim and identify gaps, but they are retrospective applications to existing benchmarks. This does not test whether prospectively applying the three-step approach during benchmark design would produce measurably better fidelity between scores and work claims.
minor comments (2)
  1. The derivation of the 18 work activities from O*NET is referenced but not accompanied by an explicit aggregation procedure or example mappings from specific occupational tasks; an appendix with this detail would aid reproducibility.
  2. Terms such as 'work product' and 'tested setting' are used throughout without early formal definitions, which may reduce clarity for readers outside work-studies literature.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and the recognition of the framework's potential value. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: The translation from reviewed work-study concerns (roles, local materials/tools, downstream artifacts) into benchmark design rules is the load-bearing step for the central claim that the three-step approach makes work claims explicit. The manuscript acknowledges situated contextual factors in the work-studies review but provides no systematic mapping, coverage check, or independent validation that the resulting rules avoid omitting those factors or introducing new mismatches.

    Authors: We agree that the translation step would be strengthened by an explicit mapping. The current manuscript synthesizes the reviewed work-study literature directly into the three design rules and the 18-activity inventory (derived from O*NET), but does not include a tabular or systematic coverage check against the original concerns. We will add a dedicated subsection that maps each reviewed factor (roles/responsibilities, local materials/tools, downstream-usable artifacts) to the corresponding guidance on activity definition, setting specification, and product scoring, and will note any potential gaps or unaddressed contextual elements. revision: yes

  2. Referee: The three case analyses (GDPval, OfficeQA Pro, APEX-SWE) demonstrate how design choices shape the strongest supported work claim and identify gaps, but they are retrospective applications to existing benchmarks. This does not test whether prospectively applying the three-step approach during benchmark design would produce measurably better fidelity between scores and work claims.

    Authors: The case analyses are retrospective by design: the paper's primary contribution is a methodological proposal illustrated on three published benchmarks to show how the framework surfaces gaps between tasks, settings, products, and work claims. A prospective test—designing a new benchmark from scratch with the method and then measuring improved fidelity—would require a separate empirical study and is outside the scope of the current manuscript. We will revise the text to state this limitation explicitly and to frame the cases as illustrative rather than as a validation of prospective efficacy. revision: partial

standing simulated objections not resolved
  • A prospective empirical test of the framework on newly designed benchmarks cannot be performed within the bounds of this methodological paper.

Circularity Check

0 steps flagged

No circularity; derivation draws on external work studies and O*NET

full rationale

The paper proposes a three-step approach (define work activity, specify tested setting, score work product) by reviewing external work studies on roles/responsibilities/materials/tools/artifacts and translating those into benchmark guidance. It derives an inventory of 18 activities from the public O*NET database. No self-citations, fitted parameters, or self-definitional reductions are present in the derivation chain. The case analyses apply the method to existing benchmarks without reducing claims to inputs by construction. The central claim remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that knowledge-work organization described in work studies can be operationalized as benchmark design rules; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows.
    Invoked in the abstract as the basis for translating concerns into benchmark guidance.

pith-pipeline@v0.9.0 · 5815 in / 1329 out tokens · 25201 ms · 2026-05-25T04:37:06.557193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

136 extracted references · 78 canonical work pages · 16 internal anchors

  1. [1]

    and Amodei, Dario and Kaplan, Jared and Clark, Jack and Ganguli, Deep , year =

    Handa, Kunal and Tamkin, Alex and McCain, Miles and Huang, Saffron and Durmus, Esin and Heck, Sarah and Mueller, Jared and Hong, Jerry and Ritchie, Stuart and Belonax, Tim and Troy, Kevin K. and Amodei, Dario and Kaplan, Jared and Clark, Jack and Ganguli, Deep , year =. Which Economic Tasks are Performed with. 2503.04761 , archivePrefix =

  2. [2]

    American Psychologist , volume =

    Messick, Samuel , title =. American Psychologist , volume =. 1995 , doi =

  3. [3]

    1995 , publisher =

    Hutchins, Edwin , title =. 1995 , publisher =

  4. [4]

    , title =

    Star, Susan Leigh and Griesemer, James R. , title =. Social Studies of Science , volume =. 1989 , doi =

  5. [5]

    1988 , publisher=

    The System of Professions: An Essay on the Division of Expert Labor , author=. 1988 , publisher=

  6. [6]

    2014 , publisher =

    Standards for Educational and Psychological Testing , author =. 2014 , publisher =

  7. [7]

    2025 , month = sep, url =

    Anthropic Economic Index Report: Uneven Geographic and Enterprise. 2025 , month = sep, url =

  8. [8]

    and Wei, Jason and Soskin Hicks, Rebecca and Bowman, Preston and Qui

    Arora, Rahul K. and Wei, Jason and Soskin Hicks, Rebecca and Bowman, Preston and Qui. 2025 , eprint =

  9. [9]

    Administrative Science Quarterly , year =

    Technicians in the Workplace: Ethnographic Evidence for Bringing Work into Organization Studies , author =. Administrative Science Quarterly , year =

  10. [10]

    , title =

    Barres, Victor and Dong, Honghua and Ray, Soham and Si, Xujie and Narasimhan, Karthik R. , title =. 2025 , eprint =

  11. [11]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia , year =. 2504.12516 , archivePrefix =

  12. [12]

    AEA Papers and Proceedings , year =

    What Can Machines Learn, and What Does It Mean for Occupations and the Economy? , author =. AEA Papers and Proceedings , year =

  13. [13]

    , journal =

    Brynjolfsson, Erik and Li, Danielle and Raymond, Lindsey R. , journal =. Generative. 2025 , doi =

  14. [14]

    The Enterprise AI Playbook: Lessons from 51 Successful Enterprise AI Deployments , author=

  15. [15]

    Advances in Knowledge Discovery and Data Mining , pages =

    Density-Based Clustering Based on Hierarchical Density Estimates , author =. Advances in Knowledge Discovery and Data Mining , pages =. 2013 , publisher =

  16. [16]

    Organization Science , volume=

    A Pragmatic View of Knowledge and Boundaries: Boundary Objects in New Product Development , author=. Organization Science , volume=. 2002 , publisher=

  17. [17]

    Organization Science , volume =

    Transferring, Translating, and Transforming: An Integrative Framework for Managing Knowledge Across Boundaries , author =. Organization Science , volume =. 2004 , doi =

  18. [18]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =. doi:10.48550/arXiv.2107.03374 , url =

  19. [19]

    The Effects of Generative

    Cui, Zheyuan (Kevin) and Demirer, Mert and Jaffe, Sonia and Musolff, Leon and Peng, Sida and Salz, Tobias , journal =. The Effects of Generative. 2025 , note =

  20. [20]

    2005 , publisher=

    Thinking for a Living: How to Get Better Performance and Results from Knowledge Workers , author=. 2005 , publisher=

  21. [21]

    Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality , author=

  22. [22]

    and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , booktitle =

    Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , booktitle =. 2024 , publisher =

  23. [23]

    , title =

    Drucker, Peter F. , title =. 1959 , publisher =

  24. [24]

    , title =

    Drucker, Peter F. , title =. California Management Review , volume =

  25. [25]

    doi:10.48550/arXiv.2303.10130 , url =

    Eloundou, Tyna and Manning, Sam and Mishkin, Pamela and Rock, Daniel , year =. doi:10.48550/arXiv.2303.10130 , url =. 2303.10130 , archivePrefix=

  26. [26]

    2022 , howpublished =

  27. [27]

    Strategic Management Journal , year =

    Occupational, Industry, and Geographic Exposure to Artificial Intelligence , author =. Strategic Management Journal , year =

  28. [28]

    2001 , publisher=

    Professionalism: The Third Logic , author=. 2001 , publisher=

  29. [29]

    International Conference on Learning Representations (ICLR) , year =

    Mialon, Gr. International Conference on Learning Representations (ICLR) , year =

  30. [30]

    GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

    Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Simon Posada and Aljubeh, Marwan and Thacker, Phoebe and Fauconnet, Laurance and Kim, Natalie S. and Chao, Patrick and Miserendino, Samuel and Chabot, Gildas and Li, David and Sharman, Michael and Barr, Alexandra and Glaese, Amelia an...

  31. [31]

    and Borsos, Zal

    Agostinelli, Andrea and Denk, Timo I. and Borsos, Zal. 2023 , eprint=

  32. [32]

    Program Synthesis with Large Language Models

    Program Synthesis with Large Language Models , author=. 2021 , eprint=. doi:10.48550/arXiv.2108.07732 , url=

  33. [33]

    How Well Can

    Bianchi, Federico and Chia, Patrick John and Yuksekgonul, Mert and Tagliabue, Jacopo and Jurafsky, Dan and Zou, James , year=. How Well Can. 2402.05863 , archivePrefix=

  34. [34]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios N. and Li, Tianle and Li, Dacheng and Zhu, Hao and Zhang, Banghua and Jordan, Michael I. and Gonzalez, Joseph E. and Stoica, Ion , year=. Chatbot Arena: An Open Platform for Evaluating. 2403.04132 , archivePrefix=

  35. [35]

    2307.13528 , archivePrefix=

    Chern, I-Chun and Chern, Steffi and Chen, Shiqi and Yuan, Weizhe and Feng, Kehua and Zhou, Chunting and He, Junxian and Neubig, Graham and Liu, Pengfei , year=. 2307.13528 , archivePrefix=

  36. [36]

    2306.04757 , archivePrefix=

    Chia, Yew Ken and Hong, Pengfei and Bing, Lidong and Poria, Soujanya , year=. 2306.04757 , archivePrefix=

  37. [37]

    2021 , publisher=

    Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang , booktitle=. 2021 , publisher=

  38. [38]

    Think You Have Solved Question Answering? Try

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , booktitle=. Think You Have Solved Question Answering? Try

  39. [39]

    2019 , publisher=

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=. 2019 , publisher=

  40. [40]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  41. [41]

    Mind2Web: Towards a Generalist Agent for the Web

    Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu , year=. 2306.06070 , archivePrefix=

  42. [42]

    2308.01861 , archivePrefix=

    Du, Xueying and Liu, Mingwei and Wang, Kaixin and Wang, Hanlin and Liu, Junwei and Chen, Yixuan and Feng, Jiayi and Sha, Chaofeng and Peng, Xin and Lou, Yiling , year=. 2308.01861 , archivePrefix=

  43. [43]

    and Li, Irene and She, Tianwei and Li, Suyi and Radev, Dragomir R

    Fabbri, Alexander R. and Li, Irene and She, Tianwei and Li, Suyi and Radev, Dragomir R. , journal=

  44. [44]

    Science , volume=

    Human-level Play in the Game of. Science , volume=

  45. [45]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , pages=

    Hierarchical Neural Story Generation , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , pages=. 2018 , publisher=

  46. [46]

    2026 , publisher=

    Fein, Daniel and Russo, Sebastian and Xiang, Violet and Jolly, Kabir and Rafailov, Rafael and Haber, Nick , booktitle=. 2026 , publisher=

  47. [47]

    Measuring Coding Challenge Competence With APPS

    Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , year=. Measuring Coding Challenge Competence With. 2105.09938 , archivePrefix=

  48. [48]

    Hendrycks, Dan and Burns, Collin and Chen, Anya and Ball, Spencer , booktitle=

  49. [49]

    Measuring Mathematical Problem Solving With the

    Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Mathematical Problem Solving With the

  50. [50]

    International Conference on Learning Representations , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

  51. [51]

    and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , booktitle=

    Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida I. and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , booktitle=

  52. [52]

    and Lu, Xinghua , booktitle=

    Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W. and Lu, Xinghua , booktitle=. 2019 , publisher=

  53. [53]

    2020 , eprint=

    What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author=. 2020 , eprint=

  54. [54]

    and Zettlemoyer, Luke , booktitle=

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , booktitle=. 2017 , publisher=

  55. [55]

    2019 , publisher=

    Kim, Chris Dongjoo and Kim, Byeongchang and Lee, Hyunmin and Kim, Gunhee , booktitle=. 2019 , publisher=

  56. [56]

    Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yao and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel , journal=

  57. [57]

    Koreeda, Yuta and Manning, Christopher D. , year=. 2110.01799 , archivePrefix=

  58. [58]

    Krithara, Anastasia and Nentidis, Anastasios and Bougiatiotis, Konstantinos and Paliouras, Georgios , journal=

  59. [59]

    Transactions of the Association for Computational Linguistics , volume=

    Natural Questions: A Benchmark for Question Answering Research , author=. Transactions of the Association for Computational Linguistics , volume=

  60. [60]

    2012 , howpublished=

    The Winograd Schema Challenge , author=. 2012 , howpublished=

  61. [61]

    2305.11747 , archivePrefix=

    Li, Junyi and Cheng, Xiaoxue and Zhao, Wayne Xin and Nie, Jian-Yun and Wen, Ji-Rong , year=. 2305.11747 , archivePrefix=

  62. [62]

    Transactions on Machine Learning Research , year=

    Holistic Evaluation of Language Models , author=. Transactions on Machine Learning Research , year=

  63. [63]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Lin, Stephanie and Hilton, Jacob and Evans, Owain , year=. 2109.07958 , archivePrefix=

  64. [64]

    AgentBench: Evaluating LLMs as Agents

    Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , year=. 23...

  65. [65]

    2024 , howpublished=

  66. [66]

    Miserendino, Samuel and Wang, Michele and Patwardhan, Tejal and Kuo, Charles E. and Dias, Rachel and Thacker, Phoebe and Thanneeru, Vishnu and Eapen, Suhas and Chastain, Eric and Barr, Alexandra and Thacker, Benjamin and Yau, Alvin and Li, David and Ludwinski, Pierce and Chabot, Gildas and Knutson, Thea and Glaese, Amelia and Sharman, Michael and Tworek, ...

  67. [67]

    Abstractive Text Summarization Using Sequence-to-Sequence

    Nallapati, Ramesh and Zhou, Bowen and dos Santos, C. Abstractive Text Summarization Using Sequence-to-Sequence. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages=

  68. [68]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  69. [69]

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=

  70. [70]

    2305.13117 , archivePrefix=

    Schlichtkrull, Michael and Guo, Zhijiang and Vlachos, Andreas , year=. 2305.13117 , archivePrefix=

  71. [71]

    AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

    Schmidgall, Samuel and Ziaei, Rojin and Harris, Carl and Reis, Eduardo and Jopling, Jeffrey and Moor, Michael , year=. 2405.07960 , archivePrefix=

  72. [72]

    2018 , publisher=

    Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle=. 2018 , publisher=

  73. [73]

    2026 , howpublished=

    Step Exams , author=. 2026 , howpublished=

  74. [74]

    , booktitle=

    Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , booktitle=

  75. [75]

    2407.15711 , archivePrefix=

    Yoran, Ori and Wolfson, Tomer and Ram, Ori and Berant, Jonathan , year=. 2407.15711 , archivePrefix=

  76. [76]

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=

  77. [77]

    TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021

    Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng , year=. 2105.07624 , archivePrefix=

  78. [78]

    Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and Gong, Chen and Nguyen, Thong Hoang and Phan, Nam Dinh and Yan, Xingyao and Le, Chakkrit and Hoang, Anh Tuan and Nguyen, An and Wang, Ziwei and Liu, Ming an...

  79. [79]

    , title =

    Grant, Robert M. , title =. Strategic Management Journal , volume =

  80. [80]

    Guha, Neel and Nyarko, Julian and Ho, Daniel E. and R. 2023 , eprint =

Showing first 80 references.