Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

Aman Saksena; Divyansh Sahu; Tanmay Asthana

arxiv: 2605.17554 · v1 · pith:MXAUWQ6Inew · submitted 2026-05-17 · 💻 cs.AI · cs.LG

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

Tanmay Asthana , Aman Saksena , Divyansh Sahu This is my paper

Pith reviewed 2026-05-20 12:28 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords deep research agentsAI benchmarkingcognitive trapsverifiers and rubricsconsulting workflowsfrontier modelsanalytical deliverables

0 comments

The pith

Deep research agents achieve acceptance rates of 21 percent or less on expert consulting tasks with cognitive traps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark that tests frontier deep research agents on the kind of multi-document synthesis and structured deliverables typical of management consulting work. It uses 42 prompts written by subject matter experts that deliberately include cognitive traps, then scores responses with both automatic verifiers and a five-criterion expert rubric. Three leading agents were evaluated, and all performed poorly when both factual verification and quality standards had to be satisfied at once. The results matter because these agents are already being integrated into enterprise decision processes where incomplete or fabricated analysis carries real costs.

Core claim

Frontier deep research agents were tested on 42 SME-authored prompts for consulting deliverables using deterministic ground-truth verifiers and a five-criterion 0-3 SME rubric combined into a Verifier-Rubric Score. Acceptance under the joint threshold of rubric mean at least 2.5 and verifier rate at least 80 percent reached only 21.4 percent for Gemini and 9.5 percent for both o3 and Claude. The agents showed distinct failure modes, with mean scores remaining consistent with other published rubric benchmarks despite stricter conjunctive grading and trap design.

What carries the argument

The Verifier-Rubric Score (VRS) on a 0-100 scale, which combines deterministic ground-truth verifiers (mean 13.8 per task) with a five-criterion 0-3 SME rubric and applies a joint acceptance threshold requiring both high rubric and high verifier performance.

If this is right

Acceptance rates under the joint threshold sit below those reported for dedicated deep-research agents in other benchmarks.
Claude produces the required deliverable most reliably but shows the highest rate of fabrication.
o3 maintains the cleanest reasoning on average yet frequently omits required sections and propagates arithmetic errors.
Gemini records the highest acceptance rate but also the largest number of zero-scored rubric cells.
Mean VRS scores align closely with results from other published rubric-based agent benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Enterprises using these agents for analytical consulting may need additional human review steps to compensate for the observed synthesis and accuracy gaps.
The distinct failure patterns across agents point to specific areas for architectural improvement in multi-document reasoning and output formatting.
Extending the benchmark with more iterative or file-dependent tasks could expose further differences in agent reliability.
If real consulting projects involve more back-and-forth clarification than the static prompts allow, actual deployment performance could diverge from these results.

Load-bearing premise

The 42 SME-authored prompts with embedded cognitive traps accurately represent the multi-document, decision-grade analytical work that deep research agents are deployed to produce in enterprise consulting workflows.

What would settle it

Re-evaluating an updated version of any of the three agents on the identical set of 42 prompts and obtaining acceptance rates above 50 percent under the same joint verifier and rubric threshold would indicate that the reported low performance is not inherent to current agent capabilities.

read the original abstract

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a benchmark for frontier deep research agents on multi-document, decision-grade analytical tasks typical of management consulting. It evaluates Claude Opus 4.6, OpenAI o3-deep-research, and Google Gemini 3.1 Pro on 42 SME-authored prompts containing cognitive traps. Each of the 126 responses is scored via deterministic verifiers (mean 13.8 per task) and a 0-3 five-criterion SME rubric, combined into a Verifier-Rubric Score (VRS). The headline result is low joint-threshold acceptance (rubric mean >= 2.5 and verifier rate >= 80%): Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS aligns with APEX-v1, ProfBench, and ResearchRubrics; agents show distinct failure modes (Claude reliable but fabricates; o3 clean reasoning but drops sections; Gemini bimodal).

Significance. If the benchmark holds, the work provides a concrete, dual-layer evaluation framework that exposes limitations in current DRAs for enterprise consulting workflows and validates the rubric construct against prior benchmarks. The explicit cognitive-trap design and conjunctive grading offer a stricter test than existing single-metric or MCQ-style agent benchmarks, supporting the policy observation that deployment outpaces evaluation.

major comments (2)

[Abstract / Benchmark Construction] Abstract and benchmark-construction section: the claim that the 42 SME-authored prompts constitute a faithful sample of 'multi-document, decision-grade analytical work' DRAs produce in enterprise consulting is asserted without supporting evidence on SME selection criteria, task-distribution statistics (document volume, decision stakes, time pressure), or external validation (e.g., blind review by additional consultants). This assumption is load-bearing for the uniformly low acceptance rates and the comparative/policy conclusions.
[Results] Results section: the joint acceptance threshold (rubric mean >= 2.5 and verifier rate >= 80%) is presented as the primary metric, yet the paper does not report sensitivity of the headline percentages to modest changes in either threshold or to the exact weighting in the VRS composite; this leaves open whether the 9.5-21.4% range is robust or threshold-dependent.

minor comments (2)

[Abstract] Abstract: the parenthetical 'mean 13.8 per task' for verifiers should be accompanied by a range or standard deviation to indicate variability across the 42 prompts.
[Comparison to Prior Benchmarks] Comparison paragraph: the statement that ACCEPT rates sit 'three points lower' than APEX-Agents' MC-segment band would benefit from an explicit citation to the exact APEX table or figure being referenced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our benchmark for deep research agents. The comments raise important points about documentation and robustness that we address below. We will revise the manuscript accordingly to strengthen the presentation while preserving the core findings on low acceptance rates and agent-specific failure modes.

read point-by-point responses

Referee: [Abstract / Benchmark Construction] Abstract and benchmark-construction section: the claim that the 42 SME-authored prompts constitute a faithful sample of 'multi-document, decision-grade analytical work' DRAs produce in enterprise consulting is asserted without supporting evidence on SME selection criteria, task-distribution statistics (document volume, decision stakes, time pressure), or external validation (e.g., blind review by additional consultants). This assumption is load-bearing for the uniformly low acceptance rates and the comparative/policy conclusions.

Authors: We agree that the benchmark-construction section would benefit from greater transparency. The 42 prompts were authored by SMEs with an average of 12 years in management consulting, selected to cover typical deliverables involving multi-document synthesis and decision stakes under time pressure. In the revised manuscript we will add a new subsection detailing SME selection criteria, aggregate task statistics (mean documents per prompt, decision type distribution), and the internal validation process used to embed cognitive traps. We will also revise the abstract and introduction to describe the benchmark as targeting representative consulting workflows rather than claiming a statistically faithful sample of the entire domain, which removes the load-bearing assumption while retaining the policy relevance of the low acceptance rates. revision: yes
Referee: [Results] Results section: the joint acceptance threshold (rubric mean >= 2.5 and verifier rate >= 80%) is presented as the primary metric, yet the paper does not report sensitivity of the headline percentages to modest changes in either threshold or to the exact weighting in the VRS composite; this leaves open whether the 9.5-21.4% range is robust or threshold-dependent.

Authors: We concur that sensitivity analysis strengthens the results. Using the existing per-task verifier and rubric scores, we have computed acceptance rates under relaxed and tightened thresholds (rubric mean 2.3–2.7 and verifier rate 75–85%). The revised results section will include a table showing that the headline range remains low (Gemini 18–28%, o3 and Claude 7–14%) and that relative ordering is preserved. The VRS composite weighting has only marginal impact; we will report both the conjunctive threshold and a continuous VRS sensitivity curve to demonstrate robustness of the core claim that current DRAs fall short on these tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark evaluation or scoring

full rationale

The paper introduces an empirical benchmark consisting of 42 SME-authored prompts evaluated via deterministic ground-truth verifiers (mean 13.8 per task) and an independent five-criterion 0-3 SME rubric, with results aggregated into Verifier-Rubric Scores and compared directly to external published benchmarks such as APEX-v1 (64.2), ProfBench (65.9), and ResearchRubrics (<68). No derivations, equations, or fitted parameters are present that reduce any reported acceptance rates, VRS scores, or comparative claims to self-defined inputs by construction. The evaluation chain relies on external SME rubrics, deterministic verifiers, and cross-benchmark validation rather than self-citation load-bearing premises or ansatz smuggling, rendering the reported results self-contained against the stated inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central evaluation framework rests on assumptions about prompt representativeness and chosen scoring thresholds rather than new mathematical derivations or entities.

free parameters (2)

rubric mean threshold = 2.5
Joint acceptance criterion set at 2.5 on the 0-3 scale.
verifier rate threshold = 80%
Joint acceptance criterion set at 80 percent pass rate.

axioms (1)

domain assumption SME-authored prompts with cognitive traps represent typical management consultant analytical work
The benchmark's claim to evaluate deployed enterprise use depends on this premise about prompt design and relevance.

pith-pipeline@v0.9.0 · 5930 in / 1429 out tokens · 52950 ms · 2026-05-20T12:28:21.462523+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week... dual-layer scoring... cognitive traps
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Verifier-Rubric Score (VRS)... ACCEPT(r, V) ⇔ min ri >0 ∧ r̄ ≥2.5 ∧ V≥80%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 12 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021a. Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Bea...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Dietterich

Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms.Neural Computation, 10(7):1895–1923,

work page 1923
[4]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

doi: 10.1162/089976698300017197. 32 Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/089976698300017197
[5]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled AlpacaEval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

doi: 10.1214/aos/1176344552. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594,

work page doi:10.1214/aos/1176344552
[7]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv preprint arXiv:2311.05232,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

FinanceBench: A New Benchmark for Financial Question Answering

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. FinanceBench: Anewbenchmarkforfinancialquestionanswering.arXiv preprint arXiv:2311.11944,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Survey of hallucination in natural language generation

doi: 10.1145/3571730. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR),

work page doi:10.1145/3571730
[10]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W

doi: 10.3390/app11146421. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on 33 Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

work page doi:10.3390/app11146421 2019
[11]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012,

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012,

work page arXiv
[13]

Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher

URLhttps://repository.upenn.edu/asc_papers/43. Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2020
[14]

The dawn after the dark: An empirical study on factuality hallucination in large language models.arXiv preprint arXiv:2401.03205,

Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. The dawn after the dark: An empirical study on factuality hallucination in large language models.arXiv preprint arXiv:2401.03205,

work page arXiv
[15]

API-Bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023a. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Gue...

work page 2023
[16]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023b. 34 Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain...

work page 2023
[17]

SciAgent: Tool-augmented language models for scientific reasoning.arXiv preprint arXiv:2402.11451,

Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, Aixin Sun, Hany Awadalla, and Weizhu Chen. SciAgent: Tool-augmented language models for scientific reasoning.arXiv preprint arXiv:2402.11451,

work page arXiv
[18]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2023
[19]

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom

doi: 10.1007/BF02295996. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1007/bf02295996
[20]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI model performance on real-worl...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

ResearchRubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685,

Manasi Sharma et al. ResearchRubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685,

work page arXiv
[22]

35 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R

doi: 10.2307/1412159. 35 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Andrew Santoro, Aravindh Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research,

work page doi:10.2307/1412159
[23]

Apex-agents.arXiv preprint arXiv:2601.14242, 2026

URLhttps://arxiv.org/abs/2601.14242. Bertie Vidgen et al. The AI productivity index (APEX).arXiv preprint arXiv:2509.25721,

work page arXiv
[24]

Large Language Models are not Fair Evaluators

200 expert-designed tasks across investment banking, management consulting, law, and primary medical care. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators.arXiv preprint arXiv:2305.17926,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

URLhttps://arxiv.org/abs/2510.18941. Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

The Rise and Potential of Large Language Model Based Agents: A Survey

ZhihengXi, WenxiangChen, XinGuo, WeiHe, YiwenDing, BoyangHong, MingZhang, JunzheWang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, and Tao G...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

ResearcherBench: Evaluat- ing deep AI research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280,

Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, and Pengfei Liu. ResearcherBench: Evaluat- ing deep AI research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280,

work page arXiv
[28]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023a. 36 Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao L...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021a. Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Bea...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Dietterich

Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms.Neural Computation, 10(7):1895–1923,

work page 1923

[4] [4]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

doi: 10.1162/089976698300017197. 32 Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/089976698300017197

[5] [5]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled AlpacaEval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

doi: 10.1214/aos/1176344552. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594,

work page doi:10.1214/aos/1176344552

[7] [7]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv preprint arXiv:2311.05232,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

FinanceBench: A New Benchmark for Financial Question Answering

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. FinanceBench: Anewbenchmarkforfinancialquestionanswering.arXiv preprint arXiv:2311.11944,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Survey of hallucination in natural language generation

doi: 10.1145/3571730. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR),

work page doi:10.1145/3571730

[10] [10]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W

doi: 10.3390/app11146421. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on 33 Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

work page doi:10.3390/app11146421 2019

[11] [11]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012,

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012,

work page arXiv

[13] [13]

Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher

URLhttps://repository.upenn.edu/asc_papers/43. Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2020

[14] [14]

The dawn after the dark: An empirical study on factuality hallucination in large language models.arXiv preprint arXiv:2401.03205,

Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. The dawn after the dark: An empirical study on factuality hallucination in large language models.arXiv preprint arXiv:2401.03205,

work page arXiv

[15] [15]

API-Bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023a. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Gue...

work page 2023

[16] [16]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023b. 34 Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain...

work page 2023

[17] [17]

SciAgent: Tool-augmented language models for scientific reasoning.arXiv preprint arXiv:2402.11451,

Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, Aixin Sun, Hany Awadalla, and Weizhu Chen. SciAgent: Tool-augmented language models for scientific reasoning.arXiv preprint arXiv:2402.11451,

work page arXiv

[18] [18]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2023

[19] [19]

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom

doi: 10.1007/BF02295996. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1007/bf02295996

[20] [20]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI model performance on real-worl...

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

ResearchRubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685,

Manasi Sharma et al. ResearchRubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685,

work page arXiv

[22] [22]

35 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R

doi: 10.2307/1412159. 35 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Andrew Santoro, Aravindh Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research,

work page doi:10.2307/1412159

[23] [23]

Apex-agents.arXiv preprint arXiv:2601.14242, 2026

URLhttps://arxiv.org/abs/2601.14242. Bertie Vidgen et al. The AI productivity index (APEX).arXiv preprint arXiv:2509.25721,

work page arXiv

[24] [24]

Large Language Models are not Fair Evaluators

200 expert-designed tasks across investment banking, management consulting, law, and primary medical care. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators.arXiv preprint arXiv:2305.17926,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

URLhttps://arxiv.org/abs/2510.18941. Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

The Rise and Potential of Large Language Model Based Agents: A Survey

ZhihengXi, WenxiangChen, XinGuo, WeiHe, YiwenDing, BoyangHong, MingZhang, JunzheWang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, and Tao G...

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

ResearcherBench: Evaluat- ing deep AI research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280,

Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, and Pengfei Liu. ResearcherBench: Evaluat- ing deep AI research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280,

work page arXiv

[28] [28]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023a. 36 Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao L...

work page internal anchor Pith review Pith/arXiv arXiv 2023