Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Pith reviewed 2026-05-20 12:28 UTC · model grok-4.3
The pith
Deep research agents achieve acceptance rates of 21 percent or less on expert consulting tasks with cognitive traps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Frontier deep research agents were tested on 42 SME-authored prompts for consulting deliverables using deterministic ground-truth verifiers and a five-criterion 0-3 SME rubric combined into a Verifier-Rubric Score. Acceptance under the joint threshold of rubric mean at least 2.5 and verifier rate at least 80 percent reached only 21.4 percent for Gemini and 9.5 percent for both o3 and Claude. The agents showed distinct failure modes, with mean scores remaining consistent with other published rubric benchmarks despite stricter conjunctive grading and trap design.
What carries the argument
The Verifier-Rubric Score (VRS) on a 0-100 scale, which combines deterministic ground-truth verifiers (mean 13.8 per task) with a five-criterion 0-3 SME rubric and applies a joint acceptance threshold requiring both high rubric and high verifier performance.
If this is right
- Acceptance rates under the joint threshold sit below those reported for dedicated deep-research agents in other benchmarks.
- Claude produces the required deliverable most reliably but shows the highest rate of fabrication.
- o3 maintains the cleanest reasoning on average yet frequently omits required sections and propagates arithmetic errors.
- Gemini records the highest acceptance rate but also the largest number of zero-scored rubric cells.
- Mean VRS scores align closely with results from other published rubric-based agent benchmarks.
Where Pith is reading between the lines
- Enterprises using these agents for analytical consulting may need additional human review steps to compensate for the observed synthesis and accuracy gaps.
- The distinct failure patterns across agents point to specific areas for architectural improvement in multi-document reasoning and output formatting.
- Extending the benchmark with more iterative or file-dependent tasks could expose further differences in agent reliability.
- If real consulting projects involve more back-and-forth clarification than the static prompts allow, actual deployment performance could diverge from these results.
Load-bearing premise
The 42 SME-authored prompts with embedded cognitive traps accurately represent the multi-document, decision-grade analytical work that deep research agents are deployed to produce in enterprise consulting workflows.
What would settle it
Re-evaluating an updated version of any of the three agents on the identical set of 42 prompts and obtaining acceptance rates above 50 percent under the same joint verifier and rubric threshold would indicate that the reported low performance is not inherent to current agent capabilities.
read the original abstract
Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a benchmark for frontier deep research agents on multi-document, decision-grade analytical tasks typical of management consulting. It evaluates Claude Opus 4.6, OpenAI o3-deep-research, and Google Gemini 3.1 Pro on 42 SME-authored prompts containing cognitive traps. Each of the 126 responses is scored via deterministic verifiers (mean 13.8 per task) and a 0-3 five-criterion SME rubric, combined into a Verifier-Rubric Score (VRS). The headline result is low joint-threshold acceptance (rubric mean >= 2.5 and verifier rate >= 80%): Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS aligns with APEX-v1, ProfBench, and ResearchRubrics; agents show distinct failure modes (Claude reliable but fabricates; o3 clean reasoning but drops sections; Gemini bimodal).
Significance. If the benchmark holds, the work provides a concrete, dual-layer evaluation framework that exposes limitations in current DRAs for enterprise consulting workflows and validates the rubric construct against prior benchmarks. The explicit cognitive-trap design and conjunctive grading offer a stricter test than existing single-metric or MCQ-style agent benchmarks, supporting the policy observation that deployment outpaces evaluation.
major comments (2)
- [Abstract / Benchmark Construction] Abstract and benchmark-construction section: the claim that the 42 SME-authored prompts constitute a faithful sample of 'multi-document, decision-grade analytical work' DRAs produce in enterprise consulting is asserted without supporting evidence on SME selection criteria, task-distribution statistics (document volume, decision stakes, time pressure), or external validation (e.g., blind review by additional consultants). This assumption is load-bearing for the uniformly low acceptance rates and the comparative/policy conclusions.
- [Results] Results section: the joint acceptance threshold (rubric mean >= 2.5 and verifier rate >= 80%) is presented as the primary metric, yet the paper does not report sensitivity of the headline percentages to modest changes in either threshold or to the exact weighting in the VRS composite; this leaves open whether the 9.5-21.4% range is robust or threshold-dependent.
minor comments (2)
- [Abstract] Abstract: the parenthetical 'mean 13.8 per task' for verifiers should be accompanied by a range or standard deviation to indicate variability across the 42 prompts.
- [Comparison to Prior Benchmarks] Comparison paragraph: the statement that ACCEPT rates sit 'three points lower' than APEX-Agents' MC-segment band would benefit from an explicit citation to the exact APEX table or figure being referenced.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our benchmark for deep research agents. The comments raise important points about documentation and robustness that we address below. We will revise the manuscript accordingly to strengthen the presentation while preserving the core findings on low acceptance rates and agent-specific failure modes.
read point-by-point responses
-
Referee: [Abstract / Benchmark Construction] Abstract and benchmark-construction section: the claim that the 42 SME-authored prompts constitute a faithful sample of 'multi-document, decision-grade analytical work' DRAs produce in enterprise consulting is asserted without supporting evidence on SME selection criteria, task-distribution statistics (document volume, decision stakes, time pressure), or external validation (e.g., blind review by additional consultants). This assumption is load-bearing for the uniformly low acceptance rates and the comparative/policy conclusions.
Authors: We agree that the benchmark-construction section would benefit from greater transparency. The 42 prompts were authored by SMEs with an average of 12 years in management consulting, selected to cover typical deliverables involving multi-document synthesis and decision stakes under time pressure. In the revised manuscript we will add a new subsection detailing SME selection criteria, aggregate task statistics (mean documents per prompt, decision type distribution), and the internal validation process used to embed cognitive traps. We will also revise the abstract and introduction to describe the benchmark as targeting representative consulting workflows rather than claiming a statistically faithful sample of the entire domain, which removes the load-bearing assumption while retaining the policy relevance of the low acceptance rates. revision: yes
-
Referee: [Results] Results section: the joint acceptance threshold (rubric mean >= 2.5 and verifier rate >= 80%) is presented as the primary metric, yet the paper does not report sensitivity of the headline percentages to modest changes in either threshold or to the exact weighting in the VRS composite; this leaves open whether the 9.5-21.4% range is robust or threshold-dependent.
Authors: We concur that sensitivity analysis strengthens the results. Using the existing per-task verifier and rubric scores, we have computed acceptance rates under relaxed and tightened thresholds (rubric mean 2.3–2.7 and verifier rate 75–85%). The revised results section will include a table showing that the headline range remains low (Gemini 18–28%, o3 and Claude 7–14%) and that relative ordering is preserved. The VRS composite weighting has only marginal impact; we will report both the conjunctive threshold and a continuous VRS sensitivity curve to demonstrate robustness of the core claim that current DRAs fall short on these tasks. revision: yes
Circularity Check
No significant circularity in benchmark evaluation or scoring
full rationale
The paper introduces an empirical benchmark consisting of 42 SME-authored prompts evaluated via deterministic ground-truth verifiers (mean 13.8 per task) and an independent five-criterion 0-3 SME rubric, with results aggregated into Verifier-Rubric Scores and compared directly to external published benchmarks such as APEX-v1 (64.2), ProfBench (65.9), and ResearchRubrics (<68). No derivations, equations, or fitted parameters are present that reduce any reported acceptance rates, VRS scores, or comparative claims to self-defined inputs by construction. The evaluation chain relies on external SME rubrics, deterministic verifiers, and cross-benchmark validation rather than self-citation load-bearing premises or ansatz smuggling, rendering the reported results self-contained against the stated inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- rubric mean threshold =
2.5
- verifier rate threshold =
80%
axioms (1)
- domain assumption SME-authored prompts with cognitive traps represent typical management consultant analytical work
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week... dual-layer scoring... cognitive traps
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Verifier-Rubric Score (VRS)... ACCEPT(r, V) ⇔ min ri >0 ∧ r̄ ≥2.5 ∧ V≥80%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.