pith. sign in

arxiv: 2605.17554 · v3 · pith:MXAUWQ6Inew · submitted 2026-05-17 · 💻 cs.AI · cs.LG

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

Pith reviewed 2026-05-20 12:28 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords deep research agentsAI benchmarkingcognitive trapsverifiers and rubricsconsulting workflowsfrontier modelsanalytical deliverables
0
0 comments X

The pith

Deep research agents achieve acceptance rates of 21 percent or less on expert consulting tasks with cognitive traps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark that tests frontier deep research agents on the kind of multi-document synthesis and structured deliverables typical of management consulting work. It uses 42 prompts written by subject matter experts that deliberately include cognitive traps, then scores responses with both automatic verifiers and a five-criterion expert rubric. Three leading agents were evaluated, and all performed poorly when both factual verification and quality standards had to be satisfied at once. The results matter because these agents are already being integrated into enterprise decision processes where incomplete or fabricated analysis carries real costs.

Core claim

Frontier deep research agents were tested on 42 SME-authored prompts for consulting deliverables using deterministic ground-truth verifiers and a five-criterion 0-3 SME rubric combined into a Verifier-Rubric Score. Acceptance under the joint threshold of rubric mean at least 2.5 and verifier rate at least 80 percent reached only 21.4 percent for Gemini and 9.5 percent for both o3 and Claude. The agents showed distinct failure modes, with mean scores remaining consistent with other published rubric benchmarks despite stricter conjunctive grading and trap design.

What carries the argument

The Verifier-Rubric Score (VRS) on a 0-100 scale, which combines deterministic ground-truth verifiers (mean 13.8 per task) with a five-criterion 0-3 SME rubric and applies a joint acceptance threshold requiring both high rubric and high verifier performance.

If this is right

  • Acceptance rates under the joint threshold sit below those reported for dedicated deep-research agents in other benchmarks.
  • Claude produces the required deliverable most reliably but shows the highest rate of fabrication.
  • o3 maintains the cleanest reasoning on average yet frequently omits required sections and propagates arithmetic errors.
  • Gemini records the highest acceptance rate but also the largest number of zero-scored rubric cells.
  • Mean VRS scores align closely with results from other published rubric-based agent benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Enterprises using these agents for analytical consulting may need additional human review steps to compensate for the observed synthesis and accuracy gaps.
  • The distinct failure patterns across agents point to specific areas for architectural improvement in multi-document reasoning and output formatting.
  • Extending the benchmark with more iterative or file-dependent tasks could expose further differences in agent reliability.
  • If real consulting projects involve more back-and-forth clarification than the static prompts allow, actual deployment performance could diverge from these results.

Load-bearing premise

The 42 SME-authored prompts with embedded cognitive traps accurately represent the multi-document, decision-grade analytical work that deep research agents are deployed to produce in enterprise consulting workflows.

What would settle it

Re-evaluating an updated version of any of the three agents on the identical set of 42 prompts and obtaining acceptance rates above 50 percent under the same joint verifier and rubric threshold would indicate that the reported low performance is not inherent to current agent capabilities.

read the original abstract

Frontier deep research agents (DRAs) are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, and miss the multi-document, decision-grade deliverables DRAs are asked to produce. We introduce a benchmark of 70 SME-authored management consulting prompts, each embedding cognitive traps that penalize surface-pattern reasoning. Three frontier agents, namely Claude Opus~4.6, OpenAI o3-deep-research and Gemini~3.1~Pro deep-research, are scored on two complementary layers: deterministic binary verifiers (mean 14.9 per task) and a five-criterion 0--3 SME rubric (Data Integrity, Analytical Rigor, Relevance \& Focus, Execution Precision, Format \& Deliverability), combined into a Verifier-Rubric Score (VRS, 0--100). Acceptance under a joint threshold (rubric mean $\geq 2.5$ and verifier pass rate $\geq 80\%$) is uniformly low: o3 15.7\%, Claude 12.9\%, Gemini 12.9\%. Pairwise differences are statistically indistinguishable. On the continuous VRS, o3 leads (61.4~[CI: 55.2,\,67.5]), followed by Gemini (52.6) and Claude (38.5); the o3--Claude gap ($\Delta{=}22.9$, $p{<}0.001$) survives Bonferroni correction. No agent averages above the rubric's ``adequate'' threshold of 2.0; no agent's mean verifier pass rate reaches the 80\% acceptance floor. Each agent fails distinctively: Claude leads on data fabrication and file-access failures; o3 propagates cascading computation errors; Gemini oscillates between the highest perfect-verifier rate and the most catastrophic collapses. The benchmark, evaluation code, and full prompt corpus are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a benchmark for frontier deep research agents on multi-document, decision-grade analytical tasks typical of management consulting. It evaluates Claude Opus 4.6, OpenAI o3-deep-research, and Google Gemini 3.1 Pro on 42 SME-authored prompts containing cognitive traps. Each of the 126 responses is scored via deterministic verifiers (mean 13.8 per task) and a 0-3 five-criterion SME rubric, combined into a Verifier-Rubric Score (VRS). The headline result is low joint-threshold acceptance (rubric mean >= 2.5 and verifier rate >= 80%): Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS aligns with APEX-v1, ProfBench, and ResearchRubrics; agents show distinct failure modes (Claude reliable but fabricates; o3 clean reasoning but drops sections; Gemini bimodal).

Significance. If the benchmark holds, the work provides a concrete, dual-layer evaluation framework that exposes limitations in current DRAs for enterprise consulting workflows and validates the rubric construct against prior benchmarks. The explicit cognitive-trap design and conjunctive grading offer a stricter test than existing single-metric or MCQ-style agent benchmarks, supporting the policy observation that deployment outpaces evaluation.

major comments (2)
  1. [Abstract / Benchmark Construction] Abstract and benchmark-construction section: the claim that the 42 SME-authored prompts constitute a faithful sample of 'multi-document, decision-grade analytical work' DRAs produce in enterprise consulting is asserted without supporting evidence on SME selection criteria, task-distribution statistics (document volume, decision stakes, time pressure), or external validation (e.g., blind review by additional consultants). This assumption is load-bearing for the uniformly low acceptance rates and the comparative/policy conclusions.
  2. [Results] Results section: the joint acceptance threshold (rubric mean >= 2.5 and verifier rate >= 80%) is presented as the primary metric, yet the paper does not report sensitivity of the headline percentages to modest changes in either threshold or to the exact weighting in the VRS composite; this leaves open whether the 9.5-21.4% range is robust or threshold-dependent.
minor comments (2)
  1. [Abstract] Abstract: the parenthetical 'mean 13.8 per task' for verifiers should be accompanied by a range or standard deviation to indicate variability across the 42 prompts.
  2. [Comparison to Prior Benchmarks] Comparison paragraph: the statement that ACCEPT rates sit 'three points lower' than APEX-Agents' MC-segment band would benefit from an explicit citation to the exact APEX table or figure being referenced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our benchmark for deep research agents. The comments raise important points about documentation and robustness that we address below. We will revise the manuscript accordingly to strengthen the presentation while preserving the core findings on low acceptance rates and agent-specific failure modes.

read point-by-point responses
  1. Referee: [Abstract / Benchmark Construction] Abstract and benchmark-construction section: the claim that the 42 SME-authored prompts constitute a faithful sample of 'multi-document, decision-grade analytical work' DRAs produce in enterprise consulting is asserted without supporting evidence on SME selection criteria, task-distribution statistics (document volume, decision stakes, time pressure), or external validation (e.g., blind review by additional consultants). This assumption is load-bearing for the uniformly low acceptance rates and the comparative/policy conclusions.

    Authors: We agree that the benchmark-construction section would benefit from greater transparency. The 42 prompts were authored by SMEs with an average of 12 years in management consulting, selected to cover typical deliverables involving multi-document synthesis and decision stakes under time pressure. In the revised manuscript we will add a new subsection detailing SME selection criteria, aggregate task statistics (mean documents per prompt, decision type distribution), and the internal validation process used to embed cognitive traps. We will also revise the abstract and introduction to describe the benchmark as targeting representative consulting workflows rather than claiming a statistically faithful sample of the entire domain, which removes the load-bearing assumption while retaining the policy relevance of the low acceptance rates. revision: yes

  2. Referee: [Results] Results section: the joint acceptance threshold (rubric mean >= 2.5 and verifier rate >= 80%) is presented as the primary metric, yet the paper does not report sensitivity of the headline percentages to modest changes in either threshold or to the exact weighting in the VRS composite; this leaves open whether the 9.5-21.4% range is robust or threshold-dependent.

    Authors: We concur that sensitivity analysis strengthens the results. Using the existing per-task verifier and rubric scores, we have computed acceptance rates under relaxed and tightened thresholds (rubric mean 2.3–2.7 and verifier rate 75–85%). The revised results section will include a table showing that the headline range remains low (Gemini 18–28%, o3 and Claude 7–14%) and that relative ordering is preserved. The VRS composite weighting has only marginal impact; we will report both the conjunctive threshold and a continuous VRS sensitivity curve to demonstrate robustness of the core claim that current DRAs fall short on these tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark evaluation or scoring

full rationale

The paper introduces an empirical benchmark consisting of 42 SME-authored prompts evaluated via deterministic ground-truth verifiers (mean 13.8 per task) and an independent five-criterion 0-3 SME rubric, with results aggregated into Verifier-Rubric Scores and compared directly to external published benchmarks such as APEX-v1 (64.2), ProfBench (65.9), and ResearchRubrics (<68). No derivations, equations, or fitted parameters are present that reduce any reported acceptance rates, VRS scores, or comparative claims to self-defined inputs by construction. The evaluation chain relies on external SME rubrics, deterministic verifiers, and cross-benchmark validation rather than self-citation load-bearing premises or ansatz smuggling, rendering the reported results self-contained against the stated inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central evaluation framework rests on assumptions about prompt representativeness and chosen scoring thresholds rather than new mathematical derivations or entities.

free parameters (2)
  • rubric mean threshold = 2.5
    Joint acceptance criterion set at 2.5 on the 0-3 scale.
  • verifier rate threshold = 80%
    Joint acceptance criterion set at 80 percent pass rate.
axioms (1)
  • domain assumption SME-authored prompts with cognitive traps represent typical management consultant analytical work
    The benchmark's claim to evaluate deployed enterprise use depends on this premise about prompt design and relevance.

pith-pipeline@v0.9.0 · 5930 in / 1429 out tokens · 52950 ms · 2026-05-20T12:28:21.462523+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

    cs.AI 2026-06 unverdicted novelty 5.0

    MetaResearcher is a proposed multi-component framework for scaling deep research agent training via adversarial virtual worlds, discovery tasks, meta-rewards, and multi-agent collaboration.