pith. sign in

arxiv: 2606.21627 · v1 · pith:7SVZSUQ7new · submitted 2026-06-19 · 💻 cs.AI · cs.LG

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Pith reviewed 2026-06-26 14:06 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords meta-evaluationLLM-as-a-judgeagentic tasksprocess-level critiqueshuman alignmenttau-benchDA-CodeKrippendorff alpha
0
0 comments X

The pith

Counsel is the first public dataset of human meta-evaluations for LLM judge critiques on agent trajectories, reaching 0.78 Krippendorff alpha.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating full trajectories of AI agents on complex tasks takes too long for humans to do at scale, so developers turn to LLM-as-a-judge systems. Counsel supplies the first open dataset that lets researchers check how well those judges perform by collecting their process-level critiques on two agent benchmarks and then recording human ratings of the same critiques. Annotators mark each flagged error as spot on, correct location but poor reasoning, or should not have flagged, and reach reliable agreement across raters. Stronger judge models plus extra reasoning steps produce higher alignment, with the best reaching roughly 88 percent on error location and 65 percent on reasoning quality. The released dataset is meant to serve as calibration data for training or tuning better automated evaluators of agent behavior.

Core claim

Counsel consists of process-level critiques from open-weight LLMJs on tau-bench and DA-Code paired with human meta-evaluations of these critiques. Human annotators label critiques on each flagged error as spot on, correct location but poor reasoning, or should not have flagged, achieving reliable inter-annotator agreement (Krippendorff's alpha of 0.78). The resulting dataset stratifies LLMJ critiques by human alignment across both error location within a trajectory and reasoning quality, serving as valuable data to calibrate, improve, or train LLMJs for agents. Comparing open-weight judges, more capable judge models and more reasoning effort both enabled improved human agreement, with the st

What carries the argument

Human meta-evaluation labels that classify each LLMJ critique as spot on, correct location but poor reasoning, or should not have flagged, thereby measuring alignment on both error location and reasoning quality.

If this is right

  • The dataset supplies training and calibration data for improving LLM judges on agent trajectories.
  • Stronger open-weight models combined with greater reasoning effort produce measurably higher human alignment.
  • Process-level critiques can be stratified by location accuracy and reasoning quality for targeted judge improvement.
  • Permissively licensed open-weight generation enables community reuse for agent evaluator development.
  • The approach directly addresses the hours-long human annotation cost that currently limits scaling of agent evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same meta-evaluation protocol could be applied to additional agent benchmarks to test whether alignment patterns hold more broadly.
  • Labels distinguishing location from reasoning quality could guide prompt engineering that separately targets each aspect of judge performance.
  • The dataset opens a route to building reward models or self-critique loops that improve agents using meta-evaluation signals.
  • If agreement remains high, hybrid pipelines that use LLM judges for most work and humans only for low-alignment cases become feasible.

Load-bearing premise

Human meta-evaluations collected on tau-bench and DA-Code provide a reliable and generalizable signal for calibrating LLM judges across the broader space of agentic tasks and error types.

What would settle it

A large drop in agreement rates when the same LLM judges are tested on a new agent benchmark with substantially different task structure or error distribution would undermine the dataset's claimed utility.

Figures

Figures reproduced from arXiv: 2606.21627 by Antonia Calvi, Charlie Wang, Eujeong Choi, Henry Broomfield, Max Bartolo, Patrick Lewis, Roman Engeler, Sashank Pisupati.

Figure 1
Figure 1. Figure 1: Example agent trajectories, judge critiques, and human meta-annotations from Counsel. Each side represents a trajectory with multiple spans, where each box shows the output generation of a span that is conditioned on information in only preceding boxes. Left: A trajectory from τ -bench, a customer support benchmark. In this interaction an LLMJ (red) correctly flags the error location and reasons correctly … view at source ↗
Figure 2
Figure 2. Figure 2: Judge critique rates across agents, benchmarks, and judge models. The figure shows the proportion of agent spans flagged as containing an error by each judge model, stratified by agent model for τ -bench retail (left) and DA-Code (right). Qwen GPT-OSS-20B Qwen GPT-OSS-20B Qwen GPT-OSS-20B 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of judgements Qwen GPT-OSS-120B:low GPT-OSS-120B:high tau-bench retail Qwen Qwen Qwe… view at source ↗
Figure 3
Figure 3. Figure 3: Human meta-annotated quality of judge outputs. Proportion of critique and judgment labeled by human annotators as Spot On, Poor Reasoning (correct location), or Should Not Have Flagged, broken down by agent model, and judge model for each of τ -bench retail (left), and DA-Code (right). “poor reasoning” or “should not have flagged” than its low reasoning counterpart across both benchmarks, suggesting that c… view at source ↗
Figure 4
Figure 4. Figure 4: Example of task underspecification in DA-Code. The task (top) does not specify numerical precision or rounding requirements. The agent produces correctly computed prices rounded to two decimal places (middle), while the benchmark gold output (bottom) retains full floating-point precision. Despite the agent’s output being semantically correct and consistent with common data reporting practices, the mismatch… view at source ↗
Figure 5
Figure 5. Figure 5: Example prompt for the judge models. Sections omitted for brevity are delineated with angular brackets. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Agent trajectory statistics across benchmarks and agent models. Top left: Distribution of the number of agent steps per trajectory on τ -bench retail for GPT-OSS-20B and Qwen3 agents. Top right: Distribution of the number of agent steps per trajectory on DA-Code for Qwen3 agents. Bottom left: Distribution of the number of output tokens (not including reasoning) per agent step on τ -bench retail for GPT-OSS… view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of self-preference bias in LLM-as-a-judge critiques. Proportion of agent spans critiqued by each judge model, grouped by whether the judge and agent belong to the same model family or different model families. p-values for one-sided normal tests for proportions of whether the same-family judgments are less prevalent than different-family judgments are above their respective bars. 14 [PITH_FULL_IM… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of in-context meta-evaluation examples on agent performance. Ablations compare no feedback, only wrong location or poor-reasoning feedback, mixed feedback, and only spot-on feedback. Each point in a violin is the average task reward on τ -bench retail across all 115 tasks in a benchmark run. There were 10 full benchmark runs for each violin. Top Examples provided to the agent’s system prompt. Bottom… view at source ↗
Figure 9
Figure 9. Figure 9: Example prompt for the judge model when acting as an evaluator in the loop with few-shot examples from Counsel. It takes a very similar structure to [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

As agentic systems tackle increasingly complex multi-step tasks, evaluating their trajectories presents a major bottleneck - human annotation of a single trajectory on popular agentic benchmarks can take hours, making it difficult to scale evaluations for measuring performance or curating training data. This has driven widespread reliance on automated approaches such as LLM-as-a-judge (LLMJ) to critique agents at the process and outcome-levels at scale, however, the soundness of LLMJ critiques often goes unmeasured. Here, we introduce Counsel, the first public dataset of meta-evaluations for agentic tasks. Counsel consists of process-level critiques from open-weight LLMJs on two agent benchmarks: tau-bench (customer support agents) and DA-Code (coding agents), and human meta-evaluations of these critiques. Human annotators label critiques on each flagged error as "spot on", "correct location but poor reasoning", or "should not have flagged", achieving reliable inter-annotator agreement (Krippendorff's alpha of 0.78). The resulting dataset stratifies LLMJ critiques by human alignment across both error location within a trajectory and reasoning quality, serving as valuable data to calibrate, improve, or train LLMJs for agents. Comparing open-weight judges, we find that more capable judge models and more reasoning effort both enabled improved human agreement, with the strongest judge reaching ~88% agreement on location and ~65% on reasoning. Counsel is generated using open-weight models and is permissively licensed for broad community use, which we hope will enable rigorous study and improved alignment of LLM-based evaluators for agentic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Counsel, the first public dataset of meta-evaluations for agentic tasks. It consists of process-level critiques generated by open-weight LLM judges on trajectories from tau-bench (customer-support dialogues) and DA-Code (coding trajectories), paired with human meta-evaluations of those critiques using three labels (spot on, correct location but poor reasoning, should not have flagged). The dataset reports Krippendorff's alpha of 0.78 for inter-annotator agreement and finds that more capable judges with greater reasoning effort achieve higher human agreement (~88% on location, ~65% on reasoning). The authors position Counsel as calibration data for improving LLM judges on agentic systems and release it under a permissive license using open-weight models.

Significance. If the reported human alignment holds and the dataset is adopted, it would provide a concrete, open resource for studying and improving LLM-as-a-judge reliability on multi-step agent trajectories, directly addressing the scaling bottleneck noted in the abstract. The use of open-weight models and permissive licensing strengthens its potential for community follow-on work on calibration and training.

major comments (2)
  1. [Abstract] Abstract: the central utility claim that Counsel supplies 'valuable data to calibrate, improve, or train LLMJs for agents' broadly is load-bearing for the paper's contribution but rests only on human meta-evaluations collected exclusively from tau-bench and DA-Code; no cross-domain hold-out, error-type taxonomy coverage, or third benchmark is reported, so domain-specific alignment patterns cannot be ruled out.
  2. [Abstract] Abstract (dataset construction paragraph): the soundness of the reported Krippendorff's alpha of 0.78 and the location/reasoning agreement figures is only partially supported because the manuscript provides no details on annotation guidelines, sampling of critiques, or controls for annotator bias.
minor comments (1)
  1. The abstract would benefit from stating the total number of critiques, trajectories, and annotators to give readers immediate scale context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope of our claims and the transparency of our annotation process. We address each major comment below and will revise the manuscript to strengthen these aspects while preserving the core contribution of releasing the first public meta-evaluation dataset for agentic tasks.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central utility claim that Counsel supplies 'valuable data to calibrate, improve, or train LLMJs for agents' broadly is load-bearing for the paper's contribution but rests only on human meta-evaluations collected exclusively from tau-bench and DA-Code; no cross-domain hold-out, error-type taxonomy coverage, or third benchmark is reported, so domain-specific alignment patterns cannot be ruled out.

    Authors: We acknowledge that the human alignment results are derived from only two benchmarks and that no cross-domain hold-out or third benchmark is included. Tau-bench and DA-Code do represent meaningfully different agentic settings (multi-turn customer support dialogues versus coding trajectories), but this does not fully address the risk of domain-specific patterns. We will revise the abstract to qualify the utility claim, add an explicit limitations paragraph discussing the absence of broader domain coverage, and note that Counsel is intended as an initial calibration resource rather than a definitive cross-domain benchmark. No new experiments are feasible within the current scope. revision: partial

  2. Referee: [Abstract] Abstract (dataset construction paragraph): the soundness of the reported Krippendorff's alpha of 0.78 and the location/reasoning agreement figures is only partially supported because the manuscript provides no details on annotation guidelines, sampling of critiques, or controls for annotator bias.

    Authors: We agree that the main text should contain more explicit details on the annotation protocol. The full manuscript includes a human annotation section, but we will expand it in revision to include the full annotation guidelines, the sampling strategy used to select critiques for labeling, and the specific procedures employed to mitigate annotator bias (e.g., training, adjudication process, and demographic considerations). These additions will directly support the reported agreement statistics. revision: yes

Circularity Check

0 steps flagged

Empirical dataset release with no derivations or self-referential predictions

full rationale

The paper is a dataset release describing collection of human meta-evaluations on LLM critiques from two fixed benchmarks (tau-bench and DA-Code). No equations, fitted parameters, predictions, or derivation chains appear; reported statistics (Krippendorff α=0.78, location/reasoning agreement rates) are direct measurements from the human annotations rather than quantities derived from the dataset itself or from self-citations. The central claim is the dataset's existence and permissively licensed availability for downstream use, which is self-contained and externally verifiable without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or fitted parameters are present. The work rests on the domain assumption that human meta-evaluations constitute a usable ground truth for LLM judge quality.

axioms (1)
  • domain assumption Human annotations on critique quality provide a reliable signal for improving LLM-as-a-judge systems
    The entire utility of the dataset depends on this premise, which is invoked when claiming the data can calibrate or train judges.

pith-pipeline@v0.9.1-grok · 5840 in / 1236 out tokens · 23957 ms · 2026-06-26T14:06:01.450828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    Ziegler, Elizabeth Barnes, and Lawrence Chan

    Measuring ai ability to complete long tasks , author=. arXiv preprint arXiv:2503.14499 , year=

  2. [2]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

  3. [3]

    and Yang, John and Ho, Leyton and Patwardhan, Tejal and Liu, Kevin and Madry, Aleksander , year=

    Chowdhury, Neil and Aung, James and Shern, Chan Jun and Jaffe, Oliver and Sherburn, Dane and Starace, Giulio and Mays, Evan and Dias, Rachel and Aljubeh, Marwan and Glaese, Mia and Jimenez, Carlos E. and Yang, John and Ho, Leyton and Patwardhan, Tejal and Liu, Kevin and Madry, Aleksander , year=. Introducing

  4. [4]

    arXiv preprint arXiv:2505.08638 , year=

    TRAIL: Trace Reasoning and Agentic Issue Localization , author=. arXiv preprint arXiv:2505.08638 , year=

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  6. [6]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

  7. [7]

    arXiv preprint arXiv:2504.15253 , year=

    Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators , author=. arXiv preprint arXiv:2504.15253 , year=

  8. [8]

    2026 , month = jan, day =

    Demystifying evals for AI agents , author =. 2026 , month = jan, day =

  9. [9]

    2025 , month = oct, day =

    What works (and what doesn’t) when automating error analysis , author =. 2025 , month = oct, day =

  10. [10]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

    AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

  11. [11]

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , url =

    Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , isbn =. -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , url =. 13th International Conference on Learning Representations, ICLR 2025 , month =

  12. [12]

    DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models , url =

    Yiming Huang and Jianwen Luo and Yan Yu and Yitong Zhang and Fangyu Lei and Yifan Wei and Shizhu He and Lifu Huang and Xiao Liu and Jun Zhao and Kang Liu , doi =. DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models , url =. EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conf...

  13. [13]

    Self-Preference Bias in LLM-as-a-Judge , url =

    Koki Wataoka and Tsubasa Takahashi and Ryokan Ri , keywords =. Self-Preference Bias in LLM-as-a-Judge , url =. arXiv , year =

  14. [14]

    OpenAI and : and Sandhini Agarwal and Lama Ahmad and Jason Ai and Sam Altman and Andy Applebaum and Edwin Arbus and Rahul K. Arora and Yu Bai and Bowen Baker and Haiming Bao and Boaz Barak and Ally Bennett and Tyler Bertao and Nivedita Brett and Eugene Brevdo and Greg Brockman and Sebastien Bubeck and Che Chang and Kai Chen and Mark Chen and Enoch Cheung ...

  15. [15]

    Qwen3 Technical Report , journal =

    An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and...

  16. [16]

    OffsetBias: Leveraging Debiased Data for Tuning Evaluators , url =

    Junsoo Park and Seungyeon Jwa and Meiying Ren and Daeyoung Kim and Sanghyuk Choi , doi =. OffsetBias: Leveraging Debiased Data for Tuning Evaluators , url =. EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024 , month =

  17. [17]

    arXiv preprint arXiv:2504.08942 , year=

    Agentrewardbench: Evaluating automatic evaluations of web agent trajectories , author=. arXiv preprint arXiv:2504.08942 , year=

  18. [18]

    JudgeBench: A Benchmark for Evaluating LLM-based Judges

    Judgebench: A benchmark for evaluating llm-based judges , author=. arXiv preprint arXiv:2410.12784 , year=

  19. [19]

    Why Do Multi-Agent LLM Systems Fail?

    Why do multi-agent llm systems fail? , author=. arXiv preprint arXiv:2503.13657 , year=

  20. [20]

    Agent-as-a-Judge: Evaluate Agents with Agents, October 2024

    Agent-as-a-judge: Evaluate agents with agents , author=. arXiv preprint arXiv:2410.10934 , year=

  21. [21]

    RewardBench 2: Advancing Reward Model Evaluation

    RewardBench 2: Advancing Reward Model Evaluation , author=. arXiv preprint arXiv:2506.01937 , year=

  22. [22]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Direct judgement preference optimization , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  23. [23]

    arXiv preprint arXiv:2501.17195 , year=

    Atla selene mini: A general purpose evaluation model , author=. arXiv preprint arXiv:2501.17195 , year=

  24. [24]

    J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning, October 2025

    J1: Incentivizing thinking in llm-as-a-judge via reinforcement learning , author=. arXiv preprint arXiv:2505.10320 , year=

  25. [25]

    arXiv preprint arXiv:2505.02387 , year=

    Rm-r1: Reward modeling as reasoning , author=. arXiv preprint arXiv:2505.02387 , year=

  26. [26]

    arXiv preprint arXiv:2505.14674 , year=

    Reward reasoning model , author=. arXiv preprint arXiv:2505.14674 , year=

  27. [27]

    arXiv preprint arXiv:2312.09241 , year=

    Tinygsm: achieving> 80\ author=. arXiv preprint arXiv:2312.09241 , year=

  28. [28]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  29. [29]

    arXiv preprint arXiv:2504.17087 , year=

    Leveraging llms as meta-judges: A multi-agent framework for evaluating llm judgments , author=. arXiv preprint arXiv:2504.17087 , year=

  30. [30]

    The role of agentic AI in shaping a smart future: A systematic review , volume =

    Soodeh Hosseini and Hossein Seilani , doi =. The role of agentic AI in shaping a smart future: A systematic review , volume =. Array , keywords =

  31. [31]

    ^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , url =

    Victor Barres and Honghua Dong and Soham Ray Sierra soham and sierraai Xujie Si and Karthik Narasimhan Sierra , keywords =. ^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , url =

  32. [32]

    Gemini 3: Introducing the latest Gemini AI model from Google , url =

    Google , keywords =. Gemini 3: Introducing the latest Gemini AI model from Google , url =

  33. [33]

    2004 , publisher=

    Content analysis: An introduction to its methodology , author=. 2004 , publisher=