pith. sign in

arxiv: 2606.22737 · v1 · pith:NF7B3YFJnew · submitted 2026-06-22 · 💻 cs.AI · cs.CL· cs.SE

GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation

Pith reviewed 2026-06-26 09:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.SE
keywords GroundEvalagent evaluationdeterministic scoringevidence groundingLLM-as-judge alternativestateful agentstrajectory analysisbenchmarking
0
0 comments X

The pith

GroundEval replaces LLM judges with a deterministic check of whether agents retrieved and used the right evidence at the right time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GroundEval as a judge-free method for evaluating stateful agents that operate over context. It defines a domain configuration that specifies what evidence is available when, generates questions from that setup, and then scores both the final answer and the full trajectory of searches, fetches, and citations the agent actually performed. The framework targets three specific grounding failures: claiming information is absent without checking, reasoning from evidence the agent could not yet have seen, and invoking the wrong causal mechanism. Case studies show frontier LLM judges awarding scores above 0.85 to answers whose required artifacts were never retrieved, while GroundEval returns 0.000 and supplies turn-level diagnostics that link tool calls to the agent's narration.

Core claim

GroundEval is a deterministic framework that uses a domain configuration to define time-bounded and access-controlled evidence, generates questions from it, lets the agent answer using its own tools, and then scores the answer together with the recorded trajectory against the permitted evidence paths; this produces three tracks (Silence, Perspective, Counterfactual) that expose when plausible answers rest on invalid evidence access.

What carries the argument

The domain configuration, which encodes the evidence the agent is permitted to access at each time step and serves both to generate questions and to validate the agent's recorded sequence of tool uses and citations.

If this is right

  • Agents receive zero scores when they claim absence without having performed the required check.
  • Evaluation distinguishes reasoning from evidence available at the relevant time versus evidence the agent could not yet have seen.
  • Scores reflect whether the correct causal mechanism was used rather than any plausible one.
  • Each question yields inspectable diagnostics that pair tool activity with the agent's turn-level narration.
  • The method applies to any stateful agent whose trajectory can be logged against a defined evidence domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same domain-configuration approach could be applied to audit live agent deployments by logging trajectories in production.
  • Per-question diagnostics could guide iterative debugging of specific agent failure modes during development.
  • The framework might be extended to generate new test questions automatically from updates to the domain configuration.
  • Integration with existing agent runtimes could make evidence-path verification a standard step before deployment.

Load-bearing premise

The domain configuration must correctly and completely list every piece of evidence the agent could have accessed at each relevant time, and the recorded trajectory must capture every search, fetch, and citation action without omissions.

What would settle it

An agent that answers a question correctly by inventing details from an artifact it never retrieved, yet still receives a non-zero GroundEval score on that question.

Figures

Figures reproduced from arXiv: 2606.22737 by Jeffrey Flynt.

Figure 1
Figure 1. Figure 1: GROUNDEVAL pipeline. A machine-readable contract over the event log, artifact corpus, access policy, and evalua￾tion configuration generates question contracts and expected answer schemas. Context mode observes evidence through structured artifact citations, while tool mode observes evidence through a gated fetch/search trace. Both modes terminate in the same deter￾ministic scorer, which produces answer, t… view at source ↗
Figure 2
Figure 2. Figure 2: The Visibility Cone in the Perspective Track. G [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of how traditional judges and G [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Same response, two verdicts. Given the agent’s answer that the Confluence page was not created, two frontier LLM judges scored the response 0.90 and 0.85 based on prose plausibility alone. GROUNDEVAL’s trace check verified whether the required artifact, CONF-PROD-002, was ever fetched. It was not, so search-space coverage is zero and the deterministic answer score is 0.000. The judges evaluated plausibilit… view at source ↗
read the original abstract

Before letting an agent operate over real context, can you prove it used the right evidence? GroundEval turns that question into a deterministic test of what the agent searched, fetched, cited, and was permitted to access. In one case study, two frontier LLM judges scored a plausible agent response above 0.85. But the trace told a different story: the agent had never retrieved the artifact its answer depended on, yielding a GroundEval score of 0.000. We introduce GroundEval, a judge-free framework for evaluating agents against grounded, time-bounded, and access-controlled evidence. GroundEval uses a domain configuration to generate questions, lets the agent choose how to answer, and then scores both the final answer and the recorded trajectory that produced it. The benchmark targets three failures that LLM-as-judge evaluation struggles to detect: whether an agent checked before claiming absence, reasoned only from evidence available to the actor at the relevant time, and used the correct causal mechanism rather than a plausible one. These correspond to three tracks: Silence, Perspective, and Counterfactual. GroundEval exposes when plausible answers rest on invalid evidence paths, and produces structured per-question diagnostics that pair tool activity with the agent's turn-level narration, making each score inspectable rather than merely reported. What our case studies turned up is that this gap isn't some rare corner case. It's exactly the blind spot that final-answer and judge-based scoring were never built to catch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces GroundEval, a deterministic, judge-free framework for evaluating stateful agents against grounded, time-bounded, and access-controlled evidence. It uses a domain configuration to generate questions, records the agent's trajectory of searches/fetches/citations, and scores both the final answer and trajectory for three failure modes (Silence, Perspective, Counterfactual) that LLM-as-judge methods miss. A case study is presented in which frontier LLM judges assign scores >0.85 to a response while GroundEval assigns 0.000 because the required artifact was never retrieved.

Significance. If the framework can be shown to work, it would provide a more reliable and inspectable alternative to LLM-as-judge evaluation for agent grounding, with structured diagnostics that pair tool activity to narration. The deterministic nature and focus on evidence paths address a genuine blind spot in current evaluation practices.

major comments (2)
  1. [Abstract] Abstract / case study: the claim that GroundEval correctly identifies invalid evidence paths (LLM judges >0.85 vs GroundEval=0) rests on the unverified assumption that the domain configuration exhaustively encodes all permitted access paths at each time step and that the recorded trajectory logs every search/fetch/citation without omissions. No verification method (e.g., exhaustive enumeration of paths or cross-validation against the agent runtime) is described; this assumption is load-bearing for the central claim.
  2. [Abstract] Abstract: no implementation details, scoring formulas, algorithms, or validation experiments are supplied for how the three tracks (Silence, Perspective, Counterfactual) are operationalized or how deterministic scores are computed from the trajectory and domain config. This prevents assessment of whether the method is actually parameter-free or reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the assumptions underlying our case study and the need for explicit implementation details. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract / case study: the claim that GroundEval correctly identifies invalid evidence paths (LLM judges >0.85 vs GroundEval=0) rests on the unverified assumption that the domain configuration exhaustively encodes all permitted access paths at each time step and that the recorded trajectory logs every search/fetch/citation without omissions. No verification method (e.g., exhaustive enumeration of paths or cross-validation against the agent runtime) is described; this assumption is load-bearing for the central claim.

    Authors: The referee is correct that the case-study claim depends on the domain configuration serving as an exhaustive encoding of permitted paths and on complete trajectory logging. The manuscript presents the domain configuration as derived directly from the environment specification (state machine plus access controls), which is treated as ground truth by construction. However, no explicit verification procedure is described. We will add a new subsection detailing how the configuration is cross-validated against the runtime and how tool-call logging ensures trajectory completeness. revision: yes

  2. Referee: [Abstract] Abstract: no implementation details, scoring formulas, algorithms, or validation experiments are supplied for how the three tracks (Silence, Perspective, Counterfactual) are operationalized or how deterministic scores are computed from the trajectory and domain config. This prevents assessment of whether the method is actually parameter-free or reproducible.

    Authors: We agree that the abstract supplies no scoring formulas or algorithms. The manuscript defines the tracks at a high level (Silence: failure to retrieve before claiming absence; Perspective: use of only temporally available evidence; Counterfactual: matching the correct causal mechanism) and states that scores are computed deterministically from trajectory events versus domain-config rules. To enable assessment of reproducibility and parameter-freeness, we will insert pseudocode for the scoring procedure and a minimal validation experiment in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is definitionally independent of fitted quantities or self-citations

full rationale

The paper presents GroundEval as a deterministic scoring procedure driven by an externally supplied domain configuration and recorded agent trajectory. No equations, parameters, or predictions are defined; the method does not fit anything to data and then relabel the fit as a prediction. No self-citations are invoked to justify uniqueness or load-bearing premises. The central distinction between LLM-judge scores and GroundEval scores is shown via case-study comparison against the supplied config/trace, which the paper treats as given inputs rather than derived outputs. This satisfies the self-contained criterion with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that a manually authored domain configuration can faithfully encode time-bounded and access-controlled evidence availability for any given evaluation scenario.

axioms (1)
  • domain assumption Domain configurations accurately represent time-bounded and access-controlled evidence availability for the agent.
    This premise is required to generate questions and to compute the Silence, Perspective, and Counterfactual scores from trajectories.

pith-pipeline@v0.9.1-grok · 5788 in / 1176 out tokens · 36447 ms · 2026-06-26T09:08:17.032903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 6 linked inside Pith

  1. [1]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023

  2. [2]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, 2023

  3. [3]

    A Systematic Study of Po- sition Bias in LLM-as-a-Judge

    Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush V osoughi. A Systematic Study of Po- sition Bias in LLM-as-a-Judge. InProceedings of the 31st International Conference on Computational Linguistics, pages 292–314, 2025

  4. [4]

    Self-Preference Bias in LLM-as-a-Judge.arXiv preprint arXiv:2410.21819, 2024

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-Preference Bias in LLM-as-a-Judge.arXiv preprint arXiv:2410.21819, 2024

  5. [5]

    Chawla, and Xiangliang Zhang

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. InInternational Conference on Learning Representations, 2025

  6. [6]

    Let’s Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024

  7. [7]

    Faithful Chain-of-Thought Reasoning

    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful Chain-of-Thought Reasoning. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 305–329, 2023

  8. [8]

    Bow- man, and Ethan Perez

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernan- dez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Ma...

  9. [9]

    TRAJECT- Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

    Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, and Benoit Dumoulin. TRAJECT- Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use. InInternational Conference on Learning Representations, 2026

  10. [10]

    OrgForge-IT: A Verifiable Synthetic Benchmark for LLM-Based Insider Threat Detec- tion.arXiv preprint arXiv:2603.22499, 2026

    Jeffrey Flynt. OrgForge-IT: A Verifiable Synthetic Benchmark for LLM-Based Insider Threat Detec- tion.arXiv preprint arXiv:2603.22499, 2026

  11. [11]

    OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora.arXiv preprint arXiv:2603.14997, 2026

    Jeffrey Flynt. OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora.arXiv preprint arXiv:2603.14997, 2026

  12. [12]

    Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval.arXiv preprint arXiv:2605.11325, 2026

    Jeffrey Flynt. Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval.arXiv preprint arXiv:2605.11325, 2026

  13. [13]

    AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

    Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents. In Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024

  14. [14]

    Pal, and Siva Reddy

    Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Sta ´nczak, Peter Shaw, Christopher J. Pal, and Siva Reddy. AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories. InConference on Language Modeling, 2025

  15. [15]

    RAGAs: Automated Evaluation of Retrieval Augmented Generation

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated Evaluation of Retrieval Augmented Generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024

  16. [16]

    ARES: An Automated Evalua- tion Framework for Retrieval-Augmented Generation Systems

    Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An Automated Evalua- tion Framework for Retrieval-Augmented Generation Systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

  17. [17]

    RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 10862–10878, 2024

  18. [18]

    Measuring Attribution in Natural Language Generation Models.Computational Linguistics, 49(4):777–840, 2023

    Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring Attribution in Natural Language Generation Models.Computational Linguistics, 49(4):777–840, 2023

  19. [19]

    WebGPT: Browser-Assisted Question-Answering with Human Feedback.arXiv preprint arXiv:2112.09332, 2021

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT: Browser-Assisted Question-Answering with Human Feedback.arXiv preprint arXiv...

  20. [20]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 13851–13870, 2024

  21. [21]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InInternational Conference on Learning Representations, 2025. 17

  22. [22]

    LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues.arXiv preprint arXiv:2605.12493, 2026

    Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng, and Kai-Wei Chang. LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues.arXiv preprint arXiv:2605.12493, 2026. 18