pith. machine review for the scientific record. sign in

arxiv: 2605.10516 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI agentsreliabilityconsistency metricsU-statisticskernel methodssemantically preserving perturbationsexecution robustnesstrajectory stability
0
0 comments X

The pith

AI agents often break down on minor task variations despite knowing the answer, and trajectory-level consistency metrics catch these failures better than pass rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a statistical framework to quantify AI agent reliability by testing consistency when tasks are altered in semantically equivalent ways. It separates an agent's core capability from its execution robustness, arguing that small perturbations can trigger complete strategy shifts even when the agent has the necessary knowledge. Traditional pass@1 rates fail to detect these issues, while the proposed U-statistics for outputs and kernel metrics for trajectories offer more sensitive diagnostics. This matters for deploying agents in real-world settings where inconsistent behavior can be costly. Experiments across three benchmarks support the claim that trajectory metrics provide clearer signals of where and why agents deviate.

Core claim

Using U-statistics to measure output-level reliability and kernel-based metrics to assess trajectory-level stability under semantically preserving perturbations, the approach shows that agents can possess core capability for a task yet lack execution robustness, resulting in strategy breakdowns from minor variations. This separation is demonstrated on three agentic benchmarks, where trajectory consistency metrics prove more diagnostically sensitive than standard pass@1 rates.

What carries the argument

U-statistics for output-level reliability and kernel-based metrics for trajectory-level stability under semantically preserving perturbations, which together isolate execution robustness from core capability.

If this is right

  • Trajectory-level metrics identify robustness problems that pass@1 rates overlook in agent evaluations.
  • The framework enables pinpointing architectural issues that cause inconsistent agent behavior.
  • Agents intended for high-stakes use can be tested for consistency beyond basic task success.
  • Minor task variations can expose execution failures even when agents have the required knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These consistency tests could be applied to non-agent systems like language models responding to varied prompts to check general reliability.
  • Training methods might incorporate these metrics as optimization targets to build more stable agents.
  • Re-evaluating existing agent benchmarks with trajectory metrics could uncover hidden failure patterns not visible in standard scores.

Load-bearing premise

Semantically preserving perturbations can be reliably defined and applied so that the metrics accurately measure execution robustness without introducing unintended biases or altering task meaning.

What would settle it

Applying the metrics to the three benchmarks and finding no greater sensitivity than pass@1 rates, or no evidence of strategy breakdowns under the perturbations.

Figures

Figures reproduced from arXiv: 2605.10516 by Aritra Guha, Cheryl Flynn, Harsh Raj, Niranjan Orkat, Subhabrata Majumdar, Suvrorup Mukherjee.

Figure 1
Figure 1. Figure 1: Accuracy and consistency (± standard error) on base and perturbed prompts from three agentic benchmarks. Statistically significant values of consistency metrics are marked by (∗ ). (0.75-0.90) under prompt-level perturbations but exhibit degradation under task-level perturbations. Trajectory metrics reveal deeper inconsistencies invisible to output-only evaluation. On SWE-bench Verified, the linear mcp per… view at source ↗
Figure 2
Figure 2. Figure 2: Performance of SWE-Bench Verified subset under varying perturbation strength. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: average edit position vs. (exponential - unweighted) score. Points above zero indicate [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trajectory summary statistics with mean ± standard error. Token counts done using cl100k base [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average edit position vs. (linear − unweighted) score, complementing the exponential panel in [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗
read the original abstract

This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a measurement framework for AI agent reliability that treats consistency as a testable statistical property. It applies U-statistics to assess output-level reliability and kernel-based metrics to quantify trajectory-level stability under semantically preserving perturbations. The central claim is that these trajectory-level metrics distinguish core capability from execution robustness more effectively than pass@1 rates, with experiments on three agentic benchmarks demonstrating greater diagnostic sensitivity for detecting strategy breakdowns.

Significance. If the perturbation generation process can be shown to preserve task semantics without introducing unstated biases, the framework would supply a principled way to isolate execution fragility from underlying competence. This could improve reliability assessment beyond binary success metrics and support targeted debugging of agent architectures in high-stakes settings. The use of established statistical tools (U-statistics, kernels) is a positive feature when properly anchored.

major comments (2)
  1. [Perturbation generation and validation (likely §3–4)] The distinction between core capability and execution robustness, and the superiority claim for trajectory-level kernel metrics over pass@1, rests on the unverified assumption that perturbations preserve task semantics while exposing execution differences. The paper must supply an explicit validation procedure (e.g., human or automated checks that success criteria remain unchanged) and report any failure rates of this preservation; without it, the metrics risk conflating capability shifts with robustness differences.
  2. [Experiments and results (likely §5)] The abstract states that trajectory-level metrics provide 'far greater diagnostic sensitivity,' yet no quantitative comparison (effect sizes, statistical tests, or confidence intervals) is visible in the provided summary. The experiments section must include direct head-to-head results on the three benchmarks, with error bars and controls for perturbation strength, to substantiate this load-bearing claim.
minor comments (2)
  1. [Methods] Notation for the kernel metric and U-statistic should be defined with explicit formulas and any hyperparameters (bandwidth, kernel choice) stated clearly to allow reproduction.
  2. [Related work] The paper should cite prior work on semantic equivalence checking or perturbation robustness in NLP/RL to situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of validation and statistical rigor that will strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions in the next version of the paper.

read point-by-point responses
  1. Referee: [Perturbation generation and validation (likely §3–4)] The distinction between core capability and execution robustness, and the superiority claim for trajectory-level kernel metrics over pass@1, rests on the unverified assumption that perturbations preserve task semantics while exposing execution differences. The paper must supply an explicit validation procedure (e.g., human or automated checks that success criteria remain unchanged) and report any failure rates of this preservation; without it, the metrics risk conflating capability shifts with robustness differences.

    Authors: We agree that an explicit validation procedure is necessary to rigorously support the distinction between core capability and execution robustness. In the revised manuscript, we will add a dedicated subsection to §3 that describes the perturbation generation process and includes both automated validation (via embedding-based semantic similarity thresholds) and human evaluation on a stratified sample of perturbations across the three benchmarks. We will report the observed preservation failure rates (preliminary internal checks indicate rates below 8%). This addition will directly address the risk of conflating capability shifts with robustness differences while preserving the framework's focus on trajectory-level stability. revision: yes

  2. Referee: [Experiments and results (likely §5)] The abstract states that trajectory-level metrics provide 'far greater diagnostic sensitivity,' yet no quantitative comparison (effect sizes, statistical tests, or confidence intervals) is visible in the provided summary. The experiments section must include direct head-to-head results on the three benchmarks, with error bars and controls for perturbation strength, to substantiate this load-bearing claim.

    Authors: The experiments in §5 already present comparative results across the three benchmarks (AgentBench, WebArena, and ToolBench) showing trajectory-level kernel metrics detect strategy breakdowns where pass@1 does not. To make the quantitative evidence fully explicit, we will revise §5 to include head-to-head tables with effect sizes, paired statistical tests (e.g., Wilcoxon signed-rank), 95% confidence intervals, and error bars from repeated runs with different random seeds. We will also add a control analysis varying perturbation strength (low/medium/high) and reporting metric sensitivity at each level. These changes will substantiate the diagnostic sensitivity claim with the requested statistical detail. revision: yes

Circularity Check

0 steps flagged

No circularity; standard statistical tools applied directly

full rationale

The paper's core framework applies U-statistics to output-level reliability and kernel-based metrics to trajectory-level stability as direct, off-the-shelf statistical constructions. These are not defined in terms of the target consistency quantities, nor are any parameters fitted to a subset of data and then relabeled as predictions. The distinction between core capability and execution robustness is a conceptual framing rather than a mathematical reduction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The metrics remain independent of the perturbation-generation process they are applied to, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard statistical methods applied to the AI agent domain without additional postulates detailed here.

pith-pipeline@v0.9.0 · 5456 in / 1025 out tokens · 43119 ms · 2026-05-12T04:38:15.643093+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 10 internal anchors

  1. [1]

    arXiv preprint arXiv:2411.07763 (2024)

    R. Cao et al. Spider2-V: How far are multimodal agents from automating data science and engineering workflows?arXiv preprint arXiv:2411.07763, 2024. URLhttps://arxiv. org/abs/2411.07763

  2. [2]

    Z. Chen, W. Du, W. Zhang, K. Liu, J. Liu, M. Zheng, J. Zhuo, S. Zhang, D. Lin, K. Chen, and F. Zhao. T-eval: Evaluating the tool utilization capability of large language models step by step, 2024. URLhttps://arxiv.org/abs/2312.14033

  3. [3]

    Cuturi, J.-P

    M. Cuturi, J.-P. Vert, O. Birkenes, and T. Matsui. A kernel for time series based on global align- ments. In2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, volume 2, pages II–413–II–416, 2007. doi: 10.1109/ICASSP.2007.366260

  4. [4]

    Lost in the Middle: How Language Models Use Long Contexts

    Y . Elazar, N. Kassner, S. Ravfogel, et al. Measuring and Improving Consistency in Pretrained Language Models.Transactions of the Association for Computational Linguistics, 2021. doi: 10.1162/tacl a 00410

  5. [5]

    E. Gan, Y . Zhao, L. Cheng, Y . Mao, A. Goyal, K. Kawaguchi, M.-Y . Kan, and M. Shieh. Reasoning robustness of llms to adversarial typographical errors. 2024. URLhttps: //arxiv.org/abs/2411.05345

  6. [6]

    Harbor: A framework for containerized agent execution.https://github

    Harbor Team. Harbor: A framework for containerized agent execution.https://github. com/harbor-framework/harbor, 2024

  7. [7]

    Hoeffding

    W. Hoeffding. A class of statistics with asymptotically normal distribution.The Annals of Mathematical Statistics, 19(3):293–325, 1948

  8. [8]

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. 2024. URLhttps://arxiv.org/abs/2308.00352

  9. [9]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? 2024. URLhttps://arxiv.org/ abs/2310.06770

  10. [10]

    Certifying llm safety against adversarial prompting

    A. Kumar, C. Agarwal, S. Srinivas, A. J. Li, S. Feizi, and H. Lakkaraju. Certifying llm safety against adversarial prompting. 2025. URLhttps://arxiv.org/abs/2309.02705

  11. [11]

    Lee.U-Statistics: Theory and Practice

    A. Lee.U-Statistics: Theory and Practice. Marcel Dekker, Inc., 1990

  12. [12]

    Mcp server.https://linear.app/docs/mcp

    Linear Team. Mcp server.https://linear.app/docs/mcp. accessed on May 4, 2026

  13. [13]

    Ma et al

    C. Ma et al. Agentboard: An analytical evaluation board of multi-turn llm agents.arXiv preprint, 2024

  14. [14]

    M. A. Merrill et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps://arxiv.org/abs/2601.11868

  15. [15]

    Novikova, C

    J. Novikova, C. Anderson, B. Blili-Hamelin, D. Rosati, and S. Majumdar. Consistency in language models: Current landscape, challenges, and future directions, 2025. URLhttps: //arxiv.org/abs/2505.00268

  16. [16]

    S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2024. URLhttps://arxiv.org/ abs/2305.15334

  17. [17]

    C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun. Chatdev: Communicative agents for software development. 2024. URLhttps://arxiv.org/abs/2307.07924

  18. [18]

    Towards a science of ai agent reliability, 2026

    S. Rabanser, S. Kapoor, P. Kirgis, K. Liu, S. Utpala, and A. Narayanan. Towards a science of ai agent reliability, 2026. URLhttps://arxiv.org/abs/2602.16666. 10

  19. [19]

    Rabinovich, S

    E. Rabinovich, S. Ackerman, O. Raz, E. Farchi, and A. Anaby Tavor. Predicting question-answering performance of large language models through semantic consistency. In S. Gehrmann, A. Wang, J. Sedoc, E. Clark, K. Dhole, K. R. Chandu, E. Santus, and H. Sedghamiz, editors,Proceedings of the Third Workshop on Natural Language Genera- tion, Evaluation, and Met...

  20. [20]

    H. Raj, D. Rosati, and S. Majumdar. Measuring reliability of large language models through semantic consistency, 2023. URLhttps://arxiv.org/abs/2211.05853

  21. [21]

    Quantifying perturbation impacts for large language models,

    P. Rauba, Q. Wei, and M. van der Schaar. Quantifying perturbation impacts for large language models, 2024. URLhttps://arxiv.org/abs/2412.00868

  22. [22]

    Deepseek/kimi models produce malformed XML tool calls

    r/RooCode Community. Deepseek/kimi models produce malformed XML tool calls. Reddit discussion, Jan. 2025. URLhttps://www.reddit.com/r/RooCode/comments/1rk90b7/ deepseek_kimi_models_produce_malformed_xml_tool/. Accessed May 4, 2025

  23. [23]

    Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. 2024. URL https://arxiv.org/abs/2309.15817

  24. [24]

    Sadhuka, D

    S. Sadhuka, D. Prinster, C. Fannjiang, G. Scalia, A. Regev, and H. Wang. E-valuator: Reliable agent verifiers with sequential hypothesis testing, 2025. URLhttps://arxiv.org/abs/ 2512.03109

  25. [25]

    R. J. Serfling.Approximation Theorems of Mathematical Statistics. John Wiley & Sons, 1980

  26. [26]

    F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. Chi, N. Sch ¨arli, and D. Zhou. Large language models can be easily distracted by irrelevant context. 2023. URLhttps://arxiv. org/abs/2302.00093

  27. [27]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URLhttps://arxiv.org/abs/2210.09261

  28. [28]

    Swe-bench leaderboard.https://www.swebench.com, 2024

    SWE-bench Team. Swe-bench leaderboard.https://www.swebench.com, 2024. Accessed October 2024

  29. [29]

    Swe-bench live leaderboard.https://www.swebench.com/live, 2025

    SWE-bench Team. Swe-bench live leaderboard.https://www.swebench.com/live, 2025. Accessed January 2025

  30. [30]

    Large language models still can’t plan (A benchmark for llms on planning and reasoning about change).CoRR, abs/2206.10498, 2022

    K. Valmeekam, M. Marquez, A. Olmo, S. Sreedharan, and S. Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. 2023. URLhttps://arxiv.org/abs/2206.10498

  31. [31]

    X. Wang, Z. Wang, J. Liu, Y . Chen, L. Yuan, H. Peng, and H. Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024. URLhttps://arxiv.org/ abs/2309.10691

  32. [32]

    C. Yang, Y . Shi, Q. Ma, M. X. Liu, C. K ¨astner, and T. Wu. What prompts don’t say: Under- standing and managing underspecification in llm prompts, 2025. URLhttps://arxiv.org/ abs/2505.13360

  33. [33]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. 2023. URLhttps://arxiv.org/abs/2210. 03629

  34. [34]

    S. Yao, N. Shinn, P. Razavi, and K. Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains. 2024. URLhttps://arxiv.org/abs/2406.12045

  35. [35]

    K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Z. Gong, and X. Xie. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. 2024. URLhttps://arxiv.org/abs/2306.04528. 11 A Proofs A.1 Proof of Theorem 1 Under Assumption 2, the instance-level U-statistics{U m n }M m=1 are independ...

  36. [36]

    g e t _ b i g f i v e _ s c o r e s

    =O p(1)by Theorem 1. ThereforeP H1(TM,n <−z 1−α)→1. A.3 Proof of Theorem 3 By our assumptions, each \MMD 2,m u is an independent realization with common mean E[ \MMD 2,m u ] =δwhereδ= MMD 2(P0,{P j}n j=1)and common varianceσ 2 MMD = V( \MMD 2,m u )∈(0,∞). Under the null hypothesisH tr 0 :P 0 =P 1 =· · ·=P n, we haveδ= 0. The aggregate statistic is: MMD 2 ...

  37. [37]

    ** Query Linear **: Use the Linear MCP tools to re tri ev e issue ‘ django__django -16819 ‘ 22

  38. [38]

    ** Read Issue Details **: Review the issue d e s c r i p t i o n and r e q u i r e m e n t s

  39. [39]

    ** I m p l e m e n t Fix **: Make the n e c e s s a r y code changes to resolve the issue

  40. [40]

    f u n c t i o n _ n a m e

    ** Verify **: Ensure your changes work c o r r e c t l y ## A v a i l a b l e MCP Tools - ‘ get_issue ‘: Get a Linear issue by ID - ‘ list_issues ‘: List all issues , o p t i o n a l l y fi lt ere d by project - ‘ search_issues ‘: Search issues by query string - ‘ g e t _ i s s u e _ c o m m e n t s ‘: Get all com me nts for an issue BFCL decoy-func.Inser...