pith. sign in

arxiv: 2602.16666 · v3 · pith:EDF5OQSHnew · submitted 2026-02-18 · 💻 cs.AI · cs.CY· cs.LG

Towards a Science of AI Agent Reliability

classification 💻 cs.AI cs.CYcs.LG
keywords agentsagentfailreliabilityacrossbenchmarksevaluationsmetrics
0
0 comments X
read the original abstract

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 15 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.

  2. Life After Benchmark Saturation: A Case Study of CORE-Bench

    cs.AI 2026-06 unverdicted novelty 6.0

    Using CORE-Bench as a case study, the paper shows that saturated benchmarks can still deliver insights on efficiency, reliability, model-scaffold differences, and human collaboration even after accuracy plateaus, and ...

  3. On the Reliability of Networks of AI Agents: Density Evolution, Stopping Sets, and Architecture Optimization

    cs.MA 2026-06 unverdicted novelty 6.0

    Extends density evolution to role-typed factor graphs with nonlinear Boolean verifiers to predict asymptotic unresolved subclaims in AI agent networks under three erasure failure modes.

  4. How Much Coordination Gain Is Real? A Paired Noise-Floor Protocol for Multi-Agent LLM Benchmarks

    cs.MA 2026-06 unverdicted novelty 6.0

    Paired configuration-equivalent trials on Claude Haiku 4.5 yield a noise floor of roughly [-3, +18]pp with no significant coordination contrast after correction, placing most recent multi-agent papers inside or below ...

  5. PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    PocketAgents introduces a manifest-driven library for LLM-based autonomous defense agents, evaluated in 18 closed-loop trials against a DarkSide-inspired attack where 13 trials produced validated blocking actions.

  6. Open-World Evaluations for Measuring Frontier AI Capabilities

    cs.AI 2026-05 conditional novelty 6.0

    Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one ...

  7. Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes

    cs.SE 2026-05 unverdicted novelty 6.0

    Pilot study shows agent decision reconstructability varies by vendor SDK regime, with completeness scores from 42.9% to 85.7% and consistent gaps in reasoning traces.

  8. Hallucinations Undermine Trust; Metacognition is a Way Forward

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.

  9. MarketBench: Evaluating AI Agents as Market Participants

    cs.AI 2026-04 unverdicted novelty 6.0

    LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from ...

  10. RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation

    cs.CV 2026-04 unverdicted novelty 6.0

    RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.

  11. Monitoring Agentic Systems Before They're Reliable

    cs.SE 2026-06 unverdicted novelty 5.0

    Introduces a three-dimension three-scope variance-based monitoring and FMEA-adapted triage methodology for structural defects in partially integrated agentic systems, validated on synthetic data showing scope-specific...

  12. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  13. Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

    cs.AI 2026-05 unverdicted novelty 5.0

    A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.

  14. ClayBuddy: A Framework, Evaluation, & Mitigation of Coding Agent Failures

    cs.SE 2026-06 unverdicted novelty 4.0

    ClayBuddy adds agent-editable context, extended prompts, a command classifier, and deterministic guardrails to coding agent harnesses and shows statistically significant safety gains across 8 evaluations.

  15. The Agentic Web Requires New Normative Infrastructure

    cs.CY 2026-06 unverdicted novelty 3.0

    The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.

  16. Security, Privacy, and Ethical Risks in OpenClaw

    cs.CR 2026-05 unverdicted novelty 3.0

    The paper analyzes security, privacy, and ethical risks in the OpenClaw AI agent system arising from its architecture, storage, tool use, and integrations, arguing these form major barriers to trustworthy adoption.