Quantitative Certification of Agentic Tool Selection

Gagandeep Singh; Isha Chaudhary; Jehyeok Yeon

arxiv: 2510.03992 · v2 · pith:3QL3UTVInew · submitted 2025-10-05 · 💻 cs.CR · cs.AI

Quantitative Certification of Agentic Tool Selection

Jehyeok Yeon , Isha Chaudhary , Gagandeep Singh This is my paper

Pith reviewed 2026-05-18 11:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords tool selectionLLM agentssafety certificationstatistical boundsBernoulli estimationClopper-Pearsondistractor selection

0 comments

The pith

LLMCert-T supplies high-confidence upper bounds on the probability that an LLM tool-selection pipeline meets a declared safety specification under realistic tool distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLMCert-T to quantify how reliably an agent's tool choices satisfy safety rules when the pool of available tools matches what would appear in an open, third-party registry. It frames the task as repeated Bernoulli trials in which sequences of inserted tools are generated by a stochastic process that conditions each round on the agent's prior selection. A sympathetic reader would care because existing benchmarks use fixed, benign tool sets that do not capture the distractors and saturation an agent actually faces, so measured accuracy can overstate real-world safety. LLMCert-T converts the trial outcomes into a one-sided Clopper-Pearson upper bound that serves as a statistically guaranteed certificate.

Core claim

LLMCert-T models tool-selection evaluation as a Bernoulli estimation problem, drawing inserted-tool sequences from a distribution fixed by the safety specification, and aggregates the per-trial outcomes into a one-sided Clopper-Pearson upper bound on the probability that the specification is satisfied. Across popular BFCL and OpenAPI tool pools the resulting certificates show current agents remain fragile: their certified correctness upper bounds drop to approximately 20 percent under Distractor Selection and Top-N Saturation specifications.

What carries the argument

LLMCert-T, a framework that treats safety evaluation as Bernoulli estimation over sequences generated by a stochastic process conditioning each round on the agent's previous selection and returns Clopper-Pearson upper bounds with explicit statistical guarantees.

If this is right

Current agents receive certified upper bounds near 20 percent for distractor-selection and top-N-saturation specifications, well below their clean-pool performance.
Safety claims for tool-selection pipelines become directly comparable across models, retrievers, mitigations, and registry policies.
Developers obtain an actionable numeric certificate rather than only empirical accuracy on curated test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Registry operators could require or publish such certificates as a condition for listing new tools.
The same bounding technique might be applied to certify other agent properties such as privacy compliance or multi-step execution correctness.
Over successive versions of an agent, the evolution of its certified bound could serve as a quantitative progress metric for safety engineering.

Load-bearing premise

The stochastic process that generates inserted-tool sequences round by round, conditioning each round on the agent's selection in the previous round, accurately models the distribution of tools an agent would actually encounter in an open registry.

What would settle it

Measuring the fraction of trials in which an agent satisfies the safety specification when tools are drawn from a live, growing open registry and finding that this fraction lies consistently above the LLMCert-T upper bound would show the modeled distribution does not match reality.

read the original abstract

Large language models (LLMs) are increasingly deployed in agentic systems, where a fundamental task is mapping user intents to relevant external tools. Errors in tool selection can have severe outcomes, such as unauthorized data access, even without modifying the agent's underlying model. Existing evaluations measure performance on curated, benign benchmarks. However, a pipeline's behavior in deployment depends on the tool pool the agent actually encounters, which in open registries is shaped by third parties. We introduce LLMCert-T, the first statistical framework that returns \textbf{high-confidence upper bounds on the probability that a tool-selection pipeline satisfies a declared safety specification under a realistic tool distribution}. LLMCert-T models tool-selection evaluation as a Bernoulli estimation problem, drawing inserted-tool sequences from a distribution that the safety specification fixes. To evaluate robustness against realistic deployment conditions, we instantiate this distribution as a stochastic process that generates inserted-tool sequences round by round, conditioning each round on the agent's selection in the previous round. LLMCert-T aggregates the per-trial outcomes into a one-sided Clopper-Pearson upper bound on the probability that the specification is satisfied. By returning this bound as a certificate with statistical guarantees over the inserted-tool sequence distribution, LLMCert-T makes safety claims intuitive, actionable, and comparable across models, retrievers, mitigations, and registry policies. Across popular BFCL and OpenAPI tool pools, LLMCert-T shows that current LLM agents remain fragile under Distractor Selection and Top-N Saturation specifications: their certified correctness upper bounds drop to approximately 20\%, far below their clean-pool lower bounds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces LLMCert-T, the first statistical framework for certifying tool-selection pipelines in LLM agents. It models evaluation as a Bernoulli estimation problem over inserted-tool sequences drawn from a distribution fixed by the safety specification. The distribution is realized as a stochastic process that generates sequences round by round, conditioning each round on the agent's prior selection. Per-trial binary outcomes (whether the declared safety specification is satisfied) are aggregated via a one-sided Clopper-Pearson upper bound, yielding high-confidence certificates on the probability of satisfaction under the modeled distribution. Experiments on BFCL and OpenAPI tool pools report that current agents achieve certified upper bounds of approximately 20% under Distractor Selection and Top-N Saturation specifications, well below their clean-pool performance.

Significance. If the statistical validity of the bounds is established, the work supplies a concrete, comparable metric for robustness of agentic tool use against third-party tool registries, moving beyond curated benchmarks. The empirical demonstration that certified correctness can fall to ~20% under realistic insertion policies is actionable for model developers and registry operators. The approach re-uses a standard, parameter-free statistical tool (Clopper-Pearson) and produces falsifiable numerical certificates rather than heuristic scores.

major comments (1)

[Abstract and stochastic process description] Abstract and § on the stochastic process: Clopper-Pearson supplies valid one-sided coverage only under i.i.d. Bernoulli trials. The described process conditions each round's tool insertion on the agent's selection from the previous round, producing a Markovian dependence within each trajectory. If a trial is defined as a single round (or if per-round satisfaction indicators are aggregated to produce the binary outcome), the sequence of outcomes is dependent rather than i.i.d.; the reported coverage guarantees and the ~20% upper bounds therefore lack the claimed statistical validity. The manuscript must either (a) prove that the dependence does not invalidate coverage, (b) redefine trials as fully independent trajectories with a single aggregate outcome per trajectory, or (c) replace Clopper-Pearson with a method valid for dependent observations.

minor comments (2)

[Abstract] The abstract states that safety specifications are 'declared' and 'fixed' by the distribution, yet provides no concrete syntax or encoding for a specification; a short formal example in the main text would clarify how a user encodes 'Distractor Selection' or 'Top-N Saturation'.
[Method] Notation for the per-trial outcome random variable and the exact mapping from a sampled sequence to the binary success indicator should be introduced once and used consistently.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying a key point regarding the statistical validity of the Clopper-Pearson bounds in LLMCert-T. We address the major comment below and will revise the manuscript to clarify the definition of trials.

read point-by-point responses

Referee: Abstract and § on the stochastic process description: Clopper-Pearson supplies valid one-sided coverage only under i.i.d. Bernoulli trials. The described process conditions each round's tool insertion on the agent's selection from the previous round, producing a Markovian dependence within each trajectory. If a trial is defined as a single round (or if per-round satisfaction indicators are aggregated to produce the binary outcome), the sequence of outcomes is dependent rather than i.i.d.; the reported coverage guarantees and the ~20% upper bounds therefore lack the claimed statistical validity. The manuscript must either (a) prove that the dependence does not invalidate coverage, (b) redefine trials as fully independent trajectories with a single aggregate outcome per trajectory, or (c) replace Clopper-Pearson with a method valid for dependent observations.

Authors: We appreciate the referee's precise identification of the dependence structure. Our framework defines each trial as a complete, independently sampled trajectory generated by the stochastic process. The binary outcome per trial is a single aggregate indicator of whether the declared safety specification is satisfied for the entire trajectory. Because trajectories are drawn independently, the trial-level binary outcomes form an i.i.d. Bernoulli sequence to which the one-sided Clopper-Pearson bound applies directly. Any Markovian dependence is internal to a trajectory and does not affect independence across trials. We will revise the abstract and the stochastic-process section to state this trial definition explicitly and to include a short argument confirming that the coverage guarantee holds at the trajectory level. This implements option (b) suggested by the referee. revision: yes

Circularity Check

0 steps flagged

No circularity in LLMCert-T's application of Clopper-Pearson bounds to sampled tool-selection outcomes

full rationale

The paper frames tool-selection evaluation as a Bernoulli estimation problem and computes one-sided Clopper-Pearson upper bounds directly from the empirical frequency of specification-satisfying outcomes across independent trials drawn from the declared distribution. The distribution is instantiated as a round-by-round conditional stochastic process, but the bound itself is produced by a fixed external statistical formula applied to the resulting binary trial outcomes. No parameters are fitted to the target probability, no self-citations justify the core bounding step, and the claimed high-confidence upper bound does not reduce to a self-definition or renaming of the input samples. The derivation remains self-contained as a standard Monte Carlo application of an established method.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard statistical assumptions for Bernoulli estimation and Clopper-Pearson bounds plus one domain modeling choice for the tool distribution; no free parameters or new invented entities are introduced.

axioms (2)

domain assumption Tool-selection outcomes can be treated as independent Bernoulli trials for the purpose of constructing a one-sided upper bound.
The paper explicitly models evaluation as a Bernoulli estimation problem.
domain assumption The stochastic process that conditions each round of inserted tools on the agent's previous selection faithfully represents realistic deployment conditions.
The paper instantiates the distribution this way to evaluate robustness against realistic conditions.

pith-pipeline@v0.9.0 · 5815 in / 1402 out tokens · 50160 ms · 2026-05-18T11:05:58.382222+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TOOLCERT models tool selection as a Bernoulli success process... aggregates the per-trial outcomes into a one-sided Clopper-Pearson upper bound
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-round, stochastic process... Markov process... conditional distribution Δadv

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Five Attacks on x402 Agentic Payment Protocol
cs.CR 2026-05 conditional novelty 7.0

Five practical attacks on the x402 agentic payment protocol are demonstrated across authorization, binding, replay protection, and web handling, validated on local chains, Base Sepolia, live endpoints, and three open-...
BEAVER: An Efficient Deterministic LLM Verifier
cs.AI 2025-12 unverdicted novelty 7.0

BEAVER is the first practical deterministic verifier that maintains sound probability bounds on LLM safety properties using token tries and frontier data structures, finding 2-3x more violations than sampling at 1/10 ...