Quantitative Certification of Agentic Tool Selection
Pith reviewed 2026-05-18 11:05 UTC · model grok-4.3
The pith
LLMCert-T supplies high-confidence upper bounds on the probability that an LLM tool-selection pipeline meets a declared safety specification under realistic tool distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMCert-T models tool-selection evaluation as a Bernoulli estimation problem, drawing inserted-tool sequences from a distribution fixed by the safety specification, and aggregates the per-trial outcomes into a one-sided Clopper-Pearson upper bound on the probability that the specification is satisfied. Across popular BFCL and OpenAPI tool pools the resulting certificates show current agents remain fragile: their certified correctness upper bounds drop to approximately 20 percent under Distractor Selection and Top-N Saturation specifications.
What carries the argument
LLMCert-T, a framework that treats safety evaluation as Bernoulli estimation over sequences generated by a stochastic process conditioning each round on the agent's previous selection and returns Clopper-Pearson upper bounds with explicit statistical guarantees.
If this is right
- Current agents receive certified upper bounds near 20 percent for distractor-selection and top-N-saturation specifications, well below their clean-pool performance.
- Safety claims for tool-selection pipelines become directly comparable across models, retrievers, mitigations, and registry policies.
- Developers obtain an actionable numeric certificate rather than only empirical accuracy on curated test sets.
Where Pith is reading between the lines
- Registry operators could require or publish such certificates as a condition for listing new tools.
- The same bounding technique might be applied to certify other agent properties such as privacy compliance or multi-step execution correctness.
- Over successive versions of an agent, the evolution of its certified bound could serve as a quantitative progress metric for safety engineering.
Load-bearing premise
The stochastic process that generates inserted-tool sequences round by round, conditioning each round on the agent's selection in the previous round, accurately models the distribution of tools an agent would actually encounter in an open registry.
What would settle it
Measuring the fraction of trials in which an agent satisfies the safety specification when tools are drawn from a live, growing open registry and finding that this fraction lies consistently above the LLMCert-T upper bound would show the modeled distribution does not match reality.
read the original abstract
Large language models (LLMs) are increasingly deployed in agentic systems, where a fundamental task is mapping user intents to relevant external tools. Errors in tool selection can have severe outcomes, such as unauthorized data access, even without modifying the agent's underlying model. Existing evaluations measure performance on curated, benign benchmarks. However, a pipeline's behavior in deployment depends on the tool pool the agent actually encounters, which in open registries is shaped by third parties. We introduce LLMCert-T, the first statistical framework that returns \textbf{high-confidence upper bounds on the probability that a tool-selection pipeline satisfies a declared safety specification under a realistic tool distribution}. LLMCert-T models tool-selection evaluation as a Bernoulli estimation problem, drawing inserted-tool sequences from a distribution that the safety specification fixes. To evaluate robustness against realistic deployment conditions, we instantiate this distribution as a stochastic process that generates inserted-tool sequences round by round, conditioning each round on the agent's selection in the previous round. LLMCert-T aggregates the per-trial outcomes into a one-sided Clopper-Pearson upper bound on the probability that the specification is satisfied. By returning this bound as a certificate with statistical guarantees over the inserted-tool sequence distribution, LLMCert-T makes safety claims intuitive, actionable, and comparable across models, retrievers, mitigations, and registry policies. Across popular BFCL and OpenAPI tool pools, LLMCert-T shows that current LLM agents remain fragile under Distractor Selection and Top-N Saturation specifications: their certified correctness upper bounds drop to approximately 20\%, far below their clean-pool lower bounds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LLMCert-T, the first statistical framework for certifying tool-selection pipelines in LLM agents. It models evaluation as a Bernoulli estimation problem over inserted-tool sequences drawn from a distribution fixed by the safety specification. The distribution is realized as a stochastic process that generates sequences round by round, conditioning each round on the agent's prior selection. Per-trial binary outcomes (whether the declared safety specification is satisfied) are aggregated via a one-sided Clopper-Pearson upper bound, yielding high-confidence certificates on the probability of satisfaction under the modeled distribution. Experiments on BFCL and OpenAPI tool pools report that current agents achieve certified upper bounds of approximately 20% under Distractor Selection and Top-N Saturation specifications, well below their clean-pool performance.
Significance. If the statistical validity of the bounds is established, the work supplies a concrete, comparable metric for robustness of agentic tool use against third-party tool registries, moving beyond curated benchmarks. The empirical demonstration that certified correctness can fall to ~20% under realistic insertion policies is actionable for model developers and registry operators. The approach re-uses a standard, parameter-free statistical tool (Clopper-Pearson) and produces falsifiable numerical certificates rather than heuristic scores.
major comments (1)
- [Abstract and stochastic process description] Abstract and § on the stochastic process: Clopper-Pearson supplies valid one-sided coverage only under i.i.d. Bernoulli trials. The described process conditions each round's tool insertion on the agent's selection from the previous round, producing a Markovian dependence within each trajectory. If a trial is defined as a single round (or if per-round satisfaction indicators are aggregated to produce the binary outcome), the sequence of outcomes is dependent rather than i.i.d.; the reported coverage guarantees and the ~20% upper bounds therefore lack the claimed statistical validity. The manuscript must either (a) prove that the dependence does not invalidate coverage, (b) redefine trials as fully independent trajectories with a single aggregate outcome per trajectory, or (c) replace Clopper-Pearson with a method valid for dependent observations.
minor comments (2)
- [Abstract] The abstract states that safety specifications are 'declared' and 'fixed' by the distribution, yet provides no concrete syntax or encoding for a specification; a short formal example in the main text would clarify how a user encodes 'Distractor Selection' or 'Top-N Saturation'.
- [Method] Notation for the per-trial outcome random variable and the exact mapping from a sampled sequence to the binary success indicator should be introduced once and used consistently.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for identifying a key point regarding the statistical validity of the Clopper-Pearson bounds in LLMCert-T. We address the major comment below and will revise the manuscript to clarify the definition of trials.
read point-by-point responses
-
Referee: Abstract and § on the stochastic process description: Clopper-Pearson supplies valid one-sided coverage only under i.i.d. Bernoulli trials. The described process conditions each round's tool insertion on the agent's selection from the previous round, producing a Markovian dependence within each trajectory. If a trial is defined as a single round (or if per-round satisfaction indicators are aggregated to produce the binary outcome), the sequence of outcomes is dependent rather than i.i.d.; the reported coverage guarantees and the ~20% upper bounds therefore lack the claimed statistical validity. The manuscript must either (a) prove that the dependence does not invalidate coverage, (b) redefine trials as fully independent trajectories with a single aggregate outcome per trajectory, or (c) replace Clopper-Pearson with a method valid for dependent observations.
Authors: We appreciate the referee's precise identification of the dependence structure. Our framework defines each trial as a complete, independently sampled trajectory generated by the stochastic process. The binary outcome per trial is a single aggregate indicator of whether the declared safety specification is satisfied for the entire trajectory. Because trajectories are drawn independently, the trial-level binary outcomes form an i.i.d. Bernoulli sequence to which the one-sided Clopper-Pearson bound applies directly. Any Markovian dependence is internal to a trajectory and does not affect independence across trials. We will revise the abstract and the stochastic-process section to state this trial definition explicitly and to include a short argument confirming that the coverage guarantee holds at the trajectory level. This implements option (b) suggested by the referee. revision: yes
Circularity Check
No circularity in LLMCert-T's application of Clopper-Pearson bounds to sampled tool-selection outcomes
full rationale
The paper frames tool-selection evaluation as a Bernoulli estimation problem and computes one-sided Clopper-Pearson upper bounds directly from the empirical frequency of specification-satisfying outcomes across independent trials drawn from the declared distribution. The distribution is instantiated as a round-by-round conditional stochastic process, but the bound itself is produced by a fixed external statistical formula applied to the resulting binary trial outcomes. No parameters are fitted to the target probability, no self-citations justify the core bounding step, and the claimed high-confidence upper bound does not reduce to a self-definition or renaming of the input samples. The derivation remains self-contained as a standard Monte Carlo application of an established method.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Tool-selection outcomes can be treated as independent Bernoulli trials for the purpose of constructing a one-sided upper bound.
- domain assumption The stochastic process that conditions each round of inserted tools on the agent's previous selection faithfully represents realistic deployment conditions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TOOLCERT models tool selection as a Bernoulli success process... aggregates the per-trial outcomes into a one-sided Clopper-Pearson upper bound
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-round, stochastic process... Markov process... conditional distribution Δadv
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Five Attacks on x402 Agentic Payment Protocol
Five practical attacks on the x402 agentic payment protocol are demonstrated across authorization, binding, replay protection, and web handling, validated on local chains, Base Sepolia, live endpoints, and three open-...
-
BEAVER: An Efficient Deterministic LLM Verifier
BEAVER is the first practical deterministic verifier that maintains sound probability bounds on LLM safety properties using token tries and frontier data structures, finding 2-3x more violations than sampling at 1/10 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.