pith. sign in

arxiv: 2604.22985 · v1 · submitted 2026-04-24 · 💻 cs.CL

Uncertainty Quantification for LLM Function-Calling

Pith reviewed 2026-05-08 11:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords uncertainty quantificationfunction callingLLMsemantic entropyabstract syntax treetool useconfidence estimation
0
0 comments X

The pith

In the function-calling setting for LLMs, multi-sample uncertainty quantification methods offer no clear advantage over simple single-sample approaches, but both can be enhanced by exploiting the structured nature of function call outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper provides the first evaluation of uncertainty quantification methods specifically for LLM function-calling. It shows that methods relying on multiple samples, which work well for natural language questions, do not outperform single-sample methods here. The authors demonstrate that the structured format of function calls allows targeted improvements, such as grouping outputs by their abstract syntax tree for multi-sample techniques and limiting logit calculations to meaningful tokens for single-sample ones. If these findings hold, deploying LLMs for tasks like financial transactions or data management becomes safer by better estimating when a function call is likely wrong before executing it.

Core claim

The central discovery is that in LLM function-calling, multi-sample UQ methods like Semantic Entropy do not provide a clear performance edge over single-sample UQ methods, in contrast to their strength in natural language Q&A. However, the particularities of FC outputs can be leveraged: multi-sample methods improve with clustering based on abstract syntax tree parsing, and single-sample methods benefit from selecting only semantically meaningful tokens for logit-based uncertainty scores.

What carries the argument

Abstract syntax tree (AST) parsing for clustering function call outputs and selection of semantically meaningful tokens for uncertainty score calculation.

If this is right

  • Improved UQ methods can reduce the risk of executing incorrect function calls in real-world applications with irreversible effects.
  • Clustering FC outputs by AST structure enhances the effectiveness of multi-sample uncertainty estimates.
  • Filtering to semantically meaningful tokens refines single-sample logit-based uncertainty measures.
  • These adaptations make existing UQ techniques more suitable for the structured outputs typical in tool-use scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar structural adaptations might improve UQ in other LLM output domains that have parseable formats, such as code generation.
  • Testing these UQ methods on benchmarks involving actual tool executions could reveal their practical impact on error rates.
  • Developers of LLM agents may prioritize simpler single-sample UQ with these tweaks for efficiency without sacrificing reliability.

Load-bearing premise

That the observed performance patterns and improvements are due to inherent properties of function-calling outputs rather than specific choices of models, datasets, or metrics in the experiments.

What would settle it

A result where multi-sample UQ clearly outperforms single-sample UQ on a new function-calling benchmark, or where AST clustering fails to improve multi-sample performance, would challenge the main findings.

Figures

Figures reproduced from arXiv: 2604.22985 by Adam Golinski, Arno Blaas, Luca Zappella, Lukas Aichberger, Michael Kirchhof, Sinead Williamson, Yarin Gal, Zihuiwen Ye.

Figure 1
Figure 1. Figure 1: SmoothECE for All Combined task, aggregated over all models. Beyond being able to distinguish incorrect from correct outputs, which we have measured with AUROC in the previous sections, a desirable quality of a UQ score is to be calibrated, i.e., to correctly predict the probability of a sample being correct, which is important in decision￾making settings (Kiyani et al., 2025). Among several calibration me… view at source ↗
Figure 2
Figure 2. Figure 2: Top token probabilities for an incorrect view at source ↗
Figure 3
Figure 3. Figure 3: Qwen2.5-7B-Instruct response top token probabilities (lowest 5 bolded) for a question wrongly view at source ↗
Figure 4
Figure 4. Figure 4: Individual tasks: AUROC ± standard error (bootstrap estimate with n = 1000) for the individual tasks described in Sec. 3.1. Boxplots summarize the variation across LLMs. MAX-SMT MAXG-NLL-SMT G-NLLAVG-SMT AVG SE-AST SE-EXM DSE-ASTDSE-EXM PE PTRUE LEN 0.4 0.5 0.6 0.7 0.8 0.9 AUROC google_gemma-3-12b-it google_gemma-3-4b-it google_gemma-2-9b-it Qwen_Qwen3-4B-Instruct-2507 Qwen_Qwen2.5-0.5B-Instruct Qwen_Qwen2… view at source ↗
Figure 5
Figure 5. Figure 5: Combinations of tasks: AUROC ± std. error for the combinations of tasks described in Sec. 3.2. Boxplots summarize the variation across LLMs for a given UQ method. D.4 Detailed Calibration Results In view at source ↗
Figure 6
Figure 6. Figure 6: Unanswerable requests: AUROC ± standard error for the combinations of answerable and unanswerable tasks described in Sec. 3.3. Boxplots summarize the variation across LLMs for a given UQ method. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Coverage 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 Accuracy SE-EXM SE-AST LEN PTRUE G-NLL G-NLL-SMT (a) Simple 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Coverage 0.90 0.92 0… view at source ↗
Figure 7
Figure 7. Figure 7: Individual tasks: Risk-coverage curves for the individual tasks described in Sec. 3.1, averaged across view at source ↗
Figure 8
Figure 8. Figure 8: Combination of tasks: Risk-coverage curves for the combinations of tasks described in Sec. 3.2, view at source ↗
Figure 9
Figure 9. Figure 9: Unanswerable requests: Risk-coverage curves for the combinations of answerable and unanswerable view at source ↗
Figure 10
Figure 10. Figure 10: SmoothECE for all tasks aggregated over models. view at source ↗
Figure 11
Figure 11. Figure 11: The algorithmic implementation of the classification of semantically meaningful tokens as defined view at source ↗
Figure 12
Figure 12. Figure 12: The algorithmic implementation of the classification of semantically meaningful tokens as defined view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM's confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language Q&A tasks, we find that in the FC setting, it offers no clear advantage over simple single-sample UQ methods. Additionally, we find that the particularities of FC outputs can be leveraged to improve the performance of existing UQ methods in this setting. Specifically, multi-sample UQ methods benefit from clustering FC outputs based on their abstract syntax tree parsing, while single-sample UQ methods can be improved by selecting only semantically meaningful tokens when calculating logit-based uncertainty scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents the first evaluation of uncertainty quantification (UQ) methods for LLM function-calling (FC). It claims that multi-sample UQ methods such as Semantic Entropy show no clear advantage over simple single-sample methods in the FC setting (in contrast to natural language Q&A tasks), and that FC output particularities can be leveraged for improvements: AST-based clustering for multi-sample UQ and selection of semantically meaningful tokens for logit-based single-sample UQ.

Significance. If the empirical findings prove robust, this work is significant for improving the safety and reliability of LLM-based agents that perform tool use with potentially irreversible effects. It identifies domain-specific challenges in applying existing UQ techniques to structured, executable outputs and offers targeted, practical adaptations that could be adopted in real deployments.

major comments (2)
  1. [§4] §4 (Experimental Setup and Results): The central claim that multi-sample UQ offers no advantage over single-sample methods due to intrinsic properties of FC outputs (rather than model/benchmark choices) is load-bearing. The manuscript must demonstrate this via systematic variation across multiple LLMs, diverse FC benchmarks, and reported effect sizes with statistical tests; without such controls the 'intrinsic' interpretation does not follow from the presented comparisons.
  2. [§5] §5 (Proposed Adaptations): The reported gains from AST clustering (multi-sample) and semantic token selection (single-sample) require ablation controls showing these outperform simpler baselines (e.g., random clustering or token selection) to establish that the improvements stem from FC structural properties.
minor comments (3)
  1. [Abstract] Abstract: The high-level findings are stated without any quantitative metrics, datasets, or effect sizes, which reduces the abstract's informativeness.
  2. [§2] Notation and Definitions: Ensure consistent definitions for all UQ methods (e.g., how Semantic Entropy is computed on FC outputs) with clear references to prior work.
  3. [Tables] Tables/Figures: Include error bars or significance tests in performance comparison tables to allow readers to assess whether differences are meaningful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important ways to strengthen the robustness of our claims about the behavior of UQ methods in the function-calling setting and the value of our proposed adaptations. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup and Results): The central claim that multi-sample UQ offers no advantage over single-sample methods due to intrinsic properties of FC outputs (rather than model/benchmark choices) is load-bearing. The manuscript must demonstrate this via systematic variation across multiple LLMs, diverse FC benchmarks, and reported effect sizes with statistical tests; without such controls the 'intrinsic' interpretation does not follow from the presented comparisons.

    Authors: We agree that the interpretation of our findings as reflecting intrinsic properties of function-calling outputs (structured, executable sequences with limited semantic diversity) rather than artifacts of specific model or benchmark choices requires broader empirical support. Our original experiments focused on representative open-source and proprietary LLMs together with standard FC benchmarks, but we acknowledge that additional controls are needed. In the revision we will (i) evaluate at least two further LLMs spanning different scales and training regimes, (ii) include results on at least one additional diverse FC benchmark, (iii) report effect sizes (Cohen’s d or similar) alongside raw metrics, and (iv) apply appropriate statistical tests (paired t-tests or Wilcoxon signed-rank tests with multiple-comparison correction) to the performance differences between multi-sample and single-sample methods. These additions will allow readers to assess whether the observed lack of advantage for multi-sample UQ generalizes beyond the original setup. revision: yes

  2. Referee: [§5] §5 (Proposed Adaptations): The reported gains from AST clustering (multi-sample) and semantic token selection (single-sample) require ablation controls showing these outperform simpler baselines (e.g., random clustering or token selection) to establish that the improvements stem from FC structural properties.

    Authors: We concur that explicit ablations are necessary to isolate the contribution of function-calling structure. The original manuscript motivated AST-based clustering by the executable syntax of FC outputs and semantic-token selection by the fact that only certain tokens carry task-relevant semantics, yet we did not compare against random baselines. In the revised version we will add two ablation studies: (1) for multi-sample UQ, we will replace AST clustering with random partitioning of the sampled outputs and report the resulting uncertainty scores; (2) for single-sample logit-based scores, we will compare semantic-token selection against both uniform random token subsampling and selection of the first/last k tokens. Performance deltas and statistical significance will be reported so that readers can verify that the gains derive from leveraging FC-specific structure rather than from generic clustering or token filtering. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of UQ methods with no derivations or self-referential predictions

full rationale

The paper conducts an experimental evaluation of existing UQ methods (single-sample and multi-sample) on LLM function-calling tasks, reporting performance comparisons and targeted adaptations such as AST-based clustering and semantic token selection. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described setup. Claims rest on observed experimental outcomes rather than any chain that reduces to its own inputs by construction. This is the standard case of a self-contained empirical study whose central findings can be falsified by replication on other models or benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical evaluation study that applies and modestly adapts existing UQ methods; it introduces no new free parameters, no invented entities, and relies only on standard domain assumptions about benchmark representativeness and metric appropriateness.

axioms (2)
  • domain assumption The chosen benchmarks and tasks are representative of real-world LLM function-calling scenarios with irreversible effects.
    The claim that multi-sample methods offer no advantage and that the proposed adaptations improve performance rests on this representativeness.
  • domain assumption Existing UQ methods developed for natural language can be meaningfully compared and adapted to structured function-call outputs.
    The entire comparative evaluation and improvement strategy presupposes that the methods transfer to the FC regime.

pith-pipeline@v0.9.0 · 5548 in / 1575 out tokens · 68935 ms · 2026-05-08T11:43:25.383121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    } 15 ] ] ,

    " } 15 ] ] , " fun ct ion " : [ { " name " : " g e t _ s c u l p t u r e _ v a l u e " , " d e s c r i p t i o n " : " Re tr iev e the current market value of a p a r t i c u l a r s c u l p t u r e by a sp eci fi c artist . " , " p a r a m e t e r s " : { " type " : " dict " , " p r o p e r t i e s " : { " s c u l p t u r e " : { " type " : " string " , ...

  2. [2]

    You are given a qu es tio n and a set of p oss ib le f u n c t i o n s

    System prompt: <| im_ st art | > system You are an expert in c o m p o s i n g f u n c t i o n s . You are given a qu es tio n and a set of p oss ib le f u n c t i o n s . You are also given b r a i n s t o r m e d ideas and a po ss ibl e answer . Based on the question , you have to assess if the p os sib le answer a ch iev es the purpose . If none of the...

  3. [3]

    <| im_end | > <| im_ st art | > a s s i s t a n t The po ss ibl e answer is : B <| im_end | > 20

    Incorrect few-shot example: <| im_ st art | > user Qu es tio n :[few-shot question inserted here] Here is a list of f u n c t i o n s in JSON format that can be invoked : [few-shot function inserted here] Here are some b r a i n s t o r m e d ideas : [few-shot brainstormed ideas inserted here] Po ss ibl e answer : [few-shot incorrect answer inserted here]...

  4. [4]

    <| im_end | > <| im_ st art | > a s s i s t a n t The po ss ibl e answer is : A <| im_end | >

    Correct few-shot example: <| im_ st art | > user Qu es tio n :[few-shot question inserted here] Here is a list of f u n c t i o n s in JSON format that can be invoked : [few-shot function inserted here] Here are some b r a i n s t o r m e d ideas : [few-shot brainstormed ideas inserted here] Po ss ibl e answer : [few-shot correct answer inserted here] Is ...

  5. [5]

    , yJ }inserted here] Po ss ibl e answer : [yG inserted here] Is the pos si ble answer : A ) True B ) False Respond with A or B only

    Actual example for which the correctness ofyG is evaluated: <| im_ st art | > user Qu es tio n :[question inserted here] Here is a list of f u n c t i o n s in JSON format that can be invoked : [function inserted here] Here are some b r a i n s t o r m e d ideas : [{y1, . . . , yJ }inserted here] Po ss ibl e answer : [yG inserted here] Is the pos si ble a...

  6. [6]

    Few-shot question: What is 19/53?

  7. [7]

    Few-shot function: [{ ’ name ’: ’ divide ’ , ’ d e s c r i p t i o n ’: ’ Divides two numbers . ’ , ’ p a r a m e t e r s ’: { ’ type ’: ’ dict ’ , ’ p r o p e r t i e s ’: { ’ n u m e r a t o r ’: { ’ type ’: ’ float ’ , ’ d e s c r i p t i o n ’: ’ The n u m e r a t o r of the fr ac tio n . ’} , ’ d e n o m i n a t o r ’: { ’ type ’: ’ float ’ , ’ d e s...

  8. [8]

    Few-shot brainstormed ideas: [ divide ( d e n o m i n a t o r =53 , n u m e r a t o r =19) ] [ divide ( n u m e r a t o r =53 , d e n o m i n a t o r =53) ] [ divide ( n u m e r a t o r =19 , d e n o m i n a t o r =19) ] [ divide ( n u m e r a t o r =19 , d e n o m i n a t o r =53) ]

  9. [9]

    Few-shot incorrect answer: [ divide ( n u m e r a t o r =53 , d e n o m i n a t o r =19) ] 21

  10. [10]

    Few-shot correct answer: [ divide ( n u m e r a t o r =19 , d e n o m i n a t o r =53) ] D Further Experimental Results D.1 Entailment-based Semantic Entropy In Table 4, we show the results of an initial experiment onQwen2.5-7B-Instruct on theAll Combinedtask for both the original implementation of SE using DeBERTa as an LLM-based entailment method (SEEN ...