UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

Varun Kotte

arxiv: 2605.18796 · v1 · pith:D7S6WHHNnew · submitted 2026-05-11 · 💻 cs.LG · cs.CL

UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

Varun Kotte This is my paper

Pith reviewed 2026-05-20 23:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM cascadesmodel routinguncertainty calibrationisotonic regressioncost optimizationexpected calibration errorinference efficiencynamed entity recognition

0 comments

The pith

Threshold policies on isotonic-calibrated uncertainty are cost-optimal for LLM cascades under three assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that calibrating uncertainty to error probabilities allows for better routing decisions in systems that use a small LLM for most queries and escalate difficult ones to a larger LLM. UCCI performs this calibration with isotonic regression on token-level margin uncertainty and then solves for the best escalation threshold using cost minimization under accuracy constraints. This is proven cost-optimal when three assumptions hold, with the calibration method having a known rate of convergence for its error. Real deployment on a named entity recognition task with 75,000 queries shows meaningful reductions in inference cost while keeping accuracy steady and improving how well the uncertainty reflects true error rates.

Core claim

UCCI maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error. On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs, UCCI cuts inference cost by 31% at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03.

What carries the argument

Isotonic regression that converts token-level margin uncertainty into a calibrated per-query error probability for use in cost-minimizing threshold selection.

If this is right

Threshold policies on the calibrated score become cost-optimal when the three assumptions hold.
Isotonic calibration reaches O(n^{-1/3}) sample complexity for expected calibration error.
The approach yields a 31% inference cost reduction on the 75,000-query NER workload at fixed accuracy.
It outperforms entropy thresholding, split-conformal routing, and FrugalGPT-style learned thresholds on the same task with measured latencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to cascades with more than two model sizes if similar calibration holds.
Online adaptation of the isotonic map might maintain performance under changing query distributions.
Similar calibration-first routing may apply to other resource allocation problems in machine learning inference.
Testing on additional tasks would reveal how general the three assumptions are in practice.

Load-bearing premise

The three explicit assumptions that establish the cost-optimality of threshold policies on the calibrated score; violation of any one means the optimality result does not apply.

What would settle it

An experiment on the same models and workload that identifies a non-threshold routing policy with lower cost at the target accuracy, or a new workload where UCCI fails to reduce cost by a comparable amount while keeping ECE low.

Figures

Figures reproduced from arXiv: 2605.18796 by Varun Kotte.

**Figure 1.** Figure 1: Reliability diagram on the calibration set. Isotonic regression reduces ECE from 0.12 (uncalibrated token margin) to 0.03 (UCCI) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Cost-accuracy Pareto frontier on the test set. UCCI dominates the FrugalGPT-style and single-model baselines. Comparisons against entropy and conformal at matched operating points appear in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per-workload threshold tuning. We present UCCI, a calibration-first router that maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold. All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UCCI gives a practical calibration step for LLM cascade routing that delivers 31% cost cuts on real production data, but the optimality guarantee hinges on three assumptions that the experiments do not directly test.

read the letter

The main point is that this paper takes token-level margin scores, runs isotonic regression to turn them into per-query error probabilities, and then solves a constrained optimization to set the escalation threshold. On a 75k-query NER workload with actual 4B and 12B model outputs and measured H100 latency, it reports 31% lower cost at micro-F1 0.91 and drops ECE from 0.12 to 0.03, beating the baselines they compare against. That combination and the end-to-end measurement setup are the concrete pieces worth noting.

Referee Report

1 major / 2 minor

Summary. The manuscript presents UCCI, a calibration-first router for LLM cascades that maps token-level margin uncertainty to per-query error probabilities via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are claimed to be cost-optimal, with isotonic calibration achieving O(n^{-1/3}) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs, UCCI achieves a 31% inference cost reduction (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03, outperforming entropy thresholding, split-conformal routing, and FrugalGPT-style learned thresholds. All results use end-to-end routing on actual model outputs and measured H100 latencies rather than simulations.

Significance. If the three assumptions hold, the work provides a principled, calibration-based alternative to heuristic routing in LLM cascades, backed by a non-parametric calibration method with explicit sample-complexity guarantees and a large-scale production experiment that reports real measured costs and confidence intervals. The end-to-end evaluation on actual outputs and direct baseline comparisons are notable strengths that increase practical relevance for cost-sensitive deployments.

major comments (1)

[§3] §3 (or §4): The central optimality claim—that threshold policies on the isotonic-calibrated score are cost-optimal—is explicitly conditioned on three assumptions (independence of the per-query error indicator given the calibrated probability, linearity of the cost function in the escalation decision, and monotonicity of the calibrated score in true error probability). The production NER experiments report a 31% cost reduction at fixed micro-F1 but contain no direct verification of these assumptions for the 4B/12B pair on the 75k-query workload, such as stratified calibration plots by query difficulty or a counterfactual comparison of threshold versus non-threshold policies. This verification is load-bearing for translating the calibration step into a guaranteed optimality result.

minor comments (2)

[Abstract] Abstract: The three assumptions are referenced but not enumerated; a brief parenthetical listing would improve readability without lengthening the abstract.
[Methods] The manuscript should clarify how token-level margin scores are aggregated into the per-query input for isotonic regression (e.g., mean, max, or learned pooling).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for emphasizing the importance of verifying the assumptions underlying our optimality claims. We respond to the major comment below.

read point-by-point responses

Referee: [§3] §3 (or §4): The central optimality claim—that threshold policies on the isotonic-calibrated score are cost-optimal—is explicitly conditioned on three assumptions (independence of the per-query error indicator given the calibrated probability, linearity of the cost function in the escalation decision, and monotonicity of the calibrated score in true error probability). The production NER experiments report a 31% cost reduction at fixed micro-F1 but contain no direct verification of these assumptions for the 4B/12B pair on the 75k-query workload, such as stratified calibration plots by query difficulty or a counterfactual comparison of threshold versus non-threshold policies. This verification is load-bearing for translating the calibration step into a guaranteed optimality result.

Authors: We agree that explicit verification of the three assumptions would strengthen the link between the isotonic calibration and the cost-optimality result. The assumptions are stated clearly in §3 as conditions for the theoretical claim. In the revised manuscript we will add stratified calibration plots (by input length and entity density) to provide empirical support for monotonicity on the 75k-query workload. We will also expand the discussion of the independence and linearity assumptions in light of the NER task structure. A full counterfactual comparison of threshold versus non-threshold policies is not included because it would require additional simulation assumptions outside the current end-to-end evaluation on real model outputs; the observed 31% cost reduction at fixed micro-F1 supplies indirect practical evidence that the assumptions hold sufficiently well for this deployment. revision: partial

Circularity Check

0 steps flagged

No significant circularity; optimality claim is conditional on explicit assumptions and uses standard isotonic regression

full rationale

The paper explicitly conditions its central claim of cost-optimality for threshold policies on three assumptions rather than deriving the result tautologically from fitted parameters or self-referential definitions. Isotonic calibration is presented as a standard non-parametric technique whose O(n^{-1/3}) ECE sample complexity bound aligns with known statistical results for isotonic regression, not a self-derived or fitted prediction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are evident in the abstract or skeptic analysis. The production NER results report empirical cost and ECE improvements at fixed micro-F1 without reducing the optimality statement to a construction from the same inputs. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on three explicit but undetailed assumptions for cost-optimality plus the standard properties of isotonic regression; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Three explicit assumptions under which threshold policies on the calibrated score are cost-optimal
Invoked in the abstract to support the claim that the routing policy is cost-optimal.

pith-pipeline@v0.9.0 · 5744 in / 1453 out tokens · 49201 ms · 2026-05-20T23:20:32.289624+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fit g by isotonic regression ... threshold policy πθ(x) = s if ˆp(x) ≤ θ ... θ* = arg min [Cost(πθ) s.t. dAcc(πθ) ≥ τ
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

[2]

Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression , author =

work page
[3]

Chen, Lingjiao and Zaharia, Matei and Zou, James , journal =

work page
[4]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Selective Classification for Deep Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[5]

International Conference on Machine Learning (ICML) , year =

On Calibration of Modern Neural Networks , author =. International Conference on Machine Learning (ICML) , year =

work page
[6]

Transactions of the Association for Computational Linguistics (TACL) , year =

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering , author =. Transactions of the Association for Computational Linguistics (TACL) , year =

work page
[7]

International Conference on Learning Representations (ICLR) , year =

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author =. International Conference on Learning Representations (ICLR) , year =

work page
[8]

and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with

work page
[9]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[10]

International Conference on Machine Learning (ICML) , year =

Consistent Estimators for Learning to Defer to an Expert , author =. International Conference on Machine Learning (ICML) , year =

work page
[11]

Journal of the American Statistical Association , volume =

Least Ambiguous Set-Valued Classifiers with Bounded Error Levels , author =. Journal of the American Statistical Association , volume =

work page
[12]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Confident Adaptive Language Modeling , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[13]

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year =

Transforming Classifier Scores into Accurate Multiclass Probability Estimates , author =. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year =

work page
[14]

Cascade Speculative Drafting for Even Faster

Chen, Ziyi and Yang, Xiaocong and Lin, Jiacheng and Sun, Chenkai and Chang, Kevin Chen-Chuan and Huang, Jie , journal =. Cascade Speculative Drafting for Even Faster

work page
[15]

Advances in Large Margin Classifiers , volume =

Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods , author =. Advances in Large Margin Classifiers , volume =

work page
[16]

and Hauskrecht, Milos , booktitle =

Naeini, Mahdi Pakdaman and Cooper, Gregory F. and Hauskrecht, Milos , booktitle =. Obtaining Well Calibrated Probabilities Using

work page
[17]

AAAI 2024 Workshop on Scientific Document Understanding , year =

Retrieval Augmented Generation for Domain-specific Question Answering , author =. AAAI 2024 Workshop on Scientific Document Understanding , year =

work page 2024
[18]

2025 , note =

Generating Answers to Contextual Queries Within a Closed Domain , author =. 2025 , note =

work page 2025
[19]

Kotte, Varun , journal =

work page
[20]

Ding, Dujian and Mallick, Ankur and Wang, Chi and Sim, Robert and Mukherjee, Subhabrata and Ruhle, Victor and Lakshmanan, Laks V. S. and Awadallah, Ahmed Hassan , booktitle =. Hybrid. 2024 , note =

work page 2024
[21]

and Kadous, M

Ong, Isaac and Almahairi, Amjad and Wu, Vincent and Chiang, Wei-Lin and Wu, Tianhao and Gonzalez, Joseph E. and Kadous, M. Waleed and Stoica, Ion , journal =

work page
[22]

Hu, Qitian Jason and Bieker, Jacob and Li, Xiuyu and Jiang, Nan and Keigwin, Benjamin and Ranganath, Gaurav and Keutzer, Kurt and Upadhyay, Shriyash Kaustubh , journal =

work page
[23]

Nature , volume =

Detecting Hallucinations in Large Language Models Using Semantic Entropy , author =. Nature , volume =. 2024 , doi =

work page 2024
[24]

and Liu, Rex and Thomson, Matt , journal =

Zellinger, Michael J. and Liu, Rex and Thomson, Matt , journal =. Cost-Saving

work page
[25]

Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

E., Bartholomew, D

Barlow, R. E., Bartholomew, D. J., Bremner, J. M., and Brunk, H. D. Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. John Wiley & Sons, 1972

work page 1972
[27]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Chen, L., Zaharia, M., and Zou, J. FrugalGPT : How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

C.-C., and Huang, J

Chen, Z., Yang, X., Lin, J., Sun, C., Chang, K. C.-C., and Huang, J. Cascade speculative drafting for even faster LLM inference. arXiv preprint arXiv:2312.11462, 2023 b

work page arXiv 2023
[29]

Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Ruhle, V., Lakshmanan, L. V. S., and Awadallah, A. H. Hybrid LLM : Cost-efficient and quality-aware query routing. In International Conference on Learning Representations (ICLR), 2024. arXiv:2404.14618

work page arXiv 2024
[30]

Farquhar, J

Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature, 630 0 (8017): 0 625--630, 2024. doi:10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[31]

and El-Yaniv, R

Geifman, Y. and El-Yaniv, R. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[32]

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017

work page 2017
[33]

RouterBench: A Benchmark for Multi-LLM Routing System

Hu, Q. J., Bieker, J., Li, X., Jiang, N., Keigwin, B., Ranganath, G., Keutzer, K., and Upadhyay, S. K. RouterBench : A benchmark for multi- LLM routing system. arXiv preprint arXiv:2403.12031, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

How can we know when language models know? on the calibration of language models for question answering

Jiang, Z., Araki, J., Ding, H., and Neubig, G. How can we know when language models know? on the calibration of language models for question answering. In Transactions of the Association for Computational Linguistics (TACL), 2021

work page 2021
[35]

PromptPort : A reliability layer for cross-model structured extraction

Kotte, V. PromptPort : A reliability layer for cross-model structured extraction. arXiv preprint arXiv:2601.06151, 2026

work page arXiv 2026
[36]

S., Sharma, S., Dernoncourt, F., and Sultania, D

Kotte, V., Bui, T., Yoon, D. S., Sharma, S., Dernoncourt, F., and Sultania, D. Generating answers to contextual queries within a closed domain, 2025. U.S. Patent Application 18/432,938; Pub. US 20250252265 A1

work page 2025
[37]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Kuhn, L., Gal, Y., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[38]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with PagedAttention . In ACM Symposium on Operating Systems Principles (SOSP), 2023

work page 2023
[39]

Predict responsibly: Improving fairness and accuracy by learning to defer

Madras, D., Pitassi, T., and Zemel, R. Predict responsibly: Improving fairness and accuracy by learning to defer. In Advances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018
[40]

and Sontag, D

Mozannar, H. and Sontag, D. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning (ICML), 2020

work page 2020
[41]

P., Cooper, G

Naeini, M. P., Cooper, G. F., and Hauskrecht, M. Obtaining well calibrated probabilities using Bayesian binning. In AAAI Conference on Artificial Intelligence (AAAI), 2015

work page 2015
[42]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. RouteLLM : Learning to route LLM s with preference data. arXiv preprint arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10 0 (3): 0 61--74, 1999

work page 1999
[44]

Least ambiguous set-valued classifiers with bounded error levels

Sadinle, M., Lei, J., and Wasserman, L. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114 0 (525): 0 223--234, 2019

work page 2019
[45]

Q., Tay, Y., and Metzler, D

Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V. Q., Tay, Y., and Metzler, D. Confident adaptive language modeling. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[46]

S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

Sharma, S., Yoon, D. S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V. Retrieval augmented generation for domain-specific question answering. In AAAI 2024 Workshop on Scientific Document Understanding, 2024. arXiv:2404.14760

work page arXiv 2024
[47]

and Elkan, C

Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2002

work page 2002
[48]

J., Liu, R., and Thomson, M

Zellinger, M. J., Liu, R., and Thomson, M. Cost-saving LLM cascades with early abstention. arXiv preprint arXiv:2502.09054, 2025

work page arXiv 2025

[1] [2]

Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression , author =

work page

[2] [3]

Chen, Lingjiao and Zaharia, Matei and Zou, James , journal =

work page

[3] [4]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Selective Classification for Deep Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[4] [5]

International Conference on Machine Learning (ICML) , year =

On Calibration of Modern Neural Networks , author =. International Conference on Machine Learning (ICML) , year =

work page

[5] [6]

Transactions of the Association for Computational Linguistics (TACL) , year =

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering , author =. Transactions of the Association for Computational Linguistics (TACL) , year =

work page

[6] [7]

International Conference on Learning Representations (ICLR) , year =

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author =. International Conference on Learning Representations (ICLR) , year =

work page

[7] [8]

and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with

work page

[8] [9]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[9] [10]

International Conference on Machine Learning (ICML) , year =

Consistent Estimators for Learning to Defer to an Expert , author =. International Conference on Machine Learning (ICML) , year =

work page

[10] [11]

Journal of the American Statistical Association , volume =

Least Ambiguous Set-Valued Classifiers with Bounded Error Levels , author =. Journal of the American Statistical Association , volume =

work page

[11] [12]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Confident Adaptive Language Modeling , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[12] [13]

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year =

Transforming Classifier Scores into Accurate Multiclass Probability Estimates , author =. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year =

work page

[13] [14]

Cascade Speculative Drafting for Even Faster

Chen, Ziyi and Yang, Xiaocong and Lin, Jiacheng and Sun, Chenkai and Chang, Kevin Chen-Chuan and Huang, Jie , journal =. Cascade Speculative Drafting for Even Faster

work page

[14] [15]

Advances in Large Margin Classifiers , volume =

Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods , author =. Advances in Large Margin Classifiers , volume =

work page

[15] [16]

and Hauskrecht, Milos , booktitle =

Naeini, Mahdi Pakdaman and Cooper, Gregory F. and Hauskrecht, Milos , booktitle =. Obtaining Well Calibrated Probabilities Using

work page

[16] [17]

AAAI 2024 Workshop on Scientific Document Understanding , year =

Retrieval Augmented Generation for Domain-specific Question Answering , author =. AAAI 2024 Workshop on Scientific Document Understanding , year =

work page 2024

[17] [18]

2025 , note =

Generating Answers to Contextual Queries Within a Closed Domain , author =. 2025 , note =

work page 2025

[18] [19]

Kotte, Varun , journal =

work page

[19] [20]

Ding, Dujian and Mallick, Ankur and Wang, Chi and Sim, Robert and Mukherjee, Subhabrata and Ruhle, Victor and Lakshmanan, Laks V. S. and Awadallah, Ahmed Hassan , booktitle =. Hybrid. 2024 , note =

work page 2024

[20] [21]

and Kadous, M

Ong, Isaac and Almahairi, Amjad and Wu, Vincent and Chiang, Wei-Lin and Wu, Tianhao and Gonzalez, Joseph E. and Kadous, M. Waleed and Stoica, Ion , journal =

work page

[21] [22]

Hu, Qitian Jason and Bieker, Jacob and Li, Xiuyu and Jiang, Nan and Keigwin, Benjamin and Ranganath, Gaurav and Keutzer, Kurt and Upadhyay, Shriyash Kaustubh , journal =

work page

[22] [23]

Nature , volume =

Detecting Hallucinations in Large Language Models Using Semantic Entropy , author =. Nature , volume =. 2024 , doi =

work page 2024

[23] [24]

and Liu, Rex and Thomson, Matt , journal =

Zellinger, Michael J. and Liu, Rex and Thomson, Matt , journal =. Cost-Saving

work page

[24] [25]

Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [26]

E., Bartholomew, D

Barlow, R. E., Bartholomew, D. J., Bremner, J. M., and Brunk, H. D. Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. John Wiley & Sons, 1972

work page 1972

[26] [27]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Chen, L., Zaharia, M., and Zou, J. FrugalGPT : How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [28]

C.-C., and Huang, J

Chen, Z., Yang, X., Lin, J., Sun, C., Chang, K. C.-C., and Huang, J. Cascade speculative drafting for even faster LLM inference. arXiv preprint arXiv:2312.11462, 2023 b

work page arXiv 2023

[28] [29]

Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Ruhle, V., Lakshmanan, L. V. S., and Awadallah, A. H. Hybrid LLM : Cost-efficient and quality-aware query routing. In International Conference on Learning Representations (ICLR), 2024. arXiv:2404.14618

work page arXiv 2024

[29] [30]

Farquhar, J

Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature, 630 0 (8017): 0 625--630, 2024. doi:10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024

[30] [31]

and El-Yaniv, R

Geifman, Y. and El-Yaniv, R. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[31] [32]

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017

work page 2017

[32] [33]

RouterBench: A Benchmark for Multi-LLM Routing System

Hu, Q. J., Bieker, J., Li, X., Jiang, N., Keigwin, B., Ranganath, G., Keutzer, K., and Upadhyay, S. K. RouterBench : A benchmark for multi- LLM routing system. arXiv preprint arXiv:2403.12031, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [34]

How can we know when language models know? on the calibration of language models for question answering

Jiang, Z., Araki, J., Ding, H., and Neubig, G. How can we know when language models know? on the calibration of language models for question answering. In Transactions of the Association for Computational Linguistics (TACL), 2021

work page 2021

[34] [35]

PromptPort : A reliability layer for cross-model structured extraction

Kotte, V. PromptPort : A reliability layer for cross-model structured extraction. arXiv preprint arXiv:2601.06151, 2026

work page arXiv 2026

[35] [36]

S., Sharma, S., Dernoncourt, F., and Sultania, D

Kotte, V., Bui, T., Yoon, D. S., Sharma, S., Dernoncourt, F., and Sultania, D. Generating answers to contextual queries within a closed domain, 2025. U.S. Patent Application 18/432,938; Pub. US 20250252265 A1

work page 2025

[36] [37]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Kuhn, L., Gal, Y., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations (ICLR), 2023

work page 2023

[37] [38]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with PagedAttention . In ACM Symposium on Operating Systems Principles (SOSP), 2023

work page 2023

[38] [39]

Predict responsibly: Improving fairness and accuracy by learning to defer

Madras, D., Pitassi, T., and Zemel, R. Predict responsibly: Improving fairness and accuracy by learning to defer. In Advances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018

[39] [40]

and Sontag, D

Mozannar, H. and Sontag, D. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning (ICML), 2020

work page 2020

[40] [41]

P., Cooper, G

Naeini, M. P., Cooper, G. F., and Hauskrecht, M. Obtaining well calibrated probabilities using Bayesian binning. In AAAI Conference on Artificial Intelligence (AAAI), 2015

work page 2015

[41] [42]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. RouteLLM : Learning to route LLM s with preference data. arXiv preprint arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [43]

Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10 0 (3): 0 61--74, 1999

work page 1999

[43] [44]

Least ambiguous set-valued classifiers with bounded error levels

Sadinle, M., Lei, J., and Wasserman, L. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114 0 (525): 0 223--234, 2019

work page 2019

[44] [45]

Q., Tay, Y., and Metzler, D

Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V. Q., Tay, Y., and Metzler, D. Confident adaptive language modeling. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[45] [46]

S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

Sharma, S., Yoon, D. S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V. Retrieval augmented generation for domain-specific question answering. In AAAI 2024 Workshop on Scientific Document Understanding, 2024. arXiv:2404.14760

work page arXiv 2024

[46] [47]

and Elkan, C

Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2002

work page 2002

[47] [48]

J., Liu, R., and Thomson, M

Zellinger, M. J., Liu, R., and Thomson, M. Cost-saving LLM cascades with early abstention. arXiv preprint arXiv:2502.09054, 2025

work page arXiv 2025