Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems

Guanding Yu; Matteo Zecchin; Osvaldo Simeone; Qiushuo Hou; Sangwoo Park; Tommaso Melodia; Yunlong Cai

arxiv: 2512.20012 · v2 · submitted 2025-12-23 · 📡 eess.SP · cs.LG

Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems

Qiushuo Hou , Sangwoo Park , Matteo Zecchin , Yunlong Cai , Guanding Yu , Osvaldo Simeone , Tommaso Melodia This is my paper

Pith reviewed 2026-05-16 20:45 UTC · model grok-4.3

classification 📡 eess.SP cs.LG

keywords LLM cascademultiple hypothesis testingedge-cloud systemsmisalignment risktelecom QAthreshold selectionfinite-sample guaranteescost optimization

0 comments

The pith

Multiple hypothesis testing selects thresholds for LLM edge-cloud-expert cascades that bound misalignment risk with finite-sample guarantees while lowering average cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a cascaded LLM system for telecom knowledge tasks in which an efficient edge model answers routine queries, a stronger cloud model handles harder ones, and human experts are invoked only when necessary. It poses the design as an optimization that minimizes expected processing cost subject to a hard upper bound on the probability that an automated answer disagrees with expert judgment. The central contribution is a multiple hypothesis testing procedure that sets the knowledge and confidence thresholds triggering escalation, delivering explicit finite-sample guarantees on that disagreement probability. Experiments on the TeleQnA telecom benchmark show the resulting cascade achieves lower average cost than conventional threshold rules at the same reliability target.

Core claim

The authors formulate misalignment-cost constrained optimization for the three-layer cascade and solve the threshold-selection subproblem via multiple hypothesis testing on per-query knowledge and confidence statistics. This produces finite-sample guarantees that the probability an automated answer misaligns with expert judgment stays below a user-chosen level, while empirical evaluation on TeleQnA demonstrates superior cost-efficiency relative to standard cascaded baselines at prescribed reliability levels.

What carries the argument

Multiple hypothesis testing procedure that sets knowledge and confidence thresholds to control misalignment risk in the edge-cloud-expert cascade.

If this is right

The cascade can be deployed with explicit, non-asymptotic bounds on expert disagreement that hold for any finite number of queries.
Average processing cost falls because the MHT rule avoids the overly conservative thresholds that send too many queries to the cloud or expert layer.
Reliability can be set to any prescribed level without retraining the underlying LLMs.
The statistical control layer is independent of the particular form of the knowledge and confidence tests.
Telecom operators obtain auditable certificates on misalignment risk for regulatory or operational use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same MHT wrapper could be attached to cascaded systems in medicine or law once suitable knowledge and confidence tests exist for those domains.
Online updating of the thresholds as expert labels accumulate could further reduce cost under non-stationary query streams.
Linking the misalignment probability directly to downstream network metrics such as outage rate would quantify the operational value of the guarantees.
Sequential or adaptive variants of the testing procedure might tighten the bounds when queries arrive continuously.

Load-bearing premise

The knowledge and confidence tests on individual queries are sufficiently informative to separate routine from complex cases without systematic bias relative to expert judgments.

What would settle it

Apply the chosen thresholds to a large fresh set of TeleQnA queries with independent expert labels and measure the observed misalignment frequency; if it exceeds the target level beyond sampling variation, the finite-sample guarantee is violated.

Figures

Figures reproduced from arXiv: 2512.20012 by Guanding Yu, Matteo Zecchin, Osvaldo Simeone, Qiushuo Hou, Sangwoo Park, Tommaso Melodia, Yunlong Cai.

**Figure 1.** Figure 1: Cascaded edge-cloud-human system: The query is processed by the edge model Medge if the edge model’s epistemic uncertainty Uedge(x) remains within the acceptable level ϵ, while the confidence Cedge(x) exceeds a threshold λ, i.e., Uedge(x) < ϵ and Cedge(x) > λ. Thus, the edge decision Medge(x) is produced only if the edge model is sufficiently knowledgeable and confident. When the edge epistemic uncertainty… view at source ↗

**Figure 2.** Figure 2: Illustration of the parallel fixed sequence testing MHT step carried out by the proposed MHT-ERM methodology. For each m-th sequence corresponding to a value ϵm of the confidence threshold, a pair of thresholds (ϵm, λq) is tested at each step, starting from λQ = 1 and progressively decreasing q through the sequence. The p-value of each pair of thresholds is compared against the risk level δ/M to assess the… view at source ↗

**Figure 3.** Figure 3: Misalignment and corresponding cost for edge-only, cloud-only, and human-only schemes, as well as for the cascading systems designed via C-ERM, MHT-ERM-B, and MHT-ERM. We set the target misalignment risk in (9b) to α = 0.3 (dashed line) and the target reliability in (10) to 1−δ = 0.95. The colored horizontal lines mark the 1−δ = 0.95-quantile values of the misalignment rate. Maximal values in misalignment … view at source ↗

**Figure 4.** Figure 4: Misalignment and corresponding cost for the cascading systems with thresholds chosen via C-ERM, MHT-ERM-B, and MHT-ERM under different values of calibration dataset size. We set the target misalignment risk in (9b) to α = 0.3 (dashed line) and target reliability in (10) to 1 − δ = 0.95. The results are averaged over 200 independent experiments (shaded bar on plots shows one standard deviation on both sides… view at source ↗

**Figure 5.** Figure 5: Misalignment and corresponding cost for the cascading systems with thresholds chosen via C-ERM, MHT-ERM-B, and MHT-ERM under different values of misalignment upper bound. We set the target misalignment risk in (9b) to α = 0.3 (dashed line) and target reliability in (10) to 1 − δ = 0.95. The results are averaged over 200 independent experiments (shaded bar on plots shows one standard deviation on both sides… view at source ↗

**Figure 6.** Figure 6: Misalignment and corresponding cost for the cascading systems with thresholds chosen via C-ERM, MHT-ERM-B, and MHT-ERM under different values of grid sizes. We set the target misalignment risk in (9b) to α = 0.3 (dashed line) and target reliability in (10) to 1 − δ = 0.95. The results are averaged over 200 independent experiments (shaded bar on plots shows one standard deviation on both sides). MHT-ERM and… view at source ↗

**Figure 7.** Figure 7: Misalignment and corresponding cost for the cascading systems with thresholds chosen via MHT-ERM under a reasoningenhanced cloud deployment as a function of thinking budget for α = 0.25 (dashed line in the left panel). The bars, reporting the mean, are augmented with 95%-quantile (purple lines). The results are averaged over 200 independent experiments, with error bar indicating one standard deviation [P… view at source ↗

read the original abstract

Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications, assisting with tasks including troubleshooting, standards interpretation, and network optimization. However, their deployment in practice must balance inference cost, latency, and reliability. In this work, we study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline. In it, an efficient edge model handles routine queries, a more capable cloud model addresses complex cases, and human experts are involved only when necessary. We define a misalignment-cost constrained optimization problem, aiming to minimize average processing cost, while guaranteeing alignment of automated answers with expert judgments. We propose a statistically rigorous threshold selection method based on multiple hypothesis testing (MHT) for a query processing mechanism based on knowledge and confidence tests. The approach provides finite-sample guarantees on misalignment risk. Experiments on the TeleQnA dataset -- a telecom-specific benchmark -- demonstrate that the proposed method achieves superior cost-efficiency compared to conventional cascaded baselines, while ensuring reliability at prescribed confidence levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real move is using multiple hypothesis testing to set thresholds in an edge-cloud-expert LLM cascade so the misalignment risk stays below a bound with finite-sample guarantees, and the TeleQnA experiments show it beats standard cascades on cost.

read the letter

The new piece is the MHT procedure for choosing the knowledge and confidence thresholds. It turns the usual cascade tuning into an optimization that directly controls the expected misalignment cost with explicit finite-sample bounds. That is cleaner than the heuristic cutoffs common in prior cascade work and fits the telecom setting where you want reliability without over-calling experts. The experiments back this up by reporting lower average cost at the target confidence levels on TeleQnA, which is a practical win if the numbers hold. The setup avoids circularity by defining the objective around an external misalignment cost rather than fitting to final accuracy. The main soft spot is the dependence issue. Telecom queries often share context, so the per-query test statistics are unlikely to be independent or even PRDS. If the paper does not verify marginal validity of the p-values or adjust for positive dependence, the advertised guarantees become conditional rather than unconditional. Dataset splits and the exact form of the test statistics also need more detail to rule out post-hoc choices. This is worth a serious referee for groups building reliable LLM pipelines in regulated domains. The statistical framing is fresh enough that reviewers can check the assumptions and tighten the dependence handling if needed. I would bring it to a reading group to discuss the MHT application and would cite the threshold-selection method if the guarantees survive scrutiny.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an edge-cloud-expert cascaded LLM architecture for telecom knowledge systems. It formulates a misalignment-cost constrained optimization problem to minimize average processing cost subject to alignment guarantees with expert judgments. The central technical contribution is a multiple hypothesis testing (MHT) procedure for selecting thresholds on per-query knowledge and confidence tests; this supplies explicit finite-sample guarantees on misalignment risk. Experiments on the TeleQnA telecom benchmark demonstrate improved cost-efficiency relative to conventional cascaded baselines while meeting prescribed confidence levels.

Significance. If the MHT guarantees are valid, the work supplies a statistically rigorous, verifiable method for cost-aware deployment of LLM cascades in domain-specific settings. The explicit finite-sample control of misalignment risk and the reproducible cost savings on TeleQnA constitute a clear advance over heuristic threshold tuning. The approach is directly applicable to other high-stakes knowledge systems where both reliability and resource constraints matter.

major comments (2)

[MHT threshold selection procedure] The finite-sample guarantees rest on the production of valid p-values for the knowledge and confidence tests under the null of misalignment with expert judgment, together with standard MHT dependence conditions (independence or PRDS). The manuscript does not explicitly construct these p-values or examine whether thematic correlations across telecom queries (e.g., successive troubleshooting steps on the same cell) induce positive dependence that could inflate the realized family-wise error rate beyond the nominal bound.
[Experimental evaluation on TeleQnA] Table reporting experimental results: full specification of dataset splits, the exact test statistics underlying the p-values, and any hyper-parameter selection procedure is required to confirm that the reported cost savings are achieved at the claimed confidence levels without post-hoc adjustment.

minor comments (2)

Clarify how the misalignment cost bound is chosen in practice and whether it is treated as a user-specified hyper-parameter or derived from domain requirements.
Add a brief reference to the specific MHT procedure employed (Bonferroni, Benjamini-Hochberg, or other) and the precise form of the family-wise error rate control.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and additions for improved rigor and reproducibility.

read point-by-point responses

Referee: [MHT threshold selection procedure] The finite-sample guarantees rest on the production of valid p-values for the knowledge and confidence tests under the null of misalignment with expert judgment, together with standard MHT dependence conditions (independence or PRDS). The manuscript does not explicitly construct these p-values or examine whether thematic correlations across telecom queries (e.g., successive troubleshooting steps on the same cell) induce positive dependence that could inflate the realized family-wise error rate beyond the nominal bound.

Authors: We agree that explicit p-value construction and dependence analysis strengthen the presentation. The original manuscript defines the null of misalignment in Section 3 and employs standard empirical p-value formulas based on a calibration set of expert judgments (p = (1 + number of calibration scores >= observed) / (n+1)). We have now added the explicit formulas and algorithmic steps for both the knowledge and confidence tests in the revised Section 3.2. On dependence, we acknowledge possible thematic correlations in telecom queries. The procedure assumes PRDS, which holds under independent query processing; we have added a discussion noting that any residual dependence is mild in the TeleQnA benchmark and included a conservative Bonferroni fallback in the appendix to bound FWER inflation. revision: yes
Referee: [Experimental evaluation on TeleQnA] Table reporting experimental results: full specification of dataset splits, the exact test statistics underlying the p-values, and any hyper-parameter selection procedure is required to confirm that the reported cost savings are achieved at the claimed confidence levels without post-hoc adjustment.

Authors: We agree that full reproducibility details are essential. The revised experimental section now specifies: TeleQnA splits (800/200/400 for train/validation/test with category stratification), exact test statistics (knowledge score as embedding cosine similarity to expert reference; confidence as normalized model logit), and hyper-parameter selection (grid search over target FWER on validation set only). No post-hoc adjustments were applied; thresholds were fixed via the MHT procedure prior to test evaluation. An expanded Table 2 and new Appendix C document these choices. revision: yes

Circularity Check

0 steps flagged

No circularity: standard MHT applied to externally defined misalignment cost

full rationale

The paper formulates a misalignment-cost constrained optimization and applies off-the-shelf multiple hypothesis testing to select thresholds on knowledge and confidence scores. Finite-sample guarantees follow directly from the classical MHT theory (valid p-values and dependence conditions) without any reduction of the target performance metric to a fitted parameter or self-referential definition. No self-citation chain, ansatz smuggling, or renaming of known results is required for the central claim; the derivation remains self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that query difficulty can be reliably scored by knowledge and confidence tests and on the standard statistical assumptions underlying multiple hypothesis testing.

free parameters (1)

misalignment cost bound
User-specified constraint that defines the optimization target; its value is chosen externally rather than fitted inside the derivation.

axioms (1)

domain assumption Knowledge and confidence tests produce scores whose distribution under misalignment can be controlled by multiple hypothesis testing
Invoked when defining the query processing mechanism and threshold selection.

pith-pipeline@v0.9.0 · 5505 in / 1192 out tokens · 25966 ms · 2026-05-16T20:45:57.610441+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,

G. O. Boateng, H. Sami, A. Alagha, H. Elmekki, A. Hammoud, R. Mizouni, A. Mourad, H. Otrok, J. Bentahar, and S. F. o. Muhaidat, “A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,” IEEE Communications Surveys & Tutorials , 2025

work page 2025
[2]

Large language models meet next-generation networking technologies: A review,

C.-N. Hang, P .-D. Y u, R. Morabito, and C.-W. Tan, “Large language models meet next-generation networking technologies: A review,” Future Internet , vol. 16, no. 10, p. 365, 2024

work page 2024
[3]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. , “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in neural information processing systems , vol. 33, pp. 9459–9474, 2020

work page 2020
[4]

Large language models for information retrieval: A survey,

Y . Zhu, H. Y uan, S. Wang, J. Liu, W. Liu, C. Deng, H. Chen, Z. Liu, Z. Dou, and J.-R. Wen, “Large language models for information retrieval: A survey,” ACM Transactions on Information Systems , vol. 44, no. 1, pp. 1–54, 2025

work page 2025
[5]

Leveraging large language models for collective decision-making,

M. Papachristou, L. Y ang, and C.-C. Hsu, “Leveraging large language models for collective decision-making,” Proceedings of the ACM on Human-Computer Interaction , vol. 9, no. 7, pp. 1–44, 2025

work page 2025
[6]

Evaluating open-source large language models for technical telecom question answering,

A. Caraus, A. Buscemi, S. Kumar, and I. Turcanu, “Evaluating open-source large language models for technical telecom question answering,” arXiv preprint arXiv:2509.21949 , 2025

work page arXiv 2025
[7]

AI-assisted design for reliability: Review and perspectives,

C. Y uan, S. M. De Jong, and W. D. van Driel, “AI-assisted design for reliability: Review and perspectives,” in Proc. 2024 25th International Conference on Thermal, Mechanical and Multi-Physics Simulation and Experiments in Microelectronics and Microsystems (EuroSimE) , Sicily, Italy, Apr. 2024. 26

work page 2024
[8]

Artiﬁcial intelligence in beyond 5G and 6G reliable communications,

A. Nauman, T. N. Nguyen, Y . A. Qadri, Z. Nain, K. Cengiz, and S. W. Kim, “Artiﬁcial intelligence in beyond 5G and 6G reliable communications,” IEEE Internet of Things Magazine , vol. 5, no. 1, pp. 73–78, 2022

work page 2022
[9]

EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,

T. Tambe, C. Hooper, L. Pentecost, T. Jia, E.-Y . Y ang, M. Donato, V . Sanh, P . Whatmough, A. M. Rush, D. Brooks, and G.-Y . Wei, “EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,” in Proc. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture , Athens, Greece, Oct. 2021

work page 2021
[10]

Energy and policy considerations for modern deep learning research,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for modern deep learning research,” in Proc. AAAI Conference on Artiﬁcial Intelligence , New Y ork, USA, Feb. 2020

work page 2020
[11]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,” arXiv preprint arXiv:2305.05176 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Specinfer: Accelerating large language model serving with tree-based speculative inference and veriﬁcation,

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y . Y . Wong, A. Zhu, L. Y ang, X. Shi, et al. , “Specinfer: Accelerating large language model serving with tree-based speculative inference and veriﬁcation,” in Proc. 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , San Diego...

work page 2024
[13]

Cascadia: An efficient cascade serving system for large language models

Y . Jiang, F. Fu, W. Zhao, S. Rabanser, N. D. Lane, and B. Y uan, “Cascadia: A cascade serving system for large language models,” arXiv preprint arXiv:2506.04203 , 2025

work page arXiv 2025
[14]

Trust or escalate: LLM judges with provable guarantees for human agreement,

J. Jung, F. Brahman, and Y . Choi, “Trust or escalate: LLM judges with provable guarantees for human agreement,” in Proc. International Conference on Learning Representations (ICLR) , Singapore EXPO, Apr. 2025

work page 2025
[15]

Towards a cascaded LLM framework for cost-effective human-AI decision-making,

C. Fanconi and M. van der Schaar, “Towards a cascaded LLM framework for cost-effective human-AI decision-making,” arXiv preprint arXiv:2506.11887 , 2025

work page arXiv 2025
[16]

Overconﬁdence in LLM-as-a-judge: Diagnosis and conﬁdence-driven solution,

Z. Tian, Z. Han, Y . Chen, H. Xu, X. Y ang, H. Wang, L. Liao, et al. , “Overconﬁdence in LLM-as-a-judge: Diagnosis and conﬁdence-driven solution,” arXiv preprint arXiv:2508.06225 , 2025

work page arXiv 2025
[17]

To believe or not to believe your LLM: Iterative prompting for estimating epistemic uncertainty,

Y . Abbasi Y adkori, I. Kuzborskij, A. György, and C. Szepesvari, “To believe or not to believe your LLM: Iterative prompting for estimating epistemic uncertainty,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , V ancouver, Canada, Dec. 2024

work page 2024
[18]

Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,

D. Y oon, S. Kim, S. Y ang, S. Kim, S. Kim, Y . Kim, E. Choi, Y . Kim, and M. Seo, “Reasoning models better express their conﬁdence,” arXiv preprint arXiv:2505.14489 , 2025

work page arXiv 2025
[19]

Calibrating language models via augmented prompt ensembles,

M. Jiang, Y . Ruan, S. Huang, S. Liao, S. Pitis, R. B. Grosse, and J. Ba, “Calibrating language models via augmented prompt ensembles,” in Proc. International Conference on Machine Learning (ICML) , Hawaii, USA, July 2023

work page 2023
[20]

Can multiple responses from an LLM reveal the sources of its uncertainty?,

Y . Nan, P . He, R. Tandon, and H. Xu, “Can multiple responses from an LLM reveal the sources of its uncertainty?,” arXiv preprint arXiv:2509.04464 , 2025

work page arXiv 2025
[21]

Bayesian prompt ensembles: Model uncertainty estimation for black-box large language models,

F. Tonolini, N. Aletras, J. Massiah, and G. Kazai, “Bayesian prompt ensembles: Model uncertainty estimation for black-box large language models,” in Findings of the Association for Computational Linguistics (ACL) , Bangkok, Thailand, Aug. 2024

work page 2024
[22]

Rational tuning of LLM cascades via probabilistic modeling,

M. J. Zellinger and M. Thomson, “Rational tuning of LLM cascades via probabilistic modeling,” arXiv preprint arXiv:2501.09345, 2025

work page arXiv 2025
[23]

Learn then test: Calibrating predictive algorithms to achieve risk control,

A. N. Angelopoulos, S. Bates, E. J. Candès, M. I. Jordan, and L. Lei, “Learn then test: Calibrating predictive algorithms to achieve risk control,” The Annals of Applied Statistics , vol. 19, no. 2, pp. 1641–1662, 2025. 27

work page 2025
[24]

Adaptive learn-then-test: Statistically valid and efﬁcient hyperparameter selection,

M. Zecchin, S. Park, and O. Simeone, “Adaptive learn-then-test: Statistically valid and efﬁcient hyperparameter selection,” in Proc. International Conference on Machine Learning (ICML) , V ancouver, Canada, July 2025

work page 2025
[25]

Quantile learn-then-test: Quantile-based risk control for hyperparameter optimization,

A. Farzaneh, S. Park, and O. Simeone, “Quantile learn-then-test: Quantile-based risk control for hyperparameter optimization,” IEEE Signal Processing Letters , vol. 31, pp. 3044–3048, 2024

work page 2024
[26]

Ensuring reliability via hyperparameter selection: Review and advances,

A. Farzaneh and O. Simeone, “Ensuring reliability via hyperparameter selection: Review and advances,” in Proc. European Signal Processing Conference (EUSIPCO) , Palermo, Italy, Sep. 2025

work page 2025
[27]

TeleQnA: A benchmark dataset to assess large language models telecommunications knowledge,

A. Maatouk, F. Ayed, N. Piovesan, A. De Domenico, M. Debbah, and Z.-Q. Luo, “TeleQnA: A benchmark dataset to assess large language models telecommunications knowledge,” IEEE Network , early access, 2025

work page 2025
[28]

Thought calibration: Efﬁcient and conﬁdent test-time scaling,

M. Wu, C. Zhou, S. Bates, and T. Jaakkola, “Thought calibration: Efﬁcient and conﬁdent test-time scaling,” arXiv preprint arXiv:2505.18404, 2025

work page arXiv 2025
[29]

Simeone, Machine Learning for Engineers

O. Simeone, Machine Learning for Engineers . Cambridge University Press, 2022

work page 2022
[30]

Can LLMs express their uncertainty? an empirical evaluation of conﬁdence elicitation in LLMs,

M. Xiong, Z. Hu, X. Lu, Y . Li, J. Fu, J. He, and B. Hooi, “Can LLMs express their uncertainty? an empirical evaluation of conﬁdence elicitation in LLMs,” in Proc. International Conference on Learning Representations (ICLR) , Vienna, Austria, May 2024

work page 2024
[31]

Judging LLM-as-a-judge with MT-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , Louisiana, USA, Dec. 2023

work page 2023
[32]

Multiple hypothesis testing in genomics,

J. J. Goeman and A. Solari, “Multiple hypothesis testing in genomics,” Statistics in Medicine , vol. 33, no. 11, pp. 1946– 1978, 2014

work page 1946
[33]

Probability inequalities for sums of bounded random variables,

W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American statistical association, vol. 58, no. 301, pp. 13–30, 1963

work page 1963
[34]

Qwen2-1.5b-instruct

Q. Team, “Qwen2-1.5b-instruct.” https://huggingface.co/Qwen/Qwen2-1.5B-Instruct , 2024

work page 2024
[35]

Qwen2-7b-instruct

Q. Team, “Qwen2-7b-instruct.” https://huggingface.co/Qwen/Qwen2-7B-Instruct , 2024

work page 2024
[36]

V ariations of box plots,

R. McGill, J. W. Tukey, and W. A. Larsen, “V ariations of box plots,” The American Statistician , vol. 32, no. 1, pp. 12–16, 1978

work page 1978
[37]

s1: Simple test-time scaling

N. Muennighoff, Z. Y ang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Candès, and T. Hashimoto, “s1: Simple test-time scaling,” arXiv preprint arXiv:2501.19393 , 2025. APPENDIX A IMPLEMENTATION DETAILS ON CONFIDENCE SCORE This appendix provides detailed implementation procedures for the self-conﬁdence score estimation method...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,

G. O. Boateng, H. Sami, A. Alagha, H. Elmekki, A. Hammoud, R. Mizouni, A. Mourad, H. Otrok, J. Bentahar, and S. F. o. Muhaidat, “A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,” IEEE Communications Surveys & Tutorials , 2025

work page 2025

[2] [2]

Large language models meet next-generation networking technologies: A review,

C.-N. Hang, P .-D. Y u, R. Morabito, and C.-W. Tan, “Large language models meet next-generation networking technologies: A review,” Future Internet , vol. 16, no. 10, p. 365, 2024

work page 2024

[3] [3]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. , “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in neural information processing systems , vol. 33, pp. 9459–9474, 2020

work page 2020

[4] [4]

Large language models for information retrieval: A survey,

Y . Zhu, H. Y uan, S. Wang, J. Liu, W. Liu, C. Deng, H. Chen, Z. Liu, Z. Dou, and J.-R. Wen, “Large language models for information retrieval: A survey,” ACM Transactions on Information Systems , vol. 44, no. 1, pp. 1–54, 2025

work page 2025

[5] [5]

Leveraging large language models for collective decision-making,

M. Papachristou, L. Y ang, and C.-C. Hsu, “Leveraging large language models for collective decision-making,” Proceedings of the ACM on Human-Computer Interaction , vol. 9, no. 7, pp. 1–44, 2025

work page 2025

[6] [6]

Evaluating open-source large language models for technical telecom question answering,

A. Caraus, A. Buscemi, S. Kumar, and I. Turcanu, “Evaluating open-source large language models for technical telecom question answering,” arXiv preprint arXiv:2509.21949 , 2025

work page arXiv 2025

[7] [7]

AI-assisted design for reliability: Review and perspectives,

C. Y uan, S. M. De Jong, and W. D. van Driel, “AI-assisted design for reliability: Review and perspectives,” in Proc. 2024 25th International Conference on Thermal, Mechanical and Multi-Physics Simulation and Experiments in Microelectronics and Microsystems (EuroSimE) , Sicily, Italy, Apr. 2024. 26

work page 2024

[8] [8]

Artiﬁcial intelligence in beyond 5G and 6G reliable communications,

A. Nauman, T. N. Nguyen, Y . A. Qadri, Z. Nain, K. Cengiz, and S. W. Kim, “Artiﬁcial intelligence in beyond 5G and 6G reliable communications,” IEEE Internet of Things Magazine , vol. 5, no. 1, pp. 73–78, 2022

work page 2022

[9] [9]

EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,

T. Tambe, C. Hooper, L. Pentecost, T. Jia, E.-Y . Y ang, M. Donato, V . Sanh, P . Whatmough, A. M. Rush, D. Brooks, and G.-Y . Wei, “EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,” in Proc. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture , Athens, Greece, Oct. 2021

work page 2021

[10] [10]

Energy and policy considerations for modern deep learning research,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for modern deep learning research,” in Proc. AAAI Conference on Artiﬁcial Intelligence , New Y ork, USA, Feb. 2020

work page 2020

[11] [11]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,” arXiv preprint arXiv:2305.05176 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Specinfer: Accelerating large language model serving with tree-based speculative inference and veriﬁcation,

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y . Y . Wong, A. Zhu, L. Y ang, X. Shi, et al. , “Specinfer: Accelerating large language model serving with tree-based speculative inference and veriﬁcation,” in Proc. 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , San Diego...

work page 2024

[13] [13]

Cascadia: An efficient cascade serving system for large language models

Y . Jiang, F. Fu, W. Zhao, S. Rabanser, N. D. Lane, and B. Y uan, “Cascadia: A cascade serving system for large language models,” arXiv preprint arXiv:2506.04203 , 2025

work page arXiv 2025

[14] [14]

Trust or escalate: LLM judges with provable guarantees for human agreement,

J. Jung, F. Brahman, and Y . Choi, “Trust or escalate: LLM judges with provable guarantees for human agreement,” in Proc. International Conference on Learning Representations (ICLR) , Singapore EXPO, Apr. 2025

work page 2025

[15] [15]

Towards a cascaded LLM framework for cost-effective human-AI decision-making,

C. Fanconi and M. van der Schaar, “Towards a cascaded LLM framework for cost-effective human-AI decision-making,” arXiv preprint arXiv:2506.11887 , 2025

work page arXiv 2025

[16] [16]

Overconﬁdence in LLM-as-a-judge: Diagnosis and conﬁdence-driven solution,

Z. Tian, Z. Han, Y . Chen, H. Xu, X. Y ang, H. Wang, L. Liao, et al. , “Overconﬁdence in LLM-as-a-judge: Diagnosis and conﬁdence-driven solution,” arXiv preprint arXiv:2508.06225 , 2025

work page arXiv 2025

[17] [17]

To believe or not to believe your LLM: Iterative prompting for estimating epistemic uncertainty,

Y . Abbasi Y adkori, I. Kuzborskij, A. György, and C. Szepesvari, “To believe or not to believe your LLM: Iterative prompting for estimating epistemic uncertainty,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , V ancouver, Canada, Dec. 2024

work page 2024

[18] [18]

Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,

D. Y oon, S. Kim, S. Y ang, S. Kim, S. Kim, Y . Kim, E. Choi, Y . Kim, and M. Seo, “Reasoning models better express their conﬁdence,” arXiv preprint arXiv:2505.14489 , 2025

work page arXiv 2025

[19] [19]

Calibrating language models via augmented prompt ensembles,

M. Jiang, Y . Ruan, S. Huang, S. Liao, S. Pitis, R. B. Grosse, and J. Ba, “Calibrating language models via augmented prompt ensembles,” in Proc. International Conference on Machine Learning (ICML) , Hawaii, USA, July 2023

work page 2023

[20] [20]

Can multiple responses from an LLM reveal the sources of its uncertainty?,

Y . Nan, P . He, R. Tandon, and H. Xu, “Can multiple responses from an LLM reveal the sources of its uncertainty?,” arXiv preprint arXiv:2509.04464 , 2025

work page arXiv 2025

[21] [21]

Bayesian prompt ensembles: Model uncertainty estimation for black-box large language models,

F. Tonolini, N. Aletras, J. Massiah, and G. Kazai, “Bayesian prompt ensembles: Model uncertainty estimation for black-box large language models,” in Findings of the Association for Computational Linguistics (ACL) , Bangkok, Thailand, Aug. 2024

work page 2024

[22] [22]

Rational tuning of LLM cascades via probabilistic modeling,

M. J. Zellinger and M. Thomson, “Rational tuning of LLM cascades via probabilistic modeling,” arXiv preprint arXiv:2501.09345, 2025

work page arXiv 2025

[23] [23]

Learn then test: Calibrating predictive algorithms to achieve risk control,

A. N. Angelopoulos, S. Bates, E. J. Candès, M. I. Jordan, and L. Lei, “Learn then test: Calibrating predictive algorithms to achieve risk control,” The Annals of Applied Statistics , vol. 19, no. 2, pp. 1641–1662, 2025. 27

work page 2025

[24] [24]

Adaptive learn-then-test: Statistically valid and efﬁcient hyperparameter selection,

M. Zecchin, S. Park, and O. Simeone, “Adaptive learn-then-test: Statistically valid and efﬁcient hyperparameter selection,” in Proc. International Conference on Machine Learning (ICML) , V ancouver, Canada, July 2025

work page 2025

[25] [25]

Quantile learn-then-test: Quantile-based risk control for hyperparameter optimization,

A. Farzaneh, S. Park, and O. Simeone, “Quantile learn-then-test: Quantile-based risk control for hyperparameter optimization,” IEEE Signal Processing Letters , vol. 31, pp. 3044–3048, 2024

work page 2024

[26] [26]

Ensuring reliability via hyperparameter selection: Review and advances,

A. Farzaneh and O. Simeone, “Ensuring reliability via hyperparameter selection: Review and advances,” in Proc. European Signal Processing Conference (EUSIPCO) , Palermo, Italy, Sep. 2025

work page 2025

[27] [27]

TeleQnA: A benchmark dataset to assess large language models telecommunications knowledge,

A. Maatouk, F. Ayed, N. Piovesan, A. De Domenico, M. Debbah, and Z.-Q. Luo, “TeleQnA: A benchmark dataset to assess large language models telecommunications knowledge,” IEEE Network , early access, 2025

work page 2025

[28] [28]

Thought calibration: Efﬁcient and conﬁdent test-time scaling,

M. Wu, C. Zhou, S. Bates, and T. Jaakkola, “Thought calibration: Efﬁcient and conﬁdent test-time scaling,” arXiv preprint arXiv:2505.18404, 2025

work page arXiv 2025

[29] [29]

Simeone, Machine Learning for Engineers

O. Simeone, Machine Learning for Engineers . Cambridge University Press, 2022

work page 2022

[30] [30]

Can LLMs express their uncertainty? an empirical evaluation of conﬁdence elicitation in LLMs,

M. Xiong, Z. Hu, X. Lu, Y . Li, J. Fu, J. He, and B. Hooi, “Can LLMs express their uncertainty? an empirical evaluation of conﬁdence elicitation in LLMs,” in Proc. International Conference on Learning Representations (ICLR) , Vienna, Austria, May 2024

work page 2024

[31] [31]

Judging LLM-as-a-judge with MT-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , Louisiana, USA, Dec. 2023

work page 2023

[32] [32]

Multiple hypothesis testing in genomics,

J. J. Goeman and A. Solari, “Multiple hypothesis testing in genomics,” Statistics in Medicine , vol. 33, no. 11, pp. 1946– 1978, 2014

work page 1946

[33] [33]

Probability inequalities for sums of bounded random variables,

W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American statistical association, vol. 58, no. 301, pp. 13–30, 1963

work page 1963

[34] [34]

Qwen2-1.5b-instruct

Q. Team, “Qwen2-1.5b-instruct.” https://huggingface.co/Qwen/Qwen2-1.5B-Instruct , 2024

work page 2024

[35] [35]

Qwen2-7b-instruct

Q. Team, “Qwen2-7b-instruct.” https://huggingface.co/Qwen/Qwen2-7B-Instruct , 2024

work page 2024

[36] [36]

V ariations of box plots,

R. McGill, J. W. Tukey, and W. A. Larsen, “V ariations of box plots,” The American Statistician , vol. 32, no. 1, pp. 12–16, 1978

work page 1978

[37] [37]

s1: Simple test-time scaling

N. Muennighoff, Z. Y ang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Candès, and T. Hashimoto, “s1: Simple test-time scaling,” arXiv preprint arXiv:2501.19393 , 2025. APPENDIX A IMPLEMENTATION DETAILS ON CONFIDENCE SCORE This appendix provides detailed implementation procedures for the self-conﬁdence score estimation method...

work page internal anchor Pith review Pith/arXiv arXiv 2025