arxiv: 2604.13271 · v1 · submitted 2026-04-14 · 💻 cs.LG

Recognition: unknown

Enhancing Confidence Estimation in Telco LLMs via Twin-Pass CoT-Ensembling

Anton Saenko , Pranshav Gajjar , Abiodun Ganiyu , Vijay K. Shah

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM confidence calibrationchain-of-thought ensemblingtelecommunicationsexpected calibration errorGemma-3twin-pass methodoverconfidence mitigation

0 comments

The pith

Twin-pass CoT-ensembling reduces expected calibration error by up to 88% for confidence estimates in telecom LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models applied to telecommunications tasks frequently assign high confidence to incorrect answers, making their self-assessments unreliable. The paper evaluates this problem on the Gemma-3 model family across TeleQnA, ORANBench, and srsRANBench and finds that standard single-pass verbalized confidence fails to track actual correctness. To correct the mismatch, the authors introduce twin-pass CoT-ensembling, which runs two independent reasoning chains and aggregates their assessments into a single calibrated score. When this procedure works, model outputs become substantially easier to verify and safer to use in network analysis and troubleshooting.

Core claim

The paper establishes that performing two separate Chain-of-Thought reasoning passes on the same query and then combining their confidence assessments produces a markedly better-calibrated confidence score than a single pass, as shown by reductions in Expected Calibration Error of up to 88% on representative telecommunications benchmarks.

What carries the argument

Twin-Pass CoT-Ensembling, which executes two independent reasoning evaluations and aggregates their confidence assessments into one calibrated score.

If this is right

Model outputs for 3GPP specification analysis become easier to verify because reported confidence more closely tracks correctness.
O-RAN network troubleshooting gains a practical route to safer reliance on LLM assistance without retraining.
The approach supplies a direct way to mitigate overconfidence in domain-specific LLMs through ensembled reasoning rather than post-training adjustments.
Trustworthy self-assessment becomes feasible for additional telecom tasks such as configuration validation and fault diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-pass structure could be tested on other technical domains that rely on precise specification following.
Combining the method with existing calibration techniques might produce still larger gains in reliability.
Real-world deployment would require checking whether the added inference cost remains acceptable for high-volume network operations.

Load-bearing premise

That combining two independent reasoning passes genuinely improves calibration rather than merely smoothing random variation or introducing new systematic bias.

What would settle it

Running the method on a fresh, previously unseen telecom benchmark and finding no reduction in Expected Calibration Error relative to single-pass baselines would show the claimed improvement does not hold.

Figures

Figures reproduced from arXiv: 2604.13271 by Abiodun Ganiyu, Anton Saenko, Pranshav Gajjar, Vijay K. Shah.

**Figure 1.** Figure 1: The system prompt utilized during Pass 2 to facilitate the blind metacognitive self-evaluation and extract the confidence score (si). IV. EXPERIMENTAL SETUP A. Benchmark and Task Setup To validate our approach, we leverage two standardized open-source telecom benchmarks: OT-Lite and OT-Full1 , both sourced from the GSMA Open Telecom Benchmarking Suite [26]. Each benchmark contains multiple-choice question… view at source ↗

**Figure 2.** Figure 2: The prompt template utilized by the baseline single-pass verbalized confidence estimation method. V. RESULTS This section presents a systematic analysis of confidence calibration in telecom-domain LLMs. We first demonstrate the failure of single-pass verbalized confidence, then evaluate our proposed Twin-Pass CoT-Ensemble across three benchmarks and three model scales. A. The Failure of Single-Pass Verbali… view at source ↗

**Figure 4.** Figure 4: Reliability diagrams for Gemma-3-4B (OT-Lite, N = 1,300, all benchmarks pooled). Left (red): Raw single-pass verbalized confidence. Center (green): Twin-Pass CoT Ensemble Mean (E). Right (blue): Ensemble Median. B. Success of Twin-Pass CoT-Ensemble Conversely, the Blind Self-Evaluation approach via TwinPass CoT-Ensemble demonstrated exceptional discriminative power. By forcing the model to explicitly rate… view at source ↗

**Figure 9.** Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly applied to complex telecommunications tasks, including 3GPP specification analysis and O-RAN network troubleshooting. However, a critical limitation remains: LLM-generated confidence scores are often biased and unreliable, frequently exhibiting systematic overconfidence. This lack of trustworthy self-assessment makes it difficult to verify model outputs and safely rely on them in practice. In this paper, we study confidence calibration in telecom-domain LLMs using the representative Gemma-3 model family (4B, 12B, and 27B parameters), evaluated on TeleQnA, ORANBench, and srsRANBench. We show that standard single-pass, verbalized confidence estimates fail to reflect true correctness, often assigning high confidence to incorrect predictions. To address this, we propose a novel Twin-Pass Chain of Thought (CoT)-Ensembling methodology for improving confidence estimation by leveraging multiple independent reasoning evaluations and aggregating their assessments into a calibrated confidence score. Our approach reduces Expected Calibration Error (ECE) by up to 88% across benchmarks, significantly improving the reliability of model self-assessment. These results highlight the limitations of current confidence estimation practices and demonstrate a practical path toward more trustworthy evaluation of LLM outputs in telecommunications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Twin-Pass CoT-Ensembling is a straightforward empirical tweak for LLM confidence on telecom benchmarks, but the abstract leaves the aggregation and controls too vague to judge if it fixes bias or just reduces variance.

read the letter

The paper's core move is to run two independent Chain-of-Thought passes on the same query with Gemma-3 models and aggregate the resulting verbalized scores into a single calibrated . They evaluate this on TeleQnA, ORANBench, and srsRANBench and report ECE drops up to 88 percent compared with single-pass baselines. That focus on a real deployment pain point—overconfident answers in 3GPP analysis or O-RAN troubleshooting—is useful. The benchmarks are domain-specific, the models are current, and the problem statement matches what practitioners actually worry about when they try to trust LLM outputs in network operations.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Twin-Pass CoT-Ensembling, a method that performs two independent chain-of-thought reasoning passes on Gemma-3 models (4B, 12B, 27B) and aggregates their verbalized confidence scores to improve calibration on telecom benchmarks (TeleQnA, ORANBench, srsRANBench). It claims that standard single-pass verbalized confidence exhibits systematic overconfidence and that the proposed ensembling reduces Expected Calibration Error (ECE) by up to 88%.

Significance. If the empirical gains prove robust, the work would offer a simple, training-free technique for mitigating overconfidence in LLM self-assessments within a high-stakes domain. This could meaningfully support safer deployment of LLMs for 3GPP analysis and O-RAN troubleshooting, where reliable uncertainty quantification is practically important.

major comments (2)

Abstract: the central claim of up to 88% ECE reduction is presented without any aggregation formula, statistical significance tests, baseline comparisons (e.g., multi-sample averaging of non-CoT scores), data-split details, or controls for prompt sensitivity. These omissions are load-bearing because they prevent verification that the reported improvement reflects genuine calibration gains rather than variance reduction or benchmark-specific artifacts.
Methodology (implied by abstract description): no ablation is described that isolates the contribution of CoT reasoning from simple ensembling of independent passes. Without this comparison, it remains unclear whether the ECE reduction arises from improved self-assessment or from averaging correlated reasoning errors that may persist on held-out telecom distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: Abstract: the central claim of up to 88% ECE reduction is presented without any aggregation formula, statistical significance tests, baseline comparisons (e.g., multi-sample averaging of non-CoT scores), data-split details, or controls for prompt sensitivity. These omissions are load-bearing because they prevent verification that the reported improvement reflects genuine calibration gains rather than variance reduction or benchmark-specific artifacts.

Authors: We acknowledge that the abstract's brevity omits key details. The full manuscript describes Twin-Pass CoT-Ensembling as performing two independent CoT passes and aggregating the verbalized confidence scores (via averaging) to produce the final estimate. To address the concern, we will revise the abstract to briefly reference the aggregation approach and add statistical significance testing (e.g., paired t-tests on ECE differences) in the results. We will also expand the experimental section with explicit baseline comparisons to multi-sample non-CoT averaging, full data-split specifications, and prompt-sensitivity controls across multiple prompt variants. These changes will help confirm the gains reflect calibration improvements rather than artifacts. revision: yes
Referee: Methodology (implied by abstract description): no ablation is described that isolates the contribution of CoT reasoning from simple ensembling of independent passes. Without this comparison, it remains unclear whether the ECE reduction arises from improved self-assessment or from averaging correlated reasoning errors that may persist on held-out telecom distributions.

Authors: We agree this ablation would strengthen the claims. The current work evaluates the combined Twin-Pass CoT-Ensembling method but does not include a direct comparison to non-CoT ensembling of independent passes. We will add this ablation study in the revised manuscript, reporting ECE for both CoT-based and non-CoT ensembling variants on the same benchmarks. This will clarify whether the observed reductions primarily stem from the CoT reasoning component or from ensembling effects alone, addressing potential concerns about correlated errors on telecom data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on external benchmarks

full rationale

The paper describes an empirical procedure (Twin-Pass CoT-Ensembling) that aggregates two independent CoT passes to produce a confidence score, with performance measured via ECE reduction on held-out benchmarks (TeleQnA, ORANBench, srsRANBench). No equations, fitted parameters, or derivations are presented that reduce to self-definition or input data by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The central result is an observed experimental improvement, not a tautological renaming or statistical artifact forced by the method's own definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work assumes standard LLM evaluation practices and that the three named benchmarks are representative of real telecom tasks; no new entities are postulated.

axioms (1)

domain assumption Expected Calibration Error is an appropriate and sufficient metric for assessing confidence reliability in this domain
The paper centers all claims on ECE reduction without discussing limitations of the metric for verbalized confidence in LLMs.

pith-pipeline@v0.9.0 · 5529 in / 1124 out tokens · 29995 ms · 2026-05-10T15:46:42.268519+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 6 canonical work pages · 1 internal anchor

[1]

N. D. Tripathi and V. K. Shah, Fundamentals of O -RAN. John Wiley & Sons, 2025

2025
[2]

Large language model (llm) for telecom - munications: A comprehensive survey on principles, key techniques, and opportunities,

H. Zhou, C. Hu, Y. Yuan, Y. Cui, Y. Jin, C. Chen, H. Wu, D. Yuan, L. Jiang, D. Wu, et al. , “Large language model (llm) for telecom - munications: A comprehensive survey on principles, key techniques, and opportunities,” IEEE Communications Surveys & Tutorials, vol. 27, no. 3, pp. 1955–2005, 2024

1955
[3]

A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,

G. O. Boateng, H. Sami, A. Alagha, H. Elmekki, A . Hammoud, R. Mi - zouni, A. Mourad, H. Otrok, J. Bentahar, S. Muhaidat, et al., “A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,” IEEE Communications Surveys & Tutorials, 2025

2025
[4]

Telecomgpt: A framework to build telecom-specific large language models,

H. Zou, Q. Zhao, Y. Tian, L. Bariah, F. Bader, T. Lestable, and M. Deb- bah, “Telecomgpt: A framework to build telecom-specific large language models,” IEEE Transactions on Machine Learning in Communications and Networking, 2025

2025
[5]

Teleqna: A benchmark dataset to assess large language models telecommunications knowledge,

A. Maatouk, F. Ayed, N. Piovesan, A. De Domenico, M. Debbah, and Z.- Q. Luo, “Teleqna: A benchmark dataset to assess large language models telecommunications knowledge,” IEEE Network, 2025

2025
[6]

arXiv preprint arXiv:2401.03804

Z. He, Z. Wang, X. Liu, S. Liu, Y. Yao, Y. Huang, X. Li, Y. Li, Z. Che, Z. Zhang, et al. , “Telechat technical report,” arXiv preprint arXiv:2401.03804, 2024

work page arXiv 2024
[7]

Oransight -2.0: Foundational llms for o - ran,

P. Gajjar and V. K. Shah, “Oransight -2.0: Foundational llms for o - ran,” IEEE Transactions on Machine Learning in Communications and Networking, 2025

2025
[8]

Ai5gtest: Ai-driven specification- aware automated testing and validation of 5g o -ran components,

A. Ganiyu, P. Gajjar, and V. K. Shah, “Ai5gtest: Ai-driven specification- aware automated testing and validation of 5g o -ran components,” in 18th ACM Conference on Security and Privacy in Wireless and Mobile Networks, pp. 53–64, 2025

2025
[9]

Ai5gtest: Llm based automation for 5g o -ran testing,

A. Ganiyu, P. Gajjar, and V. K. Shah, “Ai5gtest: Llm based automation for 5g o -ran testing,” in 18th ACM Conference on Security and Privacy in Wireless and Mobile Networks , pp. 298–299, 2025

2025
[10]

Semantic routing for en - hanced performance of llm -assisted intent -based 5g core network man - agement and orchestration,

D. M. Manias, A. Chouman, and A. Shami, “Semantic routing for en - hanced performance of llm -assisted intent -based 5g core network man - agement and orchestration,” in GLOBECOM 2024 -2024 IEEE Global Communications Conference , pp. 2924–2929, IEEE, 2024

2024
[11]

Netconfeval: Can llms facilitate network configuration?,

C. Wang, M. Scazzariello, A. Farshin, S. Ferlin, D. Kostic´, and M. C hiesa, “Netconfeval: Can llms facilitate network configuration?,” Proceedings of the ACM on Networking , vol. 2, no. CoNEXT2, pp. 1 – 25, 2024

2024
[12]

Netllm: Adapting large language models for netwo rking,

D. Wu, X. Wang, Y. Qiao, Z. Wang, J. Jiang, S. Cui, and F. Wang, “Netllm: Adapting large language models for netwo rking,” in Proceed- ings of the ACM SIGCOMM 2024 Conference , pp. 661–678, 2024

2024
[13]

Netgpt: A native -ai network architecture beyond provisioning person - alized generative services,

Y. Chen, R. Li, Z. Zhao, C. Peng, J. Wu, E. Hossain, and H. Zhang, “Netgpt: A native -ai network architecture beyond provisioning person - alized generative services,” arXiv preprint arXiv:2307.06148 , 2023

work page arXiv 2023
[14]

An intent-based networks framework based on large language models,

A. Fuad, A. H. Ahmed, M. A. Riegler, and T. Cˇ icˇic´, “An intent-based networks framework based on large language models,” in 2024 IEEE 10th International Conference on Network Softwarization (NetSoft), pp. 7–12, IEEE, 2024

2024
[15]

Lowest span conﬁdence: A zero-shot metric for efﬁcient and black-box hallucination detection in llms

Y. Qiao, L. Pan, Y. Mi, L. Liu, Y. Shen, F. Sun, and Z. Chu, “Lowest span confidence: A zero -shot metric for efficient and black -box hallucination detection in llms,” arXiv preprint arXiv:2601.19918 , 2026

work page arXiv 2026
[16]

Calibration of neural networks,

R. Vasilev and A. D’yakonov, “Calibration of neural networks,” arXiv preprint arXiv:2303.10761 , 2023

work page arXiv 2023
[17]

When confidence meets accuracy: Explor- ing the effects of multiple performance indicators on trust in machine learning models,

A. Rechkemmer and M. Yin, “When confidence meets accuracy: Explor- ing the effects of multiple performance indicators on trust in machine learning models,” in Proceedings of the 2022 chi conference on human factors in computing systems , pp. 1–14, 2022

2022
[18]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

M. Xiong, Z. Hu, X. Lu, Y. Li, J. Fu, J. He, and B. Hooi, “Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms,” arXiv preprint arXiv:2306.13063 , 2023

work page internal anchor Pith review arXiv 2023
[19]

A survey of confidence estimation and calibration in large language models,

J. Geng, F. Cai, Y. Wang, H. Koeppl, P. Nakov, and I. Gurevych, “A survey of confidence estimation and calibration in large language models,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6577–6595, 2024

2024
[20]

Calibration is not enough: Evaluating confidence estimation under language variations.arXiv preprint arXiv:2601.08064, 2026

Y. Xia, D. Ulmer, T. Blevins, Y. Liu, H. Schu¨tze, and B. Roth, “Calibration is not enough: Evaluating confidence estimation under language variations,” arXiv preprint arXiv:2601.08064 , 2026

work page arXiv 2026
[21]

Gemma 3 technical report,

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Per- rin, T. Matejovicova, A. Rame´, M. Rivie`re, L. Rouillard, T. Mesnard, G. Cideron, J. bastien Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsitsulin, R. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y. Gao, B. Musta...

2025
[22]

Language mod- els are few -shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language mod- els are few -shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[23]

Chain -of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain -of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems , vol. 35, pp. 24824–24837, 2022

2022
[24]

Read your own mind: Reasoning helps surface self -confidence signals in llms,

J. Podolak and R. Verma, “Read your own mind: Reasoning helps surface self -confidence signals in llms,” in Proceedings of the 2nd Workshop on Uncertainty -Aware NL P (UncertaiNLP 2025) , pp. 247 – 258, 2025

2025
[25]

Large language models are better reasoners with self - verification,

Y. Weng, M. Zhu, F. Xia, B. Li, S. He, S. Liu, B. Sun, K. Liu, and J. Zhao, “Large language models are better reasoners with self - verification,” in Findings of the Association for Computational Linguis - tics: EMNLP 2023, pp. 2550–2575, 2023

2023
[26]

Open teleco

GSMA, “Open teleco.” https://github.com/gsma-labs/evals, 2026

2026
[27]

Oran-bench-13k: An open source benchmark for assessing llms in open radio access networks,

P. Gajjar and V. K. Shah, “Oran-bench-13k: An open source benchmark for assessing llms in open radio access networks,” in 2025 IEEE 22nd Consumer Communications & Networking Conference (CCNC), pp. 1–4, IEEE, 2025. APPENDIX Fig. 7. Example of the Critique pipeline identifying and penalizing an acronym hallucination in TeleQnA despite maximum single-pass co...

2025
[28]

ORANBench Assessment: Correct Question ID: oranbench 1486 Question: Which of the following is a measurement type used to track the volume of UL PDCP SDUs received via the Xn-U UL GTP-U tunnel for data forwarding? Model Generation: Answer: Option 1 Reasoning: Identifies the component keywords required: UL (uplink), Rx (received), Xn -U, and Data Forwarding...
[29]

Itera- tively recalculates 16 * 2 = 32 bits iteratively in a circle, and then concludes: ”None of the options match this value

srsRANBench Assessment: Incorrect Question ID: srsranbench 203 Question: What is the number of bits to modulate in the PBCH modulator? Model Generation: Answer: Option 3 Reasoning: Identifies QPSK means 2 bits per symbol. Itera- tively recalculates 16 * 2 = 32 bits iteratively in a circle, and then concludes: ”None of the options match this value... so Op...
[30]

Single-Pass Confidence Score: 5 (Peak confidence)

TeleQnA Assessment: Incorrect Question ID: teleqna 8938 Question: What does CSAT stand for? Model Generation: Answer: Option 3 Reasoning: CSAT stands for ”Coexistence Support for Ac - cess Technologies” in cellular networks, detailing operational coexistence between LTE, WiMAX, and NB -IoT to optimize resource utilization. Single-Pass Confidence Score: 5 ...