Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

Arman Cohan; Gabrielle Kaili-May Liu

arxiv: 2605.28778 · v1 · pith:WOJFG56Gnew · submitted 2026-05-27 · 💻 cs.CL

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

Gabrielle Kaili-May Liu , Arman Cohan This is my paper

Pith reviewed 2026-06-29 12:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLMslinguistic uncertaintyepistemic markerscalibrationmarker internal confidencemodel-centric evaluationtrustworthiness

0 comments

The pith

LLMs fail to link uncertainty markers like 'likely' to stable internal confidence levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can use epistemic markers such as 'likely' or 'probably' to express their own intrinsic uncertainty in a consistent way. It defines marker internal confidence as the level a model associates with each marker within a task and introduces seven metrics to check whether these associations stay stable inside one distribution and across different ones. The results show that models keep a rough ordering of markers by confidence but cannot make the distinctions hold reliably when tasks or data change. Readers would care because verbalized confidence is a main way people judge whether to trust an LLM output, and unstable use undermines that signal. The work supplies evidence that calibration problems persist even when markers are interpreted through the model's own lens.

Core claim

LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks.

What carries the argument

Marker internal confidence (MIC), the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain, measured by seven stability metrics within and across distributions.

If this is right

Models keep a consistent ranking order of markers by associated confidence across different tasks.
Models cannot reliably differentiate markers by their internal confidence values when distributions change.
Linguistic expressions of uncertainty stay miscalibrated from the model's own viewpoint.
Trustworthiness of LLM outputs requires more stable and aligned marker use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that only target numeric calibration may leave linguistic calibration untouched.
Applications that let users read verbal confidence statements could benefit from explicit marker-stability constraints during fine-tuning.
The same stability metrics could be applied to other forms of model-generated uncertainty language beyond epistemic markers.

Load-bearing premise

The seven metrics can reliably detect and quantify whether marker-to-confidence associations stay stable when viewed from the model's own perspective.

What would settle it

A finding that the seven metrics show distinct MIC values for different markers that remain stable when the same models are tested on new task distributions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.28778 by Arman Cohan, Gabrielle Kaili-May Liu.

**Figure 2.** Figure 2: KDE plots of MIC values on PopQA. Despite limited marker discriminability, models encode multiple coarse uncertainty levels. Low D-AvgCV values alone do not reveal whether markers form a single undifferentiated cluster or a small number of stable but closely spaced uncertainty bands. To investigate this, we plot the kernel density estimate (KDE) of MIC values for each model and dataset, which reveals the … view at source ↗

**Figure 3.** Figure 3: Representative violin plots of models’ MIC densities across datasets, stratified by correctness. Dataset-level accuracy is indicated by black points (values along the second y-axis) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Representative violin plots of models’ MIC densities across datasets, stratified by faithful calibration level. Dataset-level faithful calibration is indicated by black stars (values along the second y-axis) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Representative plots of models’ MICs in relation to per-marker average absolute difference between internal confidence and human-interpreted linguistic decisiveness (MF). We observe positive trends between internal confidence and human-model decisiveness divergence per marker, suggesting high-MIC markers are a primary driver of faithful miscalibration in LLMs when we interpret markers according to human pe… view at source ↗

**Figure 6.** Figure 6: System prompt used to elicit uncertainty-bearing model responses across all experiments [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Task-specific prompts used to elicit model responses across experimental settings. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Metacognitive system prompt adapted from Liu et al. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used to score correctness of model responses via LLM-as-a-Judge. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used to extract hedge expressions and epistemic markers from model response [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used to standardize the format of epistemic markers extracted from model response [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt [49, 52] used to assess sentence-response consistency when estimating models’ intrinsic confidence. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt [49] used to score linguistic decisiveness of model responses in a human-aligned fashion via LLM-as-a-Judge. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_13.png] view at source ↗

**Figure 14.** Figure 14: Representative heatmaps of MIC values for randomly selected epistemic markers across datasets. We observe that models of varying sizes associate epistemic markers with substantially different internal confidence levels across datasets, often related to task difficulty. Despite this, confidence meanings of individual markers are generally not well-distinguished, consistent with our analysis of the concent… view at source ↗

**Figure 15.** Figure 15: Representative violin plots of models’ MIC densities across datasets. Per-dataset accuracy and faithful calibration level (measured via cMFG) are plotted using dot and star marks, respectively, with values indicated along the second y-axis. It can be seen that MIC values sometimes track with accuracy, as suggested by MAC scores in §5. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗

**Figure 16.** Figure 16: Representative KDE plots of MIC values per dataset for Llama3.1-8B-Instruct. While MIC distributions are generally concentrated within a narrow confidence range, several distinct, albeit weak peaks are observed per task, suggesting the model has some ability to encode multiple distinct uncertainty levels among epistemic markers despite limited marker discriminability. 39 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 17.** Figure 17: Representative KDE plots of MIC values per dataset for Qwen3-32B. While MIC distributions are generally concentrated within a narrow confidence range, several distinct, often comparable peaks are observed per task, suggesting a multimodal structure with larger models bearing greater ability to encode distinct uncertainty levels among epistemic markers despite limited marker discriminability. 40 [PITH_FUL… view at source ↗

read the original abstract

LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing _marker internal confidence_ (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes a model-centric view of marker internal confidence with seven metrics and reports persistent miscalibration, but the abstract gives almost no experimental details so the strength of that finding is hard to judge.

read the letter

The new piece is the shift to asking whether models can tie epistemic markers to their own internal confidence in a stable way, rather than just matching human usage. They define MIC as the confidence a model associates with a marker in a domain and supply seven metrics for within- and across-distribution stability. That framing is cleaner than most prior calibration work and could be useful if the metrics turn out to be well-grounded.

What they actually show is that models keep a rough ranking of markers across tasks but fail to differentiate them by internal confidence when distributions change. The abstract calls this faithful miscalibration under a model-centric lens.

The soft spot is the lack of any description of how MIC is estimated, which models were tested, what tasks or datasets were used, or whether the metrics rely on output patterns alone versus logit or hidden-state signals. If the metrics are built only from marker selection frequencies, the instability could be prompt or formatting noise rather than a true internal mismatch, which would undercut the central claim. The stress-test note lands on exactly that gap.

This is for calibration and alignment researchers who want a new measurement lens. A reader could extract the framework and try it themselves even if the reported numbers need checking. It is coherent enough on its own terms to deserve referee time, mainly because the question is live and the proposed metrics are concrete.

Referee Report

1 major / 0 minor

Summary. The paper formalizes Marker Internal Confidence (MIC) as the intrinsic confidence an LLM associates with a given epistemic marker in a task domain. It introduces seven metrics to quantify the stability of these MIC values within and across distributions. Experiments across models and tasks lead to the conclusion that LLMs remain faithfully miscalibrated: they struggle to differentiate markers according to internal confidence levels across distributions, even while preserving a roughly consistent ranking order across tasks.

Significance. If the empirical measurements are robust, the work supplies useful complementary evidence on the limits of linguistic uncertainty expression in LLMs, extending prior calibration studies by adopting an explicitly model-centric interpretation of marker meanings. The emphasis on stability across distributions and the preservation of ranking order are potentially actionable for trustworthiness research.

major comments (1)

[Abstract] Abstract (paragraph defining MIC and the seven metrics): the central claim that LLMs are 'faithfully miscalibrated' under a model-centric view depends on MIC being an independently estimable intrinsic quantity. The abstract provides no information on whether MIC is computed from token-level probabilities, hidden-state entropy, or solely from patterns of marker selection and prompted self-reports; without this grounding the observed instability could reflect prompt sensitivity rather than a true failure to link markers to internal confidence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the abstract. We address the major comment below and will revise the manuscript to improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph defining MIC and the seven metrics): the central claim that LLMs are 'faithfully miscalibrated' under a model-centric view depends on MIC being an independently estimable intrinsic quantity. The abstract provides no information on whether MIC is computed from token-level probabilities, hidden-state entropy, or solely from patterns of marker selection and prompted self-reports; without this grounding the observed instability could reflect prompt sensitivity rather than a true failure to link markers to internal confidence.

Authors: We agree that the abstract is too concise on the grounding of MIC and should explicitly indicate its estimation method to support the model-centric interpretation. In the full manuscript, MIC is formalized in Section 3 as an intrinsic quantity estimated solely from the model's own patterns of marker selection across controlled prompting setups and prompted self-reports within a task domain; it is not derived from token-level probabilities or hidden-state entropy. The seven metrics then evaluate the stability of these estimates. We will revise the abstract to include one additional sentence clarifying this estimation approach, which should address the concern that instability might stem from prompt sensitivity rather than a genuine failure to associate markers with internal confidence levels. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with independent definitions and metrics

full rationale

The paper formalizes MIC as an estimated quantity and introduces 7 metrics for stability evaluation, then applies them empirically across models and tasks. No equations, derivations, or self-citations are shown that reduce MIC or the metrics to fitted parameters defined by the same outputs, self-referential definitions, or load-bearing prior work by the authors. The central findings on miscalibration and ranking consistency follow from direct measurement rather than construction from inputs. This is a standard empirical analysis with no detectable circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, background axioms, or new entities; the study is purely empirical and introduces no mathematical derivations or postulated constructs beyond the defined MIC concept.

pith-pipeline@v0.9.1-grok · 5718 in / 1151 out tokens · 25941 ms · 2026-06-29T12:14:42.363225+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

109 extracted references · 67 canonical work pages · 12 internal anchors

[1]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https:// aclanthology...

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[2]

Linguistic calibration of long-form generations,

Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. Linguistic calibration of long-form generations, 2024. URLhttps://arxiv.org/abs/2404.00474

work page arXiv 2024
[3]

Cycles of thought: Measuring llm confidence through stable explanations, 2024

Evan Becker and Stefano Soatto. Cycles of thought: Measuring llm confidence through stable explanations, 2024. URLhttps://arxiv.org/abs/2406.03441. 9

work page arXiv 2024
[4]

Perceptions of linguistic uncertainty by language models and humans

Catarina G Belém, Markelle Kelly, Mark Steyvers, Sameer Singh, and Padhraic Smyth. Perceptions of linguistic uncertainty by language models and humans. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8467–8502, Miami, Florida, USA, November

2024
[5]

doi: 10.18653/v1/2024.emnlp-main.483

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.483. URLhttps://aclanthology.org/2024.emnlp-main.483/

work page doi:10.18653/v1/2024.emnlp-main.483 2024
[6]

Consistency in interpretation of probabilistic phrases.Organizational Behavior and Human Decision Processes, 36(3):391–405, 1985

David V Budescu and Thomas S Wallsten. Consistency in interpretation of probabilistic phrases.Organizational Behavior and Human Decision Processes, 36(3):391–405, 1985. ISSN 0749-5978. doi: https://doi.org/10.1016/0749-5978(85)90007-X. URL https://www. sciencedirect.com/science/article/pii/074959788590007X

work page doi:10.1016/0749-5978(85)90007-x 1985
[7]

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2024. URL https://arxiv.org/abs/2212.03827

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

hello ai

Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. "hello ai": Uncovering the onboarding needs of medical practitioners for human-ai collaborative decision-making.Proc. ACM Hum.-Comput. Interact., 3(CSCW), November 2019. doi: 10.1145/3359206. URLhttps://doi.org/10.1145/3359206

work page doi:10.1145/3359206 2019
[9]

Finetuning language models to emit linguistic expressions of uncertainty, 2024

Arslan Chaudhry, Sridhar Thiagarajan, and Dilan Gorur. Finetuning language models to emit linguistic expressions of uncertainty, 2024. URLhttps://arxiv.org/abs/2409.12180

work page arXiv 2024
[10]

Quantifying uncertainty in answers from any language model and enhancing their trustworthiness

Jiuhai Chen and Jonas Mueller. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 5186–5200, Bangkok, Thailand, August
[11]

doi: 10.18653/v1/2024.acl-long.283

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.283. URL https://aclanthology.org/2024.acl-long.283/

work page doi:10.18653/v1/2024.acl-long.283 2024
[12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E. Ho. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, 2024

2024
[14]

Communicating uncertainty using words and numbers.Trends in Cognitive Sciences, 26(6):514–526, 2022

Mandeep K Dhami and David R Mandel. Communicating uncertainty using words and numbers.Trends in Cognitive Sciences, 26(6):514–526, 2022

2022
[15]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computation...

work page doi:10.18653/v1/2024.acl-long.276 2024
[16]

Bryan Eikema, Evgenia Ilia, José G. C. de Souza, Chrysoula Zerva, and Wilker Aziz. Teaching language models to faithfully express their uncertainty, 2025. URL https://arxiv.org/ abs/2510.12587

work page arXiv 2025
[17]

Perception of probability words, 2023

Wade Fagen-Ulmschneider. Perception of probability words, 2023. URL https://waf.cs. illinois.edu/visualizations/Perception-of-Probability-Words/

2023
[18]

R. A. Fisher. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population.Biometrika, 10(4):507–521, 1915. ISSN 00063444. URLhttp://www.jstor.org/stable/2331838

work page arXiv 1915
[19]

probable error

Ronald A Fisher. On the" probable error" of a coefficient of correlation deduced from a small sample.Metron, 1:3–32, 1921. 10

1921
[20]

Epistemic integrity in large language models

Bijean Ghafouri, Shahrad Mohammadzadeh, James Zhou, Pratheeksha Nair, Jacob-Junqi Tian, Mayank Goel, Reihaneh Rabbany, Jean-François Godbout, and Kellin Pelrine. Epistemic integrity in large language models. InNeurips Safe Generative AI Workshop 2024, 2024. URL https://openreview.net/forum?id=o3wQbxRaKo

2024
[21]

Gemini 2.5 flash model card

Google DeepMind. Gemini 2.5 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf, 2025

2025
[22]

Gemini 3 flash model card

Google DeepMind. Gemini 3 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, 2025

2025
[23]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, 2026

2026
[24]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

S., Bonilla, E

Yashvir S. Grewal, Edwin V . Bonilla, and Thang D. Bui. Improving uncertainty quantification in large language models via semantic embeddings, 2024. URL https://arxiv.org/abs/ 2410.22685

work page arXiv 2024
[26]

Quantifying uncertainty in natural language explanations of large language models

Sree Harsha Tanneru, Chirag Agarwal, and Himabindu Lakkaraju. Quantifying uncertainty in natural language explanations of large language models. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors,Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages...

2024
[27]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Decom- posing uncertainty for large language models through input clarification ensembling, 2024

Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. Decom- posing uncertainty for large language models through input clarification ensembling, 2024. URLhttps://arxiv.org/abs/2311.08718

work page arXiv 2024
[29]

A survey of uncertainty estimation in llms: Theory meets practice, 2024

Hsiu-Yuan Huang, Yutong Yang, Zhaoxi Zhang, Sanwoo Lee, and Yunfang Wu. A survey of uncertainty estimation in llms: Theory meets practice, 2024. URL https://arxiv.org/ abs/2410.15326

work page arXiv 2024
[30]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2), January 2025. ISSN 1046-8188. doi: 10.1145/3703155. URL https://doi.org/...

work page doi:10.1145/3703155 2025
[31]

Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, February 2025

Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, February 2025. ISSN 2326-3881. doi: 10.1109/tse.2024.3519464. URL http://dx.doi.org/10.1109/ TSE.2024.3519464

work page doi:10.1109/tse.2024.3519464 2025
[32]

Calibrating verbal uncertainty as a linear feature to reduce hallucinations.arXiv preprint arXiv:2503.14477, 2025

Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, and Nicola Cancedda. Calibrating verbal uncertainty as a linear feature to reduce hallucinations.arXiv preprint arXiv:2503.14477, 2025

work page arXiv 2025
[33]

Calibrating language models via augmented prompt ensembles

Mingjian Jiang, Yangjun Ruan, Sicong Huang, Saifei Liao, Silviu Pitis, Roger Baker Grosse, and Jimmy Ba. Calibrating language models via augmented prompt ensembles. 2023. URL https://api.semanticscholar.org/CorpusID:271797871

2023
[34]

Conformal linguistic calibration: Trading-off between factuality and specificity, 2025

Zhengping Jiang, Anqi Liu, and Benjamin Van Durme. Conformal linguistic calibration: Trading-off between factuality and specificity, 2025. URL https://arxiv.org/abs/2502. 19110

2025
[35]

Assessing the accuracy and reliability of ai-generated medical responses: An evaluation of the chat-gpt model.Research square, page rs.3.rs—2566942, February

Douglas Johnson, Rachel Goodman, J Patrinely, Cosby Stone, Eli Zimmerman, Rebecca Donald, Sam Chang, Sean Berkowitz, Avni Finn, Eiman Jahangir, Elizabeth Scoville, Tyler Reese, Debra Friedman, Julie Bastarache, Yuri van der Heijden, Jordan Wright, Nicholas Carter, Matthew Alexander, Jennifer Choe, Cody Chastain, John Zic, Sara Horst, Isik Turker, Rajiv Ag...
[36]

doi: 10.21203/rs.3.rs-2566942/v1

ISSN 2693-5015. doi: 10.21203/rs.3.rs-2566942/v1. URL https://europepmc.org/ articles/PMC10002821

work page doi:10.21203/rs.3.rs-2566942/v1
[37]

i am uncer- tain

Marie Juanchich, Amélie Gourdon-Kanhukamwe, and Miroslav Sirota. “i am uncer- tain” vs “it is uncertain”. how linguistic markers of the uncertainty source affect un- certainty communication.Judgment and Decision Making, 12(5):445–465, 2017. doi: 10.1017/S1930297500006483

work page doi:10.1017/s1930297500006483 2017
[38]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander Michael Berenbeim, John A

Ramneet Kaur, Colin Samplawski, Adam D. Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander Michael Berenbeim, John A. Pavlik, Nathaniel D. Bastian, and Susmit Jha. Addressing uncertainty in LLMs to enhance reliability in generative AI. InNeurips Safe Generative AI Workshop 2024, 2024. URL https://openreview.net/ forum?id=Z3DS4Pcxct

2024
[40]

i’m not sure, but

Sunnie S. Y . Kim, Q. Vera Liao, Mihaela V orvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. "i’m not sure, but...": Examining the impact of large language models’ uncertainty expression on user reliance and trust. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, page 822–835, New York, NY , USA,...

work page doi:10.1145/3630106.3658941 2024
[41]

To hedge or not to hedge: the use of epistemic modal expressions in popular science in english texts, english–german translations, and german original texts

Svenja Kranich. To hedge or not to hedge: the use of epistemic modal expressions in popular science in english texts, english–german translations, and german original texts. 2011. URL https://api.semanticscholar.org/CorpusID:154907527

2011
[42]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=VD-AYtP0dve

2023
[43]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

2019
[44]

Hedges: A study in meaning criteria and the logic of fuzzy concepts.Journal of philosophical logic, 2(4):458–508, 1973

George Lakoff. Hedges: A study in meaning criteria and the logic of fuzzy concepts.Journal of philosophical logic, 2(4):458–508, 1973

1973
[45]

Hedges in japanese conversation: The influence of age, sex, and formality

Shizuka Lauwereyns. Hedges in japanese conversation: The influence of age, sex, and formality. Language Variation and Change, 14(2):239–259, 2002. doi: 10.1017/S0954394502142049

work page doi:10.1017/s0954394502142049 2002
[46]

Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation

Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, and Kyomin Jung. Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

2025
[47]

The Winograd schema challenge

Hector J Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. InAAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, page 47, 2011

2011
[48]

LegalAgentBench: Evaluating LLM agents in legal domain

Haitao Li, Junjie Chen, Jingli Yang, Qingyao Ai, Wei Jia, Youfeng Liu, Kai Lin, Yueyue Wu, Guozhi Yuan, Yiran Hu, Wuyue Wang, Yiqun Liu, and Minlie Huang. LegalAgentBench: Evaluating LLM agents in legal domain. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association ...

work page doi:10.18653/v1/2025.acl-long.116 2025
[49]

Halueval: A large-scale hallucination evaluation benchmark for large language models, 2023

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models, 2023. URL https://arxiv.org/abs/2305.11747

work page arXiv 2023
[50]

TruthfulQA: Measuring how models mimic human false- hoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicen- cio, editors,Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computat...

work page doi:10.18653/v1/2022.acl-long.229 2022
[51]

Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=8s8K2UZGTZ

2022
[52]

Gabrielle Kaili-May Liu, Gal Yona, Avi Caciularu, Idan Szpektor, Tim G. J. Rudner, and Arman Cohan. MetaFaith: Faithful natural language uncertainty expression in LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages ...

work page doi:10.18653/v1/2025.emnlp-main.1505 2025
[53]

Jiayu Liu, Qing Zong, Weiqi Wang, and Yangqiu Song. Revisiting epistemic markers in confidence estimation: Can markers accurately reflect large language models’ uncertainty? In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volum...

work page doi:10.18653/v1/2025.acl-short.18 2025
[54]

When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint, 2022

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint, 2022

2022
[55]

SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Computa...

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[56]

On the probability–quality paradox in language generation

Clara Meister, Gian Wiher, Tiago Pimentel, and Ryan Cotterell. On the probability–quality paradox in language generation. In Smaranda Muresan, Preslav Nakov, and Aline Villav- icencio, editors,Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 36–45, Dublin, Ireland, May 2022. Associat...

work page doi:10.18653/v1/2022.acl-short.5 2022
[57]

Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau

Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl_a_00494. URL https: //aclanthology.org/2022.tacl-1.50/

work page doi:10.1162/tacl_a_00494 2022
[58]

AmbigQA: Answering ambigu- ous open-domain questions

Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. AmbigQA: An- swering ambiguous open-domain questions. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783–5797, Online, November 2020. Asso- ciation for Computational ...

work page doi:10.18653/v1/2020.emnlp-main.466 2020
[59]

There may be differences: Analysing the use of hedges in english and spanish research articles.Lingua, 260:103131, 2021

Pilar Mur-Dueñas. There may be differences: Analysing the use of hedges in english and spanish research articles.Lingua, 260:103131, 2021. ISSN 0024-3841. doi: https://doi. org/10.1016/j.lingua.2021.103131. URL https://www.sciencedirect.com/science/ article/pii/S0024384121001030

work page doi:10.1016/j.lingua.2021.103131 2021
[60]

Sirois.Spearman Correlation Coefficients, Differences be- tween

Leann Myers and Maria J. Sirois.Spearman Correlation Coefficients, Differences be- tween. John Wiley & Sons, Ltd, 2014. ISBN 9781118445112. doi: https://doi.org/10. 1002/9781118445112.stat02802. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1002/9781118445112.stat02802

work page doi:10.1002/9781118445112.stat02802 2014
[61]

Thu Nguyen Thi Thuy. A corpus-based study on cross-cultural divergence in the use of hedges in academic research articles written by vietnamese and native english-speaking 15 authors.Social Sciences, 7(4), 2018. ISSN 2076-0760. doi: 10.3390/socsci7040070. URL https://www.mdpi.com/2076-0760/7/4/70

work page doi:10.3390/socsci7040070 2018
[62]

Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities, 2024

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities, 2024. URL https://arxiv.org/abs/2405.20003

work page arXiv 2024
[63]

Variability in verbal eyewitness confidence.Applied Cognitive Psychology, 38(2):e4190, 2024

Pia Pennekamp, Jamal K Mansour, and Rhiannon J Batstone. Variability in verbal eyewitness confidence.Applied Cognitive Psychology, 38(2):e4190, 2024

2024
[64]

Com- bining confidence elicitation and sample-based methods for uncertainty quantification in misinformation mitigation

Mauricio Rivera, Jean-François Godbout, Reihaneh Rabbany, and Kellin Pelrine. Com- bining confidence elicitation and sample-based methods for uncertainty quantification in misinformation mitigation. In Raúl Vázquez, Hande Celikkanat, Dennis Ulmer, Jörg Tiedemann, Swabha Swayamdipta, Wilker Aziz, Barbara Plank, Joris Baan, and Marie- Catherine de Marneffe,...

work page doi:10.18653/v1/2024.uncertainlp-1.12 2024
[65]

Thermometer: Towards universal calibration for large language models, 2024

Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models, 2024. URLhttps://arxiv.org/abs/2403.08819

work page arXiv 2024
[66]

Llamas know what gpts don’t show: Surrogate models for confidence estimation, 2023

Vaishnavi Shrivastava, Percy Liang, and Ananya Kumar. Llamas know what gpts don’t show: Surrogate models for confidence estimation, 2023. URL https://arxiv.org/abs/2311. 08877

2023
[67]

Averaging correlation coefficients: should fisher’s z transformation be used?Journal of applied psychology, 72(1):146, 1987

N Clayton Silver and William P Dunlap. Averaging correlation coefficients: should fisher’s z transformation be used?Journal of applied psychology, 72(1):146, 1987

1987
[68]

Trust me, i’m wrong: High-certainty hallucinations in llms.arXiv preprint arXiv:2502.12964, 2025

Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. Trust me, i’m wrong: High-certainty hallucinations in llms.arXiv preprint arXiv:2502.12964, 2025

work page arXiv 2025
[69]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, Yinkai Wang, Haorui Wang, Jeff Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, Shao-Xiong Lennon Luo, Yuxuan Zhang, Xiang Zou, Wanru Zhao, Yifan F. Zhang, Wucheng Zhang, Shunan Zheng, Saiyang Zhang, Sartaaj Takri...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

LACIE: Listener-aware finetuning for cali- bration in large language models

Elias Stengel-Eskin, Peter Hase, and Mohit Bansal. LACIE: Listener-aware finetuning for cali- bration in large language models. InThe Thirty-eighth Annual Conference on Neural Informa- tion Processing Systems, 2024. URL https://openreview.net/forum?id=RnvgYd9RAh

2024
[72]

Metacognition and uncertainty communication in humans and large language models.Current Directions in Psychological Science, page 09637214251391158, 2025

Mark Steyvers and Megan AK Peters. Metacognition and uncertainty communication in humans and large language models.Current Directions in Psychological Science, page 09637214251391158, 2025

2025
[73]

What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, 2025

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and Padhraic Smyth. What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, 2025

2025
[74]

An evaluation of estimative uncertainty in large language models.npj Complexity, 3(1):8, 2026

Zhisheng Tang, Ke Shen, and Mayank Kejriwal. An evaluation of estimative uncertainty in large language models.npj Complexity, 3(1):8, 2026

2026
[75]

Can large language models express uncertainty like human?arXiv preprint arXiv:2509.24202, 2025

Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A Lamb, Jialin Yu, Philip HS Torr, and Chang Xu. Can large language models express uncertainty like human?arXiv preprint arXiv:2509.24202, 2025

work page arXiv 2025
[76]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empiri...

2023
[77]

doi: 10.18653/v1/2023.emnlp-main.330

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URLhttps://aclanthology.org/2023.emnlp-main.330/

work page doi:10.18653/v1/2023.emnlp-main.330 2023
[78]

Fine- tuning language models for factuality

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine- tuning language models for factuality. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=WPZ2yPag4K

2024
[79]

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 6, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Preferences and reasons for communicating probabilistic information in verbal or numerical terms.Bulletin of the Psychonomic Society, 31(2):135–138, 1993

Thomas S Wallsten, David V Budescu, Rami Zwick, and Steven M Kemp. Preferences and reasons for communicating probabilistic information in verbal or numerical terms.Bulletin of the Psychonomic Society, 31(2):135–138, 1993

1993

Showing first 80 references.

[1] [1]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https:// aclanthology...

work page doi:10.18653/v1/2023.findings-emnlp.68 2023

[2] [2]

Linguistic calibration of long-form generations,

Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. Linguistic calibration of long-form generations, 2024. URLhttps://arxiv.org/abs/2404.00474

work page arXiv 2024

[3] [3]

Cycles of thought: Measuring llm confidence through stable explanations, 2024

Evan Becker and Stefano Soatto. Cycles of thought: Measuring llm confidence through stable explanations, 2024. URLhttps://arxiv.org/abs/2406.03441. 9

work page arXiv 2024

[4] [4]

Perceptions of linguistic uncertainty by language models and humans

Catarina G Belém, Markelle Kelly, Mark Steyvers, Sameer Singh, and Padhraic Smyth. Perceptions of linguistic uncertainty by language models and humans. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8467–8502, Miami, Florida, USA, November

2024

[5] [5]

doi: 10.18653/v1/2024.emnlp-main.483

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.483. URLhttps://aclanthology.org/2024.emnlp-main.483/

work page doi:10.18653/v1/2024.emnlp-main.483 2024

[6] [6]

Consistency in interpretation of probabilistic phrases.Organizational Behavior and Human Decision Processes, 36(3):391–405, 1985

David V Budescu and Thomas S Wallsten. Consistency in interpretation of probabilistic phrases.Organizational Behavior and Human Decision Processes, 36(3):391–405, 1985. ISSN 0749-5978. doi: https://doi.org/10.1016/0749-5978(85)90007-X. URL https://www. sciencedirect.com/science/article/pii/074959788590007X

work page doi:10.1016/0749-5978(85)90007-x 1985

[7] [7]

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2024. URL https://arxiv.org/abs/2212.03827

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

hello ai

Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. "hello ai": Uncovering the onboarding needs of medical practitioners for human-ai collaborative decision-making.Proc. ACM Hum.-Comput. Interact., 3(CSCW), November 2019. doi: 10.1145/3359206. URLhttps://doi.org/10.1145/3359206

work page doi:10.1145/3359206 2019

[9] [9]

Finetuning language models to emit linguistic expressions of uncertainty, 2024

Arslan Chaudhry, Sridhar Thiagarajan, and Dilan Gorur. Finetuning language models to emit linguistic expressions of uncertainty, 2024. URLhttps://arxiv.org/abs/2409.12180

work page arXiv 2024

[10] [10]

Quantifying uncertainty in answers from any language model and enhancing their trustworthiness

Jiuhai Chen and Jonas Mueller. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 5186–5200, Bangkok, Thailand, August

[11] [11]

doi: 10.18653/v1/2024.acl-long.283

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.283. URL https://aclanthology.org/2024.acl-long.283/

work page doi:10.18653/v1/2024.acl-long.283 2024

[12] [12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E. Ho. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, 2024

2024

[14] [14]

Communicating uncertainty using words and numbers.Trends in Cognitive Sciences, 26(6):514–526, 2022

Mandeep K Dhami and David R Mandel. Communicating uncertainty using words and numbers.Trends in Cognitive Sciences, 26(6):514–526, 2022

2022

[15] [15]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computation...

work page doi:10.18653/v1/2024.acl-long.276 2024

[16] [16]

Bryan Eikema, Evgenia Ilia, José G. C. de Souza, Chrysoula Zerva, and Wilker Aziz. Teaching language models to faithfully express their uncertainty, 2025. URL https://arxiv.org/ abs/2510.12587

work page arXiv 2025

[17] [17]

Perception of probability words, 2023

Wade Fagen-Ulmschneider. Perception of probability words, 2023. URL https://waf.cs. illinois.edu/visualizations/Perception-of-Probability-Words/

2023

[18] [18]

R. A. Fisher. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population.Biometrika, 10(4):507–521, 1915. ISSN 00063444. URLhttp://www.jstor.org/stable/2331838

work page arXiv 1915

[19] [19]

probable error

Ronald A Fisher. On the" probable error" of a coefficient of correlation deduced from a small sample.Metron, 1:3–32, 1921. 10

1921

[20] [20]

Epistemic integrity in large language models

Bijean Ghafouri, Shahrad Mohammadzadeh, James Zhou, Pratheeksha Nair, Jacob-Junqi Tian, Mayank Goel, Reihaneh Rabbany, Jean-François Godbout, and Kellin Pelrine. Epistemic integrity in large language models. InNeurips Safe Generative AI Workshop 2024, 2024. URL https://openreview.net/forum?id=o3wQbxRaKo

2024

[21] [21]

Gemini 2.5 flash model card

Google DeepMind. Gemini 2.5 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf, 2025

2025

[22] [22]

Gemini 3 flash model card

Google DeepMind. Gemini 3 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, 2025

2025

[23] [23]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, 2026

2026

[24] [24]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

S., Bonilla, E

Yashvir S. Grewal, Edwin V . Bonilla, and Thang D. Bui. Improving uncertainty quantification in large language models via semantic embeddings, 2024. URL https://arxiv.org/abs/ 2410.22685

work page arXiv 2024

[26] [26]

Quantifying uncertainty in natural language explanations of large language models

Sree Harsha Tanneru, Chirag Agarwal, and Himabindu Lakkaraju. Quantifying uncertainty in natural language explanations of large language models. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors,Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages...

2024

[27] [27]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Decom- posing uncertainty for large language models through input clarification ensembling, 2024

Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. Decom- posing uncertainty for large language models through input clarification ensembling, 2024. URLhttps://arxiv.org/abs/2311.08718

work page arXiv 2024

[29] [29]

A survey of uncertainty estimation in llms: Theory meets practice, 2024

Hsiu-Yuan Huang, Yutong Yang, Zhaoxi Zhang, Sanwoo Lee, and Yunfang Wu. A survey of uncertainty estimation in llms: Theory meets practice, 2024. URL https://arxiv.org/ abs/2410.15326

work page arXiv 2024

[30] [30]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2), January 2025. ISSN 1046-8188. doi: 10.1145/3703155. URL https://doi.org/...

work page doi:10.1145/3703155 2025

[31] [31]

Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, February 2025

Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, February 2025. ISSN 2326-3881. doi: 10.1109/tse.2024.3519464. URL http://dx.doi.org/10.1109/ TSE.2024.3519464

work page doi:10.1109/tse.2024.3519464 2025

[32] [32]

Calibrating verbal uncertainty as a linear feature to reduce hallucinations.arXiv preprint arXiv:2503.14477, 2025

Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, and Nicola Cancedda. Calibrating verbal uncertainty as a linear feature to reduce hallucinations.arXiv preprint arXiv:2503.14477, 2025

work page arXiv 2025

[33] [33]

Calibrating language models via augmented prompt ensembles

Mingjian Jiang, Yangjun Ruan, Sicong Huang, Saifei Liao, Silviu Pitis, Roger Baker Grosse, and Jimmy Ba. Calibrating language models via augmented prompt ensembles. 2023. URL https://api.semanticscholar.org/CorpusID:271797871

2023

[34] [34]

Conformal linguistic calibration: Trading-off between factuality and specificity, 2025

Zhengping Jiang, Anqi Liu, and Benjamin Van Durme. Conformal linguistic calibration: Trading-off between factuality and specificity, 2025. URL https://arxiv.org/abs/2502. 19110

2025

[35] [35]

Assessing the accuracy and reliability of ai-generated medical responses: An evaluation of the chat-gpt model.Research square, page rs.3.rs—2566942, February

Douglas Johnson, Rachel Goodman, J Patrinely, Cosby Stone, Eli Zimmerman, Rebecca Donald, Sam Chang, Sean Berkowitz, Avni Finn, Eiman Jahangir, Elizabeth Scoville, Tyler Reese, Debra Friedman, Julie Bastarache, Yuri van der Heijden, Jordan Wright, Nicholas Carter, Matthew Alexander, Jennifer Choe, Cody Chastain, John Zic, Sara Horst, Isik Turker, Rajiv Ag...

[36] [36]

doi: 10.21203/rs.3.rs-2566942/v1

ISSN 2693-5015. doi: 10.21203/rs.3.rs-2566942/v1. URL https://europepmc.org/ articles/PMC10002821

work page doi:10.21203/rs.3.rs-2566942/v1

[37] [37]

i am uncer- tain

Marie Juanchich, Amélie Gourdon-Kanhukamwe, and Miroslav Sirota. “i am uncer- tain” vs “it is uncertain”. how linguistic markers of the uncertainty source affect un- certainty communication.Judgment and Decision Making, 12(5):445–465, 2017. doi: 10.1017/S1930297500006483

work page doi:10.1017/s1930297500006483 2017

[38] [38]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander Michael Berenbeim, John A

Ramneet Kaur, Colin Samplawski, Adam D. Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander Michael Berenbeim, John A. Pavlik, Nathaniel D. Bastian, and Susmit Jha. Addressing uncertainty in LLMs to enhance reliability in generative AI. InNeurips Safe Generative AI Workshop 2024, 2024. URL https://openreview.net/ forum?id=Z3DS4Pcxct

2024

[40] [40]

i’m not sure, but

Sunnie S. Y . Kim, Q. Vera Liao, Mihaela V orvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. "i’m not sure, but...": Examining the impact of large language models’ uncertainty expression on user reliance and trust. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, page 822–835, New York, NY , USA,...

work page doi:10.1145/3630106.3658941 2024

[41] [41]

To hedge or not to hedge: the use of epistemic modal expressions in popular science in english texts, english–german translations, and german original texts

Svenja Kranich. To hedge or not to hedge: the use of epistemic modal expressions in popular science in english texts, english–german translations, and german original texts. 2011. URL https://api.semanticscholar.org/CorpusID:154907527

2011

[42] [42]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=VD-AYtP0dve

2023

[43] [43]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

2019

[44] [44]

Hedges: A study in meaning criteria and the logic of fuzzy concepts.Journal of philosophical logic, 2(4):458–508, 1973

George Lakoff. Hedges: A study in meaning criteria and the logic of fuzzy concepts.Journal of philosophical logic, 2(4):458–508, 1973

1973

[45] [45]

Hedges in japanese conversation: The influence of age, sex, and formality

Shizuka Lauwereyns. Hedges in japanese conversation: The influence of age, sex, and formality. Language Variation and Change, 14(2):239–259, 2002. doi: 10.1017/S0954394502142049

work page doi:10.1017/s0954394502142049 2002

[46] [46]

Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation

Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, and Kyomin Jung. Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

2025

[47] [47]

The Winograd schema challenge

Hector J Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. InAAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, page 47, 2011

2011

[48] [48]

LegalAgentBench: Evaluating LLM agents in legal domain

Haitao Li, Junjie Chen, Jingli Yang, Qingyao Ai, Wei Jia, Youfeng Liu, Kai Lin, Yueyue Wu, Guozhi Yuan, Yiran Hu, Wuyue Wang, Yiqun Liu, and Minlie Huang. LegalAgentBench: Evaluating LLM agents in legal domain. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association ...

work page doi:10.18653/v1/2025.acl-long.116 2025

[49] [49]

Halueval: A large-scale hallucination evaluation benchmark for large language models, 2023

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models, 2023. URL https://arxiv.org/abs/2305.11747

work page arXiv 2023

[50] [50]

TruthfulQA: Measuring how models mimic human false- hoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicen- cio, editors,Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computat...

work page doi:10.18653/v1/2022.acl-long.229 2022

[51] [51]

Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=8s8K2UZGTZ

2022

[52] [52]

Gabrielle Kaili-May Liu, Gal Yona, Avi Caciularu, Idan Szpektor, Tim G. J. Rudner, and Arman Cohan. MetaFaith: Faithful natural language uncertainty expression in LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages ...

work page doi:10.18653/v1/2025.emnlp-main.1505 2025

[53] [53]

Jiayu Liu, Qing Zong, Weiqi Wang, and Yangqiu Song. Revisiting epistemic markers in confidence estimation: Can markers accurately reflect large language models’ uncertainty? In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volum...

work page doi:10.18653/v1/2025.acl-short.18 2025

[54] [54]

When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint, 2022

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint, 2022

2022

[55] [55]

SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Computa...

work page doi:10.18653/v1/2023.emnlp-main.557 2023

[56] [56]

On the probability–quality paradox in language generation

Clara Meister, Gian Wiher, Tiago Pimentel, and Ryan Cotterell. On the probability–quality paradox in language generation. In Smaranda Muresan, Preslav Nakov, and Aline Villav- icencio, editors,Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 36–45, Dublin, Ireland, May 2022. Associat...

work page doi:10.18653/v1/2022.acl-short.5 2022

[57] [57]

Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau

Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl_a_00494. URL https: //aclanthology.org/2022.tacl-1.50/

work page doi:10.1162/tacl_a_00494 2022

[58] [58]

AmbigQA: Answering ambigu- ous open-domain questions

Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. AmbigQA: An- swering ambiguous open-domain questions. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783–5797, Online, November 2020. Asso- ciation for Computational ...

work page doi:10.18653/v1/2020.emnlp-main.466 2020

[59] [59]

There may be differences: Analysing the use of hedges in english and spanish research articles.Lingua, 260:103131, 2021

Pilar Mur-Dueñas. There may be differences: Analysing the use of hedges in english and spanish research articles.Lingua, 260:103131, 2021. ISSN 0024-3841. doi: https://doi. org/10.1016/j.lingua.2021.103131. URL https://www.sciencedirect.com/science/ article/pii/S0024384121001030

work page doi:10.1016/j.lingua.2021.103131 2021

[60] [60]

Sirois.Spearman Correlation Coefficients, Differences be- tween

Leann Myers and Maria J. Sirois.Spearman Correlation Coefficients, Differences be- tween. John Wiley & Sons, Ltd, 2014. ISBN 9781118445112. doi: https://doi.org/10. 1002/9781118445112.stat02802. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1002/9781118445112.stat02802

work page doi:10.1002/9781118445112.stat02802 2014

[61] [61]

Thu Nguyen Thi Thuy. A corpus-based study on cross-cultural divergence in the use of hedges in academic research articles written by vietnamese and native english-speaking 15 authors.Social Sciences, 7(4), 2018. ISSN 2076-0760. doi: 10.3390/socsci7040070. URL https://www.mdpi.com/2076-0760/7/4/70

work page doi:10.3390/socsci7040070 2018

[62] [62]

Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities, 2024

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities, 2024. URL https://arxiv.org/abs/2405.20003

work page arXiv 2024

[63] [63]

Variability in verbal eyewitness confidence.Applied Cognitive Psychology, 38(2):e4190, 2024

Pia Pennekamp, Jamal K Mansour, and Rhiannon J Batstone. Variability in verbal eyewitness confidence.Applied Cognitive Psychology, 38(2):e4190, 2024

2024

[64] [64]

Com- bining confidence elicitation and sample-based methods for uncertainty quantification in misinformation mitigation

Mauricio Rivera, Jean-François Godbout, Reihaneh Rabbany, and Kellin Pelrine. Com- bining confidence elicitation and sample-based methods for uncertainty quantification in misinformation mitigation. In Raúl Vázquez, Hande Celikkanat, Dennis Ulmer, Jörg Tiedemann, Swabha Swayamdipta, Wilker Aziz, Barbara Plank, Joris Baan, and Marie- Catherine de Marneffe,...

work page doi:10.18653/v1/2024.uncertainlp-1.12 2024

[65] [65]

Thermometer: Towards universal calibration for large language models, 2024

Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models, 2024. URLhttps://arxiv.org/abs/2403.08819

work page arXiv 2024

[66] [66]

Llamas know what gpts don’t show: Surrogate models for confidence estimation, 2023

Vaishnavi Shrivastava, Percy Liang, and Ananya Kumar. Llamas know what gpts don’t show: Surrogate models for confidence estimation, 2023. URL https://arxiv.org/abs/2311. 08877

2023

[67] [67]

Averaging correlation coefficients: should fisher’s z transformation be used?Journal of applied psychology, 72(1):146, 1987

N Clayton Silver and William P Dunlap. Averaging correlation coefficients: should fisher’s z transformation be used?Journal of applied psychology, 72(1):146, 1987

1987

[68] [68]

Trust me, i’m wrong: High-certainty hallucinations in llms.arXiv preprint arXiv:2502.12964, 2025

Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. Trust me, i’m wrong: High-certainty hallucinations in llms.arXiv preprint arXiv:2502.12964, 2025

work page arXiv 2025

[69] [69]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, Yinkai Wang, Haorui Wang, Jeff Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, Shao-Xiong Lennon Luo, Yuxuan Zhang, Xiang Zou, Wanru Zhao, Yifan F. Zhang, Wucheng Zhang, Shunan Zheng, Saiyang Zhang, Sartaaj Takri...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

LACIE: Listener-aware finetuning for cali- bration in large language models

Elias Stengel-Eskin, Peter Hase, and Mohit Bansal. LACIE: Listener-aware finetuning for cali- bration in large language models. InThe Thirty-eighth Annual Conference on Neural Informa- tion Processing Systems, 2024. URL https://openreview.net/forum?id=RnvgYd9RAh

2024

[72] [72]

Metacognition and uncertainty communication in humans and large language models.Current Directions in Psychological Science, page 09637214251391158, 2025

Mark Steyvers and Megan AK Peters. Metacognition and uncertainty communication in humans and large language models.Current Directions in Psychological Science, page 09637214251391158, 2025

2025

[73] [73]

What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, 2025

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and Padhraic Smyth. What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, 2025

2025

[74] [74]

An evaluation of estimative uncertainty in large language models.npj Complexity, 3(1):8, 2026

Zhisheng Tang, Ke Shen, and Mayank Kejriwal. An evaluation of estimative uncertainty in large language models.npj Complexity, 3(1):8, 2026

2026

[75] [75]

Can large language models express uncertainty like human?arXiv preprint arXiv:2509.24202, 2025

Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A Lamb, Jialin Yu, Philip HS Torr, and Chang Xu. Can large language models express uncertainty like human?arXiv preprint arXiv:2509.24202, 2025

work page arXiv 2025

[76] [76]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empiri...

2023

[77] [77]

doi: 10.18653/v1/2023.emnlp-main.330

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URLhttps://aclanthology.org/2023.emnlp-main.330/

work page doi:10.18653/v1/2023.emnlp-main.330 2023

[78] [78]

Fine- tuning language models for factuality

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine- tuning language models for factuality. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=WPZ2yPag4K

2024

[79] [79]

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 6, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [80]

Preferences and reasons for communicating probabilistic information in verbal or numerical terms.Bulletin of the Psychonomic Society, 31(2):135–138, 1993

Thomas S Wallsten, David V Budescu, Rami Zwick, and Steven M Kemp. Preferences and reasons for communicating probabilistic information in verbal or numerical terms.Bulletin of the Psychonomic Society, 31(2):135–138, 1993

1993