Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

Mauricio Villavicencio; Qianwen Wang; Sitong Pan

arxiv: 2605.28571 · v1 · pith:Q7FYNG36new · submitted 2026-05-27 · 💻 cs.HC

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

Mauricio Villavicencio , Sitong Pan , Qianwen Wang This is my paper

Pith reviewed 2026-06-29 09:57 UTC · model grok-4.3

classification 💻 cs.HC

keywords uncertainty granularityLLM-assisted decision makinghuman verificationtrust calibrationmedical decision supportAI reliabilityuser behavior

0 comments

The pith

Uncertainty shown at the token level increases agreement with LLM answers while relation-level uncertainty reduces external verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates uncertainty granularity in LLM responses, defined as expressing confidence at output, relation, or token levels within a single answer. A between-subjects experiment with 192 participants answering medical questions found that token-level uncertainty raised agreement with the AI, output- and relation-level uncertainty lowered users' confidence in their own answers, and relation-level uncertainty decreased external checks such as internet searches. These patterns indicate that the way uncertainty is broken down can steer users toward or away from independent fact-checking. The results supply concrete guidance for designing uncertainty displays that support appropriate reliance on LLMs.

Core claim

Token-level uncertainty increased users' agreement with the AI; output- and relation-level uncertainty did not increase agreement but reduced users' confidence in their own answers; relation-level uncertainty also reduced external verification behaviors such as internet searches and URL checks.

What carries the argument

Uncertainty granularity, the extent to which uncertainty is expressed at different levels (output-level for the entire response, relation-level for individual reasoning steps, token-level for specific words) within an LLM response.

If this is right

Token-level uncertainty may increase acceptance of LLM answers without additional user scrutiny.
Relation-level uncertainty may steer users away from independent fact-checking toward reliance on the provided cues.
Output- and relation-level uncertainty may lower users' confidence in their own judgments without raising agreement with the AI.
Design choices for uncertainty communication should select granularity levels according to the desired balance between trust calibration and verification behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could experiment with hybrid displays that switch granularity based on task risk or user expertise to tune verification rates.
The findings may extend to non-medical domains if similar between-subjects designs are run on factual or technical questions.
Future work could test whether combining granularity levels with other trust-calibration cues produces additive or interactive effects on verification.

Load-bearing premise

Differences in user behavior can be attributed to the granularity level rather than to how the underlying LLM outputs were generated, how uncertainty values were computed, or how the displays were presented.

What would settle it

A controlled replication that holds LLM outputs, uncertainty values, and visual displays identical across conditions and still observes no differences in agreement rates, self-confidence, or external verification rates would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.28571 by Mauricio Villavicencio, Qianwen Wang, Sitong Pan.

**Figure 1.** Figure 1: Uncertainty granularity in LLM responses. (a) Output-level uncertainty displays a single score for the entire response. (b) Finer-grained uncertainty highlights varying uncertainty across reasoning steps (relation-level) and individual words (token-level). Cat and AI assistant icons by Freepik from Flaticon (https://www.flaticon.com) sufficient critical reflection [3, 7, 25, 36, 46]. Therefore, calibrating… view at source ↗

**Figure 2.** Figure 2: LLM uncertainty quantification and visualization across [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Study Procedure and Task. (a) An overview of the user study workflow. Participants completed onboarding, answered eight information-seeking questions under one of four conditions, and then completed a post-study questionnaire. (b) A screenshot of the in-study task interface. Participants are required to complete the question-answering in-study task (B3) for all eight questions under the assigned condition.… view at source ↗

**Figure 4.** Figure 4: Comparison of DVs across conditions. We show model-estimated marginal means from the confirmatory mixed-effects [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Q1 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Q1 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Q1 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Q1 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Q2 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: Q2 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Q2 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Q2 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Q3 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

**Figure 14.** Figure 14: Q3 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

**Figure 15.** Figure 15: Q3 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: Q3 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗

**Figure 17.** Figure 17: Q4 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗

**Figure 18.** Figure 18: Q4 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗

**Figure 19.** Figure 19: Q4 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗

**Figure 20.** Figure 20: Q4 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗

**Figure 21.** Figure 21: Q5 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p039_21.png] view at source ↗

**Figure 22.** Figure 22: Q5 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p040_22.png] view at source ↗

**Figure 23.** Figure 23: Q5 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p041_23.png] view at source ↗

**Figure 24.** Figure 24: Q5 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p042_24.png] view at source ↗

**Figure 25.** Figure 25: Q6 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p043_25.png] view at source ↗

**Figure 26.** Figure 26: Q6 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p044_26.png] view at source ↗

**Figure 27.** Figure 27: Q6 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p045_27.png] view at source ↗

**Figure 28.** Figure 28: Q6 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p046_28.png] view at source ↗

**Figure 29.** Figure 29: Q7 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p047_29.png] view at source ↗

**Figure 30.** Figure 30: Q7 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p048_30.png] view at source ↗

**Figure 31.** Figure 31: Q7 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p049_31.png] view at source ↗

**Figure 32.** Figure 32: Q7 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p050_32.png] view at source ↗

**Figure 33.** Figure 33: Q8 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p051_33.png] view at source ↗

**Figure 34.** Figure 34: Q8 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p052_34.png] view at source ↗

**Figure 35.** Figure 35: Q8 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p053_35.png] view at source ↗

**Figure 36.** Figure 36: Q8 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p054_36.png] view at source ↗

read the original abstract

Despite warnings that LLMs can make mistakes, users often develop inappropriate trust and accept incorrect answers without critical evaluation. Uncertainty quantification (UQ), displaying LLMs' confidence, has emerged as a promising approach to calibrate user trust. However, prior empirical studies on uncertainty communication have treated uncertainty as a single numerical score or simple natural language expression. This simplification fails to capture a key property of LLM outputs: a single response often comprises multiple claims and reasoning steps, each with distinct levels of uncertainty. To address this gap, this study investigates uncertainty granularity (i.e., the extent to which uncertainty is expressed at different levels within an LLM response) and examines its impact on LLM-assisted decision-making. We conducted a large-scale, between-subjects study (N=192) in which participants answered medical questions using LLMs that displayed uncertainty at three different granularities: output-level (entire response), relation-level (individual reasoning steps), and token-level (specific words). Our findings reveal distinct behavioral effects as a function of uncertainty granularity. Token-level uncertainty increased users' agreement with the AI, whereas output- and relation-level uncertainty did not increase agreement but instead reduced users' confidence in their own answers. Notably, relation-level uncertainty also reduced external verification (i.e., internet searches, checking provided URLs), steering users away from independent fact-checking and toward reliance on the LLM and its accompanying uncertainty cues. Our findings demonstrate that uncertainty granularity significantly shapes how users interact with and verify LLM outputs, providing concrete design guidance for building responsible LLM applications that encourage appropriate skepticism and verification behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token-level uncertainty raised agreement while relation-level cut verification and self-confidence, but the abstract leaves the implementation details too vague to rule out confounds.

read the letter

The main thing to know is that this between-subjects study with 192 participants found different behavioral patterns depending on whether uncertainty was shown for the whole output, for individual reasoning steps, or for specific tokens. Token-level display increased agreement with the LLM; relation-level display lowered participants' own confidence and reduced how often they checked external sources.

What is new is the direct comparison of these three granularity levels inside the same response on medical questions. Earlier work mostly used a single score or phrase for the entire answer, so isolating the scope of the uncertainty signal addresses a stated gap.

The study is a standard HCI experiment with a plausible sample size and a between-subjects design that avoids carry-over effects. That part is straightforward and worth having on record.

The soft spots are real and sit right where the stress-test note points. The abstract reports no statistical results, effect sizes, or controls for LLM output variability, and it gives no information on whether the base responses, uncertainty values, or visual presentations were held constant across conditions. If token-level displays involved extra highlighting or different prompting, those differences could explain the patterns without any special role for granularity. Until the methods section shows the conditions were matched on everything except scope, the causal claim stays under-specified.

This paper is for HCI researchers and interface designers working on LLM decision support, especially in domains where verification behavior matters. A reader looking for concrete design ideas on uncertainty presentation would find the setup relevant.

It deserves peer review. The question is practical, the design is simple, and the gap it targets is real; the current draft just needs the methods and results written out in full so referees can judge whether the attribution holds.

Referee Report

2 major / 1 minor

Summary. The manuscript reports results from a between-subjects experiment (N=192) on how uncertainty granularity in LLM responses—output-level, relation-level, or token-level—affects user agreement with the LLM, self-confidence, and external verification behaviors during medical question answering. The central claim is that granularity produces distinct behavioral patterns, with token-level increasing agreement and relation-level reducing both self-confidence and external verification.

Significance. If the experimental controls are sound, the work supplies concrete, granularity-specific evidence on trust calibration and verification in LLM-assisted decisions, which is directly relevant to HCI design guidelines for uncertainty communication in high-stakes domains.

major comments (2)

[Methods] Methods section: the implementation details do not establish that the base LLM responses (specific claims, reasoning steps, and underlying uncertainty values) were identical across the three granularity conditions. Without this control, observed differences in agreement and verification cannot be unambiguously attributed to granularity rather than to systematic differences in content, uncertainty computation, or display properties.
[Results] Results section: the reported behavioral effects lack accompanying statistical test details, effect sizes, or power analyses in the abstract and main text summary, leaving the strength and reliability of the granularity-dependent patterns difficult to evaluate.

minor comments (1)

[Abstract] Abstract: no quantitative results, p-values, or effect sizes are supplied, which is standard practice for empirical HCI studies and would improve the summary's informativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on experimental controls and statistical reporting. We address each point below with clarifications from the study design and indicate planned revisions.

read point-by-point responses

Referee: [Methods] Methods section: the implementation details do not establish that the base LLM responses (specific claims, reasoning steps, and underlying uncertainty values) were identical across the three granularity conditions. Without this control, observed differences in agreement and verification cannot be unambiguously attributed to granularity rather than to systematic differences in content, uncertainty computation, or display properties.

Authors: The base LLM outputs were generated once per question using a fixed model, prompt, and temperature setting; uncertainty values for claims and tokens were computed from the same logits and then selectively rendered according to the assigned granularity condition. The three conditions therefore differed only in presentation format, not in underlying content or numerical uncertainty estimates. We will revise the Methods section to include explicit confirmation of this shared generation pipeline, along with pseudocode and example outputs demonstrating identical base responses across conditions. revision: yes
Referee: [Results] Results section: the reported behavioral effects lack accompanying statistical test details, effect sizes, or power analyses in the abstract and main text summary, leaving the strength and reliability of the granularity-dependent patterns difficult to evaluate.

Authors: The full Results section reports ANOVA and post-hoc tests with exact p-values, Cohen’s d effect sizes, and a pre-registered power analysis (target power 0.80, achieved with N=192). These details appear in the body but were condensed in the abstract and opening summary. We will expand the abstract and main-text summary to include key statistical results and effect sizes while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical behavioral study with no derivations or self-referential chains

full rationale

This paper reports a between-subjects experiment (N=192) comparing user behavior across three uncertainty-granularity conditions in an LLM-assisted medical QA task. The abstract and described methods contain no equations, fitted parameters, uniqueness theorems, ansatzes, or self-citations that serve as load-bearing premises. All central claims rest on observed differences in agreement, self-confidence, and verification rates collected from participants; these outcomes are not equivalent to the input conditions by construction. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical user study with no mathematical model. Relies on standard assumptions of experimental psychology and statistics.

axioms (1)

domain assumption Between-subjects randomization eliminates carry-over effects and that self-reported agreement, confidence, and verification behaviors are valid proxies for the constructs of interest.
The study design and measures depend on these standard experimental assumptions holding.

pith-pipeline@v0.9.1-grok · 5818 in / 1235 out tokens · 28185 ms · 2026-06-29T09:57:00.384029+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, and Salman Avestimehr. 2024. Mars: Meaning-aware response scoring for uncertainty estimation in generative llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7752–7767

2024
[2]

Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. 2024. Linguistic Calibration of Long-Form Generations. InInternational Conference on Machine Learning. PMLR, 2732–2778

2024
[3]

Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InProceedings of the 2021 CHI conference on human factors in computing systems. 1–16

2021
[4]

Evan Becker and Stefano Soatto. 2024. Cycles of thought: Measuring llm confidence through stable explanations.arXiv preprint arXiv:2406.03441(2024)

work page arXiv 2024
[5]

Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering.BMC bioinformatics20, 1 (2019), 511

2019
[6]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative Research in Psychology3, 2 (2006), 77–101

2006
[7]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z Gajos. 2021. To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making.Proceedings of the ACM on Human-computer Interaction5, CSCW1 (2021), 1–21

2021
[8]

Carrie J Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. 2019. Hello AI: uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making.Proceedings of the ACM on Human-computer Interaction3, CSCW (2019), 1–24

2019
[9]

Shiye Cao, Anqi Liu, and Chien-Ming Huang. 2024. Designing for appropriate reliance: The roles of AI uncertainty presentation, initial user decision, and user demographics in AI-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction8, CSCW1 (2024), 1–32

2024
[10]

Federico Maria Cau, Hanna Hauptmann, Lucio Davide Spano, and Nava Tintarev. 2023. Effects of AI and logic-style explanations on users’ decisions under different levels of uncertainty.ACM Transactions on Interactive Intelligent Systems13, 4 (2023), 1–42. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Villavicencio, Pan, Wang

2023
[11]

Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, and Hua Wei. 2025. Understanding the uncertainty of llm explanations: A perspective based on reasoning topology.arXiv preprint arXiv:2502.17026(2025)

work page arXiv 2025
[12]

Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–13

2023
[13]

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5050–5063

2024
[14]

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al. 2024. Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification. InFindings of the Association for Computational Linguistics ACL 2024. 9367–9385

2024
[15]

Gabriel Freedman, Adam Dejl, Deniz Gorur, Xiang Yin, Antonio Rago, and Francesca Toni. 2025. Argumentative Large Language Models for Explainable and Contestable Claim Verification. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 14930–14939

2025
[16]

Alatrash, Clara Siepmann, and Tannaz Vahidi

Mouadh Guesmi, Mohamed Amine Chatti, Shoeb Joarder, Qurat Ul Ain, R. Alatrash, Clara Siepmann, and Tannaz Vahidi. 2023. Interactive Explanation with Varying Level of Details in an Explainable Scientific Literature Recommender System.International Journal of Human–Computer Interaction40 (2023), 7248 – 7269. https://api.semanticscholar.org/CorpusId:259129445

2023
[17]

Gaole He, Patrick Hemmer, Michael Vössing, Max Schemmer, and Ujwal Gadiraju. 2025. Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition.arXiv preprint arXiv:2501.10909 (2025)

work page arXiv 2025
[18]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

I’m Not Sure, But

Sunnie SY Kim, Q Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. 2024. " I’m Not Sure, But... ": Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. InProceedings of the 2024 ACM conference on Fairness, Accountability, and Transparency (FAccT). 822–835

2024
[20]

René F Kizilcec. 2016. How much information? Effects of transparency on trust in an algorithmic interface. InProceedings of the 2016 CHI conference on human factors in computing systems. 2390–2395

2016
[21]

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. 2024. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Min Hun Lee, Renee Bao Xuan Ng, Silvana Xinyi Choo, and Shamala Thilarajah. 2024. Interactive Example-based Explanations to Improve Health Professionals’ Onboarding with AI for Human-AI Collaborative Decision Making.arXiv preprint arXiv:2409.15814 (2024)

work page arXiv 2024
[24]

Min Hun Lee and Martyn Zhe Yu Tok. 2025. Towards Uncertainty Aware Task Delegation and Human-AI Collaborative Decision-Making. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2274–2289

2025
[25]

Zhuoran Lu and Ming Yin. 2021. Human reliance on machine learning models when performance feedback is limited: Heuristics and risks. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16

2021
[26]

Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction.arXiv preprint arXiv:2002.07650 (2020)

work page arXiv 2020
[27]

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 9004–9017

2023
[28]

Hannes Matuschek, Reinhold Kliegl, Shravan Vasishth, Harald Baayen, and Douglas Bates. 2017. Balancing Type I error and power in linear mixed models.Journal of Memory and Language94 (2017), 305–315

2017
[29]

Luise Metzger, Linda Miller, Martin Baumann, and Johannes Kraus. 2024. Empowering calibrated (dis-) trust in conversational agents: a user study on the persuasive power of limitation disclaimers vs. authoritative style. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

2024
[30]

Jakub Podolak and Rajeev Verma. 2025. Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs.arXiv preprint arXiv:2505.23845(2025)

work page arXiv 2025
[31]

Snehal Prabhudesai, Leyao Yang, Sumit Asthana, Xun Huan, Q Vera Liao, and Nikola Banovic. 2023. Understanding uncertainty: how lay decision-makers perceive and interpret uncertainty in human-AI decision making. InProceedings of the 28th International Conference on Intelligent User Interfaces (IUI). 379–396

2023
[32]

Xin Qiu and Risto Miikkulainen. 2024. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space.Advances in neural information processing systems37 (2024), 134507–134533

2024
[33]

Olya Rezaeian, Alparslan Emrah Bayrak, and Onur Asan. 2025. Explainability and AI confidence in clinical decision support systems: Effects on trust, diagnostic performance, and cognitive load in breast cancer care.International Journal of Human–Computer Interaction Not All Uncertainty Is Equal FAccT ’26, June 25–28, 2026, Montreal, QC, Canada (2025), 1–21

2025
[34]

Sara Salimzadeh, Gaole He, and Ujwal Gadiraju. 2024. Dealing with Uncertainty: Understanding the Impact of Prognostic Versus Diagnostic Tasks on Trust and Reliance in Human-AI Decision-Making. InProceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 1–17

2024
[35]

Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. 2025. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.Comput. Surveys(2025)

2025
[36]

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and Padhraic Smyth. 2025. What large language models know and what people think they know.Nature Machine Intelligence7, 2 (2025), 221–231

2025
[37]

Sree Harsha Tanneru, Chirag Agarwal, and Himabindu Lakkaraju. 2024. Quantifying uncertainty in natural language explanations of large language models. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 1072–1080

2024
[38]

Shuchang Tao, Liuyi Yao, Hanxing Ding, Yuexiang Xie, Qi Cao, Fei Sun, Jinyang Gao, Huawei Shen, and Bolin Ding. 2024. When to trust llms: Aligning confidence with response quality.arXiv preprint arXiv:2404.17287(2024)

work page arXiv 2024
[39]

Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q Vera Liao, and Jennifer Wortman Vaughan. 2025. Generation probabilities are not enough: Uncertainty highlighting in ai code completions.ACM Transactions on Computer-Human Interaction32, 1 (2025), 1–30

2025
[40]

Qianwen Wang, Kexin Huang, Payal Chandak, Marinka Zitnik, and Nils Gehlenborg. 2022. Extending the nested model for user-centric xai: A design study on gnn-based drug repurposing.IEEE Transactions on Visualization and Computer Graphics29, 1 (2022), 1266–1276

2022
[41]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Zhengtao Xu, Tianqi Song, and Yi-Chieh Lee. 2025. Confronting verbalized uncertainty: Understanding how LLM’s verbalized uncertainty influences users in AI-assisted decision-making.International Journal of Human-Computer Studies197 (2025), 103455

2025
[43]

Fumeng Yang, Zhuanyi Huang, Jean Scholtz, and Dustin L Arendt. 2020. How do visual explanations foster end users’ appropriate trust in machine learning?. InProceedings of the 25th International Conference on Intelligent User Interfaces (IUI). 189–201

2020
[44]

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Junqi Dai, Qinyuan Cheng, Xuan-Jing Huang, and Xipeng Qiu. 2024. Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2401–2416

2024
[45]

Gal Yona, Roee Aharoni, and Mor Geva. 2024. Can large language models faithfully express their intrinsic uncertainty in words?arXiv preprint arXiv:2405.16908(2024)

work page arXiv 2024
[46]

Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. 2020. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. InProceedings of the 2020 conference on Fairness, Accountability, and Transparency (FAccT). 295–305

2020
[47]

Jieqiong Zhao, Yixuan Wang, Michelle V Mancenido, Erin K Chiou, and Ross Maciejewski. 2023. Evaluating the impact of uncertainty visualization on model reliance.IEEE Transactions on Visualization and Computer Graphics30, 7 (2023), 4093–4107. APPENDIX The appendix is structured in the following way. •Appendix A: Participant Demographics and Background •App...

2023

[1] [1]

Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, and Salman Avestimehr. 2024. Mars: Meaning-aware response scoring for uncertainty estimation in generative llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7752–7767

2024

[2] [2]

Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. 2024. Linguistic Calibration of Long-Form Generations. InInternational Conference on Machine Learning. PMLR, 2732–2778

2024

[3] [3]

Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InProceedings of the 2021 CHI conference on human factors in computing systems. 1–16

2021

[4] [4]

Evan Becker and Stefano Soatto. 2024. Cycles of thought: Measuring llm confidence through stable explanations.arXiv preprint arXiv:2406.03441(2024)

work page arXiv 2024

[5] [5]

Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering.BMC bioinformatics20, 1 (2019), 511

2019

[6] [6]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative Research in Psychology3, 2 (2006), 77–101

2006

[7] [7]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z Gajos. 2021. To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making.Proceedings of the ACM on Human-computer Interaction5, CSCW1 (2021), 1–21

2021

[8] [8]

Carrie J Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. 2019. Hello AI: uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making.Proceedings of the ACM on Human-computer Interaction3, CSCW (2019), 1–24

2019

[9] [9]

Shiye Cao, Anqi Liu, and Chien-Ming Huang. 2024. Designing for appropriate reliance: The roles of AI uncertainty presentation, initial user decision, and user demographics in AI-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction8, CSCW1 (2024), 1–32

2024

[10] [10]

Federico Maria Cau, Hanna Hauptmann, Lucio Davide Spano, and Nava Tintarev. 2023. Effects of AI and logic-style explanations on users’ decisions under different levels of uncertainty.ACM Transactions on Interactive Intelligent Systems13, 4 (2023), 1–42. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Villavicencio, Pan, Wang

2023

[11] [11]

Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, and Hua Wei. 2025. Understanding the uncertainty of llm explanations: A perspective based on reasoning topology.arXiv preprint arXiv:2502.17026(2025)

work page arXiv 2025

[12] [12]

Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–13

2023

[13] [13]

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5050–5063

2024

[14] [14]

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al. 2024. Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification. InFindings of the Association for Computational Linguistics ACL 2024. 9367–9385

2024

[15] [15]

Gabriel Freedman, Adam Dejl, Deniz Gorur, Xiang Yin, Antonio Rago, and Francesca Toni. 2025. Argumentative Large Language Models for Explainable and Contestable Claim Verification. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 14930–14939

2025

[16] [16]

Alatrash, Clara Siepmann, and Tannaz Vahidi

Mouadh Guesmi, Mohamed Amine Chatti, Shoeb Joarder, Qurat Ul Ain, R. Alatrash, Clara Siepmann, and Tannaz Vahidi. 2023. Interactive Explanation with Varying Level of Details in an Explainable Scientific Literature Recommender System.International Journal of Human–Computer Interaction40 (2023), 7248 – 7269. https://api.semanticscholar.org/CorpusId:259129445

2023

[17] [17]

Gaole He, Patrick Hemmer, Michael Vössing, Max Schemmer, and Ujwal Gadiraju. 2025. Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition.arXiv preprint arXiv:2501.10909 (2025)

work page arXiv 2025

[18] [18]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

I’m Not Sure, But

Sunnie SY Kim, Q Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. 2024. " I’m Not Sure, But... ": Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. InProceedings of the 2024 ACM conference on Fairness, Accountability, and Transparency (FAccT). 822–835

2024

[20] [20]

René F Kizilcec. 2016. How much information? Effects of transparency on trust in an algorithmic interface. InProceedings of the 2016 CHI conference on human factors in computing systems. 2390–2395

2016

[21] [21]

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. 2024. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Min Hun Lee, Renee Bao Xuan Ng, Silvana Xinyi Choo, and Shamala Thilarajah. 2024. Interactive Example-based Explanations to Improve Health Professionals’ Onboarding with AI for Human-AI Collaborative Decision Making.arXiv preprint arXiv:2409.15814 (2024)

work page arXiv 2024

[24] [24]

Min Hun Lee and Martyn Zhe Yu Tok. 2025. Towards Uncertainty Aware Task Delegation and Human-AI Collaborative Decision-Making. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2274–2289

2025

[25] [25]

Zhuoran Lu and Ming Yin. 2021. Human reliance on machine learning models when performance feedback is limited: Heuristics and risks. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16

2021

[26] [26]

Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction.arXiv preprint arXiv:2002.07650 (2020)

work page arXiv 2020

[27] [27]

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 9004–9017

2023

[28] [28]

Hannes Matuschek, Reinhold Kliegl, Shravan Vasishth, Harald Baayen, and Douglas Bates. 2017. Balancing Type I error and power in linear mixed models.Journal of Memory and Language94 (2017), 305–315

2017

[29] [29]

Luise Metzger, Linda Miller, Martin Baumann, and Johannes Kraus. 2024. Empowering calibrated (dis-) trust in conversational agents: a user study on the persuasive power of limitation disclaimers vs. authoritative style. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

2024

[30] [30]

Jakub Podolak and Rajeev Verma. 2025. Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs.arXiv preprint arXiv:2505.23845(2025)

work page arXiv 2025

[31] [31]

Snehal Prabhudesai, Leyao Yang, Sumit Asthana, Xun Huan, Q Vera Liao, and Nikola Banovic. 2023. Understanding uncertainty: how lay decision-makers perceive and interpret uncertainty in human-AI decision making. InProceedings of the 28th International Conference on Intelligent User Interfaces (IUI). 379–396

2023

[32] [32]

Xin Qiu and Risto Miikkulainen. 2024. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space.Advances in neural information processing systems37 (2024), 134507–134533

2024

[33] [33]

Olya Rezaeian, Alparslan Emrah Bayrak, and Onur Asan. 2025. Explainability and AI confidence in clinical decision support systems: Effects on trust, diagnostic performance, and cognitive load in breast cancer care.International Journal of Human–Computer Interaction Not All Uncertainty Is Equal FAccT ’26, June 25–28, 2026, Montreal, QC, Canada (2025), 1–21

2025

[34] [34]

Sara Salimzadeh, Gaole He, and Ujwal Gadiraju. 2024. Dealing with Uncertainty: Understanding the Impact of Prognostic Versus Diagnostic Tasks on Trust and Reliance in Human-AI Decision-Making. InProceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 1–17

2024

[35] [35]

Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. 2025. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.Comput. Surveys(2025)

2025

[36] [36]

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and Padhraic Smyth. 2025. What large language models know and what people think they know.Nature Machine Intelligence7, 2 (2025), 221–231

2025

[37] [37]

Sree Harsha Tanneru, Chirag Agarwal, and Himabindu Lakkaraju. 2024. Quantifying uncertainty in natural language explanations of large language models. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 1072–1080

2024

[38] [38]

Shuchang Tao, Liuyi Yao, Hanxing Ding, Yuexiang Xie, Qi Cao, Fei Sun, Jinyang Gao, Huawei Shen, and Bolin Ding. 2024. When to trust llms: Aligning confidence with response quality.arXiv preprint arXiv:2404.17287(2024)

work page arXiv 2024

[39] [39]

Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q Vera Liao, and Jennifer Wortman Vaughan. 2025. Generation probabilities are not enough: Uncertainty highlighting in ai code completions.ACM Transactions on Computer-Human Interaction32, 1 (2025), 1–30

2025

[40] [40]

Qianwen Wang, Kexin Huang, Payal Chandak, Marinka Zitnik, and Nils Gehlenborg. 2022. Extending the nested model for user-centric xai: A design study on gnn-based drug repurposing.IEEE Transactions on Visualization and Computer Graphics29, 1 (2022), 1266–1276

2022

[41] [41]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Zhengtao Xu, Tianqi Song, and Yi-Chieh Lee. 2025. Confronting verbalized uncertainty: Understanding how LLM’s verbalized uncertainty influences users in AI-assisted decision-making.International Journal of Human-Computer Studies197 (2025), 103455

2025

[43] [43]

Fumeng Yang, Zhuanyi Huang, Jean Scholtz, and Dustin L Arendt. 2020. How do visual explanations foster end users’ appropriate trust in machine learning?. InProceedings of the 25th International Conference on Intelligent User Interfaces (IUI). 189–201

2020

[44] [44]

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Junqi Dai, Qinyuan Cheng, Xuan-Jing Huang, and Xipeng Qiu. 2024. Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2401–2416

2024

[45] [45]

Gal Yona, Roee Aharoni, and Mor Geva. 2024. Can large language models faithfully express their intrinsic uncertainty in words?arXiv preprint arXiv:2405.16908(2024)

work page arXiv 2024

[46] [46]

Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. 2020. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. InProceedings of the 2020 conference on Fairness, Accountability, and Transparency (FAccT). 295–305

2020

[47] [47]

Jieqiong Zhao, Yixuan Wang, Michelle V Mancenido, Erin K Chiou, and Ross Maciejewski. 2023. Evaluating the impact of uncertainty visualization on model reliance.IEEE Transactions on Visualization and Computer Graphics30, 7 (2023), 4093–4107. APPENDIX The appendix is structured in the following way. •Appendix A: Participant Demographics and Background •App...

2023