pith. sign in

arxiv: 2605.28571 · v1 · pith:Q7FYNG36new · submitted 2026-05-27 · 💻 cs.HC

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

Pith reviewed 2026-06-29 09:57 UTC · model grok-4.3

classification 💻 cs.HC
keywords uncertainty granularityLLM-assisted decision makinghuman verificationtrust calibrationmedical decision supportAI reliabilityuser behavior
0
0 comments X

The pith

Uncertainty shown at the token level increases agreement with LLM answers while relation-level uncertainty reduces external verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates uncertainty granularity in LLM responses, defined as expressing confidence at output, relation, or token levels within a single answer. A between-subjects experiment with 192 participants answering medical questions found that token-level uncertainty raised agreement with the AI, output- and relation-level uncertainty lowered users' confidence in their own answers, and relation-level uncertainty decreased external checks such as internet searches. These patterns indicate that the way uncertainty is broken down can steer users toward or away from independent fact-checking. The results supply concrete guidance for designing uncertainty displays that support appropriate reliance on LLMs.

Core claim

Token-level uncertainty increased users' agreement with the AI; output- and relation-level uncertainty did not increase agreement but reduced users' confidence in their own answers; relation-level uncertainty also reduced external verification behaviors such as internet searches and URL checks.

What carries the argument

Uncertainty granularity, the extent to which uncertainty is expressed at different levels (output-level for the entire response, relation-level for individual reasoning steps, token-level for specific words) within an LLM response.

If this is right

  • Token-level uncertainty may increase acceptance of LLM answers without additional user scrutiny.
  • Relation-level uncertainty may steer users away from independent fact-checking toward reliance on the provided cues.
  • Output- and relation-level uncertainty may lower users' confidence in their own judgments without raising agreement with the AI.
  • Design choices for uncertainty communication should select granularity levels according to the desired balance between trust calibration and verification behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could experiment with hybrid displays that switch granularity based on task risk or user expertise to tune verification rates.
  • The findings may extend to non-medical domains if similar between-subjects designs are run on factual or technical questions.
  • Future work could test whether combining granularity levels with other trust-calibration cues produces additive or interactive effects on verification.

Load-bearing premise

Differences in user behavior can be attributed to the granularity level rather than to how the underlying LLM outputs were generated, how uncertainty values were computed, or how the displays were presented.

What would settle it

A controlled replication that holds LLM outputs, uncertainty values, and visual displays identical across conditions and still observes no differences in agreement rates, self-confidence, or external verification rates would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.28571 by Mauricio Villavicencio, Qianwen Wang, Sitong Pan.

Figure 1
Figure 1. Figure 1: Uncertainty granularity in LLM responses. (a) Output-level uncertainty displays a single score for the entire response. (b) Finer-grained uncertainty highlights varying uncertainty across reasoning steps (relation-level) and individual words (token-level). Cat and AI assistant icons by Freepik from Flaticon (https://www.flaticon.com) sufficient critical reflection [3, 7, 25, 36, 46]. Therefore, calibrating… view at source ↗
Figure 2
Figure 2. Figure 2: LLM uncertainty quantification and visualization across [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Study Procedure and Task. (a) An overview of the user study workflow. Participants completed onboarding, answered eight information-seeking questions under one of four conditions, and then completed a post-study questionnaire. (b) A screenshot of the in-study task interface. Participants are required to complete the question-answering in-study task (B3) for all eight questions under the assigned condition.… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of DVs across conditions. We show model-estimated marginal means from the confirmatory mixed-effects [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Q1 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Q1 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Q1 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Q1 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Q2 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Q2 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Q2 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Q2 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Q3 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Q3 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Q3 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Q3 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Q4 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Q4 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Q4 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Q4 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Q5 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p039_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Q5 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p040_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Q5 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p041_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Q5 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p042_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Q6 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p043_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Q6 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p044_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Q6 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p045_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Q6 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p046_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Q7 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p047_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Q7 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p048_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Q7 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p049_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Q7 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p050_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Q8 – Baseline condition [PITH_FULL_IMAGE:figures/full_fig_p051_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Q8 – Output-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p052_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Q8 – Relation-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p053_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Q8 – Token-Level UQ condition [PITH_FULL_IMAGE:figures/full_fig_p054_36.png] view at source ↗
read the original abstract

Despite warnings that LLMs can make mistakes, users often develop inappropriate trust and accept incorrect answers without critical evaluation. Uncertainty quantification (UQ), displaying LLMs' confidence, has emerged as a promising approach to calibrate user trust. However, prior empirical studies on uncertainty communication have treated uncertainty as a single numerical score or simple natural language expression. This simplification fails to capture a key property of LLM outputs: a single response often comprises multiple claims and reasoning steps, each with distinct levels of uncertainty. To address this gap, this study investigates uncertainty granularity (i.e., the extent to which uncertainty is expressed at different levels within an LLM response) and examines its impact on LLM-assisted decision-making. We conducted a large-scale, between-subjects study (N=192) in which participants answered medical questions using LLMs that displayed uncertainty at three different granularities: output-level (entire response), relation-level (individual reasoning steps), and token-level (specific words). Our findings reveal distinct behavioral effects as a function of uncertainty granularity. Token-level uncertainty increased users' agreement with the AI, whereas output- and relation-level uncertainty did not increase agreement but instead reduced users' confidence in their own answers. Notably, relation-level uncertainty also reduced external verification (i.e., internet searches, checking provided URLs), steering users away from independent fact-checking and toward reliance on the LLM and its accompanying uncertainty cues. Our findings demonstrate that uncertainty granularity significantly shapes how users interact with and verify LLM outputs, providing concrete design guidance for building responsible LLM applications that encourage appropriate skepticism and verification behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript reports results from a between-subjects experiment (N=192) on how uncertainty granularity in LLM responses—output-level, relation-level, or token-level—affects user agreement with the LLM, self-confidence, and external verification behaviors during medical question answering. The central claim is that granularity produces distinct behavioral patterns, with token-level increasing agreement and relation-level reducing both self-confidence and external verification.

Significance. If the experimental controls are sound, the work supplies concrete, granularity-specific evidence on trust calibration and verification in LLM-assisted decisions, which is directly relevant to HCI design guidelines for uncertainty communication in high-stakes domains.

major comments (2)
  1. [Methods] Methods section: the implementation details do not establish that the base LLM responses (specific claims, reasoning steps, and underlying uncertainty values) were identical across the three granularity conditions. Without this control, observed differences in agreement and verification cannot be unambiguously attributed to granularity rather than to systematic differences in content, uncertainty computation, or display properties.
  2. [Results] Results section: the reported behavioral effects lack accompanying statistical test details, effect sizes, or power analyses in the abstract and main text summary, leaving the strength and reliability of the granularity-dependent patterns difficult to evaluate.
minor comments (1)
  1. [Abstract] Abstract: no quantitative results, p-values, or effect sizes are supplied, which is standard practice for empirical HCI studies and would improve the summary's informativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on experimental controls and statistical reporting. We address each point below with clarifications from the study design and indicate planned revisions.

read point-by-point responses
  1. Referee: [Methods] Methods section: the implementation details do not establish that the base LLM responses (specific claims, reasoning steps, and underlying uncertainty values) were identical across the three granularity conditions. Without this control, observed differences in agreement and verification cannot be unambiguously attributed to granularity rather than to systematic differences in content, uncertainty computation, or display properties.

    Authors: The base LLM outputs were generated once per question using a fixed model, prompt, and temperature setting; uncertainty values for claims and tokens were computed from the same logits and then selectively rendered according to the assigned granularity condition. The three conditions therefore differed only in presentation format, not in underlying content or numerical uncertainty estimates. We will revise the Methods section to include explicit confirmation of this shared generation pipeline, along with pseudocode and example outputs demonstrating identical base responses across conditions. revision: yes

  2. Referee: [Results] Results section: the reported behavioral effects lack accompanying statistical test details, effect sizes, or power analyses in the abstract and main text summary, leaving the strength and reliability of the granularity-dependent patterns difficult to evaluate.

    Authors: The full Results section reports ANOVA and post-hoc tests with exact p-values, Cohen’s d effect sizes, and a pre-registered power analysis (target power 0.80, achieved with N=192). These details appear in the body but were condensed in the abstract and opening summary. We will expand the abstract and main-text summary to include key statistical results and effect sizes while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical behavioral study with no derivations or self-referential chains

full rationale

This paper reports a between-subjects experiment (N=192) comparing user behavior across three uncertainty-granularity conditions in an LLM-assisted medical QA task. The abstract and described methods contain no equations, fitted parameters, uniqueness theorems, ansatzes, or self-citations that serve as load-bearing premises. All central claims rest on observed differences in agreement, self-confidence, and verification rates collected from participants; these outcomes are not equivalent to the input conditions by construction. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical user study with no mathematical model. Relies on standard assumptions of experimental psychology and statistics.

axioms (1)
  • domain assumption Between-subjects randomization eliminates carry-over effects and that self-reported agreement, confidence, and verification behaviors are valid proxies for the constructs of interest.
    The study design and measures depend on these standard experimental assumptions holding.

pith-pipeline@v0.9.1-grok · 5818 in / 1235 out tokens · 28185 ms · 2026-06-29T09:57:00.384029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, and Salman Avestimehr. 2024. Mars: Meaning-aware response scoring for uncertainty estimation in generative llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7752–7767

  2. [2]

    Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. 2024. Linguistic Calibration of Long-Form Generations. InInternational Conference on Machine Learning. PMLR, 2732–2778

  3. [3]

    Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InProceedings of the 2021 CHI conference on human factors in computing systems. 1–16

  4. [4]

    Evan Becker and Stefano Soatto. 2024. Cycles of thought: Measuring llm confidence through stable explanations.arXiv preprint arXiv:2406.03441(2024)

  5. [5]

    Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering.BMC bioinformatics20, 1 (2019), 511

  6. [6]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative Research in Psychology3, 2 (2006), 77–101

  7. [7]

    Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z Gajos. 2021. To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making.Proceedings of the ACM on Human-computer Interaction5, CSCW1 (2021), 1–21

  8. [8]

    Carrie J Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. 2019. Hello AI: uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making.Proceedings of the ACM on Human-computer Interaction3, CSCW (2019), 1–24

  9. [9]

    Shiye Cao, Anqi Liu, and Chien-Ming Huang. 2024. Designing for appropriate reliance: The roles of AI uncertainty presentation, initial user decision, and user demographics in AI-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction8, CSCW1 (2024), 1–32

  10. [10]

    Federico Maria Cau, Hanna Hauptmann, Lucio Davide Spano, and Nava Tintarev. 2023. Effects of AI and logic-style explanations on users’ decisions under different levels of uncertainty.ACM Transactions on Interactive Intelligent Systems13, 4 (2023), 1–42. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Villavicencio, Pan, Wang

  11. [11]

    Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, and Hua Wei. 2025. Understanding the uncertainty of llm explanations: A perspective based on reasoning topology.arXiv preprint arXiv:2502.17026(2025)

  12. [12]

    Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–13

  13. [13]

    Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5050–5063

  14. [14]

    Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al. 2024. Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification. InFindings of the Association for Computational Linguistics ACL 2024. 9367–9385

  15. [15]

    Gabriel Freedman, Adam Dejl, Deniz Gorur, Xiang Yin, Antonio Rago, and Francesca Toni. 2025. Argumentative Large Language Models for Explainable and Contestable Claim Verification. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 14930–14939

  16. [16]

    Alatrash, Clara Siepmann, and Tannaz Vahidi

    Mouadh Guesmi, Mohamed Amine Chatti, Shoeb Joarder, Qurat Ul Ain, R. Alatrash, Clara Siepmann, and Tannaz Vahidi. 2023. Interactive Explanation with Varying Level of Details in an Explainable Scientific Literature Recommender System.International Journal of Human–Computer Interaction40 (2023), 7248 – 7269. https://api.semanticscholar.org/CorpusId:259129445

  17. [17]

    Gaole He, Patrick Hemmer, Michael Vössing, Max Schemmer, and Ujwal Gadiraju. 2025. Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition.arXiv preprint arXiv:2501.10909 (2025)

  18. [18]

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221(2022)

  19. [19]

    I’m Not Sure, But

    Sunnie SY Kim, Q Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. 2024. " I’m Not Sure, But... ": Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. InProceedings of the 2024 ACM conference on Fairness, Accountability, and Transparency (FAccT). 822–835

  20. [20]

    René F Kizilcec. 2016. How much information? Effects of transparency on trust in an algorithmic interface. InProceedings of the 2016 CHI conference on human factors in computing systems. 2390–2395

  21. [21]

    Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. 2024. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927(2024)

  22. [22]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664(2023)

  23. [23]

    Min Hun Lee, Renee Bao Xuan Ng, Silvana Xinyi Choo, and Shamala Thilarajah. 2024. Interactive Example-based Explanations to Improve Health Professionals’ Onboarding with AI for Human-AI Collaborative Decision Making.arXiv preprint arXiv:2409.15814 (2024)

  24. [24]

    Min Hun Lee and Martyn Zhe Yu Tok. 2025. Towards Uncertainty Aware Task Delegation and Human-AI Collaborative Decision-Making. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2274–2289

  25. [25]

    Zhuoran Lu and Ming Yin. 2021. Human reliance on machine learning models when performance feedback is limited: Heuristics and risks. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16

  26. [26]

    Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction.arXiv preprint arXiv:2002.07650 (2020)

  27. [27]

    Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 9004–9017

  28. [28]

    Hannes Matuschek, Reinhold Kliegl, Shravan Vasishth, Harald Baayen, and Douglas Bates. 2017. Balancing Type I error and power in linear mixed models.Journal of Memory and Language94 (2017), 305–315

  29. [29]

    Luise Metzger, Linda Miller, Martin Baumann, and Johannes Kraus. 2024. Empowering calibrated (dis-) trust in conversational agents: a user study on the persuasive power of limitation disclaimers vs. authoritative style. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

  30. [30]

    Jakub Podolak and Rajeev Verma. 2025. Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs.arXiv preprint arXiv:2505.23845(2025)

  31. [31]

    Snehal Prabhudesai, Leyao Yang, Sumit Asthana, Xun Huan, Q Vera Liao, and Nikola Banovic. 2023. Understanding uncertainty: how lay decision-makers perceive and interpret uncertainty in human-AI decision making. InProceedings of the 28th International Conference on Intelligent User Interfaces (IUI). 379–396

  32. [32]

    Xin Qiu and Risto Miikkulainen. 2024. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space.Advances in neural information processing systems37 (2024), 134507–134533

  33. [33]

    Olya Rezaeian, Alparslan Emrah Bayrak, and Onur Asan. 2025. Explainability and AI confidence in clinical decision support systems: Effects on trust, diagnostic performance, and cognitive load in breast cancer care.International Journal of Human–Computer Interaction Not All Uncertainty Is Equal FAccT ’26, June 25–28, 2026, Montreal, QC, Canada (2025), 1–21

  34. [34]

    Sara Salimzadeh, Gaole He, and Ujwal Gadiraju. 2024. Dealing with Uncertainty: Understanding the Impact of Prognostic Versus Diagnostic Tasks on Trust and Reliance in Human-AI Decision-Making. InProceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 1–17

  35. [35]

    Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. 2025. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.Comput. Surveys(2025)

  36. [36]

    Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and Padhraic Smyth. 2025. What large language models know and what people think they know.Nature Machine Intelligence7, 2 (2025), 221–231

  37. [37]

    Sree Harsha Tanneru, Chirag Agarwal, and Himabindu Lakkaraju. 2024. Quantifying uncertainty in natural language explanations of large language models. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 1072–1080

  38. [38]

    Shuchang Tao, Liuyi Yao, Hanxing Ding, Yuexiang Xie, Qi Cao, Fei Sun, Jinyang Gao, Huawei Shen, and Bolin Ding. 2024. When to trust llms: Aligning confidence with response quality.arXiv preprint arXiv:2404.17287(2024)

  39. [39]

    Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q Vera Liao, and Jennifer Wortman Vaughan. 2025. Generation probabilities are not enough: Uncertainty highlighting in ai code completions.ACM Transactions on Computer-Human Interaction32, 1 (2025), 1–30

  40. [40]

    Qianwen Wang, Kexin Huang, Payal Chandak, Marinka Zitnik, and Nils Gehlenborg. 2022. Extending the nested model for user-centric xai: A design study on gnn-based drug repurposing.IEEE Transactions on Visualization and Computer Graphics29, 1 (2022), 1266–1276

  41. [41]

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063(2023)

  42. [42]

    Zhengtao Xu, Tianqi Song, and Yi-Chieh Lee. 2025. Confronting verbalized uncertainty: Understanding how LLM’s verbalized uncertainty influences users in AI-assisted decision-making.International Journal of Human-Computer Studies197 (2025), 103455

  43. [43]

    Fumeng Yang, Zhuanyi Huang, Jean Scholtz, and Dustin L Arendt. 2020. How do visual explanations foster end users’ appropriate trust in machine learning?. InProceedings of the 25th International Conference on Intelligent User Interfaces (IUI). 189–201

  44. [44]

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Junqi Dai, Qinyuan Cheng, Xuan-Jing Huang, and Xipeng Qiu. 2024. Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2401–2416

  45. [45]

    Gal Yona, Roee Aharoni, and Mor Geva. 2024. Can large language models faithfully express their intrinsic uncertainty in words?arXiv preprint arXiv:2405.16908(2024)

  46. [46]

    Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. 2020. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. InProceedings of the 2020 conference on Fairness, Accountability, and Transparency (FAccT). 295–305

  47. [47]

    Jieqiong Zhao, Yixuan Wang, Michelle V Mancenido, Erin K Chiou, and Ross Maciejewski. 2023. Evaluating the impact of uncertainty visualization on model reliance.IEEE Transactions on Visualization and Computer Graphics30, 7 (2023), 4093–4107. APPENDIX The appendix is structured in the following way. •Appendix A: Participant Demographics and Background •App...