pith. sign in

arxiv: 2605.18036 · v1 · pith:AFMZRHSCnew · submitted 2026-05-18 · 💻 cs.HC · cs.AI

Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users

Pith reviewed 2026-05-20 09:15 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords trust calibrationexplainable AIXAIlimitation disclosureskin lesion classificationhuman-AI interactiononboardinguser study
0
0 comments X

The pith

Disclosing model limitations improves trust calibration only for case-specific judgments in an XAI skin-lesion classifier.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether different ways of telling users about an AI model's limits at the start help them judge its trustworthiness more accurately over repeated uses. Participants evaluated 15 real skin lesion cases with a fixed set of explanations and rated their trust in each one as well as overall. The study finds that only the onboarding condition that explicitly disclosed limitations produced better alignment between those ratings and the model's actual correctness on the cases seen. The specific mix of easy and hard cases each person encountered explained more of the differences than any of the onboarding messages did, and short sessions did not produce steady improvement in calibration.

Core claim

In the preregistered between-subjects study with 418 participants, only the limitation-disclosure onboarding condition reliably reduced the deviation between case-wise trust judgments and objective performance on the 15 encountered cases. Global trust measures and short-term experience showed no progressive calibration effect. The particular stimulus package each participant received accounted for substantially more variance in calibration outcomes than the experimental manipulation of onboarding information.

What carries the argument

Deviation score between trust-related judgments (TAIS scale and case-wise ratings) and objective correctness on the 15 encountered cases, produced by five onboarding conditions that combine example information with limitation disclosures.

If this is right

  • Case-specific limitation disclosures should be prioritized in XAI interfaces to improve alignment of user trust with actual performance.
  • Short-term interaction alone is unlikely to produce better calibration, so designers cannot rely on exposure to fix miscalibration.
  • Stimulus selection and case difficulty must be controlled or reported in future XAI calibration studies because they dominate variance.
  • Measurement tools need refinement because users do not clearly separate perceived trust, trustworthiness, and accuracy estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dynamic, per-case limitation statements tied to model confidence could produce stronger calibration effects than static onboarding.
  • Evaluations of XAI systems should randomize or balance case difficulty across participants to isolate the effect of communication strategies.
  • Longer or repeated exposure sessions may be required to test whether progressive calibration can emerge beyond short-term use.

Load-bearing premise

The chosen correctness scores on the 15 cases provide a valid external benchmark for how far trust judgments deviate from actual model performance, and the trust and accuracy ratings capture distinct intended constructs without large measurement error or overlap.

What would settle it

A replication in which the same limitation-disclosure condition produces no reduction in deviation between case-wise trust ratings and model correctness when a different set of 15 cases or a different validated trust scale is used.

Figures

Figures reproduced from arXiv: 2605.18036 by Alfio Ventura, Jan Corazza, Mustafa Yal\c{c}{\i}ner, Tim Katzke.

Figure 1
Figure 1. Figure 1: Graphical overview of our overarching study concept, the major survey proce￾dure steps with their key components, and the core analyses performed. ized as the alignment between trust-related judgments and real-life objective performance benchmarks. We (c) employ sophisticated hierarchical inferential statistics, analysing trust differences, trust calibration, and alignments. Finally (d), we derive various … view at source ↗
Figure 2
Figure 2. Figure 2: Example prediction presented to participants in the experimental group Limi￾tation Condition introduction. Participants in the Low-Information group did not see such example predictions; the other groups were only presented with the original image on the left without any manipulations. of a skin lesion, (2) a melanoma score summarizing an AI model’s malignancy prediction, (3) a reliability score intended t… view at source ↗
Figure 3
Figure 3. Figure 3: TAIS T3 trust assessment by experimental condition. The "objective" trustwor￾thiness indicators are added as reference lines, either as general system-wide indicators or as mean experienced indicators based on our stimulus packages. 2b), there is only a statistically significant difference with LC vs. LI. For on-task trust, LI vs. LC differ statistically significantly from each other, for AE, it is LI vs. … view at source ↗
Figure 4
Figure 4. Figure 4: On-task assessments of combined trust and trustworthiness (top) and esti￾mated accuracy (bottom) by experimental condition. The "objective" trustworthiness indicators are shown as reference lines, either as system-wide indicators or as mean experienced indicators based on the stimulus packages. 4.3 Modeling and Analysing Dependent Variables as a Measure of (Trust) Calibration Here, we treat TAIS, on-task t… view at source ↗
Figure 5
Figure 5. Figure 5: Combined trust and trustworthiness on-task assessment over time. The "objec￾tive" trustworthiness indicators are added as reference lines, either as general system￾wide indicators or as mean experienced indicators based on our stimulus packages. The shaded band shows the ± 95% CI around the measures. the dependent variables are documented in the supplemental material. For visual intuition of the results, a… view at source ↗
read the original abstract

Trust calibration -- aligning user trust judgment with model capability -- is crucial for safe deployment of explainable AI (XAI), yet is often evaluated via global trust ratings detached from objective performance evidence. We present a preregistered, incentivized between-subject online study (N=418 representative UK sample) on explainable skin-lesion classification that disentangles expectation-setting from experienced performance. Participants completed 15 case evaluations using a fixed XAI panel (malignancy score, reliability score, and saliency map). We systematically manipulated five experimental onboarding conditions varying example-based information and limitation disclosures with five stimulus packages naturally varying observed prediction quality. Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models. Only limitation disclosure for case-wise measures reliably impacts trust calibration, and short-term experience did not yield progressive calibration. Further, the experienced package of stimuli explained substantially more variance than the experimental manipulation. However, participants were hard-pressed to differentiate between case-wise perceived trust, trustworthiness, and accuracy estimation. We discuss implications for designing limitation communication and for measuring and analysing calibration metrics in XAI evaluations. All study materials and data of this study are publicly available for replication and further academic use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a preregistered, incentivized between-subjects online experiment (N=418, representative UK sample) investigating trust calibration in an XAI system for skin-lesion classification. Participants evaluated 15 cases using a fixed panel of malignancy score, reliability score, and saliency map. Five onboarding conditions manipulated example-based information and limitation disclosures, crossed with five stimulus packages that varied in observed prediction quality. Calibration was defined as the deviation between trust-related judgments (TAIS global scale and three case-wise ratings) and objective correctness on the encountered cases, analyzed via hierarchical mixed-effects models. The central claims are that only limitation disclosure reliably affects case-wise trust calibration, short-term experience does not produce progressive calibration, and the stimulus package accounts for substantially more variance than the experimental manipulation. The authors also report that participants struggled to differentiate the case-wise trust, trustworthiness, and accuracy judgments.

Significance. If the results hold, the work provides evidence that explicit limitation disclosure can improve case-wise trust calibration in XAI while showing that brief case experience alone is insufficient. The finding that stimulus package explains more variance than the onboarding manipulation is a useful reminder that case difficulty and individual differences often dominate experimental effects in trust studies. The preregistered design, large representative sample, incentivized task, hierarchical modeling, and public release of materials and data are clear strengths that enhance credibility and enable future replication or extension.

major comments (1)
  1. Results section (discussion of case-wise ratings): The manuscript reports that participants were hard-pressed to differentiate the three case-wise judgments (trust, trustworthiness, and accuracy estimation). Because the headline claim—that only limitation disclosure reliably impacts trust calibration—treats these ratings as separable constructs whose deviations from objective performance can be interpreted distinctly, the reported lack of differentiation raises a direct threat to construct validity. It is unclear whether the observed condition effects reflect calibrated trust or a single undifferentiated impression or shared method variance. This issue is load-bearing for the specificity of the central finding and requires either additional analyses (e.g., factor analysis or correlation matrices among the three ratings) or a revised interpretation in the discussion.
minor comments (2)
  1. Abstract and Methods: The acronym TAIS is used without expansion on first mention; spelling out 'Trust in AI Scale' (or the precise instrument name) would improve readability for readers outside the immediate subfield.
  2. Discussion: The claim that 'short-term experience did not yield progressive calibration' would benefit from a brief clarification of how 'progressive' was operationalized (e.g., trial-by-trial improvement in deviation scores) to allow readers to assess the strength of the null result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the strengths of our preregistered design, sample, and modeling approach. We address the single major comment below and outline revisions that directly respond to the concern about construct validity while preserving the integrity of our reported findings.

read point-by-point responses
  1. Referee: Results section (discussion of case-wise ratings): The manuscript reports that participants were hard-pressed to differentiate the three case-wise judgments (trust, trustworthiness, and accuracy estimation). Because the headline claim—that only limitation disclosure reliably impacts trust calibration—treats these ratings as separable constructs whose deviations from objective performance can be interpreted distinctly, the reported lack of differentiation raises a direct threat to construct validity. It is unclear whether the observed condition effects reflect calibrated trust or a single undifferentiated impression or shared method variance. This issue is load-bearing for the specificity of the central finding and requires either additional analyses (e.g., factor analysis or correlation matrices among the three ratings) or a revised interpretation in the discussion.

    Authors: We appreciate the referee drawing attention to this measurement challenge, which we already flag in the manuscript as participants being 'hard-pressed to differentiate' the three case-wise ratings. This observation is consistent with broader literature on the difficulty of isolating distinct trust constructs in applied XAI settings. To strengthen the manuscript, we will add supplementary analyses in the revised Results section: (1) a correlation matrix among the three case-wise ratings (trust, trustworthiness, accuracy estimation) and (2) an exploratory factor analysis to quantify shared variance and potential single-factor structure. These will be reported alongside the existing hierarchical models. While the high intercorrelations indicate some shared method variance, the models still detected a selective effect of the limitation-disclosure condition on calibration for the trust-related measures (but not uniformly across all judgments), which we interpret as partial evidence for specificity. In the Discussion we will expand the interpretation to explicitly address construct overlap, note that the headline claim concerns case-wise calibration broadly rather than narrowly differentiated trust, and discuss implications for future XAI trust measurement. These additions will not alter the core empirical results but will improve transparency and mitigate the validity concern. revision: yes

Circularity Check

0 steps flagged

Empirical analysis self-contained with no circular derivations

full rationale

The manuscript reports results from a preregistered between-subjects experiment (N=418) that operationalizes calibration as the deviation between trust-related judgments (TAIS and case-wise ratings) and independent objective performance benchmarks on 15 encountered cases. These deviations are analyzed via hierarchical mixed-effects models comparing five onboarding conditions and five stimulus packages. No equations, fitted parameters, or derivations reduce the reported effects (e.g., limitation disclosure impacting case-wise measures, or stimulus package explaining more variance) to inputs by construction. The manuscript contains no load-bearing self-citations, uniqueness theorems, or ansatzes; the noted difficulty participants had differentiating case-wise ratings is a measurement-validity concern rather than a circular reduction in the derivation chain. The central findings therefore rest on direct empirical comparisons against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the UK online sample behaves like lay users in real deployment contexts and that the chosen trust and performance metrics validly operationalize calibration without substantial construct overlap.

axioms (2)
  • domain assumption The representative UK sample generalizes to lay users of medical XAI systems
    The study recruits a representative UK sample but treats findings as applicable to broader lay-user populations.
  • domain assumption TAIS and case-wise ratings measure distinct trust constructs that can be meaningfully compared to objective accuracy
    Calibration is defined as deviation between these judgments and performance benchmarks.

pith-pipeline@v0.9.0 · 5772 in / 1452 out tokens · 48678 ms · 2026-05-20T09:15:55.408000+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    In: Guidotti, R., Schmid, U., Longo, L

    Abbaspour Onari, M., Baer, G., Zhang, C., Grau, I., Nobile, M.S., Zhang, Y.: The dynamics of trust in XAI: Assessing perceived and demonstrated trust across interaction modes and risk treatments. In: Guidotti, R., Schmid, U., Longo, L. (eds.) Explainable Artificial Intelligence. xAI 2025, Communications in Computer and Information Science, vol. 2576, pp. ...

  2. [2]

    IEEE Transactions on Technology and Society7(1), 70–77 (2026)

    Atf, Z., Lewis, P.R.: Is trust correlated with explainability in AI? a meta-analysis. IEEE Transactions on Technology and Society7(1), 70–77 (2026)

  3. [3]

    Proceed- ings of the ACM on Human-Computer Interaction5(CSCW1), 188:1–188:21 (2021)

    Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceed- ings of the ACM on Human-Computer Interaction5(CSCW1), 188:1–188:21 (2021)

  4. [4]

    In: Proceedings of the 26th International Conference on Intelligent User Interfaces

    Chromik, M., Eiband, M., Buchner, F., Krüger, A., Butz, A.: I think i get your point, AI! the illusion of explanatory depth in explainable AI. In: Proceedings of the 26th International Conference on Intelligent User Interfaces. pp. 307–317. ACM (2021) Exploring Trust Calibration in XAI - Impact of Exposing Model Limitations 19

  5. [5]

    Journal of Experimental Psychology: Gen- eral144(1), 114–126 (2015)

    Dietvorst, B.J., Simmons, J.P., Massey, C.: Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: Gen- eral144(1), 114–126 (2015)

  6. [6]

    Lot 2, Artificial Intelligence for Health and Care in the EU

    Directorate-General for Communications Networks, Content and Technology (Eu- ropean Commission), PwC: Study on eHealth, Interoperability of Health Data and Artificial Intelligence for Health and Care in the European Union – Final Study Report. Lot 2, Artificial Intelligence for Health and Care in the EU. Publications Office of the European Union (2021)

  7. [7]

    Communications of the ACM64(12), 86–92 (2021)

    Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumé III, H., Crawford, K.: Datasheets for datasets. Communications of the ACM64(12), 86–92 (2021)

  8. [8]

    Grand View Research: Europe AI in healthcare market size & out- look, 2025–2030,https://www.grandviewresearch.com/horizon/outlook/ ai-in-healthcare-market/europe

  9. [9]

    CoRR abs/2010.05351(2020)

    Ha, Q., Liu, B., Liu, F.: Identifying melanoma images using EfficientNet ensem- ble: Winning solution to the SIIM-ISIC melanoma classification challenge. CoRR abs/2010.05351(2020)

  10. [10]

    Human Factors57(3), 407–434 (2015)

    Hoff, K.A., Bashir, M.: Trust in automation: Integrating empirical evidence on factors that influence trust. Human Factors57(3), 407–434 (2015)

  11. [11]

    International Journal of Cognitive Ergonomics 4(1), 53–71 (2000)

    Jian, J.Y., Bisantz, A.M., Drury, C.G.: Foundations for an empirically determined scale of trust in automated systems. International Journal of Cognitive Ergonomics 4(1), 53–71 (2000)

  12. [12]

    In: Joint ProceedingsofthexAI2025Late-breakingWork,DemosandDoctoralConsortium

    Katzke, T., Yalçıner, M., Corazza, J., Ventura, A., Bündert, T.M., Müller, E.: SkinSplain: An XAI framework for trust calibration in skin lesion analysis. In: Joint ProceedingsofthexAI2025Late-breakingWork,DemosandDoctoralConsortium. CEUR Workshop Proceedings, vol. 4017, pp. 305–312. CEUR-WS (2025)

  13. [13]

    In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y

    Körber, M.: Theoretical considerations and development of a questionnaire to mea- sure trust in automation. In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y. (eds.) Proceedings of the 20th Congress of the International Er- gonomics Association (IEA 2018), pp. 13–30. Springer International Publishing, Cham (2019)

  14. [14]

    Hu- man Factors46(1), 50–80 (2004)

    Lee, J.D., See, K.A.: Trust in automation: Designing for appropriate reliance. Hu- man Factors46(1), 50–80 (2004)

  15. [15]

    Computers in Human Behavior139, 107539 (2023)

    Leichtmann, B., Humer, C., Hinterreiter, A., Streit, M., Mara, M.: Effects of ex- plainable artificial intelligence on trust and human behavior in a high-risk decision task. Computers in Human Behavior139, 107539 (2023)

  16. [16]

    Organizational Behavior and Human Decision Pro- cesses151, 90–103 (2019)

    Logg, J.M., Minson, J.A., Moore, D.A.: Algorithm appreciation: People prefer al- gorithmic to human judgment. Organizational Behavior and Human Decision Pro- cesses151, 90–103 (2019)

  17. [17]

    In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

    Ma, S., Lei, Y., Wang, X., Zheng, C., Shi, C., Yin, M., Ma, X.: Who should i trust: AI or myself? leveraging human and AI correctness likelihood to promote appropriate trust in AI-assisted decision-making. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 759:1–759:19. ACM (2023)

  18. [18]

    In: Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency

    Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T.: Model cards for model reporting. In: Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency. pp. 220–229. ACM (2019)

  19. [19]

    Computer54(10), 28–37 (2021) 20 Ventura et al

    Naiseh, M., Cemiloglu, D., Al-Thani, D., Jiang, N., Ali, R.: Explainable recom- mendations and calibrated trust: Two systematic user errors. Computer54(10), 28–37 (2021) 20 Ventura et al

  20. [20]

    National Cancer Institute: Moles to melanoma: Recognizing the ABCDE features, https://moles-melanoma-tool.cancer.gov/

  21. [21]

    National Cancer Institute: Did you know? melanoma can- cer statistics (2014),https://www.cancer.gov/types/skin/ did-you-know-melanoma-cancer-2014-video

  22. [22]

    In: Proceedings of the 26th International Conference on Intelligent User Interfaces

    Nourani, M., Roy, C., Block, J.E., Honeycutt, D.R., Rahman, T., Ragan, E.D., Gogate, V.: Anchoring bias affects mental model formation and user reliance in explainable AI systems. In: Proceedings of the 26th International Conference on Intelligent User Interfaces. pp. 340–350. ACM (2021)

  23. [23]

    Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

    Papernot, N., McDaniel, P.D.: Deep k-nearest neighbors: Towards confident, inter- pretable and robust deep learning. CoRRabs/1803.04765(2018)

  24. [24]

    Human Factors39(2), 230–253 (1997)

    Parasuraman, R., Riley, V.: Humans and automation: Use, misuse, disuse, abuse. Human Factors39(2), 230–253 (1997)

  25. [25]

    In: Pro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems

    Rechkemmer, A., Yin, M.: When confidence meets accuracy: Exploring the effects of multiple performance indicators on trust in machine learning models. In: Pro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems. pp. 535:1–535:14. ACM (2022)

  26. [26]

    npj Digital Medicine7(1), 125 (2024)

    Salinas, M.P., Sepúlveda, J., Hidalgo, L., Peirano, D., Morel, M., Uribe, P., Rotem- berg, V., Briones, J., Mery, D., Navarrete-Dechent, C.: A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. npj Digital Medicine7(1), 125 (2024)

  27. [27]

    Inter- national Journal of Human-Computer Studies52(4), 701–717 (2000)

    Skitka, L.J., Mosier, K., Burdick, M.: Accountability and automation bias. Inter- national Journal of Human-Computer Studies52(4), 701–717 (2000)

  28. [28]

    Skitka, L.J., Mosier, K., Burdick, M., Rosenblatt, B.: Does automation bias decision-making? International Journal of Human-Computer Studies51(5), 991– 1006 (1999)

  29. [29]

    ACM Trans

    Sun, Q., Akman, A., Schuller, B.W.: Explainable artificial intelligence for medical applications: A review. ACM Trans. Comput. Heal.6(2), 17:1–17:31 (2025)

  30. [30]

    In: Proceedings of the 34th International Conference on Machine Learning

    Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning. Proceed- ings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR (2017)

  31. [31]

    In: Proceed- ings of the 38th International Conference on Machine Learning

    Tan, M., Le, Q.: EfficientNetV2: Smaller models and faster training. In: Proceed- ings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10096–10106. PMLR (2021)

  32. [32]

    PsyArXiv (2025)

    Wischnewski, M., Doebler, P., Krämer, N.: Development and validation of the trust in AI scale (TAIS). PsyArXiv (2025)

  33. [33]

    In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

    Wischnewski, M., Krämer, N., Müller, E.: Measuring and understanding trust cal- ibrations for automated systems: A survey of the state-of-the-art and future direc- tions. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–16. ACM (2023)

  34. [34]

    The Journal of Social Psychology 160(6), 735–750 (2020)

    Wojton, H.M., Porter, D., Lane, S.T., Bieber, C., Madhavan, P.: Initial validation of the trust of automated systems test (TOAST). The Journal of Social Psychology 160(6), 735–750 (2020)

  35. [35]

    JAMA Dermatology160(6), 646–650 (2024)

    Wongvibulsin, S., Yan, M.J., Pahalyants, V., Murphy, W., Daneshjou, R., Rotem- berg, V.: Current state of dermatology mobile applications with artificial intelli- gence features. JAMA Dermatology160(6), 646–650 (2024)

  36. [36]

    In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency

    Zhang, Y., Liao, Q.V., Bellamy, R.K.E.: Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. pp. 295–305. ACM (2020) Exploring Trust Calibration in XAI - Impact of Exposing Model Limitations 21 Table 2.Linear Mixed Effect...