Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users
Pith reviewed 2026-05-20 09:15 UTC · model grok-4.3
The pith
Disclosing model limitations improves trust calibration only for case-specific judgments in an XAI skin-lesion classifier.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the preregistered between-subjects study with 418 participants, only the limitation-disclosure onboarding condition reliably reduced the deviation between case-wise trust judgments and objective performance on the 15 encountered cases. Global trust measures and short-term experience showed no progressive calibration effect. The particular stimulus package each participant received accounted for substantially more variance in calibration outcomes than the experimental manipulation of onboarding information.
What carries the argument
Deviation score between trust-related judgments (TAIS scale and case-wise ratings) and objective correctness on the 15 encountered cases, produced by five onboarding conditions that combine example information with limitation disclosures.
If this is right
- Case-specific limitation disclosures should be prioritized in XAI interfaces to improve alignment of user trust with actual performance.
- Short-term interaction alone is unlikely to produce better calibration, so designers cannot rely on exposure to fix miscalibration.
- Stimulus selection and case difficulty must be controlled or reported in future XAI calibration studies because they dominate variance.
- Measurement tools need refinement because users do not clearly separate perceived trust, trustworthiness, and accuracy estimates.
Where Pith is reading between the lines
- Dynamic, per-case limitation statements tied to model confidence could produce stronger calibration effects than static onboarding.
- Evaluations of XAI systems should randomize or balance case difficulty across participants to isolate the effect of communication strategies.
- Longer or repeated exposure sessions may be required to test whether progressive calibration can emerge beyond short-term use.
Load-bearing premise
The chosen correctness scores on the 15 cases provide a valid external benchmark for how far trust judgments deviate from actual model performance, and the trust and accuracy ratings capture distinct intended constructs without large measurement error or overlap.
What would settle it
A replication in which the same limitation-disclosure condition produces no reduction in deviation between case-wise trust ratings and model correctness when a different set of 15 cases or a different validated trust scale is used.
Figures
read the original abstract
Trust calibration -- aligning user trust judgment with model capability -- is crucial for safe deployment of explainable AI (XAI), yet is often evaluated via global trust ratings detached from objective performance evidence. We present a preregistered, incentivized between-subject online study (N=418 representative UK sample) on explainable skin-lesion classification that disentangles expectation-setting from experienced performance. Participants completed 15 case evaluations using a fixed XAI panel (malignancy score, reliability score, and saliency map). We systematically manipulated five experimental onboarding conditions varying example-based information and limitation disclosures with five stimulus packages naturally varying observed prediction quality. Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models. Only limitation disclosure for case-wise measures reliably impacts trust calibration, and short-term experience did not yield progressive calibration. Further, the experienced package of stimuli explained substantially more variance than the experimental manipulation. However, participants were hard-pressed to differentiate between case-wise perceived trust, trustworthiness, and accuracy estimation. We discuss implications for designing limitation communication and for measuring and analysing calibration metrics in XAI evaluations. All study materials and data of this study are publicly available for replication and further academic use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a preregistered, incentivized between-subjects online experiment (N=418, representative UK sample) investigating trust calibration in an XAI system for skin-lesion classification. Participants evaluated 15 cases using a fixed panel of malignancy score, reliability score, and saliency map. Five onboarding conditions manipulated example-based information and limitation disclosures, crossed with five stimulus packages that varied in observed prediction quality. Calibration was defined as the deviation between trust-related judgments (TAIS global scale and three case-wise ratings) and objective correctness on the encountered cases, analyzed via hierarchical mixed-effects models. The central claims are that only limitation disclosure reliably affects case-wise trust calibration, short-term experience does not produce progressive calibration, and the stimulus package accounts for substantially more variance than the experimental manipulation. The authors also report that participants struggled to differentiate the case-wise trust, trustworthiness, and accuracy judgments.
Significance. If the results hold, the work provides evidence that explicit limitation disclosure can improve case-wise trust calibration in XAI while showing that brief case experience alone is insufficient. The finding that stimulus package explains more variance than the onboarding manipulation is a useful reminder that case difficulty and individual differences often dominate experimental effects in trust studies. The preregistered design, large representative sample, incentivized task, hierarchical modeling, and public release of materials and data are clear strengths that enhance credibility and enable future replication or extension.
major comments (1)
- Results section (discussion of case-wise ratings): The manuscript reports that participants were hard-pressed to differentiate the three case-wise judgments (trust, trustworthiness, and accuracy estimation). Because the headline claim—that only limitation disclosure reliably impacts trust calibration—treats these ratings as separable constructs whose deviations from objective performance can be interpreted distinctly, the reported lack of differentiation raises a direct threat to construct validity. It is unclear whether the observed condition effects reflect calibrated trust or a single undifferentiated impression or shared method variance. This issue is load-bearing for the specificity of the central finding and requires either additional analyses (e.g., factor analysis or correlation matrices among the three ratings) or a revised interpretation in the discussion.
minor comments (2)
- Abstract and Methods: The acronym TAIS is used without expansion on first mention; spelling out 'Trust in AI Scale' (or the precise instrument name) would improve readability for readers outside the immediate subfield.
- Discussion: The claim that 'short-term experience did not yield progressive calibration' would benefit from a brief clarification of how 'progressive' was operationalized (e.g., trial-by-trial improvement in deviation scores) to allow readers to assess the strength of the null result.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the strengths of our preregistered design, sample, and modeling approach. We address the single major comment below and outline revisions that directly respond to the concern about construct validity while preserving the integrity of our reported findings.
read point-by-point responses
-
Referee: Results section (discussion of case-wise ratings): The manuscript reports that participants were hard-pressed to differentiate the three case-wise judgments (trust, trustworthiness, and accuracy estimation). Because the headline claim—that only limitation disclosure reliably impacts trust calibration—treats these ratings as separable constructs whose deviations from objective performance can be interpreted distinctly, the reported lack of differentiation raises a direct threat to construct validity. It is unclear whether the observed condition effects reflect calibrated trust or a single undifferentiated impression or shared method variance. This issue is load-bearing for the specificity of the central finding and requires either additional analyses (e.g., factor analysis or correlation matrices among the three ratings) or a revised interpretation in the discussion.
Authors: We appreciate the referee drawing attention to this measurement challenge, which we already flag in the manuscript as participants being 'hard-pressed to differentiate' the three case-wise ratings. This observation is consistent with broader literature on the difficulty of isolating distinct trust constructs in applied XAI settings. To strengthen the manuscript, we will add supplementary analyses in the revised Results section: (1) a correlation matrix among the three case-wise ratings (trust, trustworthiness, accuracy estimation) and (2) an exploratory factor analysis to quantify shared variance and potential single-factor structure. These will be reported alongside the existing hierarchical models. While the high intercorrelations indicate some shared method variance, the models still detected a selective effect of the limitation-disclosure condition on calibration for the trust-related measures (but not uniformly across all judgments), which we interpret as partial evidence for specificity. In the Discussion we will expand the interpretation to explicitly address construct overlap, note that the headline claim concerns case-wise calibration broadly rather than narrowly differentiated trust, and discuss implications for future XAI trust measurement. These additions will not alter the core empirical results but will improve transparency and mitigate the validity concern. revision: yes
Circularity Check
Empirical analysis self-contained with no circular derivations
full rationale
The manuscript reports results from a preregistered between-subjects experiment (N=418) that operationalizes calibration as the deviation between trust-related judgments (TAIS and case-wise ratings) and independent objective performance benchmarks on 15 encountered cases. These deviations are analyzed via hierarchical mixed-effects models comparing five onboarding conditions and five stimulus packages. No equations, fitted parameters, or derivations reduce the reported effects (e.g., limitation disclosure impacting case-wise measures, or stimulus package explaining more variance) to inputs by construction. The manuscript contains no load-bearing self-citations, uniqueness theorems, or ansatzes; the noted difficulty participants had differentiating case-wise ratings is a measurement-validity concern rather than a circular reduction in the derivation chain. The central findings therefore rest on direct empirical comparisons against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The representative UK sample generalizes to lay users of medical XAI systems
- domain assumption TAIS and case-wise ratings measure distinct trust constructs that can be meaningfully compared to objective accuracy
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Only limitation disclosure for case-wise measures reliably impacts trust calibration
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Guidotti, R., Schmid, U., Longo, L
Abbaspour Onari, M., Baer, G., Zhang, C., Grau, I., Nobile, M.S., Zhang, Y.: The dynamics of trust in XAI: Assessing perceived and demonstrated trust across interaction modes and risk treatments. In: Guidotti, R., Schmid, U., Longo, L. (eds.) Explainable Artificial Intelligence. xAI 2025, Communications in Computer and Information Science, vol. 2576, pp. ...
work page 2025
-
[2]
IEEE Transactions on Technology and Society7(1), 70–77 (2026)
Atf, Z., Lewis, P.R.: Is trust correlated with explainability in AI? a meta-analysis. IEEE Transactions on Technology and Society7(1), 70–77 (2026)
work page 2026
-
[3]
Proceed- ings of the ACM on Human-Computer Interaction5(CSCW1), 188:1–188:21 (2021)
Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceed- ings of the ACM on Human-Computer Interaction5(CSCW1), 188:1–188:21 (2021)
work page 2021
-
[4]
In: Proceedings of the 26th International Conference on Intelligent User Interfaces
Chromik, M., Eiband, M., Buchner, F., Krüger, A., Butz, A.: I think i get your point, AI! the illusion of explanatory depth in explainable AI. In: Proceedings of the 26th International Conference on Intelligent User Interfaces. pp. 307–317. ACM (2021) Exploring Trust Calibration in XAI - Impact of Exposing Model Limitations 19
work page 2021
-
[5]
Journal of Experimental Psychology: Gen- eral144(1), 114–126 (2015)
Dietvorst, B.J., Simmons, J.P., Massey, C.: Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: Gen- eral144(1), 114–126 (2015)
work page 2015
-
[6]
Lot 2, Artificial Intelligence for Health and Care in the EU
Directorate-General for Communications Networks, Content and Technology (Eu- ropean Commission), PwC: Study on eHealth, Interoperability of Health Data and Artificial Intelligence for Health and Care in the European Union – Final Study Report. Lot 2, Artificial Intelligence for Health and Care in the EU. Publications Office of the European Union (2021)
work page 2021
-
[7]
Communications of the ACM64(12), 86–92 (2021)
Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumé III, H., Crawford, K.: Datasheets for datasets. Communications of the ACM64(12), 86–92 (2021)
work page 2021
-
[8]
Grand View Research: Europe AI in healthcare market size & out- look, 2025–2030,https://www.grandviewresearch.com/horizon/outlook/ ai-in-healthcare-market/europe
work page 2025
-
[9]
Ha, Q., Liu, B., Liu, F.: Identifying melanoma images using EfficientNet ensem- ble: Winning solution to the SIIM-ISIC melanoma classification challenge. CoRR abs/2010.05351(2020)
-
[10]
Human Factors57(3), 407–434 (2015)
Hoff, K.A., Bashir, M.: Trust in automation: Integrating empirical evidence on factors that influence trust. Human Factors57(3), 407–434 (2015)
work page 2015
-
[11]
International Journal of Cognitive Ergonomics 4(1), 53–71 (2000)
Jian, J.Y., Bisantz, A.M., Drury, C.G.: Foundations for an empirically determined scale of trust in automated systems. International Journal of Cognitive Ergonomics 4(1), 53–71 (2000)
work page 2000
-
[12]
In: Joint ProceedingsofthexAI2025Late-breakingWork,DemosandDoctoralConsortium
Katzke, T., Yalçıner, M., Corazza, J., Ventura, A., Bündert, T.M., Müller, E.: SkinSplain: An XAI framework for trust calibration in skin lesion analysis. In: Joint ProceedingsofthexAI2025Late-breakingWork,DemosandDoctoralConsortium. CEUR Workshop Proceedings, vol. 4017, pp. 305–312. CEUR-WS (2025)
work page 2025
-
[13]
In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y
Körber, M.: Theoretical considerations and development of a questionnaire to mea- sure trust in automation. In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y. (eds.) Proceedings of the 20th Congress of the International Er- gonomics Association (IEA 2018), pp. 13–30. Springer International Publishing, Cham (2019)
work page 2018
-
[14]
Hu- man Factors46(1), 50–80 (2004)
Lee, J.D., See, K.A.: Trust in automation: Designing for appropriate reliance. Hu- man Factors46(1), 50–80 (2004)
work page 2004
-
[15]
Computers in Human Behavior139, 107539 (2023)
Leichtmann, B., Humer, C., Hinterreiter, A., Streit, M., Mara, M.: Effects of ex- plainable artificial intelligence on trust and human behavior in a high-risk decision task. Computers in Human Behavior139, 107539 (2023)
work page 2023
-
[16]
Organizational Behavior and Human Decision Pro- cesses151, 90–103 (2019)
Logg, J.M., Minson, J.A., Moore, D.A.: Algorithm appreciation: People prefer al- gorithmic to human judgment. Organizational Behavior and Human Decision Pro- cesses151, 90–103 (2019)
work page 2019
-
[17]
In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems
Ma, S., Lei, Y., Wang, X., Zheng, C., Shi, C., Yin, M., Ma, X.: Who should i trust: AI or myself? leveraging human and AI correctness likelihood to promote appropriate trust in AI-assisted decision-making. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 759:1–759:19. ACM (2023)
work page 2023
-
[18]
In: Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T.: Model cards for model reporting. In: Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency. pp. 220–229. ACM (2019)
work page 2019
-
[19]
Computer54(10), 28–37 (2021) 20 Ventura et al
Naiseh, M., Cemiloglu, D., Al-Thani, D., Jiang, N., Ali, R.: Explainable recom- mendations and calibrated trust: Two systematic user errors. Computer54(10), 28–37 (2021) 20 Ventura et al
work page 2021
-
[20]
National Cancer Institute: Moles to melanoma: Recognizing the ABCDE features, https://moles-melanoma-tool.cancer.gov/
-
[21]
National Cancer Institute: Did you know? melanoma can- cer statistics (2014),https://www.cancer.gov/types/skin/ did-you-know-melanoma-cancer-2014-video
work page 2014
-
[22]
In: Proceedings of the 26th International Conference on Intelligent User Interfaces
Nourani, M., Roy, C., Block, J.E., Honeycutt, D.R., Rahman, T., Ragan, E.D., Gogate, V.: Anchoring bias affects mental model formation and user reliance in explainable AI systems. In: Proceedings of the 26th International Conference on Intelligent User Interfaces. pp. 340–350. ACM (2021)
work page 2021
-
[23]
Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning
Papernot, N., McDaniel, P.D.: Deep k-nearest neighbors: Towards confident, inter- pretable and robust deep learning. CoRRabs/1803.04765(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Human Factors39(2), 230–253 (1997)
Parasuraman, R., Riley, V.: Humans and automation: Use, misuse, disuse, abuse. Human Factors39(2), 230–253 (1997)
work page 1997
-
[25]
In: Pro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems
Rechkemmer, A., Yin, M.: When confidence meets accuracy: Exploring the effects of multiple performance indicators on trust in machine learning models. In: Pro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems. pp. 535:1–535:14. ACM (2022)
work page 2022
-
[26]
npj Digital Medicine7(1), 125 (2024)
Salinas, M.P., Sepúlveda, J., Hidalgo, L., Peirano, D., Morel, M., Uribe, P., Rotem- berg, V., Briones, J., Mery, D., Navarrete-Dechent, C.: A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. npj Digital Medicine7(1), 125 (2024)
work page 2024
-
[27]
Inter- national Journal of Human-Computer Studies52(4), 701–717 (2000)
Skitka, L.J., Mosier, K., Burdick, M.: Accountability and automation bias. Inter- national Journal of Human-Computer Studies52(4), 701–717 (2000)
work page 2000
-
[28]
Skitka, L.J., Mosier, K., Burdick, M., Rosenblatt, B.: Does automation bias decision-making? International Journal of Human-Computer Studies51(5), 991– 1006 (1999)
work page 1999
- [29]
-
[30]
In: Proceedings of the 34th International Conference on Machine Learning
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning. Proceed- ings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR (2017)
work page 2017
-
[31]
In: Proceed- ings of the 38th International Conference on Machine Learning
Tan, M., Le, Q.: EfficientNetV2: Smaller models and faster training. In: Proceed- ings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10096–10106. PMLR (2021)
work page 2021
-
[32]
Wischnewski, M., Doebler, P., Krämer, N.: Development and validation of the trust in AI scale (TAIS). PsyArXiv (2025)
work page 2025
-
[33]
In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems
Wischnewski, M., Krämer, N., Müller, E.: Measuring and understanding trust cal- ibrations for automated systems: A survey of the state-of-the-art and future direc- tions. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–16. ACM (2023)
work page 2023
-
[34]
The Journal of Social Psychology 160(6), 735–750 (2020)
Wojton, H.M., Porter, D., Lane, S.T., Bieber, C., Madhavan, P.: Initial validation of the trust of automated systems test (TOAST). The Journal of Social Psychology 160(6), 735–750 (2020)
work page 2020
-
[35]
JAMA Dermatology160(6), 646–650 (2024)
Wongvibulsin, S., Yan, M.J., Pahalyants, V., Murphy, W., Daneshjou, R., Rotem- berg, V.: Current state of dermatology mobile applications with artificial intelli- gence features. JAMA Dermatology160(6), 646–650 (2024)
work page 2024
-
[36]
In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency
Zhang, Y., Liao, Q.V., Bellamy, R.K.E.: Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. pp. 295–305. ACM (2020) Exploring Trust Calibration in XAI - Impact of Exposing Model Limitations 21 Table 2.Linear Mixed Effect...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.