Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users

Alfio Ventura; Jan Corazza; Mustafa Yal\c{c}{\i}ner; Tim Katzke

arxiv: 2605.18036 · v1 · pith:AFMZRHSCnew · submitted 2026-05-18 · 💻 cs.HC · cs.AI

Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users

Alfio Ventura , Tim Katzke , Jan Corazza , Mustafa Yal\c{c}{\i}ner This is my paper

Pith reviewed 2026-05-20 09:15 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords trust calibrationexplainable AIXAIlimitation disclosureskin lesion classificationhuman-AI interactiononboardinguser study

0 comments

The pith

Disclosing model limitations improves trust calibration only for case-specific judgments in an XAI skin-lesion classifier.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether different ways of telling users about an AI model's limits at the start help them judge its trustworthiness more accurately over repeated uses. Participants evaluated 15 real skin lesion cases with a fixed set of explanations and rated their trust in each one as well as overall. The study finds that only the onboarding condition that explicitly disclosed limitations produced better alignment between those ratings and the model's actual correctness on the cases seen. The specific mix of easy and hard cases each person encountered explained more of the differences than any of the onboarding messages did, and short sessions did not produce steady improvement in calibration.

Core claim

In the preregistered between-subjects study with 418 participants, only the limitation-disclosure onboarding condition reliably reduced the deviation between case-wise trust judgments and objective performance on the 15 encountered cases. Global trust measures and short-term experience showed no progressive calibration effect. The particular stimulus package each participant received accounted for substantially more variance in calibration outcomes than the experimental manipulation of onboarding information.

What carries the argument

Deviation score between trust-related judgments (TAIS scale and case-wise ratings) and objective correctness on the 15 encountered cases, produced by five onboarding conditions that combine example information with limitation disclosures.

If this is right

Case-specific limitation disclosures should be prioritized in XAI interfaces to improve alignment of user trust with actual performance.
Short-term interaction alone is unlikely to produce better calibration, so designers cannot rely on exposure to fix miscalibration.
Stimulus selection and case difficulty must be controlled or reported in future XAI calibration studies because they dominate variance.
Measurement tools need refinement because users do not clearly separate perceived trust, trustworthiness, and accuracy estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic, per-case limitation statements tied to model confidence could produce stronger calibration effects than static onboarding.
Evaluations of XAI systems should randomize or balance case difficulty across participants to isolate the effect of communication strategies.
Longer or repeated exposure sessions may be required to test whether progressive calibration can emerge beyond short-term use.

Load-bearing premise

The chosen correctness scores on the 15 cases provide a valid external benchmark for how far trust judgments deviate from actual model performance, and the trust and accuracy ratings capture distinct intended constructs without large measurement error or overlap.

What would settle it

A replication in which the same limitation-disclosure condition produces no reduction in deviation between case-wise trust ratings and model correctness when a different set of 15 cases or a different validated trust scale is used.

Figures

Figures reproduced from arXiv: 2605.18036 by Alfio Ventura, Jan Corazza, Mustafa Yal\c{c}{\i}ner, Tim Katzke.

**Figure 1.** Figure 1: Graphical overview of our overarching study concept, the major survey procedure steps with their key components, and the core analyses performed. ized as the alignment between trust-related judgments and real-life objective performance benchmarks. We (c) employ sophisticated hierarchical inferential statistics, analysing trust differences, trust calibration, and alignments. Finally (d), we derive various … view at source ↗

**Figure 2.** Figure 2: Example prediction presented to participants in the experimental group Limitation Condition introduction. Participants in the Low-Information group did not see such example predictions; the other groups were only presented with the original image on the left without any manipulations. of a skin lesion, (2) a melanoma score summarizing an AI model’s malignancy prediction, (3) a reliability score intended t… view at source ↗

**Figure 3.** Figure 3: TAIS T3 trust assessment by experimental condition. The "objective" trustworthiness indicators are added as reference lines, either as general system-wide indicators or as mean experienced indicators based on our stimulus packages. 2b), there is only a statistically significant difference with LC vs. LI. For on-task trust, LI vs. LC differ statistically significantly from each other, for AE, it is LI vs. … view at source ↗

**Figure 4.** Figure 4: On-task assessments of combined trust and trustworthiness (top) and estimated accuracy (bottom) by experimental condition. The "objective" trustworthiness indicators are shown as reference lines, either as system-wide indicators or as mean experienced indicators based on the stimulus packages. 4.3 Modeling and Analysing Dependent Variables as a Measure of (Trust) Calibration Here, we treat TAIS, on-task t… view at source ↗

**Figure 5.** Figure 5: Combined trust and trustworthiness on-task assessment over time. The "objective" trustworthiness indicators are added as reference lines, either as general systemwide indicators or as mean experienced indicators based on our stimulus packages. The shaded band shows the ± 95% CI around the measures. the dependent variables are documented in the supplemental material. For visual intuition of the results, a… view at source ↗

read the original abstract

Trust calibration -- aligning user trust judgment with model capability -- is crucial for safe deployment of explainable AI (XAI), yet is often evaluated via global trust ratings detached from objective performance evidence. We present a preregistered, incentivized between-subject online study (N=418 representative UK sample) on explainable skin-lesion classification that disentangles expectation-setting from experienced performance. Participants completed 15 case evaluations using a fixed XAI panel (malignancy score, reliability score, and saliency map). We systematically manipulated five experimental onboarding conditions varying example-based information and limitation disclosures with five stimulus packages naturally varying observed prediction quality. Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models. Only limitation disclosure for case-wise measures reliably impacts trust calibration, and short-term experience did not yield progressive calibration. Further, the experienced package of stimuli explained substantially more variance than the experimental manipulation. However, participants were hard-pressed to differentiate between case-wise perceived trust, trustworthiness, and accuracy estimation. We discuss implications for designing limitation communication and for measuring and analysing calibration metrics in XAI evaluations. All study materials and data of this study are publicly available for replication and further academic use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a preregistered, incentivized between-subjects online experiment (N=418, representative UK sample) investigating trust calibration in an XAI system for skin-lesion classification. Participants evaluated 15 cases using a fixed panel of malignancy score, reliability score, and saliency map. Five onboarding conditions manipulated example-based information and limitation disclosures, crossed with five stimulus packages that varied in observed prediction quality. Calibration was defined as the deviation between trust-related judgments (TAIS global scale and three case-wise ratings) and objective correctness on the encountered cases, analyzed via hierarchical mixed-effects models. The central claims are that only limitation disclosure reliably affects case-wise trust calibration, short-term experience does not produce progressive calibration, and the stimulus package accounts for substantially more variance than the experimental manipulation. The authors also report that participants struggled to differentiate the case-wise trust, trustworthiness, and accuracy judgments.

Significance. If the results hold, the work provides evidence that explicit limitation disclosure can improve case-wise trust calibration in XAI while showing that brief case experience alone is insufficient. The finding that stimulus package explains more variance than the onboarding manipulation is a useful reminder that case difficulty and individual differences often dominate experimental effects in trust studies. The preregistered design, large representative sample, incentivized task, hierarchical modeling, and public release of materials and data are clear strengths that enhance credibility and enable future replication or extension.

major comments (1)

Results section (discussion of case-wise ratings): The manuscript reports that participants were hard-pressed to differentiate the three case-wise judgments (trust, trustworthiness, and accuracy estimation). Because the headline claim—that only limitation disclosure reliably impacts trust calibration—treats these ratings as separable constructs whose deviations from objective performance can be interpreted distinctly, the reported lack of differentiation raises a direct threat to construct validity. It is unclear whether the observed condition effects reflect calibrated trust or a single undifferentiated impression or shared method variance. This issue is load-bearing for the specificity of the central finding and requires either additional analyses (e.g., factor analysis or correlation matrices among the three ratings) or a revised interpretation in the discussion.

minor comments (2)

Abstract and Methods: The acronym TAIS is used without expansion on first mention; spelling out 'Trust in AI Scale' (or the precise instrument name) would improve readability for readers outside the immediate subfield.
Discussion: The claim that 'short-term experience did not yield progressive calibration' would benefit from a brief clarification of how 'progressive' was operationalized (e.g., trial-by-trial improvement in deviation scores) to allow readers to assess the strength of the null result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the strengths of our preregistered design, sample, and modeling approach. We address the single major comment below and outline revisions that directly respond to the concern about construct validity while preserving the integrity of our reported findings.

read point-by-point responses

Referee: Results section (discussion of case-wise ratings): The manuscript reports that participants were hard-pressed to differentiate the three case-wise judgments (trust, trustworthiness, and accuracy estimation). Because the headline claim—that only limitation disclosure reliably impacts trust calibration—treats these ratings as separable constructs whose deviations from objective performance can be interpreted distinctly, the reported lack of differentiation raises a direct threat to construct validity. It is unclear whether the observed condition effects reflect calibrated trust or a single undifferentiated impression or shared method variance. This issue is load-bearing for the specificity of the central finding and requires either additional analyses (e.g., factor analysis or correlation matrices among the three ratings) or a revised interpretation in the discussion.

Authors: We appreciate the referee drawing attention to this measurement challenge, which we already flag in the manuscript as participants being 'hard-pressed to differentiate' the three case-wise ratings. This observation is consistent with broader literature on the difficulty of isolating distinct trust constructs in applied XAI settings. To strengthen the manuscript, we will add supplementary analyses in the revised Results section: (1) a correlation matrix among the three case-wise ratings (trust, trustworthiness, accuracy estimation) and (2) an exploratory factor analysis to quantify shared variance and potential single-factor structure. These will be reported alongside the existing hierarchical models. While the high intercorrelations indicate some shared method variance, the models still detected a selective effect of the limitation-disclosure condition on calibration for the trust-related measures (but not uniformly across all judgments), which we interpret as partial evidence for specificity. In the Discussion we will expand the interpretation to explicitly address construct overlap, note that the headline claim concerns case-wise calibration broadly rather than narrowly differentiated trust, and discuss implications for future XAI trust measurement. These additions will not alter the core empirical results but will improve transparency and mitigate the validity concern. revision: yes

Circularity Check

0 steps flagged

Empirical analysis self-contained with no circular derivations

full rationale

The manuscript reports results from a preregistered between-subjects experiment (N=418) that operationalizes calibration as the deviation between trust-related judgments (TAIS and case-wise ratings) and independent objective performance benchmarks on 15 encountered cases. These deviations are analyzed via hierarchical mixed-effects models comparing five onboarding conditions and five stimulus packages. No equations, fitted parameters, or derivations reduce the reported effects (e.g., limitation disclosure impacting case-wise measures, or stimulus package explaining more variance) to inputs by construction. The manuscript contains no load-bearing self-citations, uniqueness theorems, or ansatzes; the noted difficulty participants had differentiating case-wise ratings is a measurement-validity concern rather than a circular reduction in the derivation chain. The central findings therefore rest on direct empirical comparisons against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the UK online sample behaves like lay users in real deployment contexts and that the chosen trust and performance metrics validly operationalize calibration without substantial construct overlap.

axioms (2)

domain assumption The representative UK sample generalizes to lay users of medical XAI systems
The study recruits a representative UK sample but treats findings as applicable to broader lay-user populations.
domain assumption TAIS and case-wise ratings measure distinct trust constructs that can be meaningfully compared to objective accuracy
Calibration is defined as deviation between these judgments and performance benchmarks.

pith-pipeline@v0.9.0 · 5772 in / 1452 out tokens · 48678 ms · 2026-05-20T09:15:55.408000+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Only limitation disclosure for case-wise measures reliably impacts trust calibration

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

In: Guidotti, R., Schmid, U., Longo, L

Abbaspour Onari, M., Baer, G., Zhang, C., Grau, I., Nobile, M.S., Zhang, Y.: The dynamics of trust in XAI: Assessing perceived and demonstrated trust across interaction modes and risk treatments. In: Guidotti, R., Schmid, U., Longo, L. (eds.) Explainable Artificial Intelligence. xAI 2025, Communications in Computer and Information Science, vol. 2576, pp. ...

work page 2025
[2]

IEEE Transactions on Technology and Society7(1), 70–77 (2026)

Atf, Z., Lewis, P.R.: Is trust correlated with explainability in AI? a meta-analysis. IEEE Transactions on Technology and Society7(1), 70–77 (2026)

work page 2026
[3]

Proceed- ings of the ACM on Human-Computer Interaction5(CSCW1), 188:1–188:21 (2021)

Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceed- ings of the ACM on Human-Computer Interaction5(CSCW1), 188:1–188:21 (2021)

work page 2021
[4]

In: Proceedings of the 26th International Conference on Intelligent User Interfaces

Chromik, M., Eiband, M., Buchner, F., Krüger, A., Butz, A.: I think i get your point, AI! the illusion of explanatory depth in explainable AI. In: Proceedings of the 26th International Conference on Intelligent User Interfaces. pp. 307–317. ACM (2021) Exploring Trust Calibration in XAI - Impact of Exposing Model Limitations 19

work page 2021
[5]

Journal of Experimental Psychology: Gen- eral144(1), 114–126 (2015)

Dietvorst, B.J., Simmons, J.P., Massey, C.: Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: Gen- eral144(1), 114–126 (2015)

work page 2015
[6]

Lot 2, Artificial Intelligence for Health and Care in the EU

Directorate-General for Communications Networks, Content and Technology (Eu- ropean Commission), PwC: Study on eHealth, Interoperability of Health Data and Artificial Intelligence for Health and Care in the European Union – Final Study Report. Lot 2, Artificial Intelligence for Health and Care in the EU. Publications Office of the European Union (2021)

work page 2021
[7]

Communications of the ACM64(12), 86–92 (2021)

Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumé III, H., Crawford, K.: Datasheets for datasets. Communications of the ACM64(12), 86–92 (2021)

work page 2021
[8]

Grand View Research: Europe AI in healthcare market size & out- look, 2025–2030,https://www.grandviewresearch.com/horizon/outlook/ ai-in-healthcare-market/europe

work page 2025
[9]

CoRR abs/2010.05351(2020)

Ha, Q., Liu, B., Liu, F.: Identifying melanoma images using EfficientNet ensem- ble: Winning solution to the SIIM-ISIC melanoma classification challenge. CoRR abs/2010.05351(2020)

work page arXiv 2010
[10]

Human Factors57(3), 407–434 (2015)

Hoff, K.A., Bashir, M.: Trust in automation: Integrating empirical evidence on factors that influence trust. Human Factors57(3), 407–434 (2015)

work page 2015
[11]

International Journal of Cognitive Ergonomics 4(1), 53–71 (2000)

Jian, J.Y., Bisantz, A.M., Drury, C.G.: Foundations for an empirically determined scale of trust in automated systems. International Journal of Cognitive Ergonomics 4(1), 53–71 (2000)

work page 2000
[12]

In: Joint ProceedingsofthexAI2025Late-breakingWork,DemosandDoctoralConsortium

Katzke, T., Yalçıner, M., Corazza, J., Ventura, A., Bündert, T.M., Müller, E.: SkinSplain: An XAI framework for trust calibration in skin lesion analysis. In: Joint ProceedingsofthexAI2025Late-breakingWork,DemosandDoctoralConsortium. CEUR Workshop Proceedings, vol. 4017, pp. 305–312. CEUR-WS (2025)

work page 2025
[13]

In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y

Körber, M.: Theoretical considerations and development of a questionnaire to mea- sure trust in automation. In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y. (eds.) Proceedings of the 20th Congress of the International Er- gonomics Association (IEA 2018), pp. 13–30. Springer International Publishing, Cham (2019)

work page 2018
[14]

Hu- man Factors46(1), 50–80 (2004)

Lee, J.D., See, K.A.: Trust in automation: Designing for appropriate reliance. Hu- man Factors46(1), 50–80 (2004)

work page 2004
[15]

Computers in Human Behavior139, 107539 (2023)

Leichtmann, B., Humer, C., Hinterreiter, A., Streit, M., Mara, M.: Effects of ex- plainable artificial intelligence on trust and human behavior in a high-risk decision task. Computers in Human Behavior139, 107539 (2023)

work page 2023
[16]

Organizational Behavior and Human Decision Pro- cesses151, 90–103 (2019)

Logg, J.M., Minson, J.A., Moore, D.A.: Algorithm appreciation: People prefer al- gorithmic to human judgment. Organizational Behavior and Human Decision Pro- cesses151, 90–103 (2019)

work page 2019
[17]

In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Ma, S., Lei, Y., Wang, X., Zheng, C., Shi, C., Yin, M., Ma, X.: Who should i trust: AI or myself? leveraging human and AI correctness likelihood to promote appropriate trust in AI-assisted decision-making. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 759:1–759:19. ACM (2023)

work page 2023
[18]

In: Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T.: Model cards for model reporting. In: Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency. pp. 220–229. ACM (2019)

work page 2019
[19]

Computer54(10), 28–37 (2021) 20 Ventura et al

Naiseh, M., Cemiloglu, D., Al-Thani, D., Jiang, N., Ali, R.: Explainable recom- mendations and calibrated trust: Two systematic user errors. Computer54(10), 28–37 (2021) 20 Ventura et al

work page 2021
[20]

National Cancer Institute: Moles to melanoma: Recognizing the ABCDE features, https://moles-melanoma-tool.cancer.gov/

work page
[21]

National Cancer Institute: Did you know? melanoma can- cer statistics (2014),https://www.cancer.gov/types/skin/ did-you-know-melanoma-cancer-2014-video

work page 2014
[22]

In: Proceedings of the 26th International Conference on Intelligent User Interfaces

Nourani, M., Roy, C., Block, J.E., Honeycutt, D.R., Rahman, T., Ragan, E.D., Gogate, V.: Anchoring bias affects mental model formation and user reliance in explainable AI systems. In: Proceedings of the 26th International Conference on Intelligent User Interfaces. pp. 340–350. ACM (2021)

work page 2021
[23]

Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

Papernot, N., McDaniel, P.D.: Deep k-nearest neighbors: Towards confident, inter- pretable and robust deep learning. CoRRabs/1803.04765(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Human Factors39(2), 230–253 (1997)

Parasuraman, R., Riley, V.: Humans and automation: Use, misuse, disuse, abuse. Human Factors39(2), 230–253 (1997)

work page 1997
[25]

In: Pro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems

Rechkemmer, A., Yin, M.: When confidence meets accuracy: Exploring the effects of multiple performance indicators on trust in machine learning models. In: Pro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems. pp. 535:1–535:14. ACM (2022)

work page 2022
[26]

npj Digital Medicine7(1), 125 (2024)

Salinas, M.P., Sepúlveda, J., Hidalgo, L., Peirano, D., Morel, M., Uribe, P., Rotem- berg, V., Briones, J., Mery, D., Navarrete-Dechent, C.: A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. npj Digital Medicine7(1), 125 (2024)

work page 2024
[27]

Inter- national Journal of Human-Computer Studies52(4), 701–717 (2000)

Skitka, L.J., Mosier, K., Burdick, M.: Accountability and automation bias. Inter- national Journal of Human-Computer Studies52(4), 701–717 (2000)

work page 2000
[28]

Skitka, L.J., Mosier, K., Burdick, M., Rosenblatt, B.: Does automation bias decision-making? International Journal of Human-Computer Studies51(5), 991– 1006 (1999)

work page 1999
[29]

ACM Trans

Sun, Q., Akman, A., Schuller, B.W.: Explainable artificial intelligence for medical applications: A review. ACM Trans. Comput. Heal.6(2), 17:1–17:31 (2025)

work page 2025
[30]

In: Proceedings of the 34th International Conference on Machine Learning

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning. Proceed- ings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR (2017)

work page 2017
[31]

In: Proceed- ings of the 38th International Conference on Machine Learning

Tan, M., Le, Q.: EfficientNetV2: Smaller models and faster training. In: Proceed- ings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10096–10106. PMLR (2021)

work page 2021
[32]

PsyArXiv (2025)

Wischnewski, M., Doebler, P., Krämer, N.: Development and validation of the trust in AI scale (TAIS). PsyArXiv (2025)

work page 2025
[33]

In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Wischnewski, M., Krämer, N., Müller, E.: Measuring and understanding trust cal- ibrations for automated systems: A survey of the state-of-the-art and future direc- tions. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–16. ACM (2023)

work page 2023
[34]

The Journal of Social Psychology 160(6), 735–750 (2020)

Wojton, H.M., Porter, D., Lane, S.T., Bieber, C., Madhavan, P.: Initial validation of the trust of automated systems test (TOAST). The Journal of Social Psychology 160(6), 735–750 (2020)

work page 2020
[35]

JAMA Dermatology160(6), 646–650 (2024)

Wongvibulsin, S., Yan, M.J., Pahalyants, V., Murphy, W., Daneshjou, R., Rotem- berg, V.: Current state of dermatology mobile applications with artificial intelli- gence features. JAMA Dermatology160(6), 646–650 (2024)

work page 2024
[36]

In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency

Zhang, Y., Liao, Q.V., Bellamy, R.K.E.: Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. pp. 295–305. ACM (2020) Exploring Trust Calibration in XAI - Impact of Exposing Model Limitations 21 Table 2.Linear Mixed Effect...

work page 2020

[1] [1]

In: Guidotti, R., Schmid, U., Longo, L

Abbaspour Onari, M., Baer, G., Zhang, C., Grau, I., Nobile, M.S., Zhang, Y.: The dynamics of trust in XAI: Assessing perceived and demonstrated trust across interaction modes and risk treatments. In: Guidotti, R., Schmid, U., Longo, L. (eds.) Explainable Artificial Intelligence. xAI 2025, Communications in Computer and Information Science, vol. 2576, pp. ...

work page 2025

[2] [2]

IEEE Transactions on Technology and Society7(1), 70–77 (2026)

Atf, Z., Lewis, P.R.: Is trust correlated with explainability in AI? a meta-analysis. IEEE Transactions on Technology and Society7(1), 70–77 (2026)

work page 2026

[3] [3]

Proceed- ings of the ACM on Human-Computer Interaction5(CSCW1), 188:1–188:21 (2021)

Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceed- ings of the ACM on Human-Computer Interaction5(CSCW1), 188:1–188:21 (2021)

work page 2021

[4] [4]

In: Proceedings of the 26th International Conference on Intelligent User Interfaces

Chromik, M., Eiband, M., Buchner, F., Krüger, A., Butz, A.: I think i get your point, AI! the illusion of explanatory depth in explainable AI. In: Proceedings of the 26th International Conference on Intelligent User Interfaces. pp. 307–317. ACM (2021) Exploring Trust Calibration in XAI - Impact of Exposing Model Limitations 19

work page 2021

[5] [5]

Journal of Experimental Psychology: Gen- eral144(1), 114–126 (2015)

Dietvorst, B.J., Simmons, J.P., Massey, C.: Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: Gen- eral144(1), 114–126 (2015)

work page 2015

[6] [6]

Lot 2, Artificial Intelligence for Health and Care in the EU

Directorate-General for Communications Networks, Content and Technology (Eu- ropean Commission), PwC: Study on eHealth, Interoperability of Health Data and Artificial Intelligence for Health and Care in the European Union – Final Study Report. Lot 2, Artificial Intelligence for Health and Care in the EU. Publications Office of the European Union (2021)

work page 2021

[7] [7]

Communications of the ACM64(12), 86–92 (2021)

Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumé III, H., Crawford, K.: Datasheets for datasets. Communications of the ACM64(12), 86–92 (2021)

work page 2021

[8] [8]

Grand View Research: Europe AI in healthcare market size & out- look, 2025–2030,https://www.grandviewresearch.com/horizon/outlook/ ai-in-healthcare-market/europe

work page 2025

[9] [9]

CoRR abs/2010.05351(2020)

Ha, Q., Liu, B., Liu, F.: Identifying melanoma images using EfficientNet ensem- ble: Winning solution to the SIIM-ISIC melanoma classification challenge. CoRR abs/2010.05351(2020)

work page arXiv 2010

[10] [10]

Human Factors57(3), 407–434 (2015)

Hoff, K.A., Bashir, M.: Trust in automation: Integrating empirical evidence on factors that influence trust. Human Factors57(3), 407–434 (2015)

work page 2015

[11] [11]

International Journal of Cognitive Ergonomics 4(1), 53–71 (2000)

Jian, J.Y., Bisantz, A.M., Drury, C.G.: Foundations for an empirically determined scale of trust in automated systems. International Journal of Cognitive Ergonomics 4(1), 53–71 (2000)

work page 2000

[12] [12]

In: Joint ProceedingsofthexAI2025Late-breakingWork,DemosandDoctoralConsortium

Katzke, T., Yalçıner, M., Corazza, J., Ventura, A., Bündert, T.M., Müller, E.: SkinSplain: An XAI framework for trust calibration in skin lesion analysis. In: Joint ProceedingsofthexAI2025Late-breakingWork,DemosandDoctoralConsortium. CEUR Workshop Proceedings, vol. 4017, pp. 305–312. CEUR-WS (2025)

work page 2025

[13] [13]

In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y

Körber, M.: Theoretical considerations and development of a questionnaire to mea- sure trust in automation. In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y. (eds.) Proceedings of the 20th Congress of the International Er- gonomics Association (IEA 2018), pp. 13–30. Springer International Publishing, Cham (2019)

work page 2018

[14] [14]

Hu- man Factors46(1), 50–80 (2004)

Lee, J.D., See, K.A.: Trust in automation: Designing for appropriate reliance. Hu- man Factors46(1), 50–80 (2004)

work page 2004

[15] [15]

Computers in Human Behavior139, 107539 (2023)

Leichtmann, B., Humer, C., Hinterreiter, A., Streit, M., Mara, M.: Effects of ex- plainable artificial intelligence on trust and human behavior in a high-risk decision task. Computers in Human Behavior139, 107539 (2023)

work page 2023

[16] [16]

Organizational Behavior and Human Decision Pro- cesses151, 90–103 (2019)

Logg, J.M., Minson, J.A., Moore, D.A.: Algorithm appreciation: People prefer al- gorithmic to human judgment. Organizational Behavior and Human Decision Pro- cesses151, 90–103 (2019)

work page 2019

[17] [17]

In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Ma, S., Lei, Y., Wang, X., Zheng, C., Shi, C., Yin, M., Ma, X.: Who should i trust: AI or myself? leveraging human and AI correctness likelihood to promote appropriate trust in AI-assisted decision-making. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 759:1–759:19. ACM (2023)

work page 2023

[18] [18]

In: Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T.: Model cards for model reporting. In: Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency. pp. 220–229. ACM (2019)

work page 2019

[19] [19]

Computer54(10), 28–37 (2021) 20 Ventura et al

Naiseh, M., Cemiloglu, D., Al-Thani, D., Jiang, N., Ali, R.: Explainable recom- mendations and calibrated trust: Two systematic user errors. Computer54(10), 28–37 (2021) 20 Ventura et al

work page 2021

[20] [20]

National Cancer Institute: Moles to melanoma: Recognizing the ABCDE features, https://moles-melanoma-tool.cancer.gov/

work page

[21] [21]

National Cancer Institute: Did you know? melanoma can- cer statistics (2014),https://www.cancer.gov/types/skin/ did-you-know-melanoma-cancer-2014-video

work page 2014

[22] [22]

In: Proceedings of the 26th International Conference on Intelligent User Interfaces

Nourani, M., Roy, C., Block, J.E., Honeycutt, D.R., Rahman, T., Ragan, E.D., Gogate, V.: Anchoring bias affects mental model formation and user reliance in explainable AI systems. In: Proceedings of the 26th International Conference on Intelligent User Interfaces. pp. 340–350. ACM (2021)

work page 2021

[23] [23]

Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

Papernot, N., McDaniel, P.D.: Deep k-nearest neighbors: Towards confident, inter- pretable and robust deep learning. CoRRabs/1803.04765(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

Human Factors39(2), 230–253 (1997)

Parasuraman, R., Riley, V.: Humans and automation: Use, misuse, disuse, abuse. Human Factors39(2), 230–253 (1997)

work page 1997

[25] [25]

In: Pro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems

Rechkemmer, A., Yin, M.: When confidence meets accuracy: Exploring the effects of multiple performance indicators on trust in machine learning models. In: Pro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems. pp. 535:1–535:14. ACM (2022)

work page 2022

[26] [26]

npj Digital Medicine7(1), 125 (2024)

Salinas, M.P., Sepúlveda, J., Hidalgo, L., Peirano, D., Morel, M., Uribe, P., Rotem- berg, V., Briones, J., Mery, D., Navarrete-Dechent, C.: A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. npj Digital Medicine7(1), 125 (2024)

work page 2024

[27] [27]

Inter- national Journal of Human-Computer Studies52(4), 701–717 (2000)

Skitka, L.J., Mosier, K., Burdick, M.: Accountability and automation bias. Inter- national Journal of Human-Computer Studies52(4), 701–717 (2000)

work page 2000

[28] [28]

Skitka, L.J., Mosier, K., Burdick, M., Rosenblatt, B.: Does automation bias decision-making? International Journal of Human-Computer Studies51(5), 991– 1006 (1999)

work page 1999

[29] [29]

ACM Trans

Sun, Q., Akman, A., Schuller, B.W.: Explainable artificial intelligence for medical applications: A review. ACM Trans. Comput. Heal.6(2), 17:1–17:31 (2025)

work page 2025

[30] [30]

In: Proceedings of the 34th International Conference on Machine Learning

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning. Proceed- ings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR (2017)

work page 2017

[31] [31]

In: Proceed- ings of the 38th International Conference on Machine Learning

Tan, M., Le, Q.: EfficientNetV2: Smaller models and faster training. In: Proceed- ings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10096–10106. PMLR (2021)

work page 2021

[32] [32]

PsyArXiv (2025)

Wischnewski, M., Doebler, P., Krämer, N.: Development and validation of the trust in AI scale (TAIS). PsyArXiv (2025)

work page 2025

[33] [33]

In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Wischnewski, M., Krämer, N., Müller, E.: Measuring and understanding trust cal- ibrations for automated systems: A survey of the state-of-the-art and future direc- tions. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–16. ACM (2023)

work page 2023

[34] [34]

The Journal of Social Psychology 160(6), 735–750 (2020)

Wojton, H.M., Porter, D., Lane, S.T., Bieber, C., Madhavan, P.: Initial validation of the trust of automated systems test (TOAST). The Journal of Social Psychology 160(6), 735–750 (2020)

work page 2020

[35] [35]

JAMA Dermatology160(6), 646–650 (2024)

Wongvibulsin, S., Yan, M.J., Pahalyants, V., Murphy, W., Daneshjou, R., Rotem- berg, V.: Current state of dermatology mobile applications with artificial intelli- gence features. JAMA Dermatology160(6), 646–650 (2024)

work page 2024

[36] [36]

In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency

Zhang, Y., Liao, Q.V., Bellamy, R.K.E.: Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. pp. 295–305. ACM (2020) Exploring Trust Calibration in XAI - Impact of Exposing Model Limitations 21 Table 2.Linear Mixed Effect...

work page 2020