LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data
Pith reviewed 2026-06-26 20:53 UTC · model grok-4.3
The pith
A cross-model calibrator using attribution divergence between an LLM and XGBoost reduces expected calibration error on clinical tabular data from 0.254 to 0.080.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that cross-model attribution divergence serves as a usable proxy for an LLM's epistemic uncertainty on clinical tabular prediction; a calibrator trained on this signal reduces expected calibration error from 0.254 to 0.080, yields patient-specific reliability estimates, and does so without model internals or repeated inference.
What carries the argument
The Attribution Disagreement Score (ADS) derived from comparing LLM and XGBoost feature attributions, which feeds a cross-model calibrator that outputs reliability estimates.
If this is right
- LLM verbalized confidence tracks prompt format rather than prediction quality and stays in a narrow high band even when accuracy falls to 49 percent.
- An inverse difficulty effect appears: LLM accuracy drops when the XGBoost model is near-certain, yet matches the tree model when the latter is only moderately confident.
- Few-shot examples and SHAP-derived feature evidence act as orthogonal, super-additive interventions that jointly cut the Attribution Disagreement Score and raise accuracy.
- The calibrator supplies patient-specific reliability without requiring repeated model calls or internal logit access.
Where Pith is reading between the lines
- The method could be tested on other tabular domains such as finance or sensor data to check whether attribution divergence remains informative outside clinical settings.
- Replacing the XGBoost reference model with a different non-LLM baseline might reveal whether the signal is tied to tree-based structure or works more generally.
- The approach suggests that production clinical systems could maintain a lightweight tree model in parallel with the LLM solely to monitor when the LLM's outputs are likely to be unreliable.
- If the divergence signal proves stable across prompt variations, it could serve as a lightweight audit layer for any LLM deployed on structured inputs.
Load-bearing premise
That divergence in attributions between the LLM and XGBoost measures the LLM's epistemic uncertainty rather than merely reflecting differences in model architecture or training data.
What would settle it
On a fresh clinical tabular dataset, if the divergence-based calibrator fails to produce lower expected calibration error than raw verbalized confidence while the LLM's accuracy remains comparable, the proxy claim would be falsified.
read the original abstract
Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains unexplored. We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on a prediction task via attribution divergence analysis. We report four findings. First, LLM verbalized confidence is epistemically vacuous, it outputs a near-constant (0.856-0.937) regardless of whether accuracy is 49% or 75.3%, tracking prompt format rather than prediction quality. Second, the LLM exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. Third, few-shot examples and SHAP-derived feature evidence are orthogonal, super-additive interventions: they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% without training. Fourth, a cross-model calibrator that determined LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080, replacing uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference. We frame these findings as a cold start problem for LLMs on structured data and outline a path toward genuine epistemic self-awareness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines LLMs on clinical tabular prediction tasks and reports that verbalized confidence is uninformative (near-constant 0.856-0.937 across accuracy levels of 49-75.3%), that an inverse difficulty effect exists (LLM accuracy 64.8% when XGBoost is 99% correct vs. matching when XGBoost is moderately uncertain), that few-shot examples and SHAP evidence are super-additive in reducing Attribution Disagreement Score (ADS) from 1.54 to 0.38 and boosting accuracy to 75.3%, and that a cross-model calibrator using ADS between Qwen 2.5 7B attributions and XGBoost SHAP values reduces expected calibration error from 0.254 to 0.080 without model internals or repeated inference.
Significance. If the central calibration result holds after addressing controls, the work would offer a practical, training-free method for patient-specific reliability estimates on structured clinical data, addressing a documented gap in LLM epistemic awareness for tabular tasks and providing falsifiable empirical patterns (inverse difficulty, super-additivity) that could guide future self-calibration research.
major comments (2)
- [Abstract] Abstract (fourth finding) and the cross-model calibrator claim: the reduction in ECE from 0.254 to 0.080 is presented as evidence that ADS proxies LLM epistemic uncertainty, but the manuscript provides no same-architecture control (e.g., LLM-vs-LLM attribution divergence) or ablation isolating epistemic signal from fixed differences in inductive bias, optimization, and feature handling between Qwen 2.5 7B and XGBoost; without this, the divergence and the reported calibration benefit may reflect model-type mismatch rather than epistemic blind spots.
- [Abstract] Abstract (second finding on inverse difficulty effect): the reported accuracy drop to 64.8% when XGBoost is 99% correct is load-bearing for the epistemic interpretation, yet the abstract supplies no statistical test, confidence intervals, or dataset size that would allow assessment of whether this pattern is robust or an artifact of the specific model pair.
minor comments (2)
- [Abstract] Abstract: quantitative results (accuracy, ECE, ADS values) are reported without any dataset description, patient cohort size, feature count, or attribution method implementation details, which hinders reproducibility assessment even if the full text supplies them.
- [Abstract] Abstract: the invented term 'Attribution Disagreement Score (ADS)' is introduced without an explicit formula or normalization in the summary paragraph, requiring the reader to infer its construction from later text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We respond point-by-point to the major concerns below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] Abstract (fourth finding) and the cross-model calibrator claim: the reduction in ECE from 0.254 to 0.080 is presented as evidence that ADS proxies LLM epistemic uncertainty, but the manuscript provides no same-architecture control (e.g., LLM-vs-LLM attribution divergence) or ablation isolating epistemic signal from fixed differences in inductive bias, optimization, and feature handling between Qwen 2.5 7B and XGBoost; without this, the divergence and the reported calibration benefit may reflect model-type mismatch rather than epistemic blind spots.
Authors: We agree that a same-architecture control would help isolate whether the divergence signal is specifically epistemic rather than arising from differences in model class. Our experimental design intentionally pairs the LLM with XGBoost because the latter is a strong, widely used baseline for tabular clinical data; the practical goal is to detect LLM epistemic blind spots relative to such a reference model. Nevertheless, the referee's point is valid. In revision we will add an explicit discussion of this limitation and include, where data permit, a supplementary LLM-to-LLM attribution divergence ablation to quantify the contribution of architectural mismatch. revision: partial
-
Referee: [Abstract] Abstract (second finding on inverse difficulty effect): the reported accuracy drop to 64.8% when XGBoost is 99% correct is load-bearing for the epistemic interpretation, yet the abstract supplies no statistical test, confidence intervals, or dataset size that would allow assessment of whether this pattern is robust or an artifact of the specific model pair.
Authors: The full manuscript contains the dataset sizes and reports the accuracy figures with supporting statistics. We will revise the abstract to include the relevant sample size, confidence intervals, and a brief statement on the statistical assessment of the inverse difficulty effect so that readers can evaluate robustness directly from the abstract. revision: yes
Circularity Check
No circularity; empirical comparisons are self-contained
full rationale
The paper's claims rest on direct empirical measurements: verbalized confidence ranges, accuracy under varying XGBoost certainty, ADS reductions from interventions, and ECE drop from 0.254 to 0.080 via a cross-model signal. No equations, parameters, or results are defined in terms of themselves; attribution divergence is computed from independent model outputs rather than fitted to the target reliability metric. No self-citations or uniqueness theorems appear in the provided text. The derivation chain consists of observable comparisons and interventions without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Feature attributions computed for LLM and XGBoost are directly comparable to quantify disagreement
invented entities (2)
-
Attribution Disagreement Score (ADS)
no independent evidence
-
Cross-model calibrator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Journal of the American Medical Informatics Association , volume=
Large language models are less effective at clinical prediction tasks than locally trained machine learning models , author=. Journal of the American Medical Informatics Association , volume=. 2025 , publisher=
2025
-
[2]
Advances in neural information processing systems , volume=
Why do tree-based models still outperform deep learning on typical tabular data? , author=. Advances in neural information processing systems , volume=
-
[3]
International conference on machine learning , pages=
On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=
2017
-
[4]
International conference on artificial intelligence and statistics , pages=
Tabllm: Few-shot classification of tabular data with large language models , author=. International conference on artificial intelligence and statistics , pages=. 2023 , organization=
2023
-
[5]
Jin, Jiayu and others , journal=
-
[6]
2023 , publisher=
Johnson, Alistair EW and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Hao, Sicheng and Moody, Benjamin and Gow, Brian and others , journal=. 2023 , publisher=
2023
-
[7]
arXiv preprint arXiv:2202.01602 , year=
The disagreement problem in explainable machine learning: A practitioner's perspective , author=. arXiv preprint arXiv:2202.01602 , year=
-
[8]
2005 , publisher=
Algorithmic learning in a random world , author=. 2005 , publisher=
2005
-
[9]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms , author=. arXiv preprint arXiv:2306.13063 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Diagnostic and Prognostic Research , volume=
Will large language models transform clinical prediction? , author=. Diagnostic and Prognostic Research , volume=. 2025 , publisher=
2025
-
[11]
arXiv preprint arXiv:2410.14582 , year=
Do LLMs estimate uncertainty well in instruction-following? , author=. arXiv preprint arXiv:2410.14582 , year=
-
[12]
Jsonformer: A bulletproof way to generate structured output from
-
[13]
arXiv preprint arXiv:2512.00163 , year=
Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification , author=. arXiv preprint arXiv:2512.00163 , year=
-
[14]
International journal of medical informatics , year=
Which risk predictors are more likely to indicate severe AKI in hospitalized patients? , author=. International journal of medical informatics , year=
-
[15]
Advances in neural information processing systems , volume=
A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=
-
[16]
2025 , eprint=
Qwen2.5 Technical Report , author=. 2025 , eprint=
2025
-
[17]
Teaching Models to Express Their Uncertainty in Words
Teaching models to express their uncertainty in words , author=. arXiv preprint arXiv:2205.14334 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
2023
-
[19]
Biometrika , volume=
A new measure of rank correlation , author=. Biometrika , volume=
-
[20]
New Phytologist , volume=
The distribution of the flora in the alpine zone , author=. New Phytologist , volume=
-
[21]
International Conference on Learning Representations , year=
Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. International Conference on Learning Representations , year=
-
[22]
arXiv preprint arXiv:2502.00290 , year=
Estimating llm uncertainty with evidence , author=. arXiv preprint arXiv:2502.00290 , year=
-
[23]
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning , author=. arXiv preprint arXiv:2505.11737 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
arXiv preprint arXiv:2212.13138 , year=
Large language models encode clinical knowledge , author=. arXiv preprint arXiv:2212.13138 , year=
-
[25]
Informatics , volume=
Large language models in healthcare and medical domain: A review , author=. Informatics , volume=. 2024 , organization=
2024
-
[26]
Information fusion , volume=
Tabular data: Deep learning is not all you need , author=. Information fusion , volume=. 2022 , publisher=
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.