arxiv: 2605.06382 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Rethinking Vacuity for OOD Detection in Evidential Deep Learning

Claire McNamara

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords out-of-distribution detectionevidential deep learningvacuityuncertainty massclass cardinalitylanguage modelsmultiple choice QA

0 comments

The pith

Vacuity for OOD detection in evidential deep learning changes with even small differences in class count between ID and OOD sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the vacuity metric, formed by dividing the number of classes K by the total belief strength S from Dirichlet parameters, yields inconsistent out-of-distribution detection results unless K is identical for in-distribution and out-of-distribution evaluations. A mismatch of only one class can shift AUROC by up to 0.36 and AUPR by up to 0.68 for both standard EDL and IB-EDL, creating artificial differences even when the model's evidence assignments stay fixed. This sensitivity arises because S does not scale linearly with K in practice due to how EDL suppresses incorrect evidence. The work also examines the use of EDL on causal language models with multiple-choice QA datasets and calls for consistent class cardinalities plus clearer ID/OOD definitions in that setting.

Core claim

Vacuity defined as K divided by total strength S produces misleading AUROC and AUPR scores for OOD detection whenever the class cardinality K differs between the in-distribution and out-of-distribution sets, even by one, without any change in the underlying model predictions or evidence values.

What carries the argument

The vacuity formula K/S, where S is the sum of Dirichlet parameters that represent total evidence strength in Evidential Deep Learning.

If this is right

K_ID must equal K_OOD in any valid comparison of OOD detection performance, otherwise the reported metrics are unreliable.
Both standard EDL and IB-EDL show large swings in AUROC and AUPR from a single-class difference.
Evaluations of EDL on causal language models with MCQA datasets require matched class counts to avoid artifacts.
Clearer and more consistent definitions of in-distribution versus out-of-distribution are required when fine-tuning language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

OOD benchmarks for any uncertainty method should enforce identical class counts across ID and OOD splits to remove this confound.
The same K-dependence may appear in other uncertainty measures that incorporate class cardinality.
A version of vacuity normalized to remove explicit K dependence could be tested on the same models and datasets.
Researchers should verify that model outputs remain unchanged when they artificially vary K in post-processing checks.

Load-bearing premise

That observed differences in AUROC and AUPR when K differs by one arise solely from the vacuity formula rather than from any change in how the model assigns evidence to the classes.

What would settle it

Hold model predictions and evidence values fixed, then recompute AUROC and AUPR after varying only the numerical value of K inserted into the vacuity formula and check whether the metrics still differ.

Figures

Figures reproduced from arXiv: 2605.06382 by Claire McNamara.

**Figure 1.** Figure 1: OOD detection performance for OBQA → ARC-C as the effective number of classes K increases. 4 Impact in Practice This section details how using different numbers of K can impact results in practice using a published account. 4.1 OOD using UM and MP A motivating example of the unreliability of using vacuity-based uncertainty as a means of detecting OOD inputs in practice can be seen through a faithful reprod… view at source ↗

read the original abstract

Vacuity, or Uncertainty Mass (UM), is commonly used as a metric to evaluate Out-of-Distribution (OOD) detection in Evidential Deep Learning (EDL). It generally involves dividing the number of classes ($K$) by the total strength of belief ($S$) of the model's predictions, where $S$ is derived from summing the Dirichlet parameters. As such, UM is sensitive to the cardinality of $K$. In particular, it is unlikely in practice that there is a linear relationship between $K$ and $S$ as $K$ and $S$ increase due to the nature of EDL (suppressing incorrectly assigned evidence). As a result, when comparing In Distribution (ID) and OOD results, it is important that $K_{\mathrm{ID}}$ and $K_{\mathrm{OOD}}$ are equal; something that is not always ensured in practice. We provide an empirical demonstration of how results for AUROC and AUPR can substantially differ when class cardinality between ID and OOD differs by 1, with AUROC differing by as much as 0.318 and AUPR by 0.613 for standard EDL, and AUROC by 0.360 and AUPR by 0.683 for IB-EDL. More concretely, our findings isolate an evaluation artefact: when K differs between ID and OOD, AUROC/AUPR can be artificially inflated without any change in model predictions. We further discuss the evaluation of EDL over causal language models using Multiple-Choice Question-Answer (MCQA) datasets and argue for clearer definitions of ID and OOD in this context. Our primary contribution is an empirical and theoretical demonstration that vacuity-based OOD detection in EDL-fine-tuned LLMs is highly sensitive to uncontrolled differences in evaluated class cardinality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows vacuity-based OOD detection in EDL inflates AUROC/AUPR when K differs by one between ID and OOD, with quantified shifts up to 0.36/0.68, but the experiments need to prove model predictions stayed fixed.

read the letter

The main thing here is that vacuity (K/S) for OOD detection in evidential deep learning produces large metric swings from a one-class difference in K, and the authors quantify this for both standard EDL and IB-EDL while extending the point to fine-tuned LLMs on MCQA datasets. They report AUROC differences of 0.318–0.360 and AUPR differences of 0.613–0.683, framing it as an evaluation artefact that can occur without changes to the underlying predictions. The discussion on clearer ID/OOD definitions for causal language models is also a useful practical note. This is new enough in its concrete demonstration and isolation of the cardinality effect; prior EDL work has not stressed this control as sharply in the cited literature. The numbers make the issue concrete rather than abstract, and the non-linear scaling argument from evidence suppression fits how EDL is supposed to behave. The central claim holds up on its own terms if the experiments really reuse the same evidence vectors and only swap K in the formula. The soft spot is exactly the one the stress-test flags: if the runs trained separate output heads for different K values, then the Dirichlet parameters and S would have changed along with the model, so the AUROC/AUPR gaps would not be attributable solely to the vacuity formula. The abstract asserts “without any change in model predictions,” but that control needs explicit verification in the full text—post-hoc substitution versus retraining makes a real difference. Minor issues like dataset details are secondary once that is settled. This is for people running OOD benchmarks with EDL or uncertainty-aware LLMs; anyone who reports vacuity scores should read it to avoid an easy confound. It shows honest engagement with the metric’s mechanics and deserves a serious referee to check the controls and see how general the artefact is.

Referee Report

2 major / 2 minor

Summary. The paper claims that vacuity (Uncertainty Mass UM = K/S) for OOD detection in Evidential Deep Learning is highly sensitive to mismatches in class cardinality K between ID and OOD settings. It provides an empirical demonstration on EDL- and IB-EDL-fine-tuned LLMs using MCQA datasets showing that a K difference of 1 can inflate AUROC by up to 0.318 and AUPR by up to 0.613 (and larger for IB-EDL) even without changes in model predictions, framing this as an evaluation artefact. The work also discusses non-linear scaling of total evidence S with K due to evidence suppression and calls for clearer ID/OOD definitions in LLM/MCQA contexts.

Significance. If the central empirical isolation holds, the result identifies a previously under-appreciated but practically consequential evaluation pitfall that could affect the reliability of many published OOD results using vacuity in EDL. The reported metric gaps are large enough to alter conclusions in typical benchmarks, and the focus on LLM fine-tuning makes the finding timely. The theoretical remark on non-linear S-K behavior is consistent with EDL mechanics and, if paired with the fixed-prediction control, supplies a clear prescription for matched-cardinality evaluation.

major comments (2)

[Experiments / Results] Experiments section (results on AUROC/AUPR gaps): the claim that observed differences occur 'without any change in model predictions' and isolate an 'evaluation artefact' is load-bearing. The manuscript must explicitly state and demonstrate that the same evidence vector (Dirichlet parameters and thus S) is used for both the matched-K and mismatched-K cases, with only the scalar K substituted into UM = K/S. If separate models were trained for different output cardinalities, the learned alphas would differ and the attribution to the formula alone would not hold.
[Theoretical discussion] Theoretical discussion (non-linear scaling of S with K): while the abstract notes that S is unlikely to scale linearly with K because of evidence suppression, the manuscript should supply a short concrete illustration (e.g., a two-class vs. three-class toy Dirichlet example) showing how the suppression mechanism produces the observed non-linearity; this would make the theoretical argument self-contained rather than asserted.

minor comments (2)

[Abstract / Experiments] Define IB-EDL on first use and state the precise K values employed for each ID/OOD pair in the reported tables or figures.
[Results] Add a small table or figure panel that directly juxtaposes AUROC/AUPR for matched-K versus mismatched-K under identical evidence vectors; this would make the artefact visually immediate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and recommendation for minor revision. We address each point below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses

Referee: [Experiments / Results] Experiments section (results on AUROC/AUPR gaps): the claim that observed differences occur 'without any change in model predictions' and isolate an 'evaluation artefact' is load-bearing. The manuscript must explicitly state and demonstrate that the same evidence vector (Dirichlet parameters and thus S) is used for both the matched-K and mismatched-K cases, with only the scalar K substituted into UM = K/S. If separate models were trained for different output cardinalities, the learned alphas would differ and the attribution to the formula alone would not hold.

Authors: We thank the referee for this important clarification. In the experiments, the same evidence vectors (i.e., the same Dirichlet parameters α and total evidence S) were used for both the matched-K and mismatched-K cases, with only the scalar K substituted into the vacuity formula UM = K/S. This isolates the effect to the evaluation metric while holding model predictions fixed. We agree that the manuscript should state this procedure more explicitly. In the revised version we will add a clear description in the Experiments section, including an explicit statement that the α vectors are held constant across the compared settings and a brief demonstration of the fixed parameters. revision: yes
Referee: [Theoretical discussion] Theoretical discussion (non-linear scaling of S with K): while the abstract notes that S is unlikely to scale linearly with K because of evidence suppression, the manuscript should supply a short concrete illustration (e.g., a two-class vs. three-class toy Dirichlet example) showing how the suppression mechanism produces the observed non-linearity; this would make the theoretical argument self-contained rather than asserted.

Authors: We agree that a concrete illustration would make the theoretical discussion more self-contained. We will add a short toy example to the revised manuscript (e.g., a two-class versus three-class Dirichlet comparison) that shows how evidence suppression produces non-linear scaling of total evidence S with K. This addition will strengthen the argument without changing the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claim is independent of inputs.

full rationale

The paper's central demonstration is an empirical observation that AUROC/AUPR for vacuity (UM = K/S) shift when K differs by 1 between ID and OOD, presented as an evaluation artefact. This rests on reported metric gaps (e.g., 0.318 AUROC) and the explicit formula rather than any reduction of outputs to fitted parameters or self-citations by construction. No load-bearing self-citation chains, ansatzes, or uniqueness theorems are invoked; the derivation chain is self-contained against the stated experiments and EDL properties.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the standard EDL Dirichlet parameterization and the conventional definition of vacuity as K/S; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond the domain assumption that evidence suppression prevents linear scaling of S with K.

axioms (1)

domain assumption There is no linear relationship between K and S as K and S increase due to the nature of EDL (suppressing incorrectly assigned evidence).
Invoked in the abstract to explain why UM is sensitive to cardinality and why K_ID must equal K_OOD.

pith-pipeline@v0.9.0 · 5627 in / 1504 out tokens · 36102 ms · 2026-05-08T09:47:46.845607+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 9 canonical work pages · 3 internal anchors

[1]

https://arxiv.org/abs/1806.01768

Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential Deep Learning to Quantify Classification Uncertainty, October 2018. arXiv:1806.01768 [cs]

work page arXiv 2018
[2]

Calibrating LLMs with Information-Theoretic Evidential Deep Learning, February 2025

Yawei Li, David Rügamer, Bernd Bischl, and Mina Rezaei. Calibrating LLMs with Information-Theoretic Evidential Deep Learning, February 2025. arXiv:2502.06351 [cs]

work page arXiv 2025
[3]

Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation

Yongchan Chun, Chanhee Park, Jeongho Yoon, Jaehyung Seo, and Heuiseok Lim. Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation, April 2026. arXiv:2604.08627 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Continual Evidential Deep Learning for Out-of-Distribution Detection

Eduardo Aguilar, Bogdan Raducanu, Petia Radeva, and Joost Van De Weijer. Continual Evidential Deep Learning for Out-of-Distribution Detection. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 3436–3446, Paris, France, October 2023. IEEE

2023
[5]

CEDL+: Exploiting evidential deep learning for continual out-of-distribution detection.Expert Systems with Applications, 283:127774, July 2025

Eduardo Aguilar, Bogdan Raducanu, Petia Radeva, and Joost Van De Weijer. CEDL+: Exploiting evidential deep learning for continual out-of-distribution detection.Expert Systems with Applications, 283:127774, July 2025

2025
[6]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, October 2021. arXiv:2106.09685 [cs]

work page internal anchor Pith review arXiv 2021
[7]

Estimating expected calibration errors

Nicolas Posocco and Antoine Bonnefoy. Estimating expected calibration errors. InInternational conference on artificial neural networks, pages 139–150. Springer, 2021

2021
[8]

Lakshmana Sri Harsha Nemani, P. K. Srijith, and Tomasz Ku´smierczyk. Efficient Uncertainty in LLMs through Evidential Knowledge Distillation, July 2025

2025
[9]

Receiver operating characteristics curves and related decision measures: A tutorial.Chemometrics and Intelligent Laboratory Systems, 80:24–38, 01 2006

Christopher Brown and Herbert Davis. Receiver operating characteristics curves and related decision measures: A tutorial.Chemometrics and Intelligent Laboratory Systems, 80:24–38, 01 2006

2006
[10]

MIT press, 1999

Christopher Manning and Hinrich Schutze.Foundations of statistical natural language processing. MIT press, 1999

1999
[11]

Cheng, Z

Mengyuan Chen, Junyu Gao, and Changsheng Xu. Revisiting Essential and Nonessential Settings of Evidential Deep Learning, October 2024. arXiv:2410.00393 [cs]

work page arXiv 2024
[12]

A comprehensive survey on evidential deep learning and its applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(3):2118–2138, 2026

Junyu Gao, Mengyuan Chen, Liangyu Xiang, and Changsheng Xu. A comprehensive survey on evidential deep learning and its applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(3):2118–2138, 2026

2026
[13]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018

2018
[14]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review arXiv 2018
[15]

The llama 3 herd of models, 2024

Aaron Grattafiori et al. The llama 3 herd of models, 2024

2024
[16]

Uncertainty Estimation by Fisher Information-based Evidential Deep Learning, June 2023

Danruo Deng, Guangyong Chen, Yang Yu, Furui Liu, and Pheng-Ann Heng. Uncertainty Estimation by Fisher Information-based Evidential Deep Learning, June 2023. arXiv:2303.02045 [cs]

work page arXiv 2023
[17]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, ...

2019
[18]

Winkens, R

Jim Winkens, Rudy Bunel, Abhijit Guha Roy, Robert Stanforth, Vivek Natarajan, Joseph R Ledsam, Patricia MacWilliams, Pushmeet Kohli, Alan Karthikesalingam, Simon Kohl, et al. Contrastive training for improved out-of-distribution detection.arXiv preprint arXiv:2007.05566, 2020

work page arXiv 2007
[19]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[20]

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering

Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. InProceedings of the 25th International Conference on World Wide Web, WWW ’16, page 507–517, Republic and Canton of Geneva, CHE, 2016. International World Wide Web Conferences Steering Committee. 9 Rethinking Vacuity for OO...

2016
[21]

Character-level convolutional networks for text classification, 2016

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. CoRR, abs/1509.01626, 2015

work page arXiv 2015
[22]

Jon Ryu, Soumya Ghosh, Yuheng Bu, Prasanna Sattigeri, Subhro Das, and Gregory W

Maohao Shen, J. Jon Ryu, Soumya Ghosh, Yuheng Bu, Prasanna Sattigeri, Subhro Das, and Gregory W. Wornell. Are Uncertainty Quantification Capabilities of Evidential Deep Learning a Mirage? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 107830–...

2024
[23]

Uncertainty Quantification for Multiple-Choice Questions is Just One-Token Deep

Qingcheng Zeng, Mingyu Jin, Qinkai Yu, Zhenting Wang, Wenyue Hua, Guangyan Sun, Yanda Meng, Shiqing Ma, Qifan Wang, Felix Juefei-Xu, Fan Yang, Kaize Ding, Ruixiang Tang, and Yongfeng Zhang. Uncertainty Quantification for Multiple-Choice Questions is Just One-Token Deep. InProceedings of the 34th ACM Interna- tional Conference on Information and Knowledge ...

2025
[24]

E” configurations. ∆ denotes the absolute performance drop (mismatch / as-is minus correct / removed “E

Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. Llms may perform mcqa by selecting the least incorrect option. InProceedings of the 31st International Conference on Computational Linguistics, pages 5852–5862, 2025. 7 Appendix 7.1 Vacuity Invariance Under Class Expansion Proof Proof.Let uK = K SK , S K = KX i=1 αi denote the vacui...

2025