Vendi Novelty Scores for Out-of-Distribution Detection

Adji Bousso Dieng; Amey P. Pasarkar

arxiv: 2602.10062 · v2 · pith:VI7DC5KTnew · submitted 2026-02-10 · 💻 cs.LG · cs.CV

Vendi Novelty Scores for Out-of-Distribution Detection

Amey P. Pasarkar , Adji Bousso Dieng This is my paper

Pith reviewed 2026-05-22 10:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords out-of-distribution detectionVendi scorenovelty detectiondiversity metricsmachine learning safetyimage classificationnon-parametric methodspost-hoc detection

0 comments

The pith

The Vendi Novelty Score detects out-of-distribution inputs by measuring how much they increase the diversity of in-distribution features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a new method for spotting out-of-distribution samples in image classifiers by tracking the boost a test point gives to the overall diversity of the training features. It draws on Vendi Scores, which are similarity-based measures of set diversity, to create a score that needs no density estimates or strong distributional assumptions. The approach runs in linear time, works without class labels in some variants, and maintains high performance even when the training set is reduced to one percent of its original size. A reader would care because current OOD detectors often break under realistic constraints such as limited data access or non-standard feature distributions, and a diversity-based alternative could improve safety in deployed systems.

Core claim

The central claim is that out-of-distribution detection can be reframed as quantifying the increase in Vendi Score diversity when a test sample is added to the in-distribution feature collection. This Vendi Novelty Score combines dataset-level and class-conditional signals in a single non-parametric quantity, requires no explicit density modeling, and delivers state-of-the-art detection accuracy across standard image benchmarks and multiple network architectures while preserving that accuracy when computed on only one percent of the training data.

What carries the argument

The Vendi Novelty Score, which subtracts the Vendi Score of the in-distribution feature set from the Vendi Score of the same set after the test sample is appended, thereby measuring the marginal diversity contribution of the new point.

If this is right

OOD detectors no longer require fitting density models or making parametric assumptions about feature distributions.
Effective detectors become feasible in settings where only a small fraction of training data is available or accessible.
Local class-conditional novelty and global dataset novelty can be fused inside one scalar score without additional tuning.
Memory and compute budgets for detection can be reduced dramatically while retaining competitive performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diversity-increase logic could be tested on non-image modalities such as text embeddings or sensor streams where similarity kernels are still definable.
If the method generalizes, it suggests rethinking uncertainty quantification around set diversity rather than point-wise likelihood or distance.
One could explore whether VNS correlates with human notions of novelty or surprise on the same inputs.

Load-bearing premise

That the amount by which a sample increases feature-set diversity is a reliable indicator of whether the sample lies outside the training distribution.

What would settle it

A controlled experiment on an image dataset and architecture outside those tested in the paper where VNS ranks OOD samples no better than a random baseline.

read the original abstract

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VNS reframes OOD detection around diversity increase using Vendi Scores, with a practical edge on tiny data subsets, though the 1% stability claim needs variance checks to hold up.

read the letter

The main thing here is a shift to measuring novelty as the boost a test point gives to the Vendi Score diversity of the in-distribution features. This sits apart from confidence-based or likelihood-based detectors and stays non-parametric with linear runtime. It also folds in both local class-conditional and global signals without extra fitting steps. That framing is the clearest novelty, and the paper earns credit for testing it across standard image benchmarks and multiple network architectures while reporting competitive detection numbers. The 1% training-data result is the part that could matter most in practice, since many deployment settings cannot store or access the full reference set. If the full experiments back the SOTA numbers with reasonable controls, this is a useful addition for constrained environments. The soft spot is exactly the one the stress-test flags. A 1% random subset can change the spectrum of the similarity matrix in high-dimensional feature space, and without reported variance across several draws or sensitivity plots, the claim that performance is retained feels under-supported. The abstract gives no error bars or ablation details either, so the central empirical story rests on unshown stability. If those checks exist in the body they close the gap; if not, it is a real but fixable weakness rather than a fatal one. The math builds directly on prior Vendi Score work without circular redefinition, and the citations look appropriate. This is for readers who need simple, assumption-light OOD tools in computer vision or similar domains, especially when data access is limited. It deserves a serious referee because the idea is distinct enough and the practical angle sharp enough to be worth checking in detail. I would send it to review and ask specifically for the small-subset variance results.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Vendi Novelty Score (VNS) for out-of-distribution (OOD) detection. VNS quantifies novelty as the increase in Vendi Score (a similarity-based diversity metric) of an in-distribution feature set upon addition of a test sample. The central claims are that VNS achieves state-of-the-art OOD detection performance across multiple image classification benchmarks and network architectures, and that this performance is retained when VNS is computed using only 1% of the training data.

Significance. If the empirical claims hold after addressing stability concerns, the work is significant because it establishes a new non-parametric, linear-time paradigm for OOD detection that avoids density modeling and class-conditional fitting. The reported robustness to 1% data subsets is a practical strength that could support deployment in memory-constrained settings. The paper is credited for grounding the method in an existing family of diversity metrics (Vendi Scores) and for emphasizing falsifiable, post-hoc applicability.

major comments (2)

[Experiments (1% data results)] The claim that VNS retains SOTA OOD performance when computed on only 1% of the training data (abstract and experiments) is load-bearing for the 'remarkably retains' assertion but lacks variance analysis across multiple independent random subsamples of the reference set. In high-dimensional feature spaces the Vendi Score depends on the spectrum of the Gram matrix; a 1% subset may alter the eigenspace for near-ID or boundary samples, causing the delta-VS signal to fluctuate without reported standard deviations or sensitivity results.
[§3] §3 (method): The precise definition of VNS as the difference in Vendi Score when adding a test point should include an explicit equation and a short argument or empirical check showing that the delta is systematically larger for OOD than ID samples under the chosen similarity kernel; without this the 'principled notion of novelty' remains partly heuristic.

minor comments (2)

The abstract states SOTA performance but does not name the exact benchmarks or architectures; these should be listed early in the introduction or experiments section for immediate clarity.
[Tables/Figures] All performance tables and figures should report error bars or standard deviations over multiple runs or seeds to allow readers to assess statistical reliability of the SOTA comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and describe the revisions we plan to incorporate.

read point-by-point responses

Referee: [Experiments (1% data results)] The claim that VNS retains SOTA OOD performance when computed on only 1% of the training data (abstract and experiments) is load-bearing for the 'remarkably retains' assertion but lacks variance analysis across multiple independent random subsamples of the reference set. In high-dimensional feature spaces the Vendi Score depends on the spectrum of the Gram matrix; a 1% subset may alter the eigenspace for near-ID or boundary samples, causing the delta-VS signal to fluctuate without reported standard deviations or sensitivity results.

Authors: We agree that additional analysis of variance across independent subsamples would strengthen the robustness claims. In the revised manuscript we will report OOD detection metrics averaged over multiple independent random draws of the 1% reference subset, including standard deviations. This will directly address potential sensitivity of the Gram-matrix spectrum to particular subsamples. revision: yes
Referee: [§3] §3 (method): The precise definition of VNS as the difference in Vendi Score when adding a test point should include an explicit equation and a short argument or empirical check showing that the delta is systematically larger for OOD than ID samples under the chosen similarity kernel; without this the 'principled notion of novelty' remains partly heuristic.

Authors: We thank the referee for highlighting this presentational gap. We will revise Section 3 to include the explicit definition VNS(x) = VS(F ∪ {f(x)}) − VS(F), where F denotes the in-distribution feature matrix and f(x) its embedding. We will also add a concise argument based on the eigenvalue-entropy formulation of the Vendi Score, together with a small-scale empirical verification on a synthetic mixture of Gaussians, confirming that the delta is systematically larger for points drawn from a dissimilar distribution. revision: yes

Circularity Check

0 steps flagged

VNS introduced as direct application of prior diversity metric; no derivation reduces to inputs by construction

full rationale

The paper defines VNS explicitly as the increase in Vendi Score (VS) upon adding a test sample to the in-distribution feature set, presented as a new OOD paradigm rather than a derived result. No equations or first-principles claims are shown that equate a prediction back to fitted parameters or self-cited uniqueness theorems. Performance results are empirical evaluations on benchmarks, and the 1% subsampling observation is reported as an empirical finding without statistical forcing or self-referential fitting. The approach remains self-contained against external benchmarks with no load-bearing self-citation chains or ansatz smuggling identified in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. Vendi Scores are referenced as prior work, so any parameters internal to VS computation are inherited rather than introduced here.

pith-pipeline@v0.9.0 · 5709 in / 1160 out tokens · 38380 ms · 2026-05-22T10:36:39.177797+00:00 · methodology

Vendi Novelty Scores for Out-of-Distribution Detection

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)