Near OOD Detection for Vision-Language Prompt Learning with Contrastive Logit Score

Belinda Gabbe; He Zhao; Joanna Dipnall; Myong Chol Jung

arxiv: 2405.16091 · v2 · submitted 2024-05-25 · 💻 cs.CV

Near OOD Detection for Vision-Language Prompt Learning with Contrastive Logit Score

Myong Chol Jung , Joanna Dipnall , Belinda Gabbe , He Zhao This is my paper

Pith reviewed 2026-05-24 01:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords near OOD detectionvision-language modelsprompt learningCLIPout-of-distribution detectioncontrastive logit scorepost-hoc scoring

0 comments

The pith

A contrastive logit score added post-hoc improves near-OOD detection for vision-language prompt models without retraining or architecture changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Contrastive Logit Score (CLS) as a plug-and-play function that raises near out-of-distribution detection performance for pre-trained prompt learning methods on models such as CLIP. It reports gains of up to 11.67 percent in AUROC while requiring no model changes or extra training. A sympathetic reader would care because prompt learning offers an efficient way to adapt large vision-language models, yet near-OOD cases remain difficult to flag reliably with existing approaches. The work focuses on near shifts rather than far OOD and positions CLS as general across several prompt techniques.

Core claim

The central claim is that the Contrastive Logit Score serves as an effective post-hoc scoring function for near out-of-distribution detection in vision-language prompt learning models, achieving substantial AUROC improvements without requiring model modifications or additional training.

What carries the argument

The Contrastive Logit Score (CLS), a scoring function that compares logits in a contrastive manner to produce a distribution-shift signal for near-OOD samples.

If this is right

Prompt learning methods can gain better near-OOD detection simply by switching to the CLS scoring function at inference time.
No retraining or extra data is needed to obtain the reported AUROC gains.
The approach applies with minimal added computation across multiple prompt techniques and near-OOD benchmarks.
CLS can be combined with existing prompt-learning pipelines without altering their training procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same logit-contrast idea could be tested on other vision-language tasks such as retrieval or captioning to check whether the near-OOD signal transfers.
CLS might serve as a lightweight baseline when developing hybrid OOD detectors that combine prompt-based and non-prompt models.
Examining performance on datasets with controlled degrees of shift could clarify the boundary between near and far OOD for these scoring functions.

Load-bearing premise

That comparing logits contrastively produces a reliable near-distribution-shift signal that works across prompt-learning methods and near-OOD datasets without method-specific tuning.

What would settle it

Applying CLS to a new prompt learning method on a held-out near-OOD dataset and finding no AUROC improvement or a drop would falsify the generalizability claim.

read the original abstract

Prompt learning has emerged as an efficient and effective method for fine-tuning vision-language models such as CLIP. While many studies have explored generalisation abilities of these models in few-shot classification tasks and a few studies have addressed far out-of-distribution (OOD) of the models, their potential for addressing near OOD detection remains underexplored. Existing methods either require training from scratch, need fine-tuning, or are not designed for vision-language prompt learning. To address this, we introduce the Contrastive Logit Score (CLS), a novel post-hoc, plug-and-play scoring function. CLS significantly improves near OOD detection of pre-trained vision-language prompt learning methods without modifying their model architectures or requiring retraining. Our method achieves up to an 11.67% improvement in AUROC for near OOD detection with minimal computational overhead. Extensive evaluations validate the effectiveness, efficiency, and generalisability of our approach. Our code is available at https://github.com/davidmcjung/near-OOD-prompt-learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLS gives a simple post-hoc lift to near-OOD detection in prompt-tuned VLMs, but the no-tuning generalization claim needs tighter experimental backing.

read the letter

CLS is a contrastive comparison of image and text logits meant to flag near-OOD samples after prompt learning on models like CLIP. The paper positions it as a drop-in score that needs no retraining or architecture changes and reports AUROC gains up to 11.67% on near-OOD tasks. That is the core new piece: a scoring function tailored to the prompt-learning regime rather than a new training procedure. The work is useful because it fills a narrow but real gap—most existing OOD detectors either assume from-scratch training or far-OOD settings—and the authors release code, which makes the claim checkable. The evaluations are described as extensive, which is the right direction for a methods paper in this area. The soft spot is the plug-and-play premise. The stress-test note is on target: if different prompt-learning objectives or depths change the relative scaling between image and text logits, the same contrastive formula can lose its edge. The abstract does not show whether the method was tested across a wide range of prompt variants with zero hyperparameter search per method, or whether temperature or margin choices were fixed once and for all. If those choices were dataset- or method-specific, the reported gains become less portable. The paper is aimed at researchers who already run prompt learning on vision-language models and want a lightweight OOD add-on. A reader in that subfield can extract a usable score and the accompanying numbers. It is coherent enough on its own terms to deserve referee time; the idea is testable and the overhead is low, even if the generalization story needs more scrutiny in review.

Referee Report

3 major / 2 minor

Summary. The paper proposes Contrastive Logit Score (CLS), a post-hoc plug-and-play scoring function for near out-of-distribution (OOD) detection in pre-trained vision-language prompt learning methods (e.g., based on CLIP). It claims that CLS improves AUROC by up to 11.67% over existing approaches without architecture changes or retraining, supported by extensive evaluations on effectiveness, efficiency, and generalizability, with code released.

Significance. If the central claim holds, CLS offers a lightweight, training-free enhancement to near-OOD detection for widely used prompt-tuned VLMs, addressing an underexplored area. The public code release supports reproducibility and allows direct verification of the reported gains.

major comments (3)

[§3] §3 (Method, CLS definition): The contrastive formulation must explicitly state whether any scalar (temperature, margin, or normalization constant) is fixed globally or selected per prompt-learning method; any per-method choice would undermine the 'zero hyperparameter search' and 'plug-and-play' premise for arbitrary pre-trained prompt learners.
[§4] §4 (Experiments): The reported AUROC gains (up to 11.67%) are load-bearing for the generalization claim; the tables must include an explicit statement or ablation confirming that CLS hyperparameters were held fixed across all tested prompt-learning methods (CoOp, CoCoOp, etc.) and near-OOD pairs with no post-hoc selection, otherwise the tuning-free advantage is not demonstrated.
[§4.3] §4.3 (Baselines and near-OOD pairs): The comparison set must be expanded or justified to include strong post-hoc OOD baselines designed for VLMs; if only ID-tuned or far-OOD methods are used, the relative improvement cannot be taken as evidence that CLS is superior for the near-OOD regime.

minor comments (2)

[Abstract] Abstract: The sentence 'achieves up to an 11.67% improvement in AUROC' should name the exact baseline method and dataset pair for immediate clarity.
[§3] Notation: Define the image and text logit vectors (I and T) and the contrastive operation in a single equation early in §3 to avoid ambiguity when readers compare CLS to standard softmax or energy scores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and strengthen the claims where appropriate.

read point-by-point responses

Referee: [§3] §3 (Method, CLS definition): The contrastive formulation must explicitly state whether any scalar (temperature, margin, or normalization constant) is fixed globally or selected per prompt-learning method; any per-method choice would undermine the 'zero hyperparameter search' and 'plug-and-play' premise for arbitrary pre-trained prompt learners.

Authors: In the CLS definition, all scalars including the temperature parameter are fixed globally to the default values inherited from the pre-trained CLIP model (temperature set to 1, with no margin or additional normalization constants introduced). These values are identical for every prompt-learning method and are never selected or tuned per method. We will add an explicit statement in the revised §3 confirming this global fixation to reinforce the zero-hyperparameter and plug-and-play claims. revision: yes
Referee: [§4] §4 (Experiments): The reported AUROC gains (up to 11.67%) are load-bearing for the generalization claim; the tables must include an explicit statement or ablation confirming that CLS hyperparameters were held fixed across all tested prompt-learning methods (CoOp, CoCoOp, etc.) and near-OOD pairs with no post-hoc selection, otherwise the tuning-free advantage is not demonstrated.

Authors: CLS hyperparameters were held strictly fixed across all prompt-learning methods (CoOp, CoCoOp, etc.) and all near-OOD pairs, with no post-hoc selection or per-experiment tuning performed. We will insert an explicit statement in the revised §4 and add a short confirmation note (or small ablation if space allows) documenting that the same fixed settings were used throughout. revision: yes
Referee: [§4.3] §4.3 (Baselines and near-OOD pairs): The comparison set must be expanded or justified to include strong post-hoc OOD baselines designed for VLMs; if only ID-tuned or far-OOD methods are used, the relative improvement cannot be taken as evidence that CLS is superior for the near-OOD regime.

Authors: The baselines were chosen as the standard post-hoc scoring functions that can be applied directly to VLMs without retraining. We will expand §4.3 with a justification paragraph explaining their relevance to the near-OOD prompt-learning setting and, where feasible within page limits, include one or two additional VLM-oriented post-hoc baselines to strengthen the comparison. revision: partial

Circularity Check

0 steps flagged

No circularity: CLS introduced as independent post-hoc score with empirical validation

full rationale

The paper defines CLS explicitly as a novel post-hoc, plug-and-play scoring function applied to pre-trained vision-language prompt learning models without retraining or architecture modification. No equations, parameters, or predictions in the provided text reduce by construction to fitted inputs, self-citations, or ansatzes from the same work. The AUROC gains are reported as results from extensive evaluations across methods and datasets, not as quantities forced by the method's own definition. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are mentioned; the approach rests on standard domain assumptions about logit-based OOD signals.

axioms (1)

domain assumption Logit-based scores can separate near-OOD from in-distribution samples
Core premise underlying any logit-derived OOD detector.

pith-pipeline@v0.9.0 · 5713 in / 1108 out tokens · 31212 ms · 2026-05-24T01:15:18.511694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification

Springer International Publishing, Cham, pp 446–461 Buse A (1982) The likelihood ratio, wald, and lagrange multiplier tests: An expository note. The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification. In: International Confer- ence on Machine Learning, P...

work page doi:10.1109/cvpr.2009.5206848 1982
[2]

PMLR, pp 8759–8773 Huang R, Geng A, Li Y (2021) On the importance of gradients for detecting distributional shifts in the wild. In: Beygelzimer A, Dauphin Y, Liang P, et al (eds) Advances in Neural Information Processing Systems Jia C, Yang Y, Xia Y, et al (2021) Scaling up visual and vision-language representation learn- ing with noisy text supervision. ...

work page doi:10.1109/cvpr 2021

[1] [1]

The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification

Springer International Publishing, Cham, pp 446–461 Buse A (1982) The likelihood ratio, wald, and lagrange multiplier tests: An expository note. The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification. In: International Confer- ence on Machine Learning, P...

work page doi:10.1109/cvpr.2009.5206848 1982

[2] [2]

PMLR, pp 8759–8773 Huang R, Geng A, Li Y (2021) On the importance of gradients for detecting distributional shifts in the wild. In: Beygelzimer A, Dauphin Y, Liang P, et al (eds) Advances in Neural Information Processing Systems Jia C, Yang Y, Xia Y, et al (2021) Scaling up visual and vision-language representation learn- ing with noisy text supervision. ...

work page doi:10.1109/cvpr 2021