pith. sign in

arxiv: 2405.16091 · v2 · submitted 2024-05-25 · 💻 cs.CV

Near OOD Detection for Vision-Language Prompt Learning with Contrastive Logit Score

Pith reviewed 2026-05-24 01:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords near OOD detectionvision-language modelsprompt learningCLIPout-of-distribution detectioncontrastive logit scorepost-hoc scoring
0
0 comments X

The pith

A contrastive logit score added post-hoc improves near-OOD detection for vision-language prompt models without retraining or architecture changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Contrastive Logit Score (CLS) as a plug-and-play function that raises near out-of-distribution detection performance for pre-trained prompt learning methods on models such as CLIP. It reports gains of up to 11.67 percent in AUROC while requiring no model changes or extra training. A sympathetic reader would care because prompt learning offers an efficient way to adapt large vision-language models, yet near-OOD cases remain difficult to flag reliably with existing approaches. The work focuses on near shifts rather than far OOD and positions CLS as general across several prompt techniques.

Core claim

The central claim is that the Contrastive Logit Score serves as an effective post-hoc scoring function for near out-of-distribution detection in vision-language prompt learning models, achieving substantial AUROC improvements without requiring model modifications or additional training.

What carries the argument

The Contrastive Logit Score (CLS), a scoring function that compares logits in a contrastive manner to produce a distribution-shift signal for near-OOD samples.

If this is right

  • Prompt learning methods can gain better near-OOD detection simply by switching to the CLS scoring function at inference time.
  • No retraining or extra data is needed to obtain the reported AUROC gains.
  • The approach applies with minimal added computation across multiple prompt techniques and near-OOD benchmarks.
  • CLS can be combined with existing prompt-learning pipelines without altering their training procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same logit-contrast idea could be tested on other vision-language tasks such as retrieval or captioning to check whether the near-OOD signal transfers.
  • CLS might serve as a lightweight baseline when developing hybrid OOD detectors that combine prompt-based and non-prompt models.
  • Examining performance on datasets with controlled degrees of shift could clarify the boundary between near and far OOD for these scoring functions.

Load-bearing premise

That comparing logits contrastively produces a reliable near-distribution-shift signal that works across prompt-learning methods and near-OOD datasets without method-specific tuning.

What would settle it

Applying CLS to a new prompt learning method on a held-out near-OOD dataset and finding no AUROC improvement or a drop would falsify the generalizability claim.

read the original abstract

Prompt learning has emerged as an efficient and effective method for fine-tuning vision-language models such as CLIP. While many studies have explored generalisation abilities of these models in few-shot classification tasks and a few studies have addressed far out-of-distribution (OOD) of the models, their potential for addressing near OOD detection remains underexplored. Existing methods either require training from scratch, need fine-tuning, or are not designed for vision-language prompt learning. To address this, we introduce the Contrastive Logit Score (CLS), a novel post-hoc, plug-and-play scoring function. CLS significantly improves near OOD detection of pre-trained vision-language prompt learning methods without modifying their model architectures or requiring retraining. Our method achieves up to an 11.67% improvement in AUROC for near OOD detection with minimal computational overhead. Extensive evaluations validate the effectiveness, efficiency, and generalisability of our approach. Our code is available at https://github.com/davidmcjung/near-OOD-prompt-learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Contrastive Logit Score (CLS), a post-hoc plug-and-play scoring function for near out-of-distribution (OOD) detection in pre-trained vision-language prompt learning methods (e.g., based on CLIP). It claims that CLS improves AUROC by up to 11.67% over existing approaches without architecture changes or retraining, supported by extensive evaluations on effectiveness, efficiency, and generalizability, with code released.

Significance. If the central claim holds, CLS offers a lightweight, training-free enhancement to near-OOD detection for widely used prompt-tuned VLMs, addressing an underexplored area. The public code release supports reproducibility and allows direct verification of the reported gains.

major comments (3)
  1. [§3] §3 (Method, CLS definition): The contrastive formulation must explicitly state whether any scalar (temperature, margin, or normalization constant) is fixed globally or selected per prompt-learning method; any per-method choice would undermine the 'zero hyperparameter search' and 'plug-and-play' premise for arbitrary pre-trained prompt learners.
  2. [§4] §4 (Experiments): The reported AUROC gains (up to 11.67%) are load-bearing for the generalization claim; the tables must include an explicit statement or ablation confirming that CLS hyperparameters were held fixed across all tested prompt-learning methods (CoOp, CoCoOp, etc.) and near-OOD pairs with no post-hoc selection, otherwise the tuning-free advantage is not demonstrated.
  3. [§4.3] §4.3 (Baselines and near-OOD pairs): The comparison set must be expanded or justified to include strong post-hoc OOD baselines designed for VLMs; if only ID-tuned or far-OOD methods are used, the relative improvement cannot be taken as evidence that CLS is superior for the near-OOD regime.
minor comments (2)
  1. [Abstract] Abstract: The sentence 'achieves up to an 11.67% improvement in AUROC' should name the exact baseline method and dataset pair for immediate clarity.
  2. [§3] Notation: Define the image and text logit vectors (I and T) and the contrastive operation in a single equation early in §3 to avoid ambiguity when readers compare CLS to standard softmax or energy scores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and strengthen the claims where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Method, CLS definition): The contrastive formulation must explicitly state whether any scalar (temperature, margin, or normalization constant) is fixed globally or selected per prompt-learning method; any per-method choice would undermine the 'zero hyperparameter search' and 'plug-and-play' premise for arbitrary pre-trained prompt learners.

    Authors: In the CLS definition, all scalars including the temperature parameter are fixed globally to the default values inherited from the pre-trained CLIP model (temperature set to 1, with no margin or additional normalization constants introduced). These values are identical for every prompt-learning method and are never selected or tuned per method. We will add an explicit statement in the revised §3 confirming this global fixation to reinforce the zero-hyperparameter and plug-and-play claims. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported AUROC gains (up to 11.67%) are load-bearing for the generalization claim; the tables must include an explicit statement or ablation confirming that CLS hyperparameters were held fixed across all tested prompt-learning methods (CoOp, CoCoOp, etc.) and near-OOD pairs with no post-hoc selection, otherwise the tuning-free advantage is not demonstrated.

    Authors: CLS hyperparameters were held strictly fixed across all prompt-learning methods (CoOp, CoCoOp, etc.) and all near-OOD pairs, with no post-hoc selection or per-experiment tuning performed. We will insert an explicit statement in the revised §4 and add a short confirmation note (or small ablation if space allows) documenting that the same fixed settings were used throughout. revision: yes

  3. Referee: [§4.3] §4.3 (Baselines and near-OOD pairs): The comparison set must be expanded or justified to include strong post-hoc OOD baselines designed for VLMs; if only ID-tuned or far-OOD methods are used, the relative improvement cannot be taken as evidence that CLS is superior for the near-OOD regime.

    Authors: The baselines were chosen as the standard post-hoc scoring functions that can be applied directly to VLMs without retraining. We will expand §4.3 with a justification paragraph explaining their relevance to the near-OOD prompt-learning setting and, where feasible within page limits, include one or two additional VLM-oriented post-hoc baselines to strengthen the comparison. revision: partial

Circularity Check

0 steps flagged

No circularity: CLS introduced as independent post-hoc score with empirical validation

full rationale

The paper defines CLS explicitly as a novel post-hoc, plug-and-play scoring function applied to pre-trained vision-language prompt learning models without retraining or architecture modification. No equations, parameters, or predictions in the provided text reduce by construction to fitted inputs, self-citations, or ansatzes from the same work. The AUROC gains are reported as results from extensive evaluations across methods and datasets, not as quantities forced by the method's own definition. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are mentioned; the approach rests on standard domain assumptions about logit-based OOD signals.

axioms (1)
  • domain assumption Logit-based scores can separate near-OOD from in-distribution samples
    Core premise underlying any logit-derived OOD detector.

pith-pipeline@v0.9.0 · 5713 in / 1108 out tokens · 31212 ms · 2026-05-24T01:15:18.511694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification

    Springer International Publishing, Cham, pp 446–461 Buse A (1982) The likelihood ratio, wald, and lagrange multiplier tests: An expository note. The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification. In: International Confer- ence on Machine Learning, P...

  2. [2]

    PMLR, pp 8759–8773 Huang R, Geng A, Li Y (2021) On the importance of gradients for detecting distributional shifts in the wild. In: Beygelzimer A, Dauphin Y, Liang P, et al (eds) Advances in Neural Information Processing Systems Jia C, Yang Y, Xia Y, et al (2021) Scaling up visual and vision-language representation learn- ing with noisy text supervision. ...