arxiv: 2601.22868 · v3 · submitted 2026-01-30 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Conditional Compatibility Learning for Context-Dependent Anomaly Detection

Shashank Mishra , Didier Stricker , Jason Rambach

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:55 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords contextual anomaly detectionconditional compatibility learningdisentangled representationssubject-context relationsvision-language modelsCLIP adaptationanomaly detection

0 comments

The pith

Global representations that mix subject and context are provably non-identifiable for context-dependent anomalies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that anomaly detection cannot rely on global embeddings when labels depend on the relation between a subject and its context. Two distinct subject-context pairs can produce the same embedding yet demand opposite labels, so no detector using a conflated representation can succeed on both. This formal non-identifiability leads to a reformulation: the model must instead learn whether a subject is compatible with its particular context. CC-CLIP implements this by extracting disentangled subject- and context-aware representations from a single image and fusing them via text-conditioned attention. Readers should care because many real anomalies, such as a runner on a highway versus a track, are defined only by that relation rather than by any intrinsic property of the subject.

Core claim

Any detector reasoning from a global representation that conflates subject and context is provably non-identifiable: two different subject-context configurations can map to the same embedding while requiring opposite labels, and no such detector can be correct on both. This impossibility motivates conditional compatibility learning, in which the model determines whether subjects are compatible with their surrounding context. The framework is instantiated in CC-CLIP, a vision-language architecture that learns disentangled subject- and context-aware representations from a single image and fuses visual evidence through text-conditioned attention.

What carries the argument

Conditional compatibility learning, which asks whether a subject is compatible with its specific context rather than whether the observation deviates from global normality, realized through disentangled subject- and context-aware representations fused by text-conditioned attention in CC-CLIP.

If this is right

CC-CLIP reaches state-of-the-art performance on real-world contextual anomaly detection benchmarks.
A single-branch variant of CC-CLIP remains competitive on structural anomaly detection tasks.
Anomaly labels that depend on subject-context relations become identifiable once representations are explicitly disentangled.
Vision-language models can be adapted to perform compatibility checks without assuming abnormality is intrinsic to the observation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disentanglement could be tested on relational tasks beyond anomaly detection, such as determining whether an object belongs in a given scene.
If the single-image disentanglement holds, extending the approach to video sequences might allow context to be inferred across frames without additional labels.
Datasets that explicitly annotate subject-context pairs would allow direct measurement of whether global embeddings actually collapse distinct configurations.

Load-bearing premise

The disentangled subject- and context-aware representations can be learned from single images without extra supervision or labels that would recreate the original identifiability problem.

What would settle it

A pair of images showing the same subject in two different contexts that produce identical global embeddings yet require opposite anomaly labels.

Figures

Figures reproduced from arXiv: 2601.22868 by Didier Stricker, Jason Rambach, Shashank Mishra.

**Figure 1.** Figure 1: Examples illustrating context-dependent normality. The same action may be normal or anomalous depending on context. der et al., 2022; Chandola et al., 2009), where anomalies correspond to visual defects, rare patterns, or violations of temporal or semantic consistency. However, in many realworld settings, abnormality is inherently context-dependent. The same object or action may be normal or anomalous dep… view at source ↗

**Figure 2.** Figure 2: Examples illustrating context-dependent normality using samples from existing OOC datasets. Contextual anomaly detection, as considered in this work, differs in important ways from object-centric out-ofdistribution (OOD) benchmarks such as MIT-OOC (Choi 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of CoRe-CLIP. A shared CLIP (Radford et al., 2021) backbone is augmented with three Context-Selective Residual (CSR) branches for subject, context, and global representations. The refined text encoder, optimized via Text Disentanglement Objectives, produces paired normal/anomalous embeddings, while the Compatibility Reasoning Module (CRM) fuses the visual streams to infer object–scene compatibilit… view at source ↗

**Figure 4.** Figure 4: CRM branch weighting for identical actions under normal and anomalous contexts. Bar plots indicate the relative contribution of subject, context, and global representations. the largest gain, indicating the importance of disentangling subject and context representations. Text refinement further stabilizes semantic alignment, while CRM yields additional improvements by enabling cross-branch reasoning. Tog… view at source ↗

**Figure 5.** Figure 5: Loss-weight sensitivity analysis on CAAD-3K (4-shot). Each curve shows the effect of varying a single loss-term weight while keeping all others fixed at their default values. The model remains stable across a broad range of weights for both text-space objectives and CRM fusion regularizers, demonstrating that CoRe-CLIP is not overly sensitive to hyperparameter tuning. diversity, we found that slightly elev… view at source ↗

**Figure 6.** Figure 6: Few-shot stability of CRM attention across branches on CAAD-CC. Context and subejct branches dominate consistently from 1-shot to 4-shot, confirming robust and interpretable branch selection. stays low and nearly constant, indicating that the model avoids collapsing into global appearance cues even in extremely low-shot conditions. (3) As shots increase, subject attention rises slightly while context decr… view at source ↗

**Figure 7.** Figure 7: CAAD-3K dataset generation pipeline. An overview of the automated and human-curated process used to construct CAAD-3K. The pipeline integrates (i) GPT-4 (OpenAI, 2023) prompt synthesis, (ii) high-fidelity scene rendering with FLUX.1-dev (Labs, 2025), (iii) object localization using YOLOv8 (Jocher et al., 2023) and GroundingDINO (Liu et al., 2024), (iv) instance-level segmentation with SAM (Kirillov et al.,… view at source ↗

**Figure 8.** Figure 8: Per-class distribution of normal and anomalous samples across CAAD-SS and CAAD-CC. Each bar shows CAAD-SS and CAAD-CC counts as stacked segments, illustrating balanced object coverage and controlled contextual variability across all 15 classes. To avoid bias toward specific contexts or classes, each category includes a balanced mixture of normal and anomalous samples [PITH_FULL_IMAGE:figures/full_fig_p02… view at source ↗

**Figure 9.** Figure 9: Context diversity across CAAD-SS dataset. CAADSS provides high contextual entropy, essential for evaluating object–scene compatibility. Beyond class balancing, CAAD-SS is designed to expose each class to a diverse range of valid contexts, increasing contextual entropy and preventing overfitting to narrow scene distributions [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 11.** Figure 11: Per-class Jaccard similarity between the sets of scene categories in CAAD-SS and CAAD-CC. Similarity scores remain consistently low (0.07–0.27), indicating minimal overlap in contexts while preserving the semantic identity of each class. This low contextual redundancy is desirable for reliably evaluating models on context-sensitive anomaly detection. The dataset is therefore not merely a large collection… view at source ↗

**Figure 12.** Figure 12: Class-wise anomaly localization on CAAD-3K. For each anomalous image, the heatmaps show CoRe-CLIP’s pixel-level anomaly attribution. The model consistently isolates the region corresponding to the object–scene incompatibility, demonstrating precise contextual reasoning across 15 categories. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Zero-shot qualitative results on MVTec-AD (Bergmann et al., 2019). CoRe-CLIP is trained on VisA (Zou et al., 2022) and evaluated on MVTec-AD (Bergmann et al., 2019). In this zero-shot transfer setting, CoRe-CLIP yields precise structural defect maps despite receiving no supervision from MVTec-AD (Bergmann et al., 2019), confirming that the model’s context-aware design does not compromise its ability to ge… view at source ↗

**Figure 14.** Figure 14: Zero-shot qualitative results on VisA (Zou et al., 2022). CoRe-CLIP is trained on MVTec-AD (Bergmann et al., 2019) and evaluated on VisA (Zou et al., 2022) without any access to VisA (Zou et al., 2022) anomaly images. The model accurately highlights defect regions across diverse objects and illumination conditions, confirming its robust cross-dataset generalization and its ability to localize structural a… view at source ↗

read the original abstract

Anomaly detection usually assumes that abnormality is an intrinsic property of an observation. A defect is a defect, and a rare object is rare, regardless of where it appears. Many real-world anomalies do not work this way. A runner on a track is normal, but the same runner on a highway is not. The subject is unchanged; only the context makes it anomalous. This setting, long recognized as contextual anomaly detection, remains largely underexplored in modern vision-language systems. The difficulty is not merely empirical; it is formal. When anomaly labels depend on the relation between a subject and its context, any detector reasoning from a global representation that conflates subject and context is provably non-identifiable: two different subject-context configurations can map to the same embedding while requiring opposite labels, and no such detector can be correct on both. This impossibility motivates a different formulation: instead of asking whether an observation deviates from a global notion of normality, the model should ask whether subjects are compatible with their surrounding context. We define this as conditional compatibility learning. We instantiate this framework in CC-CLIP, a vision-language architecture that learns disentangled subject- and context-aware representations from a single image and fuses visual evidence through text-conditioned attention. CC-CLIP achieves state-of-the-art results on real-world contextual anomaly detection, substantially outperforming all existing CLIP-based and context-reasoning baselines. A single-branch variant of CC-CLIP also achieves competitive performance on structural anomaly benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The non-identifiability argument for global embeddings is the real contribution here, but CC-CLIP's disentanglement claim rests on thin evidence from the abstract alone.

read the letter

The paper argues that any anomaly detector built on a single global embedding will be non-identifiable when labels depend on subject-context relations. Two different configurations can produce the same embedding yet require opposite decisions, so no such model can work on both. They frame the alternative as conditional compatibility learning and build CC-CLIP around text-conditioned attention to produce separate subject and context representations from one image. A single-branch version is also tested on structural anomalies. This is the part that feels new relative to prior CLIP anomaly papers. The motivation is straightforward and the formal point lands if the proof is tight. It explains why context-sensitive cases like a runner on a track versus a highway have been hard for standard approaches. The architecture description is clear enough on paper. The claim of SOTA on real-world contextual detection would matter for applications if the numbers hold. The main soft spot is the disentanglement step. Text-conditioned attention by itself does not obviously force subject and context signals to stay independent; without an explicit separation loss or architectural constraint shown in detail, the same non-identifiability counterexamples could still apply to the fused output. The abstract gives no datasets, metrics, or ablation numbers, so the performance gains cannot be checked yet. The citation pattern is normal for this area and does not look padded. The math on non-identifiability looks worth checking in full, but the experimental side is the bigger gap right now. This is for people working on vision-language models for anomaly detection who already care about context. It has enough of a distinct angle to deserve a serious referee, though the review will need to focus on whether the separation actually works and whether the results are robust. I would bring the identifiability section to a reading group to see how the proof is written. I would cite the non-identifiability result if it survives scrutiny. Send it to review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The paper argues that global representations conflating subject and context are provably non-identifiable for context-dependent anomalies, as two configurations can share an embedding yet require opposite labels. It proposes conditional compatibility learning as an alternative formulation and instantiates it in CC-CLIP, a vision-language model that learns disentangled subject- and context-aware representations from single images via text-conditioned attention. The work reports state-of-the-art results on real-world contextual anomaly detection benchmarks and competitive performance on structural anomaly tasks with a single-branch variant.

Significance. If the non-identifiability argument holds and the CC-CLIP architecture demonstrably resolves the identifiability issue through its disentanglement mechanism, the contribution would be significant: it supplies a formal motivation for moving beyond global embeddings in contextual anomaly detection and offers a concrete VLM-based implementation that could influence downstream work on context-aware vision-language systems.

major comments (3)

[Introduction / non-identifiability claim] The non-identifiability argument is presented as a general proof in the abstract and introduction, but the manuscript does not reduce it to an explicit pair of subject-context configurations (e.g., runner-on-track vs. runner-on-highway) that map to identical embeddings while demanding opposite anomaly labels; without this concrete counterexample, it is difficult to verify that the impossibility result applies to standard CLIP-style global pooling.
[CC-CLIP architecture description] CC-CLIP is described as learning disentangled subject- and context-aware representations through text-conditioned attention, yet no auxiliary loss, orthogonality constraint, or explicit separation objective is specified to guarantee independence of the two streams; absent such a mechanism, the fused output could remain partially entangled and inherit the same non-identifiability counterexamples.
[Experiments] The claim of state-of-the-art results on real-world contextual anomaly detection lacks any reference to the specific datasets, evaluation metrics, or baseline implementations used; without these details or ablations isolating the contribution of the disentangled representations, the empirical superiority cannot be assessed.

minor comments (1)

[Abstract] The abstract mentions a 'single-branch variant' achieving competitive results on structural anomaly benchmarks but does not clarify how this variant differs architecturally from the main two-branch CC-CLIP model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Introduction / non-identifiability claim] The non-identifiability argument is presented as a general proof in the abstract and introduction, but the manuscript does not reduce it to an explicit pair of subject-context configurations (e.g., runner-on-track vs. runner-on-highway) that map to identical embeddings while demanding opposite anomaly labels; without this concrete counterexample, it is difficult to verify that the impossibility result applies to standard CLIP-style global pooling.

Authors: We agree that an explicit counterexample would make the non-identifiability claim easier to verify. The abstract already uses the runner-on-track (normal) versus runner-on-highway (anomalous) scenario to motivate the issue. In the revised manuscript we will add a short formal subsection that constructs these two configurations, shows they collide under global CLIP-style pooling, and derives the opposite required labels, thereby confirming the impossibility result for standard global embeddings. revision: yes
Referee: [CC-CLIP architecture description] CC-CLIP is described as learning disentangled subject- and context-aware representations through text-conditioned attention, yet no auxiliary loss, orthogonality constraint, or explicit separation objective is specified to guarantee independence of the two streams; absent such a mechanism, the fused output could remain partially entangled and inherit the same non-identifiability counterexamples.

Authors: The text-conditioned attention mechanism produces separate subject and context streams by conditioning on distinct prompts. We acknowledge that an explicit independence guarantee is currently missing. In the revision we will introduce an orthogonality loss between the subject and context feature vectors and report ablations quantifying its effect on disentanglement and downstream performance. revision: yes
Referee: [Experiments] The claim of state-of-the-art results on real-world contextual anomaly detection lacks any reference to the specific datasets, evaluation metrics, or baseline implementations used; without these details or ablations isolating the contribution of the disentangled representations, the empirical superiority cannot be assessed.

Authors: The full manuscript already specifies the real-world contextual benchmarks, AUROC/AUPRC metrics, and the full set of CLIP-based and context-reasoning baselines. To improve accessibility we will add explicit citations to these elements in the abstract and introduction, and include a new ablation table that isolates the contribution of the disentangled representations. revision: partial

Circularity Check

0 steps flagged

No circularity: non-identifiability claim and CC-CLIP instantiation remain independent

full rationale

The paper states a general non-identifiability result for any global embedding that conflates subject and context, then defines conditional compatibility learning as an alternative formulation and instantiates it in CC-CLIP via text-conditioned attention for disentangled representations. No equation or definition reduces the non-identifiability proof to the CC-CLIP parameters, no fitted input is relabeled as prediction, and no self-citation chain is invoked to justify the core impossibility result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that global embeddings are provably non-identifiable for context-dependent labels; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Global representations that conflate subject and context are provably non-identifiable for context-dependent anomaly labels
Stated directly in the abstract as the motivation for the new formulation.

pith-pipeline@v0.9.0 · 5566 in / 1295 out tokens · 45181 ms · 2026-05-16T09:55:17.723289+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 4.1 (Non-identifiability under intrinsic representation collisions)... ϕ(g(a,c))=ϕ(g(a,c′)) but h(a,c)≠h(a,c′)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Context-Selective Residuals (CSR)... Compatibility Reasoning Module (CRM) fuses... text-conditioned attention

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Radford, A., Kim, J

URL https://cdn.openai.com/papers/ gpt-4.pdf. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pp. 8748–8763. PmLR, 2021. Roth, K., Pemula, L., Zepeda, J., Sch ...

work page doi:10.1016/j.neucom.2020.11.018 2021
[2]

Section A presents additional model details

work page
[3]

Section B describes implementation and training de- tails

work page
[4]

Section C reports extended ablation studies and addi- tional quantitative results

work page
[5]

Section D details the construction, annotation protocol, and evaluation splits of the CAAD-3K dataset

work page
[6]

Section E provides full quantitative results across all benchmarks

work page
[7]

Section F discusses limitations and directions for future work

work page
[8]

Section G presents additional qualitative results. A. Additional Model Details This appendix provides full specifications of the model com- ponents summarized in Section 5, including (i) Context- Selective Residuals (CSR), (ii) text refinement and losses, (iii) the Compatibility Reasoning Module (CRM), and (iv) inference for image- and pixel-level scoring...

work page 2021
[9]

a photo of cls in a normal place

with (β1, β2) = (0.5,0.999) . Table 7 summarizes the default hyperparameters used for CAAD-3K, which were kept fixed across all experiments unless otherwise noted. Table 7.Core hyperparameters used in training (default configura- tion for CAAD-3K). Component Setting Text learning rate2×10 −5 Image learning rate3×10 −4 Text refinement depthL= 3layers Image...

work page 2019
[10]

person running

and GroundingDINO (Liu et al., 2024). YOLOv8 (Jocher et al., 2023) provides reliable bounding boxes for common object categories, but cannot handle fine-grained action prompts (e.g., “person running”) or uncommon gen- erated categories (e.g., “fire”) absent from its label space. GroundingDINO (Liu et al., 2024), in contrast, performs open-vocabulary groun...

work page 2024
[11]

This prevents linguistic overfitting and yields a broad range of visual in- stantiations rather than template-like repetitions

produces diverse paraphrases describing object ap- pearance, actions, and camera viewpoints. This prevents linguistic overfitting and yields a broad range of visual in- stantiations rather than template-like repetitions. Reviewer Assurance:Together, this strategy guarantees that contextual abnormality emerges from semantic rela- tional mismatch, not from ...

work page
[12]

rich and balanced class-level semantics,

work page
[13]

broad intra-class scene diversity, and

work page
[14]

Figure 11.Per-class Jaccard similarity between the sets of scene categories in CAAD-SS and CAAD-CC

controlled inter-split contextual complementarity. Figure 11.Per-class Jaccard similarity between the sets of scene categories in CAAD-SS and CAAD-CC. Similarity scores remain consistently low (0.07–0.27), indicating minimal overlap in con- texts while preserving the semantic identity of each class. This low contextual redundancy is desirable for reliably...

work page 2019