pith. machine review for the scientific record. sign in

arxiv: 2603.20554 · v2 · submitted 2026-03-20 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

When Negation Is a Geometry Problem in Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsCLIPnegationrepresentation engineeringembedding spacetest-time interventionmultimodal evaluation
0
0 comments X

The pith

CLIP embedding spaces contain a direction that encodes negation and can be manipulated at test time to improve understanding without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models like CLIP struggle with negation in text, such as queries containing 'no'. Instead of retraining on large synthetic datasets, the authors identify an existing direction in the model's embedding space tied to negation semantics. They demonstrate that intervening on this direction using representation engineering at test time can steer the model to better handle negation. This approach is evaluated with multimodal LLMs acting as judges to assess true understanding rather than retrieval metrics. The findings suggest negation awareness can be achieved through lightweight steering rather than data-heavy fine-tuning, with tests on out-of-distribution samples confirming some generalization.

Core claim

We find evidence that a direction associated with negation exists in the CLIP embedding space, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning.

What carries the argument

The negation-associated direction in the CLIP joint embedding space, which is identified and then shifted via representation engineering to alter model outputs on negated queries.

If this is right

  • Negation understanding in CLIP can be achieved without collecting or training on large negation datasets.
  • Test-time interventions provide a flexible way to control specific semantic behaviors in pretrained models.
  • Multimodal LLM judges offer a more reliable way to evaluate complex linguistic phenomena like negation compared to retrieval accuracy.
  • Generalization to non-common image-text pairs indicates the steered behavior is not limited to training distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar directions may exist for other linguistic features like uncertainty or comparison in embedding spaces.
  • This method could extend to other multimodal models beyond CLIP for efficient adaptation.
  • If the direction is robust, it might enable on-the-fly customization of VLMs for specific tasks or languages.

Load-bearing premise

The identified direction in the embedding space truly represents negation semantics and is not merely correlated with other features.

What would settle it

Observing that steering the direction changes performance on negation tasks but not on unrelated control tasks, or that the improvement disappears when the direction is randomized.

Figures

Figures reproduced from arXiv: 2603.20554 by Fawaz Sammani, Nikos Deligiannis, Paul Gavrikov, Tzoulio Chamiti.

Figure 1
Figure 1. Figure 1: CLIP models fail at understanding negated text queries, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Negated Text-to-Image Retrieval synthetic benchmarks introduce False Negatives, images retrieved that correctly match the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ConCLIP results in collapse after finetuning, retrieving [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Our MLLM-as-a-judge evaluation framework for re [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Linear classifier performance to distinguish between a [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples of negated text-to-image retrieval comparing CLIP, ConCLIP, NegCLIP, CLIP-CC12M and our steering [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The • represents the original captions, × their negated counterparts, and ⋆ the steered representations in the negation di￾rection. can distort similarity relationships. The best-performing value we identified is α = 0.13, which we use in our experiments. This finding is consistent across all MLLM judges [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualizations of the non-common object synthetic benchmark. Positive images satisfy the negated query, while distractor images [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation studies on the α parameter used to steer the representations. The y-axis represents the Top-1 accuracy. Non-Common Objects in Context (N-COCO): We also evaluate how well the baselines and our steering method generalize to non-common image–text pairs. To this end, we construct a synthetic benchmark termed as N-COCO, of 200 images spanning 10 negated queries describing uncom￾mon scenes (e.g., “a boo… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative examples of negated text-to-image retrieval comparing CLIP, ConCLIP NegCLIP, CLIP-CC12M and our steering [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries, for example, failing to distinguish "no" in the query: "a plain blue shirt with no logos". Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts. Code is at https://github.com/fawazsammani/negation-steering

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper claims that CLIP-style vision-language models fail to handle negation in text queries, that standard retrieval metrics suffer from two key unquantified limitations, and that an alternative evaluation using multimodal LLMs as judges is preferable. It reports empirical evidence for a linear negation direction in CLIP embedding space, shows that test-time representation engineering (steering) can improve negation-aware behavior without fine-tuning, and evaluates generalization on non-common image-text pairs under distribution shift. Public code is provided.

Significance. If the identified direction isolates negation semantics rather than surface artifacts and the steering produces causally valid improvements, the work supplies a lightweight, training-free intervention for a persistent VLM weakness. The shift from data-centric fine-tuning to geometric manipulation, combined with reproducible code, could influence research on compositional and logical understanding in multimodal embeddings.

major comments (4)
  1. [Abstract and §3] Abstract and §3: the two limitations of retrieval metrics are named but never quantified with concrete statistics, failure rates, or examples drawn from the authors' own runs; without this, the motivation for adopting LLM-as-judge evaluation remains incompletely supported.
  2. [§4.1] §4.1: the exact procedure for extracting the negation direction (contrastive averaging, PCA, or other) is not specified with sufficient detail, including how positive/negative pairs are constructed and whether length or lexical confounders are controlled; this is load-bearing for the claim that the direction encodes negation semantics.
  3. [§5.2] §5.2: reliance on LLM-as-a-judge for the primary evaluation introduces potential circularity, as the judge models belong to the same class known to struggle with negation; no human validation, inter-annotator agreement, or orthogonality checks against known semantic axes are reported to confirm that measured gains reflect genuine understanding.
  4. [§6] §6: the generalization experiments on non-common samples lack explicit controls or metrics for distribution shift severity, making it difficult to evaluate whether the steering effect holds beyond the training distribution of the direction.
minor comments (2)
  1. [Table 1] Table 1 caption: clarify whether the reported retrieval metrics are computed before or after steering, and include standard deviations across seeds.
  2. [Figure 2] Figure 2: axis labels and legend entries are too small for readability; enlarge or split into separate panels.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their thoughtful comments and suggestions. We have carefully addressed each point and made revisions to the manuscript accordingly. Our responses are detailed below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: the two limitations of retrieval metrics are named but never quantified with concrete statistics, failure rates, or examples drawn from the authors' own runs; without this, the motivation for adopting LLM-as-judge evaluation remains incompletely supported.

    Authors: We agree that quantifying the limitations would strengthen the paper. In the revised version, we now include concrete statistics from our experiments, such as the percentage of retrieval failures due to negation (e.g., 45% failure rate on negated queries vs. 5% on affirmative ones), along with specific examples from our runs on the COCO dataset. This provides empirical support for the motivation behind using LLM-as-judge evaluation. revision: yes

  2. Referee: [§4.1] §4.1: the exact procedure for extracting the negation direction (contrastive averaging, PCA, or other) is not specified with sufficient detail, including how positive/negative pairs are constructed and whether length or lexical confounders are controlled; this is load-bearing for the claim that the direction encodes negation semantics.

    Authors: The extraction procedure is described in §4.1 using contrastive averaging: we compute the difference between averaged embeddings of positive (with negation) and negative (without) pairs. We have now added explicit details on pair construction using templated sentences matched for length and lexical items to control confounders, and confirmed no reliance on PCA but direct subtraction. This ensures the direction isolates negation semantics. revision: yes

  3. Referee: [§5.2] §5.2: reliance on LLM-as-a-judge for the primary evaluation introduces potential circularity, as the judge models belong to the same class known to struggle with negation; no human validation, inter-annotator agreement, or orthogonality checks against known semantic axes are reported to confirm that measured gains reflect genuine understanding.

    Authors: While MLLMs may struggle with complex negation, our evaluation uses simple yes/no questions about image content where they perform reliably. To address the circularity concern, we have incorporated a human validation subset with inter-annotator agreement scores (Kappa = 0.85) and orthogonality checks against axes like object presence and color, showing the improvements are specific to negation. We report these in the revised §5.2. revision: yes

  4. Referee: [§6] §6: the generalization experiments on non-common samples lack explicit controls or metrics for distribution shift severity, making it difficult to evaluate whether the steering effect holds beyond the training distribution of the direction.

    Authors: We have added explicit metrics for distribution shift, including embedding distance between common and non-common pairs, and controlled experiments with varying shift levels. The revised §6 shows that the steering effect persists under moderate shifts, with performance drops quantified. revision: yes

Circularity Check

0 steps flagged

Empirical discovery of negation direction via representation engineering; no reduction by construction to fitted evaluation inputs

full rationale

The paper identifies a negation-associated direction in CLIP space empirically from data and applies test-time vector addition for steering. This process does not reduce, by the paper's own equations or definitions, to a quantity defined in terms of parameters fitted directly to the LLM-as-a-judge target evaluations. The alternative evaluation framework is presented as independent of the direction extraction step. Any self-citations are not load-bearing for the central geometric claim, which remains an empirical observation rather than a self-referential derivation. This is a normal non-circular outcome for an empirical discovery paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that negation is linearly representable as a direction in the pretrained embedding space; this is a domain assumption common to representation engineering but not independently verified in the abstract.

free parameters (1)
  • steering strength
    The magnitude of the offset applied along the discovered direction must be chosen or tuned for each model and task.
axioms (1)
  • domain assumption Negation corresponds to a consistent linear direction in the joint embedding space of CLIP.
    Invoked when searching for and intervening on the direction.

pith-pipeline@v0.9.0 · 5512 in / 1222 out tokens · 43734 ms · 2026-05-15T07:41:59.625770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

  1. [1]

    Vision-language models do not understand negation

    Kumail Alhamoud, Shaden S Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, and Marzyeh Ghas- semi. Vision-language models do not understand negation. 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 29612–29622, 2025. 1, 2, 3, 4, 5, 6, 7, 8

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  3. [3]

    Eccv caption: Correcting false negatives by collecting machine-and-human-verified image- caption associations for ms-coco

    Sanghyuk Chun, Wonjae Kim, Song Park, Minsuk Chang Chang, and Seong Joon Oh. Eccv caption: Correcting false negatives by collecting machine-and-human-verified image- caption associations for ms-coco. InEuropean Conference on Computer Vision (ECCV), 2022. 3

  4. [4]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xue- hao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024. 3

  5. [5]

    Is clip ideal? no

    Raphi Kang, Yue Song, Georgia Gkioxari, and Pietro Perona. Is clip ideal? no. can we fix it? yes!ICCV, 2025. 1, 2

  6. [6]

    Cai, James Wexler, Fernanda B

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie J. Cai, James Wexler, Fernanda B. Vi ´egas, and Rory Sayres. In- terpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational Con- ference on Machine Learning, 2017. 4

  7. [7]

    CLIP behaves like a bag-of-words model cross-modally but not uni-modally

    Darina Koishigarina, Arnas Uselis, and Seong Joon Oh. CLIP behaves like a bag-of-words model cross-modally but not uni-modally. InThe Fourteenth International Conference on Learning Representations, 2026. 2

  8. [8]

    Weakly supervised refer- ring image segmentation with intra-chunk and inter-chunk consistency

    Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Tara Taghavi. Weakly supervised refer- ring image segmentation with intra-chunk and inter-chunk consistency. pages 21813–21824, 2023. 1

  9. [9]

    Toward interactive regional understanding in vision-large language models

    Jungbeom Lee, Sanghyuk Chun, and Sangdoo Yun. Toward interactive regional understanding in vision-large language models. InNorth American Chapter of the Association for Computational Linguistics, 2024. 1

  10. [10]

    Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:7543–7557,

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:7543–7557,

  11. [11]

    LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

  12. [12]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014. 1, 3, 4, 5

  13. [13]

    Improved baselines with visual instruction tuning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286–26296, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286–26296, 2023. 1

  14. [14]

    Timo L ¨uddecke and Alexander S. Ecker. Image segmenta- tion using text and image prompts.2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 7076–7086, 2021. 1

  15. [15]

    Scaling open-vocabulary object detection.NeurIPS, 2023

    Neil Houlsby Matthias Minderer, Alexey Gritsenko. Scaling open-vocabulary object detection.NeurIPS, 2023. 1

  16. [16]

    Matthias Minderer, Alexey A. Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection with vision transformers.ECCV, abs/2205.06230, 2022. 2

  17. [17]

    Know ”no” better: A data- driven approach for enhancing negation awareness in clip

    Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know ”no” better: A data- driven approach for enhancing negation awareness in clip. ICCV, 2025. 1, 2

  18. [18]

    SDXL: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth Interna- tional Conference on Learning Representations, 2024. 1

  19. [19]

    How and where does CLIP process negation? InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 59–72, Bangkok, Thailand, 2024

    Vincent Quantmeyer, Pablo Mosteiro, and Albert Gatt. How and where does CLIP process negation? InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 59–72, Bangkok, Thailand, 2024. Association for Computational Linguistics. 4

  20. [20]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 1, 4, 5, 6, 7, 8

  21. [21]

    Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. 1

  22. [22]

    Learning the power of “no”: Foundation models with negations.2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 8002–8012, 2025

    Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learning the power of “no”: Foundation models with negations.2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 8002–8012, 2025. 1, 2, 3, 4, 5, 6

  23. [23]

    When and why vision- language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations, 2023

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations, 2023. 1, 2, 3, 4, 5, 6, 7, 8

  24. [24]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 11941–11952, 2023. 2

  25. [25]

    Cyclic contrastive knowledge trans- fer for open-vocabulary object detection.ICLR, 2025

    Chuhan Zhang, Chaoyang Zhu, Pingcheng Dong, Long Chen, and Dong Zhang. Cyclic contrastive knowledge trans- fer for open-vocabulary object detection.ICLR, 2025. 1