GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

Chengchang Pan; Honggang Qi; Zesheng Li

arxiv: 2605.17727 · v1 · pith:E2J37SDCnew · submitted 2026-05-18 · 💻 cs.CV

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

Zesheng Li , Chengchang Pan , Honggang Qi This is my paper

Pith reviewed 2026-05-19 21:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelssemantic granularityprefix embeddingsmatryoshka representationsfrozen embeddingstruncatable interfacesembedding reorganizationmulti-resolution semantics

0 comments

The pith

A learned prefix transform reorganizes frozen vision-language embeddings so that length directly controls semantic granularity from coarse to fine.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the length of an embedding vector can serve as a controllable dial for accessing different levels of semantic detail in frozen vision-language models. It introduces a single shared transform that produces a Semantic Matryoshka structure: shorter prefixes handle basic object identity while longer prefixes add attributes, relations, and caption-level meaning. Because the transform is near-orthogonal and applied without altering the original embeddings, the full-dimensional space stays intact and the same interface works for both images and text. Experiments on COCO, Flickr30K, SugarCrepe, and CIFAR-100 show that progressive prefix lengths improve hard-negative discrimination and zero-shot accuracy while keeping geometric drift negligible. This approach reframes embedding compression as a semantic access mechanism rather than a lossy reduction.

Core claim

Frozen vision-language embeddings already encode signals at multiple semantic resolutions; a shared near-orthogonal prefix transform learned over these frozen vectors creates a Semantic Matryoshka interface in which short prefixes receive coarse semantic roles and successively longer prefixes expose finer language-grounded distinctions, all while preserving the original full-dimensional geometry and enabling truncation without retraining the underlying model.

What carries the argument

The Semantic Matryoshka interface produced by a single shared near-orthogonal prefix transform that progressively assigns coarse-to-fine semantic roles to increasing prefix lengths.

If this is right

Embeddings become truncatable at inference time to trade semantic detail for speed or memory without retraining the VLM.
The same transform applies uniformly to image and text sides, enabling consistent granularity control across modalities.
Zero-shot performance on downstream tasks such as CIFAR-100 classification remains essentially unchanged after the reorganization.
Hard-negative selectivity improves as prefix length grows, indicating that additional dimensions add discriminative power rather than noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Resource-constrained deployments could dynamically select prefix length based on available compute while still accessing the same underlying representation.
The approach might extend to pure language models if a similar shared prefix transform can be learned over frozen text embeddings alone.
Downstream tasks that require variable semantic depth, such as hierarchical retrieval or progressive captioning, could use prefix length as an explicit control knob.

Load-bearing premise

A single learned transform can progressively map longer prefixes to finer semantic distinctions while leaving the full embedding geometry unchanged.

What would settle it

Measure whether accuracy on fine-grained attribute or relation distinctions plateaus or drops when prefix length is increased beyond the point where coarse object identity is already recovered, or check whether full-dimensional cosine similarity between original and transformed embeddings exceeds 10^{-5}.

Figures

Figures reproduced from arXiv: 2605.17727 by Chengchang Pan, Honggang Qi, Zesheng Li.

**Figure 1.** Figure 1: Sample case. Longer prefixes progressively expose attribute, relation/action, and full-caption distinctions. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: GraSP-VL overview. A shared near-isometric transform reorders frozen image/text embeddings into a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of the shared orthogonal rotation: prefix selectivity moves toward the intended semantic ladder [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: GraSP-VL separates semantic selectivity from [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Transform-structure analysis. The learned Cayley transform is dense and near-isometric rather than a hard [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below $10^{-6}$. It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraSP-VL shows a shared prefix transform can reorganize frozen VLM embeddings so length acts as a truncatable semantic interface, with decent benchmark numbers but indirect evidence for true coarse-to-fine role assignment.

read the letter

The one or two things to know about this paper are that it proposes turning the length of vision-language embeddings into a controllable interface for semantic granularity using a learned shared prefix transform, and that the empirical results look promising on standard benchmarks while preserving the original embedding space. What is actually new is the specific setup of a Semantic Matryoshka interface where short prefixes get coarse roles and longer ones finer language-grounded distinctions, applied to frozen VLMs with a near-orthogonal shared transform. This goes beyond simple compression or prior Matryoshka work by emphasizing the truncatable semantic access without rewriting the VLM. The paper does well in showing concrete gains: a staircase score of 53.01 and hard-negative selectivity of 89.76 on their 20,147-example pool from COCO and Flickr30K. The full-space drift below 10 to the -6 is good evidence that the transform doesn't distort the geometry. It also transfers reasonably to SugarCrepe-clean with 86.03 object accuracy and some emergence score, and keeps zero-shot performance on CIFAR-100. These results suggest the approach is practical for existing models. Where the soft spots are is in how strongly the results support the progressive coarse-to-fine semantic role assignment. The length-dependent performance improvements are clear, but they don't necessarily isolate whether short prefixes specifically encode object identity while longer ones add attributes and relations. It could be explained by general increases in information capacity as length grows. The stress-test concern holds some weight here because the metrics are aggregate and don't include targeted tests for semantic content at different lengths. If the full paper has more qualitative analysis or probing experiments, that would help, but based on what's described, the central claim relies on indirect evidence from benchmark scores. This paper is for people in the multimodal representation learning community who want practical ways to adapt embedding interfaces for efficiency or adaptive querying. A reader working on VLMs or representation compression would get value from the method and the reported numbers. I would recommend sending it for peer review. The work has enough novelty and empirical support to merit a full review, though the authors may need to address the semantic granularity verification in revisions.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes GraSP-VL, a method that learns a single shared near-orthogonal prefix transform over frozen vision-language model embeddings to instantiate a Semantic Matryoshka interface. In this interface, shorter prefixes are claimed to capture coarse semantic roles (e.g., object identity) while progressively longer prefixes expose finer language-grounded distinctions (attributes, relations, caption-level meaning), all without modifying the original VLM parameter space or geometry. Empirical support is provided via a staircase score of 53.01 and hard-negative selectivity of 89.76 on a 20,147-example COCO/Flickr30K pool, plus transfer results on SugarCrepe-clean (86.03 object accuracy) and preserved zero-shot CIFAR-100 accuracy, with full-space drift below 10^{-6}.

Significance. If the central claim is substantiated, the work demonstrates that embedding length can serve as a controllable semantic granularity interface for frozen VLMs rather than a simple compression mechanism. The shared transform design and explicit geometry preservation (drift < 10^{-6}) are strengths that maintain compatibility with existing models. The reported transfer performance and length-dependent gains on aggregate metrics suggest practical utility for variable-resolution multimodal tasks, though the absence of direct probes for role assignment limits immediate impact assessment.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): The staircase score of 53.01 and hard-negative selectivity of 89.76 demonstrate monotonic performance gains with prefix length, but these aggregate metrics are consistent with any increase in usable information capacity and do not isolate whether short prefixes specifically encode object identity while longer prefixes add attributes or relations, as required by the Semantic Matryoshka claim. Direct evidence such as per-length probing for semantic categories or controlled ablations on role alignment is needed to support the progressive granularity assignment over generic scaling.
[§3] §3 (Method): The shared near-orthogonal prefix transform is described as assigning coarse-to-fine roles via length, yet the manuscript provides no derivation showing that the transform parameters or loss enforce semantic granularity alignment with prefix length rather than simply increasing representational capacity. Without this link (e.g., via an explicit objective term or analysis of the learned basis), the central claim that length becomes a semantic interface rests on unverified assumptions about the transform's effect.

minor comments (3)

[Abstract] Abstract: Report error bars or variance across runs for the staircase score, hard-negative selectivity, and drift metric to allow assessment of result stability.
[§4] §4: Clarify the exact definition and computation of the 'staircase score' and 'mean external emergence' with reference to equations or pseudocode, as these appear central to the claims but are not derived in the provided summary.
[Figure captions and §5] Figure captions and §5: Ensure all figures include axis labels, legend details, and statistical significance markers where length-dependent comparisons are shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications and indicate where we will strengthen the manuscript through revisions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): The staircase score of 53.01 and hard-negative selectivity of 89.76 demonstrate monotonic performance gains with prefix length, but these aggregate metrics are consistent with any increase in usable information capacity and do not isolate whether short prefixes specifically encode object identity while longer prefixes add attributes or relations, as required by the Semantic Matryoshka claim. Direct evidence such as per-length probing for semantic categories or controlled ablations on role alignment is needed to support the progressive granularity assignment over generic scaling.

Authors: We agree that the reported aggregate metrics demonstrate length-dependent gains but do not directly isolate specific semantic role assignments such as object identity at short prefixes versus attributes or relations at longer ones. While the hard-negative selectivity results are consistent with progressive exposure of finer distinctions, this remains indirect support. In the revised manuscript we will add direct per-length probing experiments, including linear probes trained on prefixes of varying lengths to classify object categories versus attributes/relations on a controlled subset of the data, to provide more targeted evidence for the granularity claim. revision: yes
Referee: [§3] §3 (Method): The shared near-orthogonal prefix transform is described as assigning coarse-to-fine roles via length, yet the manuscript provides no derivation showing that the transform parameters or loss enforce semantic granularity alignment with prefix length rather than simply increasing representational capacity. Without this link (e.g., via an explicit objective term or analysis of the learned basis), the central claim that length becomes a semantic interface rests on unverified assumptions about the transform's effect.

Authors: The shared near-orthogonal transform is trained with a multi-length contrastive objective that encourages prefixes to support retrieval at different granularities, and the near-orthogonality constraint is intended to distribute information across lengths rather than simply expand capacity. However, we acknowledge that the current draft lacks an explicit derivation linking these design choices to semantic granularity. We will revise §3 to include a concise analysis of how the objective and orthogonality promote length-based role assignment, supported by additional inspection of the learned basis vectors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks

full rationale

The paper learns a shared near-orthogonal prefix transform on frozen VLM embeddings and evaluates the resulting Semantic Matryoshka interface via empirical metrics (staircase score 53.01, hard-negative selectivity 89.76, transfer accuracies on SugarCrepe and CIFAR-100). These outcomes are measured on held-out annotation pools and zero-shot tasks rather than derived from the transform parameters by algebraic identity or self-referential fitting. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises in the abstract or described derivation; the central claim of reorganizing embeddings into a truncatable granularity interface is supported by observable performance scaling, not tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on one learned transform whose parameters are fitted to data and on the domain assumption that near-orthogonality plus sharing across modalities will produce the desired progressive semantic behavior without geometry drift.

free parameters (1)

shared prefix transform parameters
The near-orthogonal prefix transform is learned from data; its weights constitute free parameters that directly determine the reported staircase and selectivity scores.

axioms (1)

domain assumption A single shared near-orthogonal prefix transform can assign progressively finer language-grounded distinctions to longer prefixes while preserving full-dimensional geometry
Invoked in the abstract description of the Semantic Matryoshka interface and the claim that prefix behavior changes without rewriting the original VLM space.

invented entities (1)

Semantic Matryoshka interface no independent evidence
purpose: To provide a length-based controllable semantic access mechanism over frozen embeddings
New conceptual framing introduced by the paper to describe the truncatable prefix behavior.

pith-pipeline@v0.9.0 · 5758 in / 1398 out tokens · 36185 ms · 2026-05-19T21:46:26.045362+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost and Foundation patterns (recognition lattices, octave duality) reality_from_one_distinction and ladder constant derivations echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

GraSP-VL instantiates a Semantic Matryoshka interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions... prefix behavior changes without rewriting the original VLM space.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

work page 2021
[2]

Proceedings of the 38th International Conference on Machine Learning , pages =

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

work page 2021
[3]

Microsoft COCO: common objects in context,

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. Computer Vision -- ECCV 2014 , pages =. 2014 , volume =. doi:10.1007/978-3-319-10602-1_48 , url =

work page doi:10.1007/978-3-319-10602-1_48 2014
[4]

Transactions of the Association for Computational Linguistics , volume =

From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions , author =. Transactions of the Association for Computational Linguistics , volume =. 2014 , doi =

work page 2014
[5]

2022 , url =

Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas , booktitle =. 2022 , url =

work page 2022
[6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Reproducible Scaling Laws for Contrastive Language-Image Learning , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2023 , url =

work page 2023
[7]

Advances in Neural Information Processing Systems , volume =

Matryoshka Representation Learning , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

work page 2022
[8]

Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions

Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.576 , url =

work page doi:10.18653/v1/2024.emnlp-main.576 2024
[9]

SMEC :Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

Zhang, Biao and Chen, Lixin and Liu, Tong and Zheng, Bo , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1332 , url =

work page doi:10.18653/v1/2025.emnlp-main.1332 2025
[10]

2023 , url =

Rege, Aniket , school =. 2023 , url =

work page 2023
[11]

International Journal of Computer Vision , volume =

Learning to Prompt for Vision-Language Models , author =. International Journal of Computer Vision , volume =. 2022 , doi =

work page 2022
[12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Conditional Prompt Learning for Vision-Language Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , url =

work page 2022
[13]

2023 , url =

Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz , booktitle =. 2023 , url =

work page 2023
[14]

2021 , url =

Gao, Peng and Geng, Shijie and Zhang, Renrui and Ma, Teli and Fang, Rongyao and Zhang, Yongfeng and Li, Hongsheng and Qiao, Yu , journal =. 2021 , url =

work page 2021
[15]

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV , pages =

Zhang, Renrui and Zhang, Wei and Fang, Rongyao and Gao, Peng and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng , booktitle =. Tip-Adapter: Training-Free Adaption of. 2022 , volume =. doi:10.1007/978-3-031-19833-5_29 , url =

work page doi:10.1007/978-3-031-19833-5_29 2022
[16]

Parameter-Efficient Transfer Learning for

Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , series =

work page 2019
[17]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

work page 2022
[18]

International Conference on Learning Representations , year =

When and Why Vision-Language Models Behave like Bags-of-Words, and What to Do About It? , author =. International Conference on Learning Representations , year =

work page
[19]

Preserving Multi-Modal Capabilities of Pre-trained

Oh, Youngtaek and Cho, Jae Won and Kim, Dong-Jin and Kweon, In So and Kim, Junmo , booktitle =. Preserving Multi-Modal Capabilities of Pre-trained. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.1062 , url =

work page doi:10.18653/v1/2024.emnlp-main.1062 2024
[20]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.810 , url =

work page doi:10.18653/v1/2024.emnlp-main.810 2024
[21]

2024 , doi =

Patel, Maitreya and Kusumba, Abhiram and Cheng, Sheng and Kim, Changhoon and Gokhale, Tejas and Baral, Chitta and Yang, Yezhou , booktitle =. 2024 , doi =

work page 2024
[22]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , pages =

Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , pages =. 2025 , doi =

work page 2025
[23]

Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (

Lo, Kin Ian and Hawashin, Hala and Abbaszadeh, Mina and Limb. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (. 2025 , address =. doi:10.18653/v1/2025.starsem-1.25 , url =

work page doi:10.18653/v1/2025.starsem-1.25 2025
[24]

2025 , volume =

Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui , booktitle =. 2025 , volume =

work page 2025
[25]

2023 , url =

Sun, Quan and Fang, Yuxin and Wu, Ledell and Wang, Xinlong and Cao, Yue , journal =. 2023 , url =

work page 2023
[26]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

Sigmoid Loss for Language Image Pre-Training , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =. 2023 , url =

work page 2023
[27]

2025 , url =

Asokan, Mothilal and Wu, Kebin and Albreiki, Fatima , journal =. 2025 , url =

work page 2025
[28]

and Kiros, Jamie Ryan and Fidler, Sanja , booktitle =

Faghri, Fartash and Fleet, David J. and Kiros, Jamie Ryan and Fidler, Sanja , booktitle =. 2018 , publisher =

work page 2018
[29]

Proceedings of the European Conference on Computer Vision , pages =

Stacked Cross Attention for Image-Text Matching , author =. Proceedings of the European Conference on Computer Vision , pages =. 2018 , url =

work page 2018
[30]

2022 , url =

Yao, Lewei and Huang, Runhui and Hou, Lu and Lu, Guansong and Niu, Minzhe and Xu, Hang and Liang, Xiaodan and Li, Zhenguo and Jiang, Xin and Xu, Chunjing , booktitle =. 2022 , url =

work page 2022
[31]

, journal =

Miller, George A. , journal =. 1995 , doi =

work page 1995
[32]

2009 , doi =

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle =. 2009 , doi =

work page 2009
[33]

International Journal of Computer Vision , volume =

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author =. International Journal of Computer Vision , volume =. 2017 , doi =

work page 2017
[34]

International Conference on Learning Representations , year =

Order-Embeddings of Images and Language , author =. International Conference on Learning Representations , year =

work page
[35]

Nickel, Maximillian and Kiela, Douwe , booktitle =. Poincar. 2017 , url =

work page 2017
[36]

Proceedings of the 40th International Conference on Machine Learning , pages =

Hyperbolic Image-Text Representations , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , volume =

work page 2023
[37]

2024 , url =

Li, Xianming and Li, Zongxi and Li, Jing and Xie, Haoran and Li, Qing , journal =. 2024 , url =

work page 2024
[38]

2020 , doi =

Khattab, Omar and Zaharia, Matei , booktitle =. 2020 , doi =

work page 2020
[39]

arXiv preprint arXiv:2405.17430 , year =

Matryoshka Multimodal Models , author =. arXiv preprint arXiv:2405.17430 , year =

work page arXiv
[40]

2025 , url =

Xiao, Zilin and Ma, Qi and Gu, Mengting and Chen, Chun-cheng Jason and Chen, Xintao and Ordonez, Vicente and Mohan, Vijai , journal =. 2025 , url =

work page 2025
[41]

2023 , url =

Hsieh, Cheng-Yu and Zhang, Jieyu and Ma, Zixian and Kembhavi, Aniruddha and Krishna, Ranjay , journal =. 2023 , url =

work page 2023
[42]

2025 , url =

Abbasi, Reza and Nazari, Ali and Sefid, Aminreza and Banayeeanzade, Mohammadali and Rohban, Mohammad Hossein and Soleymani Baghshah, Mahdieh , journal =. 2025 , url =

work page 2025
[43]

2026 , url =

Wittenmayer, Kai and Rao, Sukrut and Parchami-Araghi, Amin and Schiele, Bernt and Fischer, Jonas , journal =. 2026 , url =

work page 2026
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2024 , url =

work page 2024
[45]

2024 , url =

Castro, Santiago and Ziai, Amir and Saluja, Avneesh and Yuan, Zhuoning and Mihalcea, Rada , journal =. 2024 , url =

work page 2024
[46]

arXiv preprint arXiv:2504.16801 , year =

Decoupled Global-Local Alignment for Improving Compositional Understanding , author =. arXiv preprint arXiv:2504.16801 , year =

work page arXiv

[1] [1]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

work page 2021

[2] [2]

Proceedings of the 38th International Conference on Machine Learning , pages =

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

work page 2021

[3] [3]

Microsoft COCO: common objects in context,

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. Computer Vision -- ECCV 2014 , pages =. 2014 , volume =. doi:10.1007/978-3-319-10602-1_48 , url =

work page doi:10.1007/978-3-319-10602-1_48 2014

[4] [4]

Transactions of the Association for Computational Linguistics , volume =

From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions , author =. Transactions of the Association for Computational Linguistics , volume =. 2014 , doi =

work page 2014

[5] [5]

2022 , url =

Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas , booktitle =. 2022 , url =

work page 2022

[6] [6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Reproducible Scaling Laws for Contrastive Language-Image Learning , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2023 , url =

work page 2023

[7] [7]

Advances in Neural Information Processing Systems , volume =

Matryoshka Representation Learning , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

work page 2022

[8] [8]

Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions

Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.576 , url =

work page doi:10.18653/v1/2024.emnlp-main.576 2024

[9] [9]

SMEC :Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

Zhang, Biao and Chen, Lixin and Liu, Tong and Zheng, Bo , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1332 , url =

work page doi:10.18653/v1/2025.emnlp-main.1332 2025

[10] [10]

2023 , url =

Rege, Aniket , school =. 2023 , url =

work page 2023

[11] [11]

International Journal of Computer Vision , volume =

Learning to Prompt for Vision-Language Models , author =. International Journal of Computer Vision , volume =. 2022 , doi =

work page 2022

[12] [12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Conditional Prompt Learning for Vision-Language Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , url =

work page 2022

[13] [13]

2023 , url =

Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz , booktitle =. 2023 , url =

work page 2023

[14] [14]

2021 , url =

Gao, Peng and Geng, Shijie and Zhang, Renrui and Ma, Teli and Fang, Rongyao and Zhang, Yongfeng and Li, Hongsheng and Qiao, Yu , journal =. 2021 , url =

work page 2021

[15] [15]

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV , pages =

Zhang, Renrui and Zhang, Wei and Fang, Rongyao and Gao, Peng and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng , booktitle =. Tip-Adapter: Training-Free Adaption of. 2022 , volume =. doi:10.1007/978-3-031-19833-5_29 , url =

work page doi:10.1007/978-3-031-19833-5_29 2022

[16] [16]

Parameter-Efficient Transfer Learning for

Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , series =

work page 2019

[17] [17]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

work page 2022

[18] [18]

International Conference on Learning Representations , year =

When and Why Vision-Language Models Behave like Bags-of-Words, and What to Do About It? , author =. International Conference on Learning Representations , year =

work page

[19] [19]

Preserving Multi-Modal Capabilities of Pre-trained

Oh, Youngtaek and Cho, Jae Won and Kim, Dong-Jin and Kweon, In So and Kim, Junmo , booktitle =. Preserving Multi-Modal Capabilities of Pre-trained. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.1062 , url =

work page doi:10.18653/v1/2024.emnlp-main.1062 2024

[20] [20]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.810 , url =

work page doi:10.18653/v1/2024.emnlp-main.810 2024

[21] [21]

2024 , doi =

Patel, Maitreya and Kusumba, Abhiram and Cheng, Sheng and Kim, Changhoon and Gokhale, Tejas and Baral, Chitta and Yang, Yezhou , booktitle =. 2024 , doi =

work page 2024

[22] [22]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , pages =

Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , pages =. 2025 , doi =

work page 2025

[23] [23]

Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (

Lo, Kin Ian and Hawashin, Hala and Abbaszadeh, Mina and Limb. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (. 2025 , address =. doi:10.18653/v1/2025.starsem-1.25 , url =

work page doi:10.18653/v1/2025.starsem-1.25 2025

[24] [24]

2025 , volume =

Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui , booktitle =. 2025 , volume =

work page 2025

[25] [25]

2023 , url =

Sun, Quan and Fang, Yuxin and Wu, Ledell and Wang, Xinlong and Cao, Yue , journal =. 2023 , url =

work page 2023

[26] [26]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

Sigmoid Loss for Language Image Pre-Training , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =. 2023 , url =

work page 2023

[27] [27]

2025 , url =

Asokan, Mothilal and Wu, Kebin and Albreiki, Fatima , journal =. 2025 , url =

work page 2025

[28] [28]

and Kiros, Jamie Ryan and Fidler, Sanja , booktitle =

Faghri, Fartash and Fleet, David J. and Kiros, Jamie Ryan and Fidler, Sanja , booktitle =. 2018 , publisher =

work page 2018

[29] [29]

Proceedings of the European Conference on Computer Vision , pages =

Stacked Cross Attention for Image-Text Matching , author =. Proceedings of the European Conference on Computer Vision , pages =. 2018 , url =

work page 2018

[30] [30]

2022 , url =

Yao, Lewei and Huang, Runhui and Hou, Lu and Lu, Guansong and Niu, Minzhe and Xu, Hang and Liang, Xiaodan and Li, Zhenguo and Jiang, Xin and Xu, Chunjing , booktitle =. 2022 , url =

work page 2022

[31] [31]

, journal =

Miller, George A. , journal =. 1995 , doi =

work page 1995

[32] [32]

2009 , doi =

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle =. 2009 , doi =

work page 2009

[33] [33]

International Journal of Computer Vision , volume =

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author =. International Journal of Computer Vision , volume =. 2017 , doi =

work page 2017

[34] [34]

International Conference on Learning Representations , year =

Order-Embeddings of Images and Language , author =. International Conference on Learning Representations , year =

work page

[35] [35]

Nickel, Maximillian and Kiela, Douwe , booktitle =. Poincar. 2017 , url =

work page 2017

[36] [36]

Proceedings of the 40th International Conference on Machine Learning , pages =

Hyperbolic Image-Text Representations , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , volume =

work page 2023

[37] [37]

2024 , url =

Li, Xianming and Li, Zongxi and Li, Jing and Xie, Haoran and Li, Qing , journal =. 2024 , url =

work page 2024

[38] [38]

2020 , doi =

Khattab, Omar and Zaharia, Matei , booktitle =. 2020 , doi =

work page 2020

[39] [39]

arXiv preprint arXiv:2405.17430 , year =

Matryoshka Multimodal Models , author =. arXiv preprint arXiv:2405.17430 , year =

work page arXiv

[40] [40]

2025 , url =

Xiao, Zilin and Ma, Qi and Gu, Mengting and Chen, Chun-cheng Jason and Chen, Xintao and Ordonez, Vicente and Mohan, Vijai , journal =. 2025 , url =

work page 2025

[41] [41]

2023 , url =

Hsieh, Cheng-Yu and Zhang, Jieyu and Ma, Zixian and Kembhavi, Aniruddha and Krishna, Ranjay , journal =. 2023 , url =

work page 2023

[42] [42]

2025 , url =

Abbasi, Reza and Nazari, Ali and Sefid, Aminreza and Banayeeanzade, Mohammadali and Rohban, Mohammad Hossein and Soleymani Baghshah, Mahdieh , journal =. 2025 , url =

work page 2025

[43] [43]

2026 , url =

Wittenmayer, Kai and Rao, Sukrut and Parchami-Araghi, Amin and Schiele, Bernt and Fischer, Jonas , journal =. 2026 , url =

work page 2026

[44] [44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2024 , url =

work page 2024

[45] [45]

2024 , url =

Castro, Santiago and Ziai, Amir and Saluja, Avneesh and Yuan, Zhuoning and Mihalcea, Rada , journal =. 2024 , url =

work page 2024

[46] [46]

arXiv preprint arXiv:2504.16801 , year =

Decoupled Global-Local Alignment for Improving Compositional Understanding , author =. arXiv preprint arXiv:2504.16801 , year =

work page arXiv