GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations
Pith reviewed 2026-05-19 21:46 UTC · model grok-4.3
The pith
A learned prefix transform reorganizes frozen vision-language embeddings so that length directly controls semantic granularity from coarse to fine.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Frozen vision-language embeddings already encode signals at multiple semantic resolutions; a shared near-orthogonal prefix transform learned over these frozen vectors creates a Semantic Matryoshka interface in which short prefixes receive coarse semantic roles and successively longer prefixes expose finer language-grounded distinctions, all while preserving the original full-dimensional geometry and enabling truncation without retraining the underlying model.
What carries the argument
The Semantic Matryoshka interface produced by a single shared near-orthogonal prefix transform that progressively assigns coarse-to-fine semantic roles to increasing prefix lengths.
If this is right
- Embeddings become truncatable at inference time to trade semantic detail for speed or memory without retraining the VLM.
- The same transform applies uniformly to image and text sides, enabling consistent granularity control across modalities.
- Zero-shot performance on downstream tasks such as CIFAR-100 classification remains essentially unchanged after the reorganization.
- Hard-negative selectivity improves as prefix length grows, indicating that additional dimensions add discriminative power rather than noise.
Where Pith is reading between the lines
- Resource-constrained deployments could dynamically select prefix length based on available compute while still accessing the same underlying representation.
- The approach might extend to pure language models if a similar shared prefix transform can be learned over frozen text embeddings alone.
- Downstream tasks that require variable semantic depth, such as hierarchical retrieval or progressive captioning, could use prefix length as an explicit control knob.
Load-bearing premise
A single learned transform can progressively map longer prefixes to finer semantic distinctions while leaving the full embedding geometry unchanged.
What would settle it
Measure whether accuracy on fine-grained attribute or relation distinctions plateaus or drops when prefix length is increased beyond the point where coarse object identity is already recovered, or check whether full-dimensional cosine similarity between original and transformed embeddings exceeds 10^{-5}.
Figures
read the original abstract
Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below $10^{-6}$. It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GraSP-VL, a method that learns a single shared near-orthogonal prefix transform over frozen vision-language model embeddings to instantiate a Semantic Matryoshka interface. In this interface, shorter prefixes are claimed to capture coarse semantic roles (e.g., object identity) while progressively longer prefixes expose finer language-grounded distinctions (attributes, relations, caption-level meaning), all without modifying the original VLM parameter space or geometry. Empirical support is provided via a staircase score of 53.01 and hard-negative selectivity of 89.76 on a 20,147-example COCO/Flickr30K pool, plus transfer results on SugarCrepe-clean (86.03 object accuracy) and preserved zero-shot CIFAR-100 accuracy, with full-space drift below 10^{-6}.
Significance. If the central claim is substantiated, the work demonstrates that embedding length can serve as a controllable semantic granularity interface for frozen VLMs rather than a simple compression mechanism. The shared transform design and explicit geometry preservation (drift < 10^{-6}) are strengths that maintain compatibility with existing models. The reported transfer performance and length-dependent gains on aggregate metrics suggest practical utility for variable-resolution multimodal tasks, though the absence of direct probes for role assignment limits immediate impact assessment.
major comments (2)
- [Abstract and §4] Abstract and §4 (Evaluation): The staircase score of 53.01 and hard-negative selectivity of 89.76 demonstrate monotonic performance gains with prefix length, but these aggregate metrics are consistent with any increase in usable information capacity and do not isolate whether short prefixes specifically encode object identity while longer prefixes add attributes or relations, as required by the Semantic Matryoshka claim. Direct evidence such as per-length probing for semantic categories or controlled ablations on role alignment is needed to support the progressive granularity assignment over generic scaling.
- [§3] §3 (Method): The shared near-orthogonal prefix transform is described as assigning coarse-to-fine roles via length, yet the manuscript provides no derivation showing that the transform parameters or loss enforce semantic granularity alignment with prefix length rather than simply increasing representational capacity. Without this link (e.g., via an explicit objective term or analysis of the learned basis), the central claim that length becomes a semantic interface rests on unverified assumptions about the transform's effect.
minor comments (3)
- [Abstract] Abstract: Report error bars or variance across runs for the staircase score, hard-negative selectivity, and drift metric to allow assessment of result stability.
- [§4] §4: Clarify the exact definition and computation of the 'staircase score' and 'mean external emergence' with reference to equations or pseudocode, as these appear central to the claims but are not derived in the provided summary.
- [Figure captions and §5] Figure captions and §5: Ensure all figures include axis labels, legend details, and statistical significance markers where length-dependent comparisons are shown.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below with clarifications and indicate where we will strengthen the manuscript through revisions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Evaluation): The staircase score of 53.01 and hard-negative selectivity of 89.76 demonstrate monotonic performance gains with prefix length, but these aggregate metrics are consistent with any increase in usable information capacity and do not isolate whether short prefixes specifically encode object identity while longer prefixes add attributes or relations, as required by the Semantic Matryoshka claim. Direct evidence such as per-length probing for semantic categories or controlled ablations on role alignment is needed to support the progressive granularity assignment over generic scaling.
Authors: We agree that the reported aggregate metrics demonstrate length-dependent gains but do not directly isolate specific semantic role assignments such as object identity at short prefixes versus attributes or relations at longer ones. While the hard-negative selectivity results are consistent with progressive exposure of finer distinctions, this remains indirect support. In the revised manuscript we will add direct per-length probing experiments, including linear probes trained on prefixes of varying lengths to classify object categories versus attributes/relations on a controlled subset of the data, to provide more targeted evidence for the granularity claim. revision: yes
-
Referee: [§3] §3 (Method): The shared near-orthogonal prefix transform is described as assigning coarse-to-fine roles via length, yet the manuscript provides no derivation showing that the transform parameters or loss enforce semantic granularity alignment with prefix length rather than simply increasing representational capacity. Without this link (e.g., via an explicit objective term or analysis of the learned basis), the central claim that length becomes a semantic interface rests on unverified assumptions about the transform's effect.
Authors: The shared near-orthogonal transform is trained with a multi-length contrastive objective that encourages prefixes to support retrieval at different granularities, and the near-orthogonality constraint is intended to distribute information across lengths rather than simply expand capacity. However, we acknowledge that the current draft lacks an explicit derivation linking these design choices to semantic granularity. We will revise §3 to include a concise analysis of how the objective and orthogonality promote length-based role assignment, supported by additional inspection of the learned basis vectors. revision: yes
Circularity Check
No significant circularity; results rest on external benchmarks
full rationale
The paper learns a shared near-orthogonal prefix transform on frozen VLM embeddings and evaluates the resulting Semantic Matryoshka interface via empirical metrics (staircase score 53.01, hard-negative selectivity 89.76, transfer accuracies on SugarCrepe and CIFAR-100). These outcomes are measured on held-out annotation pools and zero-shot tasks rather than derived from the transform parameters by algebraic identity or self-referential fitting. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises in the abstract or described derivation; the central claim of reorganizing embeddings into a truncatable granularity interface is supported by observable performance scaling, not tautological redefinition of inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- shared prefix transform parameters
axioms (1)
- domain assumption A single shared near-orthogonal prefix transform can assign progressively finer language-grounded distinctions to longer prefixes while preserving full-dimensional geometry
invented entities (1)
-
Semantic Matryoshka interface
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost and Foundation patterns (recognition lattices, octave duality)reality_from_one_distinction and ladder constant derivations echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
GraSP-VL instantiates a Semantic Matryoshka interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions... prefix behavior changes without rewriting the original VLM space.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 38th International Conference on Machine Learning , pages =
Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =
work page 2021
-
[2]
Proceedings of the 38th International Conference on Machine Learning , pages =
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =
work page 2021
-
[3]
Microsoft COCO: common objects in context,
Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. Computer Vision -- ECCV 2014 , pages =. 2014 , volume =. doi:10.1007/978-3-319-10602-1_48 , url =
-
[4]
Transactions of the Association for Computational Linguistics , volume =
From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions , author =. Transactions of the Association for Computational Linguistics , volume =. 2014 , doi =
work page 2014
-
[5]
Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas , booktitle =. 2022 , url =
work page 2022
-
[6]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Reproducible Scaling Laws for Contrastive Language-Image Learning , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2023 , url =
work page 2023
-
[7]
Advances in Neural Information Processing Systems , volume =
Matryoshka Representation Learning , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =
work page 2022
-
[8]
Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions
Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.576 , url =
-
[9]
SMEC :Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
Zhang, Biao and Chen, Lixin and Liu, Tong and Zheng, Bo , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1332 , url =
- [10]
-
[11]
International Journal of Computer Vision , volume =
Learning to Prompt for Vision-Language Models , author =. International Journal of Computer Vision , volume =. 2022 , doi =
work page 2022
-
[12]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Conditional Prompt Learning for Vision-Language Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , url =
work page 2022
-
[13]
Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz , booktitle =. 2023 , url =
work page 2023
-
[14]
Gao, Peng and Geng, Shijie and Zhang, Renrui and Ma, Teli and Fang, Rongyao and Zhang, Yongfeng and Li, Hongsheng and Qiao, Yu , journal =. 2021 , url =
work page 2021
-
[15]
Zhang, Renrui and Zhang, Wei and Fang, Rongyao and Gao, Peng and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng , booktitle =. Tip-Adapter: Training-Free Adaption of. 2022 , volume =. doi:10.1007/978-3-031-19833-5_29 , url =
-
[16]
Parameter-Efficient Transfer Learning for
Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , series =
work page 2019
-
[17]
Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =
work page 2022
-
[18]
International Conference on Learning Representations , year =
When and Why Vision-Language Models Behave like Bags-of-Words, and What to Do About It? , author =. International Conference on Learning Representations , year =
-
[19]
Preserving Multi-Modal Capabilities of Pre-trained
Oh, Youngtaek and Cho, Jae Won and Kim, Dong-Jin and Kweon, In So and Kim, Junmo , booktitle =. Preserving Multi-Modal Capabilities of Pre-trained. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.1062 , url =
-
[20]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =
Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.810 , url =
-
[21]
Patel, Maitreya and Kusumba, Abhiram and Cheng, Sheng and Kim, Changhoon and Gokhale, Tejas and Baral, Chitta and Yang, Yezhou , booktitle =. 2024 , doi =
work page 2024
-
[22]
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , pages =
Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , pages =. 2025 , doi =
work page 2025
-
[23]
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (
Lo, Kin Ian and Hawashin, Hala and Abbaszadeh, Mina and Limb. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (. 2025 , address =. doi:10.18653/v1/2025.starsem-1.25 , url =
-
[24]
Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui , booktitle =. 2025 , volume =
work page 2025
-
[25]
Sun, Quan and Fang, Yuxin and Wu, Ledell and Wang, Xinlong and Cao, Yue , journal =. 2023 , url =
work page 2023
-
[26]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
Sigmoid Loss for Language Image Pre-Training , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =. 2023 , url =
work page 2023
-
[27]
Asokan, Mothilal and Wu, Kebin and Albreiki, Fatima , journal =. 2025 , url =
work page 2025
-
[28]
and Kiros, Jamie Ryan and Fidler, Sanja , booktitle =
Faghri, Fartash and Fleet, David J. and Kiros, Jamie Ryan and Fidler, Sanja , booktitle =. 2018 , publisher =
work page 2018
-
[29]
Proceedings of the European Conference on Computer Vision , pages =
Stacked Cross Attention for Image-Text Matching , author =. Proceedings of the European Conference on Computer Vision , pages =. 2018 , url =
work page 2018
-
[30]
Yao, Lewei and Huang, Runhui and Hou, Lu and Lu, Guansong and Niu, Minzhe and Xu, Hang and Liang, Xiaodan and Li, Zhenguo and Jiang, Xin and Xu, Chunjing , booktitle =. 2022 , url =
work page 2022
- [31]
-
[32]
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle =. 2009 , doi =
work page 2009
-
[33]
International Journal of Computer Vision , volume =
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author =. International Journal of Computer Vision , volume =. 2017 , doi =
work page 2017
-
[34]
International Conference on Learning Representations , year =
Order-Embeddings of Images and Language , author =. International Conference on Learning Representations , year =
-
[35]
Nickel, Maximillian and Kiela, Douwe , booktitle =. Poincar. 2017 , url =
work page 2017
-
[36]
Proceedings of the 40th International Conference on Machine Learning , pages =
Hyperbolic Image-Text Representations , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , volume =
work page 2023
-
[37]
Li, Xianming and Li, Zongxi and Li, Jing and Xie, Haoran and Li, Qing , journal =. 2024 , url =
work page 2024
- [38]
-
[39]
arXiv preprint arXiv:2405.17430 , year =
Matryoshka Multimodal Models , author =. arXiv preprint arXiv:2405.17430 , year =
-
[40]
Xiao, Zilin and Ma, Qi and Gu, Mengting and Chen, Chun-cheng Jason and Chen, Xintao and Ordonez, Vicente and Mohan, Vijai , journal =. 2025 , url =
work page 2025
-
[41]
Hsieh, Cheng-Yu and Zhang, Jieyu and Ma, Zixian and Kembhavi, Aniruddha and Krishna, Ranjay , journal =. 2023 , url =
work page 2023
-
[42]
Abbasi, Reza and Nazari, Ali and Sefid, Aminreza and Banayeeanzade, Mohammadali and Rohban, Mohammad Hossein and Soleymani Baghshah, Mahdieh , journal =. 2025 , url =
work page 2025
-
[43]
Wittenmayer, Kai and Rao, Sukrut and Parchami-Araghi, Amin and Schiele, Bernt and Fischer, Jonas , journal =. 2026 , url =
work page 2026
-
[44]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2024 , url =
work page 2024
-
[45]
Castro, Santiago and Ziai, Amir and Saluja, Avneesh and Yuan, Zhuoning and Mihalcea, Rada , journal =. 2024 , url =
work page 2024
-
[46]
arXiv preprint arXiv:2504.16801 , year =
Decoupled Global-Local Alignment for Improving Compositional Understanding , author =. arXiv preprint arXiv:2504.16801 , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.