pith. sign in

arxiv: 2605.17727 · v1 · pith:E2J37SDCnew · submitted 2026-05-18 · 💻 cs.CV

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

Pith reviewed 2026-05-19 21:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelssemantic granularityprefix embeddingsmatryoshka representationsfrozen embeddingstruncatable interfacesembedding reorganizationmulti-resolution semantics
0
0 comments X

The pith

A learned prefix transform reorganizes frozen vision-language embeddings so that length directly controls semantic granularity from coarse to fine.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the length of an embedding vector can serve as a controllable dial for accessing different levels of semantic detail in frozen vision-language models. It introduces a single shared transform that produces a Semantic Matryoshka structure: shorter prefixes handle basic object identity while longer prefixes add attributes, relations, and caption-level meaning. Because the transform is near-orthogonal and applied without altering the original embeddings, the full-dimensional space stays intact and the same interface works for both images and text. Experiments on COCO, Flickr30K, SugarCrepe, and CIFAR-100 show that progressive prefix lengths improve hard-negative discrimination and zero-shot accuracy while keeping geometric drift negligible. This approach reframes embedding compression as a semantic access mechanism rather than a lossy reduction.

Core claim

Frozen vision-language embeddings already encode signals at multiple semantic resolutions; a shared near-orthogonal prefix transform learned over these frozen vectors creates a Semantic Matryoshka interface in which short prefixes receive coarse semantic roles and successively longer prefixes expose finer language-grounded distinctions, all while preserving the original full-dimensional geometry and enabling truncation without retraining the underlying model.

What carries the argument

The Semantic Matryoshka interface produced by a single shared near-orthogonal prefix transform that progressively assigns coarse-to-fine semantic roles to increasing prefix lengths.

If this is right

  • Embeddings become truncatable at inference time to trade semantic detail for speed or memory without retraining the VLM.
  • The same transform applies uniformly to image and text sides, enabling consistent granularity control across modalities.
  • Zero-shot performance on downstream tasks such as CIFAR-100 classification remains essentially unchanged after the reorganization.
  • Hard-negative selectivity improves as prefix length grows, indicating that additional dimensions add discriminative power rather than noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Resource-constrained deployments could dynamically select prefix length based on available compute while still accessing the same underlying representation.
  • The approach might extend to pure language models if a similar shared prefix transform can be learned over frozen text embeddings alone.
  • Downstream tasks that require variable semantic depth, such as hierarchical retrieval or progressive captioning, could use prefix length as an explicit control knob.

Load-bearing premise

A single learned transform can progressively map longer prefixes to finer semantic distinctions while leaving the full embedding geometry unchanged.

What would settle it

Measure whether accuracy on fine-grained attribute or relation distinctions plateaus or drops when prefix length is increased beyond the point where coarse object identity is already recovered, or check whether full-dimensional cosine similarity between original and transformed embeddings exceeds 10^{-5}.

Figures

Figures reproduced from arXiv: 2605.17727 by Chengchang Pan, Honggang Qi, Zesheng Li.

Figure 1
Figure 1. Figure 1: Sample case. Longer prefixes progressively expose attribute, relation/action, and full-caption distinctions. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GraSP-VL overview. A shared near-isometric transform reorders frozen image/text embeddings into a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the shared orthogonal rotation: prefix selectivity moves toward the intended semantic ladder [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GraSP-VL separates semantic selectivity from [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Transform-structure analysis. The learned Cayley transform is dense and near-isometric rather than a hard [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below $10^{-6}$. It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes GraSP-VL, a method that learns a single shared near-orthogonal prefix transform over frozen vision-language model embeddings to instantiate a Semantic Matryoshka interface. In this interface, shorter prefixes are claimed to capture coarse semantic roles (e.g., object identity) while progressively longer prefixes expose finer language-grounded distinctions (attributes, relations, caption-level meaning), all without modifying the original VLM parameter space or geometry. Empirical support is provided via a staircase score of 53.01 and hard-negative selectivity of 89.76 on a 20,147-example COCO/Flickr30K pool, plus transfer results on SugarCrepe-clean (86.03 object accuracy) and preserved zero-shot CIFAR-100 accuracy, with full-space drift below 10^{-6}.

Significance. If the central claim is substantiated, the work demonstrates that embedding length can serve as a controllable semantic granularity interface for frozen VLMs rather than a simple compression mechanism. The shared transform design and explicit geometry preservation (drift < 10^{-6}) are strengths that maintain compatibility with existing models. The reported transfer performance and length-dependent gains on aggregate metrics suggest practical utility for variable-resolution multimodal tasks, though the absence of direct probes for role assignment limits immediate impact assessment.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): The staircase score of 53.01 and hard-negative selectivity of 89.76 demonstrate monotonic performance gains with prefix length, but these aggregate metrics are consistent with any increase in usable information capacity and do not isolate whether short prefixes specifically encode object identity while longer prefixes add attributes or relations, as required by the Semantic Matryoshka claim. Direct evidence such as per-length probing for semantic categories or controlled ablations on role alignment is needed to support the progressive granularity assignment over generic scaling.
  2. [§3] §3 (Method): The shared near-orthogonal prefix transform is described as assigning coarse-to-fine roles via length, yet the manuscript provides no derivation showing that the transform parameters or loss enforce semantic granularity alignment with prefix length rather than simply increasing representational capacity. Without this link (e.g., via an explicit objective term or analysis of the learned basis), the central claim that length becomes a semantic interface rests on unverified assumptions about the transform's effect.
minor comments (3)
  1. [Abstract] Abstract: Report error bars or variance across runs for the staircase score, hard-negative selectivity, and drift metric to allow assessment of result stability.
  2. [§4] §4: Clarify the exact definition and computation of the 'staircase score' and 'mean external emergence' with reference to equations or pseudocode, as these appear central to the claims but are not derived in the provided summary.
  3. [Figure captions and §5] Figure captions and §5: Ensure all figures include axis labels, legend details, and statistical significance markers where length-dependent comparisons are shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications and indicate where we will strengthen the manuscript through revisions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): The staircase score of 53.01 and hard-negative selectivity of 89.76 demonstrate monotonic performance gains with prefix length, but these aggregate metrics are consistent with any increase in usable information capacity and do not isolate whether short prefixes specifically encode object identity while longer prefixes add attributes or relations, as required by the Semantic Matryoshka claim. Direct evidence such as per-length probing for semantic categories or controlled ablations on role alignment is needed to support the progressive granularity assignment over generic scaling.

    Authors: We agree that the reported aggregate metrics demonstrate length-dependent gains but do not directly isolate specific semantic role assignments such as object identity at short prefixes versus attributes or relations at longer ones. While the hard-negative selectivity results are consistent with progressive exposure of finer distinctions, this remains indirect support. In the revised manuscript we will add direct per-length probing experiments, including linear probes trained on prefixes of varying lengths to classify object categories versus attributes/relations on a controlled subset of the data, to provide more targeted evidence for the granularity claim. revision: yes

  2. Referee: [§3] §3 (Method): The shared near-orthogonal prefix transform is described as assigning coarse-to-fine roles via length, yet the manuscript provides no derivation showing that the transform parameters or loss enforce semantic granularity alignment with prefix length rather than simply increasing representational capacity. Without this link (e.g., via an explicit objective term or analysis of the learned basis), the central claim that length becomes a semantic interface rests on unverified assumptions about the transform's effect.

    Authors: The shared near-orthogonal transform is trained with a multi-length contrastive objective that encourages prefixes to support retrieval at different granularities, and the near-orthogonality constraint is intended to distribute information across lengths rather than simply expand capacity. However, we acknowledge that the current draft lacks an explicit derivation linking these design choices to semantic granularity. We will revise §3 to include a concise analysis of how the objective and orthogonality promote length-based role assignment, supported by additional inspection of the learned basis vectors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks

full rationale

The paper learns a shared near-orthogonal prefix transform on frozen VLM embeddings and evaluates the resulting Semantic Matryoshka interface via empirical metrics (staircase score 53.01, hard-negative selectivity 89.76, transfer accuracies on SugarCrepe and CIFAR-100). These outcomes are measured on held-out annotation pools and zero-shot tasks rather than derived from the transform parameters by algebraic identity or self-referential fitting. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises in the abstract or described derivation; the central claim of reorganizing embeddings into a truncatable granularity interface is supported by observable performance scaling, not tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on one learned transform whose parameters are fitted to data and on the domain assumption that near-orthogonality plus sharing across modalities will produce the desired progressive semantic behavior without geometry drift.

free parameters (1)
  • shared prefix transform parameters
    The near-orthogonal prefix transform is learned from data; its weights constitute free parameters that directly determine the reported staircase and selectivity scores.
axioms (1)
  • domain assumption A single shared near-orthogonal prefix transform can assign progressively finer language-grounded distinctions to longer prefixes while preserving full-dimensional geometry
    Invoked in the abstract description of the Semantic Matryoshka interface and the claim that prefix behavior changes without rewriting the original VLM space.
invented entities (1)
  • Semantic Matryoshka interface no independent evidence
    purpose: To provide a length-based controllable semantic access mechanism over frozen embeddings
    New conceptual framing introduced by the paper to describe the truncatable prefix behavior.

pith-pipeline@v0.9.0 · 5758 in / 1398 out tokens · 36185 ms · 2026-05-19T21:46:26.045362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

  2. [2]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

  3. [3]

    Microsoft COCO: common objects in context,

    Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. Computer Vision -- ECCV 2014 , pages =. 2014 , volume =. doi:10.1007/978-3-319-10602-1_48 , url =

  4. [4]

    Transactions of the Association for Computational Linguistics , volume =

    From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions , author =. Transactions of the Association for Computational Linguistics , volume =. 2014 , doi =

  5. [5]

    2022 , url =

    Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas , booktitle =. 2022 , url =

  6. [6]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Reproducible Scaling Laws for Contrastive Language-Image Learning , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2023 , url =

  7. [7]

    Advances in Neural Information Processing Systems , volume =

    Matryoshka Representation Learning , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  8. [8]

    Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions

    Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.576 , url =

  9. [9]

    SMEC :Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

    Zhang, Biao and Chen, Lixin and Liu, Tong and Zheng, Bo , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1332 , url =

  10. [10]

    2023 , url =

    Rege, Aniket , school =. 2023 , url =

  11. [11]

    International Journal of Computer Vision , volume =

    Learning to Prompt for Vision-Language Models , author =. International Journal of Computer Vision , volume =. 2022 , doi =

  12. [12]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Conditional Prompt Learning for Vision-Language Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , url =

  13. [13]

    2023 , url =

    Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz , booktitle =. 2023 , url =

  14. [14]

    2021 , url =

    Gao, Peng and Geng, Shijie and Zhang, Renrui and Ma, Teli and Fang, Rongyao and Zhang, Yongfeng and Li, Hongsheng and Qiao, Yu , journal =. 2021 , url =

  15. [15]

    Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV , pages =

    Zhang, Renrui and Zhang, Wei and Fang, Rongyao and Gao, Peng and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng , booktitle =. Tip-Adapter: Training-Free Adaption of. 2022 , volume =. doi:10.1007/978-3-031-19833-5_29 , url =

  16. [16]

    Parameter-Efficient Transfer Learning for

    Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , series =

  17. [17]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

  18. [18]

    International Conference on Learning Representations , year =

    When and Why Vision-Language Models Behave like Bags-of-Words, and What to Do About It? , author =. International Conference on Learning Representations , year =

  19. [19]

    Preserving Multi-Modal Capabilities of Pre-trained

    Oh, Youngtaek and Cho, Jae Won and Kim, Dong-Jin and Kweon, In So and Kim, Junmo , booktitle =. Preserving Multi-Modal Capabilities of Pre-trained. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.1062 , url =

  20. [20]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

    Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.810 , url =

  21. [21]

    2024 , doi =

    Patel, Maitreya and Kusumba, Abhiram and Cheng, Sheng and Kim, Changhoon and Gokhale, Tejas and Baral, Chitta and Yang, Yezhou , booktitle =. 2024 , doi =

  22. [22]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , pages =

    Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , pages =. 2025 , doi =

  23. [23]

    Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (

    Lo, Kin Ian and Hawashin, Hala and Abbaszadeh, Mina and Limb. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (. 2025 , address =. doi:10.18653/v1/2025.starsem-1.25 , url =

  24. [24]

    2025 , volume =

    Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui , booktitle =. 2025 , volume =

  25. [25]

    2023 , url =

    Sun, Quan and Fang, Yuxin and Wu, Ledell and Wang, Xinlong and Cao, Yue , journal =. 2023 , url =

  26. [26]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

    Sigmoid Loss for Language Image Pre-Training , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =. 2023 , url =

  27. [27]

    2025 , url =

    Asokan, Mothilal and Wu, Kebin and Albreiki, Fatima , journal =. 2025 , url =

  28. [28]

    and Kiros, Jamie Ryan and Fidler, Sanja , booktitle =

    Faghri, Fartash and Fleet, David J. and Kiros, Jamie Ryan and Fidler, Sanja , booktitle =. 2018 , publisher =

  29. [29]

    Proceedings of the European Conference on Computer Vision , pages =

    Stacked Cross Attention for Image-Text Matching , author =. Proceedings of the European Conference on Computer Vision , pages =. 2018 , url =

  30. [30]

    2022 , url =

    Yao, Lewei and Huang, Runhui and Hou, Lu and Lu, Guansong and Niu, Minzhe and Xu, Hang and Liang, Xiaodan and Li, Zhenguo and Jiang, Xin and Xu, Chunjing , booktitle =. 2022 , url =

  31. [31]

    , journal =

    Miller, George A. , journal =. 1995 , doi =

  32. [32]

    2009 , doi =

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle =. 2009 , doi =

  33. [33]

    International Journal of Computer Vision , volume =

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author =. International Journal of Computer Vision , volume =. 2017 , doi =

  34. [34]

    International Conference on Learning Representations , year =

    Order-Embeddings of Images and Language , author =. International Conference on Learning Representations , year =

  35. [35]

    Nickel, Maximillian and Kiela, Douwe , booktitle =. Poincar. 2017 , url =

  36. [36]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Hyperbolic Image-Text Representations , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , volume =

  37. [37]

    2024 , url =

    Li, Xianming and Li, Zongxi and Li, Jing and Xie, Haoran and Li, Qing , journal =. 2024 , url =

  38. [38]

    2020 , doi =

    Khattab, Omar and Zaharia, Matei , booktitle =. 2020 , doi =

  39. [39]

    arXiv preprint arXiv:2405.17430 , year =

    Matryoshka Multimodal Models , author =. arXiv preprint arXiv:2405.17430 , year =

  40. [40]

    2025 , url =

    Xiao, Zilin and Ma, Qi and Gu, Mengting and Chen, Chun-cheng Jason and Chen, Xintao and Ordonez, Vicente and Mohan, Vijai , journal =. 2025 , url =

  41. [41]

    2023 , url =

    Hsieh, Cheng-Yu and Zhang, Jieyu and Ma, Zixian and Kembhavi, Aniruddha and Krishna, Ranjay , journal =. 2023 , url =

  42. [42]

    2025 , url =

    Abbasi, Reza and Nazari, Ali and Sefid, Aminreza and Banayeeanzade, Mohammadali and Rohban, Mohammad Hossein and Soleymani Baghshah, Mahdieh , journal =. 2025 , url =

  43. [43]

    2026 , url =

    Wittenmayer, Kai and Rao, Sukrut and Parchami-Araghi, Amin and Schiele, Bernt and Fischer, Jonas , journal =. 2026 , url =

  44. [44]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2024 , url =

  45. [45]

    2024 , url =

    Castro, Santiago and Ziai, Amir and Saluja, Avneesh and Yuan, Zhuoning and Mihalcea, Rada , journal =. 2024 , url =

  46. [46]

    arXiv preprint arXiv:2504.16801 , year =

    Decoupled Global-Local Alignment for Improving Compositional Understanding , author =. arXiv preprint arXiv:2504.16801 , year =