pith. sign in

arxiv: 2509.19207 · v2 · submitted 2025-09-23 · 💻 cs.CV

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

Pith reviewed 2026-05-18 14:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords contrastive vision-language modelscompositionalitylong captionsvisual groundingbidirectional transfergeneralizationtraining conditionsarchitectural design
0
0 comments X

The pith

High-quality long-caption data with strong visual grounding simultaneously improves compositional reasoning and long-caption understanding in contrastive vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how compositional reasoning and the ability to understand long captions interact in contrastive vision-language models. It shows through experiments that these two capabilities have a bidirectional relationship that depends heavily on training conditions. When models use poorly grounded captions or make only limited updates to parameters, they fail to generalize well to new tasks. In contrast, training on high-quality long captions that are strongly grounded in visuals helps both skills develop together. This matters for building VLMs that can handle complex, detailed descriptions of images without sacrificing basic alignment.

Core claim

Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between compositional reasoning and long-caption understanding. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. Architectural choices such as frozen positional embeddings can inadvertently limit compositional learning while aiming to preserve general alignment.

What carries the argument

Controlled experiments that vary training objectives, datasets, and architectural designs to isolate transfer effects between compositionality and long-caption understanding.

If this is right

  • High-quality long-caption data with strong visual grounding promotes both compositional reasoning and long-caption understanding at the same time.
  • Models trained on poorly grounded captions or with only limited parameter updates fail to generalize these capabilities.
  • Frozen positional embeddings can limit compositional learning even when intended to preserve overall alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prioritizing visual grounding during data curation may advance both skills more reliably than simply increasing model scale.
  • A curriculum that builds from grounded short captions toward longer compositional ones could efficiently develop both abilities together.
  • The sensitive relationship observed here may appear in other multimodal tasks that require detailed visual-textual alignment.

Load-bearing premise

The controlled experiments across diverse training objectives, datasets, and architectural designs sufficiently isolate the effects of compositionality versus long-caption understanding without major confounding from unmeasured factors such as dataset biases or model scale differences.

What would settle it

Training models on high-quality long-caption data with strong visual grounding and observing no corresponding gains on compositional reasoning benchmarks, while holding other factors fixed, would falsify the claim of bidirectional promotion.

Figures

Figures reproduced from arXiv: 2509.19207 by Desmond Elliott, Israfel Salazar, Yova Kementchedjhieva.

Figure 1
Figure 1. Figure 1: Training Dynamics Across Long-Caption Datasets.We track the evolution of performance for models trained on four long￾caption datasets, evaluating at fixed training steps. Results are reported for both long-caption retrieval (Urban1K, DOCCI, sDCI, IiW) and compositional benchmark, SC++. Models trained on ShareGPT4V and DOCCI show consistently stronger generalization, suggesting that caption quality, groundi… view at source ↗
Figure 2
Figure 2. Figure 2: Long-caption Retrieval Performance. LSS achieves strong performance on all benchmarks, with LongCLIP outper￾forming by a small margin despite its three times larger input token processing capacity. When truncating LongCLIP to 70 words, LSS closes the gap and outperforms it across all benchmarks. These results demonstrate that full parameter adaptation enables more effective long-caption understanding. word… view at source ↗
read the original abstract

Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper empirically analyzes the relationship between compositional reasoning and long-caption understanding in contrastive vision-language models. Through controlled experiments varying training objectives, datasets, and architectural designs, it reports a bidirectional but sensitive relationship: models trained on poorly grounded captions or with limited updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. It further identifies that choices like frozen positional embeddings can limit compositional learning and offers guidelines for data selection and model design.

Significance. If the central empirical findings hold after addressing potential confounds, the work would be significant for multimodal learning by clarifying when and how compositionality and long-caption capabilities transfer in VLMs. It supplies actionable, data- and architecture-focused recommendations that could improve generalization on complex visual-textual tasks, an open challenge in the field. The use of diverse conditions across objectives and architectures is a strength, though the value hinges on whether the experiments truly isolate the two capabilities.

major comments (1)
  1. [Experimental Setup / Dataset Selection] Dataset construction and experimental design sections: The description of dataset selection does not report explicit controls for caption compositionality metrics (e.g., scene graph density or dependency parse depth) independent of length and grounding quality. This is load-bearing for the central claim because if 'high-quality long-caption' datasets systematically differ in compositional structure due to shared curation or visual complexity, the observed bidirectional transfer and 'sensitive relationship' could be driven by a single latent factor rather than independent reinforcement across conditions.
minor comments (2)
  1. [Abstract] Abstract and results presentation: The abstract summarizes the bidirectional relationship clearly but would benefit from a brief quantitative sense of effect sizes or number of conditions tested to help readers gauge the strength of the reported transfers.
  2. [Results / Figures] Figure and table captions: Ensure all result visualizations explicitly label the training conditions (e.g., what constitutes 'poorly grounded' captions quantitatively) so that the isolation of compositionality versus length/grounding effects is immediately verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps clarify the strength of our empirical claims. We respond to the major comment below and will revise the manuscript to incorporate additional controls and analysis.

read point-by-point responses
  1. Referee: [Experimental Setup / Dataset Selection] Dataset construction and experimental design sections: The description of dataset selection does not report explicit controls for caption compositionality metrics (e.g., scene graph density or dependency parse depth) independent of length and grounding quality. This is load-bearing for the central claim because if 'high-quality long-caption' datasets systematically differ in compositional structure due to shared curation or visual complexity, the observed bidirectional transfer and 'sensitive relationship' could be driven by a single latent factor rather than independent reinforcement across conditions.

    Authors: We agree that reporting explicit compositionality metrics would further isolate the contributions of grounding quality and length. Our dataset selection prioritized variation in visual grounding and caption length as the primary axes, with multiple training objectives and architectural ablations used to test transfer. However, we did not include quantitative compositionality statistics such as scene graph density or dependency parse depth in the original submission. In the revision we will add these metrics (computed via standard parsers and scene graph tools) for all datasets in a new table or appendix, along with a brief discussion of their correlation with the observed effects. This will allow readers to assess whether compositional structure acts as a latent confound. Our cross-condition results (e.g., training on short vs. long captions while holding architecture fixed) already provide evidence that the relationship is not reducible to a single factor, but the added metrics will make this explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical findings from controlled experiments

full rationale

The paper conducts an empirical analysis of VLMs through controlled experiments varying training objectives, datasets, and architectures. It reports observed performance differences supporting a bidirectional but sensitive relationship between compositionality and long-caption understanding. No mathematical derivation, first-principles result, or predictive equation is presented that could reduce to fitted inputs or self-referential definitions by construction. The central claims rest on experimental outcomes rather than any chain that equates outputs to inputs via self-definition, renaming, or load-bearing self-citation. This is the normal case for an experimental study whose validity can be assessed against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard machine-learning assumptions about data quality driving generalization and the ability of contrastive objectives to measure alignment; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond domain conventions.

axioms (1)
  • domain assumption Contrastive objectives in VLMs capture meaningful visual-textual alignment that can be measured through downstream task performance.
    Invoked implicitly when interpreting transfer between compositionality and caption understanding as evidence of improved binding.

pith-pipeline@v0.9.0 · 5688 in / 1121 out tokens · 41460 ms · 2026-05-18T14:37:33.295782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 5 internal anchors

  1. [1]

    Amro Kamal Mohamed Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S. Morcos. Semdedup: Data- efficient learning at web-scale through semantic deduplica- tion. InICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. 2

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  3. [3]

    Closure: Assessing systematic general- ization of clevr models.arXiv preprint arXiv:1912.05783,

    Dzmitry Bahdanau, Harm de Vries, Timothy J O’Donnell, Shikhar Murty, Philippe Beaudoin, Yoshua Bengio, and Aaron Courville. Closure: Assessing systematic general- ization of clevr models.arXiv preprint arXiv:1912.05783,

  4. [4]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, pro- ceedings, part VI 13, pages 446–461. Springer, 2014. 2

  5. [5]

    Coyo-700m: Image-text pair dataset.https : / / github

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https : / / github . com / kakaobrain/coyo-dataset, 2022. 3

  6. [6]

    Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465,

  7. [7]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 3

  8. [8]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Pier- giovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 11

  9. [9]

    Fine-grained image captioning with clip reward.arXiv preprint arXiv:2205.13115, 2022

    Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Der- noncourt, Trung Bui, and Mohit Bansal. Fine-grained image captioning with clip reward.arXiv preprint arXiv:2205.13115, 2022. 1

  10. [10]

    Why is winoground hard? investigating failures in visuolinguistic compositionality.arXiv preprint arXiv:2211.00768, 2022

    Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, and Kyle Mahowald. Why is winoground hard? investigating failures in visuolinguistic compositionality.arXiv preprint arXiv:2211.00768, 2022. 2, 3, 12

  11. [11]

    Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems, 36:76137–76150, 2023

    Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems, 36:76137–76150, 2023. 2

  12. [12]

    Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,

    Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,

  13. [13]

    Im- ageinwords: Unlocking hyper-detailed image descriptions

    Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bun- ner, Ranjay Krishna, Jason Baldridge, and Radu Soricut. Im- ageinwords: Unlocking hyper-detailed image descriptions. arXiv preprint arXiv:2405.02793, 2024. 1, 3, 4

  14. [14]

    Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

  15. [15]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

  16. [16]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 2

  17. [17]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  18. [18]

    Kamath, J

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

  19. [19]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2

  20. [20]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 2

  21. [21]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 2, 3

  22. [22]

    Learning multiple layers of features from tiny images, 2009

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009. 2

  23. [23]

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 3

  24. [24]

    Enhancing vision-language com- positional understanding with multimodal synthetic data

    Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24849–24861, 2025. 3

  25. [25]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2

  26. [26]

    Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023

    Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay Krishna. Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023. 2

  27. [27]

    Docci: De- scriptions of connected and contrasting images

    Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, et al. Docci: De- scriptions of connected and contrasting images. InEuropean Conference on Computer Vision, pages 291–309. Springer,

  28. [28]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

  29. [29]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 2

  30. [30]

    Triplet- clip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760, 2024

    Maitreya Patel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, et al. Triplet- clip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760, 2024. 2

  31. [31]

    Combined scal- ing for zero-shot transfer learning.Neurocomputing, 555: 126658, 2023

    Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, et al. Combined scal- ing for zero-shot transfer learning.Neurocomputing, 555: 126658, 2023. 2

  32. [32]

    Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 2

  33. [33]

    Connecting vision and lan- guage with localized narratives

    Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and lan- guage with localized narratives. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part V 16, pages 647–664. Springer,

  34. [34]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 8

  35. [35]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015. 2

  36. [36]

    Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature

    Babak Saleh and Ahmed Elgammal. Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature.arXiv preprint arXiv:1505.00855, 2015. 3

  37. [37]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3

  38. [38]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3

  39. [39]

    Yfcc100m: The new data in multimedia research

    Bart Thomee, David A Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 3

  40. [40]

    Winoground: Probing vision and language models for visio- linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. 1, 2

  41. [41]

    A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

    Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700–26709, 2024. 1, 2, 3, 4

  42. [42]

    Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249, 2024

    Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, and Zheng-Jun Zha. Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249, 2024. 3

  43. [43]

    When are lemons purple? the concept asso- ciation bias of vision-language models.arXiv preprint arXiv:2212.12043, 2022

    Yutaro Yamada, Yingtian Tang, Yoyo Zhang, and Ilker Yildirim. When are lemons purple? the concept asso- ciation bias of vision-language models.arXiv preprint arXiv:2212.12043, 2022. 1, 2

  44. [44]

    A model and an hypothesis for language structure.Proceedings of the American philosophical soci- ety, 104(5):444–466, 1960

    Victor H Yngve. A model and an hypothesis for language structure.Proceedings of the American philosophical soci- ety, 104(5):444–466, 1960. 13

  45. [45]

    International Conference on Learning Representations (ICLR) , year =

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022. 1, 2

  46. [46]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4, 11

  47. [47]

    Long-clip: Unlocking the long-text capability of clip

    Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InEuropean Conference on Computer Vision, pages 310–325. Springer, 2024. 1, 2, 3, 4

  48. [48]

    Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding

    Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13774–13784, 2024. 3

  49. [49]

    An ex- plainable toolbox for evaluating pre-trained vision-language models

    Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, and Jianwei Yin. An ex- plainable toolbox for evaluating pre-trained vision-language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, pages 30–37, 2022. 1, 2

  50. [50]

    Dreamlip: Language- image pre-training with long captions

    Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InEuropean Confer- ence on Computer Vision, pages 73–90. Springer, 2024. 3

  51. [51]

    Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 3 A. Training Parameters In this section, we present the training parameters for our models. Models are named after the datasets on w...