Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

Desmond Elliott; Israfel Salazar; Yova Kementchedjhieva

arxiv: 2509.19207 · v2 · submitted 2025-09-23 · 💻 cs.CV

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

Israfel Salazar , Desmond Elliott , Yova Kementchedjhieva This is my paper

Pith reviewed 2026-05-18 14:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords contrastive vision-language modelscompositionalitylong captionsvisual groundingbidirectional transfergeneralizationtraining conditionsarchitectural design

0 comments

The pith

High-quality long-caption data with strong visual grounding simultaneously improves compositional reasoning and long-caption understanding in contrastive vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how compositional reasoning and the ability to understand long captions interact in contrastive vision-language models. It shows through experiments that these two capabilities have a bidirectional relationship that depends heavily on training conditions. When models use poorly grounded captions or make only limited updates to parameters, they fail to generalize well to new tasks. In contrast, training on high-quality long captions that are strongly grounded in visuals helps both skills develop together. This matters for building VLMs that can handle complex, detailed descriptions of images without sacrificing basic alignment.

Core claim

Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between compositional reasoning and long-caption understanding. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. Architectural choices such as frozen positional embeddings can inadvertently limit compositional learning while aiming to preserve general alignment.

What carries the argument

Controlled experiments that vary training objectives, datasets, and architectural designs to isolate transfer effects between compositionality and long-caption understanding.

If this is right

High-quality long-caption data with strong visual grounding promotes both compositional reasoning and long-caption understanding at the same time.
Models trained on poorly grounded captions or with only limited parameter updates fail to generalize these capabilities.
Frozen positional embeddings can limit compositional learning even when intended to preserve overall alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prioritizing visual grounding during data curation may advance both skills more reliably than simply increasing model scale.
A curriculum that builds from grounded short captions toward longer compositional ones could efficiently develop both abilities together.
The sensitive relationship observed here may appear in other multimodal tasks that require detailed visual-textual alignment.

Load-bearing premise

The controlled experiments across diverse training objectives, datasets, and architectural designs sufficiently isolate the effects of compositionality versus long-caption understanding without major confounding from unmeasured factors such as dataset biases or model scale differences.

What would settle it

Training models on high-quality long-caption data with strong visual grounding and observing no corresponding gains on compositional reasoning benchmarks, while holding other factors fixed, would falsify the claim of bidirectional promotion.

Figures

Figures reproduced from arXiv: 2509.19207 by Desmond Elliott, Israfel Salazar, Yova Kementchedjhieva.

**Figure 1.** Figure 1: Training Dynamics Across Long-Caption Datasets.We track the evolution of performance for models trained on four longcaption datasets, evaluating at fixed training steps. Results are reported for both long-caption retrieval (Urban1K, DOCCI, sDCI, IiW) and compositional benchmark, SC++. Models trained on ShareGPT4V and DOCCI show consistently stronger generalization, suggesting that caption quality, groundi… view at source ↗

**Figure 2.** Figure 2: Long-caption Retrieval Performance. LSS achieves strong performance on all benchmarks, with LongCLIP outperforming by a small margin despite its three times larger input token processing capacity. When truncating LongCLIP to 70 words, LSS closes the gap and outperforms it across all benchmarks. These results demonstrate that full parameter adaptation enables more effective long-caption understanding. word… view at source ↗

read the original abstract

Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

High-quality grounded long captions can improve both compositionality and long-caption understanding in VLMs while frozen positional embeddings tend to limit gains, though dataset construction might still mix the two factors.

read the letter

The main takeaway is that training on high-quality long captions with strong visual grounding helps both compositional reasoning and long-caption understanding at once in contrastive VLMs, but poor grounding or frozen positional embeddings blocks that transfer. The experiments test this across different objectives, datasets, and architectures and report a bidirectional but sensitive link. Models on weak data fail to generalize, while the better data lifts both skills together. They also note that keeping positional embeddings frozen to hold onto general alignment can restrict compositional learning instead. This produces some practical pointers on data selection and model tweaks. What the paper does well is run the tests across multiple setups rather than one narrow case, which makes the patterns easier to trust. The focus stays on observable performance differences and ends with guidelines that people building VLMs could actually use. One soft spot is whether the datasets cleanly separate compositionality from caption length and grounding. If the high-quality long captions already carry more compositional structure through how they were made, the transfer results could trace back to that shared feature rather than independent reinforcement. The abstract does not mention explicit checks like scene-graph density or parse depth measured apart from the main variables, so that part of the isolation needs a close read in the methods. This work is for researchers who train or tune VLMs and want rules of thumb for handling complex captions. Anyone running similar empirical checks on multimodal models will find the comparisons useful. It deserves a serious referee because the experiments target a real gap and the outcomes are concrete enough to test further.

Referee Report

1 major / 2 minor

Summary. The paper empirically analyzes the relationship between compositional reasoning and long-caption understanding in contrastive vision-language models. Through controlled experiments varying training objectives, datasets, and architectural designs, it reports a bidirectional but sensitive relationship: models trained on poorly grounded captions or with limited updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. It further identifies that choices like frozen positional embeddings can limit compositional learning and offers guidelines for data selection and model design.

Significance. If the central empirical findings hold after addressing potential confounds, the work would be significant for multimodal learning by clarifying when and how compositionality and long-caption capabilities transfer in VLMs. It supplies actionable, data- and architecture-focused recommendations that could improve generalization on complex visual-textual tasks, an open challenge in the field. The use of diverse conditions across objectives and architectures is a strength, though the value hinges on whether the experiments truly isolate the two capabilities.

major comments (1)

[Experimental Setup / Dataset Selection] Dataset construction and experimental design sections: The description of dataset selection does not report explicit controls for caption compositionality metrics (e.g., scene graph density or dependency parse depth) independent of length and grounding quality. This is load-bearing for the central claim because if 'high-quality long-caption' datasets systematically differ in compositional structure due to shared curation or visual complexity, the observed bidirectional transfer and 'sensitive relationship' could be driven by a single latent factor rather than independent reinforcement across conditions.

minor comments (2)

[Abstract] Abstract and results presentation: The abstract summarizes the bidirectional relationship clearly but would benefit from a brief quantitative sense of effect sizes or number of conditions tested to help readers gauge the strength of the reported transfers.
[Results / Figures] Figure and table captions: Ensure all result visualizations explicitly label the training conditions (e.g., what constitutes 'poorly grounded' captions quantitatively) so that the isolation of compositionality versus length/grounding effects is immediately verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps clarify the strength of our empirical claims. We respond to the major comment below and will revise the manuscript to incorporate additional controls and analysis.

read point-by-point responses

Referee: [Experimental Setup / Dataset Selection] Dataset construction and experimental design sections: The description of dataset selection does not report explicit controls for caption compositionality metrics (e.g., scene graph density or dependency parse depth) independent of length and grounding quality. This is load-bearing for the central claim because if 'high-quality long-caption' datasets systematically differ in compositional structure due to shared curation or visual complexity, the observed bidirectional transfer and 'sensitive relationship' could be driven by a single latent factor rather than independent reinforcement across conditions.

Authors: We agree that reporting explicit compositionality metrics would further isolate the contributions of grounding quality and length. Our dataset selection prioritized variation in visual grounding and caption length as the primary axes, with multiple training objectives and architectural ablations used to test transfer. However, we did not include quantitative compositionality statistics such as scene graph density or dependency parse depth in the original submission. In the revision we will add these metrics (computed via standard parsers and scene graph tools) for all datasets in a new table or appendix, along with a brief discussion of their correlation with the observed effects. This will allow readers to assess whether compositional structure acts as a latent confound. Our cross-condition results (e.g., training on short vs. long captions while holding architecture fixed) already provide evidence that the relationship is not reducible to a single factor, but the added metrics will make this explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical findings from controlled experiments

full rationale

The paper conducts an empirical analysis of VLMs through controlled experiments varying training objectives, datasets, and architectures. It reports observed performance differences supporting a bidirectional but sensitive relationship between compositionality and long-caption understanding. No mathematical derivation, first-principles result, or predictive equation is presented that could reduce to fitted inputs or self-referential definitions by construction. The central claims rest on experimental outcomes rather than any chain that equates outputs to inputs via self-definition, renaming, or load-bearing self-citation. This is the normal case for an experimental study whose validity can be assessed against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard machine-learning assumptions about data quality driving generalization and the ability of contrastive objectives to measure alignment; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond domain conventions.

axioms (1)

domain assumption Contrastive objectives in VLMs capture meaningful visual-textual alignment that can be measured through downstream task performance.
Invoked implicitly when interpreting transfer between compositionality and caption understanding as evidence of improved binding.

pith-pipeline@v0.9.0 · 5688 in / 1121 out tokens · 41460 ms · 2026-05-18T14:37:33.295782+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 5 internal anchors

[1]

Amro Kamal Mohamed Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S. Morcos. Semdedup: Data- efficient learning at web-scale through semantic deduplica- tion. InICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. 2

work page 2023
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Closure: Assessing systematic general- ization of clevr models.arXiv preprint arXiv:1912.05783,

Dzmitry Bahdanau, Harm de Vries, Timothy J O’Donnell, Shikhar Murty, Philippe Beaudoin, Yoshua Bengio, and Aaron Courville. Closure: Assessing systematic general- ization of clevr models.arXiv preprint arXiv:1912.05783,

work page arXiv 1912
[4]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, pro- ceedings, part VI 13, pages 446–461. Springer, 2014. 2

work page 2014
[5]

Coyo-700m: Image-text pair dataset.https : / / github

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https : / / github . com / kakaobrain/coyo-dataset, 2022. 3

work page 2022
[6]

Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465,

work page
[7]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 3

work page 2024
[8]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Pier- giovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Fine-grained image captioning with clip reward.arXiv preprint arXiv:2205.13115, 2022

Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Der- noncourt, Trung Bui, and Mohit Bansal. Fine-grained image captioning with clip reward.arXiv preprint arXiv:2205.13115, 2022. 1

work page arXiv 2022
[10]

Why is winoground hard? investigating failures in visuolinguistic compositionality.arXiv preprint arXiv:2211.00768, 2022

Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, and Kyle Mahowald. Why is winoground hard? investigating failures in visuolinguistic compositionality.arXiv preprint arXiv:2211.00768, 2022. 2, 3, 12

work page arXiv 2022
[11]

Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems, 36:76137–76150, 2023

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems, 36:76137–76150, 2023. 2

work page 2023
[12]

Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,

work page
[13]

Im- ageinwords: Unlocking hyper-detailed image descriptions

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bun- ner, Ranjay Krishna, Jason Baldridge, and Radu Soricut. Im- ageinwords: Unlocking hyper-detailed image descriptions. arXiv preprint arXiv:2405.02793, 2024. 1, 3, 4

work page arXiv 2024
[14]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

work page
[15]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

work page 2022
[16]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 2

work page 2019
[17]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

work page
[18]

Kamath, J

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

work page arXiv
[19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2

work page 2023
[20]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 2

work page 2013
[21]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 2, 3

work page 2017
[22]

Learning multiple layers of features from tiny images, 2009

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009. 2

work page 2009
[23]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 3

work page 1956
[24]

Enhancing vision-language com- positional understanding with multimodal synthetic data

Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24849–24861, 2025. 3

work page 2025
[25]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2

work page 2014
[26]

Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023

Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay Krishna. Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023. 2

work page 2023
[27]

Docci: De- scriptions of connected and contrasting images

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, et al. Docci: De- scriptions of connected and contrasting images. InEuropean Conference on Computer Vision, pages 291–309. Springer,

work page
[28]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 2

work page 2012
[30]

Triplet- clip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760, 2024

Maitreya Patel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, et al. Triplet- clip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760, 2024. 2

work page 2024
[31]

Combined scal- ing for zero-shot transfer learning.Neurocomputing, 555: 126658, 2023

Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, et al. Combined scal- ing for zero-shot transfer learning.Neurocomputing, 555: 126658, 2023. 2

work page 2023
[32]

Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 2

work page 2015
[33]

Connecting vision and lan- guage with localized narratives

Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and lan- guage with localized narratives. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part V 16, pages 647–664. Springer,

work page 2020
[34]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 8

work page 2021
[35]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015. 2

work page 2015
[36]

Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature

Babak Saleh and Ahmed Elgammal. Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature.arXiv preprint arXiv:1505.00855, 2015. 3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[37]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3

work page 2018
[39]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 3

work page 2016
[40]

Winoground: Probing vision and language models for visio- linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. 1, 2

work page 2022
[41]

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700–26709, 2024. 1, 2, 3, 4

work page 2024
[42]

Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249, 2024

Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, and Zheng-Jun Zha. Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249, 2024. 3

work page arXiv 2024
[43]

When are lemons purple? the concept asso- ciation bias of vision-language models.arXiv preprint arXiv:2212.12043, 2022

Yutaro Yamada, Yingtian Tang, Yoyo Zhang, and Ilker Yildirim. When are lemons purple? the concept asso- ciation bias of vision-language models.arXiv preprint arXiv:2212.12043, 2022. 1, 2

work page arXiv 2022
[44]

A model and an hypothesis for language structure.Proceedings of the American philosophical soci- ety, 104(5):444–466, 1960

Victor H Yngve. A model and an hypothesis for language structure.Proceedings of the American philosophical soci- ety, 104(5):444–466, 1960. 13

work page 1960
[45]

International Conference on Learning Representations (ICLR) , year =

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022. 1, 2

work page arXiv 2022
[46]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4, 11

work page 2023
[47]

Long-clip: Unlocking the long-text capability of clip

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InEuropean Conference on Computer Vision, pages 310–325. Springer, 2024. 1, 2, 3, 4

work page 2024
[48]

Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13774–13784, 2024. 3

work page 2024
[49]

An ex- plainable toolbox for evaluating pre-trained vision-language models

Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, and Jianwei Yin. An ex- plainable toolbox for evaluating pre-trained vision-language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, pages 30–37, 2022. 1, 2

work page 2022
[50]

Dreamlip: Language- image pre-training with long captions

Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InEuropean Confer- ence on Computer Vision, pages 73–90. Springer, 2024. 3

work page 2024
[51]

Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 3 A. Training Parameters In this section, we present the training parameters for our models. Models are named after the datasets on w...

work page 2019

[1] [1]

Amro Kamal Mohamed Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S. Morcos. Semdedup: Data- efficient learning at web-scale through semantic deduplica- tion. InICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. 2

work page 2023

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Closure: Assessing systematic general- ization of clevr models.arXiv preprint arXiv:1912.05783,

Dzmitry Bahdanau, Harm de Vries, Timothy J O’Donnell, Shikhar Murty, Philippe Beaudoin, Yoshua Bengio, and Aaron Courville. Closure: Assessing systematic general- ization of clevr models.arXiv preprint arXiv:1912.05783,

work page arXiv 1912

[4] [4]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, pro- ceedings, part VI 13, pages 446–461. Springer, 2014. 2

work page 2014

[5] [5]

Coyo-700m: Image-text pair dataset.https : / / github

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https : / / github . com / kakaobrain/coyo-dataset, 2022. 3

work page 2022

[6] [6]

Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465,

work page

[7] [7]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 3

work page 2024

[8] [8]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Pier- giovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Fine-grained image captioning with clip reward.arXiv preprint arXiv:2205.13115, 2022

Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Der- noncourt, Trung Bui, and Mohit Bansal. Fine-grained image captioning with clip reward.arXiv preprint arXiv:2205.13115, 2022. 1

work page arXiv 2022

[10] [10]

Why is winoground hard? investigating failures in visuolinguistic compositionality.arXiv preprint arXiv:2211.00768, 2022

Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, and Kyle Mahowald. Why is winoground hard? investigating failures in visuolinguistic compositionality.arXiv preprint arXiv:2211.00768, 2022. 2, 3, 12

work page arXiv 2022

[11] [11]

Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems, 36:76137–76150, 2023

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems, 36:76137–76150, 2023. 2

work page 2023

[12] [12]

Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,

work page

[13] [13]

Im- ageinwords: Unlocking hyper-detailed image descriptions

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bun- ner, Ranjay Krishna, Jason Baldridge, and Radu Soricut. Im- ageinwords: Unlocking hyper-detailed image descriptions. arXiv preprint arXiv:2405.02793, 2024. 1, 3, 4

work page arXiv 2024

[14] [14]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

work page

[15] [15]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

work page 2022

[16] [16]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 2

work page 2019

[17] [17]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

work page

[18] [18]

Kamath, J

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

work page arXiv

[19] [19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2

work page 2023

[20] [20]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 2

work page 2013

[21] [21]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 2, 3

work page 2017

[22] [22]

Learning multiple layers of features from tiny images, 2009

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009. 2

work page 2009

[23] [23]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 3

work page 1956

[24] [24]

Enhancing vision-language com- positional understanding with multimodal synthetic data

Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24849–24861, 2025. 3

work page 2025

[25] [25]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2

work page 2014

[26] [26]

Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023

Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay Krishna. Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023. 2

work page 2023

[27] [27]

Docci: De- scriptions of connected and contrasting images

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, et al. Docci: De- scriptions of connected and contrasting images. InEuropean Conference on Computer Vision, pages 291–309. Springer,

work page

[28] [28]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 2

work page 2012

[30] [30]

Triplet- clip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760, 2024

Maitreya Patel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, et al. Triplet- clip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760, 2024. 2

work page 2024

[31] [31]

Combined scal- ing for zero-shot transfer learning.Neurocomputing, 555: 126658, 2023

Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, et al. Combined scal- ing for zero-shot transfer learning.Neurocomputing, 555: 126658, 2023. 2

work page 2023

[32] [32]

Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 2

work page 2015

[33] [33]

Connecting vision and lan- guage with localized narratives

Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and lan- guage with localized narratives. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part V 16, pages 647–664. Springer,

work page 2020

[34] [34]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 8

work page 2021

[35] [35]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015. 2

work page 2015

[36] [36]

Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature

Babak Saleh and Ahmed Elgammal. Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature.arXiv preprint arXiv:1505.00855, 2015. 3

work page internal anchor Pith review Pith/arXiv arXiv 2015

[37] [37]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3

work page 2018

[39] [39]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 3

work page 2016

[40] [40]

Winoground: Probing vision and language models for visio- linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. 1, 2

work page 2022

[41] [41]

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700–26709, 2024. 1, 2, 3, 4

work page 2024

[42] [42]

Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249, 2024

Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, and Zheng-Jun Zha. Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249, 2024. 3

work page arXiv 2024

[43] [43]

When are lemons purple? the concept asso- ciation bias of vision-language models.arXiv preprint arXiv:2212.12043, 2022

Yutaro Yamada, Yingtian Tang, Yoyo Zhang, and Ilker Yildirim. When are lemons purple? the concept asso- ciation bias of vision-language models.arXiv preprint arXiv:2212.12043, 2022. 1, 2

work page arXiv 2022

[44] [44]

A model and an hypothesis for language structure.Proceedings of the American philosophical soci- ety, 104(5):444–466, 1960

Victor H Yngve. A model and an hypothesis for language structure.Proceedings of the American philosophical soci- ety, 104(5):444–466, 1960. 13

work page 1960

[45] [45]

International Conference on Learning Representations (ICLR) , year =

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022. 1, 2

work page arXiv 2022

[46] [46]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4, 11

work page 2023

[47] [47]

Long-clip: Unlocking the long-text capability of clip

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InEuropean Conference on Computer Vision, pages 310–325. Springer, 2024. 1, 2, 3, 4

work page 2024

[48] [48]

Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13774–13784, 2024. 3

work page 2024

[49] [49]

An ex- plainable toolbox for evaluating pre-trained vision-language models

Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, and Jianwei Yin. An ex- plainable toolbox for evaluating pre-trained vision-language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, pages 30–37, 2022. 1, 2

work page 2022

[50] [50]

Dreamlip: Language- image pre-training with long captions

Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InEuropean Confer- ence on Computer Vision, pages 73–90. Springer, 2024. 3

work page 2024

[51] [51]

Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 3 A. Training Parameters In this section, we present the training parameters for our models. Models are named after the datasets on w...

work page 2019