Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
Pith reviewed 2026-05-18 14:37 UTC · model grok-4.3
The pith
High-quality long-caption data with strong visual grounding simultaneously improves compositional reasoning and long-caption understanding in contrastive vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between compositional reasoning and long-caption understanding. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. Architectural choices such as frozen positional embeddings can inadvertently limit compositional learning while aiming to preserve general alignment.
What carries the argument
Controlled experiments that vary training objectives, datasets, and architectural designs to isolate transfer effects between compositionality and long-caption understanding.
If this is right
- High-quality long-caption data with strong visual grounding promotes both compositional reasoning and long-caption understanding at the same time.
- Models trained on poorly grounded captions or with only limited parameter updates fail to generalize these capabilities.
- Frozen positional embeddings can limit compositional learning even when intended to preserve overall alignment.
Where Pith is reading between the lines
- Prioritizing visual grounding during data curation may advance both skills more reliably than simply increasing model scale.
- A curriculum that builds from grounded short captions toward longer compositional ones could efficiently develop both abilities together.
- The sensitive relationship observed here may appear in other multimodal tasks that require detailed visual-textual alignment.
Load-bearing premise
The controlled experiments across diverse training objectives, datasets, and architectural designs sufficiently isolate the effects of compositionality versus long-caption understanding without major confounding from unmeasured factors such as dataset biases or model scale differences.
What would settle it
Training models on high-quality long-caption data with strong visual grounding and observing no corresponding gains on compositional reasoning benchmarks, while holding other factors fixed, would falsify the claim of bidirectional promotion.
Figures
read the original abstract
Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically analyzes the relationship between compositional reasoning and long-caption understanding in contrastive vision-language models. Through controlled experiments varying training objectives, datasets, and architectural designs, it reports a bidirectional but sensitive relationship: models trained on poorly grounded captions or with limited updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. It further identifies that choices like frozen positional embeddings can limit compositional learning and offers guidelines for data selection and model design.
Significance. If the central empirical findings hold after addressing potential confounds, the work would be significant for multimodal learning by clarifying when and how compositionality and long-caption capabilities transfer in VLMs. It supplies actionable, data- and architecture-focused recommendations that could improve generalization on complex visual-textual tasks, an open challenge in the field. The use of diverse conditions across objectives and architectures is a strength, though the value hinges on whether the experiments truly isolate the two capabilities.
major comments (1)
- [Experimental Setup / Dataset Selection] Dataset construction and experimental design sections: The description of dataset selection does not report explicit controls for caption compositionality metrics (e.g., scene graph density or dependency parse depth) independent of length and grounding quality. This is load-bearing for the central claim because if 'high-quality long-caption' datasets systematically differ in compositional structure due to shared curation or visual complexity, the observed bidirectional transfer and 'sensitive relationship' could be driven by a single latent factor rather than independent reinforcement across conditions.
minor comments (2)
- [Abstract] Abstract and results presentation: The abstract summarizes the bidirectional relationship clearly but would benefit from a brief quantitative sense of effect sizes or number of conditions tested to help readers gauge the strength of the reported transfers.
- [Results / Figures] Figure and table captions: Ensure all result visualizations explicitly label the training conditions (e.g., what constitutes 'poorly grounded' captions quantitatively) so that the isolation of compositionality versus length/grounding effects is immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which helps clarify the strength of our empirical claims. We respond to the major comment below and will revise the manuscript to incorporate additional controls and analysis.
read point-by-point responses
-
Referee: [Experimental Setup / Dataset Selection] Dataset construction and experimental design sections: The description of dataset selection does not report explicit controls for caption compositionality metrics (e.g., scene graph density or dependency parse depth) independent of length and grounding quality. This is load-bearing for the central claim because if 'high-quality long-caption' datasets systematically differ in compositional structure due to shared curation or visual complexity, the observed bidirectional transfer and 'sensitive relationship' could be driven by a single latent factor rather than independent reinforcement across conditions.
Authors: We agree that reporting explicit compositionality metrics would further isolate the contributions of grounding quality and length. Our dataset selection prioritized variation in visual grounding and caption length as the primary axes, with multiple training objectives and architectural ablations used to test transfer. However, we did not include quantitative compositionality statistics such as scene graph density or dependency parse depth in the original submission. In the revision we will add these metrics (computed via standard parsers and scene graph tools) for all datasets in a new table or appendix, along with a brief discussion of their correlation with the observed effects. This will allow readers to assess whether compositional structure acts as a latent confound. Our cross-condition results (e.g., training on short vs. long captions while holding architecture fixed) already provide evidence that the relationship is not reducible to a single factor, but the added metrics will make this explicit. revision: yes
Circularity Check
No circularity: empirical findings from controlled experiments
full rationale
The paper conducts an empirical analysis of VLMs through controlled experiments varying training objectives, datasets, and architectures. It reports observed performance differences supporting a bidirectional but sensitive relationship between compositionality and long-caption understanding. No mathematical derivation, first-principles result, or predictive equation is presented that could reduce to fitted inputs or self-referential definitions by construction. The central claims rest on experimental outcomes rather than any chain that equates outputs to inputs via self-definition, renaming, or load-bearing self-citation. This is the normal case for an experimental study whose validity can be assessed against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive objectives in VLMs capture meaningful visual-textual alignment that can be measured through downstream task performance.
Reference graph
Works this paper leans on
-
[1]
Amro Kamal Mohamed Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S. Morcos. Semdedup: Data- efficient learning at web-scale through semantic deduplica- tion. InICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. 2
work page 2023
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Closure: Assessing systematic general- ization of clevr models.arXiv preprint arXiv:1912.05783,
Dzmitry Bahdanau, Harm de Vries, Timothy J O’Donnell, Shikhar Murty, Philippe Beaudoin, Yoshua Bengio, and Aaron Courville. Closure: Assessing systematic general- ization of clevr models.arXiv preprint arXiv:1912.05783,
-
[4]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, pro- ceedings, part VI 13, pages 446–461. Springer, 2014. 2
work page 2014
-
[5]
Coyo-700m: Image-text pair dataset.https : / / github
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https : / / github . com / kakaobrain/coyo-dataset, 2022. 3
work page 2022
-
[6]
Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465,
-
[7]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 3
work page 2024
-
[8]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Pier- giovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Fine-grained image captioning with clip reward.arXiv preprint arXiv:2205.13115, 2022
Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Der- noncourt, Trung Bui, and Mohit Bansal. Fine-grained image captioning with clip reward.arXiv preprint arXiv:2205.13115, 2022. 1
-
[10]
Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, and Kyle Mahowald. Why is winoground hard? investigating failures in visuolinguistic compositionality.arXiv preprint arXiv:2211.00768, 2022. 2, 3, 12
-
[11]
Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems, 36:76137–76150, 2023. 2
work page 2023
-
[12]
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,
-
[13]
Im- ageinwords: Unlocking hyper-detailed image descriptions
Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bun- ner, Ranjay Krishna, Jason Baldridge, and Radu Soricut. Im- ageinwords: Unlocking hyper-detailed image descriptions. arXiv preprint arXiv:2405.02793, 2024. 1, 3, 4
-
[14]
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,
-
[15]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2
work page 2022
-
[16]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 2
work page 2019
-
[17]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,
- [18]
-
[19]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2
work page 2023
-
[20]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 2
work page 2013
-
[21]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 2, 3
work page 2017
-
[22]
Learning multiple layers of features from tiny images, 2009
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009. 2
work page 2009
-
[23]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 3
work page 1956
-
[24]
Enhancing vision-language com- positional understanding with multimodal synthetic data
Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24849–24861, 2025. 3
work page 2025
-
[25]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2
work page 2014
-
[26]
Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay Krishna. Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023. 2
work page 2023
-
[27]
Docci: De- scriptions of connected and contrasting images
Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, et al. Docci: De- scriptions of connected and contrasting images. InEuropean Conference on Computer Vision, pages 291–309. Springer,
-
[28]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 2
work page 2012
-
[30]
Maitreya Patel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, et al. Triplet- clip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760, 2024. 2
work page 2024
-
[31]
Combined scal- ing for zero-shot transfer learning.Neurocomputing, 555: 126658, 2023
Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, et al. Combined scal- ing for zero-shot transfer learning.Neurocomputing, 555: 126658, 2023. 2
work page 2023
-
[32]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 2
work page 2015
-
[33]
Connecting vision and lan- guage with localized narratives
Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and lan- guage with localized narratives. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part V 16, pages 647–664. Springer,
work page 2020
-
[34]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 8
work page 2021
-
[35]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015. 2
work page 2015
-
[36]
Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature
Babak Saleh and Ahmed Elgammal. Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature.arXiv preprint arXiv:1505.00855, 2015. 3
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[37]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3
work page 2018
-
[39]
Yfcc100m: The new data in multimedia research
Bart Thomee, David A Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 3
work page 2016
-
[40]
Winoground: Probing vision and language models for visio- linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. 1, 2
work page 2022
-
[41]
A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions
Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700–26709, 2024. 1, 2, 3, 4
work page 2024
-
[42]
Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, and Zheng-Jun Zha. Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249, 2024. 3
-
[43]
Yutaro Yamada, Yingtian Tang, Yoyo Zhang, and Ilker Yildirim. When are lemons purple? the concept asso- ciation bias of vision-language models.arXiv preprint arXiv:2212.12043, 2022. 1, 2
-
[44]
Victor H Yngve. A model and an hypothesis for language structure.Proceedings of the American philosophical soci- ety, 104(5):444–466, 1960. 13
work page 1960
-
[45]
International Conference on Learning Representations (ICLR) , year =
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022. 1, 2
-
[46]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4, 11
work page 2023
-
[47]
Long-clip: Unlocking the long-text capability of clip
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InEuropean Conference on Computer Vision, pages 310–325. Springer, 2024. 1, 2, 3, 4
work page 2024
-
[48]
Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13774–13784, 2024. 3
work page 2024
-
[49]
An ex- plainable toolbox for evaluating pre-trained vision-language models
Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, and Jianwei Yin. An ex- plainable toolbox for evaluating pre-trained vision-language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, pages 30–37, 2022. 1, 2
work page 2022
-
[50]
Dreamlip: Language- image pre-training with long captions
Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InEuropean Confer- ence on Computer Vision, pages 73–90. Springer, 2024. 3
work page 2024
-
[51]
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 3 A. Training Parameters In this section, we present the training parameters for our models. Models are named after the datasets on w...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.