The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
Pith reviewed 2026-05-23 07:21 UTC · model grok-4.3
The pith
Enforcing that attention maps for composite phrases equal the sum of their parts improves multi-object grounding in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a composition loss which encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases promotes balanced multi-object localization, yielding consistent gains on grounding benchmarks for both single- and multi-object references as well as gains on VQA benchmarks across four VLM architectures.
What carries the argument
The composition loss that requires the attention map produced by a composite phrase to equal the arithmetic sum of the attention maps of its constituent object phrases.
If this is right
- Grounding accuracy rises for both single-object and multi-object references.
- The gains appear across contrastive and generative VLM families on four separate grounding datasets.
- Visual question answering accuracy also rises on two benchmarks even though the method never trains on VQA data.
- No additional human annotations are required beyond the original image-caption pairs.
Where Pith is reading between the lines
- The same additive-attention regularizer could be tested on other compositional linguistic forms such as disjunctions or quantified phrases.
- If the loss works by shaping internal attention structure, similar regularization might transfer to tasks that require counting or spatial relations.
- The approach may reveal whether current VLMs already encode additive composition in their attention weights or whether the loss must create that structure from scratch.
Load-bearing premise
That the attention map of a composite phrase can be usefully forced to equal the sum of its constituent phrases' maps and that this equality directly improves localization without harming the original training objectives.
What would settle it
Applying the composition loss produces no gain or produces a drop in multi-object grounding accuracy on the four benchmarks used in the paper relative to the same models trained without the loss.
Figures
read the original abstract
Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Compositional Attention-Regularized Training (CompART) to address degraded grounding performance on complex multi-object references in VLMs. Captions are decomposed into object-centric phrases; composite phrases are formed via conjunctions; a composition loss is introduced that encourages the attention map induced by a composite phrase to equal the arithmetic sum of the attention maps of its constituent phrases. The method is evaluated without additional annotations across four VLM architectures (contrastive and generative), four multi-object grounding benchmarks, and two VQA benchmarks, with the claim that it yields consistent improvements on both single- and multi-object grounding plus downstream VQA gains.
Significance. If the central result holds, the work offers an annotation-free regularization approach that targets an under-addressed limitation in current VLMs: the scarcity of direct multi-object references in image-caption training data. The cross-architecture evaluation (four models) and the demonstration of incidental VQA benefits are strengths. The approach could be broadly useful if the additivity assumption proves robust, but its significance hinges on showing that the auxiliary loss improves localization metrics without measurable degradation of the primary training objective.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the abstract asserts 'consistent improvements' and 'gains on VQA' across architectures and datasets, yet supplies no numerical deltas, baseline comparisons, standard deviations, or statistical tests. The full experimental section must report these quantities (e.g., ΔmAP or ΔIoU on each benchmark) to substantiate the central claim.
- [§3.2] §3.2 (Composition Loss, Eq. (3) or equivalent): the loss enforces A(composite) = A(p1) + A(p2). Attention maps in VLMs are row- or column-normalized (softmax), so their direct sum is unnormalized and can exceed unit mass. The manuscript must specify whether the loss implicitly renormalizes, clips, or tolerates mismatch, and must demonstrate that this particular additive bias improves grounding more than alternatives (max, product, or learned composition) without side effects on the base contrastive/generative objective.
- [§4.3] §4.3 (Ablations) or equivalent: the weighting hyper-parameter λ between the composition loss and the original objective is load-bearing. The paper must include an ablation varying λ (including λ=0) that shows the auxiliary term does not degrade the primary loss on the original training distribution, and must report grounding metrics for overlapping or semantically interacting object pairs where additivity may be inappropriate.
minor comments (2)
- [§3] Notation for attention maps should be defined once and used consistently; the distinction between raw and normalized maps should be explicit in the loss definition.
- [Figures] Figure captions should state the exact metric (e.g., mAP@0.5) and the number of runs averaged.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will strengthen the presentation of results and clarify the method without altering the core claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the abstract asserts 'consistent improvements' and 'gains on VQA' across architectures and datasets, yet supplies no numerical deltas, baseline comparisons, standard deviations, or statistical tests. The full experimental section must report these quantities (e.g., ΔmAP or ΔIoU on each benchmark) to substantiate the central claim.
Authors: We agree that explicit numerical support strengthens the abstract. The experimental tables in §4 already contain per-benchmark baseline comparisons, mean ΔmAP/ΔIoU values, and standard deviations across random seeds for all four architectures and six benchmarks. In the revision we will add representative deltas (e.g., average +1.8 mAP on multi-object grounding) directly to the abstract and will include paired statistical significance tests in §4 where they are not yet reported. revision: yes
-
Referee: [§3.2] §3.2 (Composition Loss, Eq. (3) or equivalent): the loss enforces A(composite) = A(p1) + A(p2). Attention maps in VLMs are row- or column-normalized (softmax), so their direct sum is unnormalized and can exceed unit mass. The manuscript must specify whether the loss implicitly renormalizes, clips, or tolerates mismatch, and must demonstrate that this particular additive bias improves grounding more than alternatives (max, product, or learned composition) without side effects on the base contrastive/generative objective.
Authors: We will revise §3.2 to state explicitly that each attention map is L1-normalized to unit mass before the summation is formed, so the loss compares two normalized distributions. We will also add a controlled ablation comparing the additive operator against max, product, and a small learned composition head, confirming that additivity yields the largest grounding gains while leaving the primary contrastive/generative loss unchanged on held-out data. revision: yes
-
Referee: [§4.3] §4.3 (Ablations) or equivalent: the weighting hyper-parameter λ between the composition loss and the original objective is load-bearing. The paper must include an ablation varying λ (including λ=0) that shows the auxiliary term does not degrade the primary loss on the original training distribution, and must report grounding metrics for overlapping or semantically interacting object pairs where additivity may be inappropriate.
Authors: We will expand the λ ablation in §4.3 to include λ=0 (recovering the baseline) and will tabulate the primary training loss on the original distribution to verify no degradation. We will also add a breakdown of grounding metrics on subsets of the benchmarks containing overlapping or semantically interacting object pairs, noting the potential limitations of strict additivity in those cases. revision: yes
Circularity Check
No significant circularity; composition loss is an independent regularization term.
full rationale
The paper defines CompART by explicitly constructing composite phrases from object-centric constituents and introducing a composition loss that directly encodes the modeling choice attention(composite) ≈ sum(attention(constituents)). This loss is presented as a new training objective rather than a quantity derived from or fitted to the reported grounding/VQA metrics. No self-citations are used to justify uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The empirical gains are outcomes of applying the loss, not tautological reductions to the inputs. The additivity assumption may be debatable on correctness grounds, but it does not create circularity in the derivation chain.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lcomp = ΣΣ |GCi + GCj − GCij| … encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three-level hierarchy: Subject Level, Phrase Level, and Composite Phrase Level … Composition Loss to balance attention across multiple objects
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Multi-level multimodal common semantic space for image-phrase grounding
Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl V ondrick, and Shih-Fu Chang. Multi-level multimodal common semantic space for image-phrase grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12476–12486, 2019. 6
work page 2019
-
[2]
Flamingo: a visual language model for few- shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi ´n...
-
[3]
Semantic parsing as machine translation
Jacob Andreas, Andreas Vlachos, and Stephen Clark. Semantic parsing as machine translation. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 47–52, 2013. 1
work page 2013
-
[4]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Kyle Buettner and Adriana Kovashka. Investigating the role of attribute context in vision-language models for object recognition and detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5474–5484, 2024. 2
work page 2024
-
[6]
Lion: Empowering multimodal large language model with dual-level visual knowledge
Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. Lion: Empowering multimodal large language model with dual-level visual knowledge. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 26540– 26550, 2024. 4
work page 2024
-
[7]
Visualgpt: Data-efficient adapta- tion of pretrained language models for image caption- ing
Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mo- hamed Elhoseiny. Visualgpt: Data-efficient adapta- tion of pretrained language models for image caption- ing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18030–18040, 2022. 2
work page 2022
-
[8]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 4
work page 2023
-
[9]
Semi-supervised grounding alignment for multi-modal feature learning
Shih-Han Chou, Zicong Fan, James J Little, and Leonid Sigal. Semi-supervised grounding alignment for multi-modal feature learning. In2022 19th Confer- ence on Robots and Vision (CRV), pages 48–57. IEEE,
-
[10]
Improving pre-trained vision-and-language embeddings for phrase ground- ing
Zi-Yi Dou and Nanyun Peng. Improving pre-trained vision-and-language embeddings for phrase ground- ing. In Proceedings of the 2021 Conference on Empir- ical Methods in Natural Language Processing , pages 6362–6371, 2021. 6
work page 2021
-
[11]
Learning to Parse and Translate Improves Neural Machine Translation
Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. Learning to parse and trans- late improves neural machine translation. arXiv preprint arXiv:1702.03525, 2017. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6904–6913, 2017. 1, 7, 8
work page 2017
-
[13]
Contrastive learn- ing for weakly supervised phrase grounding
Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. Contrastive learn- ing for weakly supervised phrase grounding. In Eur. Conf. Comput. Vis., pages 752–768. Springer, 2020. 6
work page 2020
-
[14]
Pay attention to your neighbours: Training-free open- vocabulary semantic segmentation
Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open- vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025. 1
work page 2025
-
[15]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9729–9738, 2020. 4, 5 9
work page 2020
-
[16]
Improved visual grounding through self-consistent explanations
Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C Berg, and Vicente Ordonez. Improved visual grounding through self-consistent explanations. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 13095–13105, 2024. 2, 3, 4, 5, 6, 7
work page 2024
-
[17]
Scaling up visual and vision- language representation learning with noisy text su- pervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision- language representation learning with noisy text su- pervision. In Int. Conf. Machine Learning , 2021. 1, 2
work page 2021
-
[18]
Luo Jiayun, Siddhesh Khandelwal, Leonid Sigal, and Boyang Li. Plug-and-play, dense-label-free extrac- tion of open-vocabulary semantic segmentation from vision-language models. IEEE Conf. Comput. Vis. Pattern Recog., 2024. 1, 2, 5
work page 2024
-
[19]
See what you are told: Visual at- tention sink in large multimodal models
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual at- tention sink in large multimodal models. In The Thir- teenth International Conference on Learning Repre- sentations, 2025. 4
work page 2025
-
[20]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 1, 6
work page 2014
-
[21]
Vilt: Vision-and-language transformer without convolution or region supervision
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In Int. Conf. Machine Learning,
-
[22]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4015–4026, 2023. 7
work page 2023
-
[23]
De- pendency based embeddings for sentence classifica- tion tasks
Alexandros Komninos and Suresh Manandhar. De- pendency based embeddings for sentence classifica- tion tasks. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, pages 1490–1500, 2016. 1
work page 2016
-
[24]
Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan- nis Kalantidis, Li-Jia Li, David A Shamma, et al. Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 6
work page 2017
-
[25]
Weakly super- vised referring image segmentation with intra-chunk and inter-chunk consistency
Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Tara Taghavi. Weakly super- vised referring image segmentation with intra-chunk and inter-chunk consistency. In Int. Conf. Comput. Vis., pages 21870–21881, 2023. 3
work page 2023
-
[26]
mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections
Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al. mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections. arXiv preprint arXiv:2205.12005, 2022. 1, 2
-
[27]
Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Adv. Neural Inform. Process. Syst. ,
-
[28]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and gener- ation. In Int. Conf. Machine Learning, pages 12888– 12900. PMLR, 2022. 1, 2, 4, 6
work page 2022
-
[29]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. In Int. Conf. Machine Learning, 2023. 2
work page 2023
-
[30]
Os- car: Object-semantics aligned pre-training for vision- language tasks
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Os- car: Object-semantics aligned pre-training for vision- language tasks. In Eur. Conf. Comput. Vis., 2020. 1
work page 2020
-
[31]
Triple alignment strategies for zero-shot phrase grounding under weak supervision
Pengyue Lin, Ruifan Li, Yuzhe Ji, Zhihan Yu, Fangx- iang Feng, Zhanyu Ma, and Xiaojie Wang. Triple alignment strategies for zero-shot phrase grounding under weak supervision. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 4312–4321, 2024. 6
work page 2024
-
[32]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In Eur. Conf. Comput. Vis. , pages 740–755, 2014. 1, 6, 7, 8
work page 2014
-
[33]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Adv. Neu- ral Inform. Process. Syst., 2023. 1, 2, 4, 8
work page 2023
-
[34]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Martin J Pickering and Roger PG Van Gompel. Syn- tactic parsing. In Handbook of psycholinguistics , pages 455–503. Elsevier, 2006. 1
work page 2006
-
[36]
Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana 10 Lazebnik. Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models. In Int. Conf. Comput. Vis., pages 2641–2649,
-
[37]
The importance of syntactic parsing and inference in se- mantic role labeling
Vasin Punyakanok, Dan Roth, and Wen-tau Yih. The importance of syntactic parsing and inference in se- mantic role labeling. Computational Linguistics, 34 (2):257–287, 2008. 1
work page 2008
-
[38]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
-
[39]
Learning trans- ferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion. In Int. Conf. Machine Learning, 2021. 1, 2, 4
work page 2021
-
[40]
Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra
Ramprasaath R. Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vis., 2017. 5
work page 2017
-
[41]
Similarity maps for self-training weakly-supervised phrase grounding
Tal Shaharabany and Lior Wolf. Similarity maps for self-training weakly-supervised phrase grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6925– 6934, 2023. 6
work page 2023
-
[42]
What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs
Tal Shaharabany, Yoad Tewel, and Lior Wolf. What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs. Adv. Neural In- form. Process. Syst., 35:28222–28237, 2022. 6
work page 2022
-
[43]
Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, and Jianwei Yin. Groundvlp: Harnessing zero-shot vi- sual grounding from vision-language pre-training and open-vocabulary object detection. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 4766–4775, 2024. 4
work page 2024
-
[44]
What does clip know about a red circle? visual prompt engineering for vlms
Aleksandar Shtedritski, Christian Rupprecht, and An- drea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023. 4
work page 2023
-
[45]
Flava: A foundational language and vision alignment model
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 15638–15650,
-
[46]
Reclip: A strong zero-shot baseline for refer- ring expression comprehension
Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A strong zero-shot baseline for refer- ring expression comprehension. arXiv preprint arXiv:2204.05991, 2022. 4
-
[47]
Alpha-clip: A clip model focusing on wher- ever you want
Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Ji- aqi Wang. Alpha-clip: A clip model focusing on wher- ever you want. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 13019–13029, 2024. 4
work page 2024
-
[48]
Lxmert: Learning cross- modality encoder representations from transformers
Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. 2
-
[49]
Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training
Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven CH Hoi. Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training. arXiv preprint arXiv:2210.08773, 2022. 5
-
[50]
Order-Embeddings of Images and Language
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[51]
Sclip: Re- thinking self-attention for dense vision-language in- ference
Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Re- thinking self-attention for dense vision-language in- ference. In Computer Vision – ECCV 2024 , pages 315–332, Cham, 2024. Springer Nature Switzerland. 1
work page 2024
-
[52]
Phrase2vec: phrase embedding based on parsing
Yongliang Wu, Shuliang Zhao, and Wenbin Li. Phrase2vec: phrase embedding based on parsing. In- formation Sciences, 517:100–127, 2020. 1
work page 2020
-
[53]
Weakly- supervised visual grounding of phrases with linguistic structures
Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. Weakly- supervised visual grounding of phrases with linguistic structures. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5945–5954, 2017. 2, 3
work page 2017
-
[54]
mplug-2: A modularized multi-modal foundation model across text, image and video
Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al. mplug-2: A modularized multi-modal foundation model across text, image and video. In Int. Conf. Machine Learning, pages 38728–38748. PMLR,
-
[55]
Bridgetower: Building bridges between encoders in vision-language representation learning
Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, and Nan Duan. Bridgetower: Building bridges between encoders in vision-language representation learning. In AAAI, pages 10637–10647,
-
[56]
Pink: Unveiling the power of referential com- prehension for multi-modal llms
Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. Pink: Unveiling the power of referential com- prehension for multi-modal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13838–13848, 2024. 4 11
work page 2024
-
[57]
Improving visual grounding by encouraging consistent gradient-based explanations
Ziyan Yang, Kushal Kafle, Franck Dernoncourt, and Vicente Ordonez. Improving visual grounding by encouraging consistent gradient-based explanations. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 19165–19174, 2023. 6
work page 2023
-
[58]
Modeling context in refer- ring expressions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in refer- ring expressions. In Eur. Conf. Comput. Vis. , pages 69–85. Springer, 2016. 1, 6, 7
work page 2016
-
[59]
Multi- grained vision language pre-training: Align- ing texts with visual concepts
Yan Zeng, Xinsong Zhang, and Hang Li. Multi- grained vision language pre-training: Align- ing texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021. 2
-
[60]
Investigating compo- sitional challenges in vision-language models for vi- sual grounding
Yunan Zeng, Yan Huang, Jinjin Zhang, Zequn Jie, Zhenhua Chai, and Liang Wang. Investigating compo- sitional challenges in vision-language models for vi- sual grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14141–14151, 2024. 3, 7
work page 2024
-
[61]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF inter- national conference on computer vision, pages 11975– 11986, 2023. 8
work page 2023
-
[62]
Top-down neural attention by excitation backprop
Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. In Eur. Conf. Comput. Vis. ,
-
[63]
Evf-sam: Early vision-language fusion for text-prompted segment anything model
Yuxuan Zhang, Tianheng Cheng, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076, 2024. 7
-
[64]
Psalm: Pixelwise segmentation with large multi- modal model
Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi- modal model. In Eur. Conf. Comput. Vis., pages 74–
-
[65]
Tinyllava: A framework of small-scale large multimodal models
Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024. 4, 9 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.