pith. sign in

arxiv: 2412.08110 · v3 · submitted 2024-12-11 · 💻 cs.CV · cs.CL· cs.LG

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

Pith reviewed 2026-05-23 07:21 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords compositional visual groundingattention regularizationvision-language modelsmulti-object referencesCompARTvisual question answering
0
0 comments X

The pith

Enforcing that attention maps for composite phrases equal the sum of their parts improves multi-object grounding in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models lose grounding accuracy when phrases refer to several objects at once because standard image-caption training rarely supplies direct multi-object references and attribution is hard. The paper proposes Compositional Attention-Regularized Training that splits captions into object-centric phrases, forms composite phrases with conjunctions, and adds a loss forcing the attention map of each composite phrase to equal the arithmetic sum of the maps of its constituents. This regularization is applied without any extra labels. The resulting models show better localization on both single- and multi-object grounding benchmarks and also improve on visual question answering tasks they were never trained for, across contrastive and generative architectures.

Core claim

The paper claims that a composition loss which encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases promotes balanced multi-object localization, yielding consistent gains on grounding benchmarks for both single- and multi-object references as well as gains on VQA benchmarks across four VLM architectures.

What carries the argument

The composition loss that requires the attention map produced by a composite phrase to equal the arithmetic sum of the attention maps of its constituent object phrases.

If this is right

  • Grounding accuracy rises for both single-object and multi-object references.
  • The gains appear across contrastive and generative VLM families on four separate grounding datasets.
  • Visual question answering accuracy also rises on two benchmarks even though the method never trains on VQA data.
  • No additional human annotations are required beyond the original image-caption pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same additive-attention regularizer could be tested on other compositional linguistic forms such as disjunctions or quantified phrases.
  • If the loss works by shaping internal attention structure, similar regularization might transfer to tasks that require counting or spatial relations.
  • The approach may reveal whether current VLMs already encode additive composition in their attention weights or whether the loss must create that structure from scratch.

Load-bearing premise

That the attention map of a composite phrase can be usefully forced to equal the sum of its constituent phrases' maps and that this equality directly improves localization without harming the original training objectives.

What would settle it

Applying the composition loss produces no gain or produces a drop in multi-object grounding accuracy on the four benchmarks used in the paper relative to the same models trained without the loss.

Figures

Figures reproduced from arXiv: 2412.08110 by Boyang Li, Jiayun Luo, Leonid Sigal, Mir Rayat Imtiaz Hossain, Pritam Sarkar.

Figure 1
Figure 1. Figure 1: Motivation. Existing VLM models train from unstruc￾tured image-caption pairs. The proposed HIerarchically STruc￾tured (HIST) learning framework decompose captions into hierar￾chy of phrases, establishing entailment between phrases and the image, and among the phrases themselves using proposed losses. Specifically, it extracts subjects (Subject level) from phrases and aligns them to the image, along with th… view at source ↗
Figure 2
Figure 2. Figure 2: The overall structure of HIST. We decompose image captions into object-centric phrases and build a three-level hierarchy – Subject level, Phrase level, and Composite Phrase level. Entailment between these constituent components of the sentence and the image, allows us to formulate additional regularization constraints for training of VLMs. Specifically, we leverage three losses. At the Phrase Level we ensu… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Result for ALBEF, /w SelfEQ and /w HIST on Visual Grounding and Referring Segmentation. Im￾ages are from RefCOCO+. The enlarged red stars represent the top 4 locations with the highest predicted attention value from re￾spective methods. To obtain segmentation mask, we input the four points into SAM [22] as point prompts. We note that the HIST accurately detects all objects. B sets. As shown in … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Result for TinyLLaVA and TinyLLaVA + HIST on Visual Grounding. Images are from Flickr30K. the benefit of hierarchical structured data on spatial image￾text alignment. TinyLLaVA 0.5B has the same structure as LLaVA [33] but with SigLIP [61] as visual encoder and small-scale Qwen2-0.5B [38] as LLM. Reformulate data to QA format. To adapt HIST on TinyLLaVA, we first reformulate the data into QA fo… view at source ↗
read the original abstract

Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Compositional Attention-Regularized Training (CompART) to address degraded grounding performance on complex multi-object references in VLMs. Captions are decomposed into object-centric phrases; composite phrases are formed via conjunctions; a composition loss is introduced that encourages the attention map induced by a composite phrase to equal the arithmetic sum of the attention maps of its constituent phrases. The method is evaluated without additional annotations across four VLM architectures (contrastive and generative), four multi-object grounding benchmarks, and two VQA benchmarks, with the claim that it yields consistent improvements on both single- and multi-object grounding plus downstream VQA gains.

Significance. If the central result holds, the work offers an annotation-free regularization approach that targets an under-addressed limitation in current VLMs: the scarcity of direct multi-object references in image-caption training data. The cross-architecture evaluation (four models) and the demonstration of incidental VQA benefits are strengths. The approach could be broadly useful if the additivity assumption proves robust, but its significance hinges on showing that the auxiliary loss improves localization metrics without measurable degradation of the primary training objective.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the abstract asserts 'consistent improvements' and 'gains on VQA' across architectures and datasets, yet supplies no numerical deltas, baseline comparisons, standard deviations, or statistical tests. The full experimental section must report these quantities (e.g., ΔmAP or ΔIoU on each benchmark) to substantiate the central claim.
  2. [§3.2] §3.2 (Composition Loss, Eq. (3) or equivalent): the loss enforces A(composite) = A(p1) + A(p2). Attention maps in VLMs are row- or column-normalized (softmax), so their direct sum is unnormalized and can exceed unit mass. The manuscript must specify whether the loss implicitly renormalizes, clips, or tolerates mismatch, and must demonstrate that this particular additive bias improves grounding more than alternatives (max, product, or learned composition) without side effects on the base contrastive/generative objective.
  3. [§4.3] §4.3 (Ablations) or equivalent: the weighting hyper-parameter λ between the composition loss and the original objective is load-bearing. The paper must include an ablation varying λ (including λ=0) that shows the auxiliary term does not degrade the primary loss on the original training distribution, and must report grounding metrics for overlapping or semantically interacting object pairs where additivity may be inappropriate.
minor comments (2)
  1. [§3] Notation for attention maps should be defined once and used consistently; the distinction between raw and normalized maps should be explicit in the loss definition.
  2. [Figures] Figure captions should state the exact metric (e.g., mAP@0.5) and the number of runs averaged.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will strengthen the presentation of results and clarify the method without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the abstract asserts 'consistent improvements' and 'gains on VQA' across architectures and datasets, yet supplies no numerical deltas, baseline comparisons, standard deviations, or statistical tests. The full experimental section must report these quantities (e.g., ΔmAP or ΔIoU on each benchmark) to substantiate the central claim.

    Authors: We agree that explicit numerical support strengthens the abstract. The experimental tables in §4 already contain per-benchmark baseline comparisons, mean ΔmAP/ΔIoU values, and standard deviations across random seeds for all four architectures and six benchmarks. In the revision we will add representative deltas (e.g., average +1.8 mAP on multi-object grounding) directly to the abstract and will include paired statistical significance tests in §4 where they are not yet reported. revision: yes

  2. Referee: [§3.2] §3.2 (Composition Loss, Eq. (3) or equivalent): the loss enforces A(composite) = A(p1) + A(p2). Attention maps in VLMs are row- or column-normalized (softmax), so their direct sum is unnormalized and can exceed unit mass. The manuscript must specify whether the loss implicitly renormalizes, clips, or tolerates mismatch, and must demonstrate that this particular additive bias improves grounding more than alternatives (max, product, or learned composition) without side effects on the base contrastive/generative objective.

    Authors: We will revise §3.2 to state explicitly that each attention map is L1-normalized to unit mass before the summation is formed, so the loss compares two normalized distributions. We will also add a controlled ablation comparing the additive operator against max, product, and a small learned composition head, confirming that additivity yields the largest grounding gains while leaving the primary contrastive/generative loss unchanged on held-out data. revision: yes

  3. Referee: [§4.3] §4.3 (Ablations) or equivalent: the weighting hyper-parameter λ between the composition loss and the original objective is load-bearing. The paper must include an ablation varying λ (including λ=0) that shows the auxiliary term does not degrade the primary loss on the original training distribution, and must report grounding metrics for overlapping or semantically interacting object pairs where additivity may be inappropriate.

    Authors: We will expand the λ ablation in §4.3 to include λ=0 (recovering the baseline) and will tabulate the primary training loss on the original distribution to verify no degradation. We will also add a breakdown of grounding metrics on subsets of the benchmarks containing overlapping or semantically interacting object pairs, noting the potential limitations of strict additivity in those cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; composition loss is an independent regularization term.

full rationale

The paper defines CompART by explicitly constructing composite phrases from object-centric constituents and introducing a composition loss that directly encodes the modeling choice attention(composite) ≈ sum(attention(constituents)). This loss is presented as a new training objective rather than a quantity derived from or fitted to the reported grounding/VQA metrics. No self-citations are used to justify uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The empirical gains are outcomes of applying the loss, not tautological reductions to the inputs. The additivity assumption may be debatable on correctness grounds, but it does not create circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method builds on standard attention mechanisms already present in VLMs and introduces one new loss term whose implementation details are not given.

pith-pipeline@v0.9.0 · 5773 in / 1131 out tokens · 24197 ms · 2026-05-23T07:21:36.670279+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 4 internal anchors

  1. [1]

    Multi-level multimodal common semantic space for image-phrase grounding

    Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl V ondrick, and Shih-Fu Chang. Multi-level multimodal common semantic space for image-phrase grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12476–12486, 2019. 6

  2. [2]

    Flamingo: a visual language model for few- shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi ´n...

  3. [3]

    Semantic parsing as machine translation

    Jacob Andreas, Andreas Vlachos, and Stephen Clark. Semantic parsing as machine translation. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 47–52, 2013. 1

  4. [4]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2

  5. [5]

    Investigating the role of attribute context in vision-language models for object recognition and detection

    Kyle Buettner and Adriana Kovashka. Investigating the role of attribute context in vision-language models for object recognition and detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5474–5484, 2024. 2

  6. [6]

    Lion: Empowering multimodal large language model with dual-level visual knowledge

    Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. Lion: Empowering multimodal large language model with dual-level visual knowledge. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 26540– 26550, 2024. 4

  7. [7]

    Visualgpt: Data-efficient adapta- tion of pretrained language models for image caption- ing

    Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mo- hamed Elhoseiny. Visualgpt: Data-efficient adapta- tion of pretrained language models for image caption- ing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18030–18040, 2022. 2

  8. [8]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 4

  9. [9]

    Semi-supervised grounding alignment for multi-modal feature learning

    Shih-Han Chou, Zicong Fan, James J Little, and Leonid Sigal. Semi-supervised grounding alignment for multi-modal feature learning. In2022 19th Confer- ence on Robots and Vision (CRV), pages 48–57. IEEE,

  10. [10]

    Improving pre-trained vision-and-language embeddings for phrase ground- ing

    Zi-Yi Dou and Nanyun Peng. Improving pre-trained vision-and-language embeddings for phrase ground- ing. In Proceedings of the 2021 Conference on Empir- ical Methods in Natural Language Processing , pages 6362–6371, 2021. 6

  11. [11]

    Learning to Parse and Translate Improves Neural Machine Translation

    Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. Learning to parse and trans- late improves neural machine translation. arXiv preprint arXiv:1702.03525, 2017. 1

  12. [12]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6904–6913, 2017. 1, 7, 8

  13. [13]

    Contrastive learn- ing for weakly supervised phrase grounding

    Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. Contrastive learn- ing for weakly supervised phrase grounding. In Eur. Conf. Comput. Vis., pages 752–768. Springer, 2020. 6

  14. [14]

    Pay attention to your neighbours: Training-free open- vocabulary semantic segmentation

    Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open- vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025. 1

  15. [15]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9729–9738, 2020. 4, 5 9

  16. [16]

    Improved visual grounding through self-consistent explanations

    Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C Berg, and Vicente Ordonez. Improved visual grounding through self-consistent explanations. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 13095–13105, 2024. 2, 3, 4, 5, 6, 7

  17. [17]

    Scaling up visual and vision- language representation learning with noisy text su- pervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision- language representation learning with noisy text su- pervision. In Int. Conf. Machine Learning , 2021. 1, 2

  18. [18]

    Plug-and-play, dense-label-free extrac- tion of open-vocabulary semantic segmentation from vision-language models

    Luo Jiayun, Siddhesh Khandelwal, Leonid Sigal, and Boyang Li. Plug-and-play, dense-label-free extrac- tion of open-vocabulary semantic segmentation from vision-language models. IEEE Conf. Comput. Vis. Pattern Recog., 2024. 1, 2, 5

  19. [19]

    See what you are told: Visual at- tention sink in large multimodal models

    Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual at- tention sink in large multimodal models. In The Thir- teenth International Conference on Learning Repre- sentations, 2025. 4

  20. [20]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 1, 6

  21. [21]

    Vilt: Vision-and-language transformer without convolution or region supervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In Int. Conf. Machine Learning,

  22. [22]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4015–4026, 2023. 7

  23. [23]

    De- pendency based embeddings for sentence classifica- tion tasks

    Alexandros Komninos and Suresh Manandhar. De- pendency based embeddings for sentence classifica- tion tasks. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, pages 1490–1500, 2016. 1

  24. [24]

    Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan- nis Kalantidis, Li-Jia Li, David A Shamma, et al. Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 6

  25. [25]

    Weakly super- vised referring image segmentation with intra-chunk and inter-chunk consistency

    Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Tara Taghavi. Weakly super- vised referring image segmentation with intra-chunk and inter-chunk consistency. In Int. Conf. Comput. Vis., pages 21870–21881, 2023. 3

  26. [26]

    mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections

    Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al. mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections. arXiv preprint arXiv:2205.12005, 2022. 1, 2

  27. [27]

    Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi

    Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Adv. Neural Inform. Process. Syst. ,

  28. [28]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and gener- ation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and gener- ation. In Int. Conf. Machine Learning, pages 12888– 12900. PMLR, 2022. 1, 2, 4, 6

  29. [29]

    Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. In Int. Conf. Machine Learning, 2023. 2

  30. [30]

    Os- car: Object-semantics aligned pre-training for vision- language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Os- car: Object-semantics aligned pre-training for vision- language tasks. In Eur. Conf. Comput. Vis., 2020. 1

  31. [31]

    Triple alignment strategies for zero-shot phrase grounding under weak supervision

    Pengyue Lin, Ruifan Li, Yuzhe Ji, Zhihan Yu, Fangx- iang Feng, Zhanyu Ma, and Xiaojie Wang. Triple alignment strategies for zero-shot phrase grounding under weak supervision. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 4312–4321, 2024. 6

  32. [32]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In Eur. Conf. Comput. Vis. , pages 740–755, 2014. 1, 6, 7, 8

  33. [33]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Adv. Neu- ral Inform. Process. Syst., 2023. 1, 2, 4, 8

  34. [34]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 4

  35. [35]

    Syn- tactic parsing

    Martin J Pickering and Roger PG Van Gompel. Syn- tactic parsing. In Handbook of psycholinguistics , pages 455–503. Elsevier, 2006. 1

  36. [36]

    Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana 10 Lazebnik. Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models. In Int. Conf. Comput. Vis., pages 2641–2649,

  37. [37]

    The importance of syntactic parsing and inference in se- mantic role labeling

    Vasin Punyakanok, Dan Roth, and Wen-tau Yih. The importance of syntactic parsing and inference in se- mantic role labeling. Computational Linguistics, 34 (2):257–287, 2008. 1

  38. [38]

    Qwen2.5 technical report,

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  39. [39]

    Learning trans- ferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion. In Int. Conf. Machine Learning, 2021. 1, 2, 4

  40. [40]

    Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

    Ramprasaath R. Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vis., 2017. 5

  41. [41]

    Similarity maps for self-training weakly-supervised phrase grounding

    Tal Shaharabany and Lior Wolf. Similarity maps for self-training weakly-supervised phrase grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6925– 6934, 2023. 6

  42. [42]

    What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs

    Tal Shaharabany, Yoad Tewel, and Lior Wolf. What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs. Adv. Neural In- form. Process. Syst., 35:28222–28237, 2022. 6

  43. [43]

    Groundvlp: Harnessing zero-shot vi- sual grounding from vision-language pre-training and open-vocabulary object detection

    Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, and Jianwei Yin. Groundvlp: Harnessing zero-shot vi- sual grounding from vision-language pre-training and open-vocabulary object detection. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 4766–4775, 2024. 4

  44. [44]

    What does clip know about a red circle? visual prompt engineering for vlms

    Aleksandar Shtedritski, Christian Rupprecht, and An- drea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023. 4

  45. [45]

    Flava: A foundational language and vision alignment model

    Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 15638–15650,

  46. [46]

    Reclip: A strong zero-shot baseline for refer- ring expression comprehension

    Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A strong zero-shot baseline for refer- ring expression comprehension. arXiv preprint arXiv:2204.05991, 2022. 4

  47. [47]

    Alpha-clip: A clip model focusing on wher- ever you want

    Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Ji- aqi Wang. Alpha-clip: A clip model focusing on wher- ever you want. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 13019–13029, 2024. 4

  48. [48]

    Lxmert: Learning cross- modality encoder representations from transformers

    Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. 2

  49. [49]

    Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training

    Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven CH Hoi. Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training. arXiv preprint arXiv:2210.08773, 2022. 5

  50. [50]

    Order-Embeddings of Images and Language

    Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015. 1, 2

  51. [51]

    Sclip: Re- thinking self-attention for dense vision-language in- ference

    Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Re- thinking self-attention for dense vision-language in- ference. In Computer Vision – ECCV 2024 , pages 315–332, Cham, 2024. Springer Nature Switzerland. 1

  52. [52]

    Phrase2vec: phrase embedding based on parsing

    Yongliang Wu, Shuliang Zhao, and Wenbin Li. Phrase2vec: phrase embedding based on parsing. In- formation Sciences, 517:100–127, 2020. 1

  53. [53]

    Weakly- supervised visual grounding of phrases with linguistic structures

    Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. Weakly- supervised visual grounding of phrases with linguistic structures. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5945–5954, 2017. 2, 3

  54. [54]

    mplug-2: A modularized multi-modal foundation model across text, image and video

    Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al. mplug-2: A modularized multi-modal foundation model across text, image and video. In Int. Conf. Machine Learning, pages 38728–38748. PMLR,

  55. [55]

    Bridgetower: Building bridges between encoders in vision-language representation learning

    Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, and Nan Duan. Bridgetower: Building bridges between encoders in vision-language representation learning. In AAAI, pages 10637–10647,

  56. [56]

    Pink: Unveiling the power of referential com- prehension for multi-modal llms

    Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. Pink: Unveiling the power of referential com- prehension for multi-modal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13838–13848, 2024. 4 11

  57. [57]

    Improving visual grounding by encouraging consistent gradient-based explanations

    Ziyan Yang, Kushal Kafle, Franck Dernoncourt, and Vicente Ordonez. Improving visual grounding by encouraging consistent gradient-based explanations. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 19165–19174, 2023. 6

  58. [58]

    Modeling context in refer- ring expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in refer- ring expressions. In Eur. Conf. Comput. Vis. , pages 69–85. Springer, 2016. 1, 6, 7

  59. [59]

    Multi- grained vision language pre-training: Align- ing texts with visual concepts

    Yan Zeng, Xinsong Zhang, and Hang Li. Multi- grained vision language pre-training: Align- ing texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021. 2

  60. [60]

    Investigating compo- sitional challenges in vision-language models for vi- sual grounding

    Yunan Zeng, Yan Huang, Jinjin Zhang, Zequn Jie, Zhenhua Chai, and Liang Wang. Investigating compo- sitional challenges in vision-language models for vi- sual grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14141–14151, 2024. 3, 7

  61. [61]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF inter- national conference on computer vision, pages 11975– 11986, 2023. 8

  62. [62]

    Top-down neural attention by excitation backprop

    Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. In Eur. Conf. Comput. Vis. ,

  63. [63]

    Evf-sam: Early vision-language fusion for text-prompted segment anything model

    Yuxuan Zhang, Tianheng Cheng, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076, 2024. 7

  64. [64]

    Psalm: Pixelwise segmentation with large multi- modal model

    Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi- modal model. In Eur. Conf. Comput. Vis., pages 74–

  65. [65]

    Tinyllava: A framework of small-scale large multimodal models

    Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024. 4, 9 12