The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

Boyang Li; Jiayun Luo; Leonid Sigal; Mir Rayat Imtiaz Hossain; Pritam Sarkar

arxiv: 2412.08110 · v3 · submitted 2024-12-11 · 💻 cs.CV · cs.CL· cs.LG

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

Jiayun Luo , Mir Rayat Imtiaz Hossain , Pritam Sarkar , Boyang Li , Leonid Sigal This is my paper

Pith reviewed 2026-05-23 07:21 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords compositional visual groundingattention regularizationvision-language modelsmulti-object referencesCompARTvisual question answering

0 comments

The pith

Enforcing that attention maps for composite phrases equal the sum of their parts improves multi-object grounding in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models lose grounding accuracy when phrases refer to several objects at once because standard image-caption training rarely supplies direct multi-object references and attribution is hard. The paper proposes Compositional Attention-Regularized Training that splits captions into object-centric phrases, forms composite phrases with conjunctions, and adds a loss forcing the attention map of each composite phrase to equal the arithmetic sum of the maps of its constituents. This regularization is applied without any extra labels. The resulting models show better localization on both single- and multi-object grounding benchmarks and also improve on visual question answering tasks they were never trained for, across contrastive and generative architectures.

Core claim

The paper claims that a composition loss which encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases promotes balanced multi-object localization, yielding consistent gains on grounding benchmarks for both single- and multi-object references as well as gains on VQA benchmarks across four VLM architectures.

What carries the argument

The composition loss that requires the attention map produced by a composite phrase to equal the arithmetic sum of the attention maps of its constituent object phrases.

If this is right

Grounding accuracy rises for both single-object and multi-object references.
The gains appear across contrastive and generative VLM families on four separate grounding datasets.
Visual question answering accuracy also rises on two benchmarks even though the method never trains on VQA data.
No additional human annotations are required beyond the original image-caption pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same additive-attention regularizer could be tested on other compositional linguistic forms such as disjunctions or quantified phrases.
If the loss works by shaping internal attention structure, similar regularization might transfer to tasks that require counting or spatial relations.
The approach may reveal whether current VLMs already encode additive composition in their attention weights or whether the loss must create that structure from scratch.

Load-bearing premise

That the attention map of a composite phrase can be usefully forced to equal the sum of its constituent phrases' maps and that this equality directly improves localization without harming the original training objectives.

What would settle it

Applying the composition loss produces no gain or produces a drop in multi-object grounding accuracy on the four benchmarks used in the paper relative to the same models trained without the loss.

Figures

Figures reproduced from arXiv: 2412.08110 by Boyang Li, Jiayun Luo, Leonid Sigal, Mir Rayat Imtiaz Hossain, Pritam Sarkar.

**Figure 1.** Figure 1: Motivation. Existing VLM models train from unstructured image-caption pairs. The proposed HIerarchically STructured (HIST) learning framework decompose captions into hierarchy of phrases, establishing entailment between phrases and the image, and among the phrases themselves using proposed losses. Specifically, it extracts subjects (Subject level) from phrases and aligns them to the image, along with th… view at source ↗

**Figure 2.** Figure 2: The overall structure of HIST. We decompose image captions into object-centric phrases and build a three-level hierarchy – Subject level, Phrase level, and Composite Phrase level. Entailment between these constituent components of the sentence and the image, allows us to formulate additional regularization constraints for training of VLMs. Specifically, we leverage three losses. At the Phrase Level we ensu… view at source ↗

**Figure 3.** Figure 3: Qualitative Result for ALBEF, /w SelfEQ and /w HIST on Visual Grounding and Referring Segmentation. Images are from RefCOCO+. The enlarged red stars represent the top 4 locations with the highest predicted attention value from respective methods. To obtain segmentation mask, we input the four points into SAM [22] as point prompts. We note that the HIST accurately detects all objects. B sets. As shown in … view at source ↗

**Figure 4.** Figure 4: Qualitative Result for TinyLLaVA and TinyLLaVA + HIST on Visual Grounding. Images are from Flickr30K. the benefit of hierarchical structured data on spatial imagetext alignment. TinyLLaVA 0.5B has the same structure as LLaVA [33] but with SigLIP [61] as visual encoder and small-scale Qwen2-0.5B [38] as LLM. Reformulate data to QA format. To adapt HIST on TinyLLaVA, we first reformulate the data into QA fo… view at source ↗

read the original abstract

Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CompART adds an attention-sum regularizer on decomposed captions to push multi-object grounding, but the central assumption about additivity needs checking against normalization and the abstract gives no numbers to judge effect size.

read the letter

The main takeaway is that the authors decompose captions into object phrases, build composites with conjunctions, and add a loss that pushes the attention map for the composite to equal the sum of the constituent maps. This is done without new labels and is tested on both contrastive and generative VLMs. That specific loss term and the way they construct the composites without annotations is the concrete new piece. They also run it on four architectures and report gains on grounding benchmarks plus some downstream lift on VQA, which is a reasonable way to check for side effects on the base objective. The evaluation spread across model types is a plus; it shows the idea is not tied to one training regime. The soft spot is exactly the one the stress-test flagged. Attention maps are softmax outputs, so their direct sum exceeds unit mass and the loss must be doing something implicit to make the equality workable. The paper needs to show whether this hurts the original contrastive or generative loss, whether additivity beats max or product alternatives, and what happens when objects overlap. Without those checks or any reported deltas, ablations, or statistical tests, it is hard to know if the claimed consistent improvements are real or modest. The full text presumably contains the numbers; if they are solid and the normalization issue is addressed cleanly, the work is worth a referee. It is aimed at people who train or fine-tune VLMs for grounding and scene understanding. A serious editor should send it out rather than desk-reject, because the problem is real and the method is simple enough to be worth testing even if revisions are needed.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Compositional Attention-Regularized Training (CompART) to address degraded grounding performance on complex multi-object references in VLMs. Captions are decomposed into object-centric phrases; composite phrases are formed via conjunctions; a composition loss is introduced that encourages the attention map induced by a composite phrase to equal the arithmetic sum of the attention maps of its constituent phrases. The method is evaluated without additional annotations across four VLM architectures (contrastive and generative), four multi-object grounding benchmarks, and two VQA benchmarks, with the claim that it yields consistent improvements on both single- and multi-object grounding plus downstream VQA gains.

Significance. If the central result holds, the work offers an annotation-free regularization approach that targets an under-addressed limitation in current VLMs: the scarcity of direct multi-object references in image-caption training data. The cross-architecture evaluation (four models) and the demonstration of incidental VQA benefits are strengths. The approach could be broadly useful if the additivity assumption proves robust, but its significance hinges on showing that the auxiliary loss improves localization metrics without measurable degradation of the primary training objective.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the abstract asserts 'consistent improvements' and 'gains on VQA' across architectures and datasets, yet supplies no numerical deltas, baseline comparisons, standard deviations, or statistical tests. The full experimental section must report these quantities (e.g., ΔmAP or ΔIoU on each benchmark) to substantiate the central claim.
[§3.2] §3.2 (Composition Loss, Eq. (3) or equivalent): the loss enforces A(composite) = A(p1) + A(p2). Attention maps in VLMs are row- or column-normalized (softmax), so their direct sum is unnormalized and can exceed unit mass. The manuscript must specify whether the loss implicitly renormalizes, clips, or tolerates mismatch, and must demonstrate that this particular additive bias improves grounding more than alternatives (max, product, or learned composition) without side effects on the base contrastive/generative objective.
[§4.3] §4.3 (Ablations) or equivalent: the weighting hyper-parameter λ between the composition loss and the original objective is load-bearing. The paper must include an ablation varying λ (including λ=0) that shows the auxiliary term does not degrade the primary loss on the original training distribution, and must report grounding metrics for overlapping or semantically interacting object pairs where additivity may be inappropriate.

minor comments (2)

[§3] Notation for attention maps should be defined once and used consistently; the distinction between raw and normalized maps should be explicit in the loss definition.
[Figures] Figure captions should state the exact metric (e.g., mAP@0.5) and the number of runs averaged.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will strengthen the presentation of results and clarify the method without altering the core claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the abstract asserts 'consistent improvements' and 'gains on VQA' across architectures and datasets, yet supplies no numerical deltas, baseline comparisons, standard deviations, or statistical tests. The full experimental section must report these quantities (e.g., ΔmAP or ΔIoU on each benchmark) to substantiate the central claim.

Authors: We agree that explicit numerical support strengthens the abstract. The experimental tables in §4 already contain per-benchmark baseline comparisons, mean ΔmAP/ΔIoU values, and standard deviations across random seeds for all four architectures and six benchmarks. In the revision we will add representative deltas (e.g., average +1.8 mAP on multi-object grounding) directly to the abstract and will include paired statistical significance tests in §4 where they are not yet reported. revision: yes
Referee: [§3.2] §3.2 (Composition Loss, Eq. (3) or equivalent): the loss enforces A(composite) = A(p1) + A(p2). Attention maps in VLMs are row- or column-normalized (softmax), so their direct sum is unnormalized and can exceed unit mass. The manuscript must specify whether the loss implicitly renormalizes, clips, or tolerates mismatch, and must demonstrate that this particular additive bias improves grounding more than alternatives (max, product, or learned composition) without side effects on the base contrastive/generative objective.

Authors: We will revise §3.2 to state explicitly that each attention map is L1-normalized to unit mass before the summation is formed, so the loss compares two normalized distributions. We will also add a controlled ablation comparing the additive operator against max, product, and a small learned composition head, confirming that additivity yields the largest grounding gains while leaving the primary contrastive/generative loss unchanged on held-out data. revision: yes
Referee: [§4.3] §4.3 (Ablations) or equivalent: the weighting hyper-parameter λ between the composition loss and the original objective is load-bearing. The paper must include an ablation varying λ (including λ=0) that shows the auxiliary term does not degrade the primary loss on the original training distribution, and must report grounding metrics for overlapping or semantically interacting object pairs where additivity may be inappropriate.

Authors: We will expand the λ ablation in §4.3 to include λ=0 (recovering the baseline) and will tabulate the primary training loss on the original distribution to verify no degradation. We will also add a breakdown of grounding metrics on subsets of the benchmarks containing overlapping or semantically interacting object pairs, noting the potential limitations of strict additivity in those cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; composition loss is an independent regularization term.

full rationale

The paper defines CompART by explicitly constructing composite phrases from object-centric constituents and introducing a composition loss that directly encodes the modeling choice attention(composite) ≈ sum(attention(constituents)). This loss is presented as a new training objective rather than a quantity derived from or fitted to the reported grounding/VQA metrics. No self-citations are used to justify uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The empirical gains are outcomes of applying the loss, not tautological reductions to the inputs. The additivity assumption may be debatable on correctness grounds, but it does not create circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method builds on standard attention mechanisms already present in VLMs and introduces one new loss term whose implementation details are not given.

pith-pipeline@v0.9.0 · 5773 in / 1131 out tokens · 24197 ms · 2026-05-23T07:21:36.670279+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lcomp = ΣΣ |GCi + GCj − GCij| … encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-level hierarchy: Subject Level, Phrase Level, and Composite Phrase Level … Composition Loss to balance attention across multiple objects

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 4 internal anchors

[1]

Multi-level multimodal common semantic space for image-phrase grounding

Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl V ondrick, and Shih-Fu Chang. Multi-level multimodal common semantic space for image-phrase grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12476–12486, 2019. 6

work page 2019
[2]

Flamingo: a visual language model for few- shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi ´n...

work page
[3]

Semantic parsing as machine translation

Jacob Andreas, Andreas Vlachos, and Stephen Clark. Semantic parsing as machine translation. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 47–52, 2013. 1

work page 2013
[4]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Investigating the role of attribute context in vision-language models for object recognition and detection

Kyle Buettner and Adriana Kovashka. Investigating the role of attribute context in vision-language models for object recognition and detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5474–5484, 2024. 2

work page 2024
[6]

Lion: Empowering multimodal large language model with dual-level visual knowledge

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. Lion: Empowering multimodal large language model with dual-level visual knowledge. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 26540– 26550, 2024. 4

work page 2024
[7]

Visualgpt: Data-efficient adapta- tion of pretrained language models for image caption- ing

Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mo- hamed Elhoseiny. Visualgpt: Data-efficient adapta- tion of pretrained language models for image caption- ing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18030–18040, 2022. 2

work page 2022
[8]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 4

work page 2023
[9]

Semi-supervised grounding alignment for multi-modal feature learning

Shih-Han Chou, Zicong Fan, James J Little, and Leonid Sigal. Semi-supervised grounding alignment for multi-modal feature learning. In2022 19th Confer- ence on Robots and Vision (CRV), pages 48–57. IEEE,

work page
[10]

Improving pre-trained vision-and-language embeddings for phrase ground- ing

Zi-Yi Dou and Nanyun Peng. Improving pre-trained vision-and-language embeddings for phrase ground- ing. In Proceedings of the 2021 Conference on Empir- ical Methods in Natural Language Processing , pages 6362–6371, 2021. 6

work page 2021
[11]

Learning to Parse and Translate Improves Neural Machine Translation

Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. Learning to parse and trans- late improves neural machine translation. arXiv preprint arXiv:1702.03525, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6904–6913, 2017. 1, 7, 8

work page 2017
[13]

Contrastive learn- ing for weakly supervised phrase grounding

Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. Contrastive learn- ing for weakly supervised phrase grounding. In Eur. Conf. Comput. Vis., pages 752–768. Springer, 2020. 6

work page 2020
[14]

Pay attention to your neighbours: Training-free open- vocabulary semantic segmentation

Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open- vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025. 1

work page 2025
[15]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9729–9738, 2020. 4, 5 9

work page 2020
[16]

Improved visual grounding through self-consistent explanations

Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C Berg, and Vicente Ordonez. Improved visual grounding through self-consistent explanations. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 13095–13105, 2024. 2, 3, 4, 5, 6, 7

work page 2024
[17]

Scaling up visual and vision- language representation learning with noisy text su- pervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision- language representation learning with noisy text su- pervision. In Int. Conf. Machine Learning , 2021. 1, 2

work page 2021
[18]

Plug-and-play, dense-label-free extrac- tion of open-vocabulary semantic segmentation from vision-language models

Luo Jiayun, Siddhesh Khandelwal, Leonid Sigal, and Boyang Li. Plug-and-play, dense-label-free extrac- tion of open-vocabulary semantic segmentation from vision-language models. IEEE Conf. Comput. Vis. Pattern Recog., 2024. 1, 2, 5

work page 2024
[19]

See what you are told: Visual at- tention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual at- tention sink in large multimodal models. In The Thir- teenth International Conference on Learning Repre- sentations, 2025. 4

work page 2025
[20]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 1, 6

work page 2014
[21]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In Int. Conf. Machine Learning,

work page
[22]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4015–4026, 2023. 7

work page 2023
[23]

De- pendency based embeddings for sentence classifica- tion tasks

Alexandros Komninos and Suresh Manandhar. De- pendency based embeddings for sentence classifica- tion tasks. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, pages 1490–1500, 2016. 1

work page 2016
[24]

Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan- nis Kalantidis, Li-Jia Li, David A Shamma, et al. Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 6

work page 2017
[25]

Weakly super- vised referring image segmentation with intra-chunk and inter-chunk consistency

Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Tara Taghavi. Weakly super- vised referring image segmentation with intra-chunk and inter-chunk consistency. In Int. Conf. Comput. Vis., pages 21870–21881, 2023. 3

work page 2023
[26]

mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al. mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections. arXiv preprint arXiv:2205.12005, 2022. 1, 2

work page arXiv 2022
[27]

Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Adv. Neural Inform. Process. Syst. ,

work page
[28]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and gener- ation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and gener- ation. In Int. Conf. Machine Learning, pages 12888– 12900. PMLR, 2022. 1, 2, 4, 6

work page 2022
[29]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. In Int. Conf. Machine Learning, 2023. 2

work page 2023
[30]

Os- car: Object-semantics aligned pre-training for vision- language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Os- car: Object-semantics aligned pre-training for vision- language tasks. In Eur. Conf. Comput. Vis., 2020. 1

work page 2020
[31]

Triple alignment strategies for zero-shot phrase grounding under weak supervision

Pengyue Lin, Ruifan Li, Yuzhe Ji, Zhihan Yu, Fangx- iang Feng, Zhanyu Ma, and Xiaojie Wang. Triple alignment strategies for zero-shot phrase grounding under weak supervision. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 4312–4321, 2024. 6

work page 2024
[32]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In Eur. Conf. Comput. Vis. , pages 740–755, 2014. 1, 6, 7, 8

work page 2014
[33]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Adv. Neu- ral Inform. Process. Syst., 2023. 1, 2, 4, 8

work page 2023
[34]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Syn- tactic parsing

Martin J Pickering and Roger PG Van Gompel. Syn- tactic parsing. In Handbook of psycholinguistics , pages 455–503. Elsevier, 2006. 1

work page 2006
[36]

Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana 10 Lazebnik. Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models. In Int. Conf. Comput. Vis., pages 2641–2649,

work page
[37]

The importance of syntactic parsing and inference in se- mantic role labeling

Vasin Punyakanok, Dan Roth, and Wen-tau Yih. The importance of syntactic parsing and inference in se- mantic role labeling. Computational Linguistics, 34 (2):257–287, 2008. 1

work page 2008
[38]

Qwen2.5 technical report,

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page
[39]

Learning trans- ferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion. In Int. Conf. Machine Learning, 2021. 1, 2, 4

work page 2021
[40]

Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vis., 2017. 5

work page 2017
[41]

Similarity maps for self-training weakly-supervised phrase grounding

Tal Shaharabany and Lior Wolf. Similarity maps for self-training weakly-supervised phrase grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6925– 6934, 2023. 6

work page 2023
[42]

What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs

Tal Shaharabany, Yoad Tewel, and Lior Wolf. What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs. Adv. Neural In- form. Process. Syst., 35:28222–28237, 2022. 6

work page 2022
[43]

Groundvlp: Harnessing zero-shot vi- sual grounding from vision-language pre-training and open-vocabulary object detection

Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, and Jianwei Yin. Groundvlp: Harnessing zero-shot vi- sual grounding from vision-language pre-training and open-vocabulary object detection. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 4766–4775, 2024. 4

work page 2024
[44]

What does clip know about a red circle? visual prompt engineering for vlms

Aleksandar Shtedritski, Christian Rupprecht, and An- drea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023. 4

work page 2023
[45]

Flava: A foundational language and vision alignment model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 15638–15650,

work page
[46]

Reclip: A strong zero-shot baseline for refer- ring expression comprehension

Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A strong zero-shot baseline for refer- ring expression comprehension. arXiv preprint arXiv:2204.05991, 2022. 4

work page arXiv 2022
[47]

Alpha-clip: A clip model focusing on wher- ever you want

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Ji- aqi Wang. Alpha-clip: A clip model focusing on wher- ever you want. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 13019–13029, 2024. 4

work page 2024
[48]

Lxmert: Learning cross- modality encoder representations from transformers

Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. 2

work page arXiv 1908
[49]

Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training

Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven CH Hoi. Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training. arXiv preprint arXiv:2210.08773, 2022. 5

work page arXiv 2022
[50]

Order-Embeddings of Images and Language

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[51]

Sclip: Re- thinking self-attention for dense vision-language in- ference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Re- thinking self-attention for dense vision-language in- ference. In Computer Vision – ECCV 2024 , pages 315–332, Cham, 2024. Springer Nature Switzerland. 1

work page 2024
[52]

Phrase2vec: phrase embedding based on parsing

Yongliang Wu, Shuliang Zhao, and Wenbin Li. Phrase2vec: phrase embedding based on parsing. In- formation Sciences, 517:100–127, 2020. 1

work page 2020
[53]

Weakly- supervised visual grounding of phrases with linguistic structures

Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. Weakly- supervised visual grounding of phrases with linguistic structures. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5945–5954, 2017. 2, 3

work page 2017
[54]

mplug-2: A modularized multi-modal foundation model across text, image and video

Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al. mplug-2: A modularized multi-modal foundation model across text, image and video. In Int. Conf. Machine Learning, pages 38728–38748. PMLR,

work page
[55]

Bridgetower: Building bridges between encoders in vision-language representation learning

Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, and Nan Duan. Bridgetower: Building bridges between encoders in vision-language representation learning. In AAAI, pages 10637–10647,

work page
[56]

Pink: Unveiling the power of referential com- prehension for multi-modal llms

Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. Pink: Unveiling the power of referential com- prehension for multi-modal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13838–13848, 2024. 4 11

work page 2024
[57]

Improving visual grounding by encouraging consistent gradient-based explanations

Ziyan Yang, Kushal Kafle, Franck Dernoncourt, and Vicente Ordonez. Improving visual grounding by encouraging consistent gradient-based explanations. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 19165–19174, 2023. 6

work page 2023
[58]

Modeling context in refer- ring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in refer- ring expressions. In Eur. Conf. Comput. Vis. , pages 69–85. Springer, 2016. 1, 6, 7

work page 2016
[59]

Multi- grained vision language pre-training: Align- ing texts with visual concepts

Yan Zeng, Xinsong Zhang, and Hang Li. Multi- grained vision language pre-training: Align- ing texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021. 2

work page arXiv 2021
[60]

Investigating compo- sitional challenges in vision-language models for vi- sual grounding

Yunan Zeng, Yan Huang, Jinjin Zhang, Zequn Jie, Zhenhua Chai, and Liang Wang. Investigating compo- sitional challenges in vision-language models for vi- sual grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14141–14151, 2024. 3, 7

work page 2024
[61]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF inter- national conference on computer vision, pages 11975– 11986, 2023. 8

work page 2023
[62]

Top-down neural attention by excitation backprop

Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. In Eur. Conf. Comput. Vis. ,

work page
[63]

Evf-sam: Early vision-language fusion for text-prompted segment anything model

Yuxuan Zhang, Tianheng Cheng, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076, 2024. 7

work page arXiv 2024
[64]

Psalm: Pixelwise segmentation with large multi- modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi- modal model. In Eur. Conf. Comput. Vis., pages 74–

work page
[65]

Tinyllava: A framework of small-scale large multimodal models

Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024. 4, 9 12

work page arXiv 2024

[1] [1]

Multi-level multimodal common semantic space for image-phrase grounding

Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl V ondrick, and Shih-Fu Chang. Multi-level multimodal common semantic space for image-phrase grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12476–12486, 2019. 6

work page 2019

[2] [2]

Flamingo: a visual language model for few- shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi ´n...

work page

[3] [3]

Semantic parsing as machine translation

Jacob Andreas, Andreas Vlachos, and Stephen Clark. Semantic parsing as machine translation. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 47–52, 2013. 1

work page 2013

[4] [4]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Investigating the role of attribute context in vision-language models for object recognition and detection

Kyle Buettner and Adriana Kovashka. Investigating the role of attribute context in vision-language models for object recognition and detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5474–5484, 2024. 2

work page 2024

[6] [6]

Lion: Empowering multimodal large language model with dual-level visual knowledge

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. Lion: Empowering multimodal large language model with dual-level visual knowledge. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 26540– 26550, 2024. 4

work page 2024

[7] [7]

Visualgpt: Data-efficient adapta- tion of pretrained language models for image caption- ing

Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mo- hamed Elhoseiny. Visualgpt: Data-efficient adapta- tion of pretrained language models for image caption- ing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18030–18040, 2022. 2

work page 2022

[8] [8]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 4

work page 2023

[9] [9]

Semi-supervised grounding alignment for multi-modal feature learning

Shih-Han Chou, Zicong Fan, James J Little, and Leonid Sigal. Semi-supervised grounding alignment for multi-modal feature learning. In2022 19th Confer- ence on Robots and Vision (CRV), pages 48–57. IEEE,

work page

[10] [10]

Improving pre-trained vision-and-language embeddings for phrase ground- ing

Zi-Yi Dou and Nanyun Peng. Improving pre-trained vision-and-language embeddings for phrase ground- ing. In Proceedings of the 2021 Conference on Empir- ical Methods in Natural Language Processing , pages 6362–6371, 2021. 6

work page 2021

[11] [11]

Learning to Parse and Translate Improves Neural Machine Translation

Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. Learning to parse and trans- late improves neural machine translation. arXiv preprint arXiv:1702.03525, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6904–6913, 2017. 1, 7, 8

work page 2017

[13] [13]

Contrastive learn- ing for weakly supervised phrase grounding

Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. Contrastive learn- ing for weakly supervised phrase grounding. In Eur. Conf. Comput. Vis., pages 752–768. Springer, 2020. 6

work page 2020

[14] [14]

Pay attention to your neighbours: Training-free open- vocabulary semantic segmentation

Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open- vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025. 1

work page 2025

[15] [15]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9729–9738, 2020. 4, 5 9

work page 2020

[16] [16]

Improved visual grounding through self-consistent explanations

Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C Berg, and Vicente Ordonez. Improved visual grounding through self-consistent explanations. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 13095–13105, 2024. 2, 3, 4, 5, 6, 7

work page 2024

[17] [17]

Scaling up visual and vision- language representation learning with noisy text su- pervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision- language representation learning with noisy text su- pervision. In Int. Conf. Machine Learning , 2021. 1, 2

work page 2021

[18] [18]

Plug-and-play, dense-label-free extrac- tion of open-vocabulary semantic segmentation from vision-language models

Luo Jiayun, Siddhesh Khandelwal, Leonid Sigal, and Boyang Li. Plug-and-play, dense-label-free extrac- tion of open-vocabulary semantic segmentation from vision-language models. IEEE Conf. Comput. Vis. Pattern Recog., 2024. 1, 2, 5

work page 2024

[19] [19]

See what you are told: Visual at- tention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual at- tention sink in large multimodal models. In The Thir- teenth International Conference on Learning Repre- sentations, 2025. 4

work page 2025

[20] [20]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 1, 6

work page 2014

[21] [21]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In Int. Conf. Machine Learning,

work page

[22] [22]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4015–4026, 2023. 7

work page 2023

[23] [23]

De- pendency based embeddings for sentence classifica- tion tasks

Alexandros Komninos and Suresh Manandhar. De- pendency based embeddings for sentence classifica- tion tasks. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, pages 1490–1500, 2016. 1

work page 2016

[24] [24]

Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan- nis Kalantidis, Li-Jia Li, David A Shamma, et al. Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 6

work page 2017

[25] [25]

Weakly super- vised referring image segmentation with intra-chunk and inter-chunk consistency

Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Tara Taghavi. Weakly super- vised referring image segmentation with intra-chunk and inter-chunk consistency. In Int. Conf. Comput. Vis., pages 21870–21881, 2023. 3

work page 2023

[26] [26]

mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al. mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections. arXiv preprint arXiv:2205.12005, 2022. 1, 2

work page arXiv 2022

[27] [27]

Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Adv. Neural Inform. Process. Syst. ,

work page

[28] [28]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and gener- ation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and gener- ation. In Int. Conf. Machine Learning, pages 12888– 12900. PMLR, 2022. 1, 2, 4, 6

work page 2022

[29] [29]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. In Int. Conf. Machine Learning, 2023. 2

work page 2023

[30] [30]

Os- car: Object-semantics aligned pre-training for vision- language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Os- car: Object-semantics aligned pre-training for vision- language tasks. In Eur. Conf. Comput. Vis., 2020. 1

work page 2020

[31] [31]

Triple alignment strategies for zero-shot phrase grounding under weak supervision

Pengyue Lin, Ruifan Li, Yuzhe Ji, Zhihan Yu, Fangx- iang Feng, Zhanyu Ma, and Xiaojie Wang. Triple alignment strategies for zero-shot phrase grounding under weak supervision. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 4312–4321, 2024. 6

work page 2024

[32] [32]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In Eur. Conf. Comput. Vis. , pages 740–755, 2014. 1, 6, 7, 8

work page 2014

[33] [33]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Adv. Neu- ral Inform. Process. Syst., 2023. 1, 2, 4, 8

work page 2023

[34] [34]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Syn- tactic parsing

Martin J Pickering and Roger PG Van Gompel. Syn- tactic parsing. In Handbook of psycholinguistics , pages 455–503. Elsevier, 2006. 1

work page 2006

[36] [36]

Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana 10 Lazebnik. Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models. In Int. Conf. Comput. Vis., pages 2641–2649,

work page

[37] [37]

The importance of syntactic parsing and inference in se- mantic role labeling

Vasin Punyakanok, Dan Roth, and Wen-tau Yih. The importance of syntactic parsing and inference in se- mantic role labeling. Computational Linguistics, 34 (2):257–287, 2008. 1

work page 2008

[38] [38]

Qwen2.5 technical report,

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page

[39] [39]

Learning trans- ferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion. In Int. Conf. Machine Learning, 2021. 1, 2, 4

work page 2021

[40] [40]

Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vis., 2017. 5

work page 2017

[41] [41]

Similarity maps for self-training weakly-supervised phrase grounding

Tal Shaharabany and Lior Wolf. Similarity maps for self-training weakly-supervised phrase grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6925– 6934, 2023. 6

work page 2023

[42] [42]

What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs

Tal Shaharabany, Yoad Tewel, and Lior Wolf. What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs. Adv. Neural In- form. Process. Syst., 35:28222–28237, 2022. 6

work page 2022

[43] [43]

Groundvlp: Harnessing zero-shot vi- sual grounding from vision-language pre-training and open-vocabulary object detection

Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, and Jianwei Yin. Groundvlp: Harnessing zero-shot vi- sual grounding from vision-language pre-training and open-vocabulary object detection. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 4766–4775, 2024. 4

work page 2024

[44] [44]

What does clip know about a red circle? visual prompt engineering for vlms

Aleksandar Shtedritski, Christian Rupprecht, and An- drea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023. 4

work page 2023

[45] [45]

Flava: A foundational language and vision alignment model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 15638–15650,

work page

[46] [46]

Reclip: A strong zero-shot baseline for refer- ring expression comprehension

Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A strong zero-shot baseline for refer- ring expression comprehension. arXiv preprint arXiv:2204.05991, 2022. 4

work page arXiv 2022

[47] [47]

Alpha-clip: A clip model focusing on wher- ever you want

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Ji- aqi Wang. Alpha-clip: A clip model focusing on wher- ever you want. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 13019–13029, 2024. 4

work page 2024

[48] [48]

Lxmert: Learning cross- modality encoder representations from transformers

Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. 2

work page arXiv 1908

[49] [49]

Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training

Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven CH Hoi. Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training. arXiv preprint arXiv:2210.08773, 2022. 5

work page arXiv 2022

[50] [50]

Order-Embeddings of Images and Language

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[51] [51]

Sclip: Re- thinking self-attention for dense vision-language in- ference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Re- thinking self-attention for dense vision-language in- ference. In Computer Vision – ECCV 2024 , pages 315–332, Cham, 2024. Springer Nature Switzerland. 1

work page 2024

[52] [52]

Phrase2vec: phrase embedding based on parsing

Yongliang Wu, Shuliang Zhao, and Wenbin Li. Phrase2vec: phrase embedding based on parsing. In- formation Sciences, 517:100–127, 2020. 1

work page 2020

[53] [53]

Weakly- supervised visual grounding of phrases with linguistic structures

Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. Weakly- supervised visual grounding of phrases with linguistic structures. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5945–5954, 2017. 2, 3

work page 2017

[54] [54]

mplug-2: A modularized multi-modal foundation model across text, image and video

Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al. mplug-2: A modularized multi-modal foundation model across text, image and video. In Int. Conf. Machine Learning, pages 38728–38748. PMLR,

work page

[55] [55]

Bridgetower: Building bridges between encoders in vision-language representation learning

Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, and Nan Duan. Bridgetower: Building bridges between encoders in vision-language representation learning. In AAAI, pages 10637–10647,

work page

[56] [56]

Pink: Unveiling the power of referential com- prehension for multi-modal llms

Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. Pink: Unveiling the power of referential com- prehension for multi-modal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13838–13848, 2024. 4 11

work page 2024

[57] [57]

Improving visual grounding by encouraging consistent gradient-based explanations

Ziyan Yang, Kushal Kafle, Franck Dernoncourt, and Vicente Ordonez. Improving visual grounding by encouraging consistent gradient-based explanations. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 19165–19174, 2023. 6

work page 2023

[58] [58]

Modeling context in refer- ring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in refer- ring expressions. In Eur. Conf. Comput. Vis. , pages 69–85. Springer, 2016. 1, 6, 7

work page 2016

[59] [59]

Multi- grained vision language pre-training: Align- ing texts with visual concepts

Yan Zeng, Xinsong Zhang, and Hang Li. Multi- grained vision language pre-training: Align- ing texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021. 2

work page arXiv 2021

[60] [60]

Investigating compo- sitional challenges in vision-language models for vi- sual grounding

Yunan Zeng, Yan Huang, Jinjin Zhang, Zequn Jie, Zhenhua Chai, and Liang Wang. Investigating compo- sitional challenges in vision-language models for vi- sual grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14141–14151, 2024. 3, 7

work page 2024

[61] [61]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF inter- national conference on computer vision, pages 11975– 11986, 2023. 8

work page 2023

[62] [62]

Top-down neural attention by excitation backprop

Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. In Eur. Conf. Comput. Vis. ,

work page

[63] [63]

Evf-sam: Early vision-language fusion for text-prompted segment anything model

Yuxuan Zhang, Tianheng Cheng, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076, 2024. 7

work page arXiv 2024

[64] [64]

Psalm: Pixelwise segmentation with large multi- modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi- modal model. In Eur. Conf. Comput. Vis., pages 74–

work page

[65] [65]

Tinyllava: A framework of small-scale large multimodal models

Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024. 4, 9 12

work page arXiv 2024