Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Alessandro De Bellis; Alessandro Suglia; Aryo Pradipta Gema; Claudio Pomo; Dietmar Jannach; Matteo Attimonelli; Monica Sekoyan; Pasquale Minervini; Rohit Saxena; Tommaso Di Noia

arxiv: 2605.14787 · v2 · pith:YNYPSIM2new · submitted 2026-05-14 · 💻 cs.CV · cs.CL

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Matteo Attimonelli , Alessandro De Bellis , Aryo Pradipta Gema , Rohit Saxena , Monica Sekoyan , Wai-Chung Kwan , Claudio Pomo , Alessandro Suglia

show 3 more authors

Dietmar Jannach Tommaso Di Noia Pasquale Minervini

This is my paper

Pith reviewed 2026-05-20 21:16 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords composed image retrievalmultimodal compositionunimodal shortcutsbenchmark auditimage retrievalmultimodal modelsshortcut detectionquery validation

0 comments

The pith

Many queries in composed image retrieval benchmarks can be solved using only a single modality instead of requiring true multimodal composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the assumption that strong performance on composed image retrieval benchmarks requires models to combine information from a reference image and a text modification. It shows that across four common benchmarks and eleven models, between 32 and 84 percent of queries can be answered from either the image or the text alone. A two-stage audit first flags shortcut-solvable queries via cross-model patterns and then applies human review to the remainder, finding that most of the reviewed queries are malformed or ambiguous. On the small set of well-formed queries that remain, models can no longer succeed with one input and must use both, though overall accuracy falls.

Core claim

The authors demonstrate that a large fraction of queries in four standard CIR benchmarks can be solved using unimodal signals alone, ranging from 32.2% to 83.6% across eleven models. Through cross-model analysis and human validation of 4,741 shortcut-free queries, they identify only 1,689 as well-formed, with issues like ambiguous edits and mismatched targets common in the rest. Re-evaluation on this subset shows that successful retrieval now requires multimodal composition and cannot be achieved with a single modality.

What carries the argument

A two-stage audit that first uses cross-model agreement to detect shortcut-solvable queries and then applies human validation to confirm which remaining queries are well-formed and truly compositional.

If this is right

High scores on existing CIR benchmarks can come from unimodal shortcuts rather than actual composition of image and text.
Accuracy drops on the cleaned subset while models must now combine both inputs to succeed.
Benchmarks mix shortcut-solvable queries, noisy queries, and genuinely compositional queries.
Reported multimodal capabilities of current models are overestimated on standard tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future benchmarks could add explicit checks to remove unimodal shortcuts before release.
The same shortcut problem may appear in other multimodal retrieval or editing tasks.
Training on only the validated compositional queries might push models toward better use of both modalities.

Load-bearing premise

Cross-model patterns and human judgments correctly separate queries that need both modalities from those solvable by shortcuts or that are malformed.

What would settle it

If models still reach high accuracy on the 1,689 validated queries when given only the reference image or only the text, the claim that these queries require multimodal composition would be false.

Figures

Figures reproduced from arXiv: 2605.14787 by Alessandro De Bellis, Alessandro Suglia, Aryo Pradipta Gema, Claudio Pomo, Dietmar Jannach, Matteo Attimonelli, Monica Sekoyan, Pasquale Minervini, Rohit Saxena, Tommaso Di Noia, Wai-Chung Kwan.

**Figure 2.** Figure 2: Representative failures from the audited CIRCUS [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Retriever-averaged normalised composition gap based on full-catalogue nDCG. Each panel [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Retriever-averaged normalised composition gap based on full-catalogue MRR. Each panel [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Pairwise Cohen’s κ values on the binary VALID/invalid decision across the 9-annotator overlap subset, shown with anonymized annotator identities. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: Pairwise exact agreement rates on the full issue signature across the same 9-annotator [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Annotation interface used in the human validation study. [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Representative FashionIQ annotation outcomes. Each page now shows three stacked panels [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Additional FashionIQ and CIRR examples, including the remaining FashionIQ [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗

**Figure 10.** Figure 10: Representative CIRR outcomes showing non-unique targets, broad queries, and ambiguous [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗

**Figure 11.** Figure 11: Representative LaSCo outcomes showing retained composition-required queries, problem [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: Representative CIRCO outcomes. Even in the smallest benchmark, the audit still surfaces [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗

**Figure 13.** Figure 13: Text-only shortcut on CIRR for E5-Omni (q3981). Text-only finds the target at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗

**Figure 14.** Figure 14: Text-only shortcut on CIRR for GME-Qwen2VL (q765). Text-only finds the target at rank [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗

**Figure 15.** Figure 15: Image-only shortcut on CIRR for LamRA (q496). Image-only finds the target at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p041_15.png] view at source ↗

**Figure 16.** Figure 16: Image-only shortcut on CIRR for MM-Embed (q1789). Image-only finds the target at [PITH_FULL_IMAGE:figures/full_fig_p042_16.png] view at source ↗

**Figure 17.** Figure 17: Both-conditions shortcut on CIRR for E5-Omni (q540). Both text-only and image-only [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗

**Figure 18.** Figure 18: Both-conditions shortcut on CIRR for GME-Qwen2VL (q316). Both text-only and [PITH_FULL_IMAGE:figures/full_fig_p042_18.png] view at source ↗

**Figure 19.** Figure 19: Composition-required on CIRR for GME-Qwen2VL (q1167). Only multimodal retrieval [PITH_FULL_IMAGE:figures/full_fig_p043_19.png] view at source ↗

**Figure 20.** Figure 20: Composition-required on CIRR for LamRA (q1230). Only multimodal retrieval places the [PITH_FULL_IMAGE:figures/full_fig_p043_20.png] view at source ↗

**Figure 21.** Figure 21: Unresolved on CIRR for GME-Qwen2VL (q385). All three variants miss the target [PITH_FULL_IMAGE:figures/full_fig_p043_21.png] view at source ↗

**Figure 22.** Figure 22: Unresolved on CIRR for LamRA (q3102). All three variants miss the target (MM/T/I = [PITH_FULL_IMAGE:figures/full_fig_p044_22.png] view at source ↗

**Figure 23.** Figure 23: Text-only shortcut on FashionIQ for E5-Omni (q4369). Text-only finds the target at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p044_23.png] view at source ↗

**Figure 24.** Figure 24: Text-only shortcut on FashionIQ for GME-Qwen2VL (q5276). Text-only finds the target [PITH_FULL_IMAGE:figures/full_fig_p044_24.png] view at source ↗

**Figure 25.** Figure 25: Image-only shortcut on FashionIQ for GME-Qwen2VL (q2289). Image-only finds the [PITH_FULL_IMAGE:figures/full_fig_p045_25.png] view at source ↗

**Figure 26.** Figure 26: Image-only shortcut on FashionIQ for LamRA (q2289). Image-only finds the target at [PITH_FULL_IMAGE:figures/full_fig_p045_26.png] view at source ↗

**Figure 27.** Figure 27: Both-conditions shortcut on FashionIQ for E5-Omni (q3054). Both text-only and image [PITH_FULL_IMAGE:figures/full_fig_p045_27.png] view at source ↗

**Figure 28.** Figure 28: Both-conditions shortcut on FashionIQ for GME-Qwen2VL (q2791). Both text-only and [PITH_FULL_IMAGE:figures/full_fig_p046_28.png] view at source ↗

**Figure 29.** Figure 29: Composition-required on FashionIQ for E5-Omni (q200). Only multimodal retrieval [PITH_FULL_IMAGE:figures/full_fig_p046_29.png] view at source ↗

**Figure 30.** Figure 30: Composition-required on FashionIQ for GME-Qwen2VL (q459). Only multimodal [PITH_FULL_IMAGE:figures/full_fig_p046_30.png] view at source ↗

**Figure 31.** Figure 31: Unresolved on FashionIQ for GME-Qwen2VL (q2575). All three variants miss the target [PITH_FULL_IMAGE:figures/full_fig_p047_31.png] view at source ↗

**Figure 32.** Figure 32: Unresolved on FashionIQ for LamRA (q1101). All three variants miss the target (MM/T/I [PITH_FULL_IMAGE:figures/full_fig_p047_32.png] view at source ↗

**Figure 33.** Figure 33: Text-only shortcut on LaSCo for E5-Omni (q6127). Text-only finds the target at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p047_33.png] view at source ↗

**Figure 34.** Figure 34: Text-only shortcut on LaSCo for GME-Qwen2VL (q19817). Text-only finds the target at [PITH_FULL_IMAGE:figures/full_fig_p048_34.png] view at source ↗

**Figure 35.** Figure 35: Image-only shortcut on LaSCo for E5-Omni (q13862). Image-only finds the target at rank [PITH_FULL_IMAGE:figures/full_fig_p048_35.png] view at source ↗

**Figure 36.** Figure 36: Image-only shortcut on LaSCo for GME-Qwen2VL (q11010). Image-only finds the target [PITH_FULL_IMAGE:figures/full_fig_p048_36.png] view at source ↗

**Figure 37.** Figure 37: Both-conditions shortcut on LaSCo for E5-Omni (q22248). Both text-only and image-only [PITH_FULL_IMAGE:figures/full_fig_p049_37.png] view at source ↗

**Figure 38.** Figure 38: Both-conditions shortcut on LaSCo for GME-Qwen2VL (q21623). Both text-only and [PITH_FULL_IMAGE:figures/full_fig_p049_38.png] view at source ↗

**Figure 39.** Figure 39: Composition-required on LaSCo for E5-Omni (q4400). Only multimodal retrieval places [PITH_FULL_IMAGE:figures/full_fig_p049_39.png] view at source ↗

**Figure 40.** Figure 40: Composition-required on LaSCo for GME-Qwen2VL (q23545). Only multimodal retrieval [PITH_FULL_IMAGE:figures/full_fig_p050_40.png] view at source ↗

**Figure 41.** Figure 41: Unresolved on LaSCo for E5-Omni (q3030). All three variants miss the target (MM/T/I = [PITH_FULL_IMAGE:figures/full_fig_p050_41.png] view at source ↗

**Figure 42.** Figure 42: Unresolved on LaSCo for GME-Qwen2VL (q22517). All three variants miss the target [PITH_FULL_IMAGE:figures/full_fig_p050_42.png] view at source ↗

read the original abstract

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows that many CIR benchmark queries are solvable with one modality alone, and the cleaned subset forces real multimodal use, but the audit steps need tighter validation.

read the letter

The main thing to know is that a big chunk of queries in standard composed image retrieval benchmarks can be solved without actually combining the reference image and the text modification. Across four benchmarks and eleven models the paper finds shortcut rates between 32 and 84 percent, then narrows to 1,689 human-validated well-formed queries where single-modality solutions stop working and models have to use both inputs. That shift in behavior is the useful observation.

Referee Report

2 major / 2 minor

Summary. The paper claims that Composed Image Retrieval (CIR) benchmarks do not always require multimodal composition. Across four standard benchmarks and eleven generalist multimodal embedding models, 32.2% to 83.6% of queries are solvable using only the reference image or only the modification text. A two-stage audit first uses cross-model analysis to flag shortcut-solvable queries, then performs human validation on the remaining 4,741 queries, of which only 1,689 are deemed well-formed. Re-evaluation on this validated subset shows that single-modality solutions no longer suffice and that successful retrieval requires combining both inputs, leading to the conclusion that current benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries.

Significance. If the two-stage audit is shown to be reliable, the result would be significant for multimodal retrieval research. It provides concrete evidence that high performance on existing CIR benchmarks can arise from unimodal signals rather than composition, and it supplies an auditing procedure (cross-model filtering followed by human review) that could be adopted to improve future benchmark construction. The scale of the analysis (four benchmarks, eleven models, 4,741 human-validated queries) adds weight to the practical implications for model evaluation.

major comments (2)

[§3.2] §3.2 (Cross-model analysis): The precise operational criterion for labeling a query as shortcut-solvable (e.g., minimum number of the eleven models that must succeed with a single modality, retrieval rank threshold, or agreement rule) is not stated. Without this definition the reported range 32.2–83.6% cannot be reproduced or stress-tested for sensitivity to model correlation.
[Human validation subsection] Human validation subsection: No inter-annotator agreement metric (Cohen’s κ, Fleiss’ κ, or raw percentage) is reported for the labeling of the 4,741 shortcut-free queries. This statistic is load-bearing for the claim that only 1,689 queries are well-formed and for the subsequent finding that multimodal reliance increases on the validated subset.

minor comments (2)

[Abstract] Abstract: The percentages 32.2%–83.6% are given as a range; stating the per-benchmark values would make the per-dataset variation immediately visible.
[§3.1] The manuscript would benefit from a short table listing the eleven models, their training data sources, and architectural families to allow readers to assess possible correlation in the cross-model analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and will revise the paper to improve methodological clarity and reproducibility.

read point-by-point responses

Referee: [§3.2] §3.2 (Cross-model analysis): The precise operational criterion for labeling a query as shortcut-solvable (e.g., minimum number of the eleven models that must succeed with a single modality, retrieval rank threshold, or agreement rule) is not stated. Without this definition the reported range 32.2–83.6% cannot be reproduced or stress-tested for sensitivity to model correlation.

Authors: We thank the referee for pointing out this omission. The exact operational criteria used to label queries as shortcut-solvable (including the minimum number of models succeeding on a single modality, the retrieval rank threshold, and any agreement rules across the eleven models) were not stated with sufficient precision in §3.2. We will add a complete description of these criteria in the revised manuscript so that the reported percentages can be reproduced and sensitivity to model correlation can be evaluated. revision: yes
Referee: [Human validation subsection] Human validation subsection: No inter-annotator agreement metric (Cohen’s κ, Fleiss’ κ, or raw percentage) is reported for the labeling of the 4,741 shortcut-free queries. This statistic is load-bearing for the claim that only 1,689 queries are well-formed and for the subsequent finding that multimodal reliance increases on the validated subset.

Authors: We agree that an inter-annotator agreement metric is important for establishing the reliability of the human validation. We will include this statistic (raw percentage agreement or Fleiss’ κ, as appropriate) for the labeling of the 4,741 shortcut-free queries in the revised human validation subsection. This addition will strengthen the claim that 1,689 queries are well-formed and support the observed increase in multimodal reliance on the validated subset. revision: yes

Circularity Check

0 steps flagged

Empirical audit relies on external models and human validation with no definitional or fitted reduction

full rationale

The paper conducts a two-stage empirical audit: cross-model analysis across 11 independent generalist multimodal embedding models to flag shortcut-solvable queries, followed by human validation on the remaining 4,741 queries. No equations, parameter fitting, or self-citations are used to derive the central percentages (32.2%–83.6% shortcut-solvable) or the final 1,689 well-formed count. These quantities are direct observations from retrieval experiments and annotations, not reductions to inputs by construction. The derivation chain is self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that single-modality solvability demonstrates absence of required multimodal composition and on the reliability of human judgments for identifying well-formed queries.

axioms (1)

domain assumption Strong performance on CIR benchmarks is assumed to require multimodal composition of reference image and textual modification
Explicitly stated as the principle being tested in the opening of the abstract.

pith-pipeline@v0.9.0 · 5824 in / 1317 out tokens · 57575 ms · 2026-05-20T21:16:41.196347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A query is said to admit a unimodal shortcut if any model can retrieve the target using a single modality... We define the best achievable unimodal ranks: rtxt*(i) = min_p rtxt_p(i), rimg*(i) = min_p rimg_p(i).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Image retrieval on real-life images with pre-trained vision-and-language models

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InICCV, pages 2125–2134. IEEE, 2021

work page 2021
[2]

Fashion IQ: A new dataset towards retrieving images by natural language feedback

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogério Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. InCVPR, pages 11307–11317. Computer Vision Foundation / IEEE, 2021

work page 2021
[3]

Data roaming and quality assessment for composed image retrieval

Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Data roaming and quality assessment for composed image retrieval. InAAAI, pages 13601–13609. AAAI Press, 2024

work page 2024
[4]

Zero-shot composed image retrieval with textual inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15338–15347, 2023

work page 2023
[5]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6325–6334. IEEE Computer Society, 2017

work page 2017
[6]

Zemel, Wieland Brendel, Matthias Bethge, and Felix A

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[7]

arXiv preprint arXiv:2603.21687 , year=

Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

work page arXiv 2026
[8]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. GME: improving universal multimodal retrieval by multimodal llms.CoRR, abs/2412.16855, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Mm-embed: Universal multimodal retrieval with multimodal LLMS

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal LLMS. InICLR. OpenReview.net, 2025

work page 2025
[10]

Lamra: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. InCVPR, pages 4015–4025. Computer Vision Foundation / IEEE, 2025

work page 2025
[11]

e5-omni: Explicit cross-modal alignment for omni-modal embeddings.CoRR, abs/2601.03666, 2026

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou. e5-omni: Explicit cross-modal alignment for omni-modal embeddings.CoRR, abs/2601.03666, 2026

work page arXiv 2026
[13]

Mllms are deeply affected by modality bias.CoRR, abs/2505.18657, 2025

Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu- Gang Jiang, Nicu Sebe, Dacheng Tao, Luc Van Gool, and Xuming Hu. Mllms are deeply affected by modality bias.CoRR, abs/2505.18657, 2025

work page arXiv 2025
[14]

Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.Trans

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Raghuveer Thirukovalluru, Xuan Zhang, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.Trans. Mach. Learn. Res., 2026, 2026

work page 2026
[15]

Rzenembed: Towards comprehensive multimodal retrieval.CoRR, abs/2510.27350, 2025

Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, and Yuhui Yin. Rzenembed: Towards comprehensive multimodal retrieval.CoRR, abs/2510.27350, 2025

work page arXiv 2025
[16]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking, 2026. URLhttps://arxiv.org/abs/2601.04720

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Composing text and image for image retrieval - an empirical odyssey

Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval - an empirical odyssey. InCVPR, pages 6439–6448. Computer Vision Foundation / IEEE, 2019

work page 2019
[18]

Compositional learning of image-text query for image retrieval

Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. Compositional learning of image-text query for image retrieval. InWACV, pages 1140–1149. IEEE, 2021

work page 2021
[19]

SAC: semantic attention composition for text-conditioned image retrieval

Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, and Balaji Krishnamurthy. SAC: semantic attention composition for text-conditioned image retrieval. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 597–606. IEEE, 2022. doi: 10.1109/W ACV51458.2022.00067....

work page doi:10.1109/w 2022
[20]

Pic2word: Mapping pictures to words for zero-shot composed image retrieval

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InCVPR, pages 19305–19314. IEEE, 2023

work page 2023
[21]

Zero-shot composed image retrieval with textual inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InICCV, pages 15338–15347. IEEE, 2023

work page 2023
[22]

Language-only training of zero-shot composed image retrieval

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only training of zero-shot composed image retrieval. InCVPR, pages 13225–13234. IEEE, 2024

work page 2024
[23]

Vision-by- language for training-free compositional image retrieval

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by- language for training-free compositional image retrieval. InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URLhttps://openreview.net/forum?id=EDPxCjXzSb. 11

work page 2024
[24]

CompoDiff: Versatile composed image retrieval with latent diffusion.Trans

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. CompoDiff: Versatile composed image retrieval with latent diffusion.Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/forum?id=mKtlzW0bWc

work page 2024
[25]

Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval

Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim. Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XIX, Lecture Notes in Computer Science, pages 239–254. Springer, 2024. doi:...

work page doi:10.1007/978-3-031-72655-2 2024
[26]

Magiclens: Self-supervised image retrieval with open-ended instructions

Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. InICML. OpenReview.net, 2024

work page 2024
[27]

Collm: A large language model for composed image retrieval

Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, and Abhinav Shrivastava. Collm: A large language model for composed image retrieval. InCVPR, pages 3994–4004. Computer Vision Foundation / IEEE, 2025

work page 2025
[28]

UniIR: Training and benchmarking universal multimodal information retrievers

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. UniIR: Training and benchmarking universal multimodal information retrievers. In ECCV, volume 15145 ofLecture Notes in Computer Science, pages 383–404. Springer, 2024

work page 2024
[29]

Hypothesis only baselines in natural language inference

Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. InProceedings of the Seventh Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT 2018, New Orleans, Louisiana, USA, June 5-6, 2018, pages 180–191. Association for Computational Linguistics,

work page 2018
[30]

doi: 10.18653/V1/S18-2023

work page doi:10.18653/v1/s18-2023 2023
[31]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. InNAACL-HLT, pages 107–112. Association for Computational Linguistics, 2018

work page 2018
[32]

Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases

Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. InEMNLP-IJCNLP, pages 4069–4082. Association for Computational Linguistics, 2019

work page 2019
[33]

Don’t just assume; look and answer: Overcoming priors for visual question answering

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InCVPR, pages 4971–4980. Computer Vision Foundation / IEEE Computer Society, 2018

work page 2018
[34]

Overcoming language priors in visual question answering with adversarial regularization

Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual question answering with adversarial regularization. InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 1548–1558, 2018

work page 2018
[35]

RUBi: Reducing unimodal biases for visual question answering

Rémi Cadène, Corentin Dancette, Hedi Ben-Younes, Matthieu Cord, and Devi Parikh. RUBi: Reducing unimodal biases for visual question answering. InNeurIPS, pages 841–852, 2019

work page 2019
[36]

Counterfactual VQA: A cause-effect look at language bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual VQA: A cause-effect look at language bias. InCVPR, pages 12700–12710. Computer Vision Foundation / IEEE, 2021

work page 2021
[37]

Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering

Corentin Dancette, Rémi Cadène, Damien Teney, and Matthieu Cord. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. InICCV, pages 1574–1583. IEEE, 2021

work page 2021
[38]

Nan Wu, Stanislaw Jastrzebski, Kyunghyun Cho, and Krzysztof J. Geras. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. InICML, volume 162 ofProceedings of Machine Learning Research, pages 24043–24055. PMLR, 2022

work page 2022
[39]

Vision-and-language or vision-for- language? On cross-modal influence in multimodal transformers

Stella Frank, Emanuele Bugliarello, and Desmond Elliott. Vision-and-language or vision-for- language? On cross-modal influence in multimodal transformers. InEMNLP, pages 9847–9857. Association for Computational Linguistics, 2021. 12

work page 2021
[40]

When and why vision-language models behave like bags-of-words, and what to do about it? InICLR

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InICLR. OpenReview.net, 2023

work page 2023
[41]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InCVPR, pages 5238–5248. IEEE, 2022

work page 2022
[42]

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! InEMNLP, pages 861–877

Jack Hessel and Lillian Lee. Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! InEMNLP, pages 861–877. Association for Computational Linguistics, 2020

work page 2020
[43]

Limitations

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman and Hall, New York, 1993. doi: 10.1007/978-1-4899-4541-9. 13 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and Introduction ...

work page doi:10.1007/978-1-4899-4541-9 1993
[44]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page
[45]

Is white with shorter sleeves and a flower and is shorter sleeved

We report the original benchmark (Full) together with theshortcut-freesubset (SF) and the validatedsubset (V). For each split, we show the multimodal query score (MM) and the three signed delta columns ∆MM-I, ∆MM-T, and ∆I-T, where I and T denote image-only and text-only queries. Full SF V Dataset Retriever MM∆MM-I∆MM-T∆I-T MM∆MM-I∆MM-T∆I-T MM∆MM-I∆MM-T∆I...

work page

[1] [1]

Image retrieval on real-life images with pre-trained vision-and-language models

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InICCV, pages 2125–2134. IEEE, 2021

work page 2021

[2] [2]

Fashion IQ: A new dataset towards retrieving images by natural language feedback

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogério Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. InCVPR, pages 11307–11317. Computer Vision Foundation / IEEE, 2021

work page 2021

[3] [3]

Data roaming and quality assessment for composed image retrieval

Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Data roaming and quality assessment for composed image retrieval. InAAAI, pages 13601–13609. AAAI Press, 2024

work page 2024

[4] [4]

Zero-shot composed image retrieval with textual inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15338–15347, 2023

work page 2023

[5] [5]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6325–6334. IEEE Computer Society, 2017

work page 2017

[6] [6]

Zemel, Wieland Brendel, Matthias Bethge, and Felix A

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020

[7] [7]

arXiv preprint arXiv:2603.21687 , year=

Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

work page arXiv 2026

[8] [8]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. GME: improving universal multimodal retrieval by multimodal llms.CoRR, abs/2412.16855, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Mm-embed: Universal multimodal retrieval with multimodal LLMS

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal LLMS. InICLR. OpenReview.net, 2025

work page 2025

[10] [10]

Lamra: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. InCVPR, pages 4015–4025. Computer Vision Foundation / IEEE, 2025

work page 2025

[11] [11]

e5-omni: Explicit cross-modal alignment for omni-modal embeddings.CoRR, abs/2601.03666, 2026

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou. e5-omni: Explicit cross-modal alignment for omni-modal embeddings.CoRR, abs/2601.03666, 2026

work page arXiv 2026

[12] [13]

Mllms are deeply affected by modality bias.CoRR, abs/2505.18657, 2025

Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu- Gang Jiang, Nicu Sebe, Dacheng Tao, Luc Van Gool, and Xuming Hu. Mllms are deeply affected by modality bias.CoRR, abs/2505.18657, 2025

work page arXiv 2025

[13] [14]

Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.Trans

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Raghuveer Thirukovalluru, Xuan Zhang, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.Trans. Mach. Learn. Res., 2026, 2026

work page 2026

[14] [15]

Rzenembed: Towards comprehensive multimodal retrieval.CoRR, abs/2510.27350, 2025

Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, and Yuhui Yin. Rzenembed: Towards comprehensive multimodal retrieval.CoRR, abs/2510.27350, 2025

work page arXiv 2025

[15] [16]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking, 2026. URLhttps://arxiv.org/abs/2601.04720

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [17]

Composing text and image for image retrieval - an empirical odyssey

Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval - an empirical odyssey. InCVPR, pages 6439–6448. Computer Vision Foundation / IEEE, 2019

work page 2019

[17] [18]

Compositional learning of image-text query for image retrieval

Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. Compositional learning of image-text query for image retrieval. InWACV, pages 1140–1149. IEEE, 2021

work page 2021

[18] [19]

SAC: semantic attention composition for text-conditioned image retrieval

Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, and Balaji Krishnamurthy. SAC: semantic attention composition for text-conditioned image retrieval. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 597–606. IEEE, 2022. doi: 10.1109/W ACV51458.2022.00067....

work page doi:10.1109/w 2022

[19] [20]

Pic2word: Mapping pictures to words for zero-shot composed image retrieval

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InCVPR, pages 19305–19314. IEEE, 2023

work page 2023

[20] [21]

Zero-shot composed image retrieval with textual inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InICCV, pages 15338–15347. IEEE, 2023

work page 2023

[21] [22]

Language-only training of zero-shot composed image retrieval

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only training of zero-shot composed image retrieval. InCVPR, pages 13225–13234. IEEE, 2024

work page 2024

[22] [23]

Vision-by- language for training-free compositional image retrieval

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by- language for training-free compositional image retrieval. InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URLhttps://openreview.net/forum?id=EDPxCjXzSb. 11

work page 2024

[23] [24]

CompoDiff: Versatile composed image retrieval with latent diffusion.Trans

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. CompoDiff: Versatile composed image retrieval with latent diffusion.Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/forum?id=mKtlzW0bWc

work page 2024

[24] [25]

Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval

Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim. Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XIX, Lecture Notes in Computer Science, pages 239–254. Springer, 2024. doi:...

work page doi:10.1007/978-3-031-72655-2 2024

[25] [26]

Magiclens: Self-supervised image retrieval with open-ended instructions

Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. InICML. OpenReview.net, 2024

work page 2024

[26] [27]

Collm: A large language model for composed image retrieval

Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, and Abhinav Shrivastava. Collm: A large language model for composed image retrieval. InCVPR, pages 3994–4004. Computer Vision Foundation / IEEE, 2025

work page 2025

[27] [28]

UniIR: Training and benchmarking universal multimodal information retrievers

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. UniIR: Training and benchmarking universal multimodal information retrievers. In ECCV, volume 15145 ofLecture Notes in Computer Science, pages 383–404. Springer, 2024

work page 2024

[28] [29]

Hypothesis only baselines in natural language inference

Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. InProceedings of the Seventh Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT 2018, New Orleans, Louisiana, USA, June 5-6, 2018, pages 180–191. Association for Computational Linguistics,

work page 2018

[29] [30]

doi: 10.18653/V1/S18-2023

work page doi:10.18653/v1/s18-2023 2023

[30] [31]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. InNAACL-HLT, pages 107–112. Association for Computational Linguistics, 2018

work page 2018

[31] [32]

Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases

Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. InEMNLP-IJCNLP, pages 4069–4082. Association for Computational Linguistics, 2019

work page 2019

[32] [33]

Don’t just assume; look and answer: Overcoming priors for visual question answering

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InCVPR, pages 4971–4980. Computer Vision Foundation / IEEE Computer Society, 2018

work page 2018

[33] [34]

Overcoming language priors in visual question answering with adversarial regularization

Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual question answering with adversarial regularization. InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 1548–1558, 2018

work page 2018

[34] [35]

RUBi: Reducing unimodal biases for visual question answering

Rémi Cadène, Corentin Dancette, Hedi Ben-Younes, Matthieu Cord, and Devi Parikh. RUBi: Reducing unimodal biases for visual question answering. InNeurIPS, pages 841–852, 2019

work page 2019

[35] [36]

Counterfactual VQA: A cause-effect look at language bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual VQA: A cause-effect look at language bias. InCVPR, pages 12700–12710. Computer Vision Foundation / IEEE, 2021

work page 2021

[36] [37]

Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering

Corentin Dancette, Rémi Cadène, Damien Teney, and Matthieu Cord. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. InICCV, pages 1574–1583. IEEE, 2021

work page 2021

[37] [38]

Nan Wu, Stanislaw Jastrzebski, Kyunghyun Cho, and Krzysztof J. Geras. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. InICML, volume 162 ofProceedings of Machine Learning Research, pages 24043–24055. PMLR, 2022

work page 2022

[38] [39]

Vision-and-language or vision-for- language? On cross-modal influence in multimodal transformers

Stella Frank, Emanuele Bugliarello, and Desmond Elliott. Vision-and-language or vision-for- language? On cross-modal influence in multimodal transformers. InEMNLP, pages 9847–9857. Association for Computational Linguistics, 2021. 12

work page 2021

[39] [40]

When and why vision-language models behave like bags-of-words, and what to do about it? InICLR

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InICLR. OpenReview.net, 2023

work page 2023

[40] [41]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InCVPR, pages 5238–5248. IEEE, 2022

work page 2022

[41] [42]

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! InEMNLP, pages 861–877

Jack Hessel and Lillian Lee. Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! InEMNLP, pages 861–877. Association for Computational Linguistics, 2020

work page 2020

[42] [43]

Limitations

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman and Hall, New York, 1993. doi: 10.1007/978-1-4899-4541-9. 13 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and Introduction ...

work page doi:10.1007/978-1-4899-4541-9 1993

[43] [44]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[44] [45]

Is white with shorter sleeves and a flower and is shorter sleeved

We report the original benchmark (Full) together with theshortcut-freesubset (SF) and the validatedsubset (V). For each split, we show the multimodal query score (MM) and the three signed delta columns ∆MM-I, ∆MM-T, and ∆I-T, where I and T denote image-only and text-only queries. Full SF V Dataset Retriever MM∆MM-I∆MM-T∆I-T MM∆MM-I∆MM-T∆I-T MM∆MM-I∆MM-T∆I...

work page