pith. sign in

arxiv: 2605.14787 · v2 · pith:YNYPSIM2new · submitted 2026-05-14 · 💻 cs.CV · cs.CL

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Pith reviewed 2026-05-20 21:16 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords composed image retrievalmultimodal compositionunimodal shortcutsbenchmark auditimage retrievalmultimodal modelsshortcut detectionquery validation
0
0 comments X

The pith

Many queries in composed image retrieval benchmarks can be solved using only a single modality instead of requiring true multimodal composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the assumption that strong performance on composed image retrieval benchmarks requires models to combine information from a reference image and a text modification. It shows that across four common benchmarks and eleven models, between 32 and 84 percent of queries can be answered from either the image or the text alone. A two-stage audit first flags shortcut-solvable queries via cross-model patterns and then applies human review to the remainder, finding that most of the reviewed queries are malformed or ambiguous. On the small set of well-formed queries that remain, models can no longer succeed with one input and must use both, though overall accuracy falls.

Core claim

The authors demonstrate that a large fraction of queries in four standard CIR benchmarks can be solved using unimodal signals alone, ranging from 32.2% to 83.6% across eleven models. Through cross-model analysis and human validation of 4,741 shortcut-free queries, they identify only 1,689 as well-formed, with issues like ambiguous edits and mismatched targets common in the rest. Re-evaluation on this subset shows that successful retrieval now requires multimodal composition and cannot be achieved with a single modality.

What carries the argument

A two-stage audit that first uses cross-model agreement to detect shortcut-solvable queries and then applies human validation to confirm which remaining queries are well-formed and truly compositional.

If this is right

  • High scores on existing CIR benchmarks can come from unimodal shortcuts rather than actual composition of image and text.
  • Accuracy drops on the cleaned subset while models must now combine both inputs to succeed.
  • Benchmarks mix shortcut-solvable queries, noisy queries, and genuinely compositional queries.
  • Reported multimodal capabilities of current models are overestimated on standard tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future benchmarks could add explicit checks to remove unimodal shortcuts before release.
  • The same shortcut problem may appear in other multimodal retrieval or editing tasks.
  • Training on only the validated compositional queries might push models toward better use of both modalities.

Load-bearing premise

Cross-model patterns and human judgments correctly separate queries that need both modalities from those solvable by shortcuts or that are malformed.

What would settle it

If models still reach high accuracy on the 1,689 validated queries when given only the reference image or only the text, the claim that these queries require multimodal composition would be false.

Figures

Figures reproduced from arXiv: 2605.14787 by Alessandro De Bellis, Alessandro Suglia, Aryo Pradipta Gema, Claudio Pomo, Dietmar Jannach, Matteo Attimonelli, Monica Sekoyan, Pasquale Minervini, Rohit Saxena, Tommaso Di Noia, Wai-Chung Kwan.

Figure 1
Figure 1. Figure 1: Examples of text-only shortcuts (top), image-only shortcuts (middle), and valid queries [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative failures from the audited CIRCUS [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Retriever-averaged normalised composition gap based on full-catalogue nDCG. Each panel [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Retriever-averaged normalised composition gap based on full-catalogue MRR. Each panel [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise Cohen’s κ values on the binary VALID/invalid decision across the 9-annotator overlap subset, shown with anonymized annotator identities. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pairwise exact agreement rates on the full issue signature across the same 9-annotator [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Annotation interface used in the human validation study. [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative FashionIQ annotation outcomes. Each page now shows three stacked panels [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional FashionIQ and CIRR examples, including the remaining FashionIQ [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Representative CIRR outcomes showing non-unique targets, broad queries, and ambiguous [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representative LaSCo outcomes showing retained composition-required queries, problem [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Representative CIRCO outcomes. Even in the smallest benchmark, the audit still surfaces [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Text-only shortcut on CIRR for E5-Omni (q3981). Text-only finds the target at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Text-only shortcut on CIRR for GME-Qwen2VL (q765). Text-only finds the target at rank [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Image-only shortcut on CIRR for LamRA (q496). Image-only finds the target at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p041_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Image-only shortcut on CIRR for MM-Embed (q1789). Image-only finds the target at [PITH_FULL_IMAGE:figures/full_fig_p042_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Both-conditions shortcut on CIRR for E5-Omni (q540). Both text-only and image-only [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Both-conditions shortcut on CIRR for GME-Qwen2VL (q316). Both text-only and [PITH_FULL_IMAGE:figures/full_fig_p042_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Composition-required on CIRR for GME-Qwen2VL (q1167). Only multimodal retrieval [PITH_FULL_IMAGE:figures/full_fig_p043_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Composition-required on CIRR for LamRA (q1230). Only multimodal retrieval places the [PITH_FULL_IMAGE:figures/full_fig_p043_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Unresolved on CIRR for GME-Qwen2VL (q385). All three variants miss the target [PITH_FULL_IMAGE:figures/full_fig_p043_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Unresolved on CIRR for LamRA (q3102). All three variants miss the target (MM/T/I = [PITH_FULL_IMAGE:figures/full_fig_p044_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Text-only shortcut on FashionIQ for E5-Omni (q4369). Text-only finds the target at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p044_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Text-only shortcut on FashionIQ for GME-Qwen2VL (q5276). Text-only finds the target [PITH_FULL_IMAGE:figures/full_fig_p044_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Image-only shortcut on FashionIQ for GME-Qwen2VL (q2289). Image-only finds the [PITH_FULL_IMAGE:figures/full_fig_p045_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Image-only shortcut on FashionIQ for LamRA (q2289). Image-only finds the target at [PITH_FULL_IMAGE:figures/full_fig_p045_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Both-conditions shortcut on FashionIQ for E5-Omni (q3054). Both text-only and image [PITH_FULL_IMAGE:figures/full_fig_p045_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Both-conditions shortcut on FashionIQ for GME-Qwen2VL (q2791). Both text-only and [PITH_FULL_IMAGE:figures/full_fig_p046_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Composition-required on FashionIQ for E5-Omni (q200). Only multimodal retrieval [PITH_FULL_IMAGE:figures/full_fig_p046_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Composition-required on FashionIQ for GME-Qwen2VL (q459). Only multimodal [PITH_FULL_IMAGE:figures/full_fig_p046_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Unresolved on FashionIQ for GME-Qwen2VL (q2575). All three variants miss the target [PITH_FULL_IMAGE:figures/full_fig_p047_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Unresolved on FashionIQ for LamRA (q1101). All three variants miss the target (MM/T/I [PITH_FULL_IMAGE:figures/full_fig_p047_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Text-only shortcut on LaSCo for E5-Omni (q6127). Text-only finds the target at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p047_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Text-only shortcut on LaSCo for GME-Qwen2VL (q19817). Text-only finds the target at [PITH_FULL_IMAGE:figures/full_fig_p048_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Image-only shortcut on LaSCo for E5-Omni (q13862). Image-only finds the target at rank [PITH_FULL_IMAGE:figures/full_fig_p048_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Image-only shortcut on LaSCo for GME-Qwen2VL (q11010). Image-only finds the target [PITH_FULL_IMAGE:figures/full_fig_p048_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Both-conditions shortcut on LaSCo for E5-Omni (q22248). Both text-only and image-only [PITH_FULL_IMAGE:figures/full_fig_p049_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Both-conditions shortcut on LaSCo for GME-Qwen2VL (q21623). Both text-only and [PITH_FULL_IMAGE:figures/full_fig_p049_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Composition-required on LaSCo for E5-Omni (q4400). Only multimodal retrieval places [PITH_FULL_IMAGE:figures/full_fig_p049_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Composition-required on LaSCo for GME-Qwen2VL (q23545). Only multimodal retrieval [PITH_FULL_IMAGE:figures/full_fig_p050_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Unresolved on LaSCo for E5-Omni (q3030). All three variants miss the target (MM/T/I = [PITH_FULL_IMAGE:figures/full_fig_p050_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Unresolved on LaSCo for GME-Qwen2VL (q22517). All three variants miss the target [PITH_FULL_IMAGE:figures/full_fig_p050_42.png] view at source ↗
read the original abstract

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Composed Image Retrieval (CIR) benchmarks do not always require multimodal composition. Across four standard benchmarks and eleven generalist multimodal embedding models, 32.2% to 83.6% of queries are solvable using only the reference image or only the modification text. A two-stage audit first uses cross-model analysis to flag shortcut-solvable queries, then performs human validation on the remaining 4,741 queries, of which only 1,689 are deemed well-formed. Re-evaluation on this validated subset shows that single-modality solutions no longer suffice and that successful retrieval requires combining both inputs, leading to the conclusion that current benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries.

Significance. If the two-stage audit is shown to be reliable, the result would be significant for multimodal retrieval research. It provides concrete evidence that high performance on existing CIR benchmarks can arise from unimodal signals rather than composition, and it supplies an auditing procedure (cross-model filtering followed by human review) that could be adopted to improve future benchmark construction. The scale of the analysis (four benchmarks, eleven models, 4,741 human-validated queries) adds weight to the practical implications for model evaluation.

major comments (2)
  1. [§3.2] §3.2 (Cross-model analysis): The precise operational criterion for labeling a query as shortcut-solvable (e.g., minimum number of the eleven models that must succeed with a single modality, retrieval rank threshold, or agreement rule) is not stated. Without this definition the reported range 32.2–83.6% cannot be reproduced or stress-tested for sensitivity to model correlation.
  2. [Human validation subsection] Human validation subsection: No inter-annotator agreement metric (Cohen’s κ, Fleiss’ κ, or raw percentage) is reported for the labeling of the 4,741 shortcut-free queries. This statistic is load-bearing for the claim that only 1,689 queries are well-formed and for the subsequent finding that multimodal reliance increases on the validated subset.
minor comments (2)
  1. [Abstract] Abstract: The percentages 32.2%–83.6% are given as a range; stating the per-benchmark values would make the per-dataset variation immediately visible.
  2. [§3.1] The manuscript would benefit from a short table listing the eleven models, their training data sources, and architectural families to allow readers to assess possible correlation in the cross-model analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and will revise the paper to improve methodological clarity and reproducibility.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Cross-model analysis): The precise operational criterion for labeling a query as shortcut-solvable (e.g., minimum number of the eleven models that must succeed with a single modality, retrieval rank threshold, or agreement rule) is not stated. Without this definition the reported range 32.2–83.6% cannot be reproduced or stress-tested for sensitivity to model correlation.

    Authors: We thank the referee for pointing out this omission. The exact operational criteria used to label queries as shortcut-solvable (including the minimum number of models succeeding on a single modality, the retrieval rank threshold, and any agreement rules across the eleven models) were not stated with sufficient precision in §3.2. We will add a complete description of these criteria in the revised manuscript so that the reported percentages can be reproduced and sensitivity to model correlation can be evaluated. revision: yes

  2. Referee: [Human validation subsection] Human validation subsection: No inter-annotator agreement metric (Cohen’s κ, Fleiss’ κ, or raw percentage) is reported for the labeling of the 4,741 shortcut-free queries. This statistic is load-bearing for the claim that only 1,689 queries are well-formed and for the subsequent finding that multimodal reliance increases on the validated subset.

    Authors: We agree that an inter-annotator agreement metric is important for establishing the reliability of the human validation. We will include this statistic (raw percentage agreement or Fleiss’ κ, as appropriate) for the labeling of the 4,741 shortcut-free queries in the revised human validation subsection. This addition will strengthen the claim that 1,689 queries are well-formed and support the observed increase in multimodal reliance on the validated subset. revision: yes

Circularity Check

0 steps flagged

Empirical audit relies on external models and human validation with no definitional or fitted reduction

full rationale

The paper conducts a two-stage empirical audit: cross-model analysis across 11 independent generalist multimodal embedding models to flag shortcut-solvable queries, followed by human validation on the remaining 4,741 queries. No equations, parameter fitting, or self-citations are used to derive the central percentages (32.2%–83.6% shortcut-solvable) or the final 1,689 well-formed count. These quantities are direct observations from retrieval experiments and annotations, not reductions to inputs by construction. The derivation chain is self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that single-modality solvability demonstrates absence of required multimodal composition and on the reliability of human judgments for identifying well-formed queries.

axioms (1)
  • domain assumption Strong performance on CIR benchmarks is assumed to require multimodal composition of reference image and textual modification
    Explicitly stated as the principle being tested in the opening of the abstract.

pith-pipeline@v0.9.0 · 5824 in / 1317 out tokens · 57575 ms · 2026-05-20T21:16:41.196347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    Image retrieval on real-life images with pre-trained vision-and-language models

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InICCV, pages 2125–2134. IEEE, 2021

  2. [2]

    Fashion IQ: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogério Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. InCVPR, pages 11307–11317. Computer Vision Foundation / IEEE, 2021

  3. [3]

    Data roaming and quality assessment for composed image retrieval

    Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Data roaming and quality assessment for composed image retrieval. InAAAI, pages 13601–13609. AAAI Press, 2024

  4. [4]

    Zero-shot composed image retrieval with textual inversion

    Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15338–15347, 2023

  5. [5]

    Making the V in VQA matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6325–6334. IEEE Computer Society, 2017

  6. [6]

    Zemel, Wieland Brendel, Matthias Bethge, and Felix A

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  7. [7]

    arXiv preprint arXiv:2603.21687 , year=

    Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

  8. [8]

    GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. GME: improving universal multimodal retrieval by multimodal llms.CoRR, abs/2412.16855, 2024. 10

  9. [9]

    Mm-embed: Universal multimodal retrieval with multimodal LLMS

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal LLMS. InICLR. OpenReview.net, 2025

  10. [10]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. InCVPR, pages 4015–4025. Computer Vision Foundation / IEEE, 2025

  11. [11]

    e5-omni: Explicit cross-modal alignment for omni-modal embeddings.CoRR, abs/2601.03666, 2026

    Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou. e5-omni: Explicit cross-modal alignment for omni-modal embeddings.CoRR, abs/2601.03666, 2026

  12. [13]

    Mllms are deeply affected by modality bias.CoRR, abs/2505.18657, 2025

    Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu- Gang Jiang, Nicu Sebe, Dacheng Tao, Luc Van Gool, and Xuming Hu. Mllms are deeply affected by modality bias.CoRR, abs/2505.18657, 2025

  13. [14]

    Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.Trans

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Raghuveer Thirukovalluru, Xuan Zhang, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.Trans. Mach. Learn. Res., 2026, 2026

  14. [15]

    Rzenembed: Towards comprehensive multimodal retrieval.CoRR, abs/2510.27350, 2025

    Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, and Yuhui Yin. Rzenembed: Towards comprehensive multimodal retrieval.CoRR, abs/2510.27350, 2025

  15. [16]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking, 2026. URLhttps://arxiv.org/abs/2601.04720

  16. [17]

    Composing text and image for image retrieval - an empirical odyssey

    Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval - an empirical odyssey. InCVPR, pages 6439–6448. Computer Vision Foundation / IEEE, 2019

  17. [18]

    Compositional learning of image-text query for image retrieval

    Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. Compositional learning of image-text query for image retrieval. InWACV, pages 1140–1149. IEEE, 2021

  18. [19]

    SAC: semantic attention composition for text-conditioned image retrieval

    Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, and Balaji Krishnamurthy. SAC: semantic attention composition for text-conditioned image retrieval. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 597–606. IEEE, 2022. doi: 10.1109/W ACV51458.2022.00067....

  19. [20]

    Pic2word: Mapping pictures to words for zero-shot composed image retrieval

    Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InCVPR, pages 19305–19314. IEEE, 2023

  20. [21]

    Zero-shot composed image retrieval with textual inversion

    Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InICCV, pages 15338–15347. IEEE, 2023

  21. [22]

    Language-only training of zero-shot composed image retrieval

    Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only training of zero-shot composed image retrieval. InCVPR, pages 13225–13234. IEEE, 2024

  22. [23]

    Vision-by- language for training-free compositional image retrieval

    Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by- language for training-free compositional image retrieval. InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URLhttps://openreview.net/forum?id=EDPxCjXzSb. 11

  23. [24]

    CompoDiff: Versatile composed image retrieval with latent diffusion.Trans

    Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. CompoDiff: Versatile composed image retrieval with latent diffusion.Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/forum?id=mKtlzW0bWc

  24. [25]

    Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval

    Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim. Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XIX, Lecture Notes in Computer Science, pages 239–254. Springer, 2024. doi:...

  25. [26]

    Magiclens: Self-supervised image retrieval with open-ended instructions

    Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. InICML. OpenReview.net, 2024

  26. [27]

    Collm: A large language model for composed image retrieval

    Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, and Abhinav Shrivastava. Collm: A large language model for composed image retrieval. InCVPR, pages 3994–4004. Computer Vision Foundation / IEEE, 2025

  27. [28]

    UniIR: Training and benchmarking universal multimodal information retrievers

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. UniIR: Training and benchmarking universal multimodal information retrievers. In ECCV, volume 15145 ofLecture Notes in Computer Science, pages 383–404. Springer, 2024

  28. [29]

    Hypothesis only baselines in natural language inference

    Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. InProceedings of the Seventh Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT 2018, New Orleans, Louisiana, USA, June 5-6, 2018, pages 180–191. Association for Computational Linguistics,

  29. [30]

    doi: 10.18653/V1/S18-2023

  30. [31]

    Bowman, and Noah A

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. InNAACL-HLT, pages 107–112. Association for Computational Linguistics, 2018

  31. [32]

    Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases

    Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. InEMNLP-IJCNLP, pages 4069–4082. Association for Computational Linguistics, 2019

  32. [33]

    Don’t just assume; look and answer: Overcoming priors for visual question answering

    Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InCVPR, pages 4971–4980. Computer Vision Foundation / IEEE Computer Society, 2018

  33. [34]

    Overcoming language priors in visual question answering with adversarial regularization

    Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual question answering with adversarial regularization. InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 1548–1558, 2018

  34. [35]

    RUBi: Reducing unimodal biases for visual question answering

    Rémi Cadène, Corentin Dancette, Hedi Ben-Younes, Matthieu Cord, and Devi Parikh. RUBi: Reducing unimodal biases for visual question answering. InNeurIPS, pages 841–852, 2019

  35. [36]

    Counterfactual VQA: A cause-effect look at language bias

    Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual VQA: A cause-effect look at language bias. InCVPR, pages 12700–12710. Computer Vision Foundation / IEEE, 2021

  36. [37]

    Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering

    Corentin Dancette, Rémi Cadène, Damien Teney, and Matthieu Cord. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. InICCV, pages 1574–1583. IEEE, 2021

  37. [38]

    Nan Wu, Stanislaw Jastrzebski, Kyunghyun Cho, and Krzysztof J. Geras. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. InICML, volume 162 ofProceedings of Machine Learning Research, pages 24043–24055. PMLR, 2022

  38. [39]

    Vision-and-language or vision-for- language? On cross-modal influence in multimodal transformers

    Stella Frank, Emanuele Bugliarello, and Desmond Elliott. Vision-and-language or vision-for- language? On cross-modal influence in multimodal transformers. InEMNLP, pages 9847–9857. Association for Computational Linguistics, 2021. 12

  39. [40]

    When and why vision-language models behave like bags-of-words, and what to do about it? InICLR

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InICLR. OpenReview.net, 2023

  40. [41]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InCVPR, pages 5238–5248. IEEE, 2022

  41. [42]

    Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! InEMNLP, pages 861–877

    Jack Hessel and Lillian Lee. Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! InEMNLP, pages 861–877. Association for Computational Linguistics, 2020

  42. [43]

    Limitations

    Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman and Hall, New York, 1993. doi: 10.1007/978-1-4899-4541-9. 13 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and Introduction ...

  43. [44]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

  44. [45]

    Is white with shorter sleeves and a flower and is shorter sleeved

    We report the original benchmark (Full) together with theshortcut-freesubset (SF) and the validatedsubset (V). For each split, we show the multimodal query score (MM) and the three signed delta columns ∆MM-I, ∆MM-T, and ∆I-T, where I and T denote image-only and text-only queries. Full SF V Dataset Retriever MM∆MM-I∆MM-T∆I-T MM∆MM-I∆MM-T∆I-T MM∆MM-I∆MM-T∆I...