pith. machine review for the scientific record. sign in

arxiv: 2604.05393 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.MM

Recognition: no theorem link

Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:37 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords composed image retrievalinstance-level fidelitybounding box anchoringattention modulatorfine-grained retrievalmultimodal queriesOACIR taskOACIRR benchmark
0
0 comments X

The pith

Bounding boxes anchor composed image queries to enforce exact instance matches instead of loose semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Object-Anchored Composed Image Retrieval as a task that adds a user bounding box to a reference image and modification text, requiring the system to retrieve only the specific instance while applying the described changes. Standard composed retrieval often succeeds on broad meaning but returns the wrong object when similar instances exist in the gallery. The authors release OACIRR, a benchmark of more than 160,000 real-world quadruples that includes hard-negative instance distractors across multiple domains. They then present AdaFocal, whose Context-Aware Attention Modulator dynamically strengthens focus on the boxed region while still processing the text instruction, producing higher instance fidelity than prior models.

Core claim

Object-Anchored Composed Image Retrieval requires strict instance-level consistency: the output image must contain the exact object marked by the bounding box in the reference image after the modification described in text. AdaFocal implements this via a Context-Aware Attention Modulator that adaptively intensifies attention inside the anchored region and balances it against the broader compositional context, outperforming existing compositional retrieval models on the OACIRR benchmark in instance preservation.

What carries the argument

The Context-Aware Attention Modulator, which adaptively intensifies attention inside the user-specified bounding-box region while integrating the modification text.

If this is right

  • Composed retrieval models must treat instance identity as a first-class constraint alongside semantic modification.
  • Benchmarks for multimodal retrieval will need galleries that deliberately include near-identical distractors at the object level.
  • Attention mechanisms in vision-language models can be conditioned on explicit spatial anchors supplied at query time.
  • Future systems may separate instance selection from textual editing as two distinct stages in the retrieval pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Search interfaces could add lightweight object selection tools so that users mark the target instance directly on the reference photo.
  • The same anchoring idea might transfer to video retrieval or 3D scene search where a spatial or temporal pointer disambiguates the instance.
  • If bounding boxes prove cumbersome for users, point clicks or scribbles could serve as lighter alternatives while preserving the core instance-fidelity goal.

Load-bearing premise

Users will supply bounding boxes as a practical and sufficient way to specify which exact instance they want retrieved.

What would settle it

An ablation or user study showing that instance-level accuracy gains disappear once the bounding-box input is removed or that users rarely provide such boxes in realistic queries.

Figures

Figures reproduced from arXiv: 2604.05393 by Bing Li, Chunfeng Yuan, Jun Gao, Weiming Hu, Yinan Zhou, Yuxin Chen, Yuxin Yang, Ziqi Zhang, Zongyang Ma.

Figure 1
Figure 1. Figure 1: Overview of the Object-Anchored Composed Image Retrieval (OACIR) task and our OACIRR dataset. Abstract Composed Image Retrieval (CIR) has demonstrated sig￾nificant potential by enabling flexible multimodal queries that combine a reference image and modification text. How￾ever, CIR inherently prioritizes semantic matching, strug￾gling to reliably retrieve a user-specified instance across contexts. In practi… view at source ↗
Figure 2
Figure 2. Figure 2: The multi-stage construction pipeline for the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Instance distribution of the OACIRR benchmark. posals with confidence scores below a predefined threshold are then manually annotated to ensure ground-truth preci￾sion. Finally, the entire corpus of annotated quadruples is partitioned into training and evaluation sets at an 8:2 ratio. Stage 4: Candidate Gallery Construction. To rigorously evaluate a model’s instance discrimination capabilities, we construc… view at source ↗
Figure 4
Figure 4. Figure 4: Overall architecture of our proposed AdaFocal framework. 4. Method To address the core challenges of the OACIR task, we pro￾pose AdaFocal, an effective framework that dynamically modulates visual attention for precise, instance-level re￾trieval. Our approach augments a multimodal fusion back￾bone with a dedicated module that learns to adaptively focus on user-specified instance regions, enabling a nuanced … view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on the Modulation Scalar β. fidelity. Their pre-training prioritizes broad semantic cor￾respondence across diverse multimodal data and therefore does not equip them with the robust instance-level discrim￾ination required for this task, a deficiency particularly ev￾ident in multi-object scenarios such as the Fashion subset. ZS-CIR methods, relying solely on semantic-level textual cues, perfor… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of our AdaFocal and the Baseline on the OACIRR benchmark. Green boxes indicate the ground-truth target, yellow boxes indicate instance-correct but semantically incorrect results, and all other retrieved images are marked with red boxes. key insights. First, the method of contextual aggregation is crucial. The superiority of the Transformer-based CRM over simpler aggregation methods u… view at source ↗
Figure 7
Figure 7. Figure 7: A curated collage of representative instances from the [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of our AdaFocal and the Baseline on the OACIRR benchmark. Green boxes indicate the ground-truth target, yellow boxes indicate instance-correct but semantically incorrect results, and all other retrieved images are marked with red boxes. 9 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Object-Anchored Composed Image Retrieval (OACIR), a fine-grained task that augments composed image retrieval queries with an explicit user-provided bounding box to enforce instance-level fidelity rather than broad semantic matching. It releases the OACIRR benchmark (over 160K quadruples across multiple domains, with four galleries containing hard-negative instance distractors) and proposes the AdaFocal framework, whose core component is a Context-Aware Attention Modulator that adaptively intensifies attention inside the anchored region while balancing compositional context. The central empirical claim is that AdaFocal substantially outperforms existing compositional retrieval models on instance-level consistency metrics.

Significance. If the performance gains are shown to arise from the modulator rather than from the additional bounding-box input, the work would be significant: it supplies a concrete task definition, a large-scale multi-domain benchmark with controlled hard negatives, and an initial baseline that shifts emphasis from semantic retrieval toward referential anchoring. The benchmark construction itself is a reusable contribution for future instance-aware retrieval research.

major comments (1)
  1. [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the headline claim that AdaFocal 'substantially outperforms existing compositional retrieval models' is not yet supported by a controlled comparison. Standard baselines (CLIP, ARTEMIS, etc.) were designed for text-plus-reference-image queries without spatial anchors. The manuscript gives no indication that these baselines were re-implemented with equivalent box-conditioned attention or feature masking inside the anchored region. Without that adaptation, any reported gap could be explained by the extra input modality rather than by the proposed Context-Aware Attention Modulator, undermining the central empirical contribution.
minor comments (2)
  1. [Abstract] Abstract: quantitative results, error bars, ablation tables, and dataset statistics are entirely absent, making it impossible to assess the magnitude or statistical reliability of the claimed gains.
  2. [Benchmark Construction] Benchmark description: the exact procedure for constructing the 160K quadruples, the four candidate galleries, and the hard-negative instance distractors should be detailed with statistics (e.g., number of domains, distractor selection criteria) to allow reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We sincerely thank the referee for the detailed and constructive review. We appreciate the acknowledgment of the OACIR task definition, the OACIRR benchmark, and the potential value of shifting toward referential anchoring. We address the single major comment below and will revise the manuscript to incorporate a controlled comparison that isolates the contribution of the Context-Aware Attention Modulator.

read point-by-point responses
  1. Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the headline claim that AdaFocal 'substantially outperforms existing compositional retrieval models' is not yet supported by a controlled comparison. Standard baselines (CLIP, ARTEMIS, etc.) were designed for text-plus-reference-image queries without spatial anchors. The manuscript gives no indication that these baselines were re-implemented with equivalent box-conditioned attention or feature masking inside the anchored region. Without that adaptation, any reported gap could be explained by the extra input modality rather than by the proposed Context-Aware Attention Modulator, undermining the central empirical contribution.

    Authors: We agree that the current evaluation does not fully isolate the effect of the Context-Aware Attention Modulator from the simple availability of the bounding-box input. The original experiments applied standard CIR baselines to the OACIR queries (reference image plus modification text) without explicit box conditioning, because the baselines were not originally designed for spatial anchors. In the revised manuscript we will add controlled experiments in which the baselines are adapted to the anchored setting via feature masking outside the provided bounding box and/or by injecting box-derived spatial embeddings. We will report the resulting metrics on the OACIRR galleries and update both the abstract and the experimental section to reflect these comparisons. This will allow readers to assess whether AdaFocal's adaptive focusing yields gains beyond what can be achieved by straightforward box conditioning alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical

full rationale

The paper introduces a new task (OACIR) that augments standard CIR queries with an explicit bounding box for instance anchoring, constructs the OACIRR benchmark with over 160K quadruples and hard-negative galleries, and proposes the AdaFocal model featuring a Context-Aware Attention Modulator. All performance claims, including substantial outperformance on instance-level fidelity, are presented strictly as outcomes of extensive experiments on this new benchmark. No equations, analytical derivations, first-principles predictions, or parameter-fitting steps are described that could reduce by construction to self-referential definitions or fitted inputs. Potential concerns about baseline adaptation for the bounding-box input relate to experimental controls rather than any load-bearing circularity in a derivation chain. The work is therefore self-contained as an empirical contribution with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the domain assumption that instance-level fidelity is often more important than semantic matching and on the newly introduced task, benchmark, and attention modulator; no free parameters or external axioms are stated in the abstract.

axioms (1)
  • domain assumption Emphasizing concrete instance fidelity over broad semantics is often more consequential in retrieval applications.
    Explicitly stated in the abstract as motivation for the new task.
invented entities (3)
  • OACIR task no independent evidence
    purpose: Mandates strict instance-level consistency in composed image retrieval via bounding-box anchors.
    Newly defined task not present in prior CIR work.
  • OACIRR benchmark no independent evidence
    purpose: Large-scale multi-domain dataset with 160K quadruples and hard-negative galleries for evaluating instance fidelity.
    First such benchmark claimed in the abstract.
  • AdaFocal framework no independent evidence
    purpose: Context-Aware Attention Modulator that adaptively intensifies attention on the anchored instance region.
    Newly proposed method to solve the OACIR task.

pith-pipeline@v0.9.0 · 5564 in / 1402 out tokens · 46523 ms · 2026-05-10T19:37:55.188350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Myvlm: Personalizing vlms for user-specific queries

    Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aber- man, and Daniel Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. InEuropean Conference on Computer Vision, pages 73–91. Springer, 2024. 3

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 6, 1

  3. [3]

    Products-10k: A large-scale product recognition dataset

    Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. Products-10k: A large-scale product recognition dataset.arXiv preprint arXiv:2008.10545, 2020. 3, 1

  4. [4]

    Sentence-level prompts benefit composed image retrieval

    Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, and Chun- Mei Feng. Sentence-level prompts benefit composed image retrieval. InThe Twelfth International Conference on Learn- ing Representations, 2024. 2, 6, 7

  5. [5]

    Effective conditioned and composed im- age retrieval combining clip-based features

    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Al- berto Del Bimbo. Effective conditioned and composed im- age retrieval combining clip-based features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 21466–21474, 2022. 2

  6. [6]

    Zero-shot composed image retrieval with textual inversion

    Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Al- berto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15338–15347,

  7. [7]

    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Al- berto Del Bimbo. Composed image retrieval using con- trastive learning and task-oriented clip-based features.ACM Transactions on Multimedia Computing, Communications and Applications, 20(3):1–24, 2023. 2

  8. [8]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 2, 4

  9. [9]

    Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 6

  10. [10]

    Image search with text feedback by visiolinguistic attention learn- ing

    Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learn- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3001–3011,

  11. [11]

    this is my unicorn, fluffy

    Niv Cohen, Rinon Gal, Eli A Meirom, Gal Chechik, and Yuval Atzmon. "this is my unicorn, fluffy": Personalizing frozen vision-language representations. InEuropean Con- ference on Computer Vision, pages 558–577. Springer, 2022. 2, 3

  12. [12]

    Artemis: Attention-based retrieval with text- explicit matching and implicit similarity

    Ginger Delmas, Rafael S Rezende, Gabriela Csurka, and Di- ane Larlus. Artemis: Attention-based retrieval with text- explicit matching and implicit similarity. InThe Tenth In- ternational Conference on Learning Representations, 2022. 2

  13. [13]

    Deepfashion2: A versatile benchmark for de- tection, pose estimation, segmentation and re-identification of clothing images

    Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo. Deepfashion2: A versatile benchmark for de- tection, pose estimation, segmentation and re-identification of clothing images. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 5337–5345, 2019. 3, 1

  14. [14]

    Deep image retrieval: Learning global representations for image search

    Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Lar- lus. Deep image retrieval: Learning global representations for image search. InEuropean Conference on Computer Vi- sion, pages 241–257. Springer, 2016. 2

  15. [15]

    Compodiff: Versatile composed image retrieval with latent diffusion.Transactions on Machine Learning Research, 2024

    Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. Compodiff: Versatile composed image retrieval with latent diffusion.Transactions on Machine Learning Research, 2024. Expert Certification. 2, 7

  16. [16]

    Language-only training of zero-shot com- posed image retrieval

    Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only training of zero-shot com- posed image retrieval. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13225–13234, 2024. 6

  17. [17]

    Celebrities-reid: A benchmark for clothes variation in long- term person re-identification

    Yan Huang, Qiang Wu, Jingsong Xu, and Yi Zhong. Celebrities-reid: A benchmark for clothes variation in long- term person re-identification. In2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE,

  18. [18]

    Cala: Complementary association learning for augmenting comoposed image re- trieval

    Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, and Xueming Qian. Cala: Complementary association learning for augmenting comoposed image re- trieval. InProceedings of the 47th International ACM SI- GIR Conference on Research and Development in Informa- tion Retrieval, pages 2177–2187, 2024. 2

  19. [19]

    Vision-by-language for training-free compositional image retrieval

    Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training-free compositional image retrieval. InThe Twelfth International Conference on Learning Representations, 2024. 2

  20. [20]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 3, 1

  21. [21]

    Data roaming and quality assessment for composed im- age retrieval

    Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischin- ski. Data roaming and quality assessment for composed im- age retrieval. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 2991–2999, 2024. 2, 4, 7

  22. [22]

    Automatic synthesis of high-quality triplet data for composed image retrieval.arXiv preprint arXiv:2507.05970,

    Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, and Fei Su. Automatic synthesis of high-quality triplet data for composed image retrieval.arXiv preprint arXiv:2507.05970,

  23. [23]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023. 7

  24. [24]

    U- marvel: Unveiling key factors for universal multimodal re- trieval via embedding learning with mllms.arXiv preprint arXiv:2507.14902, 2025

    Xiaojie Li, Chu Li, Shi-Zhe Chen, and Xi Chen. U- marvel: Unveiling key factors for universal multimodal re- trieval via embedding learning with mllms.arXiv preprint arXiv:2507.14902, 2025. 6

  25. [25]

    Large language- geometry model: When llm meets equivariance

    Zongzhao Li, Jiacheng Cen, Bing Su, Tingyang Xu, Yu Rong, Deli Zhao, and Wenbing Huang. Large language- geometry model: When llm meets equivariance. InPro- ceedings of the 42nd International Conference on Machine Learning, 2025. 2

  26. [26]

    From macro to micro: Benchmark- ing microscopic spatial intelligence on molecules via vision- language models.arXiv preprint arXiv:2512.10867, 2025

    Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, et al. From macro to micro: Benchmark- ing microscopic spatial intelligence on molecules via vision- language models.arXiv preprint arXiv:2512.10867, 2025

  27. [27]

    Star-r1: Spatial transformation reasoning by rein- forcing multimodal llms.arXiv preprint arXiv:2505.15804,

    Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by rein- forcing multimodal llms.arXiv preprint arXiv:2505.15804,

  28. [28]

    Mm-embed: Universal multimodal retrieval with multimodal llms

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 6

  29. [29]

    Automatic synthetic data and fine- grained adaptive feature alignment for composed person re- trieval

    Delong Liu, Haiwen Li, Zhaohui Hou, Zhicheng Zhao, Fei Su, and Yuan Dong. Automatic synthetic data and fine- grained adaptive feature alignment for composed person re- trieval. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 3, 4

  30. [30]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 4015–4025, 2025. 6

  31. [31]

    Image retrieval on real-life images with pre- trained vision-and-language models

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre- trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021. 4, 6, 7

  32. [32]

    Candidate set re-ranking for composed image re- trieval with dual multi-modal encoder.Transactions on Ma- chine Learning Research, 2024

    Zheyuan Liu, Weixuan Sun, Damien Teney, and Stephen Gould. Candidate set re-ranking for composed image re- trieval with dual multi-modal encoder.Transactions on Ma- chine Learning Research, 2024. 2

  33. [33]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  34. [34]

    Bag of tricks and a strong baseline for deep per- son re-identification

    Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep per- son re-identification. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops, 2019. 3

  35. [35]

    Mteb: Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 2014–2037, 2023. 6

  36. [36]

    Adversarial nli: A new benchmark for natural language understanding

    Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Ja- son Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. InProceed- ings of the 58th Annual Meeting of the Association for Com- putational Linguistics, pages 4885–4901, 2020. 6

  37. [37]

    Large-scale image retrieval with attentive deep local features

    Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-scale image retrieval with attentive deep local features. InProceedings of the IEEE International Conference on Computer Vision, pages 3456–3465, 2017. 2

  38. [38]

    Pic2word: Mapping pictures to words for zero-shot composed image retrieval

    Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19305– 19314, 2023. 2, 6

  39. [39]

    Covr: Learning composed video retrieval from web video captions

    Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. Covr: Learning composed video retrieval from web video captions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5270–5279, 2024. 2, 7

  40. [40]

    Composing text and image for image retrieval-an empirical odyssey

    Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6439–6448, 2019. 2

  41. [41]

    Camp: Cross- modal adaptive message passing for text-image retrieval

    Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Jun- jie Yan, Xiaogang Wang, and Jing Shao. Camp: Cross- modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5764–5773, 2019. 2

  42. [42]

    Learning to reduce dual-level discrepancy for infrared-visible person re-identification

    Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Yung-Yu Chuang, and Shin’ichi Satoh. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 618–626, 2019. 3

  43. [43]

    Uniir: Train- ing and benchmarking universal multimodal information re- trievers

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Train- ing and benchmarking universal multimodal information re- trievers. InEuropean Conference on Computer Vision, pages 387–404. Springer, 2024. 6

  44. [44]

    Google landmarks dataset v2: A large-scale benchmark for instance-level recognition and retrieval

    Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2: A large-scale benchmark for instance-level recognition and retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2575–2584, 2020. 3, 1

  45. [45]

    Fashion iq: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11307– 11317, 2021. 4, 7

  46. [46]

    Detailfusion: A dual-branch framework with detail enhancement for composed image retrieval.arXiv preprint arXiv:2505.17796, 2025

    Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Lin Song, Jun Gao, Peng Li, and Weiming Hu. Detailfusion: A dual-branch framework with detail enhancement for composed image re- trieval.arXiv preprint arXiv:2505.17796, 2025. 2 10

  47. [47]

    Good is bad: Causality inspired cloth-debiasing for cloth-changing person re-identification

    Zhengwei Yang, Meng Lin, Xian Zhong, Yu Wu, and Zheng Wang. Good is bad: Causality inspired cloth-debiasing for cloth-changing person re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1472–1481, 2023. 3

  48. [48]

    Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image re- trieval

    Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image re- trieval. InProceedings of the 47th International ACM SI- GIR Conference on Research and Development in Informa- tion Retrieval, pages 80–90, 2024. 2

  49. [49]

    Meta-personalizing vision- language models to find named instances in video

    Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, and Simon Jenni. Meta-personalizing vision- language models to find named instances in video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19123–19132, 2023. 3

  50. [50]

    Magi- cLens: Self-supervised image retrieval with open-ended in- structions

    Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magi- cLens: Self-supervised image retrieval with open-ended in- structions. InProceedings of the 41st International Con- ference on Machine Learning, pages 59403–59420. PMLR,

  51. [51]

    Context-aware attention network for image-text retrieval

    Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3536–3545, 2020. 2

  52. [52]

    GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024. 6

  53. [53]

    Prompt highlighter: Interactive control for multi- modal llms

    Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, and Jiaya Jia. Prompt highlighter: Interactive control for multi- modal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13215– 13224, 2024. 5

  54. [54]

    An open and com- prehensive pipeline for unified object grounding and detec- tion.arXiv preprint arXiv:2401.02361, 2024

    Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xin- jiang Wang, Yining Li, and Haian Huang. An open and com- prehensive pipeline for unified object grounding and detec- tion.arXiv preprint arXiv:2401.02361, 2024. 3, 6, 1

  55. [55]

    Group-aware label transfer for do- main adaptive person re-identification

    Kecheng Zheng, Wu Liu, Lingxiao He, Tao Mei, Jiebo Luo, and Zheng-Jun Zha. Group-aware label transfer for do- main adaptive person re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5310–5319, 2021. 3 11 Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval Supplementar...

  56. [56]

    More Details on the OACIRR Benchmark In this section, we provide a comprehensive overview of the construction pipeline and detailed statistics of theOACIRR benchmark. We describe the subset-specific protocols in Section 7.1, the prompts used for MLLM-based annotation in Section 7.2, the detailed dataset statistics in Section 7.3, and the instance diversit...

  57. [57]

    There is exactly one object that is the same product across all images

  58. [58]

    Requirements:

    This object may appear in different states, environ- ments, or from different viewing angles in each image. Requirements:

  59. [59]

    Output only the tag for the common object and noth- ing else

  60. [60]

    It should be specific enough to be unambigu- ous but not overly detailed

    The tag must be a short, descriptive noun phrase in English. It should be specific enough to be unambigu- ous but not overly detailed

  61. [61]

    DO NOT include any brand names

  62. [62]

    DO NOT describe the object’s state, its background, the viewing angle, or any similarities or differences be- tween the images

  63. [63]

    The common object is:

    DO NOT include any introductory phrases like “The common object is:”. For the Landmark subset, we designed a prompt that con- currently performs visual consistency filtering and class la- bel annotation. The prompt template is as follows: Visual Consistency Filter & Class Label Generation for Landmark subset Your task is to analyze a set of images from a ...

  64. [64]

    knowledge

    If you classify as “knowledge”, set “label” to null

  65. [65]

    visual”, provide the class label of the landmark for the “label

    If you classify as “visual”, provide the class label of the landmark for the “label”

  66. [66]

    Do not include any introductory text before or after the JSON object. Contextual Modification Text Generation.To ensure that the generated modification text is accurate, diverse, and ef- fectively complements the visual information, we designed domain-specific prompt templates for all four subsets. A shared instruction across these prompts was to restrict...

  67. [68]

    DO NOT describe any identical parts between the two images

    Focus exclusively on the most significant and def- inite changes. DO NOT describe any identical parts between the two images

  68. [69]

    Object to Ignore

    A specific “ Object to Ignore” is provided below. DO NOT mention this object or any of its attributes in the modification text

  69. [71]

    Avoid using repetitive sentence structures or fixed grammatical patterns

    Employ diverse expressions. Avoid using repetitive sentence structures or fixed grammatical patterns. 2 Examples:

  70. [72]

    The woman is now wearing a large pink bow and holding a light-up wand

  71. [73]

    The person is wearing a denim skirt, and the back- ground changes to a store with shelves and products

  72. [74]

    The girl changed from wearing patterned pants to white cut-off shorts, and moved from an indoor yoga room to an outdoor pathway. Object to Ignore:[Object] Modification Text Generation for Car subset Based on the two provided images, generate a modi- fication text that describes the changes from the first image to the second. Important Context: The car (mo...

  73. [75]

    The modification text must be written in fluent and natural English, NOT exceeding 25 words

  74. [76]

    DO NOT describe the car’s model or color, as they are unchanged

    Focus exclusively on the most significant and definite changes (e.g., Background / Environment, Viewing Angle, Car’s State). DO NOT describe the car’s model or color, as they are unchanged

  75. [79]

    Now shown from a low-angle perspective

  76. [80]

    The scene changes to a desert at sunset

  77. [81]

    The car is now viewed from a front angle on a snowy mountain road with its headlights turned on

  78. [82]

    Modification Text Generation for Product subset Based on the two provided images, generate a modi- fication text that describes the changes from the first image to the second

    Instead of being parked in a garage, the vehicle is now on a bridge with its driver-side door open. Modification Text Generation for Product subset Based on the two provided images, generate a modi- fication text that describes the changes from the first image to the second. Important Context: The product object:[Object]is the same in both images. You are...

  79. [84]

    Focus exclusively on the most significant and definite changes (e.g., Background / Environment, Viewing Angle, State, Packaging, Interaction)

  80. [85]

    Object to Ignore

    A specific “ Object to Ignore” is provided below. DO NOT mention this product object or any of its attributes (e.g., color, brand, type) in your response

Showing first 80 references.