Recognition: no theorem link
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
Pith reviewed 2026-05-10 19:37 UTC · model grok-4.3
The pith
Bounding boxes anchor composed image queries to enforce exact instance matches instead of loose semantics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Object-Anchored Composed Image Retrieval requires strict instance-level consistency: the output image must contain the exact object marked by the bounding box in the reference image after the modification described in text. AdaFocal implements this via a Context-Aware Attention Modulator that adaptively intensifies attention inside the anchored region and balances it against the broader compositional context, outperforming existing compositional retrieval models on the OACIRR benchmark in instance preservation.
What carries the argument
The Context-Aware Attention Modulator, which adaptively intensifies attention inside the user-specified bounding-box region while integrating the modification text.
If this is right
- Composed retrieval models must treat instance identity as a first-class constraint alongside semantic modification.
- Benchmarks for multimodal retrieval will need galleries that deliberately include near-identical distractors at the object level.
- Attention mechanisms in vision-language models can be conditioned on explicit spatial anchors supplied at query time.
- Future systems may separate instance selection from textual editing as two distinct stages in the retrieval pipeline.
Where Pith is reading between the lines
- Search interfaces could add lightweight object selection tools so that users mark the target instance directly on the reference photo.
- The same anchoring idea might transfer to video retrieval or 3D scene search where a spatial or temporal pointer disambiguates the instance.
- If bounding boxes prove cumbersome for users, point clicks or scribbles could serve as lighter alternatives while preserving the core instance-fidelity goal.
Load-bearing premise
Users will supply bounding boxes as a practical and sufficient way to specify which exact instance they want retrieved.
What would settle it
An ablation or user study showing that instance-level accuracy gains disappear once the bounding-box input is removed or that users rarely provide such boxes in realistic queries.
Figures
read the original abstract
Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Object-Anchored Composed Image Retrieval (OACIR), a fine-grained task that augments composed image retrieval queries with an explicit user-provided bounding box to enforce instance-level fidelity rather than broad semantic matching. It releases the OACIRR benchmark (over 160K quadruples across multiple domains, with four galleries containing hard-negative instance distractors) and proposes the AdaFocal framework, whose core component is a Context-Aware Attention Modulator that adaptively intensifies attention inside the anchored region while balancing compositional context. The central empirical claim is that AdaFocal substantially outperforms existing compositional retrieval models on instance-level consistency metrics.
Significance. If the performance gains are shown to arise from the modulator rather than from the additional bounding-box input, the work would be significant: it supplies a concrete task definition, a large-scale multi-domain benchmark with controlled hard negatives, and an initial baseline that shifts emphasis from semantic retrieval toward referential anchoring. The benchmark construction itself is a reusable contribution for future instance-aware retrieval research.
major comments (1)
- [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the headline claim that AdaFocal 'substantially outperforms existing compositional retrieval models' is not yet supported by a controlled comparison. Standard baselines (CLIP, ARTEMIS, etc.) were designed for text-plus-reference-image queries without spatial anchors. The manuscript gives no indication that these baselines were re-implemented with equivalent box-conditioned attention or feature masking inside the anchored region. Without that adaptation, any reported gap could be explained by the extra input modality rather than by the proposed Context-Aware Attention Modulator, undermining the central empirical contribution.
minor comments (2)
- [Abstract] Abstract: quantitative results, error bars, ablation tables, and dataset statistics are entirely absent, making it impossible to assess the magnitude or statistical reliability of the claimed gains.
- [Benchmark Construction] Benchmark description: the exact procedure for constructing the 160K quadruples, the four candidate galleries, and the hard-negative instance distractors should be detailed with statistics (e.g., number of domains, distractor selection criteria) to allow reproducibility.
Simulated Author's Rebuttal
We sincerely thank the referee for the detailed and constructive review. We appreciate the acknowledgment of the OACIR task definition, the OACIRR benchmark, and the potential value of shifting toward referential anchoring. We address the single major comment below and will revise the manuscript to incorporate a controlled comparison that isolates the contribution of the Context-Aware Attention Modulator.
read point-by-point responses
-
Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the headline claim that AdaFocal 'substantially outperforms existing compositional retrieval models' is not yet supported by a controlled comparison. Standard baselines (CLIP, ARTEMIS, etc.) were designed for text-plus-reference-image queries without spatial anchors. The manuscript gives no indication that these baselines were re-implemented with equivalent box-conditioned attention or feature masking inside the anchored region. Without that adaptation, any reported gap could be explained by the extra input modality rather than by the proposed Context-Aware Attention Modulator, undermining the central empirical contribution.
Authors: We agree that the current evaluation does not fully isolate the effect of the Context-Aware Attention Modulator from the simple availability of the bounding-box input. The original experiments applied standard CIR baselines to the OACIR queries (reference image plus modification text) without explicit box conditioning, because the baselines were not originally designed for spatial anchors. In the revised manuscript we will add controlled experiments in which the baselines are adapted to the anchored setting via feature masking outside the provided bounding box and/or by injecting box-derived spatial embeddings. We will report the resulting metrics on the OACIRR galleries and update both the abstract and the experimental section to reflect these comparisons. This will allow readers to assess whether AdaFocal's adaptive focusing yields gains beyond what can be achieved by straightforward box conditioning alone. revision: yes
Circularity Check
No significant circularity; claims are empirical
full rationale
The paper introduces a new task (OACIR) that augments standard CIR queries with an explicit bounding box for instance anchoring, constructs the OACIRR benchmark with over 160K quadruples and hard-negative galleries, and proposes the AdaFocal model featuring a Context-Aware Attention Modulator. All performance claims, including substantial outperformance on instance-level fidelity, are presented strictly as outcomes of extensive experiments on this new benchmark. No equations, analytical derivations, first-principles predictions, or parameter-fitting steps are described that could reduce by construction to self-referential definitions or fitted inputs. Potential concerns about baseline adaptation for the bounding-box input relate to experimental controls rather than any load-bearing circularity in a derivation chain. The work is therefore self-contained as an empirical contribution with no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Emphasizing concrete instance fidelity over broad semantics is often more consequential in retrieval applications.
invented entities (3)
-
OACIR task
no independent evidence
-
OACIRR benchmark
no independent evidence
-
AdaFocal framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Myvlm: Personalizing vlms for user-specific queries
Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aber- man, and Daniel Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. InEuropean Conference on Computer Vision, pages 73–91. Springer, 2024. 3
2024
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Products-10k: A large-scale product recognition dataset
Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. Products-10k: A large-scale product recognition dataset.arXiv preprint arXiv:2008.10545, 2020. 3, 1
-
[4]
Sentence-level prompts benefit composed image retrieval
Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, and Chun- Mei Feng. Sentence-level prompts benefit composed image retrieval. InThe Twelfth International Conference on Learn- ing Representations, 2024. 2, 6, 7
2024
-
[5]
Effective conditioned and composed im- age retrieval combining clip-based features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Al- berto Del Bimbo. Effective conditioned and composed im- age retrieval combining clip-based features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 21466–21474, 2022. 2
2022
-
[6]
Zero-shot composed image retrieval with textual inversion
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Al- berto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15338–15347,
-
[7]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Al- berto Del Bimbo. Composed image retrieval using con- trastive learning and task-oriented clip-based features.ACM Transactions on Multimedia Computing, Communications and Applications, 20(3):1–24, 2023. 2
2023
-
[8]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 2, 4
2023
-
[9]
Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 6
2021
-
[10]
Image search with text feedback by visiolinguistic attention learn- ing
Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learn- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3001–3011,
-
[11]
this is my unicorn, fluffy
Niv Cohen, Rinon Gal, Eli A Meirom, Gal Chechik, and Yuval Atzmon. "this is my unicorn, fluffy": Personalizing frozen vision-language representations. InEuropean Con- ference on Computer Vision, pages 558–577. Springer, 2022. 2, 3
2022
-
[12]
Artemis: Attention-based retrieval with text- explicit matching and implicit similarity
Ginger Delmas, Rafael S Rezende, Gabriela Csurka, and Di- ane Larlus. Artemis: Attention-based retrieval with text- explicit matching and implicit similarity. InThe Tenth In- ternational Conference on Learning Representations, 2022. 2
2022
-
[13]
Deepfashion2: A versatile benchmark for de- tection, pose estimation, segmentation and re-identification of clothing images
Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo. Deepfashion2: A versatile benchmark for de- tection, pose estimation, segmentation and re-identification of clothing images. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 5337–5345, 2019. 3, 1
2019
-
[14]
Deep image retrieval: Learning global representations for image search
Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Lar- lus. Deep image retrieval: Learning global representations for image search. InEuropean Conference on Computer Vi- sion, pages 241–257. Springer, 2016. 2
2016
-
[15]
Compodiff: Versatile composed image retrieval with latent diffusion.Transactions on Machine Learning Research, 2024
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. Compodiff: Versatile composed image retrieval with latent diffusion.Transactions on Machine Learning Research, 2024. Expert Certification. 2, 7
2024
-
[16]
Language-only training of zero-shot com- posed image retrieval
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only training of zero-shot com- posed image retrieval. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13225–13234, 2024. 6
2024
-
[17]
Celebrities-reid: A benchmark for clothes variation in long- term person re-identification
Yan Huang, Qiang Wu, Jingsong Xu, and Yi Zhong. Celebrities-reid: A benchmark for clothes variation in long- term person re-identification. In2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE,
-
[18]
Cala: Complementary association learning for augmenting comoposed image re- trieval
Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, and Xueming Qian. Cala: Complementary association learning for augmenting comoposed image re- trieval. InProceedings of the 47th International ACM SI- GIR Conference on Research and Development in Informa- tion Retrieval, pages 2177–2187, 2024. 2
2024
-
[19]
Vision-by-language for training-free compositional image retrieval
Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training-free compositional image retrieval. InThe Twelfth International Conference on Learning Representations, 2024. 2
2024
-
[20]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 3, 1
2013
-
[21]
Data roaming and quality assessment for composed im- age retrieval
Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischin- ski. Data roaming and quality assessment for composed im- age retrieval. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 2991–2999, 2024. 2, 4, 7
2024
-
[22]
Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, and Fei Su. Automatic synthesis of high-quality triplet data for composed image retrieval.arXiv preprint arXiv:2507.05970,
-
[23]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023. 7
2023
-
[24]
Xiaojie Li, Chu Li, Shi-Zhe Chen, and Xi Chen. U- marvel: Unveiling key factors for universal multimodal re- trieval via embedding learning with mllms.arXiv preprint arXiv:2507.14902, 2025. 6
-
[25]
Large language- geometry model: When llm meets equivariance
Zongzhao Li, Jiacheng Cen, Bing Su, Tingyang Xu, Yu Rong, Deli Zhao, and Wenbing Huang. Large language- geometry model: When llm meets equivariance. InPro- ceedings of the 42nd International Conference on Machine Learning, 2025. 2
2025
-
[26]
Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, et al. From macro to micro: Benchmark- ing microscopic spatial intelligence on molecules via vision- language models.arXiv preprint arXiv:2512.10867, 2025
-
[27]
Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by rein- forcing multimodal llms.arXiv preprint arXiv:2505.15804,
-
[28]
Mm-embed: Universal multimodal retrieval with multimodal llms
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 6
2025
-
[29]
Automatic synthetic data and fine- grained adaptive feature alignment for composed person re- trieval
Delong Liu, Haiwen Li, Zhaohui Hou, Zhicheng Zhao, Fei Su, and Yuan Dong. Automatic synthetic data and fine- grained adaptive feature alignment for composed person re- trieval. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 3, 4
2025
-
[30]
Lamra: Large multimodal model as your advanced retrieval assistant
Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 4015–4025, 2025. 6
2025
-
[31]
Image retrieval on real-life images with pre- trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre- trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021. 4, 6, 7
2021
-
[32]
Candidate set re-ranking for composed image re- trieval with dual multi-modal encoder.Transactions on Ma- chine Learning Research, 2024
Zheyuan Liu, Weixuan Sun, Damien Teney, and Stephen Gould. Candidate set re-ranking for composed image re- trieval with dual multi-modal encoder.Transactions on Ma- chine Learning Research, 2024. 2
2024
-
[33]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Bag of tricks and a strong baseline for deep per- son re-identification
Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep per- son re-identification. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops, 2019. 3
2019
-
[35]
Mteb: Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 2014–2037, 2023. 6
2014
-
[36]
Adversarial nli: A new benchmark for natural language understanding
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Ja- son Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. InProceed- ings of the 58th Annual Meeting of the Association for Com- putational Linguistics, pages 4885–4901, 2020. 6
2020
-
[37]
Large-scale image retrieval with attentive deep local features
Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-scale image retrieval with attentive deep local features. InProceedings of the IEEE International Conference on Computer Vision, pages 3456–3465, 2017. 2
2017
-
[38]
Pic2word: Mapping pictures to words for zero-shot composed image retrieval
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19305– 19314, 2023. 2, 6
2023
-
[39]
Covr: Learning composed video retrieval from web video captions
Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. Covr: Learning composed video retrieval from web video captions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5270–5279, 2024. 2, 7
2024
-
[40]
Composing text and image for image retrieval-an empirical odyssey
Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6439–6448, 2019. 2
2019
-
[41]
Camp: Cross- modal adaptive message passing for text-image retrieval
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Jun- jie Yan, Xiaogang Wang, and Jing Shao. Camp: Cross- modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5764–5773, 2019. 2
2019
-
[42]
Learning to reduce dual-level discrepancy for infrared-visible person re-identification
Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Yung-Yu Chuang, and Shin’ichi Satoh. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 618–626, 2019. 3
2019
-
[43]
Uniir: Train- ing and benchmarking universal multimodal information re- trievers
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Train- ing and benchmarking universal multimodal information re- trievers. InEuropean Conference on Computer Vision, pages 387–404. Springer, 2024. 6
2024
-
[44]
Google landmarks dataset v2: A large-scale benchmark for instance-level recognition and retrieval
Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2: A large-scale benchmark for instance-level recognition and retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2575–2584, 2020. 3, 1
2020
-
[45]
Fashion iq: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11307– 11317, 2021. 4, 7
2021
-
[46]
Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Lin Song, Jun Gao, Peng Li, and Weiming Hu. Detailfusion: A dual-branch framework with detail enhancement for composed image re- trieval.arXiv preprint arXiv:2505.17796, 2025. 2 10
-
[47]
Good is bad: Causality inspired cloth-debiasing for cloth-changing person re-identification
Zhengwei Yang, Meng Lin, Xian Zhong, Yu Wu, and Zheng Wang. Good is bad: Causality inspired cloth-debiasing for cloth-changing person re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1472–1481, 2023. 3
2023
-
[48]
Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image re- trieval
Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image re- trieval. InProceedings of the 47th International ACM SI- GIR Conference on Research and Development in Informa- tion Retrieval, pages 80–90, 2024. 2
2024
-
[49]
Meta-personalizing vision- language models to find named instances in video
Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, and Simon Jenni. Meta-personalizing vision- language models to find named instances in video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19123–19132, 2023. 3
2023
-
[50]
Magi- cLens: Self-supervised image retrieval with open-ended in- structions
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magi- cLens: Self-supervised image retrieval with open-ended in- structions. InProceedings of the 41st International Con- ference on Machine Learning, pages 59403–59420. PMLR,
-
[51]
Context-aware attention network for image-text retrieval
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3536–3545, 2020. 2
2020
-
[52]
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024. 6
work page internal anchor Pith review arXiv 2024
-
[53]
Prompt highlighter: Interactive control for multi- modal llms
Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, and Jiaya Jia. Prompt highlighter: Interactive control for multi- modal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13215– 13224, 2024. 5
2024
-
[54]
Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xin- jiang Wang, Yining Li, and Haian Huang. An open and com- prehensive pipeline for unified object grounding and detec- tion.arXiv preprint arXiv:2401.02361, 2024. 3, 6, 1
-
[55]
Group-aware label transfer for do- main adaptive person re-identification
Kecheng Zheng, Wu Liu, Lingxiao He, Tao Mei, Jiebo Luo, and Zheng-Jun Zha. Group-aware label transfer for do- main adaptive person re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5310–5319, 2021. 3 11 Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval Supplementar...
2021
-
[56]
More Details on the OACIRR Benchmark In this section, we provide a comprehensive overview of the construction pipeline and detailed statistics of theOACIRR benchmark. We describe the subset-specific protocols in Section 7.1, the prompts used for MLLM-based annotation in Section 7.2, the detailed dataset statistics in Section 7.3, and the instance diversit...
-
[57]
There is exactly one object that is the same product across all images
-
[58]
Requirements:
This object may appear in different states, environ- ments, or from different viewing angles in each image. Requirements:
-
[59]
Output only the tag for the common object and noth- ing else
-
[60]
It should be specific enough to be unambigu- ous but not overly detailed
The tag must be a short, descriptive noun phrase in English. It should be specific enough to be unambigu- ous but not overly detailed
-
[61]
DO NOT include any brand names
-
[62]
DO NOT describe the object’s state, its background, the viewing angle, or any similarities or differences be- tween the images
-
[63]
The common object is:
DO NOT include any introductory phrases like “The common object is:”. For the Landmark subset, we designed a prompt that con- currently performs visual consistency filtering and class la- bel annotation. The prompt template is as follows: Visual Consistency Filter & Class Label Generation for Landmark subset Your task is to analyze a set of images from a ...
-
[64]
knowledge
If you classify as “knowledge”, set “label” to null
-
[65]
visual”, provide the class label of the landmark for the “label
If you classify as “visual”, provide the class label of the landmark for the “label”
-
[66]
Do not include any introductory text before or after the JSON object. Contextual Modification Text Generation.To ensure that the generated modification text is accurate, diverse, and ef- fectively complements the visual information, we designed domain-specific prompt templates for all four subsets. A shared instruction across these prompts was to restrict...
-
[68]
DO NOT describe any identical parts between the two images
Focus exclusively on the most significant and def- inite changes. DO NOT describe any identical parts between the two images
-
[69]
Object to Ignore
A specific “ Object to Ignore” is provided below. DO NOT mention this object or any of its attributes in the modification text
-
[71]
Avoid using repetitive sentence structures or fixed grammatical patterns
Employ diverse expressions. Avoid using repetitive sentence structures or fixed grammatical patterns. 2 Examples:
-
[72]
The woman is now wearing a large pink bow and holding a light-up wand
-
[73]
The person is wearing a denim skirt, and the back- ground changes to a store with shelves and products
-
[74]
The girl changed from wearing patterned pants to white cut-off shorts, and moved from an indoor yoga room to an outdoor pathway. Object to Ignore:[Object] Modification Text Generation for Car subset Based on the two provided images, generate a modi- fication text that describes the changes from the first image to the second. Important Context: The car (mo...
-
[75]
The modification text must be written in fluent and natural English, NOT exceeding 25 words
-
[76]
DO NOT describe the car’s model or color, as they are unchanged
Focus exclusively on the most significant and definite changes (e.g., Background / Environment, Viewing Angle, Car’s State). DO NOT describe the car’s model or color, as they are unchanged
-
[79]
Now shown from a low-angle perspective
-
[80]
The scene changes to a desert at sunset
-
[81]
The car is now viewed from a front angle on a snowy mountain road with its headlights turned on
-
[82]
Modification Text Generation for Product subset Based on the two provided images, generate a modi- fication text that describes the changes from the first image to the second
Instead of being parked in a garage, the vehicle is now on a bridge with its driver-side door open. Modification Text Generation for Product subset Based on the two provided images, generate a modi- fication text that describes the changes from the first image to the second. Important Context: The product object:[Object]is the same in both images. You are...
-
[84]
Focus exclusively on the most significant and definite changes (e.g., Background / Environment, Viewing Angle, State, Packaging, Interaction)
-
[85]
Object to Ignore
A specific “ Object to Ignore” is provided below. DO NOT mention this product object or any of its attributes (e.g., color, brand, type) in your response
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.