Thinking Before Retrieving: Robust Zero-Shot Composed Image Retrieval via Strategic Planning and Self-Criticism

Gunho Jung; Jeong-Woo Park; Seon Bin Kim; Seong-Whan Lee

arxiv: 2606.31222 · v1 · pith:I4ZDJKH5new · submitted 2026-06-30 · 💻 cs.AI

Thinking Before Retrieving: Robust Zero-Shot Composed Image Retrieval via Strategic Planning and Self-Criticism

Gunho Jung , Jeong-Woo Park , Seon Bin Kim , Seong-Whan Lee This is my paper

Pith reviewed 2026-07-01 05:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords composed image retrievalzero-shot retrievalquery generationmulti-stage reasoningself-criticismvision-language modelstraining-free methods

0 comments

The pith

A Planner-Executor-Critic pipeline improves zero-shot composed image retrieval by evaluating multiple candidate queries before retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that single-pass generation of textual queries for composed image retrieval often distorts reference attributes or fails to meet modification instructions, degrading retrieval precision. It establishes that structuring the process as a multi-stage pipeline with explicit constraint extraction, multiple candidate generation, and constraint-based criticism reduces error propagation in a training-free setting. A sympathetic reader would care because this approach uses only frozen vision-language models and directly targets the interference between preserving image content and applying text changes. The central mechanism reframes query construction as staged inference rather than direct output.

Core claim

PEC-CIR reframes query construction for composed image retrieval as a Planner-Executor-Critic pipeline. The Planner extracts explicit constraints from the reference image and modification text. The Executor produces multiple candidate target descriptions. The Critic evaluates these candidates for constraint compliance and selects the best one for retrieval, thereby reducing the propagation of generative errors compared to single-pass methods.

What carries the argument

The Planner-Executor-Critic architecture, which turns query construction into a multi-stage reasoning pipeline that evaluates candidates before retrieval.

If this is right

Semantic distortions and omissions during generation are detected before they reach the retrieval stage.
Preservation of reference attributes and integration of textual requirements interfere less with each other.
Retrieval stability increases in training-free zero-shot settings without any model updates.
The same staged structure can be applied to other tasks that rely on generating retrieval-oriented text from multimodal inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other multimodal generation tasks where single-pass outputs frequently violate constraints.
Self-criticism with frozen models could serve as a general substitute for task-specific fine-tuning in retrieval pipelines.
Explicit constraint lists produced by the Planner offer a transparent way to audit and debug query failures.

Load-bearing premise

The Critic stage using only frozen models can reliably detect and select candidate descriptions that better preserve reference attributes and satisfy textual constraints than a single-pass generation would.

What would settle it

A direct comparison on standard composed image retrieval benchmarks measuring whether retrieval accuracy rises when replacing single-pass generation with the full Planner-Executor-Critic selection process.

Figures

Figures reproduced from arXiv: 2606.31222 by Gunho Jung, Jeong-Woo Park, Seon Bin Kim, Seong-Whan Lee.

**Figure 2.** Figure 2: Overview of PEC-CIR. The framework performs a Planner–Executor–Critic loop to iteratively [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative success cases of PEC-CIR. Each example displays the reference image, the modifi [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Representative failure cases and error taxonomy of PEC-CIR. [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Intermediate outputs of PEC-CIR across iterations. The figure visualizes the constraints derived [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

read the original abstract

Composed image retrieval requires identifying a target image from a gallery by integrating a reference image with a textual modification instruction. In a training-free zero-shot setting, this task relies on constructing a retrieval-oriented textual query within a frozen vision--language embedding space at inference time. Existing approaches predominantly rely on a single-pass generation strategy that fuses the reference context and modification text into a unified description. This strategy makes it difficult to detect or correct semantic distortions and omissions during generation. Consequently, the preservation of reference attributes and the integration of textual requirements interfere with each other, which degrades retrieval precision. To address these challenges, we introduce PEC-CIR, a training-free framework that structures query construction as a multi-stage reasoning pipeline. The framework operates through a Planner--Executor--Critic architecture where the Planner extracts explicit constraints, the Executor generates multiple candidate target descriptions, and the Critic evaluates these candidates based on constraint compliance. By reframing query construction as a staged inference process instead of a single-pass output, PEC-CIR reduces the propagation of generative errors by explicitly evaluating candidate queries before retrieval, thereby improving retrieval stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEC-CIR adds an explicit planner-executor-critic loop to zero-shot composed image retrieval, but the abstract supplies no results to show the critic actually helps.

read the letter

The main takeaway is that this paper reframes query construction for zero-shot CIR as a three-stage process instead of a single generation pass. The planner pulls out constraints, the executor makes several candidate descriptions, and the critic checks them against those constraints before retrieval. That decomposition is the concrete difference from the single-pass methods they target.

It does a clean job naming the interference problem: reference attributes and modification text can fight each other in one shot, and an explicit check step is a logical way to catch distortions. The architecture description is straightforward and stays within frozen models, which keeps the training-free claim intact.

The soft spot is exactly the one the stress-test flagged. The abstract gives no scoring function, prompt template, or mechanism for the critic, so it is unclear why the same frozen VLM family would reliably prefer better candidates than the executor produced in the first place. Without any numbers, ablations, or even a worked example, the stability claim stays untested. The paper is still early-stage on evidence.

This is for people already working on practical zero-shot retrieval pipelines who might want to try staged reasoning. A reader looking for new ideas on error reduction in generation could pull something useful, but anyone needing proven gains on benchmarks will have to wait for the experiments.

If the full paper includes solid results on standard CIR datasets and shows the critic step actually moves the needle, it is worth sending to referees. The core idea is simple enough to evaluate directly.

Referee Report

2 major / 1 minor

Summary. The paper claims that single-pass generation in training-free zero-shot composed image retrieval (CIR) leads to semantic distortions and omissions because reference attributes and textual modifications interfere; it proposes PEC-CIR, a Planner–Executor–Critic pipeline that extracts constraints, generates multiple candidate descriptions, and evaluates them for compliance before retrieval, thereby reducing error propagation and improving stability.

Significance. If the Critic stage can reliably select superior candidates using only frozen VLMs, the staged pipeline would address a genuine limitation of single-pass methods in zero-shot CIR and provide a general template for error mitigation in generative retrieval pipelines. The training-free nature is a strength, but the absence of any reported results, ablations, or implementation details prevents assessment of whether the architecture delivers the claimed stability gains.

major comments (2)

[Abstract, §3] Abstract and §3 (method): the central claim that the Critic 'evaluates these candidates based on constraint compliance' and thereby reduces generative errors is load-bearing, yet the manuscript supplies no scoring function, prompt template, or decision rule for the Critic; without an explicit mechanism it is impossible to determine why a frozen model would succeed at meta-evaluation where the Executor failed at generation.
[§4] §4 (experiments): no quantitative retrieval metrics, ablation studies, or baseline comparisons are reported, so the stability improvement asserted in the abstract cannot be verified and the contribution of the Planner–Executor–Critic decomposition remains untested.

minor comments (1)

[§3] Notation for the three stages is introduced without a compact diagram or pseudocode listing the exact sequence of calls to the frozen VLM, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for explicit implementation details and empirical validation. We will revise the manuscript to incorporate the requested clarifications and results while preserving the core contribution of the PEC-CIR framework.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): the central claim that the Critic 'evaluates these candidates based on constraint compliance' and thereby reduces generative errors is load-bearing, yet the manuscript supplies no scoring function, prompt template, or decision rule for the Critic; without an explicit mechanism it is impossible to determine why a frozen model would succeed at meta-evaluation where the Executor failed at generation.

Authors: We agree that the absence of explicit prompt templates and the Critic's decision rule limits reproducibility. In the revised manuscript we will add the full prompt templates for the Planner, Executor, and Critic stages along with the precise scoring function (constraint-compliance verification via the frozen VLM) and the selection rule. This will clarify how the Critic performs reliable meta-evaluation. revision: yes
Referee: [§4] §4 (experiments): no quantitative retrieval metrics, ablation studies, or baseline comparisons are reported, so the stability improvement asserted in the abstract cannot be verified and the contribution of the Planner–Executor–Critic decomposition remains untested.

Authors: The current manuscript presents the conceptual architecture; we acknowledge that quantitative evidence is required to substantiate the claimed stability gains. The revised version will include standard zero-shot CIR metrics (Recall@K, mAP), comparisons against single-pass baselines, and ablations isolating each stage of the pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline with no equations or self-referential reductions

full rationale

The paper presents PEC-CIR as a training-free, multi-stage procedural framework (Planner extracts constraints, Executor generates candidates, Critic evaluates compliance) for zero-shot composed image retrieval. The provided text contains no equations, fitted parameters, uniqueness theorems, or derivations. The central claim—that staging reduces generative error propagation—is an architectural assertion about inference ordering, not a quantity that reduces to its own inputs by construction. No self-citation load-bearing or ansatz smuggling is identifiable in the abstract or description. The method is self-contained as an empirical pipeline choice evaluated against single-pass baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that existing frozen vision-language models possess sufficient reasoning capability to perform planning, candidate generation, and criticism without task-specific training or additional parameters.

axioms (1)

domain assumption Frozen vision-language models can perform effective constraint extraction, multi-candidate generation, and constraint-compliance evaluation for image retrieval queries
The entire pipeline is built on this capability of the base models; if the models cannot reliably execute these roles the staged process collapses.

pith-pipeline@v0.9.1-grok · 5738 in / 1226 out tokens · 29634 ms · 2026-07-01T05:50:33.738554+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 3 canonical work pages

[1]

N. V o, L. Jiang, C. Sun, K. Murphy, L.-J. Li, L. Fei-Fei, J. Hays, Composing text and image for image retrieval - an empirical odyssey, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6439–6448. 30

2019
[2]

Philbin, O

J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial matching, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2007, pp. 1–8

2007
[3]

Gordo, J

A. Gordo, J. Almazán, J. Revaud, D. Larlus, Deep image retrieval: Learning global representations for image search, in: European Conference on Computer Vision (ECCV), Springer, 2016, pp. 241–257

2016
[4]

Y .-B. Lee, U. Park, A. K. Jain, S.-W. Lee, Pill-id: Matching and retrieval of drug pill images, Pattern Recognition Letters 33 (7) (2012) 904–910

2012
[5]

S. Sun, Q. Li, S. Gong, W. Cai, P. Torr, J. Gu, Benchcir: Benchmarking robust- ness in composed image retrieval across modalities, Pattern Recognition (2026) 113724

2026
[6]

L. Wang, W. Ao, V . N. Boddeti, S.-N. Lim, Generative zero-shot composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29690–29700

2025
[7]

E. Xing, P. Kolouju, R. Pless, A. Stylianou, N. Jacobs, Context-cir: Learning from concepts in text for composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 19638–19648

2025
[8]

D. Kang, H. Han, A. K. Jain, S.-W. Lee, Nighttime face recognition at large standoff: Cross-distance and cross-spectral matching, Pattern Recognition 47 (12) (2014) 3750–3766

2014
[9]

Delmas, R

G. Delmas, R. S. de Rezende, G. Csurka, D. Larlus, Artemis: Attention-based retrieval with text-explicit matching and implicit similarity, 2022

2022
[10]

Ju, H.-J

Y .-J. Ju, H.-J. Kim, S.-W. Lee, Mire: Enhancing multimodal queries representa- tion via fusion-free modality interaction for multimodal retrieval, in: Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 5350–5363

2025
[11]

Baldrati, M

A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Effective conditioned and composed image retrieval combining clip-based features, in: Proc. IEEE Con- 31 ference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21466– 21474

2022
[12]

Zhang, Y

K. Zhang, Y . Luan, H. Hu, K. Lee, S. Qiao, W. Chen, Y . Su, M.-W. Chang, Magiclens: Self-supervised image retrieval with open-ended instructions, in: In- ternational Conference on Machine Learning (ICML), PMLR, 2024, pp. 59403– 59420

2024
[13]

Agnolucci, A

L. Agnolucci, A. Baldrati, A. Del Bimbo, M. Bertini, isearle: Improving textual inversion for zero-shot composed image retrieval, IEEE Trans. on Pattern Analy- sis and Machine Intelligence (2025)

2025
[14]

Z. Chen, Z. Zhao, F. Su, S. Lu, Data-efficient generalization for zero-shot com- posed image retrieval, Pattern Recognition (2026) 113187

2026
[15]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning (ICML), PmLR, 2021, pp. 8748–8763

2021
[16]

Park, S.-W

J.-W. Park, S.-W. Lee, Mcot-re: Multi-faceted chain-of-thought and re-ranking for training-free zero-shot composed image retrieval, in: 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2025, pp. 984–989

2025
[17]

Y . Tang, J. Zhang, X. Qin, J. Yu, G. Gou, G. Xiong, Q. Lin, S. Rajmohan, D. Zhang, Q. Wu, Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 14400–14410

2025
[18]

Z. Yang, D. Xue, S. Qian, W. Dong, C. Xu, Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval, in: Proc. ACM SIGIR Con- ference on Research and Development in Information Retrieval (SIGIR), 2024, pp. 80–90. 32

2024
[19]

R.-C. Tu, W. Sun, H. You, Y . Wang, J. Huang, L. Shen, D. Tao, Multi- modal reasoning agent for zero-shot composed image retrieval, arXiv preprint arXiv:2505.19952 (2025)

work page arXiv 2025
[20]

Cheng, Y

Z. Cheng, Y . Ma, J. Lang, K. Zhang, T. Zhong, Y . Wang, F. Zhou, Generative thinking, corrective action: User-friendly composed image retrieval via automatic multi-agent collaboration, in: Proc. 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V . 2, 2025, pp. 334–344

2025
[21]

H. Wu, Y . Gao, X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, R. Feris, Fashion iq: A new dataset towards retrieving images by natural language feedback, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11307–11317

2021
[22]

Z. Liu, C. Rodriguez-Opazo, D. Teney, S. Gould, Image retrieval on real-life images with pre-trained vision-and-language models, in: Proc. IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2021, pp. 2125–2134

2021
[23]

Park, Y .-E

J.-W. Park, Y .-E. Kim, S.-W. Lee, Far-net: Multi-stage fusion network with en- hanced semantic alignment and adaptive reconciliation for composed image re- trieval, in: 2025 IEEE International Conference on Systems, Man, and Cybernet- ics (SMC), IEEE, 2025, pp. 990–995

2025
[24]

Yun, W.-J

M.-S. Yun, W.-J. Nam, S.-W. Lee, Coarse-to-fine deep metric learning for remote sensing image retrieval, Remote Sensing 12 (2) (2020) 219

2020
[25]

H. Wen, X. Song, J. Yin, J. Wu, W. Guan, L. Nie, Self-training boosted multi- factor matching network for composed image retrieval, IEEE Trans. on Pattern Analysis and Machine Intelligence 46 (5) (2023) 3665–3678

2023
[26]

Q. Yang, M. Ye, Z. Cai, K. Su, B. Du, Composed image retrieval via cross re- lation network with hierarchical aggregation transformer, IEEE Trans. on Image Processing 32 (2023) 4543–4554. 33

2023
[27]

S. Qian, D. Xue, Q. Fang, C. Xu, Integrating multi-label contrastive learning with dual adversarial graph neural networks for cross-modal retrieval, IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45 (4) (2022) 4794–4811

2022
[28]

S. Qian, D. Xue, H. Zhang, Q. Fang, C. Xu, Dual adversarial graph neural net- works for multi-label cross-modal retrieval, in: Proceedings of the AAAI Confer- ence on Artificial Intelligence, V ol. 35, 2021, pp. 2440–2448

2021
[29]

S. Qian, D. Xue, Q. Fang, C. Xu, Adaptive label-aware graph convolutional net- works for cross-modal retrieval, IEEE Transactions on Multimedia 24 (2021) 3520–3532

2021
[30]

Lee, Multilayer cluster neural network for totally unconstrained handwrit- ten numeral recognition, Neural Networks 8 (5) (1995) 783–792

S.-W. Lee, Multilayer cluster neural network for totally unconstrained handwrit- ten numeral recognition, Neural Networks 8 (5) (1995) 783–792

1995
[31]

Roh, H.-K

M.-C. Roh, H.-K. Shin, S.-W. Lee, View-independent human action recognition with volume motion template on single stereo camera, Pattern Recognition Let- ters 31 (7) (2010) 639–647

2010
[32]

S.-W. Lee, J. H. Kim, F. C. Groen, Translation-, rotation-and scale-invariant recognition of hand-drawn symbols in schematic diagrams, International Journal of Pattern Recognition and Artificial Intelligence 4 (01) (1990) 1–25

1990
[33]

Y . Liu, J. Du, X. Gao, J. Han, L. Shao, Relation-aware meta-learning for zero- shot sketch-based image retrieval, IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025
[34]

J. Du, Y . Liu, X. Gao, J. Han, L. Zhang, Zero-shot sketch-based image retrieval with teacher-guided and student-centered cross-modal bidirectional knowledge distillation, Pattern Recognition 164 (2025) 111529

2025
[35]

Y . Liu, J. Du, X. Gao, J. Han, L. Shao, Zero-shot sketch-based image retrieval via mixed species augmentation and bidirectional mining, IEEE Transactions on Circuits and Systems for Video Technology (2026). 34

2026
[36]

Baldrati, L

A. Baldrati, L. Agnolucci, M. Bertini, A. Del Bimbo, Zero-shot composed image retrieval with textual inversion, in: Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 15338–15347

2023
[37]

Karthik, K

S. Karthik, K. Roth, M. Mancini, Z. Akata, Vision-by-language for training- free compositional image retrieval, in: The Twelfth International Conference on Learning Representations, International Conference on Learning Represen- tations, ICLR, 2024

2024
[38]

Wu, Y .-Y

R.-D. Wu, Y .-Y . Lin, H.-F. Yang, Training-free zero-shot composed image re- trieval via weighted modality fusion and similarity, in: Proc. International Con- ference on Technologies and Applications of Artificial Intelligence (TAAI), Springer, 2024, pp. 77–90

2024
[39]

Kim, Y .-E

Y . Kim, Y .-E. Kim, S.-W. Lee, Enhancing spatio-temporal zero-shot action recog- nition with language-driven description attributes, Pattern Recognition (2025) 112687

2025
[40]

J. Byun, S. Jeong, W. Kim, S. Chun, T. Moon, An efficient post-hoc framework for reducing task discrepancy of text encoders for composed image retrieval, in: Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 3895–3904

2025
[41]

Y . K. Jang, D. Huynh, A. Shah, W.-K. Chen, S.-N. Lim, Spherical linear interpo- lation and text-anchoring for zero-shot composed image retrieval, in: European Conference on Computer Vision (ECCV), Springer, 2024, pp. 239–254

2024
[42]

Saito, K

K. Saito, K. Sohn, X. Zhang, C.-L. Li, C.-Y . Lee, K. Saenko, T. Pfister, Pic2word: Mapping pictures to words for zero-shot composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 19305–19314

2023
[43]

Y . Wang, Z. Tian, Q. Guo, Z. Qin, S. Zhou, M. Yang, L. Wang, From mapping to composing: A two-stage framework for zero-shot composed image retrieval, arXiv preprint arXiv:2504.17990 (2025). 35

work page arXiv 2025
[44]

H. Lin, H. Wen, X. Song, M. Liu, Y . Hu, L. Nie, Fine-grained textual inver- sion network for zero-shot composed image retrieval, in: Proc. 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024, pp. 240–250

2024
[45]

Y . Suo, F. Ma, L. Zhu, Y . Yang, Knowledge-enhanced dual-stream zero-shot com- posed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26951–26962

2024
[46]

Z. Yang, S. Qian, D. Xue, J. Wu, F. Yang, W. Dong, C. Xu, Semantic editing increment benefits zero-shot composed image retrieval, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1245–1254

2024
[47]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, Y . Cao, React: Syn- ergizing reasoning and acting in language models, in: The Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[48]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, Y . Iwasawa, Large language mod- els are zero-shot reasoners, Advances in Neural Information Processing Systems (NeurIPS) 35 (2022) 22199–22213

2022
[49]

Zubkova, J.-H

H. Zubkova, J.-H. Park, S.-W. Lee, Leveraging contextual confidence for smarter retrieval in large language models, Neural Networks (2026) 108862

2026
[50]

D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, et al., Least-to-most prompting enables complex reasoning in large language models, 2023

2023
[51]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdh- ery, D. Zhou, Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[52]

Cherti, R

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, J. Jitsev, Reproducible scaling laws for contrastive 36 language-image learning, in: Proc. IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023, pp. 2818–2829

2023
[53]

Google DeepMind, Introducing gemini 2.0: our new ai model for the agentic era, https://blog.google/technology/ai/google-gemini-2-0/(2024)

2024
[54]

G. Gu, S. Chun, W. Kim, Y . Kang, S. Yun, Language-only efficient training of zero-shot composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13225–13234

2024
[55]

Y . Tang, J. Yu, K. Gai, J. Zhuang, G. Xiong, Y . Hu, Q. Wu, Context-i2w: Map- ping images to context-dependent words for accurate zero-shot composed image retrieval, in: Proc. AAAI Conference on Artificial Intelligence, V ol. 38, 2024, pp. 5180–5188

2024
[56]

S. Sun, F. Ye, S. Gong, Training-free zero-shot composed image retrieval with local concept re-ranking, arXiv preprint arXiv:2312.08924 (2023)

work page arXiv 2023
[57]

this is my unicorn, fluffy

N. Cohen, R. Gal, E. A. Meirom, G. Chechik, Y . Atzmon, “this is my unicorn, fluffy”: Personalizing frozen vision-language representations, in: European Con- ference on Computer Vision (ECCV), Springer, 2022, pp. 558–577. 37

2022

[1] [1]

N. V o, L. Jiang, C. Sun, K. Murphy, L.-J. Li, L. Fei-Fei, J. Hays, Composing text and image for image retrieval - an empirical odyssey, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6439–6448. 30

2019

[2] [2]

Philbin, O

J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial matching, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2007, pp. 1–8

2007

[3] [3]

Gordo, J

A. Gordo, J. Almazán, J. Revaud, D. Larlus, Deep image retrieval: Learning global representations for image search, in: European Conference on Computer Vision (ECCV), Springer, 2016, pp. 241–257

2016

[4] [4]

Y .-B. Lee, U. Park, A. K. Jain, S.-W. Lee, Pill-id: Matching and retrieval of drug pill images, Pattern Recognition Letters 33 (7) (2012) 904–910

2012

[5] [5]

S. Sun, Q. Li, S. Gong, W. Cai, P. Torr, J. Gu, Benchcir: Benchmarking robust- ness in composed image retrieval across modalities, Pattern Recognition (2026) 113724

2026

[6] [6]

L. Wang, W. Ao, V . N. Boddeti, S.-N. Lim, Generative zero-shot composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29690–29700

2025

[7] [7]

E. Xing, P. Kolouju, R. Pless, A. Stylianou, N. Jacobs, Context-cir: Learning from concepts in text for composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 19638–19648

2025

[8] [8]

D. Kang, H. Han, A. K. Jain, S.-W. Lee, Nighttime face recognition at large standoff: Cross-distance and cross-spectral matching, Pattern Recognition 47 (12) (2014) 3750–3766

2014

[9] [9]

Delmas, R

G. Delmas, R. S. de Rezende, G. Csurka, D. Larlus, Artemis: Attention-based retrieval with text-explicit matching and implicit similarity, 2022

2022

[10] [10]

Ju, H.-J

Y .-J. Ju, H.-J. Kim, S.-W. Lee, Mire: Enhancing multimodal queries representa- tion via fusion-free modality interaction for multimodal retrieval, in: Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 5350–5363

2025

[11] [11]

Baldrati, M

A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Effective conditioned and composed image retrieval combining clip-based features, in: Proc. IEEE Con- 31 ference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21466– 21474

2022

[12] [12]

Zhang, Y

K. Zhang, Y . Luan, H. Hu, K. Lee, S. Qiao, W. Chen, Y . Su, M.-W. Chang, Magiclens: Self-supervised image retrieval with open-ended instructions, in: In- ternational Conference on Machine Learning (ICML), PMLR, 2024, pp. 59403– 59420

2024

[13] [13]

Agnolucci, A

L. Agnolucci, A. Baldrati, A. Del Bimbo, M. Bertini, isearle: Improving textual inversion for zero-shot composed image retrieval, IEEE Trans. on Pattern Analy- sis and Machine Intelligence (2025)

2025

[14] [14]

Z. Chen, Z. Zhao, F. Su, S. Lu, Data-efficient generalization for zero-shot com- posed image retrieval, Pattern Recognition (2026) 113187

2026

[15] [15]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning (ICML), PmLR, 2021, pp. 8748–8763

2021

[16] [16]

Park, S.-W

J.-W. Park, S.-W. Lee, Mcot-re: Multi-faceted chain-of-thought and re-ranking for training-free zero-shot composed image retrieval, in: 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2025, pp. 984–989

2025

[17] [17]

Y . Tang, J. Zhang, X. Qin, J. Yu, G. Gou, G. Xiong, Q. Lin, S. Rajmohan, D. Zhang, Q. Wu, Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 14400–14410

2025

[18] [18]

Z. Yang, D. Xue, S. Qian, W. Dong, C. Xu, Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval, in: Proc. ACM SIGIR Con- ference on Research and Development in Information Retrieval (SIGIR), 2024, pp. 80–90. 32

2024

[19] [19]

R.-C. Tu, W. Sun, H. You, Y . Wang, J. Huang, L. Shen, D. Tao, Multi- modal reasoning agent for zero-shot composed image retrieval, arXiv preprint arXiv:2505.19952 (2025)

work page arXiv 2025

[20] [20]

Cheng, Y

Z. Cheng, Y . Ma, J. Lang, K. Zhang, T. Zhong, Y . Wang, F. Zhou, Generative thinking, corrective action: User-friendly composed image retrieval via automatic multi-agent collaboration, in: Proc. 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V . 2, 2025, pp. 334–344

2025

[21] [21]

H. Wu, Y . Gao, X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, R. Feris, Fashion iq: A new dataset towards retrieving images by natural language feedback, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11307–11317

2021

[22] [22]

Z. Liu, C. Rodriguez-Opazo, D. Teney, S. Gould, Image retrieval on real-life images with pre-trained vision-and-language models, in: Proc. IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2021, pp. 2125–2134

2021

[23] [23]

Park, Y .-E

J.-W. Park, Y .-E. Kim, S.-W. Lee, Far-net: Multi-stage fusion network with en- hanced semantic alignment and adaptive reconciliation for composed image re- trieval, in: 2025 IEEE International Conference on Systems, Man, and Cybernet- ics (SMC), IEEE, 2025, pp. 990–995

2025

[24] [24]

Yun, W.-J

M.-S. Yun, W.-J. Nam, S.-W. Lee, Coarse-to-fine deep metric learning for remote sensing image retrieval, Remote Sensing 12 (2) (2020) 219

2020

[25] [25]

H. Wen, X. Song, J. Yin, J. Wu, W. Guan, L. Nie, Self-training boosted multi- factor matching network for composed image retrieval, IEEE Trans. on Pattern Analysis and Machine Intelligence 46 (5) (2023) 3665–3678

2023

[26] [26]

Q. Yang, M. Ye, Z. Cai, K. Su, B. Du, Composed image retrieval via cross re- lation network with hierarchical aggregation transformer, IEEE Trans. on Image Processing 32 (2023) 4543–4554. 33

2023

[27] [27]

S. Qian, D. Xue, Q. Fang, C. Xu, Integrating multi-label contrastive learning with dual adversarial graph neural networks for cross-modal retrieval, IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45 (4) (2022) 4794–4811

2022

[28] [28]

S. Qian, D. Xue, H. Zhang, Q. Fang, C. Xu, Dual adversarial graph neural net- works for multi-label cross-modal retrieval, in: Proceedings of the AAAI Confer- ence on Artificial Intelligence, V ol. 35, 2021, pp. 2440–2448

2021

[29] [29]

S. Qian, D. Xue, Q. Fang, C. Xu, Adaptive label-aware graph convolutional net- works for cross-modal retrieval, IEEE Transactions on Multimedia 24 (2021) 3520–3532

2021

[30] [30]

Lee, Multilayer cluster neural network for totally unconstrained handwrit- ten numeral recognition, Neural Networks 8 (5) (1995) 783–792

S.-W. Lee, Multilayer cluster neural network for totally unconstrained handwrit- ten numeral recognition, Neural Networks 8 (5) (1995) 783–792

1995

[31] [31]

Roh, H.-K

M.-C. Roh, H.-K. Shin, S.-W. Lee, View-independent human action recognition with volume motion template on single stereo camera, Pattern Recognition Let- ters 31 (7) (2010) 639–647

2010

[32] [32]

S.-W. Lee, J. H. Kim, F. C. Groen, Translation-, rotation-and scale-invariant recognition of hand-drawn symbols in schematic diagrams, International Journal of Pattern Recognition and Artificial Intelligence 4 (01) (1990) 1–25

1990

[33] [33]

Y . Liu, J. Du, X. Gao, J. Han, L. Shao, Relation-aware meta-learning for zero- shot sketch-based image retrieval, IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025

[34] [34]

J. Du, Y . Liu, X. Gao, J. Han, L. Zhang, Zero-shot sketch-based image retrieval with teacher-guided and student-centered cross-modal bidirectional knowledge distillation, Pattern Recognition 164 (2025) 111529

2025

[35] [35]

Y . Liu, J. Du, X. Gao, J. Han, L. Shao, Zero-shot sketch-based image retrieval via mixed species augmentation and bidirectional mining, IEEE Transactions on Circuits and Systems for Video Technology (2026). 34

2026

[36] [36]

Baldrati, L

A. Baldrati, L. Agnolucci, M. Bertini, A. Del Bimbo, Zero-shot composed image retrieval with textual inversion, in: Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 15338–15347

2023

[37] [37]

Karthik, K

S. Karthik, K. Roth, M. Mancini, Z. Akata, Vision-by-language for training- free compositional image retrieval, in: The Twelfth International Conference on Learning Representations, International Conference on Learning Represen- tations, ICLR, 2024

2024

[38] [38]

Wu, Y .-Y

R.-D. Wu, Y .-Y . Lin, H.-F. Yang, Training-free zero-shot composed image re- trieval via weighted modality fusion and similarity, in: Proc. International Con- ference on Technologies and Applications of Artificial Intelligence (TAAI), Springer, 2024, pp. 77–90

2024

[39] [39]

Kim, Y .-E

Y . Kim, Y .-E. Kim, S.-W. Lee, Enhancing spatio-temporal zero-shot action recog- nition with language-driven description attributes, Pattern Recognition (2025) 112687

2025

[40] [40]

J. Byun, S. Jeong, W. Kim, S. Chun, T. Moon, An efficient post-hoc framework for reducing task discrepancy of text encoders for composed image retrieval, in: Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 3895–3904

2025

[41] [41]

Y . K. Jang, D. Huynh, A. Shah, W.-K. Chen, S.-N. Lim, Spherical linear interpo- lation and text-anchoring for zero-shot composed image retrieval, in: European Conference on Computer Vision (ECCV), Springer, 2024, pp. 239–254

2024

[42] [42]

Saito, K

K. Saito, K. Sohn, X. Zhang, C.-L. Li, C.-Y . Lee, K. Saenko, T. Pfister, Pic2word: Mapping pictures to words for zero-shot composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 19305–19314

2023

[43] [43]

Y . Wang, Z. Tian, Q. Guo, Z. Qin, S. Zhou, M. Yang, L. Wang, From mapping to composing: A two-stage framework for zero-shot composed image retrieval, arXiv preprint arXiv:2504.17990 (2025). 35

work page arXiv 2025

[44] [44]

H. Lin, H. Wen, X. Song, M. Liu, Y . Hu, L. Nie, Fine-grained textual inver- sion network for zero-shot composed image retrieval, in: Proc. 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024, pp. 240–250

2024

[45] [45]

Y . Suo, F. Ma, L. Zhu, Y . Yang, Knowledge-enhanced dual-stream zero-shot com- posed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26951–26962

2024

[46] [46]

Z. Yang, S. Qian, D. Xue, J. Wu, F. Yang, W. Dong, C. Xu, Semantic editing increment benefits zero-shot composed image retrieval, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1245–1254

2024

[47] [47]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, Y . Cao, React: Syn- ergizing reasoning and acting in language models, in: The Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[48] [48]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, Y . Iwasawa, Large language mod- els are zero-shot reasoners, Advances in Neural Information Processing Systems (NeurIPS) 35 (2022) 22199–22213

2022

[49] [49]

Zubkova, J.-H

H. Zubkova, J.-H. Park, S.-W. Lee, Leveraging contextual confidence for smarter retrieval in large language models, Neural Networks (2026) 108862

2026

[50] [50]

D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, et al., Least-to-most prompting enables complex reasoning in large language models, 2023

2023

[51] [51]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdh- ery, D. Zhou, Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[52] [52]

Cherti, R

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, J. Jitsev, Reproducible scaling laws for contrastive 36 language-image learning, in: Proc. IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023, pp. 2818–2829

2023

[53] [53]

Google DeepMind, Introducing gemini 2.0: our new ai model for the agentic era, https://blog.google/technology/ai/google-gemini-2-0/(2024)

2024

[54] [54]

G. Gu, S. Chun, W. Kim, Y . Kang, S. Yun, Language-only efficient training of zero-shot composed image retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13225–13234

2024

[55] [55]

Y . Tang, J. Yu, K. Gai, J. Zhuang, G. Xiong, Y . Hu, Q. Wu, Context-i2w: Map- ping images to context-dependent words for accurate zero-shot composed image retrieval, in: Proc. AAAI Conference on Artificial Intelligence, V ol. 38, 2024, pp. 5180–5188

2024

[56] [56]

S. Sun, F. Ye, S. Gong, Training-free zero-shot composed image retrieval with local concept re-ranking, arXiv preprint arXiv:2312.08924 (2023)

work page arXiv 2023

[57] [57]

this is my unicorn, fluffy

N. Cohen, R. Gal, E. A. Meirom, G. Chechik, Y . Atzmon, “this is my unicorn, fluffy”: Personalizing frozen vision-language representations, in: European Con- ference on Computer Vision (ECCV), Springer, 2022, pp. 558–577. 37

2022