STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

Dongsheng Wang; Jingcai Guo; Jinsen Zhang; Miaoge Li; Wenhan Luo; Zening Sun

arxiv: 2605.21261 · v1 · pith:HW6AH7POnew · submitted 2026-05-20 · 💻 cs.CV

STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

Miaoge Li , Dongsheng Wang , Zening Sun , Jinsen Zhang , Wenhan Luo , Jingcai Guo This is my paper

Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords composed image retrievalzero-shot retrievaltraining-free methodsLLM caption refinementsemantic transitiontransportation distanceset-to-set alignmentmultimodal retrieval

0 comments

The pith

A transition vector in embedding space refines LLM captions and bidirectional transportation distances enable set-to-set matching for improved training-free zero-shot composed image retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve two problems in zero-shot composed image retrieval where large language models generate target descriptions from a reference image and a modification instruction. The generated text often adds unwanted details from the image because text cannot capture all its content, and simple similarity scores during retrieval miss varied ways the instruction can combine with the image. The proposed approach adds a transition vector to move the caption embedding closer to the intended target while using the user's instruction to remove noise, then models both text and images as collections of features whose best alignments are measured by a bidirectional transport cost. If this holds, retrieval systems could find edited images accurately on new tasks without any labeled training pairs or model updates.

Core claim

The central claim is that the Semantic Transition and Transportation framework refines an LLM-generated composed caption via a transition vector in embedding space, combined with user instructions to emphasize core modifications and filter reference-image noise, while reformulating retrieval as alignment between two discrete distributions and scoring matches with a bidirectional transportation distance that accounts for fine-grained cross-modal correspondences.

What carries the argument

The semantic transition vector that adjusts the LLM caption embedding toward the target image when fused with user instruction, paired with bidirectional transportation distance that computes retrieval scores by treating captions and images as sets of features for set-to-set matching.

If this is right

Refined captions focus on the core modification intent and exclude extraneous details present in the reference image.
Retrieval scores capture multiple possible feature alignments instead of forcing a single point-to-point match.
The overall pipeline works across diverse composed image retrieval tasks without any task-specific training or fine-tuning.
LLM outputs become more reliable inputs for retrieval once adjusted in the shared embedding space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same caption-refinement step could help other text-to-image matching tasks where generated descriptions contain image-specific artifacts.
Bidirectional transportation distances might improve performance in broader cross-modal retrieval problems that currently rely on cosine similarity.
Testing the method on compositions involving multiple simultaneous changes would reveal whether the transport formulation scales beyond single-instruction edits.

Load-bearing premise

The transition vector computed in embedding space, when combined with user instruction, will reliably filter out noise from the LLM-generated caption and bring it closer to the target image without discarding necessary details or introducing new mismatches.

What would settle it

A controlled test on a standard composed image retrieval benchmark dataset showing that retrieval accuracy drops below a simple LLM-caption baseline when the transition vector and transportation distance are removed.

Figures

Figures reproduced from arXiv: 2605.21261 by Dongsheng Wang, Jingcai Guo, Jinsen Zhang, Miaoge Li, Wenhan Luo, Zening Sun.

**Figure 1.** Figure 1: Motivation of our proposed model. Predicted captions from MLLMs typically consist of expected ground-truth sentences (red [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The overall framework of our method. STiTch first queries MLLMs to generate multiple captions and then refines them towards [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation on the number of descriptions and image aug [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the GeneCIS dataset on the ’Focus Object’ task. Heatmaps before and after the transition on target image are [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Further comparison between SEIZE and STiTch in terms [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the GeneCIS dataset on the ’Focus Object’ task. Heatmaps before and after the transition on target image are [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of our in-context learning on GeneCIS dataset. Each sample uses the same placeholder “ [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STiTch refines LLM captions via embedding transition vectors and scores with bidirectional transport, but the vector step's reliability rests on an unverified assumption about embedding geometry.

read the letter

The paper's concrete contribution is a training-free fix for zero-shot composed image retrieval that first cleans an LLM-generated caption by adding a transition vector in embedding space, then replaces point-to-point matching with a bidirectional transportation distance over discrete distributions. This pairing directly targets the noise from reference-image details and the lack of diverse composition handling that the authors flag in prior LLM work. The approach stays simple, reuses existing embedders and LLMs, and avoids any training, which makes it easy to test on top of current pipelines. That is the part worth noting for someone already running CIR experiments. The transport formulation is a reasonable way to turn retrieval into set-to-set alignment and should capture more fine-grained matches than cosine on single vectors. The transition step tries to keep the modification intent while dropping extraneous visual features that text prompts cannot fully control. Both moves are incremental but address real, observable problems in the current LLM-based line of work. The soft spot is the transition vector itself. It assumes the joint embedding space is linear enough that a simple combination of LLM caption and user instruction will move the representation closer to the unknown target image without discarding needed details or creating fresh mismatches. The abstract gives no direct check on this, such as human ratings of refined versus original captions or distance measurements against ground-truth targets. If the geometry does not support the arithmetic cleanly, the refinement could hurt on harder cases rather than help. Experiments are claimed to show gains across tasks, yet without the tables or component ablations it is difficult to separate the contribution of the vector from the transport distance or to judge robustness across LLMs and datasets. This is for researchers working on training-free multimodal retrieval who need a new baseline or want to try transport distances in this setting. A reader who cares about practical zero-shot CIR could extract a usable idea even if the numbers need confirmation. The paper has a focused claim and enough structure to deserve a serious referee who can examine the experimental controls and the embedding assumption.

Referee Report

3 major / 2 minor

Summary. The paper proposes STiTch, a training-free zero-shot composed image retrieval (CIR) framework. It uses an LLM to generate a target caption from a reference image and text modification instruction, refines this caption via a transition vector in the joint embedding space (combined with user instruction) to filter reference-image noise and focus on core intent, models the refined caption and candidate images as discrete distributions, and computes retrieval scores via a bidirectional transportation distance to enable set-to-set rather than point-to-point alignment.

Significance. If the transition-vector refinement and transportation-distance retrieval hold up under scrutiny, the approach could meaningfully advance training-free zero-shot CIR by mitigating semantic gaps between detailed images and terse instructions and by capturing compositional diversity, offering a general, parameter-light alternative to fine-tuned models for multimodal retrieval tasks.

major comments (3)

[§3 (Method)] The central claim that the transition vector (computed from the LLM-generated caption plus user instruction) reliably produces a refined caption closer to the target image embedding without discarding necessary details or introducing new mismatches is load-bearing yet unsupported by any verification. No human judgment of refined vs. original captions, embedding-distance analysis to ground-truth targets, or ablation isolating the vector's effect is described, leaving the assumption that embedding-space arithmetic is sufficiently linear and semantically meaningful untested.
[§3.2 (Transportation Distance)] The reformulation of retrieval as set-to-set alignment via bidirectional transportation distance is presented as addressing point-to-point limitations, but the manuscript provides no derivation or pseudocode showing how the discrete distributions are constructed from caption tokens and image features, nor any analysis of computational cost or sensitivity to distribution discretization choices.
[§4 (Experiments)] Extensive experiments are asserted to demonstrate generality and effectiveness, yet the abstract and available description contain no quantitative results, baseline comparisons, ablation tables, or error analysis. Without these, the claim that the method is 'general, effective, and beneficial for many CIR tasks' cannot be evaluated.

minor comments (2)

[§3] Notation for the transition vector and the bidirectional transportation distance should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
[Abstract] The abstract states two problems (unexpected features from reference images and failure to capture diverse compositions) but does not quantify their prevalence or severity in prior LLM-based methods; a short motivating example or statistic would strengthen the motivation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [§3 (Method)] The central claim that the transition vector (computed from the LLM-generated caption plus user instruction) reliably produces a refined caption closer to the target image embedding without discarding necessary details or introducing new mismatches is load-bearing yet unsupported by any verification. No human judgment of refined vs. original captions, embedding-distance analysis to ground-truth targets, or ablation isolating the vector's effect is described, leaving the assumption that embedding-space arithmetic is sufficiently linear and semantically meaningful untested.

Authors: We agree that additional empirical verification would make the contribution of the transition vector more robust. The current manuscript motivates the approach via the semantic gap between detailed images and terse instructions, but we will revise §3 to include an ablation isolating the transition vector's effect, embedding-distance measurements to ground-truth targets, and qualitative examples of refined captions demonstrating noise reduction. These additions will directly test the linearity assumption in the joint embedding space. revision: yes
Referee: [§3.2 (Transportation Distance)] The reformulation of retrieval as set-to-set alignment via bidirectional transportation distance is presented as addressing point-to-point limitations, but the manuscript provides no derivation or pseudocode showing how the discrete distributions are constructed from caption tokens and image features, nor any analysis of computational cost or sensitivity to distribution discretization choices.

Authors: We appreciate this request for greater technical detail. In the revision we will add a formal derivation of the bidirectional transportation distance, explicit pseudocode for constructing the discrete distributions (caption tokens as one distribution, image patch features as the other), and a dedicated paragraph analyzing computational complexity together with sensitivity to discretization parameters such as token count or clustering granularity. revision: yes
Referee: [§4 (Experiments)] Extensive experiments are asserted to demonstrate generality and effectiveness, yet the abstract and available description contain no quantitative results, baseline comparisons, ablation tables, or error analysis. Without these, the claim that the method is 'general, effective, and beneficial for many CIR tasks' cannot be evaluated.

Authors: The full manuscript contains quantitative results, baseline comparisons, ablation tables, and error analysis in §4. To address the presentation concern, we will revise the abstract to report key performance metrics and will add an early summary table in the introduction that highlights main findings. This will make the empirical support immediately visible while preserving the existing detailed experimental section. revision: partial

Circularity Check

0 steps flagged

No circularity: method relies on external LLMs and embeddings with independent experimental validation

full rationale

The paper introduces a training-free framework that refines LLM-generated captions via embedding-space transition vectors and reformulates retrieval as set-to-set alignment using bidirectional transportation distance. No equations, derivations, or load-bearing steps in the abstract or described method reduce the claimed performance gains to fitted parameters, self-definitions, or self-citation chains by construction. The approach explicitly builds on external components (LLMs for caption generation and pre-trained embedding spaces) and validates effectiveness through experiments on standard CIR benchmarks. This satisfies the default expectation of a self-contained proposal without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the transition vector and transportation distance are presented as new constructs but lack sufficient detail to classify further.

pith-pipeline@v0.9.0 · 5789 in / 1056 out tokens · 25730 ms · 2026-05-21T05:26:55.987451+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a novel bidirectional transportation distance... Lbi(Pt, Qy) = L Pt→Qy + L Qy→Pt

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

[1]

Compositional learning of image-text query for image retrieval

Muhammad Umer Anwaar, Egor Labintcev, and Martin Kle- insteuber. Compositional learning of image-text query for image retrieval. InProceedings of the IEEE/CVF Winter conference on Applications of Computer Vision, pages 1140– 1149, 2021. 1, 3

work page 2021
[2]

Effective conditioned and composed image retrieval combining clip-based features

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Al- berto Del Bimbo. Effective conditioned and composed image retrieval combining clip-based features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21466–21474, 2022. 1

work page 2022
[3]

Zero-shot composed image retrieval with textual inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF international conference on computer vision, pages 15338– 15347, 2023. 1, 3, 5, 6

work page 2023
[4]

PLOT: Prompt learning with optimal transport for vision-language models

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. PLOT: Prompt learning with optimal transport for vision-language models. InThe Eleventh International Conference on Learning Representations, 2023. 2

work page 2023
[5]

Graph optimal transport for cross-domain alignment

Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, and Jingjing Liu. Graph optimal transport for cross-domain alignment. InInternational Conference on Machine Learning, pages 1542–1553. PMLR, 2020. 3

work page 2020
[6]

Learning joint visual seman- tic matching embeddings for language-guided retrieval

Yanbei Chen and Loris Bazzani. Learning joint visual seman- tic matching embeddings for language-guided retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 136–152. Springer, 2020. 3

work page 2020
[7]

Image search with text feedback by visiolinguistic attention learning

Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3001–3011, 2020. 1, 3

work page 2020
[8]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6

work page 2023
[9]

Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013. 3

work page 2013
[10]

Artemis: Attention-based retrieval with text-explicit matching and implicit similarity.arXiv preprint arXiv:2203.08101, 2022

Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. Artemis: Attention-based retrieval with text-explicit matching and implicit similarity.arXiv preprint arXiv:2203.08101, 2022. 1

work page arXiv 2022
[11]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Compodiff: Versatile composed image retrieval with latent diffusion,

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. Compodiff: Versatile com- posed image retrieval with latent diffusion.arXiv preprint arXiv:2303.11916, 2023. 3

work page arXiv 2023
[13]

Language-only efficient training of zero- shot composed image retrieval

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only efficient training of zero- shot composed image retrieval. 2024 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13225–13234, 2023. 5

work page 2024
[14]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019. 4

work page internal anchor Pith review Pith/arXiv arXiv 1904
[15]

Composed query image retrieval using locally bounded features

Mehrdad Hosseinzadeh and Yang Wang. Composed query image retrieval using locally bounded features. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3596–3605, 2020. 1

work page 2020
[16]

Hycir: Boosting zero-shot composed image retrieval with synthetic labels.CoRR, abs/2407.05795, 2024

Yingying Jiang, Hanchao Jia, Xiaobing Wang, and Peng Hao. Hycir: Boosting zero-shot composed image retrieval with synthetic labels.CoRR, abs/2407.05795, 2024. 1

work page arXiv 2024
[17]

Vision-by-language for training-free com- positional image retrieval.arXiv preprint arXiv:2310.09291,

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training-free com- positional image retrieval.arXiv preprint arXiv:2310.09291,

work page arXiv
[18]

Hierarchical optimal transport for multimodal distribution alignment.Advances in neural information processing sys- tems, 32, 2019

John Lee, Max Dabagia, Eva Dyer, and Christopher Rozell. Hierarchical optimal transport for multimodal distribution alignment.Advances in neural information processing sys- tems, 32, 2019. 3

work page 2019
[19]

Cosmo: Content-style modulation for image retrieval with text feed- back

Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation for image retrieval with text feed- back. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 802–812, 2021. 1, 3

work page 2021
[20]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

work page
[21]

Patchct: Aligning patch set and label set with conditional transport for multi- label image classification

Miaoge Li, Dongsheng Wang, Xinyang Liu, Zequn Zeng, Ruiying Lu, Bo Chen, and Mingyuan Zhou. Patchct: Aligning patch set and label set with conditional transport for multi- label image classification. InProceedings of the IEEE/CVF international conference on computer vision, pages 15348– 15358, 2023. 3

work page 2023
[22]

Tsca: on the semantic consistency alignment via conditional transport for compositional zero-shot learning

Miaoge Li, Jingcai Guo, Richard Yi Da Xu, Dongsheng Wang, Xiaofeng Cao, Zhijie Rao, and Song Guo. Tsca: on the semantic consistency alignment via conditional transport for compositional zero-shot learning. pages 5607–5615, 2025. 3

work page 2025
[23]

Improving context understanding in multimodal large language models via multimodal composition learning

Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan S Kankanhalli. Improving context understanding in multimodal large language models via multimodal composition learning. InICML, page 7, 2024. 4

work page 2024
[24]

Imagine and seek: Improving composed image retrieval with an imagined proxy

You Li, Fan Ma, and Yi Yang. Imagine and seek: Improving composed image retrieval with an imagined proxy. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3984–3993, 2025. 3

work page 2025
[25]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5

work page 2014
[26]

Patch- prompt aligned bayesian prompt tuning for vision-language models.arXiv preprint arXiv:2303.09100, 2023

Xinyang Liu, Dongsheng Wang, Bowei Fang, Miaoge Li, Zhibin Duan, Yishi Xu, Bo Chen, and Mingyuan Zhou. Patch- prompt aligned bayesian prompt tuning for vision-language models.arXiv preprint arXiv:2303.09100, 2023. 3

work page arXiv 2023
[27]

Image retrieval on real-life images with pre- trained vision-and-language models

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre- trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021. 5, 6

work page 2021
[28]

Null-text inversion for editing real im- ages using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 6038–6047, 2023. 1

work page 2023
[29]

Learning to predict visual attributes in the wild

Khoi Pham, Kushal Kafle, Zhe Lin, Zhihong Ding, Scott Co- hen, Quan Tran, and Abhinav Shrivastava. Learning to predict visual attributes in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13018–13028, 2021. 5

work page 2021
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 6

work page 2021
[31]

Optimal transport for multi-source domain adaptation under target shift

Ievgen Redko, Nicolas Courty, R ´emi Flamary, and Devis Tuia. Optimal transport for multi-source domain adaptation under target shift. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 849–858. PMLR,

work page
[32]

Pic2word: Mapping pictures to words for zero-shot composed image retrieval

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19305–19314,

work page
[33]

Cognitive load during problem solving: Effects on learning.Cognitive science, 12(2):257–285, 1988

John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive science, 12(2):257–285, 1988. 1

work page 1988
[34]

Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed image retrieval

Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu. Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5180–5188, 2024. 5

work page 2024
[35]

Missing target-relevant in- formation prediction with world model for accurate zero-shot composed image retrieval

Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, and Qi Wu. Missing target-relevant in- formation prediction with world model for accurate zero-shot composed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24785– 24795, 2025. 3

work page 2025
[36]

Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval

Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dong- mei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14400–14410,

work page
[37]

Prototypes-oriented transductive few-shot learning with conditional transport

Long Tian, Jingyi Feng, Xiaoqiang Chai, Wenchao Chen, Liming Wang, Xiyang Liu, and Bo Chen. Prototypes-oriented transductive few-shot learning with conditional transport. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16317–16326, 2023. 3

work page 2023
[38]

Genecis: A benchmark for general conditional image similarity

Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6862–6872, 2023. 5

work page 2023
[39]

Springer,

C´edric Villani et al.Optimal transport: old and new. Springer,

work page
[40]

Composing text and image for image retrieval-an empirical odyssey

Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 6439–6448, 2019. 1, 3

work page 2019
[41]

Tuning multi-mode token- level prompt alignment across modalities.Advances in Neural Information Processing Systems, 36:52792–52810, 2023

Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, and Hanwang Zhang. Tuning multi-mode token- level prompt alignment across modalities.Advances in Neural Information Processing Systems, 36:52792–52810, 2023. 2, 3

work page 2023
[42]

Instruction tuning-free visual token complement for multimodal llms

Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, and Hanwang Zhang. Instruction tuning-free visual token complement for multimodal llms. InEuropean Con- ference on Computer Vision, pages 446–462. Springer, 2024. 1

work page 2024
[43]

Fashion iq: A new dataset towards retrieving images by natural language feedback

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317,

work page
[44]

Seman- tic editing increment benefits zero-shot composed image re- trieval

Zhenyu Yang, Shengsheng Qian, Dizhan Xue, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Seman- tic editing increment benefits zero-shot composed image re- trieval. InProceedings of the 32nd ACM International Con- ference on Multimedia, pages 1245–1254, 2024. 2, 5

work page 2024
[45]

Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval

Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 80–90, 2024. 3, 5

work page 2024
[46]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Magiclens: Self-supervised image retrieval with open-ended instructions

Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. arXiv preprint arXiv:2403.19651, 2024. 3

work page arXiv 2024
[48]

Label distribution learning by optimal transport

Peng Zhao and Zhi-Hua Zhou. Label distribution learning by optimal transport. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 3

work page 2018
[49]

Exploiting chain rule and bayes’ theorem to compare probability distributions.Ad- vances in Neural Information Processing Systems, 34:14993– 15006, 2021

Huangjie Zheng and Mingyuan Zhou. Exploiting chain rule and bayes’ theorem to compare probability distributions.Ad- vances in Neural Information Processing Systems, 34:14993– 15006, 2021. 3

work page 2021
[50]

Dynamic multimodal prototype learning in vision-language models

Xingyu Zhu, Shuo Wang, Beier Zhu, Miaoge Li, Yunfan Li, Junfeng Fang, Zhicai Wang, Dongsheng Wang, and Han- wang Zhang. Dynamic multimodal prototype learning in vision-language models. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 2501–2511,

work page
[51]

is solid white

Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, and Limin Wang. Awt: Transferring vision-language models via aug- mentation, weighting, and transportation.Advances in Neural Information Processing Systems, 37:25561–25591, 2024. 2 STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval Supplementa...

work page 2024
[52]

Reference Image 1

models. Moreover, current benchmarks suffer from a false-negative problem. As noted in [ 27], each (reference image, modification) pair in FashionIQ can correspond to multiple valid target images, yet only one is annotated as ground truth. Consequently, semantically correct retrieval results may be unfairly penalized under existing evaluation protocols. W...

work page

[1] [1]

Compositional learning of image-text query for image retrieval

Muhammad Umer Anwaar, Egor Labintcev, and Martin Kle- insteuber. Compositional learning of image-text query for image retrieval. InProceedings of the IEEE/CVF Winter conference on Applications of Computer Vision, pages 1140– 1149, 2021. 1, 3

work page 2021

[2] [2]

Effective conditioned and composed image retrieval combining clip-based features

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Al- berto Del Bimbo. Effective conditioned and composed image retrieval combining clip-based features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21466–21474, 2022. 1

work page 2022

[3] [3]

Zero-shot composed image retrieval with textual inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF international conference on computer vision, pages 15338– 15347, 2023. 1, 3, 5, 6

work page 2023

[4] [4]

PLOT: Prompt learning with optimal transport for vision-language models

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. PLOT: Prompt learning with optimal transport for vision-language models. InThe Eleventh International Conference on Learning Representations, 2023. 2

work page 2023

[5] [5]

Graph optimal transport for cross-domain alignment

Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, and Jingjing Liu. Graph optimal transport for cross-domain alignment. InInternational Conference on Machine Learning, pages 1542–1553. PMLR, 2020. 3

work page 2020

[6] [6]

Learning joint visual seman- tic matching embeddings for language-guided retrieval

Yanbei Chen and Loris Bazzani. Learning joint visual seman- tic matching embeddings for language-guided retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 136–152. Springer, 2020. 3

work page 2020

[7] [7]

Image search with text feedback by visiolinguistic attention learning

Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3001–3011, 2020. 1, 3

work page 2020

[8] [8]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6

work page 2023

[9] [9]

Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013. 3

work page 2013

[10] [10]

Artemis: Attention-based retrieval with text-explicit matching and implicit similarity.arXiv preprint arXiv:2203.08101, 2022

Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. Artemis: Attention-based retrieval with text-explicit matching and implicit similarity.arXiv preprint arXiv:2203.08101, 2022. 1

work page arXiv 2022

[11] [11]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Compodiff: Versatile composed image retrieval with latent diffusion,

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. Compodiff: Versatile com- posed image retrieval with latent diffusion.arXiv preprint arXiv:2303.11916, 2023. 3

work page arXiv 2023

[13] [13]

Language-only efficient training of zero- shot composed image retrieval

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only efficient training of zero- shot composed image retrieval. 2024 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13225–13234, 2023. 5

work page 2024

[14] [14]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019. 4

work page internal anchor Pith review Pith/arXiv arXiv 1904

[15] [15]

Composed query image retrieval using locally bounded features

Mehrdad Hosseinzadeh and Yang Wang. Composed query image retrieval using locally bounded features. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3596–3605, 2020. 1

work page 2020

[16] [16]

Hycir: Boosting zero-shot composed image retrieval with synthetic labels.CoRR, abs/2407.05795, 2024

Yingying Jiang, Hanchao Jia, Xiaobing Wang, and Peng Hao. Hycir: Boosting zero-shot composed image retrieval with synthetic labels.CoRR, abs/2407.05795, 2024. 1

work page arXiv 2024

[17] [17]

Vision-by-language for training-free com- positional image retrieval.arXiv preprint arXiv:2310.09291,

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training-free com- positional image retrieval.arXiv preprint arXiv:2310.09291,

work page arXiv

[18] [18]

Hierarchical optimal transport for multimodal distribution alignment.Advances in neural information processing sys- tems, 32, 2019

John Lee, Max Dabagia, Eva Dyer, and Christopher Rozell. Hierarchical optimal transport for multimodal distribution alignment.Advances in neural information processing sys- tems, 32, 2019. 3

work page 2019

[19] [19]

Cosmo: Content-style modulation for image retrieval with text feed- back

Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation for image retrieval with text feed- back. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 802–812, 2021. 1, 3

work page 2021

[20] [20]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

work page

[21] [21]

Patchct: Aligning patch set and label set with conditional transport for multi- label image classification

Miaoge Li, Dongsheng Wang, Xinyang Liu, Zequn Zeng, Ruiying Lu, Bo Chen, and Mingyuan Zhou. Patchct: Aligning patch set and label set with conditional transport for multi- label image classification. InProceedings of the IEEE/CVF international conference on computer vision, pages 15348– 15358, 2023. 3

work page 2023

[22] [22]

Tsca: on the semantic consistency alignment via conditional transport for compositional zero-shot learning

Miaoge Li, Jingcai Guo, Richard Yi Da Xu, Dongsheng Wang, Xiaofeng Cao, Zhijie Rao, and Song Guo. Tsca: on the semantic consistency alignment via conditional transport for compositional zero-shot learning. pages 5607–5615, 2025. 3

work page 2025

[23] [23]

Improving context understanding in multimodal large language models via multimodal composition learning

Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan S Kankanhalli. Improving context understanding in multimodal large language models via multimodal composition learning. InICML, page 7, 2024. 4

work page 2024

[24] [24]

Imagine and seek: Improving composed image retrieval with an imagined proxy

You Li, Fan Ma, and Yi Yang. Imagine and seek: Improving composed image retrieval with an imagined proxy. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3984–3993, 2025. 3

work page 2025

[25] [25]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5

work page 2014

[26] [26]

Patch- prompt aligned bayesian prompt tuning for vision-language models.arXiv preprint arXiv:2303.09100, 2023

Xinyang Liu, Dongsheng Wang, Bowei Fang, Miaoge Li, Zhibin Duan, Yishi Xu, Bo Chen, and Mingyuan Zhou. Patch- prompt aligned bayesian prompt tuning for vision-language models.arXiv preprint arXiv:2303.09100, 2023. 3

work page arXiv 2023

[27] [27]

Image retrieval on real-life images with pre- trained vision-and-language models

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre- trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021. 5, 6

work page 2021

[28] [28]

Null-text inversion for editing real im- ages using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 6038–6047, 2023. 1

work page 2023

[29] [29]

Learning to predict visual attributes in the wild

Khoi Pham, Kushal Kafle, Zhe Lin, Zhihong Ding, Scott Co- hen, Quan Tran, and Abhinav Shrivastava. Learning to predict visual attributes in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13018–13028, 2021. 5

work page 2021

[30] [30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 6

work page 2021

[31] [31]

Optimal transport for multi-source domain adaptation under target shift

Ievgen Redko, Nicolas Courty, R ´emi Flamary, and Devis Tuia. Optimal transport for multi-source domain adaptation under target shift. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 849–858. PMLR,

work page

[32] [32]

Pic2word: Mapping pictures to words for zero-shot composed image retrieval

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19305–19314,

work page

[33] [33]

Cognitive load during problem solving: Effects on learning.Cognitive science, 12(2):257–285, 1988

John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive science, 12(2):257–285, 1988. 1

work page 1988

[34] [34]

Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed image retrieval

Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu. Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5180–5188, 2024. 5

work page 2024

[35] [35]

Missing target-relevant in- formation prediction with world model for accurate zero-shot composed image retrieval

Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, and Qi Wu. Missing target-relevant in- formation prediction with world model for accurate zero-shot composed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24785– 24795, 2025. 3

work page 2025

[36] [36]

Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval

Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dong- mei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14400–14410,

work page

[37] [37]

Prototypes-oriented transductive few-shot learning with conditional transport

Long Tian, Jingyi Feng, Xiaoqiang Chai, Wenchao Chen, Liming Wang, Xiyang Liu, and Bo Chen. Prototypes-oriented transductive few-shot learning with conditional transport. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16317–16326, 2023. 3

work page 2023

[38] [38]

Genecis: A benchmark for general conditional image similarity

Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6862–6872, 2023. 5

work page 2023

[39] [39]

Springer,

C´edric Villani et al.Optimal transport: old and new. Springer,

work page

[40] [40]

Composing text and image for image retrieval-an empirical odyssey

Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 6439–6448, 2019. 1, 3

work page 2019

[41] [41]

Tuning multi-mode token- level prompt alignment across modalities.Advances in Neural Information Processing Systems, 36:52792–52810, 2023

Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, and Hanwang Zhang. Tuning multi-mode token- level prompt alignment across modalities.Advances in Neural Information Processing Systems, 36:52792–52810, 2023. 2, 3

work page 2023

[42] [42]

Instruction tuning-free visual token complement for multimodal llms

Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, and Hanwang Zhang. Instruction tuning-free visual token complement for multimodal llms. InEuropean Con- ference on Computer Vision, pages 446–462. Springer, 2024. 1

work page 2024

[43] [43]

Fashion iq: A new dataset towards retrieving images by natural language feedback

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317,

work page

[44] [44]

Seman- tic editing increment benefits zero-shot composed image re- trieval

Zhenyu Yang, Shengsheng Qian, Dizhan Xue, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Seman- tic editing increment benefits zero-shot composed image re- trieval. InProceedings of the 32nd ACM International Con- ference on Multimedia, pages 1245–1254, 2024. 2, 5

work page 2024

[45] [45]

Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval

Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 80–90, 2024. 3, 5

work page 2024

[46] [46]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [47]

Magiclens: Self-supervised image retrieval with open-ended instructions

Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. arXiv preprint arXiv:2403.19651, 2024. 3

work page arXiv 2024

[48] [48]

Label distribution learning by optimal transport

Peng Zhao and Zhi-Hua Zhou. Label distribution learning by optimal transport. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 3

work page 2018

[49] [49]

Exploiting chain rule and bayes’ theorem to compare probability distributions.Ad- vances in Neural Information Processing Systems, 34:14993– 15006, 2021

Huangjie Zheng and Mingyuan Zhou. Exploiting chain rule and bayes’ theorem to compare probability distributions.Ad- vances in Neural Information Processing Systems, 34:14993– 15006, 2021. 3

work page 2021

[50] [50]

Dynamic multimodal prototype learning in vision-language models

Xingyu Zhu, Shuo Wang, Beier Zhu, Miaoge Li, Yunfan Li, Junfeng Fang, Zhicai Wang, Dongsheng Wang, and Han- wang Zhang. Dynamic multimodal prototype learning in vision-language models. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 2501–2511,

work page

[51] [51]

is solid white

Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, and Limin Wang. Awt: Transferring vision-language models via aug- mentation, weighting, and transportation.Advances in Neural Information Processing Systems, 37:25561–25591, 2024. 2 STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval Supplementa...

work page 2024

[52] [52]

Reference Image 1

models. Moreover, current benchmarks suffer from a false-negative problem. As noted in [ 27], each (reference image, modification) pair in FashionIQ can correspond to multiple valid target images, yet only one is annotated as ground truth. Consequently, semantically correct retrieval results may be unfairly penalized under existing evaluation protocols. W...

work page