STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3
The pith
A transition vector in embedding space refines LLM captions and bidirectional transportation distances enable set-to-set matching for improved training-free zero-shot composed image retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Semantic Transition and Transportation framework refines an LLM-generated composed caption via a transition vector in embedding space, combined with user instructions to emphasize core modifications and filter reference-image noise, while reformulating retrieval as alignment between two discrete distributions and scoring matches with a bidirectional transportation distance that accounts for fine-grained cross-modal correspondences.
What carries the argument
The semantic transition vector that adjusts the LLM caption embedding toward the target image when fused with user instruction, paired with bidirectional transportation distance that computes retrieval scores by treating captions and images as sets of features for set-to-set matching.
If this is right
- Refined captions focus on the core modification intent and exclude extraneous details present in the reference image.
- Retrieval scores capture multiple possible feature alignments instead of forcing a single point-to-point match.
- The overall pipeline works across diverse composed image retrieval tasks without any task-specific training or fine-tuning.
- LLM outputs become more reliable inputs for retrieval once adjusted in the shared embedding space.
Where Pith is reading between the lines
- The same caption-refinement step could help other text-to-image matching tasks where generated descriptions contain image-specific artifacts.
- Bidirectional transportation distances might improve performance in broader cross-modal retrieval problems that currently rely on cosine similarity.
- Testing the method on compositions involving multiple simultaneous changes would reveal whether the transport formulation scales beyond single-instruction edits.
Load-bearing premise
The transition vector computed in embedding space, when combined with user instruction, will reliably filter out noise from the LLM-generated caption and bring it closer to the target image without discarding necessary details or introducing new mismatches.
What would settle it
A controlled test on a standard composed image retrieval benchmark dataset showing that retrieval accuracy drops below a simple LLM-caption baseline when the transition vector and transportation distance are removed.
Figures
read the original abstract
Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes STiTch, a training-free zero-shot composed image retrieval (CIR) framework. It uses an LLM to generate a target caption from a reference image and text modification instruction, refines this caption via a transition vector in the joint embedding space (combined with user instruction) to filter reference-image noise and focus on core intent, models the refined caption and candidate images as discrete distributions, and computes retrieval scores via a bidirectional transportation distance to enable set-to-set rather than point-to-point alignment.
Significance. If the transition-vector refinement and transportation-distance retrieval hold up under scrutiny, the approach could meaningfully advance training-free zero-shot CIR by mitigating semantic gaps between detailed images and terse instructions and by capturing compositional diversity, offering a general, parameter-light alternative to fine-tuned models for multimodal retrieval tasks.
major comments (3)
- [§3 (Method)] The central claim that the transition vector (computed from the LLM-generated caption plus user instruction) reliably produces a refined caption closer to the target image embedding without discarding necessary details or introducing new mismatches is load-bearing yet unsupported by any verification. No human judgment of refined vs. original captions, embedding-distance analysis to ground-truth targets, or ablation isolating the vector's effect is described, leaving the assumption that embedding-space arithmetic is sufficiently linear and semantically meaningful untested.
- [§3.2 (Transportation Distance)] The reformulation of retrieval as set-to-set alignment via bidirectional transportation distance is presented as addressing point-to-point limitations, but the manuscript provides no derivation or pseudocode showing how the discrete distributions are constructed from caption tokens and image features, nor any analysis of computational cost or sensitivity to distribution discretization choices.
- [§4 (Experiments)] Extensive experiments are asserted to demonstrate generality and effectiveness, yet the abstract and available description contain no quantitative results, baseline comparisons, ablation tables, or error analysis. Without these, the claim that the method is 'general, effective, and beneficial for many CIR tasks' cannot be evaluated.
minor comments (2)
- [§3] Notation for the transition vector and the bidirectional transportation distance should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
- [Abstract] The abstract states two problems (unexpected features from reference images and failure to capture diverse compositions) but does not quantify their prevalence or severity in prior LLM-based methods; a short motivating example or statistic would strengthen the motivation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core contributions.
read point-by-point responses
-
Referee: [§3 (Method)] The central claim that the transition vector (computed from the LLM-generated caption plus user instruction) reliably produces a refined caption closer to the target image embedding without discarding necessary details or introducing new mismatches is load-bearing yet unsupported by any verification. No human judgment of refined vs. original captions, embedding-distance analysis to ground-truth targets, or ablation isolating the vector's effect is described, leaving the assumption that embedding-space arithmetic is sufficiently linear and semantically meaningful untested.
Authors: We agree that additional empirical verification would make the contribution of the transition vector more robust. The current manuscript motivates the approach via the semantic gap between detailed images and terse instructions, but we will revise §3 to include an ablation isolating the transition vector's effect, embedding-distance measurements to ground-truth targets, and qualitative examples of refined captions demonstrating noise reduction. These additions will directly test the linearity assumption in the joint embedding space. revision: yes
-
Referee: [§3.2 (Transportation Distance)] The reformulation of retrieval as set-to-set alignment via bidirectional transportation distance is presented as addressing point-to-point limitations, but the manuscript provides no derivation or pseudocode showing how the discrete distributions are constructed from caption tokens and image features, nor any analysis of computational cost or sensitivity to distribution discretization choices.
Authors: We appreciate this request for greater technical detail. In the revision we will add a formal derivation of the bidirectional transportation distance, explicit pseudocode for constructing the discrete distributions (caption tokens as one distribution, image patch features as the other), and a dedicated paragraph analyzing computational complexity together with sensitivity to discretization parameters such as token count or clustering granularity. revision: yes
-
Referee: [§4 (Experiments)] Extensive experiments are asserted to demonstrate generality and effectiveness, yet the abstract and available description contain no quantitative results, baseline comparisons, ablation tables, or error analysis. Without these, the claim that the method is 'general, effective, and beneficial for many CIR tasks' cannot be evaluated.
Authors: The full manuscript contains quantitative results, baseline comparisons, ablation tables, and error analysis in §4. To address the presentation concern, we will revise the abstract to report key performance metrics and will add an early summary table in the introduction that highlights main findings. This will make the empirical support immediately visible while preserving the existing detailed experimental section. revision: partial
Circularity Check
No circularity: method relies on external LLMs and embeddings with independent experimental validation
full rationale
The paper introduces a training-free framework that refines LLM-generated captions via embedding-space transition vectors and reformulates retrieval as set-to-set alignment using bidirectional transportation distance. No equations, derivations, or load-bearing steps in the abstract or described method reduce the claimed performance gains to fitted parameters, self-definitions, or self-citation chains by construction. The approach explicitly builds on external components (LLMs for caption generation and pre-trained embedding spaces) and validates effectiveness through experiments on standard CIR benchmarks. This satisfies the default expectation of a self-contained proposal without circular reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundationreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a novel bidirectional transportation distance... Lbi(Pt, Qy) = L Pt→Qy + L Qy→Pt
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Compositional learning of image-text query for image retrieval
Muhammad Umer Anwaar, Egor Labintcev, and Martin Kle- insteuber. Compositional learning of image-text query for image retrieval. InProceedings of the IEEE/CVF Winter conference on Applications of Computer Vision, pages 1140– 1149, 2021. 1, 3
work page 2021
-
[2]
Effective conditioned and composed image retrieval combining clip-based features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Al- berto Del Bimbo. Effective conditioned and composed image retrieval combining clip-based features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21466–21474, 2022. 1
work page 2022
-
[3]
Zero-shot composed image retrieval with textual inversion
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF international conference on computer vision, pages 15338– 15347, 2023. 1, 3, 5, 6
work page 2023
-
[4]
PLOT: Prompt learning with optimal transport for vision-language models
Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. PLOT: Prompt learning with optimal transport for vision-language models. InThe Eleventh International Conference on Learning Representations, 2023. 2
work page 2023
-
[5]
Graph optimal transport for cross-domain alignment
Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, and Jingjing Liu. Graph optimal transport for cross-domain alignment. InInternational Conference on Machine Learning, pages 1542–1553. PMLR, 2020. 3
work page 2020
-
[6]
Learning joint visual seman- tic matching embeddings for language-guided retrieval
Yanbei Chen and Loris Bazzani. Learning joint visual seman- tic matching embeddings for language-guided retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 136–152. Springer, 2020. 3
work page 2020
-
[7]
Image search with text feedback by visiolinguistic attention learning
Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3001–3011, 2020. 1, 3
work page 2020
-
[8]
Reproducible scal- ing laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6
work page 2023
-
[9]
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013. 3
work page 2013
-
[10]
Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. Artemis: Attention-based retrieval with text-explicit matching and implicit similarity.arXiv preprint arXiv:2203.08101, 2022. 1
-
[11]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Compodiff: Versatile composed image retrieval with latent diffusion,
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. Compodiff: Versatile com- posed image retrieval with latent diffusion.arXiv preprint arXiv:2303.11916, 2023. 3
-
[13]
Language-only efficient training of zero- shot composed image retrieval
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only efficient training of zero- shot composed image retrieval. 2024 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13225–13234, 2023. 5
work page 2024
-
[14]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019. 4
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[15]
Composed query image retrieval using locally bounded features
Mehrdad Hosseinzadeh and Yang Wang. Composed query image retrieval using locally bounded features. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3596–3605, 2020. 1
work page 2020
-
[16]
Hycir: Boosting zero-shot composed image retrieval with synthetic labels.CoRR, abs/2407.05795, 2024
Yingying Jiang, Hanchao Jia, Xiaobing Wang, and Peng Hao. Hycir: Boosting zero-shot composed image retrieval with synthetic labels.CoRR, abs/2407.05795, 2024. 1
-
[17]
Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training-free com- positional image retrieval.arXiv preprint arXiv:2310.09291,
-
[18]
John Lee, Max Dabagia, Eva Dyer, and Christopher Rozell. Hierarchical optimal transport for multimodal distribution alignment.Advances in neural information processing sys- tems, 32, 2019. 3
work page 2019
-
[19]
Cosmo: Content-style modulation for image retrieval with text feed- back
Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation for image retrieval with text feed- back. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 802–812, 2021. 1, 3
work page 2021
-
[20]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,
-
[21]
Miaoge Li, Dongsheng Wang, Xinyang Liu, Zequn Zeng, Ruiying Lu, Bo Chen, and Mingyuan Zhou. Patchct: Aligning patch set and label set with conditional transport for multi- label image classification. InProceedings of the IEEE/CVF international conference on computer vision, pages 15348– 15358, 2023. 3
work page 2023
-
[22]
Miaoge Li, Jingcai Guo, Richard Yi Da Xu, Dongsheng Wang, Xiaofeng Cao, Zhijie Rao, and Song Guo. Tsca: on the semantic consistency alignment via conditional transport for compositional zero-shot learning. pages 5607–5615, 2025. 3
work page 2025
-
[23]
Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan S Kankanhalli. Improving context understanding in multimodal large language models via multimodal composition learning. InICML, page 7, 2024. 4
work page 2024
-
[24]
Imagine and seek: Improving composed image retrieval with an imagined proxy
You Li, Fan Ma, and Yi Yang. Imagine and seek: Improving composed image retrieval with an imagined proxy. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3984–3993, 2025. 3
work page 2025
-
[25]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5
work page 2014
-
[26]
Xinyang Liu, Dongsheng Wang, Bowei Fang, Miaoge Li, Zhibin Duan, Yishi Xu, Bo Chen, and Mingyuan Zhou. Patch- prompt aligned bayesian prompt tuning for vision-language models.arXiv preprint arXiv:2303.09100, 2023. 3
-
[27]
Image retrieval on real-life images with pre- trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre- trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021. 5, 6
work page 2021
-
[28]
Null-text inversion for editing real im- ages using guided diffusion models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 6038–6047, 2023. 1
work page 2023
-
[29]
Learning to predict visual attributes in the wild
Khoi Pham, Kushal Kafle, Zhe Lin, Zhihong Ding, Scott Co- hen, Quan Tran, and Abhinav Shrivastava. Learning to predict visual attributes in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13018–13028, 2021. 5
work page 2021
-
[30]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 6
work page 2021
-
[31]
Optimal transport for multi-source domain adaptation under target shift
Ievgen Redko, Nicolas Courty, R ´emi Flamary, and Devis Tuia. Optimal transport for multi-source domain adaptation under target shift. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 849–858. PMLR,
-
[32]
Pic2word: Mapping pictures to words for zero-shot composed image retrieval
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19305–19314,
-
[33]
Cognitive load during problem solving: Effects on learning.Cognitive science, 12(2):257–285, 1988
John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive science, 12(2):257–285, 1988. 1
work page 1988
-
[34]
Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu. Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5180–5188, 2024. 5
work page 2024
-
[35]
Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, and Qi Wu. Missing target-relevant in- formation prediction with world model for accurate zero-shot composed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24785– 24795, 2025. 3
work page 2025
-
[36]
Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dong- mei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14400–14410,
-
[37]
Prototypes-oriented transductive few-shot learning with conditional transport
Long Tian, Jingyi Feng, Xiaoqiang Chai, Wenchao Chen, Liming Wang, Xiyang Liu, and Bo Chen. Prototypes-oriented transductive few-shot learning with conditional transport. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16317–16326, 2023. 3
work page 2023
-
[38]
Genecis: A benchmark for general conditional image similarity
Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6862–6872, 2023. 5
work page 2023
- [39]
-
[40]
Composing text and image for image retrieval-an empirical odyssey
Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 6439–6448, 2019. 1, 3
work page 2019
-
[41]
Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, and Hanwang Zhang. Tuning multi-mode token- level prompt alignment across modalities.Advances in Neural Information Processing Systems, 36:52792–52810, 2023. 2, 3
work page 2023
-
[42]
Instruction tuning-free visual token complement for multimodal llms
Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, and Hanwang Zhang. Instruction tuning-free visual token complement for multimodal llms. InEuropean Con- ference on Computer Vision, pages 446–462. Springer, 2024. 1
work page 2024
-
[43]
Fashion iq: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317,
-
[44]
Seman- tic editing increment benefits zero-shot composed image re- trieval
Zhenyu Yang, Shengsheng Qian, Dizhan Xue, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Seman- tic editing increment benefits zero-shot composed image re- trieval. InProceedings of the 32nd ACM International Con- ference on Multimedia, pages 1245–1254, 2024. 2, 5
work page 2024
-
[45]
Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval
Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 80–90, 2024. 3, 5
work page 2024
-
[46]
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Magiclens: Self-supervised image retrieval with open-ended instructions
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. arXiv preprint arXiv:2403.19651, 2024. 3
-
[48]
Label distribution learning by optimal transport
Peng Zhao and Zhi-Hua Zhou. Label distribution learning by optimal transport. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 3
work page 2018
-
[49]
Huangjie Zheng and Mingyuan Zhou. Exploiting chain rule and bayes’ theorem to compare probability distributions.Ad- vances in Neural Information Processing Systems, 34:14993– 15006, 2021. 3
work page 2021
-
[50]
Dynamic multimodal prototype learning in vision-language models
Xingyu Zhu, Shuo Wang, Beier Zhu, Miaoge Li, Yunfan Li, Junfeng Fang, Zhicai Wang, Dongsheng Wang, and Han- wang Zhang. Dynamic multimodal prototype learning in vision-language models. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 2501–2511,
-
[51]
Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, and Limin Wang. Awt: Transferring vision-language models via aug- mentation, weighting, and transportation.Advances in Neural Information Processing Systems, 37:25561–25591, 2024. 2 STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval Supplementa...
work page 2024
-
[52]
models. Moreover, current benchmarks suffer from a false-negative problem. As noted in [ 27], each (reference image, modification) pair in FashionIQ can correspond to multiple valid target images, yet only one is annotated as ground truth. Consequently, semantically correct retrieval results may be unfairly penalized under existing evaluation protocols. W...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.