arxiv: 2605.08389 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

Mingyu Liu , Sihan Huang , Yijia Fan , Yinlin Yan , Quan Zhang , Jian-Fang Hu , Jianhuang Lai

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords zero-shot composed image retrievalZS-CIRsemantic transitionlow-rank adaptersimage retrievalmultimodal learningdecoupled learningprojection-based methods

0 comments

The pith

Decoupling endpoint alignment from semantic transition learning resolves the bottleneck in projection-based zero-shot composed image retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Projection-based zero-shot composed image retrieval lags on complex edits because the text adapter treats the modification text as a target-side cue instead of a source-conditioned change. Adding explicit transition supervision to the same adapter creates a direct conflict between matching endpoints and learning transitions. DeCIR addresses the conflict by deriving forward and reverse edit pairs from ordinary image-caption data, training two separate low-rank text adapter branches, and combining them through Low-Rank Directional Merge into one deployable model. The result lifts accuracy on standard benchmarks while keeping inference as fast and lightweight as prior projection methods, without needing large language models or extra annotated triplets at test time.

Core claim

Endpoint-level matching in projection-based ZS-CIR lets edit text serve as a target attribute cue rather than grounding a source-conditioned semantic transition. Adding transition supervision to the shared text adapter produces an endpoint-transition conflict. DeCIR constructs paired forward and reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge into a single deployable adapter.

What carries the argument

Paired forward and reverse edit tuples from image-caption pairs used to train separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, then merged via Low-Rank Directional Merge into one model.

If this is right

Projection-based ZS-CIR can handle complex semantic modifications more effectively while remaining free of LLMs at inference.
The single merged adapter preserves the original lightweight inference cost of projection methods.
Gains appear consistently on CIRR, CIRCO, FashionIQ, and GeneCIS without extra annotation requirements.
Semantic transition supervision becomes usable without harming endpoint alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same forward-reverse pairing and branch decoupling may apply to other tasks where text must describe a change conditioned on a source image rather than an absolute target.
Reducing dependence on human-annotated triplets could extend to additional composed retrieval or image-editing settings.
Analogous endpoint-transition conflicts may exist in other adapter-based multimodal adaptation methods.

Load-bearing premise

Forward and reverse edit tuples derived from ordinary image-caption pairs supply unbiased supervision for genuine semantic transitions without domain shift or spurious correlations.

What would settle it

If the merged adapter shows no gain on complex semantic edits relative to a jointly trained baseline, or if the two branches interfere after Low-Rank Directional Merge, the claimed resolution of the endpoint-transition conflict would not hold.

Figures

Figures reproduced from arXiv: 2605.08389 by Jian-Fang Hu, Jianhuang Lai, Mingyu Liu, Quan Zhang, Sihan Huang, Yijia Fan, Yinlin Yan.

**Figure 1.** Figure 1: Illustration of our motivation. The semantic transition bottleneck in projection-based ZS-CIR can lead to an endpoint shortcut, where the edit text is treated as a target-side cue while source-conditioned evidence from the reference is dropped. This motivates DeCIR to explicitly learn transition-aware matching. Composed image retrieval (CIR) searches for a target image that matches both a reference image a… view at source ↗

**Figure 2.** Figure 2: Overview of DeCIR. DeCIR decouples endpoint alignment and semantic transition alignment into two specialized branches during training. The text adapters in the two branches, i.e., Eend T and Etrans T , are then merged into a single deployable adapter via Low-Rank Directional Merge, enabling efficient inference. DeCIR keeps this lightweight projection-based inference form while adding endpoint alignment and… view at source ↗

**Figure 3.** Figure 3: Layer-wise gradient interference. Result is shown with mean ± std over five seeds. A larger interference score GIℓ = s ℓ base − s ℓ cross means that endpoint and transition gradients agree less than expected under ordinary same-objective stochastic variation. In other words, a high GIℓ indicates that the disagreement is not merely due to noisy gradients in that layer, but reflects additional conflict betw… view at source ↗

**Figure 4.** Figure 4: presents qualitative comparisons from CIRR that illustrate endpoint shortcut cases. Pic2Word can follow target-side cues in the modification, such as count or category, but may drop sourceconditioned evidence from the reference. In contrast, DeCIR better preserves the reference cue while applying the requested change, suggesting that semantic transition learning mitigates the endpoint shortcut and better … view at source ↗

**Figure 5.** Figure 5: Prompt template used for offline supervision construction. We prompt the LLM to produce the forward edit, edited caption, and reverse instruction in one structured output. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Shared-adapter transition-loss weight sweep. We sweep λtrans in Lend + λtransLtrans on CIRR validation. The sweep supports that joint underperformance is not simply due to an untuned transition-loss scale [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Parameter sensitivity analysis on CIRR validation set. The α sweep shows that transition coefficients work best when injected into an endpoint alignment branch rather than used alone. The ω sweep studies how much image-conditioned source evidence should enter the transition anchor. D Deployment Profile [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on CIRR. The Pic2Word can follow parts of the modification but drop source-conditioned evidence. DeCIR retrieves images that satisfy the modification while preserving the reference constraint. Reference Modification Pic2Word DeCIR more plain and has more coverage and darker and longer sleeves short and black with short sleeves and black and not two pieces. has a rg3 logo and it is pu… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on FashionIQ. The examples cover fashion modifications involving coverage, sleeve length, color, logos, and style. DeCIR improves edit sensitivity while preserving source garment evidence. Green boxes indicate correct top-1 retrievals, and red boxes indicate incorrect top-1 retrievals. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeCIR decouples endpoint matching from semantic transitions in projection-based ZS-CIR using forward/reverse tuples and LRDM merge, but the tuple construction may not deliver independent supervision.

read the letter

The main point is that this paper identifies an endpoint-transition conflict in projection-based zero-shot composed image retrieval and tries to fix it by splitting the objectives. They pull paired forward and reverse edit tuples out of ordinary image-caption pairs, train separate low-rank text adapter branches for each goal, and combine them with Low-Rank Directional Merge into one adapter for deployment. The result stays lightweight at inference while claiming better results on CIRR, CIRCO, FashionIQ, and GeneCIS than prior projection methods.

Referee Report

2 major / 2 minor

Summary. The paper claims that projection-based zero-shot composed image retrieval (ZS-CIR) suffers from an endpoint-transition conflict, where edit text acts as a target attribute cue rather than a source-conditioned modification. To resolve this, DeCIR constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them via Low-Rank Directional Merge (LRDM) into a single deployable adapter. It reports consistent improvements on CIRR, CIRCO, FashionIQ, and GeneCIS without increasing inference complexity.

Significance. If the decoupling is shown to address a genuine conflict rather than partitioning correlated signals, the work could meaningfully advance lightweight projection-based ZS-CIR by narrowing the gap with LLM-based methods while preserving efficiency. The forward/reverse tuple construction and LRDM merge represent a targeted architectural response to the identified bottleneck. The evaluation across four datasets provides a broad empirical base, and the absence of added inference cost is a practical strength.

major comments (2)

[§3.2] §3.2 (tuple construction): The central claim that paired forward/reverse edit tuples supply independent, unbiased supervision for the transition branch rests on the assumption that reverse tuples from holistic image-caption pairs are true inverses. Because captions are not source-conditioned modifications, the reverse tuple can reinforce attribute matching or dataset-specific phrasing instead of genuine semantic transitions. This risks the separate branches simply partitioning correlated signals, so that LRDM gains may reflect extra capacity rather than conflict resolution.
[§4] §4 (experiments): The abstract asserts consistent benchmark gains from decoupling, yet no quantitative results, ablations isolating the conflict (e.g., joint vs. decoupled training with matched capacity), error bars, or verification that constructed tuples avoid post-hoc selection bias are referenced. Without these, it is impossible to confirm that reported improvements on CIRR/CIRCO stem from resolving the endpoint-transition conflict rather than added parameters or data volume.

minor comments (2)

[§3.3] The LRDM merge operation is introduced as a novel component but lacks an explicit equation or pseudocode in the method description, making it difficult to reproduce the directional merge precisely.
[Figure 2] Notation for the two low-rank branches (endpoint vs. transition) should be distinguished more clearly in figures and equations to avoid conflating the separate training objectives.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, clarifying the design choices in tuple construction and committing to expanded experimental validation. These responses aim to strengthen the manuscript's claims regarding the endpoint-transition conflict and its resolution via decoupling.

read point-by-point responses

Referee: [§3.2] §3.2 (tuple construction): The central claim that paired forward/reverse edit tuples supply independent, unbiased supervision for the transition branch rests on the assumption that reverse tuples from holistic image-caption pairs are true inverses. Because captions are not source-conditioned modifications, the reverse tuple can reinforce attribute matching or dataset-specific phrasing instead of genuine semantic transitions. This risks the separate branches simply partitioning correlated signals, so that LRDM gains may reflect extra capacity rather than conflict resolution.

Authors: We acknowledge that image captions provide holistic descriptions rather than explicit source-conditioned edits, which introduces a degree of correlation between forward and reverse signals. However, the paired construction derives the reverse tuple by inverting the semantic direction on the same image-caption pair (e.g., using the caption to define the target state and deriving the opposing edit text), thereby supplying complementary supervision focused on the modification delta rather than static attributes. The endpoint branch is trained exclusively on direct image-text alignment objectives, while the transition branch optimizes for the paired edit direction; this separation, combined with LRDM's directional merging, is intended to prevent simple signal partitioning. We will revise §3.2 to include a concrete example of tuple construction and a brief discussion of the limitations of caption-derived inverses, while retaining the core method. revision: partial
Referee: [§4] §4 (experiments): The abstract asserts consistent benchmark gains from decoupling, yet no quantitative results, ablations isolating the conflict (e.g., joint vs. decoupled training with matched capacity), error bars, or verification that constructed tuples avoid post-hoc selection bias are referenced. Without these, it is impossible to confirm that reported improvements on CIRR/CIRCO stem from resolving the endpoint-transition conflict rather than added parameters or data volume.

Authors: Section 4 of the manuscript reports quantitative results across CIRR, CIRCO, FashionIQ, and GeneCIS with comparisons to prior projection-based and LLM-based methods. To directly isolate the contribution of decoupling, we will add new ablations in the revised §4: (1) joint training versus decoupled branches under matched parameter budgets, (2) standard error bars computed over three independent runs, and (3) explicit verification that all image-caption pairs are processed systematically without post-hoc filtering. These additions will confirm that gains arise from conflict resolution rather than capacity or data volume increases. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural decoupling is independent of fitted inputs

full rationale

The paper's core contribution is an architectural change—constructing forward/reverse tuples from existing image-caption pairs, training separate low-rank adapter branches, and merging via LRDM—applied on top of standard projection-based ZS-CIR frameworks. No equations, derivations, or self-citations are presented that reduce the claimed performance gains to a quantity defined by the method itself or to a fitted parameter renamed as a prediction. The supervision signal is generated from external data sources rather than being tautological with the target metric, and the decoupling benefit is presented as an empirical architectural hypothesis rather than a mathematical necessity. This satisfies the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; method rests on the domain assumption that image-caption pairs yield valid edit tuples and introduces LRDM as a new merge operator.

axioms (1)

domain assumption Image-caption pairs can be transformed into forward/reverse edit tuples that provide clean supervision for both endpoint and semantic transition objectives.
Invoked to justify the training data construction step described in the abstract.

invented entities (1)

Low-Rank Directional Merge (LRDM) no independent evidence
purpose: Combine separate endpoint and transition adapter branches into a single deployable low-rank adapter.
New operator introduced to resolve the endpoint-transition conflict at inference time.

pith-pipeline@v0.9.0 · 5520 in / 1309 out tokens · 67518 ms · 2026-05-12T01:38:16.722885+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

[1]

Zero-shot composed image retrieval with textual inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF international conference on computer vision, pages 15338–15347, 2023

work page 2023
[2]

Effective conditioned and composed image retrieval combining clip-based features

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. Effective conditioned and composed image retrieval combining clip-based features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21466–21474, 2022

work page 2022
[3]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. Composed image retrieval using contrastive learning and task-oriented clip-based features.ACM Transactions on Multimedia Computing, Communications and Applications, 20(3):1–24, 2023

work page 2023
[4]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

work page 2020
[5]

Image search with text feedback by visiolinguistic attention learning

Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3001–3011, 2020

work page 2020
[6]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InEuropean Conference on Computer Vision, pages 270–287. Springer, 2024

work page 2024
[7]

Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024
[8]

Compodiff: Versatile composed image retrieval with latent diffusion.Transactions on Machine Learning Research,

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. Compodiff: Versatile composed image retrieval with latent diffusion.Transactions on Machine Learning Research,

work page
[9]

Expert Certification

work page
[10]

Language-only training of zero-shot composed image retrieval

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only training of zero-shot composed image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13225–13234, 2024

work page 2024
[11]

Uni-x: Mitigating modality conflict with a two-end-separated architecture for unified multimodal models

Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, and Jun Yu. Uni-x: Mitigating modality conflict with a two-end-separated architecture for unified multimodal models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[12]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019
[13]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022
[14]

Collm: A large language model for composed image retrieval

Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, and Abhinav Shrivastava. Collm: A large language model for composed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3994–4004, 2025

work page 2025
[15]

Editing models with task arithmetic, 2023

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023

work page 2023
[16]

Spherical linear interpo- lation and text-anchoring for zero-shot composed image retrieval

Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim. Spherical linear interpo- lation and text-anchoring for zero-shot composed image retrieval. InEuropean conference on computer vision, pages 239–254. Springer, 2024

work page 2024
[17]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021. 10

work page 2021
[18]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser- Nam Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022

work page 2022
[19]

Vision-by-language for training-free compositional image retrieval

S Karthik, K Roth, M Mancini, Z Akata, et al. Vision-by-language for training-free compositional image retrieval. InThe Twelfth International Conference on Learning Representations. International Conference on Learning Representations, ICLR, 2024

work page 2024
[20]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023

work page 2023
[21]

Qure: Query-relevant retrieval through hard negative sampling in composed image retrieval

Jaehyun Kwak, Ramahdani Muhammad Izaaz Inhar, Se-Young Yun, and Sung-Ju Lee. Qure: Query-relevant retrieval through hard negative sampling in composed image retrieval. InForty-second International Conference on Machine Learning, 2025

work page 2025
[22]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[23]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

work page 2022
[24]

Improving context understanding in multimodal large language models via multimodal composition learning

Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan Kankanhalli. Improving context understanding in multimodal large language models via multimodal composition learning. InInternational conference on machine learning, 2024

work page 2024
[25]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582– 4597, 2021

work page 2021
[26]

Imagine and seek: Improving composed image retrieval with an imagined proxy

You Li, Fan Ma, and Yi Yang. Imagine and seek: Improving composed image retrieval with an imagined proxy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3984–3993, 2025

work page 2025
[27]

Hierarchy-aware pseudo word learning with text adaptation for zero-shot composed image retrieval

Zhe Li, Lei Zhang, Zheren Fu, Kun Zhang, and Zhendong Mao. Hierarchy-aware pseudo word learning with text adaptation for zero-shot composed image retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24319–24329, 2025

work page 2025
[28]

Rethinking pseudo word learning in zero-shot composed image retrieval: From an object-aware perspective

Zhe Li, Lei Zhang, Kun Zhang, Weidong Chen, Yongdong Zhang, and Zhendong Mao. Rethinking pseudo word learning in zero-shot composed image retrieval: From an object-aware perspective. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 833–843, 2025

work page 2025
[29]

Image retrieval on real-life images with pre-trained vision-and-language models

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021

work page 2021
[30]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[31]

Merging models with fisher-weighted averaging.Advances in Neural Information Processing Systems, 35:17703–17716, 2022

Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging.Advances in Neural Information Processing Systems, 35:17703–17716, 2022

work page 2022
[32]

Composed image retrieval for remote sensing

Bill Psomas, Ioannis Kakogeorgiou, Nikos Efthymiadis, Giorgos Tolias, Ondˇrej Chum, Yannis Avrithis, and Konstantinos Karantzalos. Composed image retrieval for remote sensing. InIGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, pages 8526–8534. IEEE, 2024

work page 2024
[33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[34]

Pic2word: Mapping pictures to words for zero-shot composed image retrieval

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19305–19314, 2023. 11

work page 2023
[35]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

work page 2018
[36]

Benchcir: Benchmarking robustness in composed image retrieval across modalities.Pattern Recognition, page 113724, 2026

Shitong Sun, Qilei Li, Shaogang Gong, Weitong Cai, Philip Torr, and Jindong Gu. Benchcir: Benchmarking robustness in composed image retrieval across modalities.Pattern Recognition, page 113724, 2026

work page 2026
[37]

Cotmr: chain-of-thought multi-scale reasoning for training- free zero-shot composed image retrieval

Zelong Sun, Dong Jing, and Zhiwu Lu. Cotmr: chain-of-thought multi-scale reasoning for training- free zero-shot composed image retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22675–22684, 2025

work page 2025
[38]

Knowledge-enhanced dual-stream zero-shot composed image retrieval

Yucheng Suo, Fan Ma, Linchao Zhu, and Yi Yang. Knowledge-enhanced dual-stream zero-shot composed image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26951–26962, 2024

work page 2024
[39]

Missing target-relevant information prediction with world model for accurate zero-shot composed image retrieval

Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, and Qi Wu. Missing target-relevant information prediction with world model for accurate zero-shot composed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24785–24795, 2025

work page 2025
[40]

Context-i2w: Map- ping images to context-dependent words for accurate zero-shot composed image retrieval

Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu. Context-i2w: Map- ping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5180–5188, 2024

work page 2024
[41]

Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot composed image retrieval

Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot composed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14400–14410, 2025

work page 2025
[42]

Composing text and image for image retrieval-an empirical odyssey

Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–6448, 2019

work page 2019
[43]

Wiser: Wider search, deeper thinking, and adaptive fusion for training-free zero-shot composed image retrieval.arXiv preprint arXiv:2602.23029, 2026

Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, and Jinqiao Wang. Wiser: Wider search, deeper thinking, and adaptive fusion for training-free zero-shot composed image retrieval.arXiv preprint arXiv:2602.23029, 2026

work page arXiv 2026
[44]

Model soups: aver- aging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: aver- aging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pages 23965–23998...

work page 2022
[45]

Fashion iq: A new dataset towards retrieving images by natural language feedback

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021

work page 2021
[46]

Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

work page 2023
[47]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024

work page 2024
[48]

Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

work page 2020
[49]

Robustmerge: Parameter-efficient model merging for mllms with direction robustness

Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, and Hao Tang. Robustmerge: Parameter-efficient model merging for mllms with direction robustness. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[50]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023
[51]

Zero-shot composed image retrieval via dual-stream instruction-aware distillation

Wenliang Zhong, Rob Barton, Weizhi An, Feng Jiang, Hehuan Ma, Yuzhi Guo, Abhishek Dan, Shioulin Sam, Karim Bouyarmane, and Junzhou Huang. Zero-shot composed image retrieval via dual-stream instruction-aware distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22221–22231, 2025. 12 A Additional Benchmark Results Ta...

work page 2025
[52]

Input: A source image caption will be provided

work page
[53]

brainstorming

Brainstorming: Identify the key source details in the source caption (objects, actions, setting) and propose one significant, plausible visual change. 3.instruction: Write the forward edit from source to target. 4.modified_caption: Apply the edit while preserving unrelated source details. 5.reverse_instruction: Write the reverse edit frommodified_captionb...

work page
[54]

Make exactly one significant visual change, such as changing an object color, location, action, material, count, or category

work page
[55]

Keep unrelated details unchanged inmodified_caption

work page
[56]

Thereverse_instructionmust undo only the forward edit

work page
[57]

The instruction and modified caption should be coherent and plausible. Input: {source_caption} Figure 5:Prompt template used for offline supervision construction.We prompt the LLM to produce the forward edit, edited caption, and reverse instruction in one structured output. 14 Table 7:Main DeCIR training hyperparameters.Common optimization settings follow...

work page