Recognition: unknown
TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
Pith reviewed 2026-05-09 22:42 UTC · model grok-4.3
The pith
TEMA is the first framework for multi-modification composed image retrieval, using entity mapping to improve accuracy on both new complex datasets and existing benchmarks while balancing efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency.
Load-bearing premise
The constructed datasets M-FashionIQ and M-CIRR accurately represent real-world multi-modification queries and that TEMA generalizes beyond the four tested benchmarks without overfitting to the new data.
Figures
read the original abstract
Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
invented entities (1)
-
TEMA (Text-oriented Entity Mapping Architecture)
no independent evidence
Forward citations
Cited by 3 Pith papers
-
OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction
OmniTrend predicts popularity by combining separate content attractiveness and contextual exposure predictors using cross-modal and exogenous signals.
-
HotComment: A Benchmark for Evaluating Popularity of Online Comments
HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...
-
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
Reference graph
Works this paper leans on
-
[1]
Delving deeper: Hierarchi- cal visual perception for robust video-text retrieval,
Delving deeper: Hierarchical visual percep- tion for robust video-text retrieval.arXiv preprint arXiv:2601.12768. Peiyang Liu, Sen Wang, Xi Wang, Wei Ye, and Shikun Zhang. 2021. Quadrupletbert: An efficient model for embedding-based large-scale retrieval. InNAACL, pages 3734–3739. Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. 20...
-
[2]
Visual instance-aware prompt tuning. InACM MM, pages 2880–2889. Association for Computing Machinery, Inc. Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, and 1 others. 2025. Mma-asia: A multilingual and multi- modal alignment framework for culturally-grounded evalua...
-
[3]
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
What’s missing in screen-to-action? towards a ui-in-the-loop paradigm for multimodal gui reason- ing.Preprint, arXiv:2604.06995. Xiaoling Zhou, Ou Wu, Weiyao Zhu, and Ziyang Liang
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
arXiv preprint arXiv:2603.29291 (2026)
Understanding difficulty-based sample weight- ing with a universal difficulty measure. InECML PKDD, pages 68–84. Springer. Haomiao Tang, Jinpeng Wang, Yuang Peng, GuangHao Meng, Ruisheng Luo, Bin Chen, Long Chen, Yaowei Wang, and Shu-Tao Xia. 2025. Modeling uncertainty in composed image retrieval via probabilistic embed- dings. InACL, pages 1210–1222. Zix...
-
[5]
InCVPR, pages 15028–15038
Fashionsap: Symbols and attributes prompt for fine-grained fashion vision-language pre-training. InCVPR, pages 15028–15038. Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, and Tat-Seng Chua. 2024. Composed image retrieval with text feedback via multi-grained uncertainty reg- ularization.ICLR. Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephe...
2024
-
[6]
InCVPR, pages 11307–11317
Fashion iq: A new dataset towards retriev- ing images by natural language feedback. InCVPR, pages 11307–11317. Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould. 2021. Image retrieval on real-life images with pre-trained vision-and-language models. InICCV, pages 2105–2114. IEEE. Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Ima...
2021
-
[7]
FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models
Robust variational contrastive learning for partially view-unaligned clustering. InACM MM, pages 4167–4176. Yuan Sun, Zhenwen Ren, Peng Hu, Dezhong Peng, and Xu Wang. 2023. Hierarchical consensus hashing for cross-modal retrieval.IEEE TMM, 26:824–836. Ruitao Pu, Yang Qin, Xiaomin Song, Dezhong Peng, Zhenwen Ren, and Yuan Sun. 2025a. She: Streaming-media h...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
arXiv preprint arXiv:2510.02253 (2025)
Dragflow: Unleashing dit priors with region based supervision for drag editing.arXiv preprint arXiv:2510.02253. Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, and Adams Wai-Kin Kong
-
[9]
arXiv preprint arXiv:2509.21278 , year=
Does flux already know how to perform physi- cally plausible image composition?arXiv preprint arXiv:2509.21278. Jincheng Huang, Yujie Mo, Xiaoshuang Shi, Lei Feng, and Xiaofeng Zhu. 2025. Enhancing the influence of labels on unlabeled nodes in graph convolutional networks. InForty-second International Conference on Machine Learning. Kaiming Liu, Yunhong G...
-
[10]
Stepwise refinement short hashing for image retrieval. InACM MM, pages 6501–6509. Yupeng Chang, Yi Chang, and Yuan Wu. 2026. BA- loRA: Bias-alleviating low-rank adaptation to miti- gate catastrophic inheritance in large language mod- els. InICLR. Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziw...
-
[11]
Semantic-Aware Logical Reasoning via a Semiotic Framework
Strucsum: Graph-structured reasoning for long document extractive summarization with llms. InEACL Findings, pages 3708–3721. Jincheng Huang, Yujie Mo, Ping Hu, Xiaoshuang Shi, Shangbo Yuan, Zeyu Zhang, and Xiaofeng Zhu. 2024. Exploring the role of node diversity in directed graph representation learning. InIJCAI, pages 2072–2080. Yunyao Zhang, Xinglang Zh...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
IEEE TDSC, pages 1–18
Erase: Bypassing collaborative detection of ai counterfeit via comprehensive artifacts elimination. IEEE TDSC, pages 1–18. Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang
-
[13]
MagicLens: Self-supervised image retrieval with open-ended instructions. InICML, pages 59403– 59420. Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2025. Recurrence meets transformers for universal multimodal retrieval. arXiv preprint arXiv:2509.08897. Yinan Zhou, Yaxiong Wang, Haokun Lin, Chen Ma, Li Zhu, and Zhedong Z...
-
[14]
ga−s 3: Comprehensive social network sim- ulation with group agents. InACL Findings, pages 8950–8970. Meta. 2024. The llama 3 herd of models. Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, Yicheng Li, Hao Chen, Fei Yu, and Yin Zhang. 2024. Optimizing instruction synthesis: Effective explo- ration of evolutionary space with tree search.arXiv preprint arXiv...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.