Recognition: unknown
Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization
Pith reviewed 2026-05-08 12:41 UTC · model grok-4.3
The pith
RCSR uses semantic routing and prototype anchoring to improve federated cross-modal retrieval when clients have missing modalities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RCSR integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters on a frozen CLIP backbone to deliver improved global retrieval accuracy and training stability in federated cross-modal retrieval under non-IID distributions and missing modalities, while also raising client-level performance for incomplete clients.
What carries the argument
The server-side semantic router that assigns aggregation weights based on retrieval consistency, combined with prototype anchoring to align unimodal clients and lightweight shared plus personal adapters.
If this is right
- Global retrieval accuracy rises on benchmarks such as MS-COCO and Flickr30K.
- Training stability improves during heterogeneous client updates.
- Client-level retrieval performance increases, especially for clients with missing modalities.
- Lightweight adapters support efficient global knowledge sharing alongside local personalization.
Where Pith is reading between the lines
- The routing and anchoring approach could apply to other federated multimodal tasks that face partial data and distribution shifts.
- Varying the fraction of unimodal clients in tests would clarify how robust the consistency-based weighting remains.
- Adapter personalization may lower communication overhead by shifting more work to local devices.
Load-bearing premise
The server-side semantic router can reliably measure retrieval consistency to weight updates and mitigate alignment drift without creating new biases or instability under highly heterogeneous non-IID conditions.
What would settle it
Experiments showing lower global retrieval accuracy or greater training instability when the semantic router is applied versus standard federated averaging in settings with many unimodal clients and strong non-IID splits would falsify the claim.
Figures
read the original abstract
Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RCSR, a federated cross-modal retrieval framework for handling non-IID data and missing modalities. It builds on a frozen CLIP backbone with three components: prototype anchoring to align unimodal clients to global semantics, a server-side retrieval-centric semantic router that adaptively weights client updates by retrieval consistency to reduce alignment drift, and optional lightweight client-specific adapters for personalization. Experiments on MS-COCO, Flickr30K and related benchmarks report consistent gains in global retrieval accuracy, training stability, and especially client-level performance for incomplete-modality clients.
Significance. If the empirical gains hold under rigorous controls, the work addresses a practically important gap in federated multimodal learning by combining efficient global knowledge transfer with client personalization while explicitly targeting missing-modality clients. The frozen-CLIP design and public code release are strengths that facilitate reproducibility and adoption.
major comments (2)
- [§3.2] §3.2 (Semantic Router): The central claim that the retrieval-consistency router mitigates alignment drift for incomplete-modality clients rests on the assumption that prototype-anchored consistency scores remain reliable when one modality is absent. No analysis or ablation is provided showing the distribution of consistency scores (or resulting aggregation weights) for unimodal versus multimodal clients under extreme non-IID partitions; if the router systematically down-weights the very clients it aims to help, both the global accuracy and client-level gains would be undermined. An explicit diagnostic (e.g., weight histograms or correlation between missing-modality fraction and assigned weight) is required to substantiate the mechanism.
- [Table 2 / §4.3] Table 2 / §4.3 (Ablation on router): The reported improvements for incomplete-modality clients are shown only under the full RCSR pipeline. Removing the router (or replacing it with uniform averaging) while keeping prototype anchoring and adapters would isolate whether the consistency-based weighting is load-bearing or whether gains derive primarily from anchoring and adapters; the current ablations do not perform this isolation.
minor comments (2)
- [§4.1] §4.1: The description of how prototype anchoring is performed for clients missing an entire modality (e.g., text-only) is brief; a short algorithmic box or pseudocode would clarify the exact computation of the anchored embedding.
- [Figure 3] Figure 3: The legend and axis labels for the stability curves are difficult to read at print size; increasing font size and adding a short caption explaining what 'retrieval consistency' quantifies on the y-axis would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the semantic router mechanism and the need for more targeted ablations. We have revised the manuscript to incorporate the requested diagnostics and isolation experiments, which we believe strengthen the claims regarding the router's role in handling missing modalities.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Semantic Router): The central claim that the retrieval-consistency router mitigates alignment drift for incomplete-modality clients rests on the assumption that prototype-anchored consistency scores remain reliable when one modality is absent. No analysis or ablation is provided showing the distribution of consistency scores (or resulting aggregation weights) for unimodal versus multimodal clients under extreme non-IID partitions; if the router systematically down-weights the very clients it aims to help, both the global accuracy and client-level gains would be undermined. An explicit diagnostic (e.g., weight histograms or correlation between missing-modality fraction and assigned weight) is required to substantiate the mechanism.
Authors: We agree that an explicit diagnostic is necessary to validate the router's behavior under missing modalities. In the revised manuscript, we have added a new analysis subsection in §3.2 (with an accompanying figure) that reports (i) histograms of prototype-anchored consistency scores and resulting aggregation weights separately for unimodal and multimodal clients under extreme non-IID partitions, and (ii) the Pearson correlation between per-client missing-modality fraction and assigned router weight. The results confirm that consistency scores remain reliable for unimodal clients thanks to prototype anchoring, and that the router does not systematically down-weight incomplete clients; the correlation is in fact mildly positive, indicating that clients preserving retrieval consistency receive appropriate emphasis. revision: yes
-
Referee: [Table 2 / §4.3] Table 2 / §4.3 (Ablation on router): The reported improvements for incomplete-modality clients are shown only under the full RCSR pipeline. Removing the router (or replacing it with uniform averaging) while keeping prototype anchoring and adapters would isolate whether the consistency-based weighting is load-bearing or whether gains derive primarily from anchoring and adapters; the current ablations do not perform this isolation.
Authors: We concur that isolating the router's contribution is important. We have extended the ablation study in §4.3 and updated Table 2 with a new row for the variant that retains prototype anchoring and client adapters but replaces the semantic router with uniform averaging. The additional results show that anchoring plus adapters already yield gains over the frozen-CLIP baseline, yet the consistency-based router provides further statistically significant improvements, especially on client-level metrics for incomplete-modality clients. This confirms that the retrieval-centric weighting is load-bearing rather than redundant. revision: yes
Circularity Check
No circularity: framework uses external frozen CLIP with independent experimental validation
full rationale
The paper introduces RCSR as a federated framework combining prototype anchoring, retrieval-centric semantic routing, and client adapters on a frozen CLIP backbone. No load-bearing equations, predictions, or uniqueness claims are shown to reduce by construction to fitted inputs or self-citations. Claims of improved accuracy and stability rest on empirical results across MS-COCO and Flickr30K rather than self-referential definitions, satisfying the criteria for a self-contained derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A frozen CLIP backbone already encodes sufficient shared cross-modal semantics for the target task.
invented entities (2)
-
retrieval-centric semantic router
no independent evidence
-
prototype anchoring
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. 2019. Federated learning with personalization layers.arXiv preprint arXiv:1912.00818(2019)
work page internal anchor Pith review arXiv 2019
-
[2]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multi- modal machine learning: A survey and taxonomy.IEEE transactions on pattern analysis and machine intelligence41, 2 (2018), 423–443
2018
-
[3]
Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth
Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Practi- cal Secure Aggregation for Privacy-Preserving Machine Learning. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 1175–1191. doi:10.1145/3133956.3133982
-
[4]
Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. 2021. Exploiting Shared Representations for Personalized Federated Learning. InPro- ceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139). PMLR, 2089–2099
2021
-
[5]
Dinh, Nguyen Tran, and Tuan Dung Nguyen
Canh T. Dinh, Nguyen Tran, and Tuan Dung Nguyen. 2020. Personalized Fed- erated Learning with Moreau Envelopes. InAdvances in Neural Information Processing Systems, Vol. 33. 21394–21405
2020
- [6]
- [7]
-
[8]
Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. 2020. Personalized Fed- erated Learning: A Meta-Learning Approach. InAdvances in Neural Information Processing Systems, Vol. 33. 6960–6971
2020
-
[9]
Yifang Gao, Wei Luo, Chuanchuan Wang, Nur Syazreen Ahmad, Xiaojun Wang, and Patrick Goh. 2026. A privacy-preserving multi-user retrieval system for multimodal artificial intelligence.Scientific Reports(2026)
2026
-
[10]
Differentially Private Federated Learning: A Client Level Perspective
Robin C. Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially Private Fed- erated Learning: A Client Level Perspective.arXiv preprint arXiv:1712.07557 (2017)
work page Pith review arXiv 2017
- [11]
-
[12]
Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems. InProceedings of the International Workshop OntoImage’2006: Language Resources for Content-Based Image Retrieval. Genoa, Italy, 13–23. Held in con- junction with LREC 2006 (22 May 2006)
2006
-
[13]
Chaoyang He, Murali Annavaram, and Salman Avestimehr. 2020. Group Knowl- edge Transfer: Federated Learning of Large CNNs at the Edge. InNeurIPS
2020
- [14]
-
[15]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational conference on machine learning. PMLR, 4904–4916
2021
-
[16]
Reddi, Sebastian U
Sai Praneeth Reddi Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. 2020. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. InProceedings of the 37th Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119). PMLR, 5132–5143
2020
- [17]
-
[18]
Huy Q Le, Minh NH Nguyen, Chu Myaet Thwal, Yu Qiao, Chaoning Zhang, and Choong Seon Hong. 2025. Fedmekt: Distillation-based embedding knowledge transfer for multimodal federated learning.Neural Networks183 (2025), 107017
2025
-
[19]
Huy Q Le, Chu Myaet Thwal, Yu Qiao, Ye Lin Tun, Minh NH Nguyen, Eui-Nam Huh, and Choong Seon Hong. 2025. Cross-modal prototype based multimodal federated learning under severely missing modality.Information Fusion122 (2025), 103219
2025
-
[20]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. InProceedings of the European conference on computer vision (ECCV). 201–216
2018
-
[21]
Qinbin Li, Bingsheng He, and Dawn Song. 2021. Model-contrastive federated learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10713–10722
2021
-
[22]
Qinbin Li, Zhaomin Wen, Zhaomin Wu, Sixu Hu, Na Wang, Yuan Li, Xu Liu, and Bingsheng He. 2021. A survey on federated learning systems: Vision, hype and reality for data privacy and protection.IEEE Transactions on Knowledge and Data Engineering(2021)
2021
-
[23]
Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. 2021. Ditto: Fair and robust federated learning through personalization. InInternational conference on machine learning. PMLR, 6357–6368
2021
-
[24]
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems2 (2020), 429–450
2020
-
[25]
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated Optimization in Heterogeneous Networks. In Proceedings of Machine Learning and Systems, Vol. 2. 429–450
2020
- [26]
-
[27]
Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. 2020. Ensemble distillation for robust model fusion in federated learning. InNeurIPS
2020
-
[28]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755
2014
-
[29]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net- works from decentralized data. InArtificial intelligence and statistics. PMLR, 1273–1282
2017
-
[30]
Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. 2019. Agnostic federated learning. InInternational conference on machine learning. PMLR, 4615– 4625
2019
- [31]
-
[32]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hock- enmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE international conference on computer vision. 2641–2649
2015
-
[33]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 9 Arxiv, April 2026, Preprint Zhou et al. et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
2026
-
[34]
Canh T Dinh, Nguyen Tran, and Josh Nguyen. 2020. Personalized federated learning with moreau envelopes.Advances in neural information processing systems33 (2020), 21394–21405
2020
-
[35]
Yue Tan, Guodong Long, Lu Liu, Tianyi Zhou, Qinghua Lu, Jing Jiang, and Chengqi Zhang. 2022. Fedproto: Federated prototype learning across hetero- geneous clients. InProceedings of the AAAI conference on artificial intelligence, Vol. 36. 8432–8440
2022
-
[36]
Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. 2020. Tackling the objective inconsistency problem in heterogeneous federated opti- mization.Advances in neural information processing systems33 (2020), 7611–7623
2020
-
[37]
Yaxiong Wang, Hao Yang, Xiuxiu Bai, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2020. PFAN++: Bi-directional image-text retrieval with position focused attention network.IEEE Transactions on Multimedia23 (2020), 3362–3376
2020
- [38]
-
[39]
Baochen Xiong, Xiaoshan Yang, Fan Qi, and Changsheng Xu. 2022. A unified framework for multi-modal federated learning.Neurocomputing480 (2022), 110–118
2022
-
[40]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5288–5296
2016
-
[41]
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications.ACM Transactions on Intelligent Systems and Technology (TIST)10, 2 (2019), 1–19
2019
- [42]
-
[43]
Linlin Zong, Qiujie Xie, Jiahui Zhou, Peiran Wu, Xianchao Zhang, and Bo Xu. 2021. FedCMR: Federated cross-modal retrieval. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1672–1676. A Convergence Analysis This section provides additional theoretical justification for the proposed method. O...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.