pith. sign in

arxiv: 2606.11689 · v1 · pith:HP36UCSNnew · submitted 2026-06-10 · 💻 cs.CV

RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

Pith reviewed 2026-06-27 10:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords composed image retrievalnoisy tripletslow-rank structurevalue recalibrationcorrelation matrixfashioniqcirrdenoising
0
0 comments X

The pith

The RankVR framework improves composed image retrieval by detecting noisy triplets through changes in the rank of their correlation matrix and recalibrating their training value.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Composed image retrieval requires models to understand a reference image plus a text modification to find the target image. Large datasets often contain noisy triplets where the image and text do not correspond correctly, which hurts performance. The paper shows that tracking the effective rank of the correlation matrix across samples reveals which ones break the global semantic pattern. It also tracks how much each triplet can still improve during training to decide its value. This allows the model to focus on useful hard examples while ignoring those with logical conflicts.

Core claim

RankVR addresses noisy triplet correspondence in composed image retrieval by introducing the Global Structure Consistency Perception module that uses the effective rank of the correlation matrix to identify samples disrupting macroscopic semantic symmetry, and the Adaptive Semantic Value Calibration module that integrates training potential and reliability to quantify each triplet's semantic value, resulting in more robust models on benchmarks like FashionIQ and CIRR.

What carries the argument

The Effective Rank of the Correlation Matrix in the GSCP module, which identifies samples disrupting macroscopic semantic symmetry by rank difference.

If this is right

  • RankVR achieves superior performance compared to existing state-of-the-art methods on FashionIQ and CIRR under noisy conditions.
  • The GSCP module decouples clean samples from those causing global structural inconsistency.
  • The ASVC module ensures effective use of hard clean samples while suppressing noise with logical conflicts.
  • The approach handles both global correlations and dynamic value changes during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying similar rank-based detection could help in other multimodal tasks with noisy alignments, such as video-text retrieval.
  • The dynamic calibration might reduce the need for extensive data cleaning in large-scale training pipelines.
  • If the correlation matrix is computed efficiently, this could scale to very large datasets without much overhead.

Load-bearing premise

Differences in the effective rank of the correlation matrix reliably flag samples that break semantic symmetry, and combining training potential with reliability gives an accurate value score without adding bias.

What would settle it

Running the same experiments on FashionIQ and CIRR but replacing the rank-based selection with random or scalar-based selection and finding no performance improvement would falsify the claim that the rank difference provides unique value for denoising.

Figures

Figures reproduced from arXiv: 2606.11689 by Jiale Huang, Qinlei Huang, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Zixu Li.

Figure 1
Figure 1. Figure 1: (a) CIR task with NTC. (b) Global Structural In [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of Weights on CIRR and FashionIQ. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation Study of GSCP Module on CIRR and Fash [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis of hyperparameters on CIRR [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Study on (a) FashionIQ and (b) CIRR. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Composed Image Retrieval (CIR) constitutes a pivotal paradigm requiring models to perform joint reasoning on reference images and modification texts. However, the prevalence of Noisy Triplet Correspondence (NTC) in large-scale datasets severely constrains model performance. Existing denoising methods either target binary mismatches or rely on scalar-based point-wise estimation, neglecting rich global structural correlations among sample populations and dynamic value variations during training, thereby yielding suboptimal results. This paper identifies two critical unresolved challenges: Global Structural Inconsistency of Semantic Correlations and Hard Sample Discrimination Uncertainty. To address these, we propose RankVR, a framework designed to construct a robust CIR model via global structure consistency and dynamic value perception. Specifically, we introduce the Global Structure Consistency Perception (GSCP) module, which utilizes the Effective Rank of the Correlation Matrix to decouple clean samples from structural noise. By measuring rank difference, GSCP identifies samples disrupting macroscopic semantic symmetry. Furthermore, we develop the Adaptive Semantic Value Calibration (ASVC) module to distinguish high-value hard clean samples. By integrating training potential and reliability, it dynamically quantifies the semantic value of each triplet, ensuring effective utilization of hard samples while suppressing noise characterized by logical conflicts. Extensive experiments on the FashionIQ and CIRR benchmark datasets demonstrate that RankVR significantly outperforms existing state-of-the-art methods, validating its superior robustness in noisy environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes RankVR, a framework for robust Composed Image Retrieval under Noisy Triplet Correspondence (NTC). It introduces the Global Structure Consistency Perception (GSCP) module, which uses the effective rank of the correlation matrix to identify samples disrupting macroscopic semantic symmetry via rank difference, and the Adaptive Semantic Value Calibration (ASVC) module, which integrates training potential and reliability to dynamically quantify triplet semantic value and distinguish high-value hard clean samples from noise. The central claim is that RankVR significantly outperforms existing SOTA methods on the FashionIQ and CIRR benchmarks, demonstrating superior robustness in noisy environments.

Significance. If the GSCP and ASVC modules are shown to reliably separate structural noise from clean hard samples without introducing new biases, the work could provide a useful global-structure-aware alternative to point-wise denoising methods in CIR. The approach leverages population-level correlations and dynamic training signals, which addresses stated limitations of prior binary-mismatch or scalar-estimation techniques.

major comments (3)
  1. [Abstract] Abstract: The central claim of significant outperformance on FashionIQ and CIRR is stated without any quantitative metrics, tables, error bars, ablation results, or statistical details, which is load-bearing for assessing whether the reported robustness gains can be attributed to the proposed modules rather than implementation details.
  2. [Abstract] Abstract (GSCP description): No derivation or construction details are supplied for the correlation matrix (e.g., which embeddings, batch size, centering), nor any controlled experiment showing that rank difference correlates with injected NTC levels or cleanly isolates noise without discarding hard clean samples; this directly underpins the claim that GSCP decouples clean samples from structural noise.
  3. [Abstract] Abstract (ASVC description): The integration of training potential and reliability is asserted to quantify semantic value without new biases, but no sensitivity analysis, formulation, or validation against the weakest assumption (rank-difference heuristic separating clean vs. noisy triplets) is referenced, which is load-bearing for the hard-sample discrimination claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment regarding the abstract below and outline the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of significant outperformance on FashionIQ and CIRR is stated without any quantitative metrics, tables, error bars, ablation results, or statistical details, which is load-bearing for assessing whether the reported robustness gains can be attributed to the proposed modules rather than implementation details.

    Authors: We agree that the abstract would be strengthened by including quantitative support. In the revised version, we will incorporate specific metrics such as the percentage improvements in recall@K on FashionIQ and CIRR under noisy conditions, with references to the primary results tables and error bars from our experiments. revision: yes

  2. Referee: [Abstract] Abstract (GSCP description): No derivation or construction details are supplied for the correlation matrix (e.g., which embeddings, batch size, centering), nor any controlled experiment showing that rank difference correlates with injected NTC levels or cleanly isolates noise without discarding hard clean samples; this directly underpins the claim that GSCP decouples clean samples from structural noise.

    Authors: The full derivation and construction of the correlation matrix (using reference image and modification text embeddings within each batch, with centering applied) are detailed in Section 3.2. We will add a concise description of this construction to the abstract. The controlled experiments demonstrating the correlation of rank difference with NTC injection levels and isolation of noise (without discarding hard clean samples) appear in Section 4.3; we will reference these results in the revised abstract. revision: partial

  3. Referee: [Abstract] Abstract (ASVC description): The integration of training potential and reliability is asserted to quantify semantic value without new biases, but no sensitivity analysis, formulation, or validation against the weakest assumption (rank-difference heuristic separating clean vs. noisy triplets) is referenced, which is load-bearing for the hard-sample discrimination claim.

    Authors: We agree the abstract description of ASVC is high-level. The formulation combining training potential and reliability, along with sensitivity analysis and validation against the rank-difference heuristic, is presented in Section 3.3 and the supplementary material. We will revise the abstract to briefly describe the formulation and reference the validation experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: modules reference external rank concepts without self-referential reduction

full rationale

The abstract and described framework introduce GSCP (using effective rank of correlation matrix to measure rank difference for identifying structural noise) and ASVC (integrating training potential and reliability for semantic value). No equations, derivations, or self-citations are shown that would reduce any claimed prediction or uniqueness result to a fitted input or prior author ansatz by construction. The central robustness claims are positioned as empirically validated on FashionIQ/CIRR rather than tautological. This matches the default non-circular outcome when no load-bearing step exhibits the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; modules are described conceptually without mathematical definitions or assumptions listed.

pith-pipeline@v0.9.1-grok · 5788 in / 1010 out tokens · 30617 ms · 2026-06-27T10:33:07.753229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

116 extracted references · 1 canonical work pages

  1. [1]

    Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, Chun-Mei Feng, et al . 2024. Sentence-level Prompts Bene- fit Composed Image Retrieval. InICLR

  2. [2]

    Haokun Wen, Xuemeng Song, Jianhua Yin, Jianlong Wu, Weili Guan, and Liqiang Nie. 2023. Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval.IEEE TPAMI(2023)

  3. [3]

    Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan

  4. [4]

    InAAAI, Vol

    Encoder: Entity mining and modification relation binding for composed image retrieval. InAAAI, Vol. 39. 5101–5109

  5. [5]

    Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu, Xuemeng Song, and Yupeng Hu. 2026. MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network.arXiv preprint arXiv:2603.29291(2026)

  6. [6]

    Peng Hu, Zhenyu Huang, Dezhong Peng, Xu Wang, and Xi Peng. 2023. Cross- modal retrieval with partially mismatched pairs.IEEE TPAMI45, 8 (2023), 9595– 9610

  7. [7]

    Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, and Peng Hu

  8. [8]

    Noisy-correspondence learning for text-to-image person re-identification. InCVPR. 27197–27206

  9. [9]

    Yupeng Hu, Zixu Li, Zhiwei Chen, Qinlei Huang, Zhiheng Fu, Mingzhu Xu, and Liqiang Nie. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement.ACM ToMM(2026)

  10. [10]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026. ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval. InAAAI, Vol. 40. 23373– 23381

  11. [11]

    Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, and Liqiang Nie

  12. [12]

    FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval.https://arxiv.org/abs/2503.21309(2025)

  13. [13]

    Jiale Huang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Chunxiao Wang, and Yupeng Hu. 2026. IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval.arXiv preprint arXiv:2606.08144(2026)

  14. [14]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Guozhi Qiu, Weili Guan, and Liqiang Nie. 2026. EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge.arXiv preprint arXiv:2605.24500(2026)

  15. [15]

    Yuxuan Jiang and Francis Ferraro. 2026. Beyond math: Stories as a testbed for memorization-constrained reasoning in llms. InEACL. 5590–5607

  16. [16]

    Jinhe Bi, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, Volker Tresp, et al. 2026. EchoRL: Reinforcement Learning via Rollout Echoing.arXiv preprint arXiv:2605.31228 (2026)

  17. [17]

    Jinghan Cao, Yu Ma, Xinjin Li, Qingyang Ren, and Xiangyun Chen. 2026. Task- Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models.arXiv preprint arXiv:2603.21389(2026)

  18. [18]

    Zequn Xie, Chuxin Wang, Yeqiang Wang, Sihang Cai, Shulei Wang, and Tao Jin

  19. [19]

    Chat-driven text generation and interaction for person retrieval. InEMNLP. 5259–5270

  20. [20]

    Panqi Yang, Haodong Jing, Nanning Zheng, and Yongqiang Ma. 2026. UniBVR: Balancing visual and reasoning abilities in unified 3D scene understanding.Neu- rocomputing671 (2026), 132599

  21. [21]

    Yun Song, Wenjia Zheng, Tiedan Chen, Ziyu Wang, Jiazhao Shi, and Yisong Chen

  22. [22]

    Deep neural network architectures for electrocardiogram classification: A comprehensive evaluation.arXiv preprint arXiv:2602.17701(2026)

  23. [23]

    Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang, Jingchao Wang, Zhihong Zhu, Boyan Xu, and Zhiqi Huang. 2026. MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models.arXiv preprint arXiv:2601.03331 (2026)

  24. [24]

    Yuan Sun, Xu Wang, Dezhong Peng, Zhenwen Ren, and Xiaobo Shen. 2023. Hierarchical hashing learning for image set classification.IEEE TIP32 (2023), 1732–1744

  25. [25]

    Zong Ke, Yuqing Cao, Zhenrui Chen, Yuchen Yin, Shouchao He, and Yu Cheng

  26. [26]

    Finance Research Letters(2025), 107890

    Early warning of cryptocurrency reversal risks via multi-source data. Finance Research Letters(2025), 107890

  27. [27]

    Yang Tian, Fan Liu, Jingyuan Zhang, Yupeng Hu, Liqiang Nie, et al. 2025. CoRe- MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG. InACL. 32967–32982

  28. [28]

    Qianyun Yang, Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, and Liqiang Nie. 2026. STABLE: Efficient Hybrid Nearest Neighbor Search via Magnitude- Uniformity and Cardinality-Robustness.IEEE TKDE(2026)

  29. [29]

    Qianyun Yang, Peizhuo Lv, Yingjiu Li, Shengzhi Zhang, Yuxuan Chen, Zhiwei Chen, Zixu Li, and Yupeng Hu. 2026. ERASE: Bypassing Collaborative Detection of AI Counterfeit Via Comprehensive Artifacts Elimination.IEEE TDSC(March 2026), 1–18. doi:10.1109/TDSC.2026.3677794

  30. [30]

    Senmao Tian, Xiang Wei, and Shunli Zhang. 2026. Neutral-Reference Prompting for Vision-Language Models.arXiv(2026), 2605.15615

  31. [31]

    Xingfeng Li, Yinghui Sun, Quansen Sun, Zhenwen Ren, and Yuan Sun. 2023. Cross-view graph matching guided anchor alignment for incomplete multi-view clustering.Information Fusion100 (2023), 101941

  32. [32]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Haokun Wen, Xuemeng Song, and Liqiang Nie. 2026. COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations.IEEE TIP(2026)

  33. [33]

    Senmao Tian, Xiang Wei, and Shunli Zhang. 2026. Sampling Control for Imbal- anced Calibration in Semi-Supervised Learning. InAAAI, Vol. 40. 25914–25922

  34. [34]

    Liangsi Lu, Xuhang Chen, Minzhe Guo, Shichu Li, Jingchao Wang, and Yang Shi. 2026. Chordedit: One-step low-energy transport for image editing. InCVPR. 14398–14407

  35. [35]

    Zichao Li and Zong Ke. 2025. Domain meets typology: Predicting verb-final order from universal dependencies for financial and blockchain nlp. InSIGTYP. 156–164

  36. [36]

    Zichao Li and Zong Ke. 2025. Cross-modal augmentation for low-resource language understanding and generation. InMAGMaR. 90–99

  37. [37]

    Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. 2024. Mace: Mass concept erasure in diffusion models. InCVPR. 6430–6440

  38. [38]

    Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. 2026. Adaptive dual cross-attention network for multispectral object detection in autonomous driving.ESW A(2026), 132012

  39. [39]

    Zhiheng Fu, Zixu Li, Zhiwei Chen, Fangxu Liu, Yupeng Hu, Weili Guan, and Liqiang Nie. 2026. EgoAction: Egocentric Action Composition with Reliability- Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026.arXiv preprint arXiv:2605.24496(2026)

  40. [40]

    Panqi Yang, Haodong Jing, Nanning Zheng, and Yongqiang Ma. 2026. UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space. In AAAI, Vol. 40. 11640–11648

  41. [41]

    Gengyuan Zhang, Jinhe Bi, Jindong Gu, Yanyu Chen, and Volker Tresp. 2023. SPOT! Revisiting Video-Language Models for Event Understanding.arXiv preprint arXiv:2311.12919(2023)

  42. [42]

    Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, and Xiaobin Hu. 2025. Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling.arXiv preprint arXiv:2508.03404(2025)

  43. [43]

    Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, and Liqiang Nie. 2026. R3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking. arXiv preprint arXiv:2606.01113(2026)

  44. [44]

    Yuxuan Jiang, Dawei Li, and Frank Ferraro. 2025. Drp: Distilled reasoning pruning with skill-aware step decomposition for efficient large reasoning models.arXiv preprint arXiv:2505.13975(2025)

  45. [45]

    Yuan Sun, Yang Qin, Yongxiang Li, Dezhong Peng, Xi Peng, and Peng Hu. 2024. Robust multi-view clustering with noisy correspondence.IEEE TKDE36, 12 (2024), 9150–9162

  46. [46]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Weili Guan, and Liqiang Nie. 2026. TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge.arXiv preprint arXiv:2605.24470(2026)

  47. [47]

    Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, and Yunpu Ma. 2026. The Geometry of Reasoning: Self-Evaluation via Layerwise Trajectory Evolution. In ICML. https://openreview.net/forum?id=WQyrwQwzmK

  48. [48]

    Yuxuan Jiang and Francis Ferraro. 2026. SCRIBE: Structured Mid-Level Supervi- sion for Tool-Using Language Models.arXiv preprint arXiv:2601.03555(2026)

  49. [49]

    Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. 2025. Se-agent: Self- evolution trajectory optimization in multi-step reasoning with llm-based agents. arXiv preprint arXiv:2508.02085(2025)

  50. [50]

    Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, and Xi Peng. 2021. Learning with noisy correspondence for cross-modal matching. NeurIPS34 (2021), 29406–29419

  51. [51]

    Jingjie Zhang, Lingli Tang, Jiachen Li, Jinyu Xu, Yanchun Ma, and Qing Xie

  52. [52]

    InACM MM Asia

    Robust Dual Embedding Contrastive Learning for Text-to-Image Person Re-identification with Noisy Correspondence. InACM MM Asia. 1–8

  53. [53]

    Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. INTENT: Invariance and Discrimination-aware Noise Mitiga- tion for Robust Composed Image Retrieval. InAAAI, Vol. 40. 20463–20471

  54. [54]

    Shuxian Li, Changhao He, Xiting Liu, Joey Tianyi Zhou, Xi Peng, and Peng Hu. 2025. Learning with Noisy Triplet Correspondence for Composed Image Retrieval. InCVPR. 19628–19637

  55. [55]

    Siyuan Li, Youyuan Zhang, Fangming Liu, and Jing Li. 2026. Modality-Decoupled Online Recursive Editing.arXiv preprint arXiv:2605.20273(2026)

  56. [56]

    Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays

  57. [57]

    Composing Text and Image for Image Retrieval - an Empirical Odyssey. In CVPR. IEEE, 6439–6448

  58. [58]

    Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image Search With Text Feedback by Visiolinguistic Attention Learning. InCVPR. IEEE, 2998–3008

  59. [59]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie

  60. [60]

    Retrieval

    OFFSET: Segmentation-based Focus Shift Revision for Composed Image ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Huang et al. Retrieval. InACM MM. 6113–6122

  61. [61]

    Yida Zhao, Yuqing Song, and Qin Jin. 2022. Progressive learning for image retrieval with hybrid-modality queries. InACM SIGIR. 1012–1021

  62. [62]

    Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. 2024. Data roaming and quality assessment for composed image retrieval. InAAAI, Vol. 38. 2991– 2999

  63. [63]

    Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, and Yupeng Hu. 2026. Hint: Composed image retrieval with dual- path compositional contextualized network.arXiv preprint arXiv:2603.26341 (2026)

  64. [64]

    Qinlei Huang, Zhiwei Chen, Zixu Li, Chunxiao Wang, Xuemeng Song, Yupeng Hu, and Liqiang Nie. 2025. Median: Adaptive intermediate-grained aggregation network for composed image retrieval. InICASSP. IEEE, 1–5

  65. [65]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InICML. PMLR, 8748–8763

  66. [66]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML. PMLR, 12888–12900

  67. [67]

    Zhiheng Fu, Zixu Li, Zhiwei Chen, Chunxiao Wang, Xuemeng Song, Yupeng Hu, and Liqiang Nie. 2025. Pair: Complementarity-guided disentanglement for composed image retrieval. InICASSP. IEEE, 1–5

  68. [68]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026. HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval. InAAAI, Vol. 40. 6762–6770

  69. [69]

    Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, and Tat-Seng Chua. 2024. Composed image retrieval with text feedback via multi-grained uncertainty regularization.ICLR(2024)

  70. [70]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan

  71. [71]

    InACM MM

    HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Com- posed Video Retrieval. InACM MM. 6143–6152

  72. [72]

    Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, and Zhao Yang. 2026. Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation.arXiv preprint arXiv:2605.09253(2026)

  73. [73]

    Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, Linlong Lei, and Kefu Yi. 2026. Multivariate feature learning and associative spatial information en- hancement for snow object detection in autonomous driving.EAAI175 (2026), 114672

  74. [74]

    Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, and Qing Li. 2025. KORE: Enhancing Knowledge Injec- tion for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints.arXiv preprint arXiv:2510.19316(2025)

  75. [75]

    Guancheng Wan, Xiaoran Shang, Yuxin Wu, Guibin Zhang, Jinhe Bi, et al. 2025. HYPERION: Fine-Grained Hypersphere Alignment for Robust Federated Graph Learning. InNeurIPS

  76. [76]

    Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta, and Hengyuan Zhang. 2025. Learning how to use tools, not just when: Pattern-aware tool-integrated reason- ing.arXiv preprint arXiv:2509.23292(2025)

  77. [77]

    Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, and Peng Hu. 2023. Cross-modal active complementary learning with self-refining correspondence. NeurIPS36 (2023), 24829–24840

  78. [78]

    Yiming Zeng, Wanhao Yu, Zexin Li, Tao Ren, Yu Ma, Jinghan Cao, Xiyan Chen, and Tingting Yu. 2025. Bridging the editing gap in LLMs: FineEdit for precise and targeted text modifications.EMNLP Findings(2025), 2193–2206

  79. [79]

    Shilin Lu, Zihan Zhou, Jiayou Lu, Yuanzhi Zhu, and Adams Wai-Kin Kong. 2024. Robust watermarking using generative priors against image editing: From bench- marking to advances.arXiv preprint arXiv:2410.18775(2024)

  80. [80]

    Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, and Weiran Huang. 2024. Diff-erank: A novel rank-based metric for evaluating large language models. NeurIPS37 (2024), 39501–39521

Showing first 80 references.