Recognition: unknown
ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
Pith reviewed 2026-05-10 00:43 UTC · model grok-4.3
The pith
Cone-based noise boundaries and optimal transport unlearning correct hard noise in composed image retrieval from flawed triplet annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConeSep locates noisy triplet correspondences by first applying Geometric Fidelity Quantization to establish and estimate a cone-shaped noise boundary in embedding space. It then performs Negative Boundary Learning to construct a diagonal negative combination for each query as an explicit semantic opposite-anchor. Finally, Boundary-based Targeted Unlearning frames the correction of identified noise as an optimal transport problem that isolates and adjusts only the erroneous pairs, thereby resolving modality suppression, negative anchor deficiency, and unlearning backlash without collateral damage to clean data.
What carries the argument
The cone-shaped geometric fidelity boundary that quantizes and isolates hard noise, combined with negative diagonal anchors and optimal transport for targeted unlearning in the embedding space.
If this is right
- Composed image retrieval models can learn effectively from real-world annotations that contain mismatches between reference images, target images, and modification text.
- Hard noise cases are isolated by the boundary without the small-loss assumption failing, preserving performance on clean triplets.
- Each query gains an explicit semantic opposite in the embedding space, reducing ambiguity in distinguishing intended modifications.
- Noise corrections occur only at the boundary without broad unlearning that erases valid correspondences.
Where Pith is reading between the lines
- The boundary-plus-transport pattern could extend to other multimodal tasks where labels link two images or an image and text but contain localized mismatches.
- Tolerating higher noise rates during annotation might lower the expense of building large-scale retrieval datasets.
- Varying the cone angle or transport cost parameters could adapt the method to datasets with different noise distributions.
Load-bearing premise
The three identified challenges are the primary barriers to handling noisy triplets, and the cone boundary plus transport-based correction can isolate hard noise without suppressing useful signals or creating fresh errors.
What would settle it
A controlled test on FashionIQ or CIRR where ConeSep shows no accuracy gain over prior noise-learning methods when the proportion of hard noise (similar images with mismatched text) is increased.
Figures
read the original abstract
The Composed Image Retrieval (CIR) task provides a flexible retrieval paradigm via a reference image and modification text, but it heavily relies on expensive and error-prone triplet annotations. This paper systematically investigates the Noisy Triplet Correspondence (NTC) problem introduced by annotations. We find that NTC noise, particularly ``hard noise'' (i.e., the reference and target images are highly similar but the modification text is incorrect), poses a unique challenge to existing Noise Correspondence Learning (NCL) methods because it breaks the traditional ``small loss hypothesis''. We identify and elucidate three key, yet overlooked, challenges in the NTC task, namely (C1) Modality Suppression, (C2) Negative Anchor Deficiency, and (C3) Unlearning Backlash. To address these challenges, we propose a Cone-based robuSt noisE-unlearning comPositional network (ConeSep). Specifically, we first propose Geometric Fidelity Quantization, theoretically establishing and practically estimating a noise boundary to precisely locate noisy correspondence. Next, we introduce Negative Boundary Learning, which learns a ``diagonal negative combination'' for each query as its explicit semantic opposite-anchor in the embedding space. Finally, we design Boundary-based Targeted Unlearning, which models the noisy correction process as an optimal transport problem, elegantly avoiding Unlearning Backlash. Extensive experiments on benchmark datasets (FashionIQ and CIRR) demonstrate that ConeSep significantly outperforms current state-of-the-art methods, which fully demonstrates the effectiveness and robustness of our method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript addresses the Noisy Triplet Correspondence (NTC) problem in Composed Image Retrieval (CIR), where hard noise (highly similar reference/target images with incorrect modification text) breaks the small-loss hypothesis used by existing Noise Correspondence Learning methods. It identifies three overlooked challenges—C1 Modality Suppression, C2 Negative Anchor Deficiency, and C3 Unlearning Backlash—and proposes ConeSep, a compositional network with three components: Geometric Fidelity Quantization (theoretically establishing and estimating a cone-based noise boundary to locate noisy correspondences), Negative Boundary Learning (learning explicit 'diagonal negative combination' anchors), and Boundary-based Targeted Unlearning (framing correction as an optimal transport problem to avoid backlash). Experiments on FashionIQ and CIRR are reported to show significant outperformance over SOTA methods.
Significance. If the empirical gains and component contributions hold under scrutiny, the work offers a principled, geometrically motivated solution to a practical annotation-noise issue in multimodal retrieval. Explicitly targeting hard noise and unlearning backlash via cone boundaries and optimal transport is a clear strength, as is the focus on supplying negative anchors where prior NCL methods are deficient. The approach could influence robust learning pipelines for other triplet-based tasks if the boundary estimation and transport formulation prove stable across datasets.
major comments (2)
- [§3.1] §3.1 (Geometric Fidelity Quantization): The theoretical establishment of the cone-based noise boundary is central to locating hard noise without false positives, yet the manuscript does not provide a derivation showing why the chosen cone aperture isolates incorrect modification text while preserving useful signals; without this, the claim that it 'precisely locate[s] noisy correspondence' remains under-supported relative to the performance gains asserted in §5.
- [§4.3] §4.3 (Boundary-based Targeted Unlearning): The optimal-transport formulation is presented as elegantly avoiding Unlearning Backlash, but the paper must demonstrate (via ablation or bound) that the transport plan does not inadvertently suppress modality-specific features (C1) or create new negative-anchor deficiencies; this is load-bearing because the central claim is robustness to all three challenges simultaneously.
minor comments (3)
- [§5] The abstract and §1 claim 'significantly outperforms' SOTA without quoting the exact margins or listing the baselines; §5 tables should include per-metric deltas and statistical significance tests for reproducibility.
- [§3.2] Notation for the 'diagonal negative combination' in Negative Boundary Learning is introduced without an explicit equation or embedding-space diagram; a small illustrative figure would clarify how it differs from standard negative sampling.
- The manuscript should add a limitations paragraph discussing failure cases (e.g., when the estimated cone boundary misclassifies clean but ambiguous triplets) to balance the robustness claims.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough and constructive review of our manuscript. The comments identify key areas where additional clarification and validation will strengthen the presentation of the theoretical and empirical contributions. We address each major comment point by point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Geometric Fidelity Quantization): The theoretical establishment of the cone-based noise boundary is central to locating hard noise without false positives, yet the manuscript does not provide a derivation showing why the chosen cone aperture isolates incorrect modification text while preserving useful signals; without this, the claim that it 'precisely locate[s] noisy correspondence' remains under-supported relative to the performance gains asserted in §5.
Authors: We thank the referee for highlighting this aspect of §3.1. The Geometric Fidelity Quantization is motivated by the geometry of the joint embedding space, where the cone aperture is selected to capture the region of high reference-target image similarity that is inconsistent with the modification text. We agree that an explicit derivation would provide stronger support for the isolation property. In the revised manuscript, we will add a detailed step-by-step derivation in §3.1 that formally shows how the aperture threshold separates hard noise from valid correspondences while preserving useful cross-modal signals, directly addressing the under-supported claim. revision: yes
-
Referee: [§4.3] §4.3 (Boundary-based Targeted Unlearning): The optimal-transport formulation is presented as elegantly avoiding Unlearning Backlash, but the paper must demonstrate (via ablation or bound) that the transport plan does not inadvertently suppress modality-specific features (C1) or create new negative-anchor deficiencies; this is load-bearing because the central claim is robustness to all three challenges simultaneously.
Authors: We appreciate the referee's emphasis on verifying the side effects of the optimal-transport formulation in §4.3. The existing experiments in §5, including component ablations, show that ConeSep simultaneously mitigates C1, C2, and C3 through overall performance improvements on FashionIQ and CIRR. Nevertheless, we agree that targeted validation is warranted. In the revision, we will incorporate additional ablations that explicitly measure modality-specific feature preservation (e.g., via separate image and text similarity metrics before and after transport) and negative-anchor quality to confirm that the transport plan does not reintroduce deficiencies related to C1 or C2. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper identifies three challenges in noisy triplet correspondence for composed image retrieval and proposes three targeted components (Geometric Fidelity Quantization to establish a noise boundary, Negative Boundary Learning for explicit negative anchors, and Boundary-based Targeted Unlearning modeled as optimal transport). No equations, self-citations, or fitted parameters are shown that reduce any claimed prediction or result to the inputs by construction. The derivation chain is presented as a direct response to the stated challenges without self-definitional loops, renamed known results, or load-bearing reliance on prior author work. Experiments on external benchmarks (FashionIQ, CIRR) provide independent validation, making the central claims self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NTC noise, particularly hard noise, breaks the traditional small loss hypothesis used in noise correspondence learning
invented entities (1)
-
ConeSep network
no independent evidence
Forward citations
Cited by 1 Pith paper
-
HotComment: A Benchmark for Evaluating Popularity of Online Comments
HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...
Reference graph
Works this paper leans on
-
[1]
Learning with noisy triplet corre- spondence for composed image retrieval
Shuxian Li, Changhao He, Xiting Liu, Joey Tianyi Zhou, Xi Peng, and Peng Hu. Learning with noisy triplet corre- spondence for composed image retrieval. InCVPR, pages 19628–19637, 2025. 2, 3, 6, 7
2025
-
[2]
Target-guided composed image retrieval
Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, and Liqiang Nie. Target-guided composed image retrieval. InACM MM, pages 915–923, 2023. 3
2023
-
[3]
arXiv preprint arXiv:2603.26341 (2026)
Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xi- aowei Zhu, Jiajia Nie, Yinwei Wei, and Yupeng Hu. Hint: Composed image retrieval with dual-path compositional contextualized network.arXiv preprint arXiv:2603.26341, 2026
-
[4]
arXiv preprint arXiv:2603.29291 (2026)
Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu, Xuemeng Song, and Yupeng Hu. Melt: Improve com- posed image retrieval via the modification frequentation- rarity balance network.arXiv preprint arXiv:2603.29291, 2026
-
[5]
Retrack: Evidence-driven dual-stream directional anchor calibration network for com- posed video retrieval
Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. Retrack: Evidence-driven dual-stream directional anchor calibration network for com- posed video retrieval. InAAAI, pages 23373–23381, 2026. 2
2026
-
[6]
Erase: Bypassing collaborative detection of ai counterfeit via comprehensive artifacts elimination.IEEE TDSC, pages 1–18, 2026
Qianyun Yang, Peizhuo Lv, Yingjiu Li, Shengzhi Zhang, Yuxuan Chen, Zhiwei Chen, Zixu Li, and Yupeng Hu. Erase: Bypassing collaborative detection of ai counterfeit via comprehensive artifacts elimination.IEEE TDSC, pages 1–18, 2026. 2
2026
-
[7]
Transformer tracking with cyclic shifting window attention
Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. InCVPR, pages 8791–8800, 2022
2022
-
[8]
Visual instance-aware prompt tuning
Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning. InACM MM, pages 2880–
-
[9]
Association for Computing Machinery, Inc, 2025
2025
-
[10]
Dinov3-powered multi-task foundation model for quantitative remote sensing estimation.AAAI, 40(48):41455–41456, 2026
Zhenyu Yu, Mohd Yamani Idna Idris, Pei Wang, and Rizwan Qureshi. Dinov3-powered multi-task foundation model for quantitative remote sensing estimation.AAAI, 40(48):41455–41456, 2026
2026
-
[11]
Chat-driven text generation and interaction for person retrieval
Zequn Xie, Chuxin Wang, Yeqiang Wang, Sihang Cai, Shulei Wang, and Tao Jin. Chat-driven text generation and interaction for person retrieval. InEMNLP, pages 5259– 5270, 2025
2025
-
[12]
Compact transformer tracker with correla- tive masked modeling
Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correla- tive masked modeling. InAAAI, pages 2321–2329, 2023
2023
-
[13]
Core-mmrag: Cross-source knowledge recon- ciliation for multimodal rag
Yang Tian, Fan Liu, Jingyuan Zhang, Yupeng Hu, Liqiang Nie, et al. Core-mmrag: Cross-source knowledge recon- ciliation for multimodal rag. InACL, pages 32967–32982, 2025
2025
-
[14]
Correlation-aware cross-modal attention network for fash- ion compatibility modeling in ugc systems.ACM ToMM, 2024
Kai Cui, Shenghao Liu, Wei Feng, Xianjun Deng, Liangbin Gao, Minmin Cheng, Hongwei Lu, and Laurence T Yang. Correlation-aware cross-modal attention network for fash- ion compatibility modeling in ugc systems.ACM ToMM, 2024
2024
-
[15]
Category-aware multimodal attention network for fashion compatibility modeling.IEEE TMM, 25:9120– 9131, 2023
Peiguang Jing, Kai Cui, Weili Guan, Liqiang Nie, and Yut- ing Su. Category-aware multimodal attention network for fashion compatibility modeling.IEEE TMM, 25:9120– 9131, 2023
2023
-
[16]
Multimodal high-order relationship inference network for fashion compatibility modeling in internet of multimedia things.IEEE IoT, 11(1):353–365, 2024
Peiguang Jing, Kai Cui, Jing Zhang, Yun Li, and Yuting Su. Multimodal high-order relationship inference network for fashion compatibility modeling in internet of multimedia things.IEEE IoT, 11(1):353–365, 2024
2024
-
[17]
Self-paced learning for images of antinuclear antibodies.IEEE TMI, 2025
Yiyang Jiang, Guangwu Qian, Jiaxin Wu, Qi Huang, Qing Li, Yongkang Wu, and Xiao-Yong Wei. Self-paced learning for images of antinuclear antibodies.IEEE TMI, 2025. 2
2025
-
[18]
FBS: Modeling Native Parallel Reading inside a Transformer
Tongxi Wang. Fbs: Modeling native parallel reading inside a transformer.arXiv preprint arXiv:2601.21708, 2026. 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Tongxi Wang, Zhuoyang Xia, Xinran Chen, and Shan Liu. Tracking drift: Variation-aware entropy scheduling for non-stationary reinforcement learning.arXiv preprint arXiv:2601.19624, 2026
-
[20]
AuroRA: Breaking low-rank bottleneck of loRA with non- linear mapping
Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. AuroRA: Breaking low-rank bottleneck of loRA with non- linear mapping. InNeurIPS, 2025
2025
-
[21]
Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained under- standing.arXiv preprint arXiv:2504.07745, 2025
-
[22]
Delving deeper: Hierarchi- cal visual perception for robust video-text retrieval,
Zequn Xie, Boyun Zhang, Yuxiao Lin, and Tao Jin. Delving deeper: Hierarchical visual perception for robust video-text retrieval.arXiv preprint arXiv:2601.12768, 2026
-
[23]
Expseek: Self-triggered experience seeking for web agents, 2026
Wenyuan Zhang, Xinghua Zhang, Haiyang Yu, Shuaiyi Nie, Bingli Wu, Juwei Yue, Tingwen Liu, and Yongbin Li. Expseek: Self-triggered experience seeking for web agents, 2026
2026
-
[24]
Not All Directions Matter: Towards Structured and Task-Aware Low-Rank Model Adaptation
Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxu- anzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, and Hao Xu. Not all directions matter: Toward struc- tured and task-aware low-rank adaptation.arXiv preprint arXiv:2603.14228, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Mutual learning for hashing: Unlocking strong hash functions from weak supervision, 2025
Xiaoxu Ma, Runhao Li, and Zhenyu Weng. Mutual learning for hashing: Unlocking strong hash functions from weak supervision, 2025
2025
-
[26]
Topological federated clustering via gravitational po- tential fields under local differential privacy.AAAI, 40(28): 24044–24051, 2026
Yunbo Long, Jiaquan Zhang, Xi Chen, and Alexandra Brin- trup. Topological federated clustering via gravitational po- tential fields under local differential privacy.AAAI, 40(28): 24044–24051, 2026
2026
-
[27]
Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents
Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory op- timization in multi-step reasoning with llm-based agents. arXiv preprint arXiv:2508.02085, 2025
-
[28]
Hypergraph-state collaborative reason- ing for multi-object tracking, 2026
Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, and Xinchao Wang. Hypergraph-state collaborative reason- ing for multi-object tracking, 2026
2026
-
[29]
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, and Wentao Zhang. Fusion: Fully inte- gration of vision-language representations for deep cross- modal understanding.arXiv preprint arXiv:2504.09925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Conquer: Context-aware representation with query enhancement for text-based person search,
Zequn Xie. Conquer: Context-aware representation with query enhancement for text-based person search.arXiv preprint arXiv:2601.18625, 2026. 2
-
[31]
Yuanjun Zhang, Fuzel Ahamed Shaik, Suvojit Acharjee, Fahad Khalid, and Mourad Oussalah. Towards reliable mul- timodal disaster severity assessment through preference op- timization and explainable vision-language reasoning.Re- liability Engineering & System Safety, page 112674, 2026
2026
-
[32]
arXiv preprint arXiv:2604.01617 (2026)
Qianyun Yang, Zhiwei Chen, Yupeng Hu, Zixu Li, Zhi- heng Fu, and Liqiang Nie. Stable: Efficient hybrid nearest neighbor search via magnitude-uniformity and cardinality- robustness.arXiv preprint arXiv:2604.01617, 2026
-
[33]
Coupled mamba: Enhanced multimodal fusion with coupled state space model.NeurIPS, 37:59808–59832, 2024
Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multimodal fusion with coupled state space model.NeurIPS, 37:59808–59832, 2024
2024
-
[34]
Synthvlm: High-efficiency and high-quality synthetic data for vision language models
Zheng Liu, Hao Liang, Xijie Huang, Wentao Xiong, Qin- han Yu, Linzhuang Sun, Chong Chen, Conghui He, Bin Cui, and Wentao Zhang. Synthvlm: High-efficiency and high-quality synthetic data for vision language models. arXiv preprint arXiv:2407.20756, 3, 2024
-
[35]
Tri-subspaces dis- entanglement for multimodal sentiment analysis.CVPR, 2026
Chunlei Meng, Jiabin Luo, Zhenglin Yan, Zhenyu Yu, Rong Fu, Zhongxue Gan, and Chun Ouyang. Tri-subspaces dis- entanglement for multimodal sentiment analysis.CVPR, 2026
2026
-
[36]
Tempo- ral coherent object flow for multi-object tracking
Zikai Song, Run Luo, Lintao Ma, Ying Tang, Yi- Ping Phoebe Chen, Junqing Yu, and Wei Yang. Tempo- ral coherent object flow for multi-object tracking. InAAAI, pages 6978–6986, 2025
2025
-
[37]
Stable and explainable personality trait evaluation in large language models with internal activations, 2026
Xiaoxu Ma, Xiangbo Zhang, and Zhenyu Weng. Stable and explainable personality trait evaluation in large language models with internal activations, 2026
2026
-
[38]
Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
Mengdi Li, Jiaye Lin, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, and Di Wang. Curriculum-rlaif: Curriculum alignment with reinforcement learning from ai feedback.arXiv preprint arXiv:2505.20075, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Hvd: Human vision- driven video representation learning for text-video retrieval,
Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai, and Tao Jin. Hvd: Human vision-driven video rep- resentation learning for text-video retrieval.arXiv preprint arXiv:2601.16155, 2026
-
[40]
InACL Findings, pages 8950–8970, 2025
Yunyao Zhang, Zikai Song, Hang Zhou, Wenfeng Ren, Yi- Ping Phoebe Chen, Junqing Yu, and Wei Yang.ga− s3: Comprehensive social network simulation with group agents. InACL Findings, pages 8950–8970, 2025
2025
-
[41]
Honglin Lin, Chonghan Qin, Zheng Liu, Qizhi Pei, Yu Li, Zhanping Zhong, Xin Gao, Yanfeng Wang, Conghui He, and Lijun Wu. Scientific image synthesis: Benchmark- ing, methodologies, and downstream utility.arXiv preprint arXiv:2601.17027, 2026
-
[42]
Open multimodal retrieval- augmented factual image generation.arXiv preprint arXiv:2510.22521, 2025
Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yu- peng Hu, and Liqiang Nie. Open multimodal retrieval- augmented factual image generation.arXiv preprint arXiv:2510.22521, 2025
-
[43]
Prior knowledge in- tegration via llm encoding and pseudo event regulation for video moment retrieval
Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. Prior knowledge in- tegration via llm encoding and pseudo event regulation for video moment retrieval. InACM MM, pages 7249–7258, 2024
2024
-
[44]
Hdnet: A hybrid domain network with multi-scale high-frequency information en- hancement for infrared small target detection.IEEE TGRS, 2025
Mingzhu Xu, Chenglong Yu, Zexuan Li, Haoyu Tang, Yu- peng Hu, and Liqiang Nie. Hdnet: A hybrid domain network with multi-scale high-frequency information en- hancement for infrared small target detection.IEEE TGRS, 2025
2025
-
[45]
Iidm: Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of car- bon stock in remote sensing imagery.KBS, page 115131,
Zhenyu Yu, Jinnian Wang, and Mohd Yamani Idna Idris. Iidm: Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of car- bon stock in remote sensing imagery.KBS, page 115131,
-
[46]
Fash- ion iq: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fash- ion iq: A new dataset towards retrieving images by natural language feedback. InCVPR, pages 11307–11317, 2021. 2, 6
2021
-
[47]
Image retrieval on real-life images with pre-trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InICCV, pages 2125–2134, 2021. 2, 6
2021
-
[48]
Data roaming and quality assessment for com- posed image retrieval
Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Data roaming and quality assessment for com- posed image retrieval. InProceedings of the AAAI confer- ence on artificial intelligence, pages 2991–2999, 2024. 2
2024
-
[49]
Sentence-level prompts benefit composed image retrieval
Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wang- meng Zuo, Rick Siow Mong Goh, Chun-Mei Feng, et al. Sentence-level prompts benefit composed image retrieval. InICLR, 2024. 3, 6, 7
2024
-
[50]
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature
Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, and Wentao Zhang. From uniform to het- erogeneous: Tailoring policy optimization to every token’s nature.arXiv preprint arXiv:2509.16591, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Autogenic language embedding for coherent point tracking
Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Autogenic language embedding for coherent point tracking. InACM MM, pages 2021–2030, 2024
2021
-
[52]
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
Zheng Liu, Honglin Lin, Chonghan Qin, Xiaoyang Wang, Xin Gao, Yu Li, Mengzhang Cai, Yun Zhu, Zhanping Zhong, Qizhi Pei, et al. Chartverse: Scaling chart reason- ing via reliable programmatic synthesis from scratch.arXiv preprint arXiv:2601.13606, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Foe: Forest of errors makes the first so- lution the best in large reasoning models, 2026
Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, and Guojie Song. Foe: Forest of errors makes the first so- lution the best in large reasoning models, 2026
2026
-
[54]
Yujun Wang, Jinhe Bi, Yunpu Ma, and Soeren Pirk. Ascd: Attention-steerable contrastive decoding for reducing hallu- cination in mllm.arXiv preprint arXiv:2506.14766, 2025
-
[55]
Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, and Xunliang Cai. Maspo: Unifying gradient utilization, prob- ability mass, and signal reliability for robust and sample- efficient llm reasoning.arXiv preprint arXiv:2602.17550, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
Semantic-aware logical reasoning via a semiotic framework, 2026
Yunyao Zhang, Xinglang Zhang, Junxi Sheng, Wenbing Li, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, and Zikai Song. Semantic-aware logical reasoning via a semiotic framework, 2026
2026
-
[57]
Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026
Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Li- jun Wu. Mmfinereason: Closing the multimodal reason- ing gap via open data-centric methods.arXiv preprint arXiv:2601.21821, 2026
-
[58]
Cot-kinetics: A theoretical modeling assessing lrm reasoning process
Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, V olker Tresp, et al. Cot-kinetics: A theoretical modeling assessing lrm reasoning process.arXiv preprint arXiv:2505.13408, 2025
-
[59]
Prism: Self-pruning intrinsic selection method for training- free multimodal data selection, 2025
Jinhe Bi, Yifan Wang, Danqi Yan, Aniri, Wenke Huang, Zengjie Jin, Xiaowen Ma, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, V olker Tresp, and Yunpu Ma. Prism: Self-pruning intrinsic selection method for training- free multimodal data selection, 2025
2025
-
[60]
Autoneural: Co-designing vision-language models for npu inference,
Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Lu- oyi Liang, Qiang Tang, et al. Autoneural: Co-designing vision-language models for npu inference.arXiv preprint arXiv:2512.02924, 2025
-
[61]
Hierarchical hashing learning for image set classification.IEEE TIP, 32:1732–1744, 2023
Yuan Sun, Xu Wang, Dezhong Peng, Zhenwen Ren, and Xiaobo Shen. Hierarchical hashing learning for image set classification.IEEE TIP, 32:1732–1744, 2023
2023
-
[62]
Cotextor: Training-free modular multi- lingual text editing via layered disentanglement and depth- aware fusion
Zhenyu Yu, MOHD Y AMANI IDNA IDRIS, Pei Wang, and Rizwan Qureshi. Cotextor: Training-free modular multi- lingual text editing via layered disentanglement and depth- aware fusion. InNeurIPS, 2025
2025
-
[63]
Multi-modal gradual domain osmosis: Stepwise dynamic learning with batch matching for gradual domain adaptation
Zixi Wang, Yubo Huang, Jingzehua Xu, Jinzhu Wei, Shuai Zhang, and Xin Lai. Multi-modal gradual domain osmosis: Stepwise dynamic learning with batch matching for gradual domain adaptation. InACM MM, page 8959–8967, New York, NY , USA, 2025. Association for Computing Machin- ery. 2
2025
-
[64]
Robust multi-view clustering with noisy correspondence.IEEE TKDE, 36(12):9150–9162,
Yuan Sun, Yang Qin, Yongxiang Li, Dezhong Peng, Xi Peng, and Peng Hu. Robust multi-view clustering with noisy correspondence.IEEE TKDE, 36(12):9150–9162,
-
[65]
Prototype match- ing learning for incomplete multi-view clustering.IEEE TIP, 34:828–841, 2025
Honglin Yuan, Yuan Sun, Fei Zhou, Jing Wen, Shihua Yuan, Xiaojian You, and Zhenwen Ren. Prototype match- ing learning for incomplete multi-view clustering.IEEE TIP, 34:828–841, 2025
2025
-
[66]
Incom- plete multi-view clustering with paired and balanced dy- namic anchor learning.IEEE TMM, 27:1486–1497, 2024
Xingfeng Li, Yuangang Pan, Yuan Sun, Quansen Sun, Yinghui Sun, Ivor W Tsang, and Zhenwen Ren. Incom- plete multi-view clustering with paired and balanced dy- namic anchor learning.IEEE TMM, 27:1486–1497, 2024
2024
-
[67]
Cross-modal active complementary learning with self-refining correspondence.NeurIPS, 36: 24829–24840, 2023
Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, and Peng Hu. Cross-modal active complementary learning with self-refining correspondence.NeurIPS, 36: 24829–24840, 2023
2023
-
[68]
Cross-view graph matching guided anchor alignment for incomplete multi-view clustering.Informa- tion Fusion, 100:101941, 2023
Xingfeng Li, Yinghui Sun, Quansen Sun, Zhenwen Ren, and Yuan Sun. Cross-view graph matching guided anchor alignment for incomplete multi-view clustering.Informa- tion Fusion, 100:101941, 2023
2023
-
[69]
Negative pre-aware for noisy cross-modal matching
Xu Zhang, Hao Li, and Mang Ye. Negative pre-aware for noisy cross-modal matching. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7341– 7349, 2024. 2, 3
2024
-
[70]
Cross-modal retrieval with partially mismatched pairs.IEEE TPAMI, 45(8):9595–9610, 2023
Peng Hu, Zhenyu Huang, Dezhong Peng, Xu Wang, and Xi Peng. Cross-modal retrieval with partially mismatched pairs.IEEE TPAMI, 45(8):9595–9610, 2023. 2, 4
2023
-
[71]
arXiv preprint arXiv:2507.00950 (2025)
Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, and Zikai Song. Mvp: Win- ning solution to smp challenge 2025 video track.arXiv preprint arXiv:2507.00950, 2025. 2
-
[72]
Noise-aware image captioning with progressively exploring mismatched words
Zhongtian Fu, Kefei Song, Luping Zhou, and Yang Yang. Noise-aware image captioning with progressively exploring mismatched words. InAAAI, pages 12091–12099, 2024. 2
2024
-
[73]
Noisy-correspondence learning for text-to-image person re-identification
Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, and Peng Hu. Noisy-correspondence learning for text-to-image person re-identification. In CVPR, pages 27197–27206, 2024. 2
2024
-
[74]
Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval
Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval. In ACM MM, page 6143–6152, 2025. 3
2025
-
[75]
Refine: Com- posed video retrieval via shared and differential semantics enhancement.ACM ToMM, 2026
Yupeng Hu, Zixu Li, Zhiwei Chen, Qinlei Huang, Zhi- heng Fu, Mingzhu Xu, and Liqiang Nie. Refine: Com- posed video retrieval via shared and differential semantics enhancement.ACM ToMM, 2026
2026
-
[76]
Comprehensive linguistic-visual composition network for image retrieval
Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and Liqiang Nie. Comprehensive linguistic-visual composition network for image retrieval. InACM SIGIR, pages 1369– 1378, 2021. 3
2021
-
[77]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PMLR, 2021. 3
2021
-
[78]
Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022. 3
2022
-
[79]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023. 3, 6
2023
-
[80]
Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, Yicheng Li, Hao Chen, Fei Yu, and Yin Zhang. Optimizing instruc- tion synthesis: Effective exploration of evolutionary space with tree search.arXiv preprint arXiv:2410.10392, 2024. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.