pith. machine review for the scientific record. sign in

arxiv: 2604.08598 · v2 · submitted 2026-04-07 · 💻 cs.IR · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:09 UTC · model grok-4.3

classification 💻 cs.IR cs.CV
keywords text-based person searchtest-time adaptationuncertainty estimationdomain shiftbidirectional retrievalcross-modal retrievalpretrain-then-adaptunlabeled adaptation
0
0 comments X

The pith

A pretrain-then-adapt approach uses bidirectional retrieval disagreements to estimate uncertainty and recalibrate text-based person search models on unlabeled test data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-based person search models are usually pretrained on synthetic data then fine-tuned on labeled real-world sets, but privacy rules and annotation costs make large target-domain labels unavailable in practice. The paper replaces this with a pretrain-then-adapt paradigm that performs offline test-time adaptation using only unlabeled test samples and minimal extra compute. It introduces an uncertainty signal based on whether an image-text pair ranks highly in both image-to-text and text-to-image retrieval; high mutual ranking signals low uncertainty and strong alignment. Recalibration driven by this signal reduces domain shift and raises retrieval accuracy on four standard benchmarks for both one-stage and two-stage backbones.

Core claim

The central claim is that a bidirectional retrieval disagreement mechanism supplies a usable proxy for uncertainty in cross-modal person search. Pairs that rank highly in both retrieval directions receive low uncertainty and are treated as reliable; the rest drive model recalibration in an offline test-time step. This process, applied after pretraining, mitigates domain shift without any target labels and yields consistent gains across CLIP-based and XVLM-based frameworks on CUHK-PEDES, ICFG-PEDES, RSTPReid, and PAB.

What carries the argument

The Uncertainty-Aware Test-Time Adaptation (UATTA) framework, whose core component is the bidirectional retrieval disagreement score that labels image-text pairs as low- or high-uncertainty to guide label-free recalibration.

If this is right

  • The method removes dependence on large-scale target-domain labels for practical deployment.
  • Adaptation incurs only minimal post-training time cost while still handling domain shift.
  • Performance gains hold for both single-stage CLIP-style and two-stage XVLM-style retrieval pipelines.
  • The same offline procedure establishes a new baseline for label-efficient person search systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disagreement-based uncertainty signal could serve as a lightweight proxy in other cross-modal retrieval settings where labeled target data is scarce.
  • If the recalibration step proves robust, future systems might rely less on expensive synthetic pretraining corpora.
  • Applying the method to streaming test data rather than a fixed offline batch would test whether continuous domain drift can be tracked without labels.

Load-bearing premise

Disagreement between the two retrieval directions reliably indicates uncertainty and that recalibrating on high-uncertainty pairs improves alignment without introducing new errors.

What would settle it

On any of the four benchmarks, the adapted model after UATTA recalibration shows lower top-1 or mAP retrieval accuracy than the same pretrained model left unchanged.

Figures

Figures reproduced from arXiv: 2604.08598 by Jiahao Zhang, Shaofei Huang, Yaxiong Wang, Zhedong Zheng.

Figure 1
Figure 1. Figure 1: Accuracy vs. efficiency trade-off on the PAB bench [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistical Overview of our Uncertainty Indicator on [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Uncertainty-aware Test-Time Adaptation Framework(UATTA). Given the image gallery set [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of bidirectional Top-𝐾 retrieval con￾sistent sample selection on RSTPReid. 𝐾 denotes the mu￾tual top range in bidirectional retrieval. Best performance is achieved at 𝐾 = 3. Since each identity in RSTPReid contains 5 ground-truth images, we adopt 𝐾 = 5 as the default setting to represent the borderline of true and false positives. model’s capacity, often compromising representation quality d… view at source ↗
Figure 5
Figure 5. Figure 5: Top-5 Text-based Person Search Results on RST￾PReid and PAB. The figure presents the Top-5 retrieval re￾sults for representative text queries on the RSTPReid and PAB, where the similarity score of each retrieved image is reported below the corresponding result. Correctly matched person images are highlighted with green bounding boxes, while false matches are indicated in red. On RSTPReid, our method consis… view at source ↗
Figure 6
Figure 6. Figure 6: T-SNE Visualization of Feature Space Shifts on RST [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Text-based person search faces inherent limitations due to data scarcity, driven by stringent privacy constraints and the high cost of manual annotation. To mitigate this, existing methods usually rely on a Pretrain-then-Finetune paradigm, where models are first pretrained on synthetic person-caption data to establish cross-modal alignment, followed by fine-tuning on labeled real-world datasets. However, this paradigm lacks practicality in real-world deployment scenarios, where large-scale annotated target-domain data is typically inaccessible. In this work, we propose a new Pretrain-then-Adapt paradigm that eliminates reliance on extensive target-domain supervision through an offline test-time adaptation manner, enabling dynamic model adaptation using only unlabeled test data with minimal post-train time cost. To mitigate overconfidence with false positives of previous entropy-based test-time adaptation, we propose an Uncertainty-Aware Test-Time Adaptation (UATTA) framework, which introduces a bidirectional retrieval disagreement mechanism to estimate uncertainty, i.e., low uncertainty is assigned when an image-text pair ranks highly in both image-to-text and text-to-image retrieval, indicating high alignment; otherwise, high uncertainty is detected. This indicator drives offline test-time model recalibration without labels, effectively mitigating domain shift. We validate UATTA on four benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, RSTPReid, and PAB, showing consistent improvements across both CLIP-based (one-stage) and XVLM-based (two-stage) frameworks. Ablation studies confirm that UATTA outperforms existing offline test-time adaptation strategies, establishing a new benchmark for label-efficient, deployable person search systems. Our code is available at https://github.com/nkuzjh/UATTA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a Pretrain-then-Adapt paradigm for text-based person search that replaces labeled fine-tuning with offline test-time adaptation on unlabeled target data. It introduces an Uncertainty-Aware Test-Time Adaptation (UATTA) framework whose core component is a bidirectional retrieval disagreement mechanism: an image-text pair is assigned low uncertainty (and used for recalibration) when it ranks highly in both image-to-text and text-to-image retrieval. The method is evaluated on CUHK-PEDES, ICFG-PEDES, RSTPReid and PAB, reporting consistent gains for both one-stage (CLIP) and two-stage (XVLM) backbones, with ablations against prior offline TTA baselines and public code release.

Significance. If the central claim holds, the work addresses a practically important gap: privacy-constrained deployment of cross-modal retrieval where target labels are unavailable. The offline, low-cost adaptation and evaluation across two model families and four benchmarks are positive features; public code further supports reproducibility.

major comments (1)
  1. The bidirectional high-rank agreement mechanism (described in the UATTA framework) implicitly assumes that mutual top ranking is diagnostic of correct cross-modal alignment. Under domain shift this can fail if the pretrained model shares the same systematic error in both retrieval directions, causing incorrect pairs to receive low uncertainty scores and be used for recalibration. The manuscript provides no error analysis, confusion-matrix breakdown, or verification that the uncertainty signal improves alignment rather than reinforcing noise; this assumption is load-bearing for the claim that UATTA safely mitigates shift without labels or new errors.
minor comments (1)
  1. Abstract states 'consistent improvements' and 'outperforms existing strategies' but supplies no numerical deltas, tables, or metrics; adding at least headline numbers would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The major comment raises an important point about the core assumption in our bidirectional retrieval disagreement mechanism. We address it point-by-point below and commit to strengthening the manuscript with additional analysis.

read point-by-point responses
  1. Referee: The bidirectional high-rank agreement mechanism (described in the UATTA framework) implicitly assumes that mutual top ranking is diagnostic of correct cross-modal alignment. Under domain shift this can fail if the pretrained model shares the same systematic error in both retrieval directions, causing incorrect pairs to receive low uncertainty scores and be used for recalibration. The manuscript provides no error analysis, confusion-matrix breakdown, or verification that the uncertainty signal improves alignment rather than reinforcing noise; this assumption is load-bearing for the claim that UATTA safely mitigates shift without labels or new errors.

    Authors: We agree that the manuscript currently lacks a direct error analysis or verification of the selected low-uncertainty pairs. While the empirical gains on four benchmarks and two backbones (CLIP and XVLM) suggest the mechanism is effective in practice, we acknowledge this does not fully rule out reinforcement of systematic errors. In the revision we will add: (1) post-hoc precision of the top-ranked image-text pairs used for recalibration (computed against ground-truth labels available in the test sets), (2) a breakdown showing how uncertainty scores correlate with alignment quality, and (3) an ablation comparing adaptation performance when using only high-agreement pairs versus random or entropy-based selection. These additions will directly address whether the signal improves alignment or risks reinforcing noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation chain introduces a bidirectional retrieval disagreement heuristic to estimate uncertainty for offline test-time recalibration on unlabeled data. This proxy is computed from the model's current rankings and used to drive adaptation, with performance then measured against external ground-truth benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid, PAB). No step reduces a claimed prediction or result to its own inputs by definition, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests solely on self-citation. The method is a proposed heuristic validated empirically rather than a self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that mutual high ranking in bidirectional retrieval indicates reliable alignment suitable for label-free recalibration. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Pretrained cross-modal models can be recalibrated at test time using uncertainty estimates derived from retrieval rankings without target-domain labels.
    This is the load-bearing premise enabling the Pretrain-then-Adapt paradigm and UATTA mechanism.

pith-pipeline@v0.9.0 · 5619 in / 1342 out tokens · 75243 ms · 2026-05-10T19:09:53.286978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. 2023. Rasa: Relation and sensitivity aware representation learning for text-based person search.arXiv:2305.13653(2023)

  2. [2]

    Yang Bai, Jingyao Wang, Min Cao, Chen Chen, Ziqiang Cao, Liqiang Nie, and Min Zhang. 2023. Text-based person search without parallel image-text data. In Proceedings of the 31st ACM International Conference on Multimedia. 757–767

  3. [3]

    Maryam Bukhari, Sadaf Yasmin, Sheneela Naz, Muazzam Maqsood, Jehyeok Rew, and Seungmin Rho. 2023. Language and vision based person re-identification for surveillance systems using deep learning with LIP layers.Image and Vision Computing132 (2023), 104658

  4. [4]

    Feng Chen, Jielong He, Yang Liu, Heng Liu, Zhe Chen, and Yaxiong Wang. 2025. Unsupervised Cross-Modal Person Search via Progressive Diverse Text Genera- tion. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). Association for Computing Machinery, 6047–6056

  5. [5]

    Pengxu Chen, Huazhong Liu, Jihong Ding, Xinghao Huang, Shaojun Zou, and Laurence Tianruo Yang. 2025. Class Activation Values: Lucid and Faithful Visual Interpretations for CLIP-based Text-Image Retrievals. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). 844–853

  6. [6]

    Weijing Chen, Linli Yao, and Qin Jin. 2023. Rethinking Benchmarks for Cross- modal Image-text Retrieval. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). 1241–1251

  7. [7]

    Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. 2021. Semantically self-aligned network for text-to-image part-aware person re-identification.arXiv preprint arXiv:2107.12666(2021)

  8. [8]

    Qichao Dong, Lingyu Liu, Yaxiong Wang, Jason J. R. Liu, and Zhedong Zheng

  9. [9]

    InACM Multimedia - BNI Track

    Domain-Agnostic Neural Oil Painting via Normalization Affine Test-Time Adaptation. InACM Multimedia - BNI Track

  10. [10]

    Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. 2021. Unsupervised pre-training for person re- identification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14750–14759

  11. [11]

    Bipin Gaikwad and Abhijit Karmakar. 2023. Real-time distributed video analytics for privacy-aware person search.Computer Vision and Image Understanding234 (2023), 103749

  12. [12]

    Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, and Min Zhang. 2025. Semi- Supervised Text-Based Person Search.IEEE Transactions on Image Processing34 (jan 2025), 5888–5903

  13. [13]

    Tiantian Gong, Junsheng Wang, and Liyan Zhang. 2024. Enhancing cross-modal completion and alignment for unsupervised incomplete text-to-image person retrieval. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI ’24). Article 88, 9 pages

  14. [14]

    Xiao Han, Sen He, Li Zhang, and Tao Xiang. 2021. Text-Based Person Search with Limited Data. InBMVC

  15. [15]

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. 2024. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608(2024)

  16. [16]

    Bohan Hou, Haoqiang Lin, Xuemeng Song, Haokun Wen, Meng Liu, Yupeng Hu, and Xiangyu Zhao. 2025. FiRE: Enhancing MLLMs with Fine-Grained Context Learning for Complex Image Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). 803–812

  17. [17]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

  18. [18]

    Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning. pmlr, 448–456

  19. [19]

    Yusuke Iwasawa and Yutaka Matsuo. 2021. Test-time classifier adjustment mod- ule for model-agnostic domain generalization.Advances in Neural Information Processing Systems34 (2021), 2427–2440

  20. [20]

    Ding Jiang and Mang Ye. 2023. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. InIEEE International Conference on Computer Vision and Pattern Recognition (CVPR)

  21. [21]

    Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, and Xiangmin Xu. 2025. Modeling Thousands of Human Annotators for Generalizable Text-to- Image Person Re-identification. InProceedings of the Computer Vision and Pattern Recognition Conference. 9220–9230

  22. [22]

    Alex Kendall and Yarin Gal. 2017. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems30 (2017)

  23. [23]

    Samee Khan, Tanveer Hussain, Amin Ullah, and Sung Baik. 2021. Deep-ReID: Deep Features and Autoencoder Assisted Image Patching Strategy for Person Re-identification in Smart Cities Surveillance.Multimedia Tools and Applications 83 (01 2021). doi:10.1007/s11042-020-10145-8

  24. [24]

    Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, and Sungroh Yoon. 2024. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors.ICLR(2024)

  25. [25]

    Haobin Li, Peng Hu, Qianjun Zhang, Xi Peng, XitingLiu, and Mouxing Yang

  26. [26]

    In The Thirteenth International Conference on Learning Representations

    Test-time Adaptation for Cross-modal Retrieval with Query Shift. In The Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=BmG88rONaU

  27. [27]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

  28. [28]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning. PMLR, 12888–12900

  29. [29]

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language repre- sentation learning with momentum distillation.Advances in neural information processing systems34 (2021), 9694–9705

  30. [30]

    Shenshen Li, Chen He, Xing Xu, Fumin Shen, Yang Yang, and Heng Tao Shen

  31. [31]

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol

    Adaptive uncertainty-based learning for text-based person retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 3172–3180

  32. [32]

    Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang

  33. [33]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    Person search with natural language description. InProceedings of the IEEE conference on computer vision and pattern recognition. 1970–1979

  34. [34]

    Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 4582–4597

  35. [35]

    Yongqi Li, Hongru Cai, Wenjie Wang, Leigang Qu, Yinwei Wei, Wenjie Li, Liqiang Nie, and Tat-Seng Chua. 2025. Revolutionizing Text-to-Image Retrieval as Au- toregressive Token-to-Voken Generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). 813–822

  36. [36]

    Zongyi Li, Li Jianbo, Yuxuan Shi, Jiazhong Chen, Shijuan Huang, Linnan Tu, Fei Shen, and Hefei Ling. 2025. Exploring the Potential of Large Vision-Language Models for Unsupervised Text-Based Person Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5119–5127

  37. [37]

    Zongyi Li, Jianbo Li, Yuxuan Shi, Hefei Ling, Jiazhong Chen, Runsheng Wang, and Shijuan Huang. 2024. Cross-modal generation and alignment via attribute- guided prompt for unsupervised text-based person retrieval. InProceedings of the International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organ...

  38. [38]

    Jian Liang, Dapeng Hu, and Jiashi Feng. 2020. Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation. InInternational Conference on Machine Learning (ICML). 6028–6039

  39. [39]

    Vuong D Nguyen, Samiha Mirza, Abdollah Zakeri, Ayush Gupta, Khadija Khaldi, Rahma Aloui, Pranav Mantini, Shishir K Shah, and Fatima Merchant. 2024. Tack- ling domain shifts in person re-identification: A survey and analysis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4149–4159

  40. [40]

    Kai Niu, Liucun Shi, Ke Han, Qinzi Zhao, Yue Wu, and Yanning Zhang. 2025. Test-Time Adaptation for Text-Based Person Search. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). 2997–3006

  41. [41]

    Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. 2023. Towards stable test-time adaptation in dynamic wild world.arXiv preprint arXiv:2302.12400(2023)

  42. [42]

    Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, and Tat- Seng Chua. 2023. Learnable pillar-based re-ranking for image-text retrieval. InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval. 1252–1261

  43. [43]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  44. [44]

    Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, and Jingdong Wang. 2023. Unified pre-training with pseudo texts for text-to-image person re-identification. InProceedings of the IEEE/CVF international conference on computer vision. 11174– 11184

  45. [45]

    Liangxu Su, Rong Quan, Zhiyuan Qi, and Jie Qin. 2024. MACA: Memory-aided Coarse-to-fine Alignment for Text-based Person Search. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). 2497–2501

  46. [46]

    Jintao Sun, Hao Fei, Gangyi Ding, and Zhedong Zheng. 2025. From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search. InProceedings of the ACM on Web Conference 2025 (WWW ’25). ACM, 2341–2351. doi:10.1145/3696410.3714788

  47. [47]

    Mingkui Tan, Guohao Chen, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Peilin Zhao, and Shuaicheng Niu. 2025. Uncertainty-calibrated test-time model adaptation without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence (2025). SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Zhang et al

  48. [48]

    Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, and Dapeng Tao. 2024. Harnessing the power of mllms for transferable text-to-image person reid. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17127–17137

  49. [49]

    Haomiao Tang, Jinpeng Wang, Yuang Peng, GuangHao Meng, Ruisheng Luo, Bin Chen, Long Chen, Yaowei Wang, and Shu-Tao Xia. 2025. Modeling Uncertainty in Composed Image Retrieval via Probabilistic Embeddings. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1210–1222

  50. [50]

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. 2021. Tent: Fully Test-Time Adaptation by Entropy Minimization. In International Conference on Learning Representations. https://openreview.net/ forum?id=uXl3bZLkr3c

  51. [51]

    Junsheng Wang, Tiantian Gong, and Yan Yan. 2024. Semi-supervised Prototype Semantic Association Learning for Robust Cross-modal Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 872–881

  52. [52]

    Yaxiong Wang, Lianwei Wu, Lechao Cheng, Zhun Zhong, Yujiao Wu, and Meng Wang. 2025. Beyond general alignment: Fine-grained entity-centric image-text matching with multimodal attentive experts. InProceedings of the 48th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval. 792–802

  53. [53]

    Shicheng Xu, Danyang Hou, Liang Pang, Jingcheng Deng, Jun Xu, Huawei Shen, and Xueqi Cheng. 2024. Invisible relevance bias: Text-image retrieval models prefer ai-generated images. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval. 208–217

  54. [54]

    Mouxing Yang, Yunfan Li, Changqing Zhang, Peng Hu, and Xi Peng. 2024. Test- time adaptation against multi-modal reliability bias. InThe twelfth international conference on learning representations

  55. [55]

    Shuyu Yang, Yaxiong Wang, Yongrui Li, Li Zhu, and Zhedong Zheng. 2026. Minimizing the Pretraining Gap: Domain-Aligned Text-Based Person Retrieval. Pattern Recognition(2026)

  56. [56]

    Shuyu Yang, Yaxiong Wang, Li Zhu, and Zhedong Zheng. 2025. Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 11720–11730

  57. [57]

    Shuyu Yang, Yinan Zhou, Zhedong Zheng, Yaxiong Wang, Li Zhu, and Yujiao Wu

  58. [58]

    InProceedings of the 31st ACM international conference on multimedia

    Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. InProceedings of the 31st ACM international conference on multimedia. 4492–4501

  59. [59]

    Tao Yang, Shenglong Zhou, Yuwang Wang, Yan Lu, and Nanning Zheng. 2022. Test-time batch normalization.arXiv preprint arXiv:2205.10210(2022)

  60. [60]

    Chen Yiyang, Zheng Zhedong, Ji Wei, Qu Leigang, and Chua Tat-Seng. 2024. Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization. InThe Twelfth International Conference on Learning Representa- tions. https://openreview.net/forum?id=Yb5KvPkKQg

  61. [61]

    Hang Yu, Jiahao Wen, and Zhedong Zheng. 2025. CAMeL: Cross-modality Adap- tive Meta-Learning for Text-based Person Retrieval.IEEE Transactions on Infor- mation Forensics and Security(2025)

  62. [62]

    Yan Zeng, Xinsong Zhang, and Hang Li. 2021. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. InInternational Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:244129883

  63. [63]

    Yifan Zhang, Xue Wang, Kexin Jin, Kun Yuan, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan. 2023. AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation. InProceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelh...

  64. [64]

    Shizhen Zhao, Changxin Gao, Yuanjie Shao, Wei-Shi Zheng, and Nong Sang

  65. [65]

    InProceedings of the IEEE/CVF international conference on computer vision

    Weakly supervised text-based person re-identification. InProceedings of the IEEE/CVF international conference on computer vision. 11395–11404

  66. [66]

    Shuai Zhao, Xiaohan Wang, Linchao Zhu, and Yi Yang. 2024. Test-time adaptation with clip reward for zero-shot generalization in vision-language models.ICLR (2024)

  67. [67]

    Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian

  68. [68]

    InProceedings of the IEEE international conference on computer vision

    Scalable person re-identification: A benchmark. InProceedings of the IEEE international conference on computer vision. 1116–1124

  69. [69]

    Zhedong Zheng and Liang Zheng. 2024. 2. object re-identification: Problems, algorithms and responsible research practice.The Boundaries of Data(2024), 21

  70. [70]

    Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embeddings with instance loss.ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)16, 2 (2020), 1–23

  71. [71]

    Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. InProceedings of the IEEE international conference on computer vision. 3754–3762

  72. [72]

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models.International Journal of Computer Vision 130, 9 (2022), 2337–2348

  73. [73]

    Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. Dssl: Deep surroundings-person separation learning for text-based person retrieval. InProceedings of the 29th ACM international conference on multimedia. 209–217

  74. [74]

    Jialong Zuo, Jiahao Hong, Feng Zhang, Changqian Yu, Hanyu Zhou, Changxin Gao, Nong Sang, and Jingdong Wang. 2024. Plip: Language-image pre-training for person representation learning.Advances in Neural Information Processing Systems37 (2024), 45666–45702