arxiv: 2605.04531 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection

Lihua Zhou , Mao Ye , Xiatian Zhu , Nianxin Li , Changyi Ma , Shuaifeng Li , Yitong Qin , Hongbin Liu

show 2 more authors

Jiebo Luo Zhen Lei

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary object detectiontest-time adaptationvision-language modelssemantic misalignmenttraining-free adaptationevolutionary searchGrounding DINO

0 comments

The pith

RGSE refines text embeddings at test time through reward-guided perturbations to correct semantic misalignment in open-vocabulary object detection without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reward-Guided Semantic Evolution (RGSE) to adapt open-vocabulary detectors like Grounding DINO when test images come from a shifted distribution. It generates perturbed variants of the original text embeddings, scores each variant by its cosine similarity to high-confidence visual region proposals drawn from the current image and from earlier ones, and produces a refined embedding as a weighted average of the variants according to those scores. This process runs entirely at inference time with no backpropagation, model updates, or stored external memory. A sympathetic reader would care because it supplies a direct, low-cost way to restore alignment between text and vision embeddings whenever real-world data drifts.

Core claim

RGSE treats text embedding adaptation as a semantic search process: it perturbs text embeddings as candidate variants, evaluates them via cosine similarity with current and historical high-confidence visual proposals as a reward signal, and fuses them into a refined embedding through reward-weighted averaging.

What carries the argument

Reward-guided semantic evolution that perturbs text embeddings, scores variants by cosine similarity to high-confidence visual proposals, and fuses them by reward-weighted averaging.

If this is right

Achieves state-of-the-art detection accuracy across multiple benchmarks under test-time distribution shifts.
Adds only minimal computational overhead relative to standard forward passes.
Bypasses both backpropagation-based adaptation and external-memory methods used in prior work.
Directly realigns text and vision embeddings in a fully training-free manner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation-plus-reward mechanism could be applied to other vision-language tasks such as open-vocabulary segmentation or captioning where embedding drift occurs at test time.
Historical proposals already collected during a session might make the method especially stable for video or streaming detection.
Refining the perturbation distribution or the number of candidates evaluated could further reduce the already-low overhead.
The approach suggests evolutionary search in embedding space as a general lightweight substitute for gradient-based test-time adaptation.

Load-bearing premise

Cosine similarity between perturbed text embeddings and high-confidence visual proposals provides a reliable signal of better semantic alignment.

What would settle it

A benchmark run in which RGSE produces lower average precision than the original unadapted model on a dataset known to contain distribution shift, especially if the reward scores fail to track actual detection improvements.

Figures

Figures reproduced from arXiv: 2605.04531 by Changyi Ma, Hongbin Liu, Jiebo Luo, Lihua Zhou, Mao Ye, Nianxin Li, Shuaifeng Li, Xiatian Zhu, Yitong Qin, Zhen Lei.

**Figure 1.** Figure 1: Comparison between previous methods and RGSE. (a) Previous view at source ↗

**Figure 2.** Figure 2: (1) Perturbation: We generate multiple candidate text view at source ↗

**Figure 2.** Figure 2: Overview of Reward-Guided Semantic Evolution (RGSE). Given a test image, we first obtain the initial outputs from Grounding DINO: region view at source ↗

**Figure 3.** Figure 3: Hyperparameter sensitivity on PASCAL-C. RGSE shows stable performance across a broad range of view at source ↗

**Figure 4.** Figure 4: Qualitative results on PASCAL-C-Brit, PASCAL-C-Contrast, PASCAL-C-GaussNoise, and FoggyCityscapes (Swin-T). Green, red and blue boxes view at source ↗

**Figure 5.** Figure 5: t-SNE visualization of text embedding trajectories during RGSE view at source ↗

read the original abstract

Open-vocabulary object detection with vision-language models (VLMs) such as Grounding DINO suffers from performance degradation under test-time distribution shifts, primarily due to semantic misalignment between text embeddings and shifted visual embeddings of region proposals. While recent test-time adaptive object detection methods for VLM-based either rely on costly backpropagation or bypass semantic misalignment via external memory, none directly and efficiently align text and vision in a training-free manner. To address this, we propose Reward-Guided Semantic Evolution (RGSE), a training-free framework that directly refines the text embeddings at test time. Inspired by evolutionary search, RGSE treats text embedding adaptation as a semantic search process: it perturbs text embeddings as candidate variants, evaluates them via cosine similarity with current and historical high-confidence visual proposals as a reward signal, and fuses them into a refined embedding through reward-weighted averaging. Without any backpropagation, RGSE achieves state-of-the-art performance across multiple detection benchmarks while adding minimal computational overhead. Our code will be open source upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RGSE gives a training-free evolutionary tweak to text embeddings for test-time detection, but the reward from proposal similarity looks shaky when the base detector is already off under shift.

read the letter

The main thing to know is that this paper introduces Reward-Guided Semantic Evolution to refine text embeddings at test time for open-vocabulary detectors like Grounding DINO. It perturbs the embeddings to create variants, scores them by cosine similarity to high-confidence visual proposals from the current frame and a historical buffer, then fuses the best ones via reward-weighted averaging. No backprop, no external memory banks, just this search process to reduce semantic misalignment caused by distribution shifts.

Referee Report

2 major / 1 minor

Summary. The paper introduces Reward-Guided Semantic Evolution (RGSE), a training-free test-time adaptation framework for open-vocabulary object detection with VLMs such as Grounding DINO. Text embeddings are perturbed to generate candidate variants; each variant is scored by cosine similarity to high-confidence region proposals drawn from the current input and a historical buffer; the scores serve as rewards to compute a refined embedding via weighted averaging. The authors claim this process corrects semantic misalignment, yields state-of-the-art results on multiple detection benchmarks, and incurs only minimal computational overhead without any back-propagation or parameter updates.

Significance. If the empirical claims hold and the reward signal proves robust, RGSE would constitute a lightweight, training-free alternative to existing test-time adaptation techniques that rely on optimization or external memory banks. The emphasis on direct semantic alignment via evolutionary search and the commitment to open-sourcing code are positive contributions to reproducibility in the field.

major comments (2)

[Method (reward signal definition)] The central claim that cosine similarity to high-confidence visual proposals supplies a reliable reward signal rests on the assumption that the base detector's proposals remain sufficiently accurate under distribution shift. The manuscript provides no analysis or ablation of proposal quality (e.g., precision of high-confidence boxes before versus after adaptation) or of how the historical buffer accumulates usable signal before the reward collapses. This assumption is load-bearing for the assertion that RGSE corrects misalignment without training.
[Experiments] The SOTA performance claims require explicit ablations isolating the contribution of reward-weighted averaging, historical buffer size, perturbation variance, and the high-confidence threshold. Without these controls, it is impossible to determine whether observed gains stem from the proposed mechanism or from other implementation choices.

minor comments (1)

[Abstract] The abstract refers to 'multiple detection benchmarks' without naming them; the introduction or experimental section should list the exact datasets and metrics used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We have carefully considered each point and provide detailed responses below, along with plans for revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method (reward signal definition)] The central claim that cosine similarity to high-confidence visual proposals supplies a reliable reward signal rests on the assumption that the base detector's proposals remain sufficiently accurate under distribution shift. The manuscript provides no analysis or ablation of proposal quality (e.g., precision of high-confidence boxes before versus after adaptation) or of how the historical buffer accumulates usable signal before the reward collapses. This assumption is load-bearing for the assertion that RGSE corrects misalignment without training.

Authors: We acknowledge the importance of validating the reward signal's reliability. While the current manuscript demonstrates performance improvements through the overall framework, we agree that explicit analysis of proposal quality would provide stronger support. In the revised manuscript, we will add a new subsection with ablations on proposal precision (e.g., comparing IoU or classification accuracy of high-confidence boxes pre- and post-adaptation) across benchmarks. We will also include plots showing the evolution of average reward scores over test sequences to illustrate that the historical buffer maintains usable signal without collapse, supporting the training-free adaptation claim. revision: yes
Referee: [Experiments] The SOTA performance claims require explicit ablations isolating the contribution of reward-weighted averaging, historical buffer size, perturbation variance, and the high-confidence threshold. Without these controls, it is impossible to determine whether observed gains stem from the proposed mechanism or from other implementation choices.

Authors: We agree that isolating the contributions of each component is essential for rigorous validation of the SOTA claims. The original manuscript includes some component studies, but to fully address this, we will expand the experimental section with dedicated ablations: (1) comparing reward-weighted averaging against uniform or no averaging, (2) varying historical buffer sizes and reporting performance curves, (3) sweeping perturbation variances and their impact on adaptation, and (4) ablating the high-confidence threshold with corresponding results. These additions will clarify that the gains arise from the RGSE mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RGSE derivation chain

full rationale

The paper defines the reward signal explicitly as cosine similarity between perturbed text embeddings and independent high-confidence visual proposals produced by the base detector (Grounding DINO). This signal is computed from external visual data rather than being defined in terms of the target detection performance or the refined embeddings themselves. The subsequent reward-weighted averaging is a direct, non-iterative fusion step with no fitted parameters or self-referential loops. No equations, self-citations, or uniqueness theorems are invoked in the provided description to justify the core process, and the method remains self-contained against external benchmarks without reducing any claimed prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities are detailed in the provided text.

pith-pipeline@v0.9.0 · 5507 in / 1111 out tokens · 24752 ms · 2026-05-08T17:08:53.923719+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 7 canonical work pages

[1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[2]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 38–55

2024
[3]

Generalizing vision-language models to novel domains: A comprehensive survey,

X. Li, J. Li, F. Li, L. Zhu, Y . Yang, and H. T. Shen, “Generalizing vision-language models to novel domains: A comprehensive survey,” arXiv preprint arXiv:2506.18504, 2025

work page arXiv 2025
[4]

A survey on transfer learning,

S. J. Pan and Q. Yang, “A survey on transfer learning,”IEEE Trans- actions on knowledge and data engineering, vol. 22, no. 10, pp. 1345– 1359, 2009

2009
[5]

Dual domain- attribute learning framework with asynchronous adapters for continual test-time adaptation,

Y . Tian, K. Li, T. He, L. Wan, P.-A. Heng, and W. Feng, “Dual domain- attribute learning framework with asynchronous adapters for continual test-time adaptation,”IEEE Transactions on Image Processing, 2026

2026
[6]

Consistent assistant domains transformer for source-free domain adaptation,

R. Shao, W. Zhang, K. Luo, Q. Li, and J. Wang, “Consistent assistant domains transformer for source-free domain adaptation,”IEEE Trans- actions on Image Processing, 2025

2025
[7]

Deep label propagation with nuclear norm maximization for visual domain adaptation,

W. Wang, H. Li, C. Wang, C. Huang, Z. Ding, F. Nie, and X. Cao, “Deep label propagation with nuclear norm maximization for visual domain adaptation,”IEEE Transactions on Image Processing, vol. 34, pp. 1246– 1258, 2025

2025
[8]

Adaptive dispersal and collaborative clustering for few-shot unsupervised domain adaptation,

Y . Lu, H. Huang, W. K. Wong, X. Hu, Z. Lai, and X. Li, “Adaptive dispersal and collaborative clustering for few-shot unsupervised domain adaptation,”IEEE Transactions on Image Processing, 2025

2025
[9]

A comprehensive survey on test-time adaptation under distribution shifts,

J. Liang, R. He, and T. Tan, “A comprehensive survey on test-time adaptation under distribution shifts,”International Journal of Computer Vision, vol. 133, no. 1, pp. 31–64, 2025

2025
[10]

Towards efficient test time adaptation with hierarchical distribution alignment,

Y . Liu, C. Huang, Y . Xu, X. Cao, and J. Wang, “Towards efficient test time adaptation with hierarchical distribution alignment,”IEEE Transactions on Image Processing, 2025

2025
[11]

A3-tta: Adaptive anchor alignment test-time adaptation for image segmentation,

J. Wu, X. Luo, Y . Zhou, L. Wu, G. Wang, and S. Zhang, “A3-tta: Adaptive anchor alignment test-time adaptation for image segmentation,” IEEE Transactions on Image Processing, vol. 34, pp. 8511–8522, 2025

2025
[12]

Test-time prompt tuning for zero-shot generalization in vision- language models,

M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision- language models,” inProceedings of the 36th International Conference on Neural Information Processing Systems, 2022, pp. 14 274–14 289

2022
[13]

Diverse data augmentation with diffusions for effective test-time prompt tuning,

C.-M. Feng, K. Yu, Y . Liu, S. Khan, and W. Zuo, “Diverse data augmentation with diffusions for effective test-time prompt tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2704–2714

2023
[14]

Efficient test-time adaptation of vision-language models,

A. Karmanov, D. Guan, S. Lu, A. El Saddik, and E. Xing, “Efficient test-time adaptation of vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 162–14 171

2024
[15]

R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning,

L. Sheng, J. Liang, Z. Wang, and R. He, “R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 29 958–29 967

2025
[16]

O- tpt: Orthogonality constraints for calibrating test-time prompt tuning in vision-language models,

A. Sharifdeen, M. A. Munir, S. Baliah, S. Khan, and M. H. Khan, “O- tpt: Orthogonality constraints for calibrating test-time prompt tuning in vision-language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 942–19 951

2025
[17]

Bayesian test-time adaptation for vision-language models,

L. Zhou, M. Ye, S. Li, N. Li, X. Zhu, L. Deng, H. Liu, and Z. Lei, “Bayesian test-time adaptation for vision-language models,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025

2025
[18]

Dual prototype evolving for test-time generalization of vision-language models,

C. Zhang, S. Stepputtis, K. Sycara, and Y . Xie, “Dual prototype evolving for test-time generalization of vision-language models,” vol. 37, 2024, pp. 32 111–32 136

2024
[19]

Cola: Context-aware language-driven test-time adaptation,

A. Zhang, T. Yu, L. Bai, J. Tang, Y . Guo, Y . Ruan, Y . Zhou, and Z. Lu, “Cola: Context-aware language-driven test-time adaptation,”IEEE Transactions on Image Processing, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

2025
[20]

Vlod-tta: Test-time adaptation of vision-language object detectors,

A. Belal, H. R. Medeiros, M. Pedersoli, and E. Granger, “Vlod-tta: Test-time adaptation of vision-language object detectors,”arXiv preprint arXiv:2510.00458, 2025

work page arXiv 2025
[21]

Bayesian test-time adaptation for object recognition and de- tection with vision-language models,

L. Zhou, M. Ye, S. Li, N. Li, J. Wu, X. Zhu, L. Deng, H. Liu, J. Luo, and Z. Lei, “Bayesian test-time adaptation for object recognition and de- tection with vision-language models,”arXiv preprint arXiv:2510.02750, 2025

work page arXiv 2025
[22]

Historical test- time prompt tuning for vision foundation models,

J. Zhang, J. Huang, X. Zhang, L. Shao, and S. Lu, “Historical test- time prompt tuning for vision foundation models,” inProceedings of the 38th International Conference on Neural Information Processing Systems, 2024, pp. 12 872–12 896

2024
[23]

Test-time adaptive object de- tection with foundation model,

Y . Gao, Y . Zhang, Z. Cai, and D. Huang, “Test-time adaptive object de- tection with foundation model,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[24]

2016, arXiv e-prints, arXiv:1604.00772, doi: 10.48550/arXiv.1604.00772

N. Hansen, “The cma evolution strategy: A tutorial,”arXiv preprint arXiv:1604.00772, 2016

work page arXiv 2016
[25]

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al

M. Jaderberg, V . Dalibard, S. Osindero, W. M. Czarnecki, J. Don- ahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan et al., “Population based training of neural networks,”arXiv preprint arXiv:1711.09846, 2017

work page arXiv 2017
[26]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016
[27]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2016

2016
[28]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

2021
[29]

Filip: Fine-grained interactive language-image pre-training,

L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image pre-training,” inInternational Conference on Learning Representations
[30]

Pali: A jointly-scaled multilingual language-image model,

X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyeret al., “Pali: A jointly-scaled multilingual language-image model,” inThe Eleventh International Conference on Learning Representations
[31]

Lit: Zero-shot transfer with locked-image text tuning,

X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer, “Lit: Zero-shot transfer with locked-image text tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 123–18 133

2022
[32]

Contrastive vision-language pre-training with limited resources,

Q. Cui, B. Zhou, Y . Guo, W. Yin, H. Wu, O. Yoshie, and Y . Chen, “Contrastive vision-language pre-training with limited resources,” in European Conference on Computer Vision. Springer, 2022, pp. 236– 253

2022
[33]

Open-vocabulary object detec- tion via vision and language knowledge distillation,

X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detec- tion via vision and language knowledge distillation,” inInternational Conference on Learning Representations, 2022

2022
[34]

Detclip: dictionary-enriched visual-concept paralleled pre- training for open-world detection,

L. Yao, J. Han, Y . Wen, X. Liang, D. Xu, W. Zhang, Z. Li, C. Xu, and H. Xu, “Detclip: dictionary-enriched visual-concept paralleled pre- training for open-world detection,” inProceedings of the 36th Interna- tional Conference on Neural Information Processing Systems, 2022, pp. 9125–9138

2022
[35]

Open-vocabulary object detection using captions,

A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-vocabulary object detection using captions,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2021, pp. 14 393– 14 402

2021
[36]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213– 229

2020
[37]

Grounded language-image pre-training,

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre-training,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975

2022
[38]

Yolo- world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 901–16 911

2024
[39]

Tent: Fully test-time adaptation by entropy minimization,

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” inInternational Conference on Learning Representations
[40]

Stfar: Improving object detection robustness at test-time by self-training with feature alignment regular- ization,

Y . Chen, X. Xu, Y . Su, and K. Jia, “Stfar: Improving object detection robustness at test-time by self-training with feature alignment regular- ization,”arXiv preprint arXiv:2303.17937, 2023

work page arXiv 2023
[41]

What how and when should object detectors update in continually changing test domains?

J. Yoo, D. Lee, I. Chung, D. Kim, and N. Kwak, “What how and when should object detectors update in continually changing test domains?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23 354–23 363

2024
[42]

Towards online domain adaptive object detection,

V . VS, P. Oza, and V . M. Patel, “Towards online domain adaptive object detection,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 478–488

2023
[43]

Mlfa: Toward realistic test time adaptive object detection by multi-level feature alignment,

Y . Liu, J. Wang, C. Huang, Y . Wu, Y . Xu, and X. Cao, “Mlfa: Toward realistic test time adaptive object detection by multi-level feature alignment,”IEEE Transactions on Image Processing, vol. 33, pp. 5837– 5848, 2024

2024
[44]

Efficient test-time adaptive object detection via sensitivity-guided pruning,

K. Wang, X. Fu, X. Lu, C. Ge, C. Cao, W. Zhai, and Z.-J. Zha, “Efficient test-time adaptive object detection via sensitivity-guided pruning,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 10 577–10 586

2025
[45]

Fully test-time adaptation for object detection,

X. Ruan and W. Tang, “Fully test-time adaptation for object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1038–1047

2024
[46]

Large-scale evolution of image classifiers,

E. Real, S. Moore, A. Selle, S. Saxena, Y . L. Suematsu, J. Tan, Q. V . Le, and A. Kurakin, “Large-scale evolution of image classifiers,” in International conference on machine learning. PMLR, 2017, pp. 2902– 2911

2017
[47]

Regularized evolution for image classifier architecture search,

E. Real, A. Aggarwal, Y . Huang, and Q. V . Le, “Regularized evolution for image classifier architecture search,” inProceedings of the aaai conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 4780– 4789

2019
[48]

A survey on evolutionary computation approaches to feature selection,

B. Xue, M. Zhang, W. N. Browne, and X. Yao, “A survey on evolutionary computation approaches to feature selection,”IEEE Transactions on evolutionary computation, vol. 20, no. 4, pp. 606–626, 2015

2015
[49]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019
[50]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

2021
[51]

Dino: Detr with improved denoising anchor boxes for end-to-end object detection,

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” inThe Eleventh International Conference on Learning Representations, 2022

2022
[52]

Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,

J. Liang, D. Hu, and J. Feng, “Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,” inInternational conference on machine learning. PMLR, 2020, pp. 6028–6039

2020
[53]

Test-time classifier adjustment module for model-agnostic domain generalization,

Y . Iwasawa and Y . Matsuo, “Test-time classifier adjustment module for model-agnostic domain generalization,” vol. 34, 2021, pp. 2427–2440

2021
[54]

End-to-end semi-supervised object detection with soft teacher,

M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-to-end semi-supervised object detection with soft teacher,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3060–3069

2021
[55]

Revisiting realistic test-time training: Sequential inference and adaptation by anchored clustering,

Y . Su, X. Xu, and K. Jia, “Revisiting realistic test-time training: Sequential inference and adaptation by anchored clustering,” vol. 35, 2022, pp. 17 543–17 555

2022
[56]

Semantic foggy scene under- standing with synthetic data,

C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene under- standing with synthetic data,”International Journal of Computer Vision, vol. 126, no. 9, pp. 973–992, 2018

2018
[57]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223

2016
[58]

Ecker, Matthias Bethge, and Wieland Brendel

C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bringmann, A. S. Ecker, M. Bethge, and W. Brendel, “Benchmarking robustness in object detection: Autonomous driving when winter is coming,”arXiv preprint arXiv:1907.07484, 2019

work page arXiv 1907
[59]

The pascal visual object classes challenge: A retrospective,

M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,”International journal of computer vision, vol. 111, no. 1, pp. 98–136, 2015

2015
[60]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

2014