VFM⁴SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
Pith reviewed 2026-05-25 06:01 UTC · model grok-4.3
The pith
Vision foundation models preserve stable relational structures that compensate for missed detections in DETR detectors under domain shifts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Performance degradation under domain shift is dominated by increasing missed detections that arise from disrupted encoder-side object-background and inter-instance relations plus weakened semantic-spatial binding between decoder queries and objects; vision foundation models preserve stable relational structures and object responses under severe shifts and therefore supply usable cross-domain stability priors when their encoder relations are distilled and their category semantics are injected into queries.
What carries the argument
Dual-prior learning framework that performs Cross-domain Stable Relational Prior Distillation from a frozen VFM into the detector encoder and Semantic-Contextual Prior-based Query Enhancement that adds category semantic prototypes and global object context to decoder queries.
If this is right
- VFM4SDG outperforms prior SDGOD methods on standard benchmarks while remaining compatible with two mainstream DETR-based detectors.
- Relational stability in the encoder and query-object binding in the decoder are the primary factors that determine cross-domain detection reliability.
- A single frozen VFM can serve as a fixed source of priors that compensates for domain-induced degradation without requiring domain-specific adjustments.
Where Pith is reading between the lines
- The same relational-distillation approach could be tested on non-DETR detectors or on tasks such as instance segmentation that also rely on query-object binding.
- If the stability priors generalize, the method might lower the amount of data augmentation needed in other single-domain generalization pipelines.
- Measuring the preservation of VFM relations across a wider range of imaging degradations would clarify whether the observed stability holds beyond the weather and illumination shifts examined.
Load-bearing premise
The claim that missed detections are the main source of degradation and that distilling from a frozen VFM will restore the broken relations and bindings without creating new instabilities.
What would settle it
A controlled test in which the VFM-derived relations are shown to be as unstable as the detector's own relations under the same shifts, or in which adding the distilled priors fails to reduce the count of missed detections on the standard SDGOD benchmarks.
Figures
read the original abstract
Real-world weather, illumination, and imaging variations often induce severe domain shifts, degrading single-source detectors in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant learning, while largely overlooking how domain shift disrupts detector prediction stability. Through analytical experiments, we find that performance degradation is mainly dominated by increasing missed detections. Further analysis shows that this phenomenon stems from reduced cross-domain stability in DETR-style detectors: domain shift disrupts encoder-side object-background and inter-instance relations, and further weakens the semantic-spatial binding between decoder queries and real objects. Motivated by this, we find that vision foundation models (VFMs) still preserve stable relational structures and object responses under severe shifts, making them suitable cross-domain stability priors to compensate for detector degradation. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen VFM into encoder representation learning and decoder query modeling. Specifically, we propose Cross-domain Stable Relational Prior Distillation to distill stable object-background and inter-instance relations from the VFM into the encoder, compensating for relational degradation. Meanwhile, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category semantic prototypes and global object context into queries before they enter the decoder layer, enhancing semantic-spatial query-object binding stability. Extensive experiments show that VFM$^{4}$SDG significantly outperforms existing advanced methods on standard SDGOD benchmarks and two mainstream DETR-based detection frameworks, demonstrating its effectiveness, robustness, and generality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that domain shift in DETR-style detectors primarily increases missed detections by disrupting encoder-side object-background/inter-instance relations and decoder query-object bindings; vision foundation models (VFMs) preserve stable relational structures and object responses across shifts, so the proposed VFM⁴SDG dual-prior framework distills cross-domain stable relational priors from a frozen VFM into the detector encoder and injects category semantic prototypes plus global context into decoder queries, yielding significant gains over prior SDGOD methods on standard benchmarks and two DETR frameworks.
Significance. If the analytical experiments confirm that VFM relational stability is quantitatively distinct from detector degradation and the performance lift is specifically due to the two priors rather than generic regularization or capacity, the work would establish a concrete mechanism for using frozen VFMs as cross-domain stability anchors in single-source generalized detection, with potential generality across detection architectures.
major comments (2)
- [analytical experiments / motivation section] The central motivation rests on the analytical finding that VFMs preserve stable relational structures under shift while detectors do not, yet no explicit quantitative stability metric (e.g., cosine similarity of relation matrices or query-object binding scores) is reported on frozen VFM features versus detector features across source and target domains; without such a table or figure in the analytical experiments section, the claimed cross-domain stability prior cannot be separated from post-hoc performance gains.
- [method and experiments] § on method and experiments: the two proposed modules (Cross-domain Stable Relational Prior Distillation and Semantic-Contextual Prior-based Query Enhancement) are motivated as directly compensating the identified degradation modes, but the manuscript provides no ablation that measures the reduction in missed detections attributable to each module separately on the target domains, nor any statistical test confirming the modules address the encoder-relation and decoder-binding issues rather than generic regularization.
minor comments (2)
- [method] Notation for the two priors and their loss terms should be introduced with explicit equations early in the method section to improve readability.
- [figures] Figure captions for the analytical experiments should explicitly state the domains and models compared so that the stability claim can be verified from the visuals alone.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and commit to revisions that directly strengthen the motivation and experimental validation.
read point-by-point responses
-
Referee: [analytical experiments / motivation section] The central motivation rests on the analytical finding that VFMs preserve stable relational structures under shift while detectors do not, yet no explicit quantitative stability metric (e.g., cosine similarity of relation matrices or query-object binding scores) is reported on frozen VFM features versus detector features across source and target domains; without such a table or figure in the analytical experiments section, the claimed cross-domain stability prior cannot be separated from post-hoc performance gains.
Authors: We agree that an explicit quantitative comparison would more rigorously separate the claimed stability prior from performance observations. While the current analytical experiments section links missed detections to relational degradation via indirect metrics and visualizations, it does not report direct cross-domain stability scores (e.g., cosine similarity of relation matrices or query-object binding) between frozen VFM and detector features. We will add a dedicated table and accompanying figure in the revised analytical experiments section that computes and reports these metrics on both source and target domains for VFM versus detector features. This addition will be placed before the method section to better ground the motivation. revision: yes
-
Referee: [method and experiments] § on method and experiments: the two proposed modules (Cross-domain Stable Relational Prior Distillation and Semantic-Contextual Prior-based Query Enhancement) are motivated as directly compensating the identified degradation modes, but the manuscript provides no ablation that measures the reduction in missed detections attributable to each module separately on the target domains, nor any statistical test confirming the modules address the encoder-relation and decoder-binding issues rather than generic regularization.
Authors: We concur that module-specific ablations focused on missed-detection reduction and statistical validation would strengthen the causal claims. The existing ablations demonstrate overall gains but do not isolate per-module effects on missed detections or include formal statistical tests. In the revision we will add targeted ablations that report the change in missed-detection rate on target domains when each module is enabled individually. We will also run multiple random seeds and include paired statistical significance tests (e.g., t-tests) comparing the full model against ablated variants to confirm the improvements exceed generic regularization effects. These results will appear in the main experiments section or supplementary material. revision: yes
Circularity Check
No circularity: empirical motivation and proposed modules are independent of fitted inputs or self-citations.
full rationale
The paper's chain consists of analytical observations on detector degradation under domain shift, followed by a proposed dual-prior framework (Cross-domain Stable Relational Prior Distillation and Semantic-Contextual Prior-based Query Enhancement) that injects VFM features. No equations, parameter fits, or derivations are presented that reduce the claimed stability priors or performance gains to quantities defined from the same data by construction. No self-citation load-bearing steps or uniqueness theorems from prior author work appear in the provided text. The approach is self-contained against external SDGOD benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Performance degradation under domain shift is mainly dominated by increasing missed detections caused by disrupted encoder relations and weakened decoder query-object binding.
- domain assumption Vision foundation models preserve stable relational structures and object responses under severe domain shifts.
Reference graph
Works this paper leans on
-
[1]
Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation,
A. Wu and C. Deng, “Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation,” inProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp. 847–856. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11 GT Daytime ClearDaytime FoggyDusk RainyNight Rainy SA-DETR (DINO) VFM4...
work page 2022
-
[2]
Dg-detr: Toward domain generalized detection transformer,
S. Hwang, D. Han, and M. Jeon, “Dg-detr: Toward domain generalized detection transformer,”arXiv preprint arXiv:2504.19574, 2025
-
[3]
Style-adaptive detection transformer for single-source domain generalized object detection,
J. Han, Y . Wang, and L. Chen, “Style-adaptive detection transformer for single-source domain generalized object detection,”arXiv preprint arXiv:2504.20498, 2025
-
[4]
End-to-end object detection with transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213– 229
work page 2020
-
[5]
Learning to learn single domain gen- eralization,
F. Qiao, L. Zhao, and X. Peng, “Learning to learn single domain gen- eralization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 556–12 565. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 Source Image Co- VFM4SDG (Co-DETR)DETR DINOv3 Fig. 4. Visualization of Encoder Feature Responses (Layer...
work page 2020
-
[6]
Adversarially adaptive normalization for single domain generalization,
X. Fan, Q. Wang, J. Ke, F. Yang, B. Gong, and M. Zhou, “Adversarially adaptive normalization for single domain generalization,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recogni- tion, 2021, pp. 8208–8217
work page 2021
-
[7]
Progressive domain expansion network for single domain gen- eralization,
L. Li, K. Gao, J. Cao, Z. Huang, Y . Weng, X. Mi, Z. Yu, X. Li, and B. Xia, “Progressive domain expansion network for single domain gen- eralization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 224–233
work page 2021
-
[8]
Learning to diversify for single domain generalization,
Z. Wang, Y . Luo, R. Qiu, Z. Huang, and M. Baktashmotlagh, “Learning to diversify for single domain generalization,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 834– 843
work page 2021
-
[9]
Exact feature distribution matching for arbitrary style transfer and domain generalization,
Y . Zhang, M. Li, R. Li, K. Jia, and L. Zhang, “Exact feature distribution matching for arbitrary style transfer and domain generalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8035–8045
work page 2022
-
[10]
Out-of-domain generalization from a single source: An uncertainty quantification approach,
X. Peng, F. Qiao, and L. Zhao, “Out-of-domain generalization from a single source: An uncertainty quantification approach,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 46, no. 3, pp. 1775–1787, 2022
work page 2022
-
[11]
Meta convolutional neural networks for single domain gen- eralization,
C. Wan, X. Shen, Y . Zhang, Z. Yin, X. Tian, F. Gao, J. Huang, and X.-S. Hua, “Meta convolutional neural networks for single domain gen- eralization,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4682–4691
work page 2022
-
[12]
Attention consistency on visual corruptions for single-source domain generalization,
I. Cugu, M. Mancini, Y . Chen, and Z. Akata, “Attention consistency on visual corruptions for single-source domain generalization,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4165–4174
work page 2022
-
[13]
Adversarial source generation for source-free domain adaptation,
C. Cui, F. Meng, C. Zhang, Z. Liu, L. Zhu, S. Gong, and X. Lin, “Adversarial source generation for source-free domain adaptation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4887–4898, 2023
work page 2023
-
[14]
Adversarial bayesian augmen- tation for single-source domain generalization,
S. Cheng, T. Gokhale, and Y . Yang, “Adversarial bayesian augmen- tation for single-source domain generalization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 400–11 410
work page 2023
-
[15]
Meta-causal learning for single domain generalization,
J. Chen, Z. Gao, X. Wu, and J. Luo, “Meta-causal learning for single domain generalization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7683–7692
work page 2023
-
[16]
Center- aware adversarial augmentation for single domain generalization,
T. Chen, M. Baktashmotlagh, Z. Wang, and M. Salzmann, “Center- aware adversarial augmentation for single domain generalization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 4157–4165
work page 2023
-
[17]
Learning class and domain augmentations for single-source open- domain generalization,
P. Bele, V . Bundele, A. Bhattacharya, A. Jha, G. Roig, and B. Banerjee, “Learning class and domain augmentations for single-source open- domain generalization,” inProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, 2024, pp. 1816–1826
work page 2024
-
[18]
Progressive diversity generation for single domain generalization,
D. Rui, K. Guo, X. Zhu, Z. Wu, and H. Fang, “Progressive diversity generation for single domain generalization,”IEEE Transactions on Multimedia, vol. 26, pp. 10 200–10 210, 2024
work page 2024
-
[19]
Single domain generalization via normalised cross- correlation based convolutions,
W. Chuah, R. Tennakoon, R. Hoseinnezhad, D. Suter, and A. Bab- Hadiashar, “Single domain generalization via normalised cross- correlation based convolutions,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1752–1761
work page 2024
-
[20]
Wildnet: Learning domain generalized semantic segmentation from the wild,
S. Lee, H. Seong, S. Lee, and E. Kim, “Wildnet: Learning domain generalized semantic segmentation from the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9936–9946
work page 2022
-
[21]
Learning generalized knowledge from a single domain on urban-scene segmentation,
X. Li, M. Li, X. Li, and X. Guo, “Learning generalized knowledge from a single domain on urban-scene segmentation,”IEEE Transactions on Multimedia, vol. 25, pp. 7635–7646, 2022
work page 2022
-
[22]
Style projected clustering for domain generalized semantic segmentation,
W. Huang, C. Chen, Y . Li, J. Li, C. Li, F. Song, Y . Yan, and Z. Xiong, “Style projected clustering for domain generalized semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 3061–3071
work page 2023
-
[23]
Adaptive texture filtering for single-domain generalized segmentation,
X. Li, M. Li, Y . Wang, C.-X. Ren, and X. Guo, “Adaptive texture filtering for single-domain generalized segmentation,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 2, 2023, pp. 1442–1450
work page 2023
-
[24]
Clip the gap: A single domain generalization approach for object detection,
V . Vidit, M. Engilberge, and M. Salzmann, “Clip the gap: A single domain generalization approach for object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 3219–3229
work page 2023
-
[25]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[26]
Improving single domain-generalized object detection: A focus on diversification and alignment,
M. S. Danish, M. H. Khan, M. A. Munir, M. S. Sarfraz, and M. Ali, “Improving single domain-generalized object detection: A focus on diversification and alignment,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 17 732– 17 742
work page 2024
-
[27]
Towards robust object detection invariant to real-world domain shifts,
Q. Fan, M. Segu, Y .-W. Tai, F. Yu, C.-K. Tang, B. Schiele, and D. Dai, “Towards robust object detection invariant to real-world domain shifts,” inThe Eleventh International Conference on Learning Representations (ICLR 2023). OpenReview, 2023
work page 2023
-
[28]
Srcd: Se- mantic reasoning with compound domains for single-domain generalized object detection,
Z. Rao, J. Guo, L. Tang, Y . Huang, X. Ding, and S. Guo, “Srcd: Se- mantic reasoning with compound domains for single-domain generalized object detection,”IEEE Transactions on Neural Networks and Learning Systems, 2024
work page 2024
-
[29]
G-nas: Generalizable neural architecture search for single domain generalization object detection,
F. Wu, J. Gao, L. Hong, X. Wang, C. Zhou, and N. Ye, “G-nas: Generalizable neural architecture search for single domain generalization object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5958–5966
work page 2024
-
[30]
Unbiased faster r-cnn for single-source domain generalized object detection,
Y . Liu, S. Zhou, X. Liu, C. Hao, B. Fan, and J. Tian, “Unbiased faster r-cnn for single-source domain generalized object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 28 838–28 847
work page 2024
-
[31]
H. Touvron, M. Cord, and H. J ´egou, “Deit iii: Revenge of the vit,” in European conference on computer vision. Springer, 2022, pp. 516–533
work page 2022
-
[32]
Grounded language-image pre-training,
L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre-training,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975
work page 2022
-
[33]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
work page 2023
-
[34]
SAM 2: Segment Anything in Images and Videos
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
An empirical study of training self- supervised vision transformers,
X. Chen, S. Xie, and K. He, “An empirical study of training self- supervised vision transformers,” inProceedings of the IEEE/CVF in- ternational conference on computer vision, 2021, pp. 9640–9649
work page 2021
-
[37]
Emerging properties in self-supervised vision transformers,
M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660
work page 2021
-
[38]
Is imagenet worth 1 video? learning strong image encoders from 1 long unlabelled video,
S. Venkataramanan, M. N. Rizve, J. Carreira, Y . M. Asano, and Y . Avrithis, “Is imagenet worth 1 video? learning strong image encoders from 1 long unlabelled video,”arXiv preprint arXiv:2310.08584, 2023
-
[39]
Self-supervised cross- stage regional contrastive learning for object detection,
J. Yan, L. Yang, Y . Gao, and W.-S. Zheng, “Self-supervised cross- stage regional contrastive learning for object detection,” in2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023, pp. 1044–1049
work page 2023
-
[40]
Masked au- toencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009
work page 2022
-
[41]
BEiT: BERT Pre-Training of Image Transformers
H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,”arXiv preprint arXiv:2106.08254, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Image as a foreign language: Beit pretraining for vision and vision-language tasks,
W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Somet al., “Image as a foreign language: Beit pretraining for vision and vision-language tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 175–19 186
work page 2023
-
[43]
Deconstructing denoising diffusion models for self-supervised learning,
X. Chen, Z. Liu, S. Xie, and K. He, “Deconstructing denoising diffusion models for self-supervised learning,”arXiv preprint arXiv:2401.14404, 2024
-
[44]
iBOT: Image BERT Pre-Training with Online Tokenizer
J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,”arXiv preprint arXiv:2111.07832, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[45]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Vision transformer adapter for dense predictions,
Z. Chen, Y . Duan, W. Wang, J. He, T. Lu, J. Dai, and Y . Qiao, “Vision transformer adapter for dense predictions,”arXiv preprint arXiv:2205.08534, 2022
-
[48]
Frozen- detr: Enhancing detr with image understanding from frozen foundation models,
S. Fu, J. Yan, Q. Yang, X. Wei, X. Xie, and W.-S. Zheng, “Frozen- detr: Enhancing detr with image understanding from frozen foundation models,”Advances in Neural Information Processing Systems, vol. 37, pp. 105 949–105 971, 2024
work page 2024
-
[49]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,”arXiv preprint arXiv:2203.03605, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Detrs with collaborative hybrid assign- ments training. arxiv 2022,
Z. Zong, G. Song, and Y . Liu, “Detrs with collaborative hybrid assign- ments training. arxiv 2022,”arXiv preprint arXiv:2211.12860, 2022
-
[51]
Rt-detrv4: Painlessly furthering real-time object detection with vision foundation models,
Z. Liao, Y . Zhao, X. Shan, Y . Yan, C. Liu, L. Lu, X. Ji, and J. Chen, “Rt-detrv4: Painlessly furthering real-time object detection with vision foundation models,”arXiv preprint arXiv:2510.25257, 2025
-
[52]
M. Xu, L. Qin, W. Chen, S. Pu, and L. Zhang, “Multi-view adversarial discriminator: Mine the non-causal factors for object detection in unseen domains,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 8103–8112
work page 2023
-
[53]
Object-aware domain gen- eralization for object detection,
W. Lee, D. Hong, H. Lim, and H. Myung, “Object-aware domain gen- eralization for object detection,” inproceedings of the AAAI conference on artificial intelligence, vol. 38, no. 4, 2024, pp. 2947–2955
work page 2024
-
[54]
X. Xu, J. Yang, W. Shi, S. Ding, L. Luo, and J. Liu, “Physaug: A physical-guided and frequency-based data augmentation for single- domain generalized object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 20, 2025, pp. 21 815– 21 823
work page 2025
-
[55]
A. Xiao, W. Yu, and H. Yu, “Sample-aware randaugment: Search-free automatic data augmentation for effective image recognition: A. xiao et al.”International Journal of Computer Vision, vol. 133, no. 11, pp. 7710–7725, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.