pith. sign in

arxiv: 2412.06341 · v2 · pith:OJ4WVOEInew · submitted 2024-12-09 · 💻 cs.CV · cs.AI

Visual Accommodation: Rethinking Image Scale as a Learnable Variable for Object Detection

Pith reviewed 2026-05-23 07:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords object detectionscale adaptationtest-time resolutionDETRparametric scalingsingle-pass inferenceinput scale prediction
0
0 comments X

The pith

A lightweight predictor lets object detectors choose input scale at test time by optimizing a parametric scaling model with loss objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats image scale as a learnable variable instead of a fixed setting during inference. It adds a scale predictor that estimates suitable resolutions for each test image, trained through a parametric description of desired scaling behavior that produces usable loss signals even though the true best scale is never observed directly. This setup aims to carry the benefits of multi-scale training over to test time without running multiple resolutions or passes. The result is claimed to be a single forward pass that adapts resolution on the fly, closing the gap between training robustness and deployment flexibility.

Core claim

Ciliary-DETR adds a scale predictor that dynamically estimates test-time scale factors. Because the optimal scale cannot be observed in standard training, a parametric formulation of desired scaling behavior is introduced; this formulation yields loss-driven objectives that train the predictor, producing flexible single-pass inference that adapts resolution without retraining the detector.

What carries the argument

Lightweight scale predictor that estimates test-time scale factors from a parametric model of desired scaling behavior.

If this is right

  • Detectors perform resolution adaptation in a single forward pass rather than multiple fixed-resolution runs.
  • Multi-scale training robustness carries over to test-time inputs without extra inference cost.
  • The same detector can handle inputs whose object sizes vary widely without manual scale selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same predictor mechanism could be attached to non-DETR detectors to test whether the parametric loss approach generalizes.
  • If the scale choices prove stable across datasets, real-time systems could drop multi-scale test pipelines entirely.
  • Extending the predictor to predict per-object scale adjustments inside one image would be a direct next step.

Load-bearing premise

The optimal input scale is inherently unobservable under standard training setups, and a parametric formulation of desired scaling behavior can be turned into loss-driven objectives that successfully guide scale optimization.

What would settle it

A controlled test on images whose best scales are known in advance that shows the predictor's chosen scales produce no accuracy gain over a fixed-resolution baseline.

Figures

Figures reproduced from arXiv: 2412.06341 by Daeun Seo, Hoeseok Yang, Hyungshin Kim, Sihyeong Park.

Figure 1
Figure 1. Figure 1: A preliminary experiment for the effect of image res [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison on COCO val. The marker size indicates the maximum size of the image resolution, which is employed 800×1333 for MS training. The base backbone network is R50. As depicted in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Elastic-DETR. The image resolution is scaled according to the scale factor, which is obtained from the scale [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scale loss produces a low value for high (low)-scale fac￾tor for small (large) objects. When yup = 0, the input probability converges to τmin/τmax, indicating that scale loss can function effectively with scale factor clipping. is because up-scaling and down-scaling probabilities mutu￾ally appear, except in cases where the object is excessively small or large. Consequently, we introduce a continuous-valued… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between two forms of the loss function uti [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of scale factors based on the average of object sizes from the image. Normalized scale factors are utilized to obtain [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of encoder attention map. Class-wise Comparison. Tab. 5 displays the class-wise performance gain compared to MS. Our method can en￾hance performance for most classes, except six classes. Classes Total # Positive Baseball Bat (11.1%) / Remote (8.9%) / Toaster (7.3%) 74 Toothbrush (6.2% ) / Tennis Racket (5.6%) Negative Oven (-3.1%) / Refrigerator (-3.0%) / Hair Drier (-3.0%) 6 Cat (-1.5%) / Orange (… view at source ↗
Figure 7
Figure 7. Figure 7: Trained beta distribution and per-scale loss distribution on COCO [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of positive and negative classes. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Convergence of boundaries in early training iterations. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

We propose Ciliary-DETR (previous name: Elastic-DETR), a framework for test-time resolution adjustment analogous to biological accommodation. While multi-scale data augmentation improves robustness to scale variation, modern detectors rely on fixed inference resolutions, potentially limiting flexibility and robustness. Similar to the ciliary muscle, we introduce a lightweight scale predictor that dynamically estimates test-time scale factors across a wide range of input scales. The core challenge is that the optimal input scale is inherently unobservable under standard training setups. To address this challenge, we introduce a parametric formulation of desired scaling behavior, leading to loss-driven objectives that guide scale optimization. Overall, our method enables flexible and efficient single-pass inference, bridging the gap between training-time robustness and test-time adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes Ciliary-DETR (formerly Elastic-DETR), a framework for test-time resolution adjustment in object detection analogous to biological accommodation. It introduces a lightweight scale predictor that dynamically estimates test-time scale factors across a wide range of input scales. The central technical contribution is a parametric formulation of desired scaling behavior that yields loss-driven objectives to train the predictor, addressing the claim that optimal input scale is inherently unobservable under standard training setups. The method is presented as enabling flexible and efficient single-pass inference that bridges training-time multi-scale robustness with test-time adaptation.

Significance. If the parametric formulation and resulting scale predictions prove effective, the work would provide a practical mechanism for adaptive single-pass inference without the overhead of multi-scale testing or post-hoc ensembling. This addresses a real deployment gap in modern detectors and could influence subsequent work on learnable input transformations. The internal logic of the approach is consistent and does not contain an evident circularity or unsupported leap once the concrete formulation is examined.

minor comments (2)
  1. The abstract refers to both 'Ciliary-DETR' and the previous name 'Elastic-DETR'; a single consistent name should be used throughout, with any name change noted only in a footnote.
  2. The claim that the approach 'bridges the gap between training-time robustness and test-time adaptation' would benefit from a brief quantitative comparison (e.g., mAP vs. latency) against standard multi-scale inference baselines in the results section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for reviewing our manuscript. We appreciate the assessment that the internal logic is consistent, that there is no evident circularity, and that the approach could address a real deployment gap in modern detectors. The recommendation is listed as uncertain, yet the report contains no specific major comments or requests for clarification. We therefore provide no point-by-point responses and remain available to address any concrete questions the referee may wish to raise.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central mechanism introduces a parametric formulation of desired scaling behavior to derive loss-driven objectives for training a lightweight scale predictor, directly addressing the stated unobservability of optimal input scale under standard setups. This constitutes an independent modeling choice rather than any reduction of a claimed prediction or result to its own fitted inputs by construction. No load-bearing step relies on self-citation chains, uniqueness theorems imported from the authors' prior work, or renaming of known empirical patterns; the single-pass inference claim follows from the proposed architecture and training objectives without evident self-definitional equivalence. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method depends on the unobservability of optimal scale and on the existence of a parametric description of desired scaling that can be converted into usable training losses; both are introduced without external validation in the provided abstract.

axioms (1)
  • domain assumption The optimal input scale is inherently unobservable under standard training setups.
    Explicitly stated as the core challenge that the parametric formulation is meant to solve.
invented entities (1)
  • lightweight scale predictor no independent evidence
    purpose: Dynamically estimate test-time scale factors
    New module introduced to perform the accommodation function; no independent evidence of its behavior outside the paper is supplied.

pith-pipeline@v0.9.0 · 5657 in / 1183 out tokens · 19101 ms · 2026-05-23T07:37:07.278643+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 4 internal anchors

  1. [1]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 3

  2. [2]

    Cf-detr: Coarse-to-fine transformers for end-to-end object detection

    Xipeng Cao, Peng Yuan, Bailan Feng, and Kun Niu. Cf-detr: Coarse-to-fine transformers for end-to-end object detection. In Proceedings of the AAAI conference on artificial intelli- gence, pages 185–193, 2022. 3

  3. [3]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 1, 3, 6

  4. [4]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

  5. [5]

    A new coefficient of correlation

    Sourav Chatterjee. A new coefficient of correlation. Jour- nal of the American Statistical Association, 116(536):2009– 2022, 2021. 5

  6. [6]

    Scale-aware automatic augmentation for object detection

    Yukang Chen, Yanwei Li, Tao Kong, Lu Qi, Ruihang Chu, Lei Li, and Jiaya Jia. Scale-aware automatic augmentation for object detection. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9563–9572, 2021. 2

  7. [7]

    AutoAugment: Learning Augmentation Policies from Data

    Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- van, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018. 2

  8. [8]

    Dynamic head: Unifying object detection heads with attentions

    Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: Unifying object detection heads with attentions. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7373–7382, 2021. 3

  9. [9]

    Object detection in aerial im- ages: A large-scale benchmark and challenges

    Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Michael Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, et al. Object detection in aerial im- ages: A large-scale benchmark and challenges. IEEE trans- actions on pattern analysis and machine intelligence, 44(11): 7778–7796, 2021. 12

  10. [10]

    Learning with a wasserstein loss

    Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. Advances in neural information processing systems, 28,

  11. [11]

    Rational decisions

    Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological) , 14(1):107– 114, 1952. 4

  12. [12]

    Dynamic neural networks: A sur- vey

    Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A sur- vey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021. 2

  13. [13]

    Channel gating neural networks

    Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh. Channel gating neural networks. Ad- vances in Neural Information Processing Systems, 32, 2019. 2

  14. [14]

    Dq-detr: Detr with dynamic query for tiny object detection

    Yi-Xin Huang, Hou-I Liu, Hong-Han Shuai, and Wen-Huang Cheng. Dq-detr: Detr with dynamic query for tiny object detection. arXiv preprint arXiv:2404.03507, 2024. 3

  15. [15]

    Dn-detr: Accelerate detr training by intro- ducing query denoising

    Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13619–13627, 2022. 3, 5, 6

  16. [16]

    Dynamic anchor feature selection for single-shot object detection

    Shuai Li, Lingxiao Yang, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Dynamic anchor feature selection for single-shot object detection. In Proceedings of the IEEE/CVF international conference on computer vision , pages 6609–6618, 2019. 2

  17. [17]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 2

  18. [18]

    Feature pyra- mid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 3

  19. [19]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 3

  20. [20]

    Dynamicdet: A unified dynamic architecture for object de- tection

    Zhihao Lin, Yongtao Wang, Jinhe Zhang, and Xiaojie Chu. Dynamicdet: A unified dynamic architecture for object de- tection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6282– 6291, 2023. 2

  21. [21]

    Dab-detr: Dynamic anchor boxes are better queries for detr.arXiv preprint arXiv:2201.12329, 2022

    Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022. 3, 6

  22. [22]

    Ssd: Single shot multibox detector

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amster- dam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016. 1

  23. [23]

    Conditional detr for fast training convergence

    Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 3, 6

  24. [24]

    Auto- focus: Efficient multi-scale inference

    Mahyar Najibi, Bharat Singh, and Larry S Davis. Auto- focus: Efficient multi-scale inference. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9745–9755, 2019. 1

  25. [25]

    9 Big/little deep neural network for ultra low power infer- ence

    Eunhyeok Park, Dongyoung Kim, Soobeom Kim, Yong- Deok Kim, Gunhee Kim, Sungroh Yoon, and Sungjoo Yoo. 9 Big/little deep neural network for ultra low power infer- ence. In 2015 International Conference on Hardware/Soft- ware Codesign and System Synthesis (CODES+ ISSS), pages 124–132. IEEE, 2015. 2

  26. [26]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 3

  27. [27]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information process- ing systems, 28, 2015. 1, 3

  28. [28]

    detrex: Benchmarking de- tection transformers, 2023

    Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, and Lei Zhang. detrex: Benchmarking de- tection transformers, 2023. 5, 6

  29. [29]

    An analysis of scale in- variance in object detection snip

    Bharat Singh and Larry S Davis. An analysis of scale in- variance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3578–3587, 2018. 1

  30. [30]

    Vidt: An efficient and effective fully transformer-based object detector

    Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jam- pani, Dongyoon Han, Byeongho Heo, Wonjae Kim, and Ming-Hsuan Yang. Vidt: An efficient and effective fully transformer-based object detector. arXiv preprint arXiv:2110.03921, 2021. 6

  31. [31]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR,

  32. [32]

    Efficientdet: Scalable and efficient object detection

    Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020. 1, 2

  33. [33]

    Fcos: Fully convolutional one-stage object detection

    Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019. 3

  34. [34]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 1

  35. [35]

    Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

    Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7464–7475, 2023. 3

  36. [36]

    Residual attention network for image classification

    Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3156–3164, 2017. 3

  37. [37]

    Elastic: Improving cnns with dynamic scaling policies

    Huiyu Wang, Aniruddha Kembhavi, Ali Farhadi, Alan L Yuille, and Mohammad Rastegari. Elastic: Improving cnns with dynamic scaling policies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2258–2267, 2019. 3

  38. [38]

    IDK Cascades: Fast Deep Learning by Learning not to Overthink

    Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, and Joseph E Gonzalez. Idk cascades: Fast deep learning by learning not to overthink. arXiv preprint arXiv:1706.00885, 2017. 2

  39. [39]

    Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition

    Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, and Gao Huang. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. Advances in neural information processing systems , 34:11960–11973,

  40. [40]

    Anchor detr: Query design for transformer-based detector

    Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI conference on artificial intelli- gence, pages 2567–2575, 2022. 3, 6

  41. [41]

    Resolution adaptive networks for efficient in- ference

    Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, and Gao Huang. Resolution adaptive networks for efficient in- ference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2369–2378,

  42. [42]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 2

  43. [43]

    Detr++: Tam- ing your multi-scale detection transformer

    Chi Zhang, Lijuan Liu, Xiaoxue Zang, Frederick Liu, Hao Zhang, Xinying Song, and Jindong Chen. Detr++: Tam- ing your multi-scale detection transformer. arXiv preprint arXiv:2206.02977, 2022. 3

  44. [44]

    Accelerating detr convergence via semantic- aligned matching

    Gongjie Zhang, Zhipeng Luo, Yingchen Yu, Kaiwen Cui, and Shijian Lu. Accelerating detr convergence via semantic- aligned matching. In Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 949– 958, 2022. 3

  45. [45]

    Towards efficient use of multi-scale features in transformer-based object detectors

    Gongjie Zhang, Zhipeng Luo, Zichen Tian, Jingyi Zhang, Xiaoqin Zhang, and Shijian Lu. Towards efficient use of multi-scale features in transformer-based object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6206–6216, 2023. 3

  46. [46]

    Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection

    Lei Zhu, Zijun Deng, Xiaowei Hu, Chi-Wing Fu, Xuemiao Xu, Jing Qin, and Pheng-Ann Heng. Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 121–136, 2018. 3

  47. [47]

    Dynamic resolution net- work

    Mingjian Zhu, Kai Han, Enhua Wu, Qiulin Zhang, Ying Nie, Zhenzhong Lan, and Yunhe Wang. Dynamic resolution net- work. Advances in Neural Information Processing Systems, 34:27319–27330, 2021. 2

  48. [48]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. 3

  49. [49]

    Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023

    Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 1 10 Appendix A. Discussions A.1. Multi-scale Image Resolution The preliminary experiment depicted in Fig. 1 was per- formed to assess the flexibility of the MS technique. The performance improvement ac...