Visual Accommodation: Rethinking Image Scale as a Learnable Variable for Object Detection

Daeun Seo; Hoeseok Yang; Hyungshin Kim; Sihyeong Park

arxiv: 2412.06341 · v2 · pith:OJ4WVOEInew · submitted 2024-12-09 · 💻 cs.CV · cs.AI

Visual Accommodation: Rethinking Image Scale as a Learnable Variable for Object Detection

Daeun Seo , Hoeseok Yang , Sihyeong Park , Hyungshin Kim This is my paper

Pith reviewed 2026-05-23 07:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords object detectionscale adaptationtest-time resolutionDETRparametric scalingsingle-pass inferenceinput scale prediction

0 comments

The pith

A lightweight predictor lets object detectors choose input scale at test time by optimizing a parametric scaling model with loss objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats image scale as a learnable variable instead of a fixed setting during inference. It adds a scale predictor that estimates suitable resolutions for each test image, trained through a parametric description of desired scaling behavior that produces usable loss signals even though the true best scale is never observed directly. This setup aims to carry the benefits of multi-scale training over to test time without running multiple resolutions or passes. The result is claimed to be a single forward pass that adapts resolution on the fly, closing the gap between training robustness and deployment flexibility.

Core claim

Ciliary-DETR adds a scale predictor that dynamically estimates test-time scale factors. Because the optimal scale cannot be observed in standard training, a parametric formulation of desired scaling behavior is introduced; this formulation yields loss-driven objectives that train the predictor, producing flexible single-pass inference that adapts resolution without retraining the detector.

What carries the argument

Lightweight scale predictor that estimates test-time scale factors from a parametric model of desired scaling behavior.

If this is right

Detectors perform resolution adaptation in a single forward pass rather than multiple fixed-resolution runs.
Multi-scale training robustness carries over to test-time inputs without extra inference cost.
The same detector can handle inputs whose object sizes vary widely without manual scale selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predictor mechanism could be attached to non-DETR detectors to test whether the parametric loss approach generalizes.
If the scale choices prove stable across datasets, real-time systems could drop multi-scale test pipelines entirely.
Extending the predictor to predict per-object scale adjustments inside one image would be a direct next step.

Load-bearing premise

The optimal input scale is inherently unobservable under standard training setups, and a parametric formulation of desired scaling behavior can be turned into loss-driven objectives that successfully guide scale optimization.

What would settle it

A controlled test on images whose best scales are known in advance that shows the predictor's chosen scales produce no accuracy gain over a fixed-resolution baseline.

Figures

Figures reproduced from arXiv: 2412.06341 by Daeun Seo, Hoeseok Yang, Hyungshin Kim, Sihyeong Park.

**Figure 2.** Figure 2: Comparison on COCO val. The marker size indicates the maximum size of the image resolution, which is employed 800×1333 for MS training. The base backbone network is R50. As depicted in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Elastic-DETR. The image resolution is scaled according to the scale factor, which is obtained from the scale [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Scale loss produces a low value for high (low)-scale factor for small (large) objects. When yup = 0, the input probability converges to τmin/τmax, indicating that scale loss can function effectively with scale factor clipping. is because up-scaling and down-scaling probabilities mutually appear, except in cases where the object is excessively small or large. Consequently, we introduce a continuous-valued… view at source ↗

**Figure 5.** Figure 5: Comparison between two forms of the loss function uti [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of scale factors based on the average of object sizes from the image. Normalized scale factors are utilized to obtain [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Example of encoder attention map. Class-wise Comparison. Tab. 5 displays the class-wise performance gain compared to MS. Our method can enhance performance for most classes, except six classes. Classes Total # Positive Baseball Bat (11.1%) / Remote (8.9%) / Toaster (7.3%) 74 Toothbrush (6.2% ) / Tennis Racket (5.6%) Negative Oven (-3.1%) / Refrigerator (-3.0%) / Hair Drier (-3.0%) 6 Cat (-1.5%) / Orange (… view at source ↗

**Figure 7.** Figure 7: Trained beta distribution and per-scale loss distribution on COCO [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Examples of positive and negative classes. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Convergence of boundaries in early training iterations. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

read the original abstract

We propose Ciliary-DETR (previous name: Elastic-DETR), a framework for test-time resolution adjustment analogous to biological accommodation. While multi-scale data augmentation improves robustness to scale variation, modern detectors rely on fixed inference resolutions, potentially limiting flexibility and robustness. Similar to the ciliary muscle, we introduce a lightweight scale predictor that dynamically estimates test-time scale factors across a wide range of input scales. The core challenge is that the optimal input scale is inherently unobservable under standard training setups. To address this challenge, we introduce a parametric formulation of desired scaling behavior, leading to loss-driven objectives that guide scale optimization. Overall, our method enables flexible and efficient single-pass inference, bridging the gap between training-time robustness and test-time adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames scale as a learnable test-time variable via a lightweight predictor and parametric loss, aiming for single-pass adaptive detection, but the approach needs concrete equations and results to verify it delivers.

read the letter

The central move here is treating optimal input scale as unobservable during standard training and then using a parametric formulation to create loss objectives that train a small scale predictor for test-time use. This lets the detector adjust resolution on the fly without multiple forward passes or heavy augmentation at inference. The biological analogy to ciliary accommodation is just framing; the actual claim is that this bridges multi-scale training robustness with efficient single-pass inference.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes Ciliary-DETR (formerly Elastic-DETR), a framework for test-time resolution adjustment in object detection analogous to biological accommodation. It introduces a lightweight scale predictor that dynamically estimates test-time scale factors across a wide range of input scales. The central technical contribution is a parametric formulation of desired scaling behavior that yields loss-driven objectives to train the predictor, addressing the claim that optimal input scale is inherently unobservable under standard training setups. The method is presented as enabling flexible and efficient single-pass inference that bridges training-time multi-scale robustness with test-time adaptation.

Significance. If the parametric formulation and resulting scale predictions prove effective, the work would provide a practical mechanism for adaptive single-pass inference without the overhead of multi-scale testing or post-hoc ensembling. This addresses a real deployment gap in modern detectors and could influence subsequent work on learnable input transformations. The internal logic of the approach is consistent and does not contain an evident circularity or unsupported leap once the concrete formulation is examined.

minor comments (2)

The abstract refers to both 'Ciliary-DETR' and the previous name 'Elastic-DETR'; a single consistent name should be used throughout, with any name change noted only in a footnote.
The claim that the approach 'bridges the gap between training-time robustness and test-time adaptation' would benefit from a brief quantitative comparison (e.g., mAP vs. latency) against standard multi-scale inference baselines in the results section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for reviewing our manuscript. We appreciate the assessment that the internal logic is consistent, that there is no evident circularity, and that the approach could address a real deployment gap in modern detectors. The recommendation is listed as uncertain, yet the report contains no specific major comments or requests for clarification. We therefore provide no point-by-point responses and remain available to address any concrete questions the referee may wish to raise.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central mechanism introduces a parametric formulation of desired scaling behavior to derive loss-driven objectives for training a lightweight scale predictor, directly addressing the stated unobservability of optimal input scale under standard setups. This constitutes an independent modeling choice rather than any reduction of a claimed prediction or result to its own fitted inputs by construction. No load-bearing step relies on self-citation chains, uniqueness theorems imported from the authors' prior work, or renaming of known empirical patterns; the single-pass inference claim follows from the proposed architecture and training objectives without evident self-definitional equivalence. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method depends on the unobservability of optimal scale and on the existence of a parametric description of desired scaling that can be converted into usable training losses; both are introduced without external validation in the provided abstract.

axioms (1)

domain assumption The optimal input scale is inherently unobservable under standard training setups.
Explicitly stated as the core challenge that the parametric formulation is meant to solve.

invented entities (1)

lightweight scale predictor no independent evidence
purpose: Dynamically estimate test-time scale factors
New module introduced to perform the accommodation function; no independent evidence of its behavior outside the paper is supplied.

pith-pipeline@v0.9.0 · 5657 in / 1183 out tokens · 19101 ms · 2026-05-23T07:37:07.278643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 4 internal anchors

[1]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

Cf-detr: Coarse-to-fine transformers for end-to-end object detection

Xipeng Cao, Peng Yuan, Bailan Feng, and Kun Niu. Cf-detr: Coarse-to-fine transformers for end-to-end object detection. In Proceedings of the AAAI conference on artificial intelli- gence, pages 185–193, 2022. 3

work page 2022
[3]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 1, 3, 6

work page 2020
[4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021
[5]

A new coefficient of correlation

Sourav Chatterjee. A new coefficient of correlation. Jour- nal of the American Statistical Association, 116(536):2009– 2022, 2021. 5

work page 2009
[6]

Scale-aware automatic augmentation for object detection

Yukang Chen, Yanwei Li, Tao Kong, Lu Qi, Ruihang Chu, Lei Li, and Jiaya Jia. Scale-aware automatic augmentation for object detection. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9563–9572, 2021. 2

work page 2021
[7]

AutoAugment: Learning Augmentation Policies from Data

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- van, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Dynamic head: Unifying object detection heads with attentions

Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: Unifying object detection heads with attentions. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7373–7382, 2021. 3

work page 2021
[9]

Object detection in aerial im- ages: A large-scale benchmark and challenges

Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Michael Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, et al. Object detection in aerial im- ages: A large-scale benchmark and challenges. IEEE trans- actions on pattern analysis and machine intelligence, 44(11): 7778–7796, 2021. 12

work page 2021
[10]

Learning with a wasserstein loss

Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. Advances in neural information processing systems, 28,

work page
[11]

Rational decisions

Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological) , 14(1):107– 114, 1952. 4

work page 1952
[12]

Dynamic neural networks: A sur- vey

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A sur- vey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021. 2

work page 2021
[13]

Channel gating neural networks

Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh. Channel gating neural networks. Ad- vances in Neural Information Processing Systems, 32, 2019. 2

work page 2019
[14]

Dq-detr: Detr with dynamic query for tiny object detection

Yi-Xin Huang, Hou-I Liu, Hong-Han Shuai, and Wen-Huang Cheng. Dq-detr: Detr with dynamic query for tiny object detection. arXiv preprint arXiv:2404.03507, 2024. 3

work page arXiv 2024
[15]

Dn-detr: Accelerate detr training by intro- ducing query denoising

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13619–13627, 2022. 3, 5, 6

work page 2022
[16]

Dynamic anchor feature selection for single-shot object detection

Shuai Li, Lingxiao Yang, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Dynamic anchor feature selection for single-shot object detection. In Proceedings of the IEEE/CVF international conference on computer vision , pages 6609–6618, 2019. 2

work page 2019
[17]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 2

work page 2014
[18]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 3

work page 2017
[19]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 3

work page 2017
[20]

Dynamicdet: A unified dynamic architecture for object de- tection

Zhihao Lin, Yongtao Wang, Jinhe Zhang, and Xiaojie Chu. Dynamicdet: A unified dynamic architecture for object de- tection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6282– 6291, 2023. 2

work page 2023
[21]

Dab-detr: Dynamic anchor boxes are better queries for detr.arXiv preprint arXiv:2201.12329, 2022

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022. 3, 6

work page arXiv 2022
[22]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amster- dam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016. 1

work page 2016
[23]

Conditional detr for fast training convergence

Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 3, 6

work page 2021
[24]

Auto- focus: Efficient multi-scale inference

Mahyar Najibi, Bharat Singh, and Larry S Davis. Auto- focus: Efficient multi-scale inference. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9745–9755, 2019. 1

work page 2019
[25]

9 Big/little deep neural network for ultra low power infer- ence

Eunhyeok Park, Dongyoung Kim, Soobeom Kim, Yong- Deok Kim, Gunhee Kim, Sungroh Yoon, and Sungjoo Yoo. 9 Big/little deep neural network for ultra low power infer- ence. In 2015 International Conference on Hardware/Soft- ware Codesign and System Synthesis (CODES+ ISSS), pages 124–132. IEEE, 2015. 2

work page 2015
[26]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 3

work page 2016
[27]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information process- ing systems, 28, 2015. 1, 3

work page 2015
[28]

detrex: Benchmarking de- tection transformers, 2023

Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, and Lei Zhang. detrex: Benchmarking de- tection transformers, 2023. 5, 6

work page 2023
[29]

An analysis of scale in- variance in object detection snip

Bharat Singh and Larry S Davis. An analysis of scale in- variance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3578–3587, 2018. 1

work page 2018
[30]

Vidt: An efficient and effective fully transformer-based object detector

Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jam- pani, Dongyoon Han, Byeongho Heo, Wonjae Kim, and Ming-Hsuan Yang. Vidt: An efficient and effective fully transformer-based object detector. arXiv preprint arXiv:2110.03921, 2021. 6

work page arXiv 2021
[31]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR,

work page
[32]

Efficientdet: Scalable and efficient object detection

Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020. 1, 2

work page 2020
[33]

Fcos: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019. 3

work page 2019
[34]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 1

work page 2017
[35]

Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7464–7475, 2023. 3

work page 2023
[36]

Residual attention network for image classification

Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3156–3164, 2017. 3

work page 2017
[37]

Elastic: Improving cnns with dynamic scaling policies

Huiyu Wang, Aniruddha Kembhavi, Ali Farhadi, Alan L Yuille, and Mohammad Rastegari. Elastic: Improving cnns with dynamic scaling policies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2258–2267, 2019. 3

work page 2019
[38]

IDK Cascades: Fast Deep Learning by Learning not to Overthink

Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, and Joseph E Gonzalez. Idk cascades: Fast deep learning by learning not to overthink. arXiv preprint arXiv:1706.00885, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition

Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, and Gao Huang. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. Advances in neural information processing systems , 34:11960–11973,

work page
[40]

Anchor detr: Query design for transformer-based detector

Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI conference on artificial intelli- gence, pages 2567–2575, 2022. 3, 6

work page 2022
[41]

Resolution adaptive networks for efficient in- ference

Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, and Gao Huang. Resolution adaptive networks for efficient in- ference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2369–2378,

work page
[42]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 2

work page 2022
[43]

Detr++: Tam- ing your multi-scale detection transformer

Chi Zhang, Lijuan Liu, Xiaoxue Zang, Frederick Liu, Hao Zhang, Xinying Song, and Jindong Chen. Detr++: Tam- ing your multi-scale detection transformer. arXiv preprint arXiv:2206.02977, 2022. 3

work page arXiv 2022
[44]

Accelerating detr convergence via semantic- aligned matching

Gongjie Zhang, Zhipeng Luo, Yingchen Yu, Kaiwen Cui, and Shijian Lu. Accelerating detr convergence via semantic- aligned matching. In Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 949– 958, 2022. 3

work page 2022
[45]

Towards efficient use of multi-scale features in transformer-based object detectors

Gongjie Zhang, Zhipeng Luo, Zichen Tian, Jingyi Zhang, Xiaoqin Zhang, and Shijian Lu. Towards efficient use of multi-scale features in transformer-based object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6206–6216, 2023. 3

work page 2023
[46]

Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection

Lei Zhu, Zijun Deng, Xiaowei Hu, Chi-Wing Fu, Xuemiao Xu, Jing Qin, and Pheng-Ann Heng. Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 121–136, 2018. 3

work page 2018
[47]

Dynamic resolution net- work

Mingjian Zhu, Kai Han, Enhua Wu, Qiulin Zhang, Ying Nie, Zhenzhong Lan, and Yunhe Wang. Dynamic resolution net- work. Advances in Neural Information Processing Systems, 34:27319–27330, 2021. 2

work page 2021
[48]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023

Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 1 10 Appendix A. Discussions A.1. Multi-scale Image Resolution The preliminary experiment depicted in Fig. 1 was per- formed to assess the flexibility of the MS technique. The performance improvement ac...

work page 2023

[1] [1]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2004

[2] [2]

Cf-detr: Coarse-to-fine transformers for end-to-end object detection

Xipeng Cao, Peng Yuan, Bailan Feng, and Kun Niu. Cf-detr: Coarse-to-fine transformers for end-to-end object detection. In Proceedings of the AAAI conference on artificial intelli- gence, pages 185–193, 2022. 3

work page 2022

[3] [3]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 1, 3, 6

work page 2020

[4] [4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021

[5] [5]

A new coefficient of correlation

Sourav Chatterjee. A new coefficient of correlation. Jour- nal of the American Statistical Association, 116(536):2009– 2022, 2021. 5

work page 2009

[6] [6]

Scale-aware automatic augmentation for object detection

Yukang Chen, Yanwei Li, Tao Kong, Lu Qi, Ruihang Chu, Lei Li, and Jiaya Jia. Scale-aware automatic augmentation for object detection. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9563–9572, 2021. 2

work page 2021

[7] [7]

AutoAugment: Learning Augmentation Policies from Data

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- van, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Dynamic head: Unifying object detection heads with attentions

Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: Unifying object detection heads with attentions. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7373–7382, 2021. 3

work page 2021

[9] [9]

Object detection in aerial im- ages: A large-scale benchmark and challenges

Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Michael Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, et al. Object detection in aerial im- ages: A large-scale benchmark and challenges. IEEE trans- actions on pattern analysis and machine intelligence, 44(11): 7778–7796, 2021. 12

work page 2021

[10] [10]

Learning with a wasserstein loss

Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. Advances in neural information processing systems, 28,

work page

[11] [11]

Rational decisions

Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological) , 14(1):107– 114, 1952. 4

work page 1952

[12] [12]

Dynamic neural networks: A sur- vey

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A sur- vey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021. 2

work page 2021

[13] [13]

Channel gating neural networks

Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh. Channel gating neural networks. Ad- vances in Neural Information Processing Systems, 32, 2019. 2

work page 2019

[14] [14]

Dq-detr: Detr with dynamic query for tiny object detection

Yi-Xin Huang, Hou-I Liu, Hong-Han Shuai, and Wen-Huang Cheng. Dq-detr: Detr with dynamic query for tiny object detection. arXiv preprint arXiv:2404.03507, 2024. 3

work page arXiv 2024

[15] [15]

Dn-detr: Accelerate detr training by intro- ducing query denoising

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13619–13627, 2022. 3, 5, 6

work page 2022

[16] [16]

Dynamic anchor feature selection for single-shot object detection

Shuai Li, Lingxiao Yang, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Dynamic anchor feature selection for single-shot object detection. In Proceedings of the IEEE/CVF international conference on computer vision , pages 6609–6618, 2019. 2

work page 2019

[17] [17]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 2

work page 2014

[18] [18]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 3

work page 2017

[19] [19]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 3

work page 2017

[20] [20]

Dynamicdet: A unified dynamic architecture for object de- tection

Zhihao Lin, Yongtao Wang, Jinhe Zhang, and Xiaojie Chu. Dynamicdet: A unified dynamic architecture for object de- tection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6282– 6291, 2023. 2

work page 2023

[21] [21]

Dab-detr: Dynamic anchor boxes are better queries for detr.arXiv preprint arXiv:2201.12329, 2022

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022. 3, 6

work page arXiv 2022

[22] [22]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amster- dam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016. 1

work page 2016

[23] [23]

Conditional detr for fast training convergence

Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 3, 6

work page 2021

[24] [24]

Auto- focus: Efficient multi-scale inference

Mahyar Najibi, Bharat Singh, and Larry S Davis. Auto- focus: Efficient multi-scale inference. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9745–9755, 2019. 1

work page 2019

[25] [25]

9 Big/little deep neural network for ultra low power infer- ence

Eunhyeok Park, Dongyoung Kim, Soobeom Kim, Yong- Deok Kim, Gunhee Kim, Sungroh Yoon, and Sungjoo Yoo. 9 Big/little deep neural network for ultra low power infer- ence. In 2015 International Conference on Hardware/Soft- ware Codesign and System Synthesis (CODES+ ISSS), pages 124–132. IEEE, 2015. 2

work page 2015

[26] [26]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 3

work page 2016

[27] [27]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information process- ing systems, 28, 2015. 1, 3

work page 2015

[28] [28]

detrex: Benchmarking de- tection transformers, 2023

Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, and Lei Zhang. detrex: Benchmarking de- tection transformers, 2023. 5, 6

work page 2023

[29] [29]

An analysis of scale in- variance in object detection snip

Bharat Singh and Larry S Davis. An analysis of scale in- variance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3578–3587, 2018. 1

work page 2018

[30] [30]

Vidt: An efficient and effective fully transformer-based object detector

Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jam- pani, Dongyoon Han, Byeongho Heo, Wonjae Kim, and Ming-Hsuan Yang. Vidt: An efficient and effective fully transformer-based object detector. arXiv preprint arXiv:2110.03921, 2021. 6

work page arXiv 2021

[31] [31]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR,

work page

[32] [32]

Efficientdet: Scalable and efficient object detection

Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020. 1, 2

work page 2020

[33] [33]

Fcos: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019. 3

work page 2019

[34] [34]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 1

work page 2017

[35] [35]

Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7464–7475, 2023. 3

work page 2023

[36] [36]

Residual attention network for image classification

Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3156–3164, 2017. 3

work page 2017

[37] [37]

Elastic: Improving cnns with dynamic scaling policies

Huiyu Wang, Aniruddha Kembhavi, Ali Farhadi, Alan L Yuille, and Mohammad Rastegari. Elastic: Improving cnns with dynamic scaling policies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2258–2267, 2019. 3

work page 2019

[38] [38]

IDK Cascades: Fast Deep Learning by Learning not to Overthink

Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, and Joseph E Gonzalez. Idk cascades: Fast deep learning by learning not to overthink. arXiv preprint arXiv:1706.00885, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [39]

Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition

Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, and Gao Huang. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. Advances in neural information processing systems , 34:11960–11973,

work page

[40] [40]

Anchor detr: Query design for transformer-based detector

Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI conference on artificial intelli- gence, pages 2567–2575, 2022. 3, 6

work page 2022

[41] [41]

Resolution adaptive networks for efficient in- ference

Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, and Gao Huang. Resolution adaptive networks for efficient in- ference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2369–2378,

work page

[42] [42]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 2

work page 2022

[43] [43]

Detr++: Tam- ing your multi-scale detection transformer

Chi Zhang, Lijuan Liu, Xiaoxue Zang, Frederick Liu, Hao Zhang, Xinying Song, and Jindong Chen. Detr++: Tam- ing your multi-scale detection transformer. arXiv preprint arXiv:2206.02977, 2022. 3

work page arXiv 2022

[44] [44]

Accelerating detr convergence via semantic- aligned matching

Gongjie Zhang, Zhipeng Luo, Yingchen Yu, Kaiwen Cui, and Shijian Lu. Accelerating detr convergence via semantic- aligned matching. In Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 949– 958, 2022. 3

work page 2022

[45] [45]

Towards efficient use of multi-scale features in transformer-based object detectors

Gongjie Zhang, Zhipeng Luo, Zichen Tian, Jingyi Zhang, Xiaoqin Zhang, and Shijian Lu. Towards efficient use of multi-scale features in transformer-based object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6206–6216, 2023. 3

work page 2023

[46] [46]

Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection

Lei Zhu, Zijun Deng, Xiaowei Hu, Chi-Wing Fu, Xuemiao Xu, Jing Qin, and Pheng-Ann Heng. Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 121–136, 2018. 3

work page 2018

[47] [47]

Dynamic resolution net- work

Mingjian Zhu, Kai Han, Enhua Wu, Qiulin Zhang, Ying Nie, Zhenzhong Lan, and Yunhe Wang. Dynamic resolution net- work. Advances in Neural Information Processing Systems, 34:27319–27330, 2021. 2

work page 2021

[48] [48]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[49] [49]

Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023

Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 1 10 Appendix A. Discussions A.1. Multi-scale Image Resolution The preliminary experiment depicted in Fig. 1 was per- formed to assess the flexibility of the MS technique. The performance improvement ac...

work page 2023