Recognition: unknown
Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery
Pith reviewed 2026-05-10 02:11 UTC · model grok-4.3
The pith
By choosing the number of overlapping patches adaptively from image resolution, ASAHI improves small object detection in high-resolution aerial imagery while cutting inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ASAHI determines the optimal number of slices according to image resolution, thereby generating either 6 or 12 overlapping patches via a learned threshold; it augments training with both full-resolution and sliced patches and applies Cluster-DIoU-NMS to merge detections in crowded scenes, yielding 56.8 percent mAP on VisDrone2019-DET-val and 22.7 percent on xView-test at reduced inference cost.
What carries the argument
The adaptive resolution-aware slicing algorithm that uses a learned threshold to decide between 6 and 12 overlapping patches, replacing fixed slice sizes to balance overlap and redundant computation.
If this is right
- Detection accuracy reaches state-of-the-art levels on the VisDrone and xView benchmarks.
- Inference time drops 20-25 percent compared with fixed-size slicing baselines.
- Training on the mixture of full and sliced patches improves the detector's handling of small targets.
- Cluster-DIoU-NMS reduces duplicate boxes more reliably in dense object clusters than either component alone.
- Beneficial patch overlap is retained while overall redundant computation is lowered.
Where Pith is reading between the lines
- The same resolution-driven patch count rule could be applied to video streams where object scale changes frame to frame.
- The learned threshold might be re-optimized per camera type or altitude to match typical object densities.
- Similar adaptive slicing could speed up high-resolution segmentation or instance counting tasks outside aerial domains.
- Lower compute cost opens the possibility of running these detectors on embedded hardware without accuracy loss.
Load-bearing premise
Dynamically selecting 6 or 12 patches from a learned resolution threshold will always keep enough overlap to find small objects while trimming unnecessary work across every scene type and density level.
What would settle it
A test set of varied-resolution aerial images where the adaptive method either misses more small objects than fixed slicing or shows no reduction in measured inference time would falsify the central efficiency claim.
Figures
read the original abstract
Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Adaptive Slicing-Assisted Hyper Inference (ASAHI) for small-object detection in high-resolution aerial imagery. It replaces fixed-size slicing with an adaptive resolution-aware algorithm that selects either 6 or 12 overlapping patches according to a learned threshold, augments training via slicing-assisted fine-tuning (SAF), and applies Cluster-DIoU-NMS (CDN) for post-processing. The central empirical claim is that ASAHI reaches 56.8% mAP on VisDrone2019-DET-val and 22.7% mAP on xView-test while cutting inference time 20-25% relative to the SAHI baseline.
Significance. If the reported gains are reproducible and the adaptive threshold generalizes, the work would offer a practical improvement over fixed slicing methods by reducing redundant computation while maintaining overlap for small targets. The combination of SAF and CDN also targets training and crowded-scene issues that are common in aerial detection benchmarks.
major comments (3)
- [Abstract] Abstract: the headline SOTA numbers (56.8% VisDrone, 22.7% xView) and the 20-25% speed-up are stated without any reference to the base detector, full baseline list, or ablation results that isolate the contribution of the adaptive slicing component versus SAF and CDN.
- [Abstract] Abstract (adaptive resolution-aware slicing paragraph): the method relies on a learned threshold that selects between exactly two discrete patch counts (6 or 12); no value of the threshold, training procedure for it, or density-conditioned ablation is supplied, leaving the central assumption that resolution alone suffices to preserve overlap and avoid missed small objects untested.
- [Abstract] Abstract: the manuscript reports no error analysis or failure-case study on scenes with varying object density, which directly bears on whether the claimed accuracy and speed improvements hold when the fixed 6/12 choices under-slice dense regions or over-slice sparse ones.
minor comments (1)
- [Abstract] Abstract: formatting inconsistency in the component list ('(1)an' lacks space after the parenthesis).
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We agree that additional context will improve clarity and will revise the abstract accordingly while preserving its conciseness. We address each point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline SOTA numbers (56.8% VisDrone, 22.7% xView) and the 20-25% speed-up are stated without any reference to the base detector, full baseline list, or ablation results that isolate the contribution of the adaptive slicing component versus SAF and CDN.
Authors: The abstract summarizes key outcomes with reference to the SAHI baseline. The base detector, complete baseline comparisons, and component-wise ablations (isolating adaptive slicing, SAF, and CDN) are provided in Sections 4.1–4.3 and Table 3. We will revise the abstract to explicitly name the base detector and note that ablations confirm the individual contributions. revision: yes
-
Referee: [Abstract] Abstract (adaptive resolution-aware slicing paragraph): the method relies on a learned threshold that selects between exactly two discrete patch counts (6 or 12); no value of the threshold, training procedure for it, or density-conditioned ablation is supplied, leaving the central assumption that resolution alone suffices to preserve overlap and avoid missed small objects untested.
Authors: The learned threshold and its training procedure are detailed in Section 3.1. The abstract omits the numerical value and procedure for brevity. We will add a concise clause specifying the threshold and training approach. Our experiments already evaluate performance across datasets with varying densities; we will incorporate an explicit density-conditioned ablation summary into the revised abstract and experiments section. revision: yes
-
Referee: [Abstract] Abstract: the manuscript reports no error analysis or failure-case study on scenes with varying object density, which directly bears on whether the claimed accuracy and speed improvements hold when the fixed 6/12 choices under-slice dense regions or over-slice sparse ones.
Authors: The abstract is space-constrained and therefore omits detailed error analysis. The manuscript provides qualitative results and density-stratified quantitative evaluations in Section 4.4 and the supplementary material. We will revise the abstract to include a brief summary of robustness across densities and expand the failure-case discussion in the main text. revision: yes
Circularity Check
No significant circularity: empirical claims rest on external benchmarks
full rationale
The paper proposes ASAHI with three components—an adaptive resolution-aware slicing algorithm using a learned threshold to select 6 or 12 patches, a slicing-assisted fine-tuning strategy, and a Cluster-DIoU-NMS post-processor—and reports empirical SOTA results (56.8% on VisDrone2019-DET-val, 22.7% on xView-test) plus 20-25% faster inference than SAHI. These performance numbers are obtained via standard evaluation on held-out external datasets rather than any derivation, equation, or fitted parameter that reduces to the method's own inputs by construction. No mathematical first-principles chain, self-definitional quantities, or load-bearing self-citations appear in the provided text; the learned threshold is an internal design choice whose effect is measured externally, not presupposed.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned threshold for slice count
axioms (1)
- domain assumption Training on a mixture of full-resolution and sliced patches improves small-object detection performance
Reference graph
Works this paper leans on
-
[1]
Slicing aided hyper inference and fine-tuning for small object detection
Fatih Cagatay Akyon, Sinan Onur Altinuc, and Alptekin Temizel. Slicing aided hyper inference and fine-tuning for small object detection. InProceedings of the IEEE Interna- tional Conference on Image Processing (ICIP), pages 966–
-
[2]
SOD-MTGAN: Small object detection via multi- task generative adversarial network
Yancheng Bai, Yongqiang Zhang, Mingli Ding, and Bernard Ghanem. SOD-MTGAN: Small object detection via multi- task generative adversarial network. InProceedings of the 8 European Conference on Computer Vision (ECCV), pages 210–226. Springer, 2018
2018
-
[3]
YOLOv4: Optimal Speed and Accuracy of Object Detection
Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. YOLOv4: Optimal speed and accuracy of object detection.arXiv preprint arXiv:2004.10934, 2020
work page internal anchor Pith review arXiv 2004
-
[4]
Soft-NMS–improving object detection with one line of code
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–improving object detection with one line of code. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5562–5570. IEEE, 2017
2017
-
[5]
VisDrone-DET2021: The vision meets drone object detection challenge results
Yuren Cao, Zhijian He, Longjia Wang, Wengao Wang, Yix- ian Yuan, Daiwei Zhang, Jialiang Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. VisDrone-DET2021: The vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2847–2854. IEEE, 2021
2021
-
[6]
End- to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InProceedings of the European Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020
2020
-
[7]
Diffu- sionDet: Diffusion model for object detection
Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Diffu- sionDet: Diffusion model for object detection. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 19830–19843. IEEE, 2023
2023
- [8]
-
[9]
A global-local self- adaptive network for drone-view object detection.IEEE Transactions on Image Processing, 30:1556–1569, 2021
Sutao Deng, Shuai Li, Kai Xie, Wenfeng Song, Xiang- wen Liao, Aimin Hao, and Hong Qin. A global-local self- adaptive network for drone-view object detection.IEEE Transactions on Image Processing, 30:1556–1569, 2021
2021
-
[10]
CSWin Transformer: A general vision transformer backbone with cross-shaped windows
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. CSWin Transformer: A general vision transformer backbone with cross-shaped windows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12114–12124. IEEE, 2022
2022
-
[11]
An image is worth 16×16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16×16 words: Trans- formers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2021
2021
-
[12]
VisDrone-DET2019: The vi- sion meets drone object detection in image challenge results
Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiangyu Zheng, Xiangyu Wang, Yifan Zhang, et al. VisDrone-DET2019: The vi- sion meets drone object detection in image challenge results. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 213–226. IEEE, 2019
2019
-
[13]
Ad- vancing sequential numerical prediction in autoregressive models
Xingjian Fei, Jinghui Lu, Quan Sun, Hao Feng, Yanjie Wang, Wei Shi, Anlu Wang, Jingqun Tang, and Can Huang. Ad- vancing sequential numerical prediction in autoregressive models. InProceedings of the Annual Meeting of the As- sociation for Computational Linguistics (ACL), 2025
2025
-
[14]
TOOD: Task-aligned one-stage object detection
Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. TOOD: Task-aligned one-stage object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3490–3499. IEEE, 2021
2021
-
[15]
DocPedia: Unleashing the power of large multimodal model in the frequency domain for ver- satile document understanding
Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wenqing Zhou, Hezhi Li, and Can Huang. DocPedia: Unleashing the power of large multimodal model in the frequency domain for ver- satile document understanding. 2024
2024
-
[16]
Hao Feng, Wei Shi, Kai Zhang, Xingjian Fei, Liangtao Liao, Dingkang Yang, Yingying Du, Xuecheng Wu, Jingqun Tang, Yang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026
-
[17]
UniDoc: A universal large multimodal model for simultaneous text detection, recogni- tion, spotting and understanding
Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wenqing Zhou, Hezhi Li, and Can Huang. UniDoc: A universal large multimodal model for simultaneous text detection, recogni- tion, spotting and understanding. 2023
2023
-
[18]
Dolphin: Document image parsing via heterogeneous anchor prompting
Hao Feng, Shuai Wei, Xingjian Fei, Wei Shi, Yunhao Han, Liangtao Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the As- sociation for Computational Linguistics: ACL, pages 21919– 21936, 2025
2025
-
[19]
Rich feature hierarchies for accurate object detec- tion and semantic segmentation
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detec- tion and semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587. IEEE, 2014
2014
-
[20]
Mask R-CNN
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV), pages 2980–
-
[21]
Spatial pyramid pooling in deep convolutional networks for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015
1904
-
[22]
Coordinate atten- tion for efficient mobile network design
Qibin Hou, Daquan Zhou, and Jiashi Feng. Coordinate atten- tion for efficient mobile network design. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13708–13717. IEEE, 2021
2021
-
[23]
MinDEV: Multi-modal integrated diffusion framework for video reconstruction from EEG sig- nals
Shuo Huang, Yuxuan Wang, Hanchi Luo, Huanyu Jing, Chao Qin, and Jingqun Tang. MinDEV: Multi-modal integrated diffusion framework for video reconstruction from EEG sig- nals. InProceedings of the ACM International Conference on Multimedia, pages 3350–3359, 2025
2025
-
[24]
Ultr- alytics YOLOv8
Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultr- alytics YOLOv8. 2023.https://github.com/ ultralytics/ultralytics
2023
-
[25]
Ultralyt- ics/YOLOv5: v5.0–YOLOv5-P6 1280 models, AWS, Super- vise.ly and YouTube integrations, 2021
Glenn Jocher, Alex Stoken, Jiri Borovec, et al. Ultralyt- ics/YOLOv5: v5.0–YOLOv5-P6 1280 models, AWS, Super- vise.ly and YouTube integrations, 2021
2021
-
[26]
Augmentation for small object detection
M ´at´e Kisantal, Zbigniew Wojna, Jakub Muber, Jakub Jezier- ski, and Jarosław Kowalczyk. Augmentation for small object detection. InProceedings of the International Conference on Advances in Computer Vision, 2019. 9
2019
-
[27]
Focus-and-detect: A small object detection framework for aerial images.Signal Processing: Image Communication, 104:116675, 2022
Onur Can Koyun, Ramazan Kadir Keser, ˙Ibrahim Batuhan Akkaya, and Bu ˘gra Ufuk T ¨oreyin. Focus-and-detect: A small object detection framework for aerial images.Signal Processing: Image Communication, 104:116675, 2022
2022
-
[28]
xview: Objects in context in overhead imagery,
Darius Lam, Richard Kuzma, Kevin McGee, Samuel Doo- ley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, and Brendan McCord. xview: Objects in context in overhead imagery.arXiv preprint arXiv:1802.07856, 2018
-
[29]
Density map guided object detection in aerial images
Changlin Li, Taojiannan Yang, Sijie Zhu, Chen Chen, and Shanyue Guan. Density map guided object detection in aerial images. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 737–746. IEEE, 2020
2020
-
[30]
DN-DETR: Accelerate DETR training by in- troducing query denoising
Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. DN-DETR: Accelerate DETR training by in- troducing query denoising. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13619–13627. IEEE, 2022
2022
-
[31]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944. IEEE, 2017
2017
-
[32]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017
2017
-
[33]
Path aggregation network for instance segmentation
Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768. IEEE, 2018
2018
-
[34]
Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision (ECCV), pages 38–55. Springer, 2024
2024
-
[35]
SSD: Single shot multibox detector
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. InProceedings of the European Conference on Computer Vision (ECCV), pages 21–37. Springer, 2016
2016
-
[36]
SPTS v2: Single-point scene text spotting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15047–15063, 2023
Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, et al. SPTS v2: Single-point scene text spotting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15047–15063, 2023
2023
-
[37]
Swin Transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022. IEEE, 2021
2021
-
[38]
A bounding box is worth one token— interleaving layout and text in a large language model for document understanding
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token— interleaving layout and text in a large language model for document understanding. InFindings of the Association for Computational Linguistics: ACL, pages 7252–7273, 2025
2025
-
[39]
Meta-DermDiagnosis: Few-shot skin disease identification using meta-learning
Kushagra Mahajan, Monika Sharma, and Lovekesh Vig. Meta-DermDiagnosis: Few-shot skin disease identification using meta-learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 730–731. IEEE, 2020
2020
-
[40]
Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small ob- ject detection
Junhyug Noh, Wonho Bae, Wonhee Lee, Jinhwan Seo, and Gunhee Kim. Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small ob- ject detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9725–
-
[41]
Power of tiling and merging in small object detection
Fatih ¨Ozge Unel, Burak Osman Ozkalayci, and Cevahir Cigla. Power of tiling and merging in small object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 0–0. IEEE, 2019
2019
-
[42]
De- tectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution
Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. De- tectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10208–10219. IEEE, 2021
2021
-
[43]
You only look once: Unified, real-time object de- tection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779–
-
[44]
YOLO9000: Better, faster, stronger
Joseph Redmon and Ali Farhadi. YOLO9000: Better, faster, stronger. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7263–
-
[45]
Faster R-CNN: Towards real-time object detection with re- gion proposal networks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 39(6):1137–1149, 2017
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 39(6):1137–1149, 2017
2017
-
[46]
Biluo Shan, Xingjian Fei, Wei Shi, Anlu Wang, Guozhi Tang, Liangtao Liao, Jingqun Tang, Xiang Bai, and Can Huang. MCTBench: Multimodal cognition towards text-rich visual scenes benchmark.arXiv preprint arXiv:2410.11538, 2024
-
[47]
Weighted boxes fusion: Ensembling boxes from different object detection models.Image and Vision Computing, 107:104117, 2021
Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models.Image and Vision Computing, 107:104117, 2021
2021
-
[48]
HIT-UA V: A high-altitude in- frared thermal dataset for unmanned aerial vehicle-based ob- ject detection.Scientific Data, 10(1):227, 2023
Jiashun Suo, Tianyi Wang, Xingzhou Zhang, Haiyang Chen, Wei Zhou, and Weisong Shi. HIT-UA V: A high-altitude in- frared thermal dataset for unmanned aerial vehicle-based ob- ject detection.Scientific Data, 10(1):227, 2023
2023
-
[49]
EfficientDet: Scalable and efficient object detection
Mingxing Tan, Ruoming Pang, and Quoc V Le. EfficientDet: Scalable and efficient object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10778–10787. IEEE, 2020
2020
-
[50]
Character recognition competition for street view shop signs.National Science Review, 10(6):nwad141, 2023
Jingqun Tang, Wei Du, Bochao Wang, Wenqing Zhou, Song Mei, Tao Xue, Xiang Xu, and Haoyue Zhang. Character recognition competition for street view shop signs.National Science Review, 10(6):nwad141, 2023. 10
2023
-
[51]
TextSquare: Scaling up text-centric visual instruction tuning,
Jingqun Tang, Chunhui Lin, Zhen Zhao, Shuai Wei, Binghong Wu, Qi Liu, Yang He, Kang Lu, Hao Feng, Yu Li, et al. TextSquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024
-
[52]
MTVQA: Benchmarking multilingual text-centric vi- sual question answering
Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shuai Wei, Anlu Wang, Chunhui Lin, Hao Feng, Zhen Zhao, et al. MTVQA: Benchmarking multilingual text-centric vi- sual question answering. InFindings of the Association for Computational Linguistics: ACL, pages 7748–7763, 2025
2025
-
[53]
Optimal boxes: Boosting end-to- end scene text recognition by adjusting annotated bounding boxes via reinforcement learning
Jingqun Tang, Wenqiang Qian, Lei Song, Xiaobin Dong, Lingxi Li, and Xiang Bai. Optimal boxes: Boosting end-to- end scene text recognition by adjusting annotated bounding boxes via reinforcement learning. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 233–
-
[54]
You can even annotate text with voice: Transcription-only-supervised text spotting
Jingqun Tang, Shaobo Qiao, Benlei Cui, Yuhang Ma, Shengjie Zhang, and Dimitrios Kanoulas. You can even annotate text with voice: Transcription-only-supervised text spotting. InProceedings of the 30th ACM International Con- ference on Multimedia, pages 4154–4163, 2022
2022
-
[55]
Few could be better than all: Feature sampling and grouping for scene text detec- tion
Jingqun Tang, Wenqing Zhang, Hao Liu, Min-Ke Yang, Bo Jiang, Guangliang Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4563–
-
[56]
Recent advances in small object detection based on deep learning: A review
Kang Tong, Yiquan Wu, and Fei Zhou. Recent advances in small object detection based on deep learning: A review. Image and Vision Computing, 97:103910, 2020
2020
-
[57]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems (NeurIPS), volume 30, 2017
2017
-
[58]
PARGO: Bridging vision-language with partial and global views
Anlu Wang, Biluo Shan, Wei Shi, Kai-Yun Lin, Xingjian Fei, Guozhi Tang, Liangtao Liao, Jingqun Tang, Can Huang, et al. PARGO: Bridging vision-language with partial and global views. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2025
2025
-
[59]
Anlu Wang, Jingqun Tang, Liangtao Liao, Hao Feng, Qi Liu, Xingjian Fei, Jinghui Lu, Han Wang, Hao Liu, Yang Liu, et al. WildDoc: How far are we from achieving comprehen- sive and robust document understanding in the wild? InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
2025
-
[60]
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7464–7475. IEEE, 2023
2023
-
[61]
A normal- ized Gaussian Wasserstein distance for tiny object detection
Jinwang Wang, Chang Xu, Wen Yang, and Lei Yu. A normal- ized Gaussian Wasserstein distance for tiny object detection. arXiv preprint arXiv:2110.13389, 2021
-
[62]
Tao Wang, Yang Chen, Munan Qiao, and Hichem Snoussi. A fast and robust convolutional neural network-based de- fect detection model in product quality control.The Inter- national Journal of Advanced Manufacturing Technology, 94(9):3465–3471, 2018
2018
-
[63]
A-Fast-RCNN: Hard positive generation via adversary for object detection
Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-Fast-RCNN: Hard positive generation via adversary for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3039–3048. IEEE, 2017
2017
-
[64]
Object detec- tion using clustering algorithm adaptive searching regions in aerial images
Yang Wang, Yuliang Yang, and Xin Zhao. Object detec- tion using clustering algorithm adaptive searching regions in aerial images. InProceedings of the European Conference on Computer Vision Workshops (ECCVW), pages 651–664. Springer, 2020
2020
-
[65]
CBAM: Convolutional block attention module
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 3–19. Springer, 2018
2018
-
[66]
PP-YOLOE: An evolved version of YOLO
Shangliang Xu, Xinxin Wang, Wenyu Lv, Qinyao Chang, Cheng Cui, Kaipeng Deng, Guanzhong Wang, Qingqing Dang, Shengyu Wei, Yuning Du, et al. PP-YOLOE: An evolved version of YOLO.arXiv preprint arXiv:2203.16250, 2022
-
[67]
Query- Det: Cascaded sparse query for accelerating high-resolution small object detection
Chenhongyi Yang, Zehao Huang, and Naiyan Wang. Query- Det: Cascaded sparse query for accelerating high-resolution small object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13658–13667. IEEE, 2022
2022
-
[68]
Clustered object detection in aerial images
Fan Yang, Heng Fan, Peng Chu, Erik Blasch, and Haibin Ling. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8310–8319. IEEE, 2019
2019
-
[69]
DINO: DETR with improved denoising anchor boxes for end-to-end object detection
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. InProceedings of the International Conference on Learning Representations (ICLR), 2023
2023
-
[70]
FFCA-YOLO for small object detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:5611215, 2024
Yuting Zhang, Mang Ye, Jianbing Zhu, Siming Liu, Lei Zhang, and Bo Du. FFCA-YOLO for small object detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:5611215, 2024
2024
-
[71]
TabPedia: Towards comprehensive visual table understanding with concept synergy
Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shuai Wei, Binghong Wu, Liangtao Liao, Yongjie Ye, Hao Liu, Wenqing Zhou, et al. TabPedia: Towards comprehensive visual table understanding with concept synergy. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[72]
Multi-modal in-context learning makes an ego-evolving scene text recognizer
Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Hao Liu, Zhizhong Zhang, Xin Tan, Can Huang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recognizer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15230–15241. IEEE, 2024
2024
-
[73]
Harmonizing visual text comprehension and genera- tion
Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shuai Wei, Hao Liu, Xin Tan, Zhizhong Zhang, Can Huang, et al. Harmonizing visual text comprehension and genera- tion. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[74]
Distance-IoU loss: Faster and better 11 learning for bounding box regression
Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-IoU loss: Faster and better 11 learning for bounding box regression. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12993–13000, 2020
2020
-
[75]
Enhancing ge- ometric factors in model learning and inference for object detection and instance segmentation.IEEE Transactions on Cybernetics, 52(8):8574–8586, 2022
Zhaohui Zheng, Ping Wang, Dongwei Ren, Wei Liu, Rong- guang Ye, Qinghua Hu, and Wangmeng Zuo. Enhancing ge- ometric factors in model learning and inference for object detection and instance segmentation.IEEE Transactions on Cybernetics, 52(8):8574–8586, 2022
2022
-
[76]
SSA-CNN: Semantic self-attention CNN for pedestrian detection.arXiv preprint arXiv:1902.09080, 2019
Chengji Zhou, Meiqing Wu, and Siew-Kei Lam. SSA-CNN: Semantic self-attention CNN for pedestrian detection.arXiv preprint arXiv:1902.09080, 2019
-
[77]
TPH- YOLOv5: Improved YOLOv5 based on transformer predic- tion head for object detection on drone-captured scenarios
Xingkui Zhu, Shuchang Lyu, Xu Wang, and Qi Zhao. TPH- YOLOv5: Improved YOLOv5 based on transformer predic- tion head for object detection on drone-captured scenarios. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2778–2788. IEEE, 2021
2021
-
[78]
Deformable DETR: Deformable transform- ers for end-to-end object detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transform- ers for end-to-end object detection. InProceedings of the In- ternational Conference on Learning Representations (ICLR), 2021. 12
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.