pith. sign in

arxiv: 2605.25046 · v1 · pith:RHH7GLGWnew · submitted 2026-05-24 · 💻 cs.CV · cs.AI

TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors

Pith reviewed 2026-06-30 12:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords tiny object detectionreal-time object detectionYOLO-DETR hybridfeature pyramidset predictionspatial adapterMS COCO
0
0 comments X

The pith

TinyFormer hybrid fuses YOLO pyramids with DETR set prediction to retain tiny objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard YOLO detectors lose tiny objects through deep large-stride backbones while DETR detectors miss them inside coarse token grids. TinyFormer creates a hybrid that keeps a YOLO-style pyramid neck, adds NMS-free set prediction, and inserts two new modules to restore the missing spatial detail. The Parallel Bi-fusion Module supplies high-resolution shortcuts from shallow layers, and the Spatial Semantic Adapter pulls early-stage cues into the transformer tokens. On MS COCO these changes produce a 1.6 percent AP gain on small objects and push overall AP to 58.5 percent, or 60.2 percent after Objects365 pre-training.

Core claim

TinyFormer unifies dense YOLO-style feature fusion and DETR-style set prediction through a Parallel Bi-fusion Module that builds high-resolution shortcuts from shallow stages and a Spatial Semantic Adapter that injects early high-resolution cues into transformer embeddings, yielding 58.5 percent AP overall and a 1.6 percent AP gain on small objects.

What carries the argument

Parallel Bi-fusion Module (PBM) that creates high-resolution shortcuts from shallow stages to the feature pyramid, together with Spatial Semantic Adapter (SSA) that extracts and injects early-stage spatial cues into transformer token embeddings.

If this is right

  • Real-time detectors can reach higher small-object accuracy by combining dense pyramid fusion with set prediction.
  • Adding PBM alone raises small-object AP by 1.6 percent while overall AP increases slightly.
  • Objects365 pre-training lifts the same model to 60.2 percent AP with lower parameter count than competing pretrained detectors.
  • The hybrid architecture offers a concrete accuracy-efficiency trade-off for applications that must run at video rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shortcut and adapter pattern may transfer to other transformer detectors that currently under-perform on fine-scale features.
  • Testing the modules on datasets with even smaller average object size could reveal whether the gains scale with object size.
  • Replacing the ViT backbone with a lighter CNN might isolate how much of the improvement comes from the fusion modules versus the transformer itself.

Load-bearing premise

The measured gains on small objects are produced by the PBM and SSA modules rather than by any differences in training schedule, data augmentation, or hyper-parameters relative to the YOLO and DEIMv2 baselines.

What would settle it

An experiment that retrains the DEIMv2 or YOLO baselines with exactly the same schedule, augmentation, and hyper-parameters used for TinyFormer and obtains equal or larger AP gains on small objects would show the modules are not responsible.

Figures

Figures reproduced from arXiv: 2605.25046 by Ghufron Wahyu Kurniawan, Jun-Wei Hsieh, Kuan-Chuan Peng, Meng-Yu Kao.

Figure 1
Figure 1. Figure 1: Comparison of TinyFormer with the state-of-the-art detectors on COCO [14] and VisDrone [31]. By effectively recovering spatial priors, TinyFormer explicitly surpasses CNN-based YOLOs in small object detection while maintaining higher overall AP than ViT-based DETRs. bias in favor of large instances, suppressing small ones. Despite architectural differences, both CNN and Transformer detectors share a key bo… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of TinyFormer. The framework consists of a DINOv3 [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Parallel Bi-fusion Module (PBM). (a) Standard FPN and PAN structures. (b) PBM (purple) replaces conventional top-down fusion with two parallel bi-fusion blocks that aggregate features from three scales: current (Fi), deeper semantic context (Fi+1), and shallower high-resolution details (Fi−1), enabling bidirectional feature flow. (c) Architecture of the bi-fusion block. Channel Split Xpart Xid  convolut… view at source ↗
Figure 4
Figure 4. Figure 4: Detailed architecture of the fusion block in Neck. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of the Spatial Semantic Adapter (SSA). (a) SSA integrated with the DINOv3 backbone, featuring the Spatial Detail Extractor (SDE) and Semantic Purification Block (SPB). (b) Detailed structure of the SDE. (c) Intermediate fusion method at the F3 scale. C denotes the base channel dimension. where Fi denotes the features on the target scale, Fi+1 represents the deep branch that provides high￾level… view at source ↗
Figure 6
Figure 6. Figure 6: Grad-CAM visualization on the VisDrone 2019 val. Top: baseline DEIMv2-X; bottom: TinyFormer-X-PBM. From left to right, columns show detection results and multi-scale neck activa￾tion maps (F¯ 3, F¯ 4, F¯ 5, defined in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO--DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes TinyFormer, a YOLO-DETR hybrid real-time object detector that integrates ViT backbones, NMS-free set prediction, and a YOLO-style pyramid neck. It introduces the Parallel Bi-fusion Module (PBM) to create high-resolution shortcuts from shallow stages to the feature pyramid and the Spatial Semantic Adapter (SSA) to inject high-resolution cues from early stages into transformer token embeddings. On MS COCO, TinyFormer-X reaches 58.5% AP (58.4% without PBM) with a 1.6% AP gain on small objects; with Objects365 pre-training it reaches 60.2% AP while using fewer parameters than RF-DETR. Code is released.

Significance. If the small-object gains are robustly attributable to PBM and SSA, the work provides a concrete accuracy-efficiency trade-off for real-time tiny-object detection by bridging dense YOLO-style fusion with DETR-style matching. The public code release and use of the standard MS COCO benchmark are positive for reproducibility and comparability.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (experimental comparisons): the claim that TinyFormer outperforms DEIMv2 and YOLO-series detectors does not include an explicit statement that all baselines were trained with identical epoch counts, learning-rate schedules, data-augmentation policies, and optimizer settings. Without this, the 1.6% AP_s improvement cannot be confidently credited to PBM and SSA rather than uncontrolled training differences.
  2. [§4.1] §4.1 (ablation on PBM): while the abstract reports a 0.1% overall AP increase and 1.6% AP_s increase when adding PBM to TinyFormer-X, the corresponding table or text does not report the AP_s value for the 58.4% configuration, preventing direct verification that the small-object gain is isolated to PBM.
minor comments (1)
  1. [§3] Notation for PBM and SSA is introduced in the abstract and §3 but the precise tensor shapes and injection points into the DETR decoder are not restated in a single equation or diagram caption, making the modules harder to re-implement from text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental clarity and ablation reporting. We address each major comment below and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experimental comparisons): the claim that TinyFormer outperforms DEIMv2 and YOLO-series detectors does not include an explicit statement that all baselines were trained with identical epoch counts, learning-rate schedules, data-augmentation policies, and optimizer settings. Without this, the 1.6% AP_s improvement cannot be confidently credited to PBM and SSA rather than uncontrolled training differences.

    Authors: We agree that an explicit statement on training parity is needed for full confidence in attributing gains to PBM and SSA. While we followed the official training recipes and hyperparameters reported in each baseline paper (standard practice for such comparisons), the manuscript does not document this explicitly. We will revise §4 to include a dedicated paragraph detailing that all models were trained for the same number of epochs using identical data-augmentation pipelines, learning-rate schedules, and optimizers (with minor adjustments only for convergence stability as noted in the original works). This addition will directly address the concern and allow readers to verify the attribution. revision: yes

  2. Referee: [§4.1] §4.1 (ablation on PBM): while the abstract reports a 0.1% overall AP increase and 1.6% AP_s increase when adding PBM to TinyFormer-X, the corresponding table or text does not report the AP_s value for the 58.4% configuration, preventing direct verification that the small-object gain is isolated to PBM.

    Authors: We acknowledge the omission in the ablation table. The text states that adding PBM yields a 1.6% AP_s gain, but the table in §4.1 does not list the corresponding AP_s for the 58.4% AP (no-PBM) variant. We will update the ablation table to explicitly report AP, AP_s, AP_m, and AP_l for both configurations, enabling direct verification that the small-object improvement is attributable to PBM. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical architecture paper with externally verifiable results

full rationale

The paper presents a new YOLO-DETR hybrid detector and reports direct AP metrics on the fixed public MS COCO test set (e.g., 58.4% AP without PBM, 58.5% with PBM, 1.6% AP_s gain). No derivation chain, equations, or fitted parameters exist that could reduce a claimed prediction to its own inputs by construction. Self-citations, if present for baselines, are not load-bearing for the central accuracy claims, which remain falsifiable outside the paper. This matches the default non-circular outcome for empirical work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of two newly introduced modules whose internal design choices (fusion ratios, injection points, token dimensions) are not visible in the abstract. No new physical entities are postulated.

free parameters (1)
  • fusion and injection hyperparameters inside PBM and SSA
    These control how high-resolution cues are combined with transformer tokens and are chosen during architecture search or validation.
axioms (1)
  • domain assumption High-resolution shortcuts from shallow stages preserve usable spatial detail for tiny objects
    Invoked when the paper states that PBM builds these shortcuts to avoid suppression in deep feature maps.
invented entities (2)
  • Parallel Bi-fusion Module (PBM) no independent evidence
    purpose: Build high-resolution shortcuts from shallow stages into the feature pyramid
    New module introduced in the paper; no independent evidence outside the reported experiments.
  • Spatial Semantic Adapter (SSA) no independent evidence
    purpose: Extract high-resolution cues from early stages and inject into transformer token embeddings
    New module introduced in the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.1-grok · 5897 in / 1533 out tokens · 34645 ms · 2026-06-30T12:07:42.823215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Yolov4: Optimal speed and accuracy of object detection.arXiv, 2020

    Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection.arXiv, 2020

  2. [2]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

  3. [3]

    Parallel residual bi-fusion feature pyramid network for accurate single-shot object detection.IEEE Transactions on Image Processing, 30:9099–9111, 2021

    Ping-Yang Chen, Ming-Ching Chang, Jun-Wei Hsieh, and Yong-Sheng Chen. Parallel residual bi-fusion feature pyramid network for accurate single-shot object detection.IEEE Transactions on Image Processing, 30:9099–9111, 2021. doi: 10.1109/TIP.2021.3118953

  4. [4]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  5. [5]

    Yolov8.https: // docs

    Jocher Glenn. Yolov8.https: // docs. ultralytics. com/ models/ yolov8/, 2023

  6. [6]

    Yolo11.https: // docs

    Jocher Glenn. Yolo11.https: // docs. ultralytics. com/ models/ yolo11/, 2024

  7. [7]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016

  8. [8]

    Real-time object detection meets dinov3.arXiv, 2025

    Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, and Xi Shen. Real-time object detection meets dinov3.arXiv, 2025

  9. [9]

    Deim: Detr with improved matching for fast convergence

    Shihua Huang, Zhichao Lu, Xiaodong Cun, Yongjun Yu, Xiao Zhou, and Xi Shen. Deim: Detr with improved matching for fast convergence. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15162–15171, 2025

  10. [10]

    yolov11.https://github.com/ultralytics, 2024

    Glenn Jocher. yolov11.https://github.com/ultralytics, 2024

  11. [11]

    Glenn Jocher, K Nishimura, T Mineeva, and RJAM Vilariño. yolov5. https://github.com/ultralytics/yolov5/tree, 2, 2020

  12. [12]

    Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025

    Mengqi Lei, Siqi Li, Yihong Wu, and et al. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025

  13. [13]

    A deep learning-based hybrid framework for object detection and recognition in autonomous driving.IEEE Access, 8:194228–194239, 2020

    Yanfen Li, Hanxiang Wang, L Minh Dang, Tan N Nguyen, Dongil Han, Ahyun Lee, Insung Jang, and Hyeonjoon Moon. A deep learning-based hybrid framework for object detection and recognition in autonomous driving.IEEE Access, 8:194228–194239, 2020

  14. [14]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  15. [15]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, 2017

  16. [16]

    Path aggregation network for instance seg- mentation

    Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance seg- mentation. InProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 10

  17. [17]

    Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer, 2024

    Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, and Yi Liu. Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer, 2024. URL https://arxiv.org/abs/ 2407.17140

  18. [18]

    D-fine: Redefine regression task in detrs as fine-grained distribution refinement.arXiv, 2024

    Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. D-fine: Redefine regression task in detrs as fine-grained distribution refinement.arXiv, 2024

  19. [19]

    Rf-detr: Neural architecture search for real-time detection transformers, 2025

    Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, and Neehar Peri. Rf-detr: Neural architecture search for real-time detection transformers, 2025. URL https://arxiv.org/abs/2511. 09554

  20. [20]

    Yolo26: key architec- tural enhancements and performance benchmarking for real-time object detection.arXiv preprint arXiv:2509.25164, 2025

    Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, and Manoj Karkee. Yolo26: key architec- tural enhancements and performance benchmarking for real-time object detection.arXiv preprint arXiv:2509.25164, 2025

  21. [21]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019

  22. [22]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  23. [23]

    YOLOv12: Attention-Centric Real-Time Object Detectors

    Yunjie Tian, Qixiang Ye, and David Doermann. Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524, 2025

  24. [24]

    Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024

    Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024

  25. [25]

    Cspnet: A new backbone that can enhance learning capability of cnn

    Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A new backbone that can enhance learning capability of cnn. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 390–391, 2020

  26. [26]

    Yolov9: Learning what you want to learn using programmable gradient information.arXiv, 2024

    Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using programmable gradient information.arXiv, 2024

  27. [27]

    Surveiledge: Real-time video query based on collaborative cloud-edge deep learning

    Shibo Wang, Shusen Yang, and Cong Zhao. Surveiledge: Real-time video query based on collaborative cloud-edge deep learning. InIEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages 2519–2528. IEEE, 2020

  28. [28]

    mixup: Beyond empirical risk minimization

    Hongyi Zhang. mixup: Beyond empirical risk minimization. InICLR, 2017

  29. [29]

    So-detr: leveraging dual-domain features and knowledge distillation for small object detection

    Huaxiang Zhang, Hao Zhang, Aoran Mei, Zhongxue Gan, and Guo-Niu Zhu. So-detr: leveraging dual-domain features and knowledge distillation for small object detection. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2025

  30. [30]

    Detrs beat yolos on real-time object detection, 2023

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection, 2023

  31. [31]

    motor" and

    Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (11):7380–7399, 2021. 11 Appendix A Implementation Details We implement TinyFormer in PyTorch 2.5.1 with CUDA 12.2, building upon the DEIMv2 [8]. To ensure a s...