pith. sign in

arxiv: 2603.09405 · v2 · pith:YFGFDFP6new · submitted 2026-03-10 · 💻 cs.CV

YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search

Pith reviewed 2026-05-21 11:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords YOLO architecture searchsurrogate benchmarkself-evolving predictorobject detectionneural architecture searchCOCO-minievolutionary search
0
0 comments X

The pith

A self-evolving surrogate predictor trained on sampled YOLO architectures can guide evolutionary search to detectors outperforming official YOLO baselines at comparable latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to remove the bottleneck of full training for each candidate during neural architecture search for YOLO object detectors. It samples 1,000 architectures across channel widths, depths, and operators in backbones and necks, trains them on the reduced COCO-mini set, and trains a LightGBM predictor. A Self-Evolving Mechanism then repeatedly uses the current predictor to select and train additional high-performing candidates, expanding the pool to 1,500 and lifting the ensemble's R-squared to 0.815 along with its ranking consistency. When the final predictor serves as the fitness function in evolutionary search, the resulting architectures exceed all official YOLOv8 through YOLO12 models on COCO-mini at matched latency, showing that the surrogate can reliably identify strong designs without exhaustive evaluation.

Core claim

YOLO-NAS-Bench defines a search space covering core YOLO modules from versions 8 through 12, samples 1,000 architectures via random, stratified, and Latin Hypercube methods, and trains them on COCO-mini. The Self-Evolving Mechanism iteratively deploys the predictor to locate and evaluate additional informative architectures near the high-performance frontier, growing the training pool to 1,500 while raising ensemble R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752. Using the refined predictor directly as the fitness function in evolutionary search produces architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini.

What carries the argument

The Self-Evolving Mechanism, which progressively aligns the predictor training distribution to the high-performance frontier by using the current model to discover and train new informative architectures each iteration.

If this is right

  • The refined predictor supplies a low-cost fitness signal that lets evolutionary search explore the YOLO design space more thoroughly than full training permits.
  • The benchmark supplies a standardized way to compare different NAS algorithms for detection without incurring repeated full training costs.
  • Architectures surfaced by the predictor demonstrate that ranking consistency in the high-performance regime is sufficient to locate designs better than current hand-designed baselines.
  • The iterative alignment process shows that predictor accuracy improves most when training data is deliberately shifted toward top-performing candidates rather than uniform sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-evolving sampling idea could be tested on other detection families or segmentation tasks where evaluation cost is similarly high.
  • If the COCO-mini to full-COCO correlation holds, the approach offers a practical route to architecture search on larger-scale or video detection problems.
  • The benchmark could serve as a testbed for hybrid search methods that combine the predictor with gradient-based or reinforcement-learning NAS strategies.

Load-bearing premise

Performance rankings and absolute metrics measured on the reduced COCO-mini dataset are sufficiently correlated with results on the full COCO dataset and real deployment conditions to make the surrogate useful for guiding architecture search.

What would settle it

Fully train the top architectures found by the evolutionary search on the complete COCO dataset and verify whether they still exceed the official YOLOv8-YOLO12 baselines in accuracy at matched latency.

Figures

Figures reproduced from arXiv: 2603.09405 by Jiaxin Zheng, Xiaoyu Ding, Yongtao Wang, Zhe Li.

Figure 1
Figure 1. Figure 1: Latency vs. mAP on COCO-mini. Architectures dis￾covered by our predictor-guided EA search consistently Pareto￾dominate all official YOLO baselines (v8–v12) across the full la￾tency spectrum, demonstrating the strong discriminative power of the YOLO-NAS-Bench surrogate predictor. compute per training run, and the complete search space needs to jointly consider backbone, neck, and head com￾ponents. Fully tra… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the YOLO-NAS-Bench pipeline. (1) A YOLO-style search space spanning channel, depth, and operator dimen￾sions across both backbone and neck is defined. (2) 1,000 architectures are sampled via three complementary strategies and trained on COCO-mini. (3) A LightGBM predictor is trained on the resulting {architecture, mAP} pairs. (4) The Self-Evolving Predictor iteratively expands the pool with hig… view at source ↗
Figure 3
Figure 3. Figure 3: Self-Evolving Predictor. Starting from 1,000 architec￾tures, the loop partitions latency into 10 buckets. For each bucket, EA search selects the top 5 architectures using predicted mAP as fitness and real latency as constraint. In each round, these 50 new architectures are trained on COCO-mini, merged into the pool, and the predictor is retrained. After 10 rounds the pool grows to 1,500 architectures, yiel… view at source ↗
Figure 4
Figure 4. Figure 4: Predicted vs. ground-truth mAP on the full 1,500- architecture pool. Each point is an architecture colored by its sampling source. Points cluster closely around the y=x diagonal, confirming strong agreement between the ensemble predictor and ground-truth performance [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor's training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor's R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor's discriminative power for top-performing detection architectures. The code is available at https://github.com/VDIGPKU/YOLO-NAS-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces YOLO-NAS-Bench as the first surrogate benchmark for neural architecture search targeting YOLO-style detectors. It defines a search space over channel widths, block depths, and operator types for backbone and neck modules spanning YOLOv8 to YOLO12. 1,000 architectures are sampled and trained on COCO-mini; a LightGBM surrogate is trained on these results. A Self-Evolving Mechanism then uses the predictor to propose and evaluate additional high-performance architectures, expanding the set to 1,500 samples and lifting ensemble R² from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752. The final predictor serves as the fitness function in an evolutionary search that yields architectures outperforming official YOLOv8–YOLO12 baselines at comparable latency on COCO-mini.

Significance. If the surrogate rankings prove transferable, the benchmark and self-evolving predictor would meaningfully lower the barrier to NAS for object detection by replacing multi-day full trainings with fast inference. The explicit focus on the high-performance regime via self-evolution is a constructive technical contribution, and public code release aids reproducibility. The central limitation is that all quantitative claims, including the headline result of surpassing official baselines, rest exclusively on COCO-mini without reported transfer or rank-correlation checks to the full COCO dataset.

major comments (2)
  1. [Abstract] Abstract and experimental results: All performance numbers, ranking metrics, and the claim that evolved architectures surpass official YOLOv8–YOLO12 baselines are obtained exclusively on COCO-mini. No experiment trains the discovered models on the full COCO train set, evaluates mAP on the standard val set, or measures rank correlation between COCO-mini and full-COCO orderings. Because the usefulness of the surrogate for guiding real NAS hinges on this correlation, the absence of such validation is load-bearing for the central claim.
  2. [Self-Evolving Mechanism] Self-Evolving Mechanism (described in abstract and §3): The loop uses the current predictor to select new candidate architectures, evaluates them on COCO-mini, and folds the results back into the training pool. While presented as active learning, the manuscript does not report a held-out validation set, an analysis of selection bias, or an ablation showing that the observed R²/Kendall-Tau gains are not partly artifacts of the predictor reinforcing its own preferences. This circularity risk directly affects the reliability of the final predictor used for evolutionary search.
minor comments (3)
  1. The term 'Sparse Kendall Tau' is used without definition or citation; a short explanation or reference to the exact variant employed would improve clarity.
  2. [Experimental setup] LightGBM hyperparameters and the precise number of architectures added per self-evolving iteration are treated as free parameters but are not tabulated or subjected to sensitivity analysis; including these values would help readers reproduce the reported metric improvements.
  3. Figure captions and axis labels should explicitly state that all latency and accuracy numbers are measured on COCO-mini rather than full COCO to avoid misinterpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their positive evaluation of the paper's significance and for highlighting important aspects that require further clarification and validation. We address the two major comments below, committing to revisions that will strengthen the manuscript's claims regarding the surrogate's applicability and the robustness of the self-evolving approach.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: All performance numbers, ranking metrics, and the claim that evolved architectures surpass official YOLOv8–YOLO12 baselines are obtained exclusively on COCO-mini. No experiment trains the discovered models on the full COCO train set, evaluates mAP on the standard val set, or measures rank correlation between COCO-mini and full-COCO orderings. Because the usefulness of the surrogate for guiding real NAS hinges on this correlation, the absence of such validation is load-bearing for the central claim.

    Authors: We acknowledge that all reported results, including the outperformance claims, are based on COCO-mini. This dataset was selected as a computationally tractable proxy to enable sampling and full training of 1,500 architectures, which would otherwise demand infeasible resources on full COCO. To directly address transferability, we will add to the revised manuscript: (i) training and evaluation of a subset of the evolved architectures on the full COCO train set with mAP reported on the standard validation set, and (ii) rank-correlation analysis between COCO-mini and full-COCO performance for a representative sample of architectures. These additions will provide concrete evidence on the proxy's reliability for NAS guidance. revision: yes

  2. Referee: [Self-Evolving Mechanism] Self-Evolving Mechanism (described in abstract and §3): The loop uses the current predictor to select new candidate architectures, evaluates them on COCO-mini, and folds the results back into the training pool. While presented as active learning, the manuscript does not report a held-out validation set, an analysis of selection bias, or an ablation showing that the observed R²/Kendall-Tau gains are not partly artifacts of the predictor reinforcing its own preferences. This circularity risk directly affects the reliability of the final predictor used for evolutionary search.

    Authors: We thank the referee for identifying this potential circularity concern. The self-evolving mechanism was intended to enrich the training distribution toward high-performing architectures relevant to NAS. In the revision we will add: (1) explicit description and results from a held-out validation set used to track predictor generalization across iterations, (2) distributional analysis comparing selected versus non-selected architectures to quantify selection bias, and (3) an ablation contrasting self-evolution against continued random or stratified sampling, demonstrating that the observed R² and Sparse Kendall Tau improvements arise from frontier enrichment rather than self-reinforcement alone. These elements will substantiate the mechanism's reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the surrogate benchmark construction

full rationale

The paper trains a LightGBM predictor on 1,000 randomly/stratified/LHS-sampled YOLO architectures whose true performance is obtained by full training on COCO-mini. The self-evolving step iteratively uses the current predictor only to propose additional candidates that are then actually trained and added to the pool, raising R² and Kendall-τ on the expanded set; this is ordinary active learning, not a definitional loop in which a prediction is forced by its own inputs. The final evolutionary search employs the trained predictor as a cheap fitness function, after which the discovered candidates are evaluated to produce the reported superiority claim. No equation reduces to another by construction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled. The derivation remains self-contained against the COCO-mini benchmark data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on training a modest number of architectures on a reduced dataset and iteratively refining a surrogate; no machine-checked proofs, external validation sets, or parameter-free derivations are mentioned.

free parameters (2)
  • LightGBM model hyperparameters
    Control the surrogate predictor but are not enumerated in the abstract.
  • Number of architectures added per self-evolving iteration
    Determines how the training distribution shifts toward high performers.
axioms (1)
  • domain assumption COCO-mini performance rankings transfer to full COCO and deployment settings
    All training, evaluation, and final claims rely on the mini dataset without stated justification for generalization.
invented entities (1)
  • Self-Evolving Mechanism no independent evidence
    purpose: Progressively align the predictor's training distribution with the high-performance frontier
    Introduced by the authors to improve accuracy in the regime most relevant to NAS.

pith-pipeline@v0.9.0 · 5840 in / 1696 out tokens · 56339 ms · 2026-05-21T11:05:28.421054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Super-gradients, 2021

    Shay Aharon, Louis-Dupont, Ofri Masad, Kate Yurkova, Lotem Fridman, Lkdci, Eugene Khvedchenya, Ran Rubin, Natan Bagrov, Borys Tymchenko, Tomer Keren, Alexander Zhilko, and Eran-Deci. Super-gradients, 2021. 1, 2

  2. [2]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. YOLOv4: Optimal speed and accuracy of object detection.arXiv preprint arXiv:2004.10934, 2020. 5

  3. [3]

    Random forests.Machine Learning, 45(1): 5–32, 2001

    Leo Breiman. Random forests.Machine Learning, 45(1): 5–32, 2001. 6

  4. [4]

    XGBoost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. 6

  5. [5]

    DetNAS: Backbone search for object detection

    Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. DetNAS: Backbone search for object detection. 32:6642–6652, 2019. 1, 2

  6. [6]

    NAS-Bench-201: Extending the scope of reproducible neural architecture search

    Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the scope of reproducible neural architecture search. InInter- national Conference on Learning Representations, 2020. 1, 2

  7. [7]

    NGBoost: Natural gradient boosting for probabilistic prediction

    Tony Duan, Anand Avati, Daisy Yi Ding, Khanh K Thai, Sanjay Basu, Andrew Ng, and Alejandro Schuler. NGBoost: Natural gradient boosting for probabilistic prediction. InIn- ternational Conference on Machine Learning, pages 2690– 2700, 2020. 6

  8. [8]

    NAS-FPN: Learning scalable feature pyramid architecture for object detection

    Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. NAS-FPN: Learning scalable feature pyramid architecture for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7036– 7045, 2019. 2

  9. [9]

    Simple copy-paste is a strong data augmentation method for instance segmentation

    Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2918– 2928, 2021. 5

  10. [10]

    SP-NAS: Serial-to-parallel backbone search for object detection

    Chenhan Jiang, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. SP-NAS: Serial-to-parallel backbone search for object detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11863–11872, 2020. 2

  11. [11]

    Ultralytics yolo11, 2024

    Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 2, 3

  12. [12]

    Ultralytics yolov8, 2023

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023. 2, 3

  13. [13]

    LightGBM: A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. 30:3149– 3157, 2017. 4, 6

  14. [14]

    YOLOBench: Benchmarking efficient object detectors on embedded sys- tems

    Ivan Lazarevich, Matteo Grimaldi, Ravish Kumar, Saptarshi Mitra, Shahrukh Khan, and Sudhakar Sah. YOLOBench: Benchmarking efficient object detectors on embedded sys- tems. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 1169–1178, 2023. 2

  15. [15]

    OPANAS: One-shot path aggregation network architecture search for object detection

    Tingting Liang, Yongtao Wang, Zhi Tang, Guosheng Hu, and Haibin Ling. OPANAS: One-shot path aggregation network architecture search for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10195–10203, 2021. 1, 2

  16. [16]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755,

  17. [17]

    DARTS: Differentiable architecture search

    Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InInternational Confer- ence on Learning Representations, 2019. 1

  18. [18]

    A comparison of three methods for selecting values of input variables in the analysis of output from a computer code.Technometrics, 42(1):55–61, 2000

    Michael D McKay, Richard J Beckman, and William J Conover. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code.Technometrics, 42(1):55–61, 2000. 3

  19. [19]

    Regularized evolution for image classifier architecture search

    Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4780–4789, 2019. 1, 2, 4, 5

  20. [20]

    YOLOv12: Attention-centric real-time object detectors

    Yunjie Tian, Qixiang Ye, and David Doermann. YOLOv12: Attention-centric real-time object detectors. InAdvances in Neural Information Processing Systems, 2025. 2

  21. [21]

    NAS-Bench-360: Benchmarking neural architecture search on diverse tasks

    Renbo Tu, Nicholas Roberts, Misha Khodak, Junhong Shen, Frederic Sala, and Ameet Talwalkar. NAS-Bench-360: Benchmarking neural architecture search on diverse tasks. 35:12380–12394, 2022. 2

  22. [22]

    YOLOv10: Real-time end- to-end object detection

    Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, and Guiguang Ding. YOLOv10: Real-time end- to-end object detection. 37:107984–108011, 2024. 2, 3

  23. [23]

    YOLOv9: Learning what you want to learn using pro- grammable gradient information

    Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. YOLOv9: Learning what you want to learn using pro- grammable gradient information. InEuropean Conference on Computer Vision, pages 1–21, 2024. 2

  24. [24]

    NAS-Bench-101: To- wards reproducible neural architecture search

    Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NAS-Bench-101: To- wards reproducible neural architecture search. InInterna- tional Conference on Machine Learning, pages 7105–7114,

  25. [25]

    Surrogate NAS benchmarks: Going beyond the limited search spaces of tabular NAS benchmarks

    Arber Zela, Julien Niklas Siems, Lucas Zimmer, Jovita Lukasik, Margret Keuper, and Frank Hutter. Surrogate NAS benchmarks: Going beyond the limited search spaces of tabular NAS benchmarks. InInternational Conference on Learning Representations, 2022. 1, 2, 4

  26. [26]

    mixup: Beyond empirical risk minimiza- tion

    Hongyi Zhang, Moustapha Ciss ´e, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions, 2018. 5

  27. [27]

    DETRs beat YOLOs on real-time object detection

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. DETRs beat YOLOs on real-time object detection. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16965–16974, 2024. 2

  28. [28]

    Neural architecture search with reinforcement learning

    Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. InInternational Conference on Learning Representations, 2017. 1 7