XferNAS: Transfer Neural Architecture Search
Pith reviewed 2026-05-24 19:33 UTC · model grok-4.3
The pith
A transfer framework lets existing NAS optimizers reuse prior task knowledge and cuts search time from 200 to 6 GPU days.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a transfer framework, realized through only minor modifications to existing NAS optimizers, reuses knowledge learned on source tasks to reduce search time and improve final architectures on target tasks. Integration with one optimizer on CIFAR-10 and CIFAR-100 yields a 33-fold reduction in GPU days and new record error rates of 1.99 and 14.06. The framework is shown to be robust across different quantities of source and target data, always matching or exceeding the performance of the base optimizer.
What carries the argument
The transfer framework that injects knowledge reuse from prior NAS searches into existing optimizers via minor code changes.
If this is right
- Any existing NAS optimizer can be upgraded to a transfer version without redesigning its core logic.
- Repeated architecture searches become feasible on modest compute budgets instead of hundreds of GPU days.
- New state-of-the-art error rates on CIFAR-10 and CIFAR-100 are reachable while preserving the original optimizer's strengths.
- Performance remains at least as good as the base optimizer even when source and target data quantities vary.
- The same minor-change approach can be applied to other NAS benchmarks without task-specific redesign.
Where Pith is reading between the lines
- The framework could be applied sequentially so that each new task accumulates knowledge from all earlier ones.
- Similar reuse patterns might shorten other expensive hyperparameter searches that share architectural structure.
- If negative transfer appears on dissimilar tasks, lightweight detection heuristics could be added without altering the core claim.
- Industry pipelines that run many related searches could adopt the framework to amortize search cost across projects.
Load-bearing premise
Knowledge from source-task searches transfers positively to the target task even after only minor modifications and without substantial negative transfer.
What would settle it
An experiment on a new target task in which the transferred optimizer returns higher error rates or requires more search time than the unmodified optimizer would falsify the claimed benefit.
Figures
read the original abstract
The term Neural Architecture Search (NAS) refers to the automatic optimization of network architectures for a new, previously unknown task. Since testing an architecture is computationally very expensive, many optimizers need days or even weeks to find suitable architectures. However, this search time can be significantly reduced if knowledge from previous searches on different tasks is reused. In this work, we propose a generally applicable framework that introduces only minor changes to existing optimizers to leverage this feature. As an example, we select an existing optimizer and demonstrate the complexity of the integration of the framework as well as its impact. In experiments on CIFAR-10 and CIFAR-100, we observe a reduction in the search time from 200 to only 6 GPU days, a speed up by a factor of 33. In addition, we observe new records of 1.99 and 14.06 for NAS optimizers on the CIFAR benchmarks, respectively. In a separate study, we analyze the impact of the amount of source and target data. Empirically, we demonstrate that the proposed framework generally gives better results and, in the worst case, is just as good as the unmodified optimizer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes XferNAS, a general framework that adds only minor modifications to existing NAS optimizers to transfer knowledge from searches on source tasks, thereby accelerating architecture search on new target tasks. On CIFAR-10 and CIFAR-100 it reports reducing search cost from 200 to 6 GPU days (33× speedup) while setting new records of 1.99 % and 14.06 % error; a separate ablation studies the effect of source/target data volume and claims the method is never worse than the unmodified baseline.
Significance. If the positive-transfer premise holds across task distributions, the work would materially lower the computational barrier to NAS by amortizing prior search effort, making automated architecture design practical for a wider range of users. The design choice of minimal changes to existing optimizers is a pragmatic strength that facilitates adoption.
major comments (2)
- [Abstract and §4] Abstract and §4 (data-volume study): the claim that the framework “generally gives better results and, in the worst case, is just as good” rests on an unquantified assumption of positive transfer; no metric of source–target task similarity, no worst-case bound on negative transfer under distribution shift, and no analysis beyond data-volume ablation are provided, yet these are load-bearing for both the 33× speedup and the record claims.
- [§5] §5 (CIFAR benchmark results): the reported records (1.99 % / 14.06 %) and wall-clock reduction are given without the number of independent runs, standard deviations, or statistical comparison against the unmodified optimizer, so it is impossible to assess whether the observed gains are reliable or merely within run-to-run variance.
minor comments (2)
- [Abstract] Abstract: the phrase “new records of 1.99 and 14.06 for NAS optimizers” is ambiguous; clarify whether these are test error rates of the discovered architectures or some other metric.
- [§3] Notation: the transfer mechanism (how source knowledge is encoded and injected) is described only at a high level; a concise pseudocode or diagram in §3 would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond point-by-point to the major comments below, indicating where revisions will be made and where limitations prevent further changes.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (data-volume study): the claim that the framework “generally gives better results and, in the worst case, is just as good” rests on an unquantified assumption of positive transfer; no metric of source–target task similarity, no worst-case bound on negative transfer under distribution shift, and no analysis beyond data-volume ablation are provided, yet these are load-bearing for both the 33× speedup and the record claims.
Authors: We agree that the claim is supported only by the empirical data-volume ablation in §4 rather than by a similarity metric or theoretical bound. No such metric or worst-case analysis appears in the manuscript. In revision we will qualify the statement in the abstract and §4 to read “in our experiments” and will add an explicit limitations paragraph noting the absence of a bound on negative transfer under arbitrary distribution shift. revision: yes
-
Referee: [§5] §5 (CIFAR benchmark results): the reported records (1.99 % / 14.06 %) and wall-clock reduction are given without the number of independent runs, standard deviations, or statistical comparison against the unmodified optimizer, so it is impossible to assess whether the observed gains are reliable or merely within run-to-run variance.
Authors: The CIFAR benchmark results in §5 were obtained from single runs of each search; the manuscript therefore contains no report of independent runs, standard deviations, or statistical tests. Because the original experiments were not replicated, we cannot supply these quantities. We will add a sentence in §5 stating that results are from single runs and noting the computational cost as the reason. revision: no
- Provision of standard deviations or statistical comparisons for the §5 CIFAR results, because multiple independent runs were not performed in the original study.
Circularity Check
No circularity; empirical validation on external benchmarks
full rationale
The paper proposes a transfer framework for NAS optimizers and reports observed speedups (200 to 6 GPU days) and accuracies (1.99/14.06) from direct experiments on CIFAR-10/100. No equations, predictions, or first-principles results are derived; all central claims are measurements against fixed external benchmarks. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The work is self-contained against external data and receives the default low score.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Architectural knowledge from source tasks transfers positively to target tasks with only minor optimizer changes.
Reference graph
Works this paper leans on
-
[1]
Designing neural network architectures using reinforcement learning
Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. In 5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceed- ings, 2017
work page 2017
-
[2]
Understanding and simplifying one-shot architecture search
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Re- search, pages 550–559, Stockholmsmässan, Stockholm ...
work page 2018
-
[3]
Efficient architecture search by network transformation
Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelli- gence (IAAI-18), and the 8th AAAI Symposium on Edu- cational Advances in Artificial Intelligence ...
work page 2018
-
[4]
Path-level network transformation for efficient architecture search
Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level network transformation for efficient architecture search. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10- 15, 2018, pages 677–686, 2018
work page 2018
-
[5]
ProxylessNAS: Direct neural architecture search on target task and hard- ware
Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hard- ware. In Proceedings of the International Conference on Learning Representations, ICLR 2019, New Orleans, Louisiana, USA, 2019
work page 2019
-
[6]
Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. CoRR, abs/1708.04552, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
Xavier Gastaldi. Shake-shake regularization. CoRR, abs/1705.07485, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
A neural representation of sketch drawings
David Ha and Douglas Eck. A neural representation of sketch drawings. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018
work page 2018
-
[9]
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vi- sion and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 2261–2269. IEEE Computer Society, 2017
work page 2017
-
[10]
Nikolopoulos, Costas Bekas, and A
Roxana Istrate, Florian Scheidegger, Giovanni Mariani, Dimitrios S. Nikolopoulos, Costas Bekas, and A. Cris- tiano I. Malossi. TAPAS: train-less accuracy predictor for architecture search. In Proceedings of the Thirty- Third AAAI Conference on Artificial Intelligence, (AAAI- 19), Honolulu, Hawaii, USA, 2019
work page 2019
-
[11]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Con- ference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Pro- ceedings, 2015
work page 2015
-
[12]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009
work page 2009
-
[13]
Fractalnet: Ultra-deep neural networks without residuals
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings , 2017
work page 2017
-
[14]
Progressive neural architecture search
Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018
work page 2018
-
[15]
Hierarchical representations for efficient architecture search
Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisan- tha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018
work page 2018
-
[16]
DARTS: differentiable architecture search
Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. In Proceed- ings of the International Conference on Learning Repre- sentations, ICLR 2019, New Orleans, Louisiana, USA, 2019
work page 2019
-
[17]
Reed, Cheng-Yang Fu, and Alexan- der C
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexan- der C. Berg. SSD: single shot multibox detector. In Computer Vision - ECCV 2016 - 14th European Con- ference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 21–37, 2016
work page 2016
-
[18]
SGDR: stochastic gradient descent with warm restarts
Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th Interna- tional Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017
work page 2017
-
[19]
Neural architecture optimization
Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie- Yan Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 7827–7838, 2018
work page 2018
-
[20]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011
work page 2011
-
[21]
Efficient neural architecture search via parameters sharing
Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Pro- ceedings of Machine Learning Research, pages 4095– 4104, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2...
work page 2018
-
[22]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V . Le. Aging evolution for image classifier ar- chitecture search. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, (AAAI-19), Honolulu, Hawaii, USA, 2019
work page 2019
-
[23]
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V . Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In Doina Precup and Yee Whye Teh, edi- tors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research, pages 2902–2911, Inte...
work page 2017
-
[24]
Joseph Redmon, Santosh Kumar Divvala, Ross B. Gir- shick, and Ali Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 779–788, 2016
work page 2016
-
[25]
Deep learning architecture search by neuro-cell-based evolution with function-preserving mutations
Martin Wistuba. Deep learning architecture search by neuro-cell-based evolution with function-preserving mutations. In ECML/PKDD (2) , volume 11052 of Lecture Notes in Computer Science , pages 243–258. Springer, 2018
work page 2018
-
[26]
Practical deep learning architecture optimization
Martin Wistuba. Practical deep learning architecture optimization. In 5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018, Turin, Italy, October 1-3, 2018, pages 263–272, 2018
work page 2018
-
[27]
Inductive Transfer for Neural Architecture Optimization
Martin Wistuba and Tejaswini Pedapati. Inductive transfer for neural architecture optimization. CoRR, abs/1903.03536, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[28]
A Survey on Neural Architecture Search
Martin Wistuba, Ambrish Rawat, and Tejaswini Peda- pati. A survey on neural architecture search. CoRR, abs/1905.01392, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[29]
Transfer learning with neural automl
Catherine Wong, Neil Houlsby, Yifeng Lu, and Andrea Gesmundo. Transfer learning with neural automl. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Process- ing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 8366–8375, 2018
work page 2018
-
[30]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion- mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He
Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transforma- tions for deep neural networks. In 2017 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 5987–5995, 2017
work page 2017
-
[32]
SNAS: stochastic neural architecture search
Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In Pro- ceedings of the International Conference on Learning Representations, ICLR 2019, New Orleans, Louisiana, USA, 2019
work page 2019
-
[33]
Practical block-wise neural network architecture generation
Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise neural network architecture generation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2423– 2432, 2018
work page 2018
-
[34]
Barret Zoph and Quoc V . Le. Neural architecture search with reinforcement learning. In 5th Interna- tional Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017
work page 2017
-
[35]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V . Le. Learning transferable architectures for scalable image recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8697–8710, 2018. A Training details for the convolutional neural networks During the search proce...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.