pith. sign in

arxiv: 1907.00274 · v1 · pith:J2LHXH4Fnew · submitted 2019-06-29 · 💻 cs.CV · cs.LG

NetTailor: Tuning the Architecture, Not Just the Weights

Pith reviewed 2026-05-25 12:33 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords NetTailortransfer learningCNN architecture adaptationtask-specific networkssoft attentioncomplexity regularizationmulti-task object recognitionnetwork compression
0
0 comments X

The pith

NetTailor reuses pre-trained CNN layers as blocks to build task-specific networks whose size scales with task difficulty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard fine-tuning learns an entirely new CNN for each recognition task, so final model size stays the same regardless of whether the task is easy or hard. NetTailor instead treats layers from a strong pre-trained network as reusable universal blocks, assembles them with small task-specific layers, and trains the result to match the original network's internal activations while using soft attention and complexity penalties to drop unnecessary blocks. Experiments show the resulting networks become markedly smaller for simple tasks such as character or traffic-sign recognition than for fine-grained recognition, yet retain accuracy and allow the same blocks to be shared across tasks.

Core claim

The paper shows that a transfer procedure can adapt network architecture, not merely weights, by combining pre-trained layers as universal blocks with task-specific layers, training the assembly to mimic a strong unconstrained CNN's activations, and applying soft-attention over blocks together with complexity regularization; this produces networks whose complexity automatically matches task difficulty while preserving classification accuracy and cross-task parameter sharing.

What carries the argument

NetTailor assembly of pre-trained universal blocks with task-specific layers, trained under activation mimicking, soft-attention, and complexity regularization.

If this is right

  • Simple tasks receive networks with substantially fewer parameters than complex tasks.
  • Classification accuracy stays comparable to training an unconstrained CNN from the same starting point.
  • The same pre-trained blocks can be reused across many tasks without duplication.
  • A single platform can host more tasks simultaneously because per-task memory and compute scale with difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The block-selection process could be extended to decide at inference time which blocks to activate based on input difficulty.
  • The same modular reuse might combine with quantization or hardware-aware search to further reduce deployment cost on edge devices.
  • If the attention mechanism generalizes, it could serve as a diagnostic for which layers matter most for a given task family.

Load-bearing premise

Training under the combined objectives of classification error, activation mimicking, soft attention, and complexity regularization will select blocks that remain accurate for the target task without separate validation of the resulting architecture.

What would settle it

Applying the procedure to tasks of graded difficulty and finding no reliable reduction in selected blocks or parameters for the simpler tasks relative to standard fine-tuning.

Figures

Figures reproduced from arXiv: 1907.00274 by Nuno Vasconcelos, Pedro Morgado.

Figure 1
Figure 1. Figure 1: Architecture fine-tuning with NETTAILOR. Pre-trained blocks shown in gray, task-specific in green. Left: Pre-trained CNN augmented with low-complexity blocks that introduce skip connections. Center: Optimization prunes blocks of poor trade-off between complexity and impact on recognition. Right: The final network is a combination of pre-trained and task-specific blocks. different training procedures (e.g. … view at source ↗
Figure 2
Figure 2. Figure 2: Augmentation of pre-trained block Gl at layer l with multiple proxy layers A l p. xi represents the network activation after layer i. with any transfer technique that produces a teacher that preserves the architecture of the pre-trained network. We focused on fine￾tuning due to its popularity and high performance for most tasks where a reasonably sized dataset is available for training [27, 70]. Layer prun… view at source ↗
Figure 3
Figure 3. Figure 3: Block removal criteria. (a) Self-exclusion. (b) Input exclusion. (c) Output exclusion. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reduction of network complexity and final architecture after adapting ResNet34 to three datasets using N [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy vs. complexity of models of increasing depth. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy vs. complexity of models discovered by [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Real-world applications of object recognition often require the solution of multiple tasks in a single platform. Under the standard paradigm of network fine-tuning, an entirely new CNN is learned per task, and the final network size is independent of task complexity. This is wasteful, since simple tasks require smaller networks than more complex tasks, and limits the number of tasks that can be solved simultaneously. To address these problems, we propose a transfer learning procedure, denoted NetTailor, in which layers of a pre-trained CNN are used as universal blocks that can be combined with small task-specific layers to generate new networks. Besides minimizing classification error, the new network is trained to mimic the internal activations of a strong unconstrained CNN, and minimize its complexity by the combination of 1) a soft-attention mechanism over blocks and 2) complexity regularization constraints. In this way, NetTailor can adapt the network architecture, not just its weights, to the target task. Experiments show that networks adapted to simple tasks, such as character or traffic sign recognition, become significantly smaller than those adapted to hard tasks, such as fine-grained recognition. More importantly, due to the modular nature of the procedure, this reduction in network complexity is achieved without compromise of either parameter sharing across tasks, or classification accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes NetTailor, a transfer learning procedure that treats layers of a pre-trained CNN as universal blocks which are combined with small task-specific layers to form task-adapted networks. Training minimizes classification error while also mimicking internal activations of an unconstrained CNN and applying soft-attention over blocks plus complexity regularization, thereby adapting architecture (not just weights) to task complexity. Experiments are claimed to show that networks for simple tasks (character/traffic-sign recognition) become significantly smaller than those for hard tasks (fine-grained recognition) while preserving accuracy and cross-task parameter sharing.

Significance. If the experimental claims hold with proper controls, the method offers a practical route to multi-task deployment on constrained hardware by producing task-dependent network sizes without retraining separate CNNs from scratch or sacrificing modularity. The combination of activation mimicking with attention-based block selection is a distinctive element that could influence subsequent work on modular transfer learning.

major comments (2)
  1. [Experiments] The central claim that size reduction is task-driven (rather than regularization-driven) requires evidence that the same complexity penalty produces different block selections across tasks of varying difficulty. The experiments section reports smaller architectures for simple tasks but does not include a control in which block selection is performed with task-independent regularization strength or with attention disabled, leaving open the possibility that observed size differences are artifacts of the regularization term alone.
  2. [Method (joint loss and attention mechanism)] The procedure relies on soft-attention during training to select blocks, yet no post-training enumeration or comparison (e.g., exhaustive search over subsets of the universal blocks at matched complexity) is provided to establish that the learned selection is near-optimal for the observed accuracy. Without this, the claim that the joint loss reliably yields task-appropriate architectures remains unverified.
minor comments (2)
  1. [Method] Notation for the soft-attention weights and the complexity regularization term should be introduced with explicit equations rather than descriptive text only.
  2. [Abstract and Experiments] The abstract states experimental support for size reduction without accuracy loss, but quantitative tables or figures are not referenced in the provided summary; ensure all reported accuracies and parameter counts appear with standard deviations or multiple runs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed review and constructive suggestions. Below we address each major comment in turn.

read point-by-point responses
  1. Referee: [Experiments] The central claim that size reduction is task-driven (rather than regularization-driven) requires evidence that the same complexity penalty produces different block selections across tasks of varying difficulty. The experiments section reports smaller architectures for simple tasks but does not include a control in which block selection is performed with task-independent regularization strength or with attention disabled, leaving open the possibility that observed size differences are artifacts of the regularization term alone.

    Authors: The regularization strength is indeed task-independent, as stated in the method. The differences arise because the attention mechanism is optimized jointly with the task-specific classification and mimicry losses, which vary with task difficulty. To strengthen this, we will include an additional experiment disabling the attention mechanism (forcing all blocks) and show that the resulting networks do not exhibit the same task-dependent size variation. We will also clarify this in the text. Thus, a revision will be made. revision: yes

  2. Referee: [Method (joint loss and attention mechanism)] The procedure relies on soft-attention during training to select blocks, yet no post-training enumeration or comparison (e.g., exhaustive search over subsets of the universal blocks at matched complexity) is provided to establish that the learned selection is near-optimal for the observed accuracy. Without this, the claim that the joint loss reliably yields task-appropriate architectures remains unverified.

    Authors: We acknowledge that an exhaustive search would provide stronger evidence of optimality. However, such a search is intractable for the number of blocks considered (exponential complexity). The soft-attention serves as a practical, differentiable proxy trained end-to-end with the joint loss. We will add a paragraph discussing this limitation and note that the method prioritizes practicality and modularity over guaranteed optimality. No revision to the experiments is planned as the requested comparison is not feasible. revision: no

standing simulated objections not resolved
  • Verification of near-optimality via exhaustive enumeration of block subsets, which is computationally infeasible.

Circularity Check

0 steps flagged

No circularity; new training procedure with independent objectives

full rationale

The paper defines NetTailor as a transfer procedure that combines classification loss with explicit new terms (activation mimicking of an unconstrained CNN, soft-attention over blocks, and complexity regularization). These objectives are introduced as design choices rather than derived from or reduced to quantities already present in the inputs or prior fits. The central experimental claim—that simpler tasks yield smaller adapted networks—is presented as an observed outcome of training, not a prediction forced by construction or by self-citation. No equations, uniqueness theorems, or ansatzes are shown to collapse the result back onto the method's own fitted parameters. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the method rests on domain assumptions about layer reusability and the effectiveness of mimicry plus regularization; no free parameters or invented entities are explicitly quantified.

axioms (2)
  • domain assumption Layers from a pre-trained CNN can serve as universal blocks that combine with task-specific layers for new tasks.
    Central premise of the NetTailor construction stated in the abstract.
  • domain assumption Mimicking internal activations of a strong unconstrained CNN plus complexity regularization produces accurate yet smaller task-adapted networks.
    Core training objective described in the abstract.

pith-pipeline@v0.9.0 · 5756 in / 1447 out tokens · 46311 ms · 2026-05-25T12:33:20.791238+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 16 internal anchors

  1. [1]

    Expert gate: Lifelong learning with a network of experts

    Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Computer V ision and P attern Recognition (CVPR), 2017

  2. [2]

    Do deep nets really need to be deep? In Advances in Neural Information Processing Systems (NeurIPS), 2014

    Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems (NeurIPS), 2014

  3. [3]

    Curriculum learning

    Y oshua Bengio, J´erˆome Louradour, Ronan Collobert, and Jason W eston. Curriculum learning. InInternational Conference on Machine Learning (ICML), 2009

  4. [4]

    Model compression

    Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InInternational Conference on Knowledge Discovery and Data Mining (SIGKDD), 2006

  5. [5]

    Learning complexity-aware cascades for deep pedestrian detection

    Zhaowei Cai, Mohammad Saberian, and Nuno V asconcelos. Learning complexity-aware cascades for deep pedestrian detection. In International Conference on Computer V ision (ICCV), 2015

  6. [6]

    Multitask learning

    Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997

  7. [7]

    Net2Net: Accelerating Learning via Knowledge Transfer

    Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015

  8. [8]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea V edaldi. Describing textures in the wild. In Computer V ision and P attern Recognition (CVPR), 2014

  9. [9]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, W ei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer V ision and P attern Recognition (CVPR), 2009

  10. [10]

    How do humans sketch objects? ACM Trans

    Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? ACM Trans. Graph. (Proc. SIGGRAPH) , 31(4):44:1–44:10, 2012

  11. [11]

    Everingham, L

    M. Everingham, L. V an Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The P ASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html

  12. [12]

    Learning to Teach

    Y ang Fan, Fei Tian, T ao Qin, Xiang-Y ang Li, and Tie-Y an Liu. Learning to teach. arXiv preprint arXiv:1805.03643, 2018

  13. [13]

    Spatially adaptive computation time for residual networks

    Michael Figurnov, Maxwell D Collins, Y ukun Zhu, Li Zhang, Jonathan Huang, Dmitry V etrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. In Computer V ision and P attern Recognition (CVPR), 2017

  14. [14]

    Unsupervised domain adaptation by backpropagation

    Y aroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInternational Conference on Machine Learning (ICML), 2015

  15. [15]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer V ision and P attern Recognition (CVPR), 2014

  16. [16]

    An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

    Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Y oshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013

  17. [17]

    Learning both weights and connections for efficient neural network

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

  18. [18]

    Second order derivatives for network pruning: Optimal brain surgeon

    Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. InAdvances in Neural Information Processing Systems (NeurIPS), 1993

  19. [19]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask r-cnn. In International Conference on Computer V ision (ICCV), 2017

  20. [20]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InComputer V ision and P attern Recognition (CVPR), 2016

  21. [21]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  22. [22]

    Multi-scale dense net- works for resource efficient image classification

    Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q W einberger. Multi-scale dense net- works for resource efficient image classification. InInternational Conference on Learning Representations (ICLR), 2018

  23. [23]

    Multi-task learning using uncertainty to weigh losses for scene geometry and seman- tics

    Alex Kendall, Y arin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and seman- tics. In Computer V ision and P attern Recognition (CVPR), 2018

  24. [24]

    Overcoming catastrophic forgetting in neural networks.National Academy of Sciences, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel V eness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.National Academy of Sciences, 2017

  25. [25]

    3D object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In International IEEE W orkshop on 3D Representation and Recognition (3dRR), 2013

  26. [26]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. T echnical report, 2009

  27. [27]

    Im- agenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im- agenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012

  28. [28]

    Human-level concept learning through probabilistic program induction

    Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenen- baum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015

  29. [29]

    Optimal brain damage

    Y ann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in Neural Information Processing Systems (NeurIPS), 1990

  30. [30]

    Overcoming catastrophic forgetting by incremental moment matching

    Sang-W oo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-W oo Ha, and Byoung-T ak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  31. [31]

    Pruning Filters for Efficient ConvNets

    Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016

  32. [32]

    Learning without forgetting

    Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on P attern Analysis and Machine Intelligence (TP AMI), 2017

  33. [33]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer V ision (ECCV), 2014

  34. [34]

    Progressive neural architecture search

    Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, W ei Hua, Li-Jia Li, Li Fei-Fei, Alan Y uille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In European Conference on Computer V ision (ECCV), 2018

  35. [35]

    DARTS: Differentiable Architecture Search

    Hanxiao Liu, Karen Simonyan, and Yiming Y ang. Darts: Dif- ferentiable architecture search.arXiv preprint arXiv:1806.09055, 2018

  36. [36]

    Gradient episodic memory for continual learning

    David Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  37. [37]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea V edaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

  38. [38]

    Piggyback: Adapting a single network to multiple tasks by learning to mask weights

    Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer V ision (ECCV), 2018

  39. [39]

    Packnet: Adding multiple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. InConference on Computer V ision and P attern Recognition (CVPR), 2018

  40. [40]

    Teacher-Student Curriculum Learning

    T ambet Matiisen, A vital Oliver, T aco Cohen, and John Schul- man. Teacher-student curriculum learning. arXiv preprint arXiv:1707.00183, 2017

  41. [41]

    Cross-stitch networks for multi-task learning

    Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Computer V ision and P attern Recognition (CVPR), 2016

  42. [42]

    Never-ending learning

    T om Mitchell, William Cohen, Estevam Hruschka, Partha T alukdar, Bo Y ang, Justin Betteridge, Andrew Carlson, B Dalvi, Matt Gardner, Bryan Kisiel, et al. Never-ending learning. Communications of the ACM, 61(5):103–115, 2018

  43. [43]

    Pruning Convolutional Neural Networks for Resource Efficient Inference

    Pavlo Molchanov, Stephen T yree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference.arXiv preprint arXiv:1611.06440, 2016

  44. [44]

    An experimental study on pedestrian classification.IEEE Transactions on P attern Analysis and Machine Intelligence (TP AMI), 28(11):1863–1868, 2006

    Stefan Munder and Dariu M Gavrila. An experimental study on pedestrian classification.IEEE Transactions on P attern Analysis and Machine Intelligence (TP AMI), 28(11):1863–1868, 2006

  45. [45]

    Reading digits in natural images with unsupervised feature learning

    Y uval Netzer, T ao W ang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems W orkshop (NeurIPS), 2011

  46. [46]

    Nilsback and A

    M-E. Nilsback and A. Zisserman. Automated flower classifi- cation over a large number of classes. InIndian Conference on Computer V ision, Graphics and Image Processing, 2008

  47. [47]

    Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition

    Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on P attern Analysis and Machine Intelligence (TP AMI), 2017

  48. [48]

    Encoder based lifelong learning

    Amal Rannen, Rahaf Aljundi, Matthew B Blaschko, and Tinne Tuytelaars. Encoder based lifelong learning. In International Conference on Computer V ision (ICCV), 2017

  49. [49]

    Learning multiple visual domains with residual adapters

    Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea V edaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  50. [50]

    Efficient parametrization of multi-domain deep neural networks

    Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea V edaldi. Efficient parametrization of multi-domain deep neural networks. In Computer V ision and P attern Recognition (CVPR), 2018

  51. [51]

    icarl: Incremental classifier and representation learning

    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Computer V ision and P attern Recognition (CVPR), 2017

  52. [52]

    Y ou only look once: Unified, real-time object detection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Unified, real-time object detection. In Computer V ision and P attern Recognition (CVPR), 2016

  53. [53]

    FitNets: Hints for Thin Deep Nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Y oshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014

  54. [54]

    Incremental learning through deep adaptation.IEEE Transactions on P attern Analysis and Machine Intelligence (P AMI), 2018

    Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation.IEEE Transactions on P attern Analysis and Machine Intelligence (P AMI), 2018

  55. [55]

    Progressive Neural Networks

    Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016

  56. [56]

    Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature

    Babak Saleh and Ahmed Elgammal. Large-scale classification of fine-art paintings: Learning the right metric on the right feature. arXiv preprint arXiv:1505.00855, 2015

  57. [57]

    Facenet: A unified embedding for face recognition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Computer V ision and P attern Recognition (CVPR), 2015

  58. [58]

    Cnn features off-the-shelf: an astounding baseline for recognition

    Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Computer V ision and P attern Recognition W orkshops (CVPRw), 2014

  59. [59]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

  60. [60]

    Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012

  61. [61]

    Branchynet: Fast inference via early exiting from deep neural networks

    Surat Teerapittayanon, Bradley McDanel, and HT Kung. Branchynet: Fast inference via early exiting from deep neural networks. In International Conference on P attern Recognition (ICPR), 2016

  62. [62]

    A lifelong learning perspective for mobile robot control

    Sebastian Thrun. A lifelong learning perspective for mobile robot control. InIntelligent Robots and Systems, pages 201–214, 1995

  63. [63]

    Simultaneous deep transfer across domains and tasks

    Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In International Conference on Computer V ision (ICCV), 2015

  64. [64]

    Adversarial discriminative domain adaptation

    Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Computer V ision and P attern Recognition (CVPR), 2017

  65. [65]

    Convolutional networks with adaptive inference graphs

    Andreas V eit and Serge Belongie. Convolutional networks with adaptive inference graphs. InEuropean Conference on Computer V ision (ECCV), 2018

  66. [66]

    Rapid object detection using a boosted cascade of simple features

    Paul Viola, Michael Jones, et al. Rapid object detection using a boosted cascade of simple features. InComputer V ision and P attern Recognition (CVPR), 2001

  67. [67]

    C. W ah, S. Branson, P . W elinder, P . Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of T echnology, 2011

  68. [68]

    Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers

    Fan Y ang, W ongun Choi, and Y uanqing Lin. Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In Computer V ision and P attern Recognition (CVPR), 2016

  69. [69]

    Lifelong Learning with Dynamically Expandable Networks

    Jaehong Y oon, Eunho Y ang, et al. Lifelong learning with dynam- ically expandable networks.arXiv preprint arXiv:1708.01547, 2017

  70. [70]

    How transferable are features in deep neural networks? InAdvances in Neural Information Processing Systems (NeurIPS), 2014

    Jason Y osinski, Jeff Clune, Y oshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? InAdvances in Neural Information Processing Systems (NeurIPS), 2014

  71. [71]

    Wide residual networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine V ision Conference (BMVC), 2016

  72. [72]

    Facial landmark detection by deep multi-task learning

    Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou T ang. Facial landmark detection by deep multi-task learning. In European Conference on Computer V ision (ECCV), 2014

  73. [73]

    Learning deep features for scene recognition using places database

    Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio T orralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems (NeurIPS), 2014

  74. [74]

    Less is more: T owards compact cnns

    Hao Zhou, Jose M Alvarez, and Fatih Porikli. Less is more: T owards compact cnns. InEuropean Conference on Computer V ision (ECCV), 2016

  75. [75]

    Neural Architecture Search with Reinforcement Learning

    Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578, 2016

  76. [76]

    Learning Transferable Architectures for Scalable Image Recognition

    Barret Zoph, Vijay V asudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017