pith. sign in

arxiv: 1511.05641 · v4 · pith:THCPO3NSnew · submitted 2015-11-18 · 💻 cs.LG

Net2Net: Accelerating Learning via Knowledge Transfer

classification 💻 cs.LG
keywords neuralknowledgenetworkprocessduringexperimentationnet2netprevious
0
0 comments X
read the original abstract

We introduce techniques for rapidly transferring the information stored in one neural net into another neural net. The main purpose is to accelerate the training of a significantly larger neural net. During real-world workflows, one often trains very many different neural networks during the experimentation and design process. This is a wasteful process in which each new model is trained from scratch. Our Net2Net technique accelerates the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network. Our techniques are based on the concept of function-preserving transformations between neural network specifications. This differs from previous approaches to pre-training that altered the function represented by a neural net when adding layers to it. Using our knowledge transfer mechanism to add depth to Inception modules, we demonstrate a new state of the art accuracy rating on the ImageNet dataset.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

  2. Isotropic Activation Functions Enable Deindividuated Neurons and Adaptive Topologies

    cs.NE 2026-02 unverdicted novelty 7.0

    Isotropic activation functions derived from reparameterisation symmetries and SVD diagonalisation enable function-preserving neuron removal and addition in dense networks, supporting up to 50% sparsification and real-...

  3. Dota 2 with Large Scale Deep Reinforcement Learning

    cs.LG 2019-12 accept novelty 7.0

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  4. Language Models as Knowledge Bases?

    cs.CL 2019-09 accept novelty 7.0

    BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.

  5. NetTailor: Tuning the Architecture, Not Just the Weights

    cs.CV 2019-06 unverdicted novelty 7.0

    NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for s...

  6. Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models

    cs.CV 2026-05 unverdicted novelty 6.0

    RBDC trains wide vision models by recursive block-diagonal coupling of narrower pre-trained models, reducing training FLOPs by 30% at similar ImageNet accuracy for DeiT and ResNet while outperforming model growth baselines.

  7. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.

  8. Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...

  9. Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

    cs.LG 2026-04 unverdicted novelty 5.0

    Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.

  10. Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts

    cs.LG 2025-10 unverdicted novelty 5.0

    Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.