Net2Net: Accelerating Learning via Knowledge Transfer

Tianqi Chen , Ian Goodfellow , Jonathon Shlens

Authors on Pith no claims yet

classification 💻 cs.LG

keywords neuralknowledgenetworkprocessduringexperimentationnet2netprevious

read the original abstract

We introduce techniques for rapidly transferring the information stored in one neural net into another neural net. The main purpose is to accelerate the training of a significantly larger neural net. During real-world workflows, one often trains very many different neural networks during the experimentation and design process. This is a wasteful process in which each new model is trained from scratch. Our Net2Net technique accelerates the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network. Our techniques are based on the concept of function-preserving transformations between neural network specifications. This differs from previous approaches to pre-training that altered the function represented by a neural net when adding layers to it. Using our knowledge transfer mechanism to add depth to Inception modules, we demonstrate a new state of the art accuracy rating on the ImageNet dataset.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Dota 2 with Large Scale Deep Reinforcement Learning
cs.LG 2019-12 accept novelty 7.0

OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
cs.CV 2026-04 unverdicted novelty 6.0

CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
cs.LG 2026-04 unverdicted novelty 5.0

Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.