pith. machine review for the scientific record. sign in

arxiv: 1511.05641 · v4 · submitted 2015-11-18 · 💻 cs.LG

Recognition: unknown

Net2Net: Accelerating Learning via Knowledge Transfer

Authors on Pith no claims yet
classification 💻 cs.LG
keywords neuralknowledgenetworkprocessduringexperimentationnet2netprevious
0
0 comments X
read the original abstract

We introduce techniques for rapidly transferring the information stored in one neural net into another neural net. The main purpose is to accelerate the training of a significantly larger neural net. During real-world workflows, one often trains very many different neural networks during the experimentation and design process. This is a wasteful process in which each new model is trained from scratch. Our Net2Net technique accelerates the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network. Our techniques are based on the concept of function-preserving transformations between neural network specifications. This differs from previous approaches to pre-training that altered the function represented by a neural net when adding layers to it. Using our knowledge transfer mechanism to add depth to Inception modules, we demonstrate a new state of the art accuracy rating on the ImageNet dataset.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

  2. Dota 2 with Large Scale Deep Reinforcement Learning

    cs.LG 2019-12 accept novelty 7.0

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  3. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.

  4. Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...

  5. Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

    cs.LG 2026-04 unverdicted novelty 5.0

    Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.